Last modified on 30-Nov-23
The roadmap to get started offers a curated experience to help you get from zero to productive with Soda software.
However, if a guided experience is not your style, take a different path!
- Follow a 15-min tutorial to set up and run Soda using demo data.
- Follow a Use case guide for implementation instructions that target a specific outcome.
- Request a demo so we can help you get the most out of your Soda experience.
- Choose a flavor of Soda 🚀 Start here!
- Set up Soda
- Write SodaCL checks
- Run scans and review results
- Organize, alert, investigate
Need help? Join the Soda community on Slack.
Soda enables Data Engineers, Data Scientists, and Data Analysts to test data for quality where and when they need to.
Is your data fresh? Is it complete or missing values? Are there unexpected duplicate values? Did something go wrong during transformation? Are all the data values valid? These are the questions that Soda answers.
- Use Soda with GitHub Actions to test data quality during CI/CD development.
- Use Soda to build data quality rules in a collaborative, browser user interface.
- Use it with Airflow to test data quality after ingestion and transformation in your pipeline.
- Import your dbt tests into Soda to facilitate issue investigation and track dataset health over time.
- Integrate Soda with your data catalog to gauge dataset health from within the catalog.
Soda works by taking the data quality checks that you prepare and using them to run a scan of datasets in a data source. A scan is a command which instructs Soda to prepare optimized SQL queries that execute data quality checks on your data source to find invalid, missing, or unexpected data. When checks fail, they surface bad-quality data and present check results that help you investigate and address quality issues.
To test your data quality, you choose a flavor of Soda (choose a deployment model) which enables you to configure connections with your data sources and define data quality checks, then run scans that execute your data quality checks.
- Connect to your data source.
Connect Soda to a data source such as Snowflake, Amazon Athena, or Big Query by providing access details for your data source such as host, port, and data source login credentials.
- Define checks to surface bad-quality data.
Define data quality checks using Soda Checks Language (SodaCL), a domain-specific language for data quality testing. A Soda Check is a test that Soda performs when it scans a dataset in your data source.
- Run a scan to execute your data quality checks.
During a scan, Soda does not ingest your data, it only scans it for quality metrics, then uses the metadata to prepare scan results1. After a scan, each check results in one of three default states:
- pass: the values in the dataset match or fall within the thresholds you specified
- fail: the values in the dataset do not match or fall within the thresholds you specified
- error: the syntax of the check is invalid, or there are runtime or credential errors
- A fourth state, warn, is something you can explicitly configure for individual checks.
- Review scan results and investigate issues.
You can review the scan output in the command-line and in your Soda Cloud account. Access visualized scan results, set alert notifications, track trends in data quality over time, and integrate with the messaging, ticketing, and data cataloging tools you already use, like Slack, Jira, and Atlan.
1 An exception to this rule is when Soda collects failed row samples that it presents in scan output to aid with issue investigation, a feature you can disable.
Was this documentation helpful?
What could we do to improve this page?
Documentation always applies to the latest version of Soda products
Last modified on 30-Nov-23