Take a sip of Soda
Last modified on 27-Sep-23
Is Soda the data quality testing solution you’ve been looking for? Take a sip and see!
Use the example data in this tutorial to set up and run a simple Soda scan for data quality.
01 Learn the basics of Soda | 2 minutes
02 Install Soda | 10 minutes
03 Set up example data source | 5 minutes
04 Connect Soda to the data source | 5 minutes
05 Write checks and run a scan | 5 minutes
Soda basics
Soda enables Data Engineers to test data for quality where and when they need to.
Is your data fresh? Is it complete or missing values? Are there unexpected duplicate values? Did something go wrong during transformation? Are all the data values valid? These are the questions that Soda answers for Data Engineers.
- Use Soda with GitHub Actions to test data quality during CI/CD development.
- Use it with Airflow to test data quality after ingestion and transformation in your pipeline.
- Import your dbt tests into Soda to facilitate issue investigation and track dataset health over time.
- Integrate Soda with your data catalog to gauge dataset health from within the catalog.
How it works
Soda works by taking the data quality checks that you prepare and using them to run a scan of datasets in a data source. A scan is a CLI command which instructs Soda to prepare optimized SQL queries that execute data quality checks on your data source to find invalid, missing, or unexpected data. When checks fail, they surface bad-quality data and present check results that help you investigate and address quality issues.
To test your data quality, you install the Soda Library, and sign up for a Soda Cloud account that connects to your Soda Library using API keys so that you can complete the following tasks:
- Connect to your data source.
To connect to a data source such as Snowflake, Amazon Athena, or Big Query, you use aconfiguration.yml
file which stores access details for your data source such as host, port, and data source login credentials. - Define checks to surface bad-quality data.
To define the data quality checks that Soda runs against a dataset, you use achecks.yml
file. A Soda Check is a test that Soda performs when it scans a dataset in your data source. The checks YAML file stores the checks you write using the Soda Checks Language (SodaCL), a domain-specific language for data quality testing. - Run a scan to execute your data quality checks.
During a scan, Soda does not ingest your data, it only scans it for quality metrics, then uses the metadata to prepare scan results1. After a scan, each check results in one of three default states:- pass: the values in the dataset match or fall within the thresholds you specified
- fail: the values in the dataset do not match or fall within the thresholds you specified
- error: the syntax of the check is invalid, or there are runtime or credential errors
- A fourth state, warn, is something you can explicitly configure for individual checks.
- Review scan results and investigate issues.
You can review the scan output in the command-line and in your Soda Cloud account. Add API keys to the sameconfiguration.yml
file to push check results to your account so you can access visualized scan results, set alert notifications, track trends in data quality over time, and integrate with the messaging, ticketing, and data cataloging tools you already use, like Slack, Jira, and Alation.
1 An exception to this rule is when Soda collects failed row samples that it presents in scan output to aid with issue investigation, a feature you can disable.
Learn more about How Soda works.
Learn more about running Soda scans.
Learn more about SodaCL.
Access the Glossary for a full list of Soda terminology.
Install Soda from the command line
This tutorial references a MacOS environment.
- Check the following prerequisites:
- You have installed Python 3.8 or greater.
- You have installed Pip 21.0 or greater.
- You have installed Docker Desktop and have access to GitHub (to set up an example data source).
- In your command-line interface, create a Soda project directory in your local environment, then navigate to the directory.
mkdir soda_sip cd soda_sip
- Best practice dictates that you install the Soda using a virtual environment. In your command-line interface, create a virtual environment in the
.venv
directory.python3 -m venv .venv
- Activate the virtual environment.
source .venv/bin/activate
- Execute the following command to install the Soda package for PostgreSQL in your virtual environment. The example data is in a PostgreSQL data source, but there are 15+ data sources with which you can connect your own data beyond this tutorial.
pip install -i https://pypi.cloud.soda.io soda-postgres
- Validate the installation.
soda --help
To exit the virtual environment when you are done with this tutorial, use the command deactivate
.
Build an example data source
To enable you to take a first sip of Soda, you can use Docker to quickly build an example PostgreSQL data source against which you can run scans for data quality. The example data source contains data for AdventureWorks, an imaginary online e-commerce organization.
- (Optional) Access the
sodadata/sip-of-soda
repository in GitHub. - (Optional) Access a quick view of the AdventureWorks schema.
- Open a new tab in Terminal.
- If it is not already running, start Docker Desktop.
- Run the following command in Terminal to set up the prepared example data source.
docker run \
--name sip-of-soda \
-p 5432:5432 \
-e POSTGRES_PASSWORD=secret \
sodadata/soda-adventureworks
When the output reads data system is ready to accept connections
, your data source is set up and you are ready to proceed.
Troubleshoot
Problem: When you rundocker-compose up
you get an error that reads [17168] Failed to execute script docker-compose
.Solution: Start Docker Desktop running.
Problem: When you run
docker-compose up
you get an error that reads Cannot start service soda-adventureworks: Ports are not available: exposing port TCP 0.0.0.0:5432 -> 0.0.0.0:0: listen tcp 0.0.0.0:5432: bind: address already in use
.Solution:
- Execute the command
lsof -i tcp:5432
to print a list of PIDs using the port. - Use the PID value to run the following command to free up the port:
kill -9 your_PID_value
. You many need to prepend the commands withsudo
. - Run the
docker run
command again.
Connect Soda to the data source and a cloud account
To connect to a data source such as Snowflake, PostgreSQL, Amazon Athena, or GCP Big Query, you use a configuration.yml
file which stores access details for your data source.
This tutorial also instructs you to connect to a Soda Cloud account using API keys that you create and add to the same configuration.yml
file. Available for free as a 45-day trial, your Soda Cloud account gives you access to visualized scan results, tracks trends in data quality over time, lets you set alert notifications, and much more.
- In a code editor such as Sublime or Visual Studio Code, create a new file called
configuration.yml
and save it in yoursoda_sip
directory. - Copy and paste the following connection details into the file. The
data_source
configuration details connect Soda to the example AdventureWorks data source you set up using Docker.data_source adventureworks: type: postgres connection: host: localhost username: postgres password: secret database: postgres schema: public
- In a browser, navigate to cloud.soda.io/signup to create a new Soda account. If you already have a Soda account, log in.
- Navigate to your avatar > Profile, then access the API keys tab. Click the plus icon to generate new API keys. Copy+paste the
soda_cloud
configuration syntax, including the API keys, into theconfiguration.yml
file, as in the example below.data_source adventureworks: type: postgres connection: host: localhost ... soda_cloud: host: cloud.soda.io api_key_id: 2e0ba0cb-**7b api_key_secret: 5wdx**aGuRg
- Save the
configuration.yml
file and close the API modal in your Soda account. - In Terminal, return to the tab in which the virtual environment is active in the
soda_sip
directory. Run the following command to test Soda’s connection to the data source.
Command:soda test-connection -d adventureworks -c configuration.yml
Output:
Soda Library 1.0.x Soda Core 3.0.x Successfully connected to 'adventureworks'. Connection 'adventureworks' is valid.
Write some checks and run a scan
- Create another file in the
soda_sip
directory calledchecks.yml
. A check is a test that Soda executes when it scans a dataset in your data source. Thechecks.yml
file stores the checks you write using the Soda Checks Language (SodaCL). - Open the
checks.yml
file in your code editor, then copy and paste the following checks into the file.checks for dim_customer: - invalid_count(email_address) = 0: valid format: email name: Ensure values are formatted as email addresses - missing_count(last_name) = 0: name: Ensure there are no null values in the Last Name column - duplicate_count(phone) = 0: name: No duplicate phone numbers - freshness(date_first_purchase) < 7d: name: Data in this dataset is less than 7 days old - schema: warn: when schema changes: any name: Columns have not been added, removed, or changed
What do these checks do?
- Ensure values are formatted as email addresses checks that all entries in the
email_address
column are formatted asname@domain.extension
. See Validity metrics. - Ensure there are no null values in the Last Name column automatically checks for NULL values in the
last_name
column. See Missing metrics. - No duplicate phone numbers validates that each value in the
phone
column is unique. See Numeric metrics. - Columns have not been added, removed, or changed compares the schema of the dataset to the last scan result to determine if any columns were added, deleted, changed data type, or changed index. The first time this check executes, the results show
[NOT EVALUATED]
because there are no previous values to which to compare current results. In other words, this check requires a minimum of two scans to evaluate properly. See Schema checks. - Data in this dataset is less than 7 days old confirms that the data in the dataset is less than seven days old. See Freshness checks.
- Ensure values are formatted as email addresses checks that all entries in the
- Save the changes to the
checks.yml
file, then, in Terminal, use the following command to run a scan. A scan is a CLI command which instructs Soda to prepare SQL queries that execute data quality checks on your data source. As input, the command requires:-d
the name of the data source to scan-c
the filepath and name of theconfiguration.yml
file- the filepath and name of the
checks.yml
file
Command:soda scan -d adventureworks -c configuration.yml checks.yml
Output:
Soda Library 1.0.x Soda Core 3.0.x Sending failed row samples to Soda Cloud Scan summary: 3/5 checks PASSED: dim_customer in adventureworks No changes to schema [PASSED] Emails formatted correctly [PASSED] No null values for last name [PASSED] 2/5 checks FAILED: dim_customer in adventureworks No duplicate phone numbers [FAILED] check_value: 715 Data is fresh [FAILED] max_column_timestamp: 2014-01-28 23:59:59.999999 max_column_timestamp_utc: 2014-01-28 23:59:59.999999+00:00 now_variable_name: NOW now_timestamp: 2023-04-24T21:02:15.900007+00:00 now_timestamp_utc: 2023-04-24 21:02:15.900007+00:00 freshness: 3372 days, 21:02:15.900008 Oops! 2 failures. 0 warnings. 0 errors. 3 pass. Sending results to Soda Cloud Soda Cloud Trace: 4417******32502
- As you can see from the output, some checks failed and Soda sent the results to your Cloud account. To access visualized check results and further examine the failed checks, return to your Soda account in your browser and click Checks.
- In the table of checks that Soda displays, you can click the line item for one of the checks that failed to examine the visualized results in a line graph, and to access the failed row samples that Soda automatically collected when it ran the scan and executed the checks.
Use the failed row samples, as in the example below, to determine what caused a data quality check to fail.
✨Well done!✨ You’ve taken the first step towards a future in which you and your colleagues can trust the quality and reliability of your data. Huzzah!
Now what?

Sip more Soda
Integrate with Slack Add alert notification rules Integrate with dbt Report on data health
Choose your adventure
Test data during development Test data in your pipeline Enable end-user testingIf you are done with the example data, you can delete it from your account to start fresh with your own data.
- Navigate to your avatar > Scans & Data.
- In the Data Sources tab, click the stacked dots to the right of the
adventureworks
data source, then select Delete Data Source. - Follow the steps to confirm deletion.
- Connect to your own data by configuring your data source connections in your existing
configuration.yml
file. - Adjust your
checks.yml
to point to your own dataset in your data source, then adjust the checks to apply to your own data. Go ahead and run a scan!
Need help?
- Request a demo. Hey, what can Soda do for you?
- Join the Soda community on Slack.
Was this documentation helpful?
What could we do to improve this page?
- Suggest a docs change in GitHub.
- Share feedback in the Soda community on Slack.
Documentation always applies to the latest version of Soda products
Last modified on 27-Sep-23