Test data quality during CI/CD development
Last modified on 31-May-23
Use this guide to install and set up Soda to test the quality of your data during your development lifecycle. Catch data quality issues in a GitHub pull request before merging data management changes, such as transformations, into production.
(Not quite ready for this big gulp of Soda? 🥤Try taking a sip, first.)
01 Learn the basics of Soda
02 Get context for this guide
03 Install Soda from the command-line
04 Connect Soda to your data source and platform account
05 Write checks for data quality
06 Create a GitHub Action job
07 Set up Slack integration and notification rules
08 Trigger a scan and examine the scan results
Soda basics
Soda is a platform that enables Data Engineers to test data for quality where and when they need to.
Is your data fresh? Is it complete or missing values? Are there unexpected duplicate values? Did something go wrong during transformation? Are all the data values valid? These are the questions that Soda answers for Data Engineers.
- Use Soda with GitHub Actions to test data quality during CI/CD development.
- Use it with Airflow to test data quality after ingestion and transformation in your pipeline.
- Import your dbt tests into the Soda platform to facilitate issue investigation and track dataset health over time.
- Integrate Soda with your data catalog to gauge dataset health from within the catalog.
How it works
Soda works by taking the data quality checks that you prepare and using them to run a scan of datasets in a data source. A scan is a CLI command which instructs Soda to prepare optimized SQL queries that execute data quality checks on your data source to find invalid, missing, or unexpected data. When checks fail, they surface bad-quality data and present check results that help you investigate and address quality issues.
To test your data quality, you install the Soda CLI tool and sign up for a Soda platform account so that you can complete the following tasks:
- Connect to your data source.
To connect to a data source such as Snowflake, Amazon Athena, or Big Query, you use aconfiguration.yml
file which stores access details for your data source such as host, port, and data source login credentials. - Define checks to surface “bad” data.
To define the data quality checks that Soda runs against a dataset, you use achecks.yml
file. A Soda Check is a test that Soda performs when it scans a dataset in your data source. The checks YAML file stores the checks you write using the Soda Checks Language (SodaCL), a domain-specific language for data quality testing. - Run a scan to execute your data quality checks.
During a scan, Soda does not ingest your data, it only scans it for quality metrics, then uses the metadata to prepare scan results1. After a scan, each check results in one of three default states:- pass: the values in the dataset match or fall within the thresholds you specified
- fail: the values in the dataset do not match or fall within the thresholds you specified
- error: the syntax of the check is invalid, or there are runtime or credential errors
- A fourth state, warn, is something you can explicitly configure for individual checks.
- Review scan results and investigate issues.
You can review the scan output in the command-line and in your Soda platform account. Add API keys to the sameconfiguration.yml
file to push check results to your account so you can access visualized scan results, set alert notifications, track trends in data quality over time, and integrate with the messaging, ticketing, and data cataloging tools you already use, like Slack, Jira, and Alation.
1 An exception to this rule is when Soda collects failed row samples that it presents in scan output to aid with issue investigation, a feature you can disable.
Learn more about How Soda works.
Learn more about running Soda scans.
Learn more about SodaCL.
Access the Glossary for a full list of Soda terminology.
About this guide
The instructions below offer Data Engineers an example of how to use GitHub Actions to execute SodaCL checks for data quality on data in a Snowflake data source.
(Not a GitHub Actions user? Stay tuned for more guides coming soon.)
For context, the example assumes that a team of people use GitHub to collaborate on managing data ingestion and transformation with dbt. In the same repo, team members can collaborate to write tests for data quality in SodaCL checks YAML files. With each new PR, or commit to an existing PR, in the repo that adds a transformation or makes changes to a dbt model, a GitHub Action executes a Soda scan for data quality and presents the results of the scan in a comment in the pull request, and in the Soda platform.
Where the scan results indicate an issue with data quality, Soda notifies the team via a notification in Slack so that they can investigate and address any issues before merging the PR into production.
Borrow from this guide to connect to your own data source, set up a GitHub Action job, and execute your own relevant tests for data quality to prevent issues in production.
Install Soda from the command line
- Ensure that you have the following prerequisites installed in your environment.
- Python 3.8 or greater
- Pip 21.0 or greater
- Best practice dictates that you install Soda using a virtual environment. In Terminal, create a virtual environment using the commands below. Depending on your version of Python, you may need to replace
python
withpython3
in the first command.python -m venv .venv source .venv/bin/activate
- Execute the following command, replacing
soda-core-postgres
with the install package that matches the type of data source you use to store data; expand Soda packages link below for a complete list.pip install soda-core-postgres
Soda packages
Data source | Install package |
---|---|
Amazon Athena | soda-core-athena |
Amazon Redshift | soda-core-redshift |
Apache Spark DataFrames (For use with programmatic Soda scans, only.) | soda-core-spark-df |
Azure Synapse (Experimental) | soda-core-sqlserver |
ClickHouse (Experimental) | soda-core-mysql |
Dask and Pandas (Experimental) | soda-core-pandas-dask |
Databricks | soda-core-spark[databricks] |
Denodo (Experimental) | soda-core-denodo |
Dremio | soda-core-dremio |
DuckDB (Experimental) | soda-core-duckdb |
GCP Big Query | soda-core-bigquery |
IBM DB2 | soda-core-db2 |
Local file | Use Dask. |
MS SQL Server | soda-core-sqlserver |
MySQL | soda-core-mysql |
OracleDB | soda-core-oracle |
PostgreSQL | soda-core-postgres |
Snowflake | soda-core-snowflake |
Trino | soda-core-trino |
Vertica (Experimental) | soda-core-vertica |
To deactivate the virtual environment, use the following command:
deactivate
- Ensure that you have the following prerequisites installed in your environment.
- Python 3.8 or greater
- Pip 21.0 or greater
- Best practice dictates that you install Soda using a virtual environment. In Terminal, create a virtual environment using the commands below. Depending on your version of Python, you may need to replace
python
withpython3
in the first command. Refer to the virtualenv documentation for activating a Windows script.python -m venv .venv .venv\Scripts\activate
- Execute the following command, replacing
soda-core-postgres
with the install package that matches the type of data source you use to store data; expand Soda packages link below for a complete list.pip install soda-core-postgres
Soda packages
Data source | Install package |
---|---|
Amazon Athena | soda-core-athena |
Amazon Redshift | soda-core-redshift |
Apache Spark DataFrames (For use with programmatic Soda scans, only.) | soda-core-spark-df |
Azure Synapse (Experimental) | soda-core-sqlserver |
ClickHouse (Experimental) | soda-core-mysql |
Dask and Pandas (Experimental) | soda-core-pandas-dask |
Databricks | soda-core-spark[databricks] |
Denodo (Experimental) | soda-core-denodo |
Dremio | soda-core-dremio |
DuckDB (Experimental) | soda-core-duckdb |
GCP Big Query | soda-core-bigquery |
IBM DB2 | soda-core-db2 |
Local file | Use Dask. |
MS SQL Server | soda-core-sqlserver |
MySQL | soda-core-mysql |
OracleDB | soda-core-oracle |
PostgreSQL | soda-core-postgres |
Snowflake | soda-core-snowflake |
Trino | soda-core-trino |
Vertica (Experimental) | soda-core-vertica |
To deactivate the virtual environment, use the following command:
deactivate
Refer to the virtualenv documentation for deactivating a Windows script.
Connect Soda to your data source and platform account
To connect to a data source such as Snowflake, PostgreSQL, Amazon Athena, or GCP Big Query, you use a configuration.yml
file which stores access details for your data source.
This guide also instructs you to connect to a Soda platform account using API keys that you create and add to the same configuration.yml
file. Available for free as a 45-day trial, your Soda platform account gives you access to visualized scan results, tracks trends in data quality over time, enables you to set alert notifications, and much more.
- In the GitHub repository in which you work with your dbt models, or ingest and transform data, create a directory to contain your Soda configuration and check YAML files.
- Use GitHub Secrets to securely store the values for your data source login credentials. Soda requires access to the credentials so it can access the data source to scan the data.
- In your new directory, create a new file called
configuration.yml
. - Open the
configuration.yml
file in a code editor, then copy and paste the connection configuration for the data source that you use.
Thedata_source
configuration details connect Soda to your data source; the example below is the connection configuration for Snowflake. Thesoda_cloud
configuration connects Soda to your platform account; leave it blank for a moment.data_source my_datasource_name: type: snowflake connection: username: ${SNOWFLAKE_USER} password: ${SNOWFLAKE_PASS} account: plu449.us-west-1 database: sodadata_test warehouse: compute_wh role: analyst session_parameters: QUERY_TAG: soda-queries QUOTED_IDENTIFIERS_IGNORE_CASE: false schema: public soda_cloud: host: api_key_id: api_key_secret:
- In a browser, navigate to cloud.soda.io/signup to create a new Soda account. If you already have a Soda account, log in.
- In your platform account, navigate to your avatar > Profile, then access the API keys tab. Click the plus icon to generate a new set of API keys.
- Copy the API Key ID, then paste it into the
configuration.yml
as the value forapi_key_id
. - Copy the API Key Secret, then paste it into the
configuration.yml
as the value forapi_key_secret
. - Enter the value for
host
according to the region your Soda platform account uses:cloud.soda.io
for EU region;cloud.us.soda.io
for USA region.
- Copy the API Key ID, then paste it into the
- Save the
configuration.yml
file and close the API modal in your Soda account. - In Terminal, run the following command to test Soda’s connection to your data source, replacing the value of
my_datasource_name
with the name of your data source.
soda test-connection -d my_datasource_name -c configuration.yml
Write checks for data quality
A check is a test that Soda executes when it scans a dataset in your data source. The checks.yml
file stores the checks you write using the Soda Checks Language (SodaCL). You can create multiple checks.yml
files to organize your data quality checks and run all, or some of them, at scan time.
- In the same directory in which you created the
configuration.yml
, create another file namedchecks.yml
. - Open the
checks.yml
file in your code editor, then copy and paste the following rather generic checks into the file. Note that therow_count
check is written to fail to demonstrate alerting when a data quality check fails.- Replace the value of
dataset_name
with the name of a dataset in your data source. - Replace the value of
column1
with the name of a column in the dataset.
checks for dataset_name: # Checks that dataset contains fewer than 2 rows; written to fail - row_count < 2: name: Dataset is unreasonably small # Checks that column contains no NULL values - missing_count(column1) = 0: name: No NULL values # Checks for columns removed or added, or change to index or type - schema: warn: when schema changes: any name: No changes to schema
- Replace the value of
- Save the
checks.yml
file.
Learn more about SodaCL.
Learn more about using multiple checks YAML files.
Create a GitHub Action job
Use GitHub Actions to execute a Soda scan for data quality each time you create a new pull request or commit to an existing one. The GitHub action posts the data quality scan results in a PR comment.
- Be sure to trigger a Soda scan after you have completed a dbt run that executed your dbt tests.
- Note that GitHub PR comments have a 4-byte unicode character limit of 65,536. If the GitHub Action tries to post a comment that exceeds this limit, the job completion may be impacted.
What does the GitHub Action do?
To summarize, the action completes the following tasks:- Checks out the repo.
- Sets up Python 3.10, which Soda requires (In fact, the minimum version that Soda requires is Python 3.8.)
- Installs Soda via
pip install
. - Uses the
configuration.yml
andchecks.yml
you created to run a scan of your data and save the results to a JSON file. - Extracts and converts Soda scan results from JSON to a markdown table.
- Searches the PR for any existing comment containing "Soda Scan Summary".
- If no such comment exists, creates a new comment in the PR and posts "Soda Scan Summary" table of check results.
- If such a comment does already exist, updates the existing comment with the new "Soda Scan Summary" table of check results.
- In case the Soda scan failed, sets a comment to advise of scan failure and provides advice to check the job logs in step 4.
if
instruction. - Create a file in the
.github/workflows
directory of your repository calledsoda-actions.yml
. - Copy and paste the example below into the new file, adjusting environment and data source-specific details as needed. You must adjust some of the values to apply to your own repo and your own data source, but the sequence of steps is suitable to copy+paste.
- Save the
soda-actions.yml
file.
# This GitHub Action runs a Soda scan on a Snowflake data source called reporting_api_marts.
# Best practice dictates that you set one action per data source and point to a folder that contains the relevant Soda files.
# This example sets up the action to run on a specific datasource and points `soda scan` to a folder of Soda check files.
name: Run Soda Scan on [reporting_api__marts]
# GitHub triggers this job when a user creates or updates a pull request.
on:
pull_request:
jobs:
run:
# The job runs on the latest Docker image for Ubuntu.
runs-on: ubuntu-latest
steps:
# Step 1: Checkout the repository code into the container.
- name: Checkout
uses: actions/checkout@v2
# Step 2: Setup Python 3.10, which Soda requires.
- uses: actions/setup-python@v2
with:
python-version: '3.10'
# Step 3: Install Soda package via pip.
# Be sure to install the package that corresponds to your specific datasource
# This example connects to Snowflake
- name: Install Soda
run: pip install -U pip && pip install -q soda-core-snowflake
# Step 4: Perform a Soda scan on the dataset and save the result as ci_scan_results.json.
# Store sensitive information such as credentials in the GitHub repository secrets.
# The configuration.yaml file expects the username and password to come from the environment
# variables `SNOWFLAKE_USER` and `SNOWFLAKE_PASS`
- name: Perform Soda scan
id: soda_scan
env:
SNOWFLAKE_USER: $
SNOWFLAKE_PASS: $
run: soda scan -d reporting_api__marts -c ./soda_checks/configs/configuration.yml ./soda_checks/checks -srf ci_scan_results.json
# This option ensures that the rest of the steps to run even if the scan fails.
# The action as a whole will still results as failed if the scan fails; see last step of action
continue-on-error: true
# Step 5: Extract and convert Soda scan results from JSON to a markdown table stored in checks_markdown_table.md.
# This step uses a CLI JSON processor application available on dockerhub (https://github.com/stedolan/jq)
# However, you can implement your own custom script in your preferred language to use in its place.
- name: Convert JSON to markdown
id: convert
run: |
MESSAGE_HEADER="**Soda Scan Summary**:\n"
TABLE_HEADER="| Dataset | Name | Type | Definition | Outcome | Value |\n|------|---------|------|------------|---------|-------|\n"
TABLE_CONTENT=$(cat ci_scan_results.json | docker run --rm -i stedolan/jq -r '.checks[] | "| \(.table) | \(.name) | \(.type) | \(.definition | gsub("\n"; " ")) | \(.outcome) | \(.diagnostics.value) |"')
if [ -z "$TABLE_CONTENT" ]; then
echo "😿 No scan results found. Check the logs of the 'Perform Soda scan' step in your GitHub action" > checks_markdown_table.md
else
if [ "$" == "success" ]; then
echo -e "✅ All checks passed.\n$MESSAGE_HEADER$TABLE_HEADER$TABLE_CONTENT" > checks_markdown_table.md
else
echo -e "❌ Some checks failed.\n$MESSAGE_HEADER$TABLE_HEADER$TABLE_CONTENT" > checks_markdown_table.md
fi
fi
# Step 6: Find an existing comment by GitHub Actions bot containing 'Soda Scan Summary'.
# This step determines whether to create a new comment, or update an existing.
- name: Find comment
uses: peter-evans/find-comment@v2
id: fc
with:
issue-number: $
comment-author: 'github-actions[bot]'
body-includes: 'Soda Scan Summary'
# Step 7: If no comment with Soda scan results exists, create new.
# The GITHUB_TOKEN variable is present out-of-the-box in all actions.
# GH automatically has access to the variable in the container environment.
- name: Create comment
if: steps.fc.outputs.comment-id == ''
uses: peter-evans/create-or-update-comment@v3
with:
token: $
issue-number: $
body-path: 'checks_markdown_table.md'
edit-mode: replace
# Step 8: If a comment with Soda scan results exists, update existing comment.
- name: Update comment
if: steps.fc.outputs.comment-id != ''
uses: peter-evans/create-or-update-comment@v3
with:
token: $
issue-number: $
comment-id: $
body-path: 'checks_markdown_table.md'
edit-mode: replace
# Step 9: Fail the job if the Soda scan did not complete successfully.
- name: Fail job if Soda scan failed
if: steps.soda_scan.outcome != 'success'
run: echo "Soda scan failed. Check the logs of the 'Perform Soda scan' step." && exit 1
Set up Slack integration and notification rules
Use this integration to enable Soda to send alert notifications to a Slack channel to notify your team of warn and fail check results. If your team does not use Slack, you can skip this step and Soda sends alert notifications via email.
- Log in to your Soda platform account and navigate to your avatar > Organization Settings, then navigate to the Integrations tab and click the + icon to add a new integration.
- Follow the guided steps to authorize Soda to connect to your Slack workspace. If necessary, contact your organization’s Slack Administrator to approve the integration with Soda.
- Configuration tab: select the public channels to which Soda can post messages; Soda cannot post to private channels.
- Scope tab: select the Soda features, both alert notifications and incidents, which can access the Slack integration.
- To dictate where Soda should send alert notifications for checks that fail, create a new notification rule. Navigate to your avatar > Notification Rules, then click New Notification Rule. Follow the guided steps to complete the new rule directly Soda to send check results that fail to a specific channel in your Slack workspace.
Learn more about Integrating with Slack.
Learn more about Setting notification rules.
Trigger a scan and examine the scan results
To trigger the GitHub Action job and initiate a Soda scan for data quality, create a new pull request in your repository. Be sure to trigger a Soda scan after you have completed a dbt run that executed your dbt tests.
- For the purposes of this exercise, create a new branch in your GitHub repo, then make a small change to an existing file and commit and push the change to the branch.
- Execute a dbt run.
- Create a new pull request, then navigate to your GitHub account and review the PR you just created. Notice that the Soda scan action is queued and perhaps already running against your data to check for quality.
- When the job completes, you can see a new comment in the PR with a table of checks that indicate which checks have passed and failed.
- Access your Slack workspace, then navigate to the channel to which you directed Soda to send fail notifications in the Notification Rule you created. Notice the alert notification of the check that purposely failed during the Soda scan.
- Navigate to your Soda platform account, then click Checks to access the Check Results page. The results from the scan that Soda performed during the GitHub Action job appear in the results table where you can click each line item to learn more about the results…
…including, for some types of checks, samples of failed rows to help in your investigation of a data quality issue, as in the example below.
Troubleshoot Soda scan execution
If the Soda scan did not complete successfully, you can review the scan logs to determine the cause of the problem.- In the pull request in which the scan failed, navigate to Actions > Jobs > run > Perform scan.
- Expand the step to examine the scan logs.
- Access Troubleshoot SocaCL for help diagnosing issues.
✨Well done!✨ You’ve taken the first step towards a future in which you and your colleagues prevent data quality issues from getting into production. Huzzah!
Now what?
Need help?
- Not quite ready for this big gulp of Soda?
Try taking a sip, first.
- Request a demo. Hey, what can Soda do for you?
- Join the Soda community on Slack.
Was this documentation helpful?
What could we do to improve this page?
- Suggest a docs change in GitHub.
- Share feedback in the Soda community on Slack.
Last modified on 31-May-23