1 of 100

Soda v3

Soda v3 documentation

Soda is a data quality platform that provides tools to monitor, test, and improve data quality across all stacks.

Welcome to the Soda documentation hub, your one-stop resource for everything you need to know about Soda’s data quality platform. Dive into our guides, tutorials, reference materials, and integration pages to learn how keep your data quality fresh across your entire stack.

Get started!

Soda v3 vs v4

This is the documentation for Soda v3. If you are using Soda v4 or want to learn more about the next iteration of Soda, head to the .

Soda v3 is a checks-based, CLI-driven data quality tool.

Soda v4 has incorporated collaborative data contracts and end-to-end observability features to become a unified data-quality platform for all.

Capability

Soda v3

Soda v4

📚 Guides & Tutorials

Learn core concepts and best practices:

: Practical Soda usage scenarios
: Define data quality checks
: Execute Soda data scans
: Check results and investigate issues

🔌 Integrations

Extend Soda into your existing tools and workflows:

📖 Reference

Detailed command, API, and configuration docs:

💬 Community & Support

Need help or want to contribute?

Join our Slack Community:
Browse GitHub Discussions:

Still have questions? Use the search bar above or reach out through our community channels for additional help.

Get started

Follow this tutorial to set up and run a simple Soda scan for data quality using example data.

The Soda environment has been updated since this tutorial.

Refer to for updated tutorials.

Is Soda the data quality testing solution you've been looking for? Take a sip and see! 🫧

Use the example data in this quick tutorial to set up and run a simple Soda scan for data quality.

Set up Soda | 3 minutes Build an example data source | 2 minutes | 5 minutes | 5 minutes

💡 For standard set up instructions, access the .
✨ Want a total UI experience? Use the out-of-the-box to skip the CLI.

Set up Soda

This tutorial references a MacOS environment.

Check the following prerequisites:

You have installed Python 3.8, 3.9, or 3.10.
You have installed Pip 21.0 or greater.
(Optional) You have installed and have access to , to set up an example data source.

Visit to sign up for a Soda Cloud account which is free for a 45-day trial.
In your command-line interface, create a Soda project directory in your local environment, then navigate to the directory.

Best practice dictates that you install the Soda using a virtual environment. In your command-line interface, create a virtual environment in the .venv directory, then activate the environment.

Execute the following command to install the Soda package for PostgreSQL in your virtual environment. The example data is in a PostgreSQL data source, but there are 15+ data sources with which you can connect your own data beyond this tutorial.

Validate the installation.

To exit the virtual environment when you are done with this tutorial, use the command deactivate.

Build an example data source

To enable you to take a first sip of Soda, you can use Docker to quickly build an example PostgreSQL data source against which you can run scans for data quality. The example data source contains data for AdventureWorks, an imaginary online e-commerce organization.

(Optional) Access the repository in GitHub.
(Optional) Access a quick view of the .

Open a new tab in Terminal.
If it is not already running, start Docker Desktop.
Run the following command in Terminal to set up the prepared example data source.

When the output reads data system is ready to accept connections, your data source is set up and you are ready to proceed.

Troubleshoot

Problem: When you run docker-compose up you get an error that reads [17168] Failed to execute script docker-compose.

Alternatively, you can use your own data for this tutorial. To do so:

Skip the steps above involving Docker.
Install the Soda Library package that corresponds with your data source, such as soda-bigquery, soda-athena, etc. See full list.
Collect your data source's login credentials that you must provide to Soda so that it can scan your data for quality.

Connect Soda

To connect to a data source such as Snowflake, PostgreSQL, Amazon Athena, or GCP BigQuery, you use a configuration.yml file which stores access details for your data source.

This tutorial also instructs you to connect to a Soda Cloud account using API keys that you create and add to the same configuration.yml file. Available for free as a 45-day trial, your Soda Cloud account validates your free trial or license, gives you access to visualized scan results, tracks trends in data quality over time, lets you set alert notifications, and much more.

In a code editor such as Sublime or Visual Studio Code, create a new file called configuration.yml and save it in your soda_sip directory.
Copy and paste the following connection details into the file. The data_source configuration details connect Soda to the example AdventureWorks data source you set up using Docker. If you are using your own data, provide the data_source values that correspond with your own data source.

Output:

Write some checks and run a scan

Create another file in the soda_sip directory called checks.yml. A check is a test that Soda executes when it scans a dataset in your data source. The checks.yml file stores the checks you write using the Soda Checks Language (SodaCL).
Open the checks.yml file in your code editor, then copy and paste the following checks into the file.

What do these checks do?

Ensure values are formatted as email addresses checks that all entries in the email_address column are formatted as [email protected]. See .
Ensure there are no null values in the Last Name column automatically checks for NULL values in the

Save the changes to the checks.yml file, then, in Terminal, use the following command to run a scan. A scan is a CLI command which instructs Soda to prepare SQL queries that execute data quality checks on your data source. As input, the command requires:

-d the name of the data source to scan
-c the filepath and name of the configuration.yml file
the filepath and name of the checks.yml file Command:

Output:

As you can see in the Scan Summary in the command-line output, some checks failed and Soda sent the results to your Cloud account. To access visualized check results and further examine the failed checks, return to your Soda account in your browser and click Checks.
In the table of checks that Soda displays, you can click the line item for one of the checks that failed to examine the visualized results in a line graph, and to access the failed row samples that Soda automatically collected when it ran the scan and executed the checks. Use the failed row samples, as in the example below, to determine what caused a data quality check to fail.

✨Well done!✨ You've taken the first step towards a future in which you and your colleagues can trust the quality and reliability of your data. Huzzah!

If you are done with the example data, you can delete it from your account to start fresh with your own data.

Navigate to your avatar > Data Sources.
In the Data Sources tab, click the stacked dots to the right of the adventureworks data source, then select Delete Data Source.
Follow the steps to confirm deletion.

Go further

Get inspired on how to set up Soda to meet your .
Use to quickly get off the ground with basic checks for data quality.
Learn writing SodaCL checks.
Read more about in general.

Need help?

What can Soda do for you? .
Join the .

Get started roadmap

Get started with Soda! Use this curated set of instructions to quickly get data quality tests up and running.

The Soda environment has been updated since this tutorial.

Refer to for updated tutorials.

The roadmap to get started offers a curated experience to help you get from zero to productive with Soda software.

However, if a guided experience is not your style, take a different path!

Use case guides

Access examples of Soda implementations according to use case and data quality testing needs.

Use the following guides as example implementations based on how you intend to use Soda for data quality testing. For standard set up instructions, see .

Add Soda to a Databricks notebook

Use this guide to invoke Soda data quality tests from inside a Databricks notebook.

Use this guide to install and set up Soda in a Databricks notebook so you can run data quality tests on data in a Spark data source.

🎥 Watch a video that demonstrates how to add Soda to your Databricks pipeline:

About this guide

The instructions below offer Data Engineers an example of how to write Python in a Databricks notebook to set up Soda, then write and execute scans for data quality in Spark.

This example uses a programmatic deployment model which invokes the Soda Python library, and uses Soda Cloud to validate a commercial usage license and display visualized data quality test results. See:

Generate API keys

Generate API keys to securely connect Soda Library or a Soda Agent to Soda Cloud, or to access Soda Cloud via API.

Soda Cloud uses API keys to securely communicate with other entities such as Soda Library and self-hosted Soda Agents, and to provide secure access to Soda Cloud via API.

There are two sets of API keys that you can generate and use with Soda Cloud:

API keys for communicating with Soda Library, the Soda Cloud API or Soda Cloud Reporting API, and the Soda Library Docker image that the GitHub Action for Soda uses
API keys for communicating with a self-hosted Soda Agent

Double-onboard a data source

Learn how to double-onboard a data source to leverage all the features supported by Soda Agents.

To scan your data for quality, Soda must connect to a data source using connection configurations (host, port, login credentials, etc.) that you either define in Soda Cloud during onboarding using a Soda Agent, or in a configuration YAML file you reference during programmatic or CLI scans using Soda Library. Soda recognizes each data source you onboard as an independent resource in Soda Cloud, where it displays all scan results and failed row samples for all data sources regardless of onboarding method.

However, data sources you connect via a Soda agent using the guided workflow in Soda Cloud support several features which data sources you connect via Soda Library do not, including:

no-code checks
Discussions
Available in 2025

If you have onboarded a data source via Soda Library but you wish to take advantage of the features available to Soda Agent-onboarded data sources, you can double-onboard an existing data source.

See also:
See also:
See also: in Soda Cloud

Prerequisites

You , you have configured it to connect to your data source, and you have run at least one programmatically or via the Soda Library CLI.
You have deployed a helm chart in a Kubernetes cluster in your cloud services environment OR Someone with Soda Admin privileges in your organization’s Soda Cloud account has navigated to your avatar > Organization Settings check the box to Enable Soda-hosted Agent; see .
You have access to the connection configurations (host, port, login credentials, etc.) for your data source.

Self-hosted agent

¹ MS SQL Server with Windows Authentication does not work with Soda Agent out-of-the-box.

Soda-hosted agent

Onboard an existing data source

Log in to Soda Cloud, then navigate to your avatar > Data Sources.
From the list of data sources connected to your Soda Cloud account, click to select and open the one you onboarded via Soda Library and now wish to double-onboard via a Soda Agent.
Follow the guided workflow to onboard the existing data source via a Soda Agent, starting by using the dropdown to select the Default Scan Agent you wish to use to connect to the data source.

define a schedule for your default scan definition
provide connection configuration details for the data source such as name, schema, and login credentials, and test the connection to the data source
profile the datasets in the data source to gather basic metadata about the contents of each
identify the datasets to which you wish to apply automated monitoring for anomalies and schema changes

Save your changes, then navigate to the Datasets page and select a dataset in the data source you just double-onboarded.
(Optional) If you wish, and if you have for the feature, you can follow the instructions to for the dataset.
(Optional) Click Add Check and begin adding to the dataset.

Known issue: Double-onboarding a data source renders Soda Library API keys invalid. After double-onboarding a data source, if you run a programmatic or CLI scan of that data source using Soda Library, an error appears to indicate that the API keys are invalid. As a workaround, in Soda Cloud, then, in your configuration YAML, replace the old API key values with the newly-generated ones.

Go further

Learn more about for observability.

Write checks with Ask AI

Use Soda's Ask AI assistant to turn natural language into production-ready data quality checks in SodaCL.

Ask AI is an in-product generative AI assistant for data quality testing. Ask AI replaces SodaGPT, the original implementation of a generative AI assistant.

✖️ Requires Soda Core Scientific (included in a Soda Agent) ✖️ Supported in Soda Core ✔️ Supported in Soda Library + Soda Cloud ✔️ Supported in Soda Cloud + Soda Agent

to your Soda Cloud account, click the Ask AI button in the main nav, then provide natural language instructions to the interface to:

receive fully-formed, syntax-correct checks in the

Add automated monitoring checks

Use a SodaCL automated monitoring check to automatically check for row count anomalies and schema changes.

This feature is not supported in Soda Core OSS.

Migrate to Soda Library in minutes to start using this feature for free with a 45-day trial.

Use automated monitoring checks to instruct Soda to automatically check for row count anomalies and schema changes in a dataset.

✔️ Requires Soda Core Scientific (included in a Soda Agent) ✖️ Supported in Soda Core ✔️ Supported in Soda Library + Soda Cloud ✔️ Supported in Soda Cloud Agreements + Soda Agent ✖️ Available as no-code checks

About automated monitoring checks

When you add automated monitoring checks to a data source connected to your Soda Cloud account via a self-hosted agent, Soda prepares and executes two checks on all the datasets you indicated as included in the configuration.

Anomaly score check on row count: This check counts the number of rows in a dataset during scan and registers anomalous counts relative to previous measurements for the row count metric. Refer to for details. Anomaly score checks require a minimum of four data points (four scans at stable intervals) to establish a baseline against which to gauge anomalies. If you do not see check results immediately, allow Soda Library to accumulate the necessary data points for relative comparison.

Schema evolution check: This check monitors schema changes in datasets, including column addition, deletion, data type changes, and index changes. By default, this automated check results in a failure if a column is deleted, its type changes, or its index changes; it results in a warning if a column is added. Refer to for details. Schema checks require a minimum of one data point to use as a baseline against which to gauge schema changes. If you do not see check results immediately, wait until after you have scanned the dataset twice.

Add automated monitoring checks

Add automated monitoring checks as part of the guided workflow to create a new data source only in deployment models that use a self-hosted Soda agent, not a Soda-hosted Soda agent. For a Soda-hosted agent, consider using the automated for observability into basic data quality in your datasets.

If you are using a self-operated deployment model that leverages Soda Library, add the column profiling configuration outlined below to your checks YAML file.

In Soda Cloud, navigate to your avatar > Data Sources > New Data Source to begin.

In step 5. Check of the guided workflow, you have the option of listing the datasets to which you wish to automatically add anomaly score and schema evolution checks. (Note that if you have signed up for early access to for datasets, this Check tab is unavailable as Soda performs all automated monitoring automatically in the dashboards.)

The example check below uses a wildcard character (%) to specify that Soda Library executes automated monitoring checks against all datasets with names that begin with prod, and not to execute the checks against any dataset with a name that begins with test.

You can also specify individual datasets to include or exclude, as in the following example.

Scan results in Soda Cloud

To review the check results for automated monitoring checks in Soda Cloud, you can:

navigate to the Checks dashboard to see the check results
navigate to the Datasets dashboard to see the check results for an individual dataset

Add quotes to all datasets

To add those necessary quotes to dataset names that Soda acts upon automatically – discovering, profiling, or sampling datasets, or creating automated monitoring checks – you can add a quote_tables configuration to your data source, as in the following example.

If your dataset names include white spaces or use special characters, you must wrap those dataset names in quotes whenever you identify them to Soda, such as in a checks YAML file.

Go further

Learn more about the for datasets.
Reference .
Use a to gauge how recently your data was captured.
Use to compare the values of one column to another.

Manage scheduled scans

From time to time, Soda may encounter runtime issues when it attempts to run a data quality scan on data in your data source. Issues such as unresponsive databases, or incorrectly defined checks may cause delays in the scan process which can result in excessive check execution times, sluggish database responsiveness due to heavy loads, or scheduling conflicts with other processes that cause bottlenecks.

You can view the status of scans that are in progress, queuing, completed, or partially complete with errors in the Scans dashboard in Soda Cloud.

Set alert notifications

To provide visibility into slow, incomplete, or failed Soda scans, you can set up customized alerts notifications for each Scan Definition that you created using Soda Cloud.

Log in to your Soda Cloud account, then navigate to Scans, and access the Agents tab. (You cannot set scan definition notifications for scans that you run using Soda Library.)
From the list, select one that uses the Scan Definition for which you wish to configure alerts.
On the scan definition's page, click the stacked dots at right, then select Edit Scan Definition.

Field or checkbox

Guidance

Investigate scan issues

When you notice or receive a notification about a scan failure or delay, you can access the scan’s logs to investigate what is causing the issue.

Log in to your Soda Cloud account, then navigate to Scans, and access the Agents tab.
From the list of scan definitions, select the one that failed or timed out.
On the scan definitions’s page, in the list of scan results, locate the one that failed or timed out, then click the stacked dots to its right and select Scan Logs.
Review the scan log, using the filter to show only warning or errors if you wish, or downloading the log file for external analysis.

Cancel and restart scans

Use the Scans page to access an overview of the executing and queuing scans in your Soda Cloud account. If you wish, you can cancel and restart a scan to manage the order in the queue.

On the Scans page, select a scan that is in an Executing state.
On the scan definition's page, click Cancel Scan.
When the scan state reads Canceled, you can click Run Scan from the same page to restart the scan.

Configure scan timeouts

To prevent processing bottlenecks, configure a scan timeout on your Soda Agent to ensure that excessively long-running scans stop automatically. If you have configured a delayed completion alert using the procedure above, Soda uses this timeout value to trigger alert notifications.

By default, Soda sets the scan timeout to two hours; follow the steps below to adjust that value.

Log in to your Soda Cloud account, then navigate to your avatar, Data Sources, and access the Agents tab.
From the list of Agents, select the one for which you wish to adjust the timeout value.
On the agent's page, click the stacked dots at right, then select Edit Agent.
Use the dropdown to adjust the value of Timeout Scans After

Best practices for optimized scheduled scans

Best practice dictates that to enhance scan efficiency, you avoid scheduling resource-intensive tasks, such as , concurrently with checks. This practice minimizes the likelihood of delays caused by resource contention, ensuring smoother execution of scans.
Do not set all of your scan definitions to run at the same time, particularly if the scans use the same Soda Agent. Mindfully stagger scan definition times to more evenly distribute executions and reduce the risk of bottlenecks, delays, and failed scans.
As the volume of checks a scan executes organically increases over time, scans may take longer to execute. If your scans are timing out too frequently, adjust the to a higher threshold.

Go further

for checks that fail or warn during a Soda scan.

Set notification rules

Use Soda Cloud to set alert notification rules for multiple checks across datasets in your account.

In Soda Cloud, you can define where and when to send alert notifications when check results warn or fail. You can define these parameters for:

agreements as you create or edit them; see Define SodaCL checks for Use an agreement.
no-code checks after you have created them; see Define SodaCL checks for Use a no-code check.
multiple checks by defining notification rules; read on!

For example, you can define a notification rule to instruct Soda Cloud to send an alert to your #sales-engineering Slack channel whenever a data quality check on the snowflake_sales data source fails.

Default rules

By default, Soda Cloud establishes two notification rules on your Soda Cloud account by default. You can these rules if you wish.

Refer to for details on resource ownership.

Set new rules

For a new rule, you define conditions for sending notifications including the severity of a check result and whom to notify when bad data triggers an alert.

In Soda Cloud, navigate to your avatar > Notification Rules, then click New Notification Rule. Follow the guided steps to complete the new rule. Use the table below for insight into the values to enter in the fields and editing panels.

Field or Label

Guidance

Edit or delete rules

Navigate to your avatar > Notification Rules, then click the stacked dots at the right of a check and select Edit Notification Rule or Delete Notification Rule.

Go further

Learn more about SodaCL .
Integrate your Soda Cloud account with your .
Integrate your Soda Cloud account with a third-party tool using a .

Organize datasets

Use attributes, tags, and filters to facilitate your search for the specific data quality status of your datasets.

With dozens, or even hundreds of datasets in your Soda Cloud account, it may be laborious to try to find the data quality information you're looking for. To facilitate your search for specific data quality status, consider defining your own Attributes and Tags for datasets, then use filters to narrow your search.

Define dataset attributes

Define new attributes for datasets in your organization that your colleagues can use to categorize datasets for easy identification and discovery. Consider adding multiple attributes to access precise cross-sections of data quality.

Create and track incidents

If you have integrated Soda Cloud with Slack, you can use an Incident’s built-in ability to create a channel that your team can use to investigate an issue.

When Soda runs a scan to execute the SodaCL checks you defined, Soda Cloud displays the checks and their latest scan results in the Checks dashboard. For a check that failed or triggered a warning, you have the option of creating an Incident for that check result in Soda Cloud to track your team's investigation and resolution of a data quality issue.

If you have integrated your Soda Cloud account with a Slack workspace, or MS Teams channel, or another third-party messaging or ticketing tool that your team uses such as Jira or ServiceNow, you can use an incident’s built-in ability to create an incident-specific link where you and your team can collaborate on the issue investigation.

Create Incidents

Log in to your Soda Cloud account, then navigate to the Checks dashboard.
For the check you wish to investigate, click the stacked dots at right, then select Create Incident. Provide a Title, Severity, and Description of your new incident, then save.
In the Incident column of the check result, click the Incident link to access the Incident page where you can record the following details:

Track Incidents

As your team works through the investigation of an Incident, use the Incident's Status field to keep track of your progress.
In the Incidents dashboard, review all Incidents, their severity and status, and the assigned lead. Sort the list of Incidents by severity.
From an Incident's page, link other check results to the same Incident to expand the investigation landscape.
If you opened a Slack channel to investigate the incident, Soda archives the channel when you set the

Go further

to facilitate your search for the right data.
for a check result.
Collaborate with your team using a .
Integrate Soda with your or .

Integrate Soda

Soda enables you to seamlessly integrate data quality checks into the tools and workflows you already use across your data stack, whether aligning data governance efforts, collaborating across teams, or triggering automated CI/CD and incident-management workflows.

These integrations surface data quality metrics and rule definitions directly within your existing tools, automate alert notifications to your teams, and streamline the creation and tracking of incidents and tickets based on check results.

To get started, select the integration you need from the list below for detailed instructions, prerequisites, and troubleshooting tips:

Data catalogs & governance tools

Access data quality insights directly within your Alation catalog. Run Soda scans to surface quality metrics and rules in the context of data sources, datasets, or columns.

Surface Soda-driven quality checks and metrics alongside your Atlan metadata. Flag poor-quality data in lineage diagrams and explore data-profile details in Atlan.

Explore any of these guides to get started with your preferred integration, and unlock end-to-end data-quality observability across your stack.

Integrate Soda with Atlan

Integrate Soda with Atlan to access details about the quality of your data from right within your data catalog.

Integrate Soda with Atlan to access details about the quality of your data from within the data catalog.

Run data quality checks using Soda and visualize quality metrics and rules within the context of a data source, dataset, or column in Atlan.
Use Soda Cloud to flag poor-quality data in lineage diagrams.
Give your Atlan users the confidence of knowing that the data they are using is sound.

Integrate Soda with Microsoft Teams

Integrate MS Teams in your Soda Cloud account so that Soda sends alert notifications and incident events to your MS Teams conversation.

Configure Soda Cloud to connect your account to MS Teams so that you can:

send for failed or warning check results to MS Teams channel
start conversations to track and resolve data quality with MS Teams

Integrate Soda with Purview

Integrate Soda with Microsoft Purview to access details about the quality of your data from right within your data catalog.

Integrate Soda with Microsoft's Purview data catalog to access details about the quality of your data from within the catalog.

Run data quality checks using Soda and visualize quality metrics and rules within the context of a table in Purview.
Give your Purview-using colleagues the confidence of knowing that the data they are using is sound.
Encourage others to add data quality checks using a link in Purview that connects directly to Soda.

In Purview, you can see all the Soda data quality checks and the value associated with the check's latest measurement, the health score of the dataset, and the timestamp for the most recent update. Each of these checks listed in Purview includes a link that opens a new page in Soda Cloud so you can examine diagnostic and historic information about the check.

Purview displays the latest check results according to the most recent Soda scan for data quality, where color-coded icons indicate the latest result. A gray icon indicates that a check was not evaluated as part of a scan.

If Soda is performing no data quality checks on a dataset, the instructions in Purview invite a catalog user to access soda and create new checks.

Prerequisites

You have completed at least one to validate that the data source’s datasets appear in Soda Cloud as expected.
You have a Purview account with the privileges necessary to collect the information Soda needs to complete the integration.
The data source that contains the data you wish to check for data quality is available in Purview.

Set up the integration

Sign into your Soda Cloud account and confirm that you see the datasets you expect to see in the data source you wish to test for quality.
In your Soda Cloud account, navigate to your avatar > Profile, then navigate to the API Keys tab. Click the plus icon to generate new API keys.
Copy the following values and paste to a temporary, secure, local location.

API Key ID
API Key Secret

Access for instructions on how to create the following values, then paste to a temporary, secure, local location.

client_id
client_secret
tenant_id

Copy the value of your purview endpoint from the URL (https://XXX.purview.azure.com) and paste to a temporary, secure, local location.
To connect your Soda Cloud account to your Purview Account, contact your Soda Account Executive or email with the details you collected in the previous steps to request Purview integration.

Go further

Integrate Soda with ServiceNow

Configure a webhook to connect Soda to your ServiceNow account.

Configure a webhook in Soda Cloud to connect to your ServiceNow account.

In ServiceNow, you can create a Scripted REST API that enables you to prepare a resource to work as an incoming webhook. Use the ServiceNow Resource Path in the URL field in the Soda Cloud integration setup.

This example offers guidance on how to set up a Scripted REST API Resource to generate an external link which Soda Cloud displays in the Incident Details; see image below. When you change the status of a Soda Cloud incident, the webhook also updates the status of the SNOW issue that corresponds with the incident.

Refer to Event payloads for details information.

The following steps offer a brief overview of how to set up a ServiceNow Scripted REST API Resource to integrate with a Soda Cloud webhook. Reference the ServiceNow documentation for details:

In ServiceNow, start by navigating to the All menu, then use the filter to search for and select Scripted REST APIs.
Click New to create a new scripted REST API. Provide a name and API ID, then click Submit to save.
In the Scipted Rest APIs list, find and open your newly-created API, then, in the Resources tab, click New to create a new resource.

Go further

As a business user, learn more about in Soda Cloud.
Set that apply to multiple checks in your account.
Learn more about creating, tracking, and resolving data quality .
Access a list of that Soda Cloud supports.

Integrate Soda with Slack

Integrate your Slack workspace in your Soda Cloud account so that Soda Cloud can send Slack notifications to your team when a data issue triggers an alert.

As a user with permission to do so in your Soda Cloud account, you can integrate your Slack workspace in your Soda Cloud account so that Soda Cloud can interact with individuals and channels in the workspace. Use the Slack integration to:

send notifications to Slack when a check result triggers an alert
create a private channel whenever you open new incident to investigate a failed check result
track Soda Discussions wherein your fellow Soda users collaborate on data quality checks

In Soda Cloud, navigate to your avatar > Organization Settings, then navigate to the Integrations tab and click the + icon to add a new integration.
Follow the guided steps to authorize Soda Cloud to connect to your Slack workspace. If necessary, contact your organization’s Slack Administrator to approve the integration with Soda Cloud.
- Configuration tab: select the public channels to which Soda can post messages; Soda cannot post to private channels.

Note that Soda caches the response from the Slack API, refreshing it hourly. If you created a new public channel in Slack to use for your integration with Soda, be aware that the new channel may not appear in the Configuration tab in Soda until the hourly Slack API refresh is complete.

Integration for Soda Cloud alert notifications

You can use this integration to enable Soda Cloud to send alert notifications to a Slack channel to notify your team of warn and fail check results.

With such an integration, Soda Cloud enables users to select a Slack channel as the destination for an alert notification of an individual check or checks that form a part of an agreement, or multiple checks.

To send notifications that apply to multiple checks, see .

Integration for Soda Cloud incidents

You can use this integration to notify your team when a new incident has been created in Soda Cloud. With such an integration, Soda Cloud displays an external link to an incident-specific Slack channel in the Incident Details.

Refer to for more details about using incidents in Soda Cloud.

Set a default Slack channel for notifications

You can set a default Slack channel that Soda Cloud applies to all alert notifications. If you have not already set the default Slack channel when you initially set up the integration, you can edit it to set the default.

In your Soda Cloud account, go to your avatar > Organization Settings.
Go to the Integrations tab, then click the stacked dots to the right of the Slack integration. Select Edit Integration Settings.
In the Slack Channels dialog, go to the Scope tab.

Go further

Set that apply to multiple checks in your account.
Learn more about using Slack to collaborate on resolving .
Access a list of that Soda Cloud supports.

SodaCL reference

Cross checks

Use a SodaCL cross check to compare row counts across datasets in the same, or different, data sources.

Use a cross check to compare row counts between datasets within the same, or different, data sources.

Define cross checks

In the context of , cross checks are unique. This check employs the row_count metric and is limited in its syntax variation, with only a few mutable parts to specify dataset and data source names.

Data source reference

Connect Soda to Amazon Athena

Access configuration details to connect Soda to an Athena data source.

For Soda to run quality scans on your data, you must configure it to connect to your data source. To learn how to set up Soda and configure it to connect to your data sources, see .

Connection configuration reference

Install package: soda-athena

Property

Required

Connect Soda to ClickHouse

Access configuration details to connect Soda to a ClickHouse data source.

Connection configuration reference

Because ClickHouse is compatible with MySQL wire protocol, Soda offers indirect support for ClickHouse data sources using the soda-mysql package.

Property

Required

Notes

Test the data source connection

To confirm that you have correctly configured the connection details for the data source(s) in your configuration YAML file, use the test-connection command. If you wish, add a -V option to the command to returns results in verbose mode in the CLI.

Supported data types

Category

Data type

Connect Soda to Databricks

Access configuration details to connect Soda to Databricks using a Spark data source.

You can use the Soda Library packages for Apache Spark to connect to Databricks SQL or to use Spark DataFrames on Databricks.

Refer to Connect to Spark for Databricks SQL.
Refer to Use Soda Library with Spark DataFrames on Databricks. 🎥 Watch a video that demonstrates how to add Soda to your Databricks pipeline: https://go.soda.io/soda-databricks-video

Connect Soda to Denodo

Access configuration details to connect Soda to a Denodo data source.

Connection configuration reference

Install package: soda-denodo

Property

Required

Notes

Supported data types

Category

Data type

Connect Soda to Dremio

Access configuration details to connect Soda to a Dremio data source.

Compatibility

Soda supports Dremio version 22 or greater.

Connection configuration reference

Connect Soda to DuckDB

Access configuration details to connect Soda to a DuckDB data source.

Connection configuration reference

Install package: soda-duckdb

Property

Required

Notes

Connect Soda to Google CloudSQL

Access configuration details to connect Soda to a Google CloudSQL data source.

Connection configuration reference

Because Google CloudSQL is compatible with PostgreSQL wire protocol, Soda offers support for Google CloudSQL data sources using the soda-postgres package.

Property

Required

Notes

Supported data types

Category

Data type

Connect Soda to IBM DB2

Access configuration details to connect Soda to an IBM DB2 data source.

Compatibility

Soda supports connections to IBM DB2 for Linux, UNIX, Windows (LUW). Soda does not support connections to IBM DB2 for z/OS. Refer to for more information.

Connection configuration reference

Connect Soda to a local file using Dask

Set up Soda to programmatically scan the contents of a local file using Dask.

For use with , only. Refer to .

to use Soda to scan a local file for data quality. Refer to the following example that executes a simple check for row count of the dataset.

Connect Soda to MotherDuck

Access reference configuration to connect Soda to a MotherDuck data source.

Connection configuration reference

Install package: soda-duckdb Refer to MotherDuck instructions for further detail.

data_source quack:
  type: duckdb
  database: "md:sample_data?motherduck_token=eyJhbGciOxxxxx.eyJzZXxxxxx.l4sxxxxx"
  read_only: true

Property

Required

Notes

Supported data types

Category

Data type

Connect Soda to MS SQL Server

Access configuration details to connect Soda to an MS SQL Server data source.

Connection configuration reference

Install package: soda-sqlserver

Property

Required

Notes

Connect Soda to Microsoft Fabric

Access configuration details to connect Soda to a Microsoft Fabric data source.

Connection configuration reference

Install package: soda-fabric

Soda support for Fabric data source is based on soda-sqlserver package.

data_source my_datasource_name:
  type: fabric
  host: host
  port: '1433'
  username: simple
  password: simple_pass
  database: database
  schema: dbo
  trusted_connection: false
  encrypt: false
  trust_server_certificate: false
  driver: ODBC Driver 18 for SQL Server
  scope: DW
  connection_parameters:
    multi_subnet_failover: true
  authentication: sql

Property

Required

Notes

Soda Library Python API reference

Access Python reference content for the Soda Scan class and its methods.

Use the Python API to programmatically execute Soda scans. The following content offers a reference for the Soda scan class and its methods.

Refer to Program a scan, Program a scan tab, for instructional details and an example of a complete file.

Classes

Use the Scan class to programmatically define and execute data quality scans. See Invoke Soda Library for an example of how to use the Soda Library Python API in a programmatic scan.

Methods

Use this method to execute the scan. When executed, Soda returns an integer exit code as per the table that follows.

Exit code

Description

Provide required scan settings

Specify the data source on which Soda executes the checks.

Provide the scan definition name if the scan has been defined in Soda Cloud. By providing this value, Soda correlates subsequent scans from the same pipeline.

To retrieve this value, navigate to the Scans page in Soda Cloud, then select the scan definition you wish to execute remotely and copy the scan name, which is the smaller text under the label. For example, weekday_scan_schedule.

Add configurations to a scan

Add data source and Soda Cloud connection configurations from a YAML file. file_path is a string that points to a configuration file. ~ expands to the user's home directory.

Optionally, add all connection configurations from all matching YAML files in the file path according to your specifications.

path is a string that is the path to a directory, but you can use it as a path to a configuration file. ~ expands to the user's home directory or the directory in which to search for configuration files.
recursive requires a boolean value that controls whether Soda scans nested directories. If unspecified, the default value is true.
suffixes

Optionally, add connection configurations from a YAML-formatted string.

environment_yaml_str is a string that represents a configuration and must be YAML-formatted.
file_path is an optional string that you use to get the location of errors in the logs.

Add SodaCL checks to a scan

Add a SodaCL checks YAML file to the scan according to a file path you specify. file_path is a string that identifies a checks YAML file.

Optionally, add all the files in a directory to the scan as SodaCL checks YAML files.

path is a string that identifies a directory, but you can use it as a path to a configuration file. ~ expands to the user's home directory or the directory in which to search for checks YAML files.
recursive is an optional boolean value that controls whether Soda scans nested directories. If unspecified, the default value is true.
suffixes

Optionally, add SodaCL checks from a YAML-formatted string.

sodacl_yaml_str is a string that represents the SodaCL checks and must be YAML-formatted.
file_path is an optional string that you use to get the location of errors in the logs.

If you use a for SodaCL checks, add a SodaCL template file to the scan. file_path is a string that identifies a SodaCL template file.

If you use multiple for SodaCL checks, add all the template files in a directory to the scan. path is a string that identifies the directory that contains the SodaCL template files.

Add local data to a scan

If you use Pandas, add a Pandas Dataframe dataset to the scan.

dataset_name is a string to identify a dataset.
pandas_df is a Pandas Dataframe object.
data_source_name is a string to identify a data source.

If you use Dask, add a Dask Dataframe dataset to the scan.

dataset_name is a string used to identify a dataset.
dask_df is a Dask Dataframe object.
data_source_name is a string to identify a data source.

If you use PySpark, add a Spark session to the scan.

spark_session is a Spark session object.
data_source_name is a string to identify a data source.

If you use a pre-existing DuckDB connection object as a data source, add a DuckDB connection to the scan.

duckdb_connection is a DuckDB connection object.
data_source_name is a string to identify a data source.

Add optional scan settings

Configure a scan to output verbose log information. This is useful when you wish to see the SQL queries that Soda executes or to troubleshoot scan issues.

Configure Soda to prevent it from sending scan results to Soda Cloud. This is useful if, for example, you are testing checks locally and do not wish to muddy the measurements in your Soda Cloud account with test run metadata.

Configure a scan to have access to custom variables that can be referenced in your SodaCL files.variables is a dictionary with string keys and string values.

Add configurations to handle scan results

Use the following configurations to handle errors and/or warnings that occurred during a Soda scan.

Instruct Soda to raise an AssertionError when errors occur in the scan logs.

Instruct Soda to raise an AssertionError when errors or warnings occur in the scan logs.

Instruct Soda to raise an AssertionError when a specific error message occurs in the scan logs. Use expected_error_message to specify the error message as a string.

Instruct Soda to return a boolean value to indicate that errors occurred in the scan logs.

Instruct Soda to return a boolean value to indicate that errors or warnings occurred in the scan logs.

Instruct Soda to return a string that represents the logs from the scan.

Instruct Soda to return a list of strings of scan errors in the logs.

Instruct Soda to return a list of strings of scan errors and warnings in the logs.

Instruct Soda to return a string of all scan errors in the logs.

Instruct Soda to return a dictionary containing the results of the scan.

The scan results dictionary includes the following keys:

Add configurations to handle check results

Use the following configurations to handle the results of checks executed during a Soda scan.

Instruct Soda to raise an AssertionError when any check execution results in a fail state.

Instruct Soda to raise an AssertionError when any check execution results in a fail or warn state.

Instruct Soda to return a boolean value to indicate that one or more checks executed during the scan resulted in a fail state.

Instruct Soda to return a boolean value to indicate that one or more checks executed during the scan resulted in a warn state.

Instruct Soda to return a boolean value to indicate that one or more checks executed during the scan resulted in a fail or warn state.

Instruct Soda to return a list of strings of checks that resulted in a fail state.

Instruct Soda to return a string of checks that resulted in a fail state.

Instruct Soda to return a list of strings of checks that resulted in a fail or warn state.

Instruct Soda to return a string of checks that resulted in a fail or warn state.

Instruct Soda to return a string of all check results.

Attributes

Configure the datasource-level samples limit for the failed rows sampler. This is useful when scanning Pandas, Dask, or Spark Dataframes.

Replace the failed rows sampler with a custom sampler. See for instructions about how to define a custom sampler.

SodaCL optional check configurations

Add optional configurations to your SodaCL checks to optimize and clarify.

When you define SodaCL checks for data quality in your checks YAML file, you have the option of adding one or more extra configurations or syntax variations. Read more about SodaCL metrics and checks in general.

The following optional configurations are available to use with most, though not all, check types. The detailed documentation for metrics and individual check types indicate specifically which optional configurations are compatible.

Customize check names

Add a customized, plain-language name to your check so that anyone reviewing the check results can easily grasp the intent of the check.

Add the name to the check as a nested key:value pair, as per the example below.

Be sure to add the : to the end of your check, before the nested content.
If name is configured, Soda Library sends the value of name to Soda Cloud as the check identifier.
Avoid applying the same customized check names in multiple agreements. Soda Cloud associates check results with agreements according to name so if you reuse custom names, Soda Cloud may become confused about which agreement to which to link check results.

If you wish, you can use a variable to customize a dynamic check name. Read more about .

When you run a scan with Soda Library, it uses the value you specified for your variable in the scan results, as in the example below.

Add a check identity

Soda Cloud identifies a check using details such as the check definition, the check YAML file name, and the file's location. When you modify an individual check, the check identity changes, which results in a new check in Soda Cloud. For example, the following check sends one check result to Soda Cloud after a scan.

If you changed the threshold from 0 to 99, then after the next scan, Soda Cloud considers this as a new check and discards the previous check result's history; it would appear as though the original check and its results had disappeared. Note that this behaviour does not apply to changing values that use an in-check variable, as in the example below.

If you anticipate modifying a check, you can explicitly specify a check identity so that Soda Cloud can correctly accumulate the results of a single check and retain its history even if the check has been modified. Be sure to complete the steps below before making any changes to the check so that you do not lose the existing check result history.

Add an identity property to your check using the identifier you copied as the identity's value.

Choosing a Value for identity

The most important rule is that the identity value must be unique across all your checks. Here are some recommended approaches:

Generate a UUID yourself.
Use the generated check ID from Soda Cloud (available in the check details).
Follow a naming pattern, for example:
Example:

This ensures no accidental collisions between checks and preserves a clear mapping over time.

Save your changes, then run a scan to push new results to Soda Cloud that include the check identity.
With the check identity now associated with the check in Soda Cloud, you may proceed to make changes to the check.

See also:

Difference Between Check Identity and Soda Cloud Check ID

It’s important to note that check identity is not the same as check ID in Soda Cloud.

Check ID
- Generated automatically by Soda Cloud as a UUID when a check is first created.
- Used to uniquely reference that check.
Check Identity

Think of check identity as the link between old and new versions of your check, while the check ID is simply the identifier inside Soda Cloud.

Add alert configurations

When Soda runs a scan of your data, it returns a check result for each check. Each check results in one of three default states:

pass: the values in the dataset match or fall within the thresholds you specified
fail: the values in the dataset do not match or fall within the thresholds you specified
error: the syntax of the check is invalid

However, you can add alert configurations to a check to explicitly specify the conditions that warrant a warn result. Setting more granular conditions for a warn, or fail, state of a check result gives you more insight into the severity of a data quality issue.

For example, perhaps 50 missing values in a column is acceptable, but more than 50 is cause for concern; you can use alert configurations to warn you when there are 0 - 50 missing values, but fail when there are 51 or more missing values.

Configure a single alert

Add alert configurations as nested key:value pairs, as in the following example which adds a single alert configuration. It produces a warn check result when the volume of duplicate phone numbers in the dataset exceeds five. Refer to the CLI output below.

Configure multiple alerts

Add multiple nested key:value pairs to define both warn alert conditions and fail alert conditions.

The following example defines the conditions for both a warn and a fail state. After a scan, the check result is warn when there are between one and ten duplicate phone numbers in the dataset, but if Soda Library discovers more than ten duplicates, as it does in the example, the check fails. If there are no duplicate phone numbers, the check passes.

Be sure to add the : to the end of your check, before the nested content.
Be aware that a check that contains one or more alert configurations only ever yields a .

Expect one check result

Be aware that a check that contains one or more alert configurations only ever yields a single check result; one check yields one check result. If your check triggers both a warn and a fail, the check result only displays the more severe, failed check result. (Schema checks behave slightly differently; see .)

Using the following example, Soda Library, during a scan, discovers that the data in the dataset triggers both alerts, but the check result is still Only 1 warning. Nonetheless, the results in the CLI still display both alerts as having both triggered a [WARNED] state.

The check in the example below data triggers both warn alerts and the fail alert, but only returns a single check result, the more severe Oops! 1 failures.

Define zones using alert configurations

Use alert configurations to write checks that define fail or warn zones. By establishing these zones, the check results register as more severe the further a measured value falls outside the threshold parameters you specify as acceptable for your data quality.

The example that follows defines split warning and failure zones in which inner is good, and outer is bad. The chart below illustrates the pass (white), warn (yellow), and fail (red) zones. Note that an individual check only ever yields one check result. If your check triggers both a warn and a fail, the check result only displays the more serious, failed check result. See for details.

The next example defines a different kind of zone in which inner is bad, and outer is good. The chart below illustrates the fail (red), warn (yellow), and pass (white) zones.

Add a filter to a check

Add a filter to a check to apply conditions that specify a portion of the data against which Soda executes the check. For example, you may wish to use an in-check filter to support a use case in which “Column X must be filled in for all rows that have value Y in column Z”.

Add a filter as a nested key:value pair, as in the following example which filters the scan results to display only those rows with a value of 81 or greater and which contain 11 in the sales_territory_key column. You cannot use a variable to specify an in-check filter.

If your filter uses a string as a value, be sure to wrap the string in single quotes, as in the following example.

You can use AND or OR to add multiple filter conditions to a filter key:value pair to further refine your results, as in the following example.

To improve the readability of multiple filters in a check, consider adding filters as separate line items, as per the following example.

If your column names use quotes, these quotes produce invalid YAML syntax which results in an error message. Instead, write the check without the quotes or, if the quotes are mandatory for the filter to work, prepare the filter in a text block as in the following example.

Be aware that if no rows match the filter parameters you set, Soda does not evaluate the check. In other words, Soda first finds rows that match the filter, then executes the check on those rows.

If, in the example above, none of the rows contained a value of 11 in the sales_territory_key column, Soda does not evaluate the check and returns a NOT EVALUATED message in the CLI scan output, such as the following.

See for further details.
See also: .

Use quotes in a check

In the checks you write with SodaCL, you can apply the quoting style that your data source uses for dataset or column names. Soda Library uses the quoting style you specify in the aggregated SQL queries it prepares, then executes during a scan.

Note that the type of quotes you use must match that which your data source uses. For example, BigQuery uses a backtick (`) as a quotation mark.
Soda does not support quotes in the dataset name identifier, as in checks for "CUSTOMERS":

Check:

Resulting SQL query:

Apply checks to multiple datasets

Add a for each section to your checks configuration to specify a list of checks you wish to execute on multiple datasets.

Add a for each dataset T section header anywhere in your YAML file. The purpose of the T is only to ensure that every for each configuration has a unique name.
Nested under the section header, add two nested keys, one for datasets and one for checks.

Limitations and specifics for for each

For each is not compatible with dataset filters.
Soda dataset names matching is case insensitive.
You cannot use quotes around dataset names in a for each configuration.
If any of your checks specify column names as arguments, make sure the column exists in all datasets listed under the datasets heading.

See for further details.

Scan a portion of your dataset

It can be time-consuming to check exceptionally large datasets for data quality in their entirety. Instead of checking whole datasets, you can use a dataset filter to specify a portion of data in a dataset against which Soda Library executes a check.

Except with a NOW variable, you cannot use variables in checks you write in an agreement in Soda Cloud as it is impossible to provide the variable values at scan time.
Known issue: Dataset filters are not compatible with . With such a check, Soda does not apply the dataset filter at scan time.

In your checks YAML file, add a section header called filter, then append a dataset name and, in square brackets, the name of the filter. The name of the filter cannot contain spaces. Refer to the example below.
Nested under the filter header, use a SQL expression to specify the portion of data in a dataset that Soda Library must check.
- The SQL expression in the example references two variables: ts_start

If you wish to run checks on the same dataset without using a filter, add a separate section for checks for your_dataset_name without the appended filter name. Any checks you nest under this header execute against all the data in the dataset.

See for further details.

Collect failed rows samples

Soda collects failed rows samples explicitly and implicitly.

To explicitly collect failed row samples, you can add a check to explicitly collect samples of failed rows.

Explicitly, Soda automatically collects 100 failed row samples for the following explicitly-configured checks:

that use the failed rows query configuration

Implicitly, Soda automatically collects 100 failed row samples for the following checks:

checks that use a
checks that use a
checks that use a

Beyond the default behavior of collecting and sending 100 failed row samples to Soda Cloud when a check fails, you can:

customize the sample size
customize columns from which to collect samples
disable failed row collection
reroute failed row samples to a non-Soda Cloud destination, such as an S3 bucket.

Learn how to .

Go further

Reference .

Anomaly score checks (deprecated)

Anomaly score checks use a machine learning algorithm to automatically detect anomalies in your time-series data.

This check is being deprecated. Soda recommends using the new features, rebuilt from the ground up, 70% more accurate and significantly faster.

Use an anomaly score check to automatically discover anomalies in your time-series data.

About anomaly score checks

The anomaly score check is powered by a machine learning algorithm that works with measured values for a metric that occur over time. The algorithm learns the patterns of your data – its trends and seasonality – to identify and flag anomalies in time-series data.

Install Soda Scientific

To use an anomaly score check, you must install Soda Scientific in the same directory or virtual environment in which you installed Soda Library. Soda Scientific is included in Soda Agent deployment. Best practice recommends installing Soda Library and Soda Scientific in a virtual environment to avoid library conflicts, but you can if you prefer.

Set up a virtual environment, and install Soda Library in your new virtual environment.
Use the following command to install Soda Scientific.

Refer to for help with issues during installation.

Define an anomaly score check

The following example demonstrates how to use the anomaly score for the row_count metric in a check. You can use any , , or metric in lieu of row_count.

Currently, you can only use < default to define the threshold in an anomaly score check.
By default, anomaly score checks yield warn check results, not fails.

You can use any , , or metric in anomaly score checks. The following example detects anomalies for the average of order_price in an orders dataset.

The following example detects anomalies for the count of missing values in the id column.

Anomaly score check results

Because the anomaly score check requires at least four data points before it can start detecting what counts as an anomalous measurement, the first few scans yield a check result that indicates that Soda does not have enough data.

Though your first instinct may be to run several scans in a row to product the four measurments that the anomaly score needs, the measurements don’t “count” if the frequency of occurrence is too random, or rather, the measurements don't represent enough of a stable frequency.

If, for example, you attempt to run eight back-to-back scans in five minutes, the anomaly score does not register the measurements resulting from those scans as a reliable pattern against which to evaluate an anomaly.

Consider using the Soda library to set up a that produces a check result for an anomaly score check on a regular schedule.

Produce warnings instead of fails

By default, an anomaly score check yields either a pass or fail result; pass if Soda does not detect an anomaly, fail if it does.

If you wish, you can instruct Soda to issue warn check results instead of fails by adding a warn_only configuration, as in the following example.

Reset anomaly history

If you wish, you can reset an anomaly score's history, effectively recalibrating what Soda considers anomalous on a dataset.

In Soda Cloud, navigate to the Check History page of the anomaly check you wish to reset.
Click to select a node in the graph that represents a measurement, then click Feedback.
In the modal that appears, you can choose to exclude the individual measurement, or all previous data up to that measurement, the latter of which resets the anomaly score's history.

Optional check configurations

Supported

Configuration

Documentation

Example with quotes

Example with for each

Track anomalies and relative changes by group

You can use a group by configuration to detect anomalies by category, and monitor relative changes over time in each category.

✔️ Requires Soda Core Scientific for anomaly check (included in a Soda Agent) ✖️ Supported in Soda Core ✔️ Supported in Soda Library 1.1.27 or greater + Soda Cloud ✔️ Supported in Soda Cloud Agreements + Soda Agent 0.8.57 or greater ✖️ Available as a no-code check

The following example includes three checks grouped by gender.

The first check uses the custom metric average_children to collect measurements and gauge them against an absolute threshold of 2. Soda Cloud displays the check results grouped by gender.
The second check uses the same custom metric to detect anomalous measurements relative to previous measurements. Soda must collect a minimum of four, regular-cadence, measurements to have enough data from which to gauge an anomolous measurement. Until it has enough measurements, Soda returns a check result of [NOT EVALUATED]. Soda Cloud displays any detected anomalies grouped by gender.

Troubleshoot Soda Scientific installation

While installing Soda Scientific works on Linux, you may encounter issues if you install Soda Scientific on Mac OS (particularly, machines with the M1 ARM-based processor) or any other operating system. If that is the case, consider using one of the following alternative installation procedures.

Need help? Ask the team in the .

Install Soda Scientific Locally

Set up a virtual environment, and install Soda Library in your new virtual environment.
Use the following command to install Soda Scientific.

List of Soda Scientific dependencies

pandas<2.0.0
wheel
pydantic>=1.8.1,<2.0.0

Use Docker to run Soda Library

Use Soda’s Docker image in which Soda Scientific is pre-installed. You need Soda Scientific to be able to use SodaCL or .

If you have not already done so, in your local environment.
From Terminal, run the following command to pull Soda Library’s official Docker image; adjust the version to reflect the most .
Verify the pull by running the following command.
Output:
When you run the Docker image on a non-Linux/amd64 platform, you may see the following warning from Docker, which you can ignore.

Error: Mounts denied

If you encounter the following error, follow the procedure below.

You need to give Docker permission to acccess your configuration.yml and checks.yml files in your environment. To do so:

Access your Docker Dashboard, then select Preferences (gear symbol).
Select Resources, then follow the to add your Soda project directory – the one you use to store your configuration.yml and checks.yml files – to the list of directories that can be bind-mounted into Docker containers.
Click Apply & Restart, then repeat steps 2 - 4 above.

Error: Configuration path does not exist

If you encounter the following error, double check the syntax of the scan command in step 4 above.

Be sure to prepend /sodacl/ to both the congifuration.yml filepath and the checks.yml filepath.
Be sure to mount your files into the container by including the -v option. For example, -v /Users/MyName/soda_project:/sodacl.

Troubleshoot Soda Scientific installation in a virtual env

If you have defined an anomaly detection check and you use an M1 MacOS machine, you may get aLibrary not loaded: @rpath/libtbb.dylib error. This is a known issue in the MacOS community and is caused by issues during the installation of the . There currently are no official workarounds or releases to fix the problem, but the following adjustments may address the issue.

Install soda-scientific as per the local environment installation instructions and activate the virtual environment.
Use the following command to navigate to the directory in which the stan_model of the prophet package is installed in your virtual environment.
For example, if you have created a python virtual environment in a /venvs directory in your home directory and you use Python 3.9, you would use the following command.

List of comparison symbols and phrases

Go further

Reference .

Test data quality in a Dagster pipeline

Use this guide as an example of how to invoke Soda data quality tests in a Dagster pipeline.

Use this guide as an example for how to use Soda to test for data quality in an ETL pipeline in Dagster.

About this guide

The instructions below offer an example of how to execute several Soda Checks Language (SodaCL) tests for data quality at multiple points within a Dagster pipeline.

For context, the example follows a fictional organization called Bikes 4 All that operates several bicycle retail stores in different regions. The Data Analysts at the company are struggling with their sales forecasts and reporting dashboards. The company has tasked the Data Engineering team to automate the ETL pipeline that uses Dagster and dbt to orchestrate the ingestion and transformation of data before exporting it for use by Data Analysts in their business intelligence tools.

The pipeline built in an assets.py file in the Dagster project automates a flow which:

Tests data before ingestion: Uploaded from various stores into S3, the Data Engineers run Soda data quality checks before copying the data to a Redshift data source. To do so, they use Soda Library to load the files into DataFrames, then run a Soda scan for data quality to catch any issues with incomplete, missing, or invalid data early in the pipeline. For any Soda checks that fail, the team routes failed row samples, which contain sensitive data, back to their own S3 bucket to use to investigate data quality issues.
Loads data: After addressing any data quality issues in the retail data in S3, they load the data into Redshift in a staging environment.
Transforms data in staging: Using dbt, the Data Engineers build the models in a staging environment which transform the data for efficient use by the Data Analysts.

As a final step, outside the Dagster pipeline, the Data Engineers also design a dashboard in Tableau to monitor data quality status.

Prerequisites

The Data Engineers in this example uses the following:

Python 3.8, 3.9, or 3.10
Pip 21.0 or greater
dbt-core and the required database adapter (dbt-redshift)
a Dagster account

Install dbt, Dagster, and Soda Library

Though listed as prerequisites, the following instructions include details for installing and initializing dbt-core and Dagster.

From the command-line, a Data Engineer installs dbt-core and the required database adapter for Redshift, and initializes a dbt project directory. Consult for details.
In the same directory that contains the dbt_project.yml, they install and initialize the Dagster project inside the dbt project. Consult the Dagster and documentation for details.cd project-namepip install dagster-dbt dagster-webserver dagster-awsdagster-dbt project scaffold --project-name my-dagster-projectcd my-dagster-project
They install the Soda Library packages they need to run data quality scans in both Redshift and on data in DataFrames using Dask and Pandas.pip install -i https://pypi.cloud.soda.io soda-redshiftpip install -i https://pypi.cloud.soda.io soda-pandas-dask

Create and connect a Soda Cloud account

To validate an account license or free trial, Soda Library must communicate with a Soda Cloud account via API keys. You create a set of API keys in your Soda Cloud account, then use them to configure the connection to Soda Library.

In a browser, a Data Engineer navigates to to create a new Soda account, which is free for a 45-day trial.
They navigate to your avatar > Profile, access the API keys tab, then click the plus icon to generate new API keys.
They create a new file called configuration.yml in the same directory in which they installed the Soda Library packages, then copy+paste the API key values into the file according to the following configuration. This config enables Soda Library to connect to Soda Cloud via API.

No Redshift connection details in the configuration.yml?

Normally, when connecting Soda Library to a data source so it can run data quality scans, you must configure data source connection details in a configuration.yml file, as instructed in .

However, in this example, because the Data Engineers need only use Soda Library to programmatically run scans on data loaded as DataFrames from S3, it is not necessary to provide the connection config details. See: .

Later in this example, when the Data Engineers run Soda scans remotely, they do so via calls to Soda Cloud API endpoints. Soda Cloud is configured to connect to the Redshift data source and Soda executes the scan via the Soda-hosted Agent included out-of-the-box with a Soda Cloud account. Learn more about the .

Set up Soda

To empower their Data Analyst colleagues to write their own no-code checks for data quality, a Data Engineer volunteers to set up Soda to:

connect to the Redshift data source that will contain the ingested data in a staging environment
discover the datasets and make them accessible by others in the Soda Cloud user interface
create check attributes to keep data quality check results organized

Logged in to Soda Cloud, the Data Engineer, who, as the initiator of the Soda Cloud account for the organization is automatically the Soda Admin, decides to use the out-of-the-box Soda-hosted agent made available for every Soda Cloud organization to securely connect to their Redshift data source.
The Data Engineer follows the guided workflow to to the Soda Cloud account to connect to their Redshift data source, making sure to include all datasets during , and exclude datasets from to avoid exposing any customer information in the Soda Cloud UI.
Lastly, they follow the instructions to create , which serve to label and sort check results by pipeline stage, data domain, etc.

Write pre-ingestion SodaCL checks

Before the Data Engineer loads the existing retail data from S3 to Redshift, they prepare several data quality tests using the Soda Checks Language (SodaCL), a YAML-based, domain-specific language for data reliability.

After creating a new checks.yaml file in the same directory in which they installed the Soda Library packages, the Data Engineer consults with their colleagues and defines the following checks for four datasets—stores, stocks, customers, and orders—being sure to add attributes to each to keep the check results organized.

Run pre-ingestion checks

In the assets.py file of their Dagster project, the Data Engineer begins defining the first asset under the @asset decorator. Consult the for details.

The first definition loads the S3 data into a DataFrame, then runs the pre-ingestion checks on the data. Because the data contains sensitive customer information, the Data Engineer also includes a which sends failed row samples for checks that fail to an S3 bucket instead of automatically pushing them to Soda Cloud. To execute the scan programmatically, the script references two files that Soda uses:

the configuration.yml file which contains the Soda Cloud API key values that Soda Library need to validate the user license before executing a scan, and
the checks.yml file which contains all the pre-ingestion SodaCL checks that the Data Engineer prepared.

Load data into Redshift and define staging transformations

After all SodaCL checks pass, indicating that the data quality is good, the next step in the Dagster pipeline loads the data from the S3 bucket into Amazon Redshift. As the Redshift data source is connected to Soda Cloud, both Data Engineers and Data Analysts in the Soda Cloud account can access the data and prepare no-code SodaCL checks to test data for quality.

The Data Engineer then defines the dbt models that transform the data and which run under the @dbt_assets decorator in the staging environment.

Write post-transformation SodaCL checks

With the transformed data available in Redshift in a staging environment, the Data Engineer invites their Data Analyst colleagues to define their own for data quality.

The Data Analysts in the organization know their data the best, particularly the data feeding their reports and dashboards. However, as they prefer not to write code—SQL, Python, or SodaCL—Soda Cloud offers them a UI-based experience to define the data quality tests they know are required.

When they create a check for a dataset in Soda Cloud, they also make two selections that help gather and analyze check results later:

a scan definition in which to include their check
one or more check attributes

The is what Soda uses to run regularly-scheduled scans of data. For example, a scan definition may instruct Soda to use the Soda-hosted agent connected to a Redshift data source to execute the checks associated with it every day at 07:00 UTC. Additionally, a Data Engineer can programmatically trigger a scheduled scan in Soda Cloud using the scanDefinition identifier; see the next step!

The creator of a no-code check can select an existing scan definition, or choose to create a new one to define a schedule that runs at a different time of day, or at a different frequency. In this example, the Data Analysts creating the checks are following the Data Engineer's instruction that they use the same scan definition for their checks, dagsterredshift_default_scan, to facilitate running a single remote scan in the pipeline, later.

The check attributes that the Data Engineer defined when they are available in the Soda Cloud user interface for Data Analysts to select when they are creating a check. For example, a missing check on the store_id column validates that there are no NULL values in the column. By adding four attributes to the check, the Data Analyst makes it easier for themselves and their colleagues to filter and analyze check results in Soda Cloud, and other BI tools, according to these custom attributes.

Trigger a Soda scan via API

After the Data Analysts have added the data quality checks they need to the datasets in Soda Cloud, the next step in the pipeline triggers a Soda scan of the data remotely, via the . To do this, a Data Engineer uses the scan definition that the Data Analysts assigned to checks as they created them.

In the Dagster pipeline, the Data Engineer adds a script to first call the Soda Cloud API to trigger a scan via the endpoint.

Then, using the scanID from the response of the first call, they send a request to the endpoint which continues to call the endpoint as the scan executes until the scan status reaches an end state that indicates that the scan completed, issued a warning, or failed to complete. If the scan completes successfully, the pipeline continues to the next step; otherwise, it trips a “circuit breaker”, halting the pipeline.

Transform data in production

When all the Data Analysts' checks have been executed and the results indicate that the data is sound in staging, the Data Engineer adds a step in the pipeline to perform the same transformations on data in the production environment. The production data in Redshift feeds the reports and dashboards that the Data Analysts use, who now with more confidence in the reliability of the data.

Export data quality test results

As a last step in the Dagster pipeline, the Data Engineer goes the extra mile to export data quality check results to tables in Redshift. The script again accesses the Soda Cloud API to gather results, then transforms the API data quality responses into DataFrames and writes them to Redshift.

In this example, the check attributes that both the Data Engineers and Data Analysts applied to the checks they created prove useful: during export, the script adds separate columns to the tables in Redshift for the attributes' keys and values so that anyone using the data to create, say, a dashboard in Tableau, can organize the data according to attributes like Data Quality Dimension, Pipeline Stage, or Data Domain.

Download this asset definition:

Create a Dagster asset job

After defining all the assets for the Dagster pipeline, the Data Engineer must define the asset jobs, schedules, and resources for the Dagster and dbt assets. The definitions.py in the Dagster project wires everything together. Consult for more information.

Review results

To review check results from the latest Soda scan for data quality, along with the historical measurements for each check, both Data Analysts and Data Engineers can use Soda Cloud.

They navigate to the Datasets page, then select a dataset from those listed to access a dataset overview page which offers info about check coverage, the dataset's health, and a list of its latest check results.

To keep sensitive customer data secure, the Data Engineers in this example chose to reroute any failed row samples that Soda implicitly collected for missing, validity, reference checks, and *explicitly* collected for failed row checks to an S3 bucket. Those with access to the bucket can review the CSV files which contain the failed row samples which can help Data Engineers investigate the cause of data quality issues.

Further, because the Data Engineer went the extra mile to export data quality check results via the Soda Cloud API to tables in Redshift, they are able to prepare a Tableau dashboard using the check attributes to present data according to Domain, Dimension, etc.

To do so in Tableau, they added their data source, selected the Redshift connector, and entered the database connection configuration details. Consult for details.

Go further

Learn more about to review Soda check results from within the catalog.
Learn more about to set up for data quality checks that fail.

Distribution checks

Use a SodaCL distribution check to monitor the consistency of a column over time.

Distribution checks will no longer be supported in Soda v4; they will deprecated and replaced by MAD.

In the short term, v3 users can use summary statistics instead.

Use a distribution check to determine whether the distribution of a column has changed between two points in time. For example, if you trained a model at a particular moment in time, you can use a distribution check to find out how much the data in the column has changed over time, or if it has changed all.

✔️ Requires Soda Core Scientific (included in a Soda Agent) ✔️ Supported in Soda Core ✔️ Supported in Soda Library + Soda Cloud ✖️ Supported in Soda Cloud Agreements + Soda Agent ✖️ Available as a no-code check

About distribution checks

To detect changes in the distribution of a column between different points in time, Soda uses approaches based on and based on metrics that quantify the distance between samples.

When using hypothesis testing, a distribution check allows you to determine whether enough evidence exists to conclude that the distribution of a column has changed. It returns the probability that the difference between samples taken at two points in time would have occurred if they came from the same distribution (see ). If this probability is smaller than a threshold that you define, the check warns you that the column's distribution has changed.

You can use the following statistical tests for hypothesis testing in your distribution checks.

for continuous data
for categorical data

When using a metric to measure distance between samples, a distribution check returns the value of the distance metric that you chose based on samples taken at two points in time. If the value of the distance metric is larger than a threshold that you define, the check warns that the column's distribution has changed.

You can use the following distance metrics in your distribution checks.

for continuous or categorical data
(standardized using the sample standard deviation) for continuous or categorical data
(standardized using the sample standard deviation, this metric is equal to the SWD) for continuous or categorical data

Sample sizes in distribution checks

In hypothesis testing, the of a test refers to its ability to reject the null hypothesis when it is false. Specifically, the power of a test tells you how likely it is that the null hypothesis will be rejected if the true difference with the alternative hypothesis were of a particular size; see . A very powerful test is able to reject the null hypothesis even if the true difference is small. Since distribution checks issue warnings based on the p-value alone and do not take effect size into account, having too much power can make the results of the checks hard to interpret. An extremely powerful test rejects the null hypothesis for effect sizes that are negligible. Because the power of a test increases as its sample size increases, there is a sample size limit of one million in distribution checks. The default sample size limit of 1 million rows is based on simulations that used the Kolmogorov-Smirnov test. The simulation generated samples from a normal distribution, an exponential distribution, a laplacian distribution, a beta distribution, and a mixture distribution (generated by randomly choosing between two normal distributions). The Kolmogorov-Smirnov test compared these samples to samples that came from the same distributions, but with different means. For example, it compared samples from a normal distribution to samples from another normal distribution with a different mean. For each distribution type, the Kolmogorov-Smirnov test rejected the null hypothesis 100% of the time if the effect size was equal to, or larger than, a shift to the mean of 1% of the standard deviation, when using a sample size of one million. Using such a sample size does not result in problems with local memory. If you wish, you can define your own sample size using a SQL query. See Define the sample size section, below.

Distribution check thresholds for distance metrics

The values of the Population Stability Index (PSI) and the Standardized Wasserstein Distance (SWD) can be hard to interpret. Consider carefully investigating which distribution thresholds make sense for your use case. Some common interpretations of the PSI result are as follows:

PSI < 0.1: no significant distribution change
0.1 < PSI < 0.2: moderate distribution change

Install Soda Scientific

To use a distribution check, you must install Soda Scientific in the same directory or virtual environment in which you installed Soda Library. Best practice recommends installing Soda Library and Soda Scientific in a virtual environment to avoid library conflicts, but you can if you prefer.

Set up a virtual environment, and install Soda Library in your new virtual environment.
Use the following command to install Soda Scientific.

Refer to for help with issues during installation.

Generate a distribution reference object (DRO)

Not yet supported in Soda Cloud

Before defining a distribution check, you must generate a distribution reference object (DRO).

When you run a distribution check, Soda compares the data in a column of your dataset with a snapshot of the same column at a different point in time. This snapshot exists in the DRO, which serves as a point of reference. The distribution check result indicates whether the difference between the distributions of the snapshot and the actual datasets is statistically significant.

To create a DRO, you use the CLI command soda update-dro. When you execute the command, Soda stores the entire contents of the column(s) you specified in local memory. Before executing the command, examine the volume of data the column(s) contains and ensure that your system can accommodate storing it in local memory.

If you have not already done so, create a directory to contain the files that Soda uses for a distribution check.
Use a code editor to create a file called distribution_reference.yml (though, you can name it anything you wish) in your Soda project directory, then add the following example content to the file.
Optionally, you can define multiple DROs in your distribution_reference.yml file by naming them. The following example defines two DROs.
Change the values for dataset

about bins and weights, and how Soda computes the number of bins for a DRO.

Define a distribution check

If you have not already done so, create a checks.yml file in your Soda project directory. The checks YAML file stores the Soda Checks you write, including distribution checks; Soda Library executes the checks in the file when it runs a scan of your data.
In your new file, add the following example content.
Replace the following values with your own dataset and threshold details.

Distribution check details

For continuous columns, When you execute the soda scan command, Soda stores up to one million records in local memory. If the column has more than one million records, then Soda applies limit SQL clause to make sure that your system can accommodate storing it in local memory.
For continuous columns, as explained in , Soda uses bins and weights to take random samples from your DRO. Therefore, it is possible that the original dataset that you used to create the DRO resembles a different underlying distribution than the dataset that Soda creates by sampling from the DRO. To limit the impact of this possibility, Soda runs the tests in each distribution check ten times and returns the median of the results, either as a p-value or a distance metric). For example, if you use the Kolmogorov-Smirnov test and a threshold of 0.05, the distribution check uses the Kolmogorov-Smirnov test to compare ten different samples from your DRO to the data in your column. If the median of the returned p-values is smaller than 0.05, the check issues a warning. This approach does change the interpretation of the distribution check results. For example, the probability of a type I error is multiple orders of magnitude smaller than the significance level that you choose.

Bins and weights

Soda uses the bins and weights to generate a sample from the reference distribution when it executes the distribution check during a scan. By creating a sample using the DRO's bins and weights, you do not have to save the entire – potentially very large - sample. The distribution_type value impacts how the weights and bins will be used to generate a sample, so make sure your choice reflects the nature of your data (continuous or categorical).

To compute the number of bins for a DRO, Soda uses different strategies based on whether outlier values are present in the dataset.

By default Soda automatically computes the number of bins for each DRO by taking the maximum of and methods. also applies this practice by default.

For datasets with outliers, such as in the example below, the default strategy does not work well. When taking the maximum of and methods, it produces a great number of bins, 3466808, while there are only nine elements in the array. The outlier value 10e6 result in a misleading bin size.

If the number of bins is greater than the size of data, Soda uses to detect and filter the outliers. Basically, for data that is greater than Q3 + 1.5 IQR and less than Q1 - 1.5 IQR Soda removes the datasets, then recomputes the number of bins with the same method by taking the maximum of and .

After removing the outliers, if the number of bins still exceeds the size of the filtered data, Soda takes the square root of the dataset size to set the number of bins. To cover edge cases, if the square root of the dataset size exceeds one million, then Soda sets the number of bins to one million to prevent it from generating too many bins.

Define the sample size

You can add a sample parameter for both a distribution check and DRO to include a sample SQL clause that Soda passes when it executes the check during a scan.

Apply a sample to a distribution check for continuous columns

If the data to which you wish to apply distribution check does not fit in memory or involves a time constraint, use a sample to specify a SQL query that returns a sample of the data. The SQL query that you provide is specific to the type of data source you use. In the example below, the SQL query for a PostgreSQL data source randomly samples 50% of the data with seed 61. You can customize the sample SQL query to meet your needs.

Use sample for continuous values, only. For categorical values refer to .

Sampling Caveats

Some data sources do not have a built-in sampling function. For example, BigQuery does not support TABLESAMPLE BERNOULLI. In such a case, add a filter parameter to randomly obtain a sample of the data. The filter parameter applies a data source-specific SQL WHERE clause to the data. In the example below, the SQL query for a BigQuery data source randomly samples 50% of the data.

Distribution Check

DRO

Distribution check examples

You can define multiple distribution checks in a single checks.yml file. If you create a new DRO for another dataset and column in sales_dist_ref.yml for example, you can define two distribution checks in the same checks.yml file, as per the following.

Alternatively you can define two DROs in distribution_reference.yml, naming them cars_owned_dro and calendar_quarter_dro, and use both in a single checks.yml file

You can also define multiple checks for different columns in the same dataset by generating multiple DROs for those columns. Refer to the following example.