1 of 12

Learning resources

Glossary

Access a glossary of Soda terminology.

active check

Soda's licensing model is based on the volume of active checks. An active check is one that Soda has executed during a scan at least once in the past 90 days. A single check, whether it has been executed during a scan one time, fifty times, or five hundred times in the last 90 days counts as an active check.

agreement

(Deprecating) A collection of checks that serve as a contract between stakeholders that stipulates the expected and agreed-upon state of data quality in a data source.

alert configuration

A configuration in a SodaCL check that you use to explicitly specify the conditions that warrant a warn result. See .

built-in metric

An out-of-the-box metric that you can configure in a checks YAML file. See .

check

A test for data quality that you write using the Soda Checks Language (SodaCL). Technically, it is a Python expression that checks metrics to see if they match the parameters you defined for a measurement. See .

checks YAML

The file in which you define SodaCL checks. Soda Library uses the input from this file to prepare, then run SQL queries against your data. See .

cloud metric store

The component in Soda Cloud that stores metric measurements. This component facilities the visualization of changes to your data over time.

collection

A saved set of filters in the Checks dashboard that you can access via a dropdown. Also known as a Saved View.

column

A column in a dataset in your data source.

configuration key

The key in the key-value pair that you use to define what qualifies as a missing or valid value in a column. A Soda scan uses the value of a column configuration key to determine if a check should pass, warn, or fail. For example, in valid format: UUID , valid format is a column configuration key and UUID is the only format of the data in the column that Soda considers valid. See and .

configuration YAML

The file in which you configure data source connection details and Soda Cloud connection details.

data source

A storage location that contains a collection of datasets, such as Snowflake, Amazon Athena, or GCP BigQuery.

dataset

A representation of a tabular data structure with rows and columns. A dataset can take the form of a table in PostgreSQL or Snowflake, a stream, or a DataFrame in a Spark application.

discussion

A collaborative messaging and check proposal space that data producers and consumers can use to establish agreed-upon rules for data quality. See: .

incident

A ticket you create and associate with a failed check result so as to track your team’s investigation and resolution of a data quality issue. See .

measurement

The value for a metric that Soda Library collects during a scan.

metric

A property of the data in your dataset. See .

monitor

(Deprecated) A set of details you define in Soda Cloud which Soda SQL used when it ran a scan. Now deprecated and replaced by a .

no-code check

A SodaCL check you create via the Soda Cloud user interface.

notification

A setting you configure in a Soda Cloud agreement that defines whom to notify with check results after a scan.

recon YAML

The file in which you define SodaCL reconciliation checks. See .

scan

A command that executes checks to extract information about data in a data source. See .

scan definition

A collection of checks YAML files that contain the checks for data quality you wish to scan at a specific time, including details for which Soda Agent to use to connect to which data source. Effectively, a scan definition provides the what, when, and where to run a scan.

scan definition name

A unique identifier that you add to a programmatic scan or to the soda scan command using the -s option. Soda Cloud uses the scan definition name to correlate subsequent scan results, thus retaining an historical record of the measurements over time.

scan schedule

The schedule you customize in Soda Cloud to instruct a Soda Agent to execute scans at a regular cadence.

Soda Agent

The self-hosted or Soda-hosted Helm chart that faciliates a secure connection between your Soda Cloud account and your data sources. See .

SodaCL

The domain-specific language to define Soda Checks in a checks YAML file. A Soda Check is a test that Soda Library executes when it scans a dataset in your data source.

Soda Cloud

A web application that enables you to examine scan results and create agreements. Create a Soda Cloud account at .

Soda Core

A free, open-source, Python library and command-line tool that enables you to use the Soda Checks Language to turn user-defined input into aggregated SQL queries that test for data quality. See Soda Core in .

Soda Library

A Python library and CLI tool that is a commercial extension of Soda Core. Connect Soda Library with over a dozen data sources and Soda Cloud, and use the Soda Checks Language to turn user-defined input into aggregated SQL queries that test for data quality.

Soda Spark (Deprecated)

Soda Spark was an extension of Soda SQL that allowed you to run Soda SQL functionality programmatically on a Spark DataFrame. It has been replaced by Soda Library configured to .

Soda SQL (Deprecated)

Soda SQL was an open-source command-line tool that scanned the data in your data source. Replaced by Soda Library.

threshold

The value for a metric that Soda checks against during a scan. See .

validity rule

In Soda Cloud, the key-value pair that you use to define what qualifies as a missing valid value in a column. A Soda scan uses the value defined in a validity rule to determine if it should pass or fail a check. See also: .

Soda overview

Soda utilizes user-defined input to prepare SQL queries to find bad data, visualize results, set up alerts, and track dataset health over time.

Soda is a tool that enables Data Engineers, Data Scientists, and Data Analysts to test data for quality where and when they need to.

Is your data fresh?
Is it complete or missing values?
Are there unexpected duplicate values?
Did something go wrong during transformation?
Are all the date values valid?
Are anomalous values disrupting downstream reports?

These are questions that Soda answers.

What does Soda do?

Soda works by taking the data quality checks that you prepare and using them to run a scan of datasets in a data source.

A scan is a command which instructs Soda to execute data quality checks on your data source to find invalid, missing, or unexpected data. When data quality checks fail, they surface bad-quality data and present check results that help you investigate and address quality issues.

Working together, Soda Library or a Soda Agent, Soda Cloud and Soda Checks Language (SodaCL) empower you and your colleagues to collaborate on data quality testing.

Soda Library is a Python library and CLI tool that performs the work of converting user-defined input into SQL queries that execute when you run scans for data quality. This "engine" of Soda uses the data source connection information you provide in a configuration YAML file, and the data quality checks you define in a checks YAML file, to run on-demand or scheduled scans of your data. Soda Library pushes scan results to your Soda Cloud account to enable you and your colleagues to analyze check results, investigate issues, and track dataset health over time.
The Soda Agent is a self-hosted or Soda-hosted containerized Soda Library deployed in a Kubernetes cluster in a cloud services provider environment, such as Azure or AWS. Deploy a self-hosted Soda Agent to use Soda Library while meeting infrastructure team’s security rules and requirements. See for details.
Soda Cloud communicates with Soda Library installed as a library and CLI tool in your development environment, or as an agent in your cloud service-based environment. While Soda Library is the mechanism that executes scans, Soda Cloud is what makes data quality results accessible and shareable by multiple team members. Use it to access visualized scan results, discover data quality anomalies, set up alerts for quality checks that fail, and track data quality health over time. Connect your Soda Cloud account to the ticketing, messaging, and data cataloging tools you already use to embed Soda quality checks into your team's existing processes and pipelines.

Soda Core, the free, open-source Python library and CLI tool from which Soda Library extends, continues to exist as an OSS project in GitHub, including all . to connect to Soda Cloud and access all the newest Soda features.

Example SodaCL checks

Soda uses the input in the checks YAML files to prepare SQL queries that it runs against your data during a scan. During a scan, Soda does not ingest your data, it only scans it for quality metrics, then uses the metadata to prepare scan results. (An exception to this rule is when Soda collects failed row samples that it presents in scan output to aid with issue investigation, a feature you can .

After a scan, each check results in one of three default states:

pass: the values in the dataset match or fall within the thresholds you specified
fail: the values in the dataset do not match or fall within the thresholds you specified
error: the syntax of the check is invalid
A fourth state, warn, is something you can explicitly configure for individual checks.

Soda makes the results available in the command-line and in your online account, and notifies you of failed checks by email, Slack, MS Teams, or any messaging platform your team already uses.

Where do you use Soda?

You can programmatically embed Soda scan executions in your data pipeline after ingestion and transformation to get early and precise warnings in Soda about data quality issues before they have a downstream impact. Upon receiving a data quality alert in Slack, for example, your team can take quick action in Soda Cloud to identify the issue and open an incident to investigate the root cause. See .

You can also add Soda scans to your CI/CD development lifecycle to ensure that any changes you make to dbt models or other changes or added transformations are checked for data quality before merging into production, preventing data quality issues from impacting business operations. In conjunction with GitHub Actions, for example, you can automate scans for data quality whenever a team member creates a new pull request to ensure that “checking for data quality” is a regular part of your software development lifecycle. An ounce of prevention in development is worth a pound of cure in production! See .

Use Soda to test the quality in a data migration project at both source and target, both before and after migration to prevent data quality issues from polluting a new data source. See .

Go further

Learn more about the ways you can use Soda in .
with your data catalog.
Auto-generate tailored to your data.
Set up bulk to send alerts for failed checks.

How Soda works

Learn Soda Library Basics, Soda Library Operation, Soda Library Automation and Soda Cloud.

Soda Library is Python library and CLI tool that enables Data Engineers to test data for quality where and when they need to. The Soda Agent is a self-hosted or Soda-hosted containerized Soda Library deployed in a Kubernetes cluster, so the behavior described below for Soda Library is more or less the same for Soda Agent.

Soda Library utilizes user-defined input to prepare SQL queries that run checks on datasets in a data source to find invalid, missing, or unexpected data. When checks fail, they surface the data that you defined as “bad” in the check. Armed with this information, you and your data engineering team can diagnose where the “bad” data entered your data pipeline and take steps to prioritize and resolve issues.

Use Soda Library to manually or programmatically scan the data that your organization uses to make decisions. Optionally, you can integrate Soda Library with your data orchestration tool, such as Airflow, to schedule scans and automate actions based on scan results. Connect Soda Library to a Soda Cloud account where you and your team can use the web application to monitor check results and collaborate to keep your data issue-free.

Soda Library basics

This tool checks the quality of data inside . It enables you to perform four basic tasks:

connect to your data source
connect to a Soda Cloud account
define checks to surface bad-quality data
run a scan for data quality against your data

To connect to a data source such as Snowflake, Amazon Athena, or GCP BigQuery, you use a configuration.yml file which stores access details for your data source and connection details for your Soda Cloud account. (Except for connections to Spark DataFrames which do not use a configuration YAML file.) Refer to for details and links to data source-specific connection configurations.

Configuration YAML example

To define the data quality checks that Soda Library runs against a dataset, you use a checks.yml file. A Soda Check is a test that Soda Library performs when it scans a dataset in your data source. The checks YAML file stores the Soda Checks you write using .

For example, you can define checks that look for things like missing or forbidden columns in a dataset, or rows that contain data in an invalid format. See for more details.

Checks YAML example

In your own local environment, you create and store your checks YAML file anywhere you wish, then identify its name and filepath in the scan command. In fact, you can name the file whatever you like, as long as it is a .yml file and it contains checks using the SodaCL syntax.

You write Soda Checks using SodaCL’s built-in metrics, though you can go beyond the built-in metrics and write your own SQL queries, if you wish. The example above illustrates two simple checks on two datasets, but SodaCL offers a wealth of that enable you to define checks for more complex situations.

To scan your data, you use the soda scan CLI command. Soda Library uses the input in the checks YAML file to prepare SQL queries that it runs against the data in one or more datasets in a data source. It returns the output of the scan with each check's results in the CLI.

Soda Library operation

The following image illustrates what Soda Library does when you initiate a scan.

1 - You trigger a scan using the soda scan CLI command from your Soda project directory which contains the configuration.yml and checks.yml files. The scan specifies which data source to scan, where to get data source access info, and which checks to run on which datasets.

2 - Soda Library uses the checks you defined in the checks YAML to prepare SQL queries that it runs on the datasets in your data source.

3 - When Soda Library runs a scan, it performs the following actions:

fetches column metadata (column name, type, and nullable)
executes a single aggregation query that computes aggregate metrics for multiple columns, such as missing, min, or max
for each column each dataset, executes several more queries

4 - As a result of a scan, each check results in one of three default states:

pass: the values in the dataset match or fall within the thresholds you specified
fail: the values in the dataset do not match or fall within the thresholds you specified
error: the syntax of the check is invalid

A fourth state, warn, is something you can explicitly configure for individual checks. See .

The scan results appear in your Soda Library command-line interface (CLI) and the latest result appears in the Checks dashboard in the Soda Cloud web application; examples follow.

Optionally, you can add --local option to the scan command to prevent Soda Library from sending check results and any other metadata to Soda Cloud.

The Soda Cloud web application integrates with your Soda Library implementation giving your team broader visibility into your organization’s data quality. Soda Library pushes scan results to your Soda Cloud account where you can examine the results.

Soda Library does not send data to Soda Cloud; it only ever pushes metadata to the cloud. All your data stays inside your private network. An exception to this rule is when Soda collects failed row samples that it presents in scan output to aid with issue investigation, a feature you can .

The web app serves to complement Soda Library. Use Soda Cloud to:

access visualized check results
track check results over time with the Cloud Metric Store that records past measurements
set up and send alert notifications when bad-quality data surfaces
examine failed row samples

Soda Library automation

To automate scans on your data, you can use Soda Library to programmatically execute scans. Based on a set of conditions or a specific schedule of events, you can instruct Soda Library to, for example, automatically run scans in your development workflow in GitHub. Refer to the for details.

Alternatively, you can integrate Soda Library with a data orchestration tool such as, Airflow, Dagster, or Prefect to schedule automated scans. You can also configure actions that the orchestration tool can take based on scan output. For example, you may wish to scan your data at several points along your data pipeline, perhaps when new data enters a warehouse, after it is transformed, and before it is exported to another data source or tool. Refer to for details.

Go further

Learn more about the you can use to check for data quality.
Learn how to prepare of your data.
Learn more about the ways you can use Soda in .
Use to investigate data quality issues.

Soda Agent basic concepts

Establish a baseline understanding of the concepts involved in deploying a Soda Agent.

The Soda Agent is a tool that empowers Soda Cloud users to securely access data sources to scan for data quality. For a self-hosted agent, create a Kubernetes cluster in a cloud services provider environment, then use Helm to deploy a Soda Agent in the cluster.

This setup enables Soda Cloud users to securely connect to data sources (Snowflake, Amazon Athena, etc.) from within the Soda Cloud web application. Any user in your Soda Cloud account can add a new data source via the agent, then write their own no-code checks to check for data quality in the new data source.

What follows is an extremely abridged introduction to a few basic elements involved in the deployment and setup of a self-hosted Soda Agent.

Soda Library is a Python library and command-line tool that serves as the backbone of Soda technology. It is the software that performs the work of converting user-defined input into SQL queries that execute when you run scans for data quality in a data source. Connect Soda Library to a Soda Cloud account where you and your team can use the web application to collaborate on data quality monitoring.

Both Soda Library and Soda Cloud make use of Soda Checks Language (SodaCL) to write checks for data quality. The checks are tests that Soda Library executes when it runs a scan of your data.

Soda Agent is essentially Soda Library functionality that you deploy in a Kubernetes cluster in your own cloud services provider environment. When you deploy an agent, you also deploy two types of workloads in your Kubernetes cluster from a Docker image:

a Soda Agent Orchestrator which creates Kubernetes Jobs to trigger scheduled and on-demand scans of data
a Soda Agent Scan Launcher which wraps around Soda Library, the tool which performs the scan itself

Kubernetes is a system for orchestrating containerized applications; a Kubernetes cluster is a set of resources that supports an application deployment.

You need a Kubernetes cluster in which to deploy the containerized applications that make up the Soda Agent. Kubernetes uses the concept of Secrets that the Soda Agent Helm chart employs to store connection secrets that you specify as values during the Helm release of the Soda Agent. Depending on your cloud provider, you can arrange to store these Secrets in a specialized storage such as or . See: .

The Jobs that the agent creates access these Secrets when they execute. Learn more about .

Within a cloud services provider environment is where you create your Kubernetes cluster. You can deploy a Soda Agent in any environment in which you can create Kubernetes clusters such as:

Amazon Elastic Kubernetes Service (EKS)
Microsoft Azure Kubernetes Service (AKS)
Google Kubernetes Engine (GKE)
Any Kubernetes cluster version 1.21 or greater which uses standard Kubernetes

Helm is a package manager for Kubernetes which bundles YAML files together for storage in a public or private repository. This bundle of YAML files is referred to as a Helm chart. The Soda Agent is a Helm chart. Anyone with access to the Helm chart’s repo can deploy the chart to make use of YAML files in it. Learn more about .

The Soda Agent Helm chart is stored on a and published on . Anyone can use Helm to find and deploy the Soda Agent Helm chart in their Kubernetes cluster

Go further

in a Kubernetes cluster.

Soda architecture

Review the architecture and resources of Soda which connects to data sources to perform scans of datasets

Your Soda architecture depends upon the flavor of Soda (deployment model) you chose when you set up your environment. The following offers a high-level view of the architecture of a few flavors of Soda.

Self-operated deployment

This deployment model is a simple setup in which you install Soda Library locally and connect it to Soda Cloud via API keys.

Soda Library connects to data sources and performs scans of each dataset in a data source. When you connect Soda Library to a Soda Cloud account, it pushes scan results to Soda Cloud where users in your organization can view check results, access Cloud Metric Store data, and integrate with Slack to investigate data quality Incidents.

When Soda Library completes a scan, it uses a secure API to push the results to your Soda Cloud account where you can log in and examine the details in the web application. Notably, Soda Library only pushes metadata to Soda Cloud, barring any failed rows you explicity instruct Soda Library to send to Soda Cloud. By default all your data stays inside your private network. See .

You can use checks to view samples of data that , and track data quality over time. Soda Cloud stores your scan results and prepares charts that represent the volume of failed checks in each scan. These visualizations of your scan results enable you to see where your data quality is improving or deteriorating over time.

Soda-hosted agent deployment

This deployment model provides a secure, out-of-the-box Soda Agent to manage access to data sources from within your Soda Cloud account. Quickly configure connections to your data sources in the Soda Cloud user interface, then empower all your colleagues to explore datasets, access check results, customize collections, and create their own no-code checks for data quality. Soda-hosted agent supports connections to BigQuery, Databricks SQL, MS SQL Server, MySQL, PostgreSQL, Redshift, and Snowflake data sources.

Self-hosted agent deployment

This deployment model enables a data or infrastructure engineer to deploy Soda Library as an agent in a Kubernetes cluster within a cloud-services environment such as Google Cloud Platform, Azure, or AWS.

The engineer can manage access to data sources while giving Soda Cloud end-users easy access to Soda check results and enabling to write their own checks for data quality. Users connect to data sources and write checks for data quality directly in the Soda Cloud user interface.

Soda Cloud resources

Soda Cloud is made up of several parts, or resources, that work together to define checks, execute scans, and display results that help you gauge the quality and reliability of your data.

It is helpful to understand these resources and how they relate, or connect, to each other if you are establishing for your organization’s Soda Cloud account, or if you are planning to delete an existing resource.

The following diagram illustrates an example deployment of a single Soda Cloud account with two Soda Agents, each of which connects to two data sources. A Soda Cloud Administrator has also created integrations with Slack, Jira (via a webhook), and MS Teams.

A Soda Agent is Soda Library that has been deployed in Kubernetes cluster in a cloud services provider environment. It enables Soda Cloud users to securely connect to data sources such as Snowflake, BigQuery, and PostgreSQL.

about Soda Agent.

A data source in Soda Cloud is a representation of the connection to your data source. Notably, it does not contain any of your data^†, only data source metadata that it uses to check for data quality.

about adding new data sources.

Within the context of Soda Cloud, a data source contains:

datasets which represent tabular structures with rows and columns in your data source; like data sources, they do not contain your data^†, only metadata. Datasets can contain user-defined attributes that help filter and organize check results.
scan definitions, sometimes referred to as a scan schedule, which you use to define a Soda scan execution schedule for the data source.
agreements in which you write checks to define what good data looks like. Agreements also specify where to send alert notifications when a check result warns or fails, such as to a Slack channel in your organization.

about notification rules.

^† An exception to this rule exists when you configure Soda Cloud to collect sample data from a dataset, or samples of failed rows from a dataset when a check result fails.

Delete resources

As the example deployment diagram illustrates, the different resources in Soda Cloud have several connections to each other. You can responsibly delete resources in Soda Cloud -- it warns you about the relevant impact before executing a deletion! -- but it may help to visualize the impact a deletion may have on your deployment before proceeding.

The following non-exhaustive list of example deletions serve to illustrate the potential impact of deleting.

Delete a dataset

Deleting a dataset affects individual checks defined inside an agreement. If you have multiple agreements which contain checks against a particular dataset, all of those checks, and consequently the agreements they are in, are impacted when you delete a dataset. Further, if the dataset contains attributes, those attributes disappear with the dataset upon deletion.

Delete a data source

Deleting a data source affects many other resources in Soda Cloud. As the following diagram illustrates, when you delete a data source, you delete all its datasets, scan definitions, agreements, and the checks in the agreements.

If an agreement contains a that compares the row count of datasets between data sources (as does the agreement in Data source C in the diagram), deleting a data source affects more than the checks and agreements it contains.

Delete a scan definition

Deleting a scan definition has the potential to impact multiple agreements in a data source. Among other things, the scan definition defines the cadence that Soda Cloud uses to execute scans of data in the data source.

Soda does not scan any agreements that reference a deleted scan definition. Consequently, your Checks dashboard in Soda Cloud no longer displays checks for the agreement, nor would Soda Cloud send alert notifications.

Delete an integration

A Soda Cloud Administrator has the ability to add, edit, and delete integrations with third-party service providers.

As the example diagram indicates, deleting a Slack integration prevents Soda Cloud from sending alert notifications to Slack when check results warn or fail, and prevents users from connecting an to a Slack channel to collaborate on data quality issue resolution.

Example combination deployment

If your Soda Cloud account is also connected to Soda Library, your deployment may resemble something like the following diagram.

Note that you can delete resources that appear in Soda Cloud as a result of a manual or programmatic Soda Library scan. However, unless you delete the reference to the resource at its source – the checks.yml file or configuration.yml file – the resource will reappear in Soda Cloud when Soda Library sends its next set of scan results.

For example, imagine you use Soda Library to run scans and send results to Soda Cloud. In the checks.yml file that you use to define your checks, you have the following configuration:

In Soda Cloud, you can see dataset-q because Soda Library pushed the scan results to Soda Cloud which resulted in the creation of a resource for that dataset. In Soda Cloud, you can use the UI to delete dataset-q, but unless you also remove the checks for dataset-q configuration from your checks.yml file, the dataset reappears in Soda Cloud the next time you run a scan.

Go further

Create a Soda Cloud account at .
As a business user, learn how to write SodaCL checks in in Soda Cloud.
Learn more about viewing in Soda Cloud.

Active checks and datasets

Learn more about active checks and datasets as they are defined in Soda's licensing model.

Soda’s licensing model can include volume-based measures of active checks, or a similar model based on active datasets.

An active dataset is one for which Soda has executed at least one check, excluding empty datasets. A dataset counts as active if you add configuration for:

a check
automated monitoring checks; see Add automated monitoring checks
an anomaly detection dashboard (available in 2025); see

An active check is one that Soda has executed during a scan at least once in the past 90 days. A single check, whether it has been executed during one scan, fifty scans, or five hundred scans in the last 90 days counts as one active check.

A single check is identifiable as a test that yields a single result.

A check with one or more counts as a single check. The following is an example of a single active check as it only ever yields one result: pass, warn, fail, or error. Note, A check that results in an error counts as an active check. Soda executes the check during a scan in order to yield a result; if the result is an error, it is still a result.

A check that is included as part of a configuration yields a single result for each dataset against which it is executed. The following example produces four check results and, thus, has four active checks and four active datasets.

Similarly, a single check that is included in a scan against two data sources, or two environments such as staging and production, counts as two active checks for two active datasets. The following example checks.yml file contains as a single check. The scan commands that follow instruct Soda to execute the check on two different environments which counts as two active checks for two active datasets. See also: .

A check that involves data comparison between multiple datasets in the same, or different, data sources counts as a single check. The following example has four checks, two and two , and counts as two active datasets.

Similarly, a that compares data between source and target datasets in the same, or different, data sources counts as a single active check running against a single active dataset which, for this type of check, is the target dataset. The following example has five active checks for five active datasets.

Where a check involves grouping its results by category, as in a configuration, the check itself still counts as a single check. The following example has one active check for one active dataset.

Go further

Access information about that you can use in SodaCL checks.

Data security and privacy

Soda works in several ways to ensure your data and systems remain private. We offer secure connections, SSO, and observe compliance and reporting regulations.

Soda works in several ways to ensure your data and systems are secure and remain private.

Compliance and reporting

As a result of an independent review in July 2025, Soda has been found to be SOCII Type 2 compliant. Contact [email protected] for more information.

Using a Soda-hosted agent

Soda hosts agents in a secure environment in Amazon AWS. As a SOC 2 Type 2 certified business, Soda responsibly manages Soda-hosted agents to ensure that they remain private, secure, and independent of all other hosted agents.

Soda encrypts values pertaining to data source connections and only uses the values to access the data to perform scans for data quality. It uses to encrypt and store the values you provide for access to your data source. AMS KMS keys are certified under the .
Soda encrypts the secrets you provide via Soda Cloud both in transit and at rest. This means that secrets leave your browser already encrypted and can only be decrypted using a Private Key that only the Soda Agent can access.
Once you enter data source access credentials into Soda Cloud, neither you or any user or entity can access the values because they have been encrypted and can only be decrypted by the Soda Agent.

Connecting with Soda Library

Installed in your environment, you use the Soda Library command-line tools to securely connect to a data source using system variables to store login credentials.

Sending data to Soda Cloud

Soda Library uses a secure API to connect to Soda Cloud. When Soda Library completes a scan, it pushes the scan results to your Soda Cloud account where you can log in and examine the details in the web application.

Notably, your Soda Cloud account does not store the raw data that Soda Library scans. Soda Library pushes metadata to Soda Cloud; by default all your data stays inside your private network.

Soda Cloud does store the following:

metadata, such as column names
aggregated metrics, such as averages
sample rows and failed rows, if you explicitly set up your configuration to send this data to Soda Cloud

Where your datasets contain or private information, you may not want to send failed row samples from your data source to Soda Cloud. In such a circumstance, you can in Soda Cloud.

Receiving events from Soda Cloud

You can set up Soda Cloud to send events to your services using integrations like Soda Webhooks. If your destination services are behind a firewall, you may need to passlist Soda Cloud's egress IP addresses to allow communication. The current IP addresses used by Soda Cloud are:

EU: 54.78.91.111, 52.49.181.67
US: 34.208.202.240, 52.35.114.145

Ensure these addresses are allowed in your firewall settings to avoid any disruptions in receiving events from Soda Cloud.

Single sign-on with Soda Cloud

Organizations that use a SAML 2.0 single sign-on (SSO) identity provider can add Soda Cloud as a service provider. Once added, employees of the organization can gain authorized and authenticated access to the organization’s Soda Cloud account by successfully logging in to their SSO. Refer to for details.

Soda Library usage statistics

To understand how users are using Soda Library, the Soda dev team added telemetry event tracking to Soda Library. See instructions to opt-out.

To understand how users are using Soda Library, and to proactively capture bugs and performance issues, the Soda development team has added telemetry event tracking to Soda Library.

Soda tracks usage statistics using the Open Telemetry Framework. The data Soda tracks is completely anonymous, does not contain any personally identifiying information (PII) in any form, and is purely for internal use.

Opt out of usage statistics

Soda Library collects usage statistics by default. You can opt-out from sending Soda Library usage statistics at any time by adding the following to your ~/.soda/config.yml or .soda/config.yml

Support

For the open source developer tools and free trial version of our software, Soda offers free support to the Soda community of users in Slack.

For the open source developer tools and free trial version of our software, Soda offers free support to the Soda community of users in Slack. Join the Soda community in Slack to ask and answer questions.

Community memebers are also welcome to create and/or resolve issues in the public, open-source Soda Core repository in GitHub.

Service Level Agreement

For customers using Soda Cloud Enterprise, Soda adheres to a Service Level Agreement (SLA) that outlines the support and maintenance services that Soda provides.

Generally speaking, the SLA for enterprise customers outlines Soda Cloud availability and Soda's incident and error support. Refer to the official SLA in your Enterprise contract for details.

Soda community code of conduct

Reference the Soda Community Code of Conduct for guidelines for behaviors. Be safe, be respectful, be yourself.

This code of conduct describes the tenets of the inclusive and harassment-free environment that is the Soda Community. We do not tolerate harassment or discrimination of any participant in any form.

Soda Community: A welcome environment

We are genuinely proud of our growing and enthusiastic Soda community, both on and offline. It is delightful to be a part of something new and exciting in the industry, and we want to keep it that way!

As such, we are dedicated to making sure the Soda community is a welcoming place for everyone to participate, collaborate, innovate, and have fun! It is this spirit that we offer this code of conduct and encourage all our members to be safe, be respectful, and be yourself.

We hope to maintain an environment in which all individuals can interact and collaborate in a positive way. Examples of behavior that contribute to creating a welcome environment include:

Soda SQL and Soda Spark are now Soda Core

Soda SQL and Soda Spark have been deprecated and replaced by Soda Core.

The very first Soda OSS tools, Soda SQL and Soda Spark, served their community well since 2021. They have been deprecated. about the decision to deprecate and move forward with Soda Core.

Soda SQL was the original command-line tool that Soda created to test for data quality. It has been replaced by Soda Core.
Soda Spark was an extension of Soda SQL that allowed you to run Soda SQL functionality programmatically on a Spark DataFrame. It has been replaced by Soda Core configured to [connect with Apache Spark](
).

Soda architecture

Review the architecture and resources of Soda which connects to data sources to perform scans of datasets

Self-operated deployment

This deployment model is a simple setup in which you install Soda Library locally and connect it to Soda Cloud via API keys.

Soda-hosted agent deployment

Self-hosted agent deployment

Soda Cloud resources

Soda Cloud is made up of several parts, or resources, that work together to define checks, execute scans, and display results that help you gauge the quality and reliability of your data.

A Soda Agent is Soda Library that has been deployed in Kubernetes cluster in a cloud services provider environment. It enables Soda Cloud users to securely connect to data sources such as Snowflake, BigQuery, and PostgreSQL.

about Soda Agent.

A data source in Soda Cloud is a representation of the connection to your data source. Notably, it does not contain any of your data^†, only data source metadata that it uses to check for data quality.

about adding new data sources.

Within the context of Soda Cloud, a data source contains:

datasets which represent tabular structures with rows and columns in your data source; like data sources, they do not contain your data^†, only metadata. Datasets can contain user-defined attributes that help filter and organize check results.
scan definitions, sometimes referred to as a scan schedule, which you use to define a Soda scan execution schedule for the data source.
agreements in which you write checks to define what good data looks like. Agreements also specify where to send alert notifications when a check result warns or fails, such as to a Slack channel in your organization.

about notification rules.

^† An exception to this rule exists when you configure Soda Cloud to collect sample data from a dataset, or samples of failed rows from a dataset when a check result fails.

Delete resources

The following non-exhaustive list of example deletions serve to illustrate the potential impact of deleting.

Delete a dataset

Delete a data source

Delete a scan definition

Delete an integration

A Soda Cloud Administrator has the ability to add, edit, and delete integrations with third-party service providers.

Example combination deployment

If your Soda Cloud account is also connected to Soda Library, your deployment may resemble something like the following diagram.

For example, imagine you use Soda Library to run scans and send results to Soda Cloud. In the checks.yml file that you use to define your checks, you have the following configuration:

Go further

Create a Soda Cloud account at .
As a business user, learn how to write SodaCL checks in in Soda Cloud.
Learn more about viewing in Soda Cloud.