1 of 100

Soda v4

What is Soda?

Soda helps data teams make sure their data can be trusted. It makes it easy to find, understand, and fix problems in the data.

You can use Soda to:

Monitor production data with automated, ML-powered observability that spots unexpected changes without needing to define every rule up front.
Define data contracts, making expectations explicit and enabling producers and consumers to collaborate on reliable data at the source.
Test data earlier in the pipeline, as part of CI/CD workflows or during development, to prevent bad data from reaching production.

Soda helps teams to start right and automatically detect anomalies in metrics that have already happened. And shift left to prevent issues from happening again with collaborative data contracts.

Soda v4 vs v3

This is the documentation for Soda v4. If you are still using Soda v3, head to the .

The new version of Soda has transformed the software into a full data-quality platform by layering on:

End-to-end data observability:
Collaborative data contracts:

This marks the shift from a CLI-centric checks engine toward a unified, observability-driven data quality platform with a refined, three-tier Core + Agent + Cloud architecture, built-in contracts, orchestration, and deep integrations.

to learn more about Soda's capabilities.

What is data quality?

Data quality refers to how well a dataset meets the expectations of completeness, accuracy, timeliness, uniqueness, and consistency. Good data supports business goals, drives confident decision-making, and is the base for great data products.

Poor data quality causes failed pipelines, incorrect reports, and broken AI models. Managing data quality means proactively validating assumptions and reactively monitoring for drift or degradation.

Soda helps you answer questions like:

Is the data fresh and complete?
Are there unexpected values or duplicates?
Did values shift outside of expected ranges?
Are schema or contract changes causing breakage?

Key Concepts

Data Observability

Data observability is a reactive approach to monitoring data in production and catching unexpected issues as they emerge. It helps answer the question: What is happening with my data right now, and how is that changing over time?

Use data observability to:

Detect anomalies in data quality metrics such as freshness, row counts, null values or custom ones
Monitor metric trends and seasonality
Identify late-arriving or missing records
Get alerted when values deviate from historical norms

Data Testing

Data testing is a proactive approach that validates known expectations about your data during development, deployment, or transformation. It helps you catch issues before they reach production, break reports, or impact downstream systems.

Use data testing to:

Align on what “good data” looks like through data contracts
Verify that your data meets those expectations, including schema, values, and transformations
Test data at every step of the pipeline to prevent bad data from reaching downstream systems
Integrate with CI/CD workflows for continuous quality checks during development

Data Contracts

Data contracts define what a dataset should look like, including its schema, data types, value ranges, and other constraints. They establish a shared agreement between data producers and consumers about what’s expected and what must be upheld.

Both testing and observability play a role in upholding data contracts:

Testing validates that data meets the contract during development, pipeline execution, and on schedule.
Observability monitors contract adherence in production and detects unexpected issues.

Data Observability vs Data Testing

While data testing and observability are different in when and how they operate, they work best together as a unified strategy.

Approach

Timing

Use case

Together, they enable end-to-end data quality management: testing prevents problems, and observability detects those that escape prevention. At the same time, observability can help prioritize which issues to address and shift left to resolve them upstream.

Data quality at scale across the enterprise

Divide and conquer

Managing data quality across hundreds or thousands of datasets requires a scalable, federated approach. Soda enables this through:

Metadata-driven observability that adapts checks to each dataset's structure and context.
Role-based collaboration so teams can take ownership of the data they know best.
An interface for both engineering and business users, enabling collaboration through code, UI, or APIs, depending on user preference and role.
Integration with existing tools and workflows, such as data catalogs and incident management systems.

Data quality as a team sport

Reliable data depends on collaboration across roles:

Data engineers embed tests and monitor pipelines to catch issues early.
Data producers and consumers align on expectations through data contracts.
Data consumers report issues and collaborate with producers to interpret metrics and resolve problems.
Governance teams define and enforce data quality standards.

Soda Cloud acts as the shared workspace where these roles collaborate, triage incidents, and resolve issues.

Deployment options

Soda offers three deployment models, depending on your infrastructure and data privacy needs.

Deployment Model

Description

Ideal For

Key Features

Considerations

Supported data sources and integrations

Soda integrates with the modern data stack:

Data warehouses and databases: Databricks, Snowflake, BigQuery, Redshift, PostgreSQL, MySQL, Spark, Presto, DuckDB, and more.
Orchestration platforms: Airflow, Dagster, Prefect, Azure Data Factory.
Metadata tools: Atlan, Alation, Collibra, data.world, Zeenea.
Cloud providers: AWS, Google Cloud, Azure.

What’s next?

To get started with Soda, check out the end-to-end guide.

Community & Support

Need help or want to contribute?

Join our Slack Community:
Browse GitHub Discussions:

Still have questions? Use the search bar above or reach out through our community channels for additional help.

Metric Monitoring dashboard

As soon as a data source is connected, the metric monitoring dashboard is available and will have historical information. Soda establishes a statistical baseline for each metric and continually compares new scan results against that baseline, flagging anomalies according to the sensitivity, exclusions, and threshold strategy you’ve configured.

What are Metric Monitors?

Metric monitors are the foundation of data observability in Soda. Monitors track data quality metrics over time and leverage historical values for analysis. Soda automatically collects these metrics and examines how they evolve over time through a to identify when metrics deviate from expected patterns and trigger alerts. These deviations are surfaced and recorded in the Metric Monitors

Metadata based

These monitors are derived directly from the data platform’s system metadata, without scanning row-level values. They surface structural signals, such as:

: when the dataset was last updated
: any alterations to the schema
: the overall number of records in the dataset

Last modification time

Definition

The elapsed interval between the scan time and the timestamp of the most recent change to the database. This includes any change to the data (inserts, updates, deletes) as well as any change to the schema.

Source

Schema changes

Definition

The count of schema alterations (column additions, removals, or data-type changes) detected since the previous scan. Any schema change is treated as an anomaly.

Source

metadata

Computation

No sampling used. The value is calculated through the difference of two full table definitions.

Total row count

Definition

The total number of rows in the dataset at scan time.

Source

metadata

Total row count change

Definition

The difference in total row count between the current scan and the immediately preceding scan (current_count – previous_count).

Source

Query based

These monitors require executing queries against the data itself to surface usage and content recency patterns, for example:

Most recent timestamp: the latest event or ingestion time across all rows
Partition row count: the number of records within the current partition (e.g. today’s data)

Query-based monitors give you a window into data flow and freshness, helping detect lags in ingestion pipelines or staleness in source systems.

Most recent timestamp (dataset)

Definition

The interval between scan time and the maximum timestamp in the partition column (within the latest partition).

Source

data (time partition)

Computation

Via MAX(column) for any time related column.

No sampling used.

Partition row count

Definition

The number of rows in the most recent time partition at scan time (e.g. all rows where partition_col = current_partition).

Source

All data types

Metrics that support all data types are foundational checks that apply to any column regardless of its data type:

Count of non-NULL values
Duplicate percentage
Minimum and maximum values
of distinct entries

These metrics form the backbone of data completeness and consistency monitoring, ensuring every column meets basic quality expectations.

Count

Definition

The total number of non-NULL values in the monitored column. Useful for identifying unexpected drops or spikes in data completeness.

Source

data

Computation

COUNT(<column>)

Duplicate percentage

This metric is a work in progress.

Definition

Percentage of all duplicate non-NULL records in a column.

For example: duplicate_percent(id)

Maximum value

Definition

The maximum value within a specific column.

Source

data (any datatype with an order defined for max/min)

Minimum value

Definition

The minimum value within a specific column.

Source

data (any datatype with an order defined for max/min)

Missing values percentage

Definition

Number of missing values relative to the number of rows in the partition, expressed as percentage → (number of null or missing values in column ÷ total rows in partition) × 100

Source

data

Computation

1 - (count(column) / count(*)) × 100

Unique count

Definition

The number of distinct non-NULL values in the monitored column. Highlights unexpected changes in cardinality (e.g., new user IDs, codes).

Source

data

Computation

COUNT(DISTINCT <column>

Timestamp

Timestamp metrics highlight recency and time‐based anomalies, which is crucial for validating timeliness in event streams and incremental loads:

Most recent timestamp (column)

Definition

Interval between scan time and the maximum value in a date/time/time-stamp column (within the partition). Supported only on date/datetime/time columns.

Source

data (timestamp)

Numeric

Numeric metrics capture central tendency and dispersion in numerical columns, such as:

Mean ()
&

Average

Definition

Arithmetic mean of all non-NULL values in the column, computed per partition.

Source

data (numeric)

Computation

Using the built in AVG(column_name) function of every database.

Standard deviation

Definition

The standard deviation of the values within a specific column.

Source

data (numeric)

Sum

Definition

Detects anomalies in the total (sum) of all non-NULL values in a given column over the latest partition. It flags unexpected increases or decreases in the aggregate amount.

Source

data (numeric)

Computation

SUM(<column>)

Variance

Definition

The variance of the values within a specific column.

Source

data (numeric)

Computation

Uses the SQL standard for all databases VAR_SAMP(), which is a sampling based method.

Q1

This metric is not supported in MySQL.

Definition

Quartiles divide all non-NULL values in the column within the latest partition into four equal parts based on value: Q1 represents the 25th percentile.

Median

This metric is not supported in MySQL.

Definition

Quartiles divide all non-NULL values in the column within the latest partition into four equal parts based on value: Q2 represents the 50th percentile, that is, the median.

Q3

This metric is not supported in MySQL.

Definition

Quartiles divide all non-NULL values in the column within the latest partition into four equal parts based on value: Q3 represents the 75th percentile.

Text

Text metrics help catch formatting issues, truncated values, or unexpectedly long/free-form entries.

With Soda, it's possible to assess the character‐length properties of string columns:

Average length

Definition

The average (avgLength) of non-NULL string values in the column.

Source

data (value length in characters)

SQLServer: instead of length in characters, it uses data length (number of bytes).

Computation

For each value, Soda encapsulates the length in the aggregation metric: AVG(LENGTH(column))

Maximum length

Definition

The maximum (maxLength) of non-NULL string values in the column.

Source

data (value length in characters)

SQLServer: instead of length in characters, it uses data length (number of bytes)

Computation

For each value, Soda encapsulates the length in the aggregation metric: MAX(LENGTH(column))

Minimum length

Definition

The minimum (minLength) of non-NULL string values in the column.

Source

data (value length in characters)

SQLServer: instead of length in characters, it uses data length (number of bytes)

Computation

For each value, Soda encapsulates the length in the aggregation metric: AVG(LENGTH(column))

Group By monitors

Group By monitors enable you to track data quality metrics across specific segments of your dataset. Instead of monitoring a metric for a column as a whole, you can break it down per category (for example, per region, per school year, per status).

This functionality is especially valuable when:

You want to detect anomalies at a more granular level, within each segment or category.
You need visibility into how data quality differs across categories.

Metadata data sources

Oracle

Historical backfilling: not possible.
Row count: metadata row counts are calculated via count(*)

Git-managed Data Contracts

Define, version, and test contracts as code

For teams that manage data like software, Git-managed data contracts offer a code-first way to define and enforce data quality expectations.

In this model, contracts are written in YAML and stored in your Git repository, right alongside your data models, transformation logic, and CI/CD workflows. You write, version, test, and promote contracts just like any other code artifact.

This approach gives engineers full control, reproducibility, and integration into development pipelines. And with the right setup, you can still collaborate with non-technical users via Soda Cloud and even sync UI-authored changes into Git using our future proposal workflow.

Why Git-managed?

Full version control Track every change, roll back when needed, and manage contracts with the same discipline as application code.
Code-first workflow Keep contracts close to your data models and transformations for better alignment, automation, and traceability.
CI/CD integration Run contract verifications in your existing pipelines; on every commit, PR, or deployment.
Team governance

If you're already managing your data infrastructure in Git, Git-managed contracts are the natural extension for bringing data quality under control without adding friction or silos.

In the next sections, we’ll walk you through how to set up, author, and run Git-managed contracts using the Soda CLI.

Install and configure

Before you can define, test, or verify Git-managed data contracts, you need to install the Soda CLI and configure your environment.

This setup gives you full control over your contracts, letting you version them in Git, execute them locally or remotely, and integrate them into your CI/CD pipelines.

Install the Soda Core Python Package

Install Soda Core using pip:

Replace postgres with the name of your data source type, e.g., snowflake, databricks, etc. See for a list of supported data source types.

If you need authenticated access, follow the to set up your environment and install the necessary Soda extensions.

Connect to Soda Cloud (optional)

If you want to interact with Soda Cloud to publish contracts and view verification results, or to use Soda Agent, you’ll need to connect the CLI to your Soda Cloud account.

Don’t have an account? to get started.

Create the config file

This generates a basic Soda Cloud configuration file.

Add your API keys

Open sc.yml and fill in your API key and organization details.

Learn more about how to generate keys:

Test the connection

This ensures the CLI can authenticate and communicate with Soda Cloud.

Configure your data source

To verify a contract, Soda needs to know how to connect to your data source. You have two options:

Connect with Soda Core

If you prefer to define your own connection locally (or aren’t using a Soda Agent), you can create a data source config file for Soda Core.

Install the required package for your data source. For example, for PostgreSQL:

See the for supported packages and configurations.

Create the config file

Open ds.yml and provide the necessary credentials.

For example with PostgreSQL:

Refer to the for the configurations of each data source type.

Avoid hardcoding secrets. Use environment variables or a secrets manager where possible.

Test the connection:

Use an existing Data Source via Soda Agent

If your data source is already connected to Soda Cloud using a Soda Agent (hosted or self-hosted), you can reuse that connection without managing credentials or configs locally.

You just need to ensure you have set up the connection with Soda Cloud.

Choose the method that best fits your setup:

Use Soda Agent for a centralized, cloud-managed connection, or local configuration if you want full control within your environment.

Create and edit contracts

With Git-managed contracts, you define your expectations as code using YAML. This gives you full control over how your data is validated, and allows you to manage contracts just like any other code artifact: versioned, tested, and deployed via Git.

To learn all about the structure and supported features, refer to the full specification in the

The contract structure includes:

Dataset and column structure
Available check types (missing, invalid

Verify a contract

Once your contract is authored and published (or available locally), you can verify whether the actual data complies with the defined expectations. Soda provides two execution options:

Soda Core – run verifications locally, typically in CI/CD pipelines or dev environments.
Soda Agent – run verifications remotely using an agent deployed in your environment, triggered via Soda Cloud.

Both approaches support variable overrides, publishing results to Soda Cloud, and integration into automated workflows.

Learn more about Deployment options

Using Soda Core

Soda Core runs the verification locally, connecting to your data source using the defined data source configuration file.

This command:

Connects to your database using the local config
Loads the contract
Runs all checks and returns a pass/fail result

With variable overrides

You can pass variables defined in the contract using the --set flag:

Learn about variables in Data Contract:

Publish results to Soda Cloud

To send verification results to Soda Cloud for visibility and reporting.

Add the flag --publish to the command.

This action requires the "Manage contract" permission on the dataset; the user is identified based on the API key provided in the Soda Cloud configuration.

Learn more about permissions here:

Learn how to connect the CLI to Soda Cloud:

This is recommended if you want stakeholders to see the outcomes in Soda Cloud or include them in dashboards and alerting.

Using Soda Agent

Soda Agent executes verifications using data sources configured in Soda Cloud.

This setup:

Runs verifications through the Soda Agent connected to your data source
Fetches the published contract from Soda Cloud
Returns the result locally in the CLI

With variable overrides

You can pass variables defined in the contract using the --set flag:

Learn about variables in Data Contract:

Publish results to Soda Cloud

You can also push results to Soda Cloud from the agent-based run.

Add the flag --publish to the command.

This action requires the "Manage contract" permission on the dataset; the user is identified based on the API key provided in the Soda Cloud configuration.

Learn more about permissions here:

This is recommended if you want stakeholders to see the outcomes in the Soda Cloud or include them in dashboards and alerting.

Cloud-managed Data Contracts

Cloud-managed Data Contracts let you define and manage expectations for your data directly in the Soda Cloud UI.

This approach is perfect for data analysts, product owners, and business stakeholders who know what “good data” looks like but prefer intuitive tools over code. It’s also ideal for teams that want to move fast, collaborate visually, and integrate seamlessly with engineering workflows when needed.

With Soda Cloud, you can browse datasets, add quality rules, test and publish contracts, and set up scheduled or on-demand verification. All from your browser.

Why Cloud-managed?

Faster time to value – no setup required
Accessible to everyone – empower domain experts, not just engineers
Built for collaboration – share, comment, and propose changes in a shared UI
Easily operationalized – schedule tests and trigger verifications programmatically

Cloud-managed contracts are a powerful way to bring your organization together around trusted data.

Prerequisites

Before creating contracts in Soda Cloud, make sure:

You have a Soda Cloud account
You have access to an organization in Soda Cloud
You have connected at least one data source via a Soda Agent

Verify a contract

Once a Data Contract is published, the next step is to verify that the actual data matches the expectations you’ve defined. Soda offers several flexible ways to execute contract verifications, depending on your needs and technical setup.

Manual execution (from the Dataset Page)

You can manually verify a contract at any time from the dataset page in Soda Cloud.

Simply open the dataset and click Verify Contract. This will:

Diagnostics Warehouse

Diagnostics Warehouse provides a clear, detailed view of the state of data checks while allowing access to failed rows in order to take a closer look and resolve data quality issues.

Overview

Diagnostics Warehouse stores all Soda scan details, failed records, and historical data quality issues directly in your data warehouse of choice, safely and securely. Nothing is stored outside. This gives you the ability to run diagnostics, resolve issues, and see exactly why problems happen. You can go as deep as you need, from a single record to a full dataset.

Each time a Soda scan runs, Diagnostics Warehouse stores failed rows together with check and scan results, and related metadata attributes. With that information, data teams can quickly diagnose and resolve issues at both dataset and row level. Additionally, Soda's Diagnostics Warehouse makes it easier for teams to build on top of Soda's outputs to set up operational workflows, and connect to BI tools you already know and trust.

Features & capabilities

Full diagnostic information in one place, including attributes.
Transparency for all: replace black-box runs with auditable facts and keep an immutable, queryable history of what was checked, when, how long it took, what failed, and why.
Faster root-cause analysis: jump from a failed check to the exact failed rows, affected datasets/columns, and prior history to see if it’s a one-off issue or a pattern.

Security & governance

Data minimization: Diagnostics Warehouse stores metadata about runs and checks and, for row-level checks, it only stores failed rows when the option is enabled.
Warehouse residency: Diagnostics are not stored in Soda. They live in your analytics warehouse, respecting your access controls, encryption, and audit trails.

Get started

Enable Diagnostics Warehouse in your Soda data source settings.
Grant the service identity permission to create and write to the Diagnostics Warehouse schema in your warehouse.
Run your checks; Diagnostics Warehouse tables populate automatically.
Query your warehouse and connect to your BI tools to start exploring.

Next: to enable Diagnostics Warehouse in your organization, reach out to Soda at .

Manage data quality issues

This section introduces the key features and workflows in Soda for managing data quality issues and reporting.

Learn how to find datasets and checks, navigate dashboards, and understand check results.

You’ll also learn how to set up notifications to stay informed and build custom dashboards using tools like Power BI or Tableau.

Organization dashboard

The Organization Dashboard provides a high-level overview of your data quality across datasets and checks in Soda Cloud. It shows key trends over time, such as the number of checks that are passing, failing, or in a warning state, helping you identify issues early.

You’ll also find key metrics, including:

Scans in failed mode: Datasets that are currently blocked due to failing checks.
Checks currently failing: Active checks that need attention.
Overall Health Score: The number of failing checks out of the total number of checks

These insights allow you to quickly identify where action is needed.

Customize your dashboard

You can tailor the Organization Dashboard to focus on the areas that matter most to you:

Apply filters based on attributes: Use the filter options to narrow down the view by attributes

Click the Save Dashboard button to store your current filter configuration as a collection.

Enter a name for the collection and click Save

Once saved, your collection will be available in the dropdown at the top right of the dashboard. Simply select it to switch views.

Use the context menu next to the collection name to:

Delete the collection if it’s no longer needed.
Share the collection with others in your organization.

Activity section

The Activity section offers insights into how Soda is being used across your organization. It tracks adoption metrics, such as active users, active checks, active datasets, and the number of alerts in the last 90 days.

Custom Dashboards

In addition to the built-in dashboards in Soda Cloud, you can build custom dashboards tailored to your organization’s specific needs. By leveraging the Soda REST API, you can programmatically retrieve data quality metrics, check results, and incident details, and integrate them into external dashboarding tools such as Power BI, Tableau, or Looker.

This enables you to create tailored views and reports that align with your business logic and audience, ensuring stakeholders get the right insights in the tools they already use.

Learn more on

Browse datasets

The Datasets page displays all datasets that have been onboarded into Soda Cloud—either through publishing a contract or via the onboarding process: Onboard datasets on Soda Cloud .

It provides a quick overview of each dataset’s health, showing at a glance if a dataset has issues, how many checks from its contract are failing, how many anomalies were detected through metric monitoring, and when the last scan was executed.

You can filter datasets by properties like data source, owners, arrival time, attributes, or flags such as failures or anomalies. Use the search bar to quickly find a specific dataset by name, and sort the list by name, creation time, or data quality status.

Learn more about custom attributes: Dataset Attributes & Responsibilities

You can also sort the datasets list by name, creation date, check failures, or anomalies to prioritize your focus.

Customize your datasets' view

You can tailor the Datasets view to focus on the areas that matter most to you:

Use the filter options to narrow down the view
Click the Save Dashboard button to store your current filter configuration as a collection.

Enter a name for the collection and click Save

Once saved, your collection will be available in the dropdown at the top right of the dashboard. Simply select it to switch views.

Use the context menu next to the collection name to:

Delete the collection if it’s no longer needed.
Share the collection with others in your organization.

Dataset dashboard

The Dataset Page provides a detailed view of each dataset’s health and monitoring information in Soda Cloud. It includes several tabs to help you explore and manage data quality at the dataset level.

Checks tab

Displays the results of all checks defined in the dataset’s contract. Checks are grouped by column, with column-level checks nested under their respective columns. Columns with failed checks are automatically expanded so you can spot issues quickly. You can filter the view to show only failed checks and search for specific checks or columns by name for faster navigation and troubleshooting.

Learn more about Data Testing

Metric Monitoring tab

Shows the metrics that are actively monitored for the dataset, helping you track trends and detect anomalies over time.

Learn more about

Profiling tab

Provides an overview of the dataset’s structure, including column names, data types, distinct counts, and other statistics. You can search for a specific column by name to quickly locate and review its profiling details.

Incidents tab

Lists incidents related to the dataset, helping you track issues and collaborate on resolution. You can filter incidents based on criteria such as user lead, status, or severity to focus on the most important or urgent cases.

Learn more about

Browse checks

The Checks page displays all checks defined in a data contract and tracked in Soda Cloud. It provides a quick overview of check health across datasets, allowing you to create custom groupings by applying filters such as data source, dataset, owners, or status (pass, fail, warning). This helps you focus on specific areas or teams that matter most.

You can also review key details like the check type, the dataset it belongs to, and the time of the last scan. Use the search bar to quickly find a specific check by name, and sort the list by name, last run time, or check status.

You can filter checks by properties such as data source, dataset, owners, attributes, or status (pass, fail, warning). Use the search bar to quickly find a specific check by name.

Learn more about custom attributes: Check and dataset attributes

You can also sort the list by name, last run time, or check status.

Customize your check view

You can tailor the Checks view to focus on the areas that matter most to you:

Use the filter options to narrow down the view
Click the Save Dashboard button to store your current filter configuration as a collection.

Enter a name for the collection and click Save

Once saved, your collection will be available in the dropdown at the top right of the dashboard. Simply select it to switch views.

Use the context menu next to the collection name to:

Delete the collection if it’s no longer needed.
Share the collection with others in your organization.

Check and dataset attributes

Attributes allow you to add descriptive metadata to your datasets and checks. This metadata can then be:

Used for filtering in Soda Cloud, making it easier to search and organize datasets and checks based on specific criteria (e.g., business domain, sensitivity, criticality).
Leveraged in reporting, enabling you to group datasets, track ownership, and monitor data quality across different categories or dimensions.

Adding meaningful attributes enhances discoverability, governance, and collaboration within Soda and its integrations.

Analyze monitor and check results

When you access a monitor (from Metric Monitoring) or a check (from the Contract), Soda Cloud provides a time series view that shows how the monitored metric or check result evolves. This helps you explore the history of data quality issues, spot trends, and understand changes in your data.

Diagnostics

For certain types of checks and monitors, additional diagnostic information is also available for each monitor or check results to help you investigate issues in more detail.

For example:

Schema Checks: View a side-by-side comparison of the actual vs. expected schema to identify differences.
Missing, Duplicate, or Invalid Checks: See the percentage of failed rows vs. passing rows to understand the scale and impact of the issue.

This view helps you drill down into specific data issues, explore context, and take informed action.

Notifications

Soda’s notification system helps you stay informed when data issues occur—whether it’s a failed check or an anomaly detected through metric monitoring. Notifications are dynamically dispatched using notification rules, allowing you to target alerts based on specific properties, attributes, or datasets.

How Notification Rules Work

Notification rules define when and to whom a notification is sent. Rules can be configured to match specific checks or anomalies, ensuring the right people are notified at the right time.

Creating a Notification Rule

Only users with the Manage Notification Rules permission can create or edit rules. All users can view rules. Read about

To create a new notification rule:

Click on your profile in Soda Cloud and select Notification Rules from the menu.

Click New Rule.

Provide a name for the rule.

Define the Rule Scope

Checks:

All checks: The rule applies to every check in your organization.
Specific checks: Build custom rules by filtering on check properties, dataset properties, or attributes.

Anomalies from Metric Monitoring: Select specific datasets where the rule applies.

Define the recipients (users, groups, or integrations like Slack, Teams, or webhooks).

...and choose the alert type (only applicable for checks, not anomalies):

Only failures
Failures and warnings
All statuses

Save to create the notification rule

Pausing Notification Rules

You can pause a notification rule at any time to temporarily disable alerts without deleting the rule.

Incidents

Incidents help you track, investigate, and resolve data quality issues when they occur. An incident is created when a data issue, such as a failed or warning check, has been confirmed and assigned to someone for resolution.

To create or update an incident, the user has to have "Manage Incidents" on the related dataset.

Creating an Incident

Deploy Soda Agent

The Soda Agent allows you to securely scan your data sources for quality issues directly from Soda Cloud. It can be self-hosted or Soda-hosted, depending on your deployment preferences. The self-hosted option allows for a more custom and secure deployment, while the Soda-hosted agent is easier to start with. Learn more about Deployment options

You can deploy a self-hosted agent in the infrastructure of your choice:

Kubernetes cluster
Amazon EKS
Azure AKS
Google GKE

Soda-hosted agents are included in all Free, Team, and Enterprise plans at no additional cost. However, self-hosted agents require an Enterprise plan.

If you wish to use self-hosted agents, please contact us at to discuss Enterprise plan options or reach out via the support portal for existing customers.

Upgrade Soda Agent

The Soda Agent is a Helm chart that you deploy on a Kubernetes cluster and connect to your Soda Cloud account using API keys.

To take advantage of new or improved features and functionality in the Soda Agent, including new features in the Soda Library, you can upgrade your agent when a new version becomes available in ArtifactHub.io.

Note that there is no downtime associated with the exercise of upgrading a self-hosted Soda Agent. Because Soda does not define the .spec.strategy in the deployment manifest of the Soda Agent Helm chart, Kubernetes uses the default RollingUpdate to upgrade; refer to Kubernetes documentation .

If you regularly access multiple clusters, you must ensure that are first accessing the cluster which contains your deployed Soda Agent. Use the following command to determine which cluster you are accessing.

If you must switch contexts to access a different cluster, copy the name of cluster you wish to use, then run the following command.

To upgrade the agent, you must know the values for:

namespace - the namespace you created, and into which you deployed the Soda Agent
release - the name of the instance of a helm chart that is running in your Kubernetes cluster
API keys - the values Soda Cloud created which you used to run the agent application in the cluster Access the first two values by running the following command.

Output:

Access the API key values by running the following command, replacing the placeholder values with your own details.

From the output above, the command to use is:

Use the following command to search ArifactHub for the most recent version of the Soda Agent Helm chart.

Use the following command to upgrade the Helm repository.

Upgrade the Soda Agent Helm chart. The value for the chart argument can be a chart reference such as example/agent, a path to a chart directory, a packaged chart, or a URL. To upgrade the agent, Soda uses a chart reference: soda-agent/soda-agent.

From the output above, the command to use would be:

OR, if you use a values YAML file,

Upgrading from 1.1.x to 1.2.x+

Starting from version 1.2.0 all images required for the Soda Agent are distributed using a Soda-hosted image registry.

For more information, see .

Set up authentication for the Soda image registry

Redeploy Soda Agent

When you delete the Soda Agent Helm chart from your cluster, you also delete all the agent resources on your cluster. However, if you wish to redeploy the previously-registered agent (using the same name), you need to specify the agent ID in your override values in your values YAML file.

In Soda Cloud, navigate to your avatar > Agents.
Click to select the agent you wish to redeploy, then copy the agent ID of the previously-registered agent from the URL. For example, in the following URL, the agent ID is the long UUID at the end. https://cloud.soda.io/agents/842feab3-snip-87eb-06d2813a72c1. Alternatively, if you use the base64 CLI tool, you can run the following command to obtain the agentID.

 kubectl get secret/soda-agent-id -n soda-agent --template={{.data.SODA_AGENT_ID}} | base64 --decode

Open your values.yml file, then add the id key:value pair under agent, using the agent ID you copied from the URL as the value.

To redeploy the agent, you need to provide the values for the API keys the agent uses to connect to Soda Cloud in the values YAML file. Access the values by running the following command, replacing the soda-agent values with your own details, then paste the values into your values YAML file.

Alternatively, if you use the base64 CLI tool, you can run the following commands to obtain the API key and API secret, respectively.

In the same directory in which the values.yml file exists, use the following command to install the Soda Agent helm chart.

Validate the Soda Agent deployment by running the following command:

Organization and Admin settings

The Organization and Admin Settings in Soda Cloud provide a centralized interface for managing your organization’s configuration, roles, user access, and integrations. From setting your organization’s name to defining global roles, managing user groups, and enabling the Soda-hosted Agent, these settings help you tailor Soda Cloud to your team’s needs and governance policies.

To access the settings, click on your avatar on the top right, and then click Organization Settings.

Only users with the Manage Organization Settings global role can access and modify these settings.

General settings

The General Settings page allows you to configure foundational settings for your organization in Soda Cloud. These settings impact how your organization operates and how users interact with the platform.

Organization Name

Set the name of your organization. This name appears throughout Soda Cloud, such as in dashboards, reports and notifications.

Enable the Login As feature to allow the Soda Support team to log in as an admin within your organization. This can be useful when troubleshooting issues or providing assistance.

Enable or Disable the Soda-hosted Agent

You can choose whether to use the Soda-hosted Agent by enabling or disabling it in the Organization Settings:

Toggle the Soda-hosted Agent option to enable or disable the agent for your organization.
Disabling the agent prevents Soda Cloud from running scans or checks via the managed agent. You’ll need to use a self-hosted agent or Soda Core in your environment instead.

Profiling Data Collection

By default, the Soda-hosted Agent collects profiling information (such as column-level statistics and schema details) to support features like dataset discovery and monitoring in Soda Cloud.

You can choose to disable profiling if you prefer not to send profiling data to Soda Cloud: Toggle the Profiling Data Collection option to disable profiling for your organization.

This ensures that no profiling information is collected or pushed to Soda Cloud. Only check results and metadata necessary for contract validation will be processed.

Enable Data Source Secrets

Manage secure storage of secrets such as API keys, credentials or connection details. Secrets can be used in data source configurations, checks, and other automated processes.

For more information, see the

Reference

This reference hub includes detailed documentation for Soda’s key interfaces and configuration options, as well as information on Soda's architecture and specifics, including:

Data Contract Language Reference – Author, validate, and manage data contracts using YAML-based definitions.
CLI Command and Python Reference – Use Soda’s command-line interface to configure, run, and automate verification workflows.
REST API Reference – Interact with Soda Cloud programmatically to manage datasets, run verifications, and retrieve results.

Athena

Access configuration details to connect Soda to an Amazon Athena data source.

Connection configuration reference

Install the following package:

Data source YAML

Setup & configuration

This page provides detailed information about how to configure the Soda↔Collibra integration.

Both Collibra and Soda need to be configured so the integration can run successfully. This page covers both Collibra and Soda settings, including asset types, attribute types, relation types, and domain mappings. These settings establish the foundation for reliable synchronization of data quality checks and metadata between Soda and Collibra.

Configuration Guide

1. Collibra Configuration

Base Settings

Asset Types

Configure the different types of assets in Collibra:

Attribute Types

Define the attributes that will be set on check assets:

Diagnostic Attributes Behavior:

Flexible Extraction: Automatically extracts metrics from any diagnostic type (missing, aggregate, valid, etc.)
Future-Proof: Works with new diagnostic types that Soda may introduce
Smart Fallbacks: Falls back to datasetRowsTested

Relation Types

Define the types of relationships between assets:

Responsibilities

Configure ownership role mappings:

Domains

Configure the domains where assets will be created:

2. Soda Configuration

Base Settings

General Settings

Attributes

Define Soda attributes and their mappings:

Multiple dimensions support

The integration supports both single and multiple dimensions for data quality checks:

Single dimension: Specify as a string value (e.g., "Completeness")
Multiple dimensions: Use a comma-separated string (e.g., "Completeness, Consistency")

When multiple dimensions are provided as a comma-separated string, the integration will:

Automatically split the string by commas and trim whitespace
Search for each dimension asset in Collibra individually
Create a relation for each dimension found
Log a warning for any dimension that cannot be found in Collibra

Example Configuration:

This will create three separate dimension relations in Collibra, one for each dimension specified.

Monitor Exclusion

The integration can exclude Soda monitors (items with metricType) from synchronization:

Enabled (sync_monitors: true): All checks and monitors are synchronized (default)
Disabled (sync_monitors: false): Only checks are synchronized, monitors are filtered out

When sync_monitors is disabled, the integration will:

Filter out all items that have a metricType attribute
Only process actual checks (items without metricType)
Log the number of monitors filtered out for each dataset
Continue processing with the remaining checks

This is useful when you want to focus on data quality checks and exclude monitoring metrics from your Collibra catalog.

Custom Attribute Syncing configuration

See the section below for detailed instructions.

Custom Attribute Syncing

The integration supports syncing custom attributes from Soda checks to Collibra assets, allowing you to enrich your Collibra assets with business context and additional metadata from your data quality checks.

How Custom Attribute Syncing Works

Custom attribute syncing enables you to map specific attributes from your Soda checks to corresponding attribute types in Collibra. When a check is synchronized, the integration will automatically extract the values of these attributes and set them on the created/updated Collibra asset.

Configuration

To enable custom attribute syncing, add the custom_attributes_mapping_soda_attribute_name_to_collibra_attribute_type_id configuration to your config.yaml file:

The configuration value is a JSON string containing key-value pairs where:

Key: The name of the attribute in Soda (as it appears on your Soda checks)
Value: The UUID of the corresponding attribute type in Collibra

Step-by-Step Setup

1. Identify Soda Attributes

First, identify which attributes from your Soda checks you want to sync to Collibra. Common examples include:

description - Check description
business_impact - Business impact assessment
data_domain - Data domain classification

2. Find Collibra Attribute Type UUIDs

For each Soda attribute, find the corresponding attribute type UUID in Collibra:

Navigate to your Collibra instance
Go to Settings → Metamodel → Attribute Types
Find or create the attribute types you want to map to
Copy the UUID of each attribute type

3. Create the JSON Mapping

Create a JSON object mapping Soda attribute names to Collibra attribute type UUIDs:

4. Add to Configuration

Add the JSON mapping to your config.yaml file as a single-line string:

Complete Example

Here's a complete example showing how to configure custom attribute syncing:

Soda Check with Custom Attributes:

Collibra Configuration:

Result: When this check is synchronized, the integration will create a Collibra asset with these attributes automatically set:

Description: "Ensures orders table is not empty"
Business Impact: "critical"
Data Domain: "sales"
Criticality: "high"

⚠️ Important Notes

JSON Format: The mapping must be a valid JSON string enclosed in single quotes
Attribute Type UUIDs: Use the exact UUIDs from your Collibra metamodel
Case Sensitivity: Soda attribute names are case-sensitive and must match exactly
Missing Attributes: If a Soda check doesn't have an attribute defined in the mapping, it will be skipped (no error)

Troubleshooting

Common Issues:

Invalid JSON: Ensure the JSON string is properly formatted and enclosed in single quotes
Attribute Not Found: Verify the Soda attribute names match exactly what's defined in your checks
UUID Errors: Confirm the Collibra attribute type UUIDs are correct and exist in your instance
Permission Issues: Ensure your Collibra user has permissions to set the specified attribute types

Debug Mode: Run with debug mode to see detailed logging about custom attribute processing:

Look for log messages like:

Processing custom attribute: attribute_name
Successfully set custom attribute: attribute_name
Skipping custom attribute (not found in check): attribute_name

Deletion Synchronization

The integration automatically synchronizes deletions, removing obsolete check assets from Collibra when checks are deleted or removed in Soda.

How It Works

Pattern Matching: For each dataset, the integration searches for all check assets in Collibra using the naming pattern {checkname}___{datasetName}
Comparison: Compares the list of check assets in Collibra with the current checks returned from Soda
Identification: Identifies assets that exist in Collibra but are no longer present in Soda

Benefits

Automatic Cleanup: Keeps your Collibra catalog in sync with Soda without manual intervention
Efficient Processing: Uses bulk deletion operations to minimize API calls
Idempotent: Safe to run multiple times - handles already-deleted assets gracefully
Transparent: Shows deletion progress in the console output and tracks metrics

Example Output

When obsolete checks are found and deleted, you'll see:

And in the summary:

Configuration

No additional configuration is required. Deletion synchronization is enabled by default and runs automatically for each dataset during the integration process.

Monitoring

Deletion synchronization is tracked in the integration metrics:

Checks deleted: Number of obsolete check assets removed from Collibra
Error Tracking: Any deletion failures are recorded in the error summary

Error Handling

404 Errors: If assets are already deleted (404 response), the integration treats this as success and continues
Other Errors: Network issues, authentication problems, or other HTTP errors are retried with exponential backoff
Missing Assets: If no check assets are found in Collibra for a dataset, deletion sync is skipped

Ownership Synchronization

The integration supports automatic synchronization of dataset ownership from Collibra to Soda.

How It Works

Asset Discovery: For each dataset, finds the corresponding table asset in Collibra
Responsibility Extraction: Retrieves ownership responsibilities from Collibra
User Mapping: Maps Collibra users to Soda users by email address
Ownership Update: Updates the Soda dataset with synchronized owners

Configuration Requirements

Ensure the following are configured in your config.yaml:

Monitoring

Ownership synchronization is tracked in the integration metrics:

👥 Owners synchronized: Number of successful ownership transfers
❌ Ownership sync failures: Number of failed synchronization attempts

Error Handling

Common issues and their handling:

Missing Collibra Asset: Skip ownership sync for that dataset
No Collibra Owners: Log information message, continue processing
User Email Mismatch: Track as error, continue with remaining users
Soda API Failures: Retry with exponential backoff

Data Quality score guide

In order to show the Soda Data Quality score in Collibra, you will need to create an aggregation path as follows:

Navigate to Collibra Settings > Operating Model > Quality Score Aggregation
Create a new score aggregation. You will create two different aggregations as follows:

If you are using Collibra as a report catalog and want to show Quality Scores on your reports, you will create a third aggregation using the path “Report is part of data structure” & “Asset complies with Governance Asset”.

Assign the new aggregation paths to the asset types COLUMN and TABLE (and any other asset types such as a REPORT).

Collibra Settings > Operating Model > Asset Types > Column
Click the assignment being used (Default Assignment) > Quality Score Aggregations > External Data Quality > Choose “Soda Data Quality [COLUMN]"
Navigate to Collibra Settings > Operating Model > Asset Types > Table

(Optional) If you want to show the Soda Data Quality score in a diagram view on the assets types, you will need to add the above aggregations as an overlay for each asset type (Column, Table, Report) as follows:

For advanced configuration details, head to .

Operations & advanced usage

This page provides detailed information about everything that happens while running and after running the Soda↔Collibra integration.

Advanced usage focuses on running and maintaining the Soda↔Collibra bi-directional integration after setup. The goal is to equip technical implementers with the detail required to operate the integration efficiently, resolve issues quickly, and adapt it to complex environments.

Performance & Monitoring

Performance Optimization

Caching System

Domain Mappings: Cached for the entire session
Asset Lookups: LRU cache reduces repeated API calls
Configuration Parsing: One-time parsing with caching

Batch Processing

Asset Operations: Create/update multiple assets in single calls
Attribute Management: Bulk attribute creation and updates
Relation Creation: Batch relationship establishment

Performance Results

3-5x faster execution vs. original implementation
60% fewer API calls through caching
90% reduction in rate limit errors
Improved reliability with comprehensive error handling

Performance Benchmarks

Typical Performance

Small datasets (< 100 checks): 30-60 seconds
Medium datasets (100-1000 checks): 2-5 minutes
Large datasets (1000+ checks): 5-15 minutes

Performance varies based on:

Network latency to APIs
Number of existing vs. new assets
Complexity of relationships
API rate limits

Monitoring & Metrics

Integration Completion Report

Debug Logging

Enable detailed logging for troubleshooting:

Debug output includes:

Dataset processing details
API call timing and results
Caching hit/miss statistics
Error context and stack traces

Diagnostic Metrics Processing

The integration automatically extracts diagnostic metrics from Soda check results and populates detailed row-level statistics in Collibra.

Supported Metrics

Metric

Source

Description

Flexible Diagnostic Type Support

The system automatically extracts metrics from any diagnostic type, making it future-proof:

Current Soda Diagnostic Types

Future Diagnostic Types (Automatically Supported)

Intelligent Extraction Logic

The system uses a metric-focused approach rather than type-specific logic:

Scans All Diagnostic Types: Iterates through every diagnostic type in the response
Extracts Relevant Metrics: Looks for specific metric fields regardless of diagnostic type name
Applies Smart Fallbacks: Uses datasetRowsTested if checkRowsTested is not available

Fallback Mechanisms

Priority

Field Used

Fallback Reason

Example Processing Flow

Input: Soda Check Result

Output: Collibra Attributes

Benefits

✅ Future-Proof: Automatically works with new diagnostic types Soda introduces
✅ Comprehensive: Provides both raw metrics and calculated insights
✅ Flexible: Handles partial data gracefully with intelligent fallbacks
✅ Accurate: Uses check-specific row counts when available

Testing

Unit Tests

Local Kubernetes Testing

Head to to learn more about the Kubernetes deployment.

Legacy Tests

Advanced Configuration

Performance Tuning

Modify constants.py for your environment:

Enhanced Configuration Options

For detailed information on configuring custom attribute syncing, see the section above.

Custom Logging

Environment Variables

Troubleshooting

Common Issues

Performance Issues

Slow Processing: Increase BATCH_SIZE and DEFAULT_PAGE_SIZE
Rate Limiting: Increase RATE_LIMIT_DELAY
Memory Usage: Decrease CACHE_MAX_SIZE

Connection Issues

API Timeouts: Check network connectivity and API endpoints
Authentication: Verify credentials and permissions
Rate Limits: Monitor API usage and adjust delays

Data Issues

Missing Assets: Ensure required asset types exist in Collibra
Relation Failures: Verify relation type configurations
Domain Mapping: Check domain IDs and JSON formatting

Diagnostic Metrics Issues

Missing Diagnostic Attributes: Check if Soda checks have lastCheckResultValue.diagnostics data
Incomplete Metrics: Some diagnostic types may only have partial metrics (e.g., aggregate checks lack failedRowsCount)
Attribute Type Configuration: Verify diagnostic attribute type IDs are configured correctly in config.yaml

Debug Commands

Log Analysis

Look for these patterns in debug logs:

General Operation Patterns:

Rate limit prevention: Normal throttling behavior
Successfully updated/created: Successful operations
Skipping dataset: Expected filtering behavior

Diagnostic Processing Patterns:

Processing diagnostics: Diagnostic data found in check result
Found failedRowsCount in 'X': Successfully extracted failure count from diagnostic type X
Found checkRowsTested in 'X': Successfully extracted row count from diagnostic type X

Reference

Common Commands

Key Configuration Sections

Collibra Base: collibra.base_url, collibra.username, collibra.password
Soda API: soda.api_key_id, soda.api_key_secret

Essential UUIDs to Configure

Asset types (table, check, dimension, column)
Attribute types (evaluation status, sync date, diagnostic metrics)
Relation types (table-to-check, check-to-dimension)
Domain IDs for asset creation

Support

For issues and questions:

Check the section
Enable for detailed information
Review the performance metrics for bottlenecks
Consult the for usage examples