Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
These monitors require executing queries against the data itself to surface usage and content recency patterns, for example:
Most recent timestamp: the latest event or ingestion time across all rows
Partition row count: the number of records within the current partition (e.g. today’s data)
Query-based monitors give you a window into data flow and freshness, helping detect lags in ingestion pipelines or staleness in source systems.
These monitors are derived directly from the data platform’s system metadata, without scanning row-level values. They surface structural signals, such as:
: when the dataset was last updated
: any alterations to the schema
: the overall number of records in the dataset
Text metrics help catch formatting issues, truncated values, or unexpectedly long/free-form entries.
With Soda, it's possible to assess the character‐length properties of string columns:
Timestamp metrics highlight recency and time‐based anomalies, which is crucial for validating timeliness in event streams and incremental loads:
This section introduces the key features and workflows in Soda for managing data quality issues and reporting.
Learn how to find datasets and checks, navigate dashboards, and understand check results.
You’ll also learn how to set up notifications to stay informed and build custom dashboards using tools like Power BI or Tableau.
Numeric metrics capture central tendency and dispersion in numerical columns, such as:
Mean ()
&
Total row count change: the delta in row count compared to the previous observation
Because they read only metadata, these monitors are extremely lightweight to compute and ideal for continuous, real-time dashboarding of dataset activity.
Depending on your data source(s), metadata-based Metric Monitors may only be supported on Tables, and not on other data objects e.g. Views.
Alternatives could be to setup these metadata-based Metric Monitors on the source Tables of your non-Table data objects, and/or store these data objects as Tables instead.
MAX(column)
Soda keeps track of the previous total row count, fetches total row count again at scan time and subtracts both.
Uses the SQL standard for all databases STDDEV_SAMP(), which is a sampling based method.
data (numeric)
For data sources supporting exact percentiles (e.g. PostgreSQL’s PERCENTILE_DISC(0.75)), Soda uses that function.
For data sources that provide approximations (such as BigQuery, SQLServer, Redshift and Trino), Soda uses those approximated values.
Through the row count value provided by the metadata, which is calculated differently for every database.
data (numeric)
For data sources supporting exact percentiles (e.g. PostgreSQL’s PERCENTILE_DISC(0.25)), Soda uses that function.
For data sources that provide approximations (such as BigQuery, SQLServer, Redshift and Trino), Soda uses those approximated values.
data (numeric)
For data sources supporting exact percentiles (e.g. PostgreSQL’s PERCENTILE_DISC(0.5)), Soda uses that function.
For data sources that provide approximations (such as BigQuery, SQLServer, Redshift and Trino), Soda uses those approximated values.
MIN(column)
Through count(*) for the partition.
MAX(timestamp)
Numeric metrics enable you to detect outliers, shifts in scale, or drifts in distribution over time.
Soda helps data teams make sure their data can be trusted. It makes it easy to find, understand, and fix problems in the data.
You can use Soda to:
Monitor production data with automated, ML-powered observability that spots unexpected changes without needing to define every rule up front.
Define data contracts, making expectations explicit and enabling producers and consumers to collaborate on reliable data at the source.
Test data earlier in the pipeline, as part of CI/CD workflows or during development, to prevent bad data from reaching production.
Soda helps teams to start right and automatically detect anomalies in metrics that have already happened. And shift left to prevent issues from happening again with collaborative data contracts.
This is the documentation for Soda v4. If you are still using Soda v3, head to the .
The new version of Soda has transformed the software into a full data-quality platform by layering on:
End-to-end data observability:
Collaborative data contracts:
This marks the shift from a CLI-centric checks engine toward a unified, observability-driven data quality platform with a refined, three-tier Core + Agent + Cloud architecture, built-in contracts, orchestration, and deep integrations.
to learn more about Soda's capabilities.
Data quality refers to how well a dataset meets the expectations of completeness, accuracy, timeliness, uniqueness, and consistency. Good data supports business goals, drives confident decision-making, and is the base for great data products.
Poor data quality causes failed pipelines, incorrect reports, and broken AI models. Managing data quality means proactively validating assumptions and reactively monitoring for drift or degradation.
Soda helps you answer questions like:
Is the data fresh and complete?
Are there unexpected values or duplicates?
Did values shift outside of expected ranges?
Are schema or contract changes causing breakage?
Data observability is a reactive approach to monitoring data in production and catching unexpected issues as they emerge. It helps answer the question: What is happening with my data right now, and how is that changing over time?
Use data observability to:
Detect anomalies in data quality metrics such as freshness, row counts, null values or custom ones
Monitor metric trends and seasonality
Identify late-arriving or missing records
Get alerted when values deviate from historical norms
Data testing is a proactive approach that validates known expectations about your data during development, deployment, or transformation. It helps you catch issues before they reach production, break reports, or impact downstream systems.
Use data testing to:
Align on what “good data” looks like through data contracts
Verify that your data meets those expectations, including schema, values, and transformations
Test data at every step of the pipeline to prevent bad data from reaching downstream systems
Integrate with CI/CD workflows for continuous quality checks during development
Data contracts define what a dataset should look like, including its schema, data types, value ranges, and other constraints. They establish a shared agreement between data producers and consumers about what’s expected and what must be upheld.
Both testing and observability play a role in upholding data contracts:
Testing validates that data meets the contract during development, pipeline execution, and on schedule.
Observability monitors contract adherence in production and detects unexpected issues.
While data testing and observability are different in when and how they operate, they work best together as a unified strategy.
Together, they enable end-to-end data quality management: testing prevents problems, and observability detects those that escape prevention. At the same time, observability can help prioritize which issues to address and shift left to resolve them upstream.
Managing data quality across hundreds or thousands of datasets requires a scalable, federated approach. Soda enables this through:
Metadata-driven observability that adapts checks to each dataset's structure and context.
Role-based collaboration so teams can take ownership of the data they know best.
An interface for both engineering and business users, enabling collaboration through code, UI, or APIs, depending on user preference and role.
Integration with existing tools and workflows, such as data catalogs and incident management systems.
Reliable data depends on collaboration across roles:
Data engineers embed tests and monitor pipelines to catch issues early.
Data producers and consumers align on expectations through data contracts.
Data consumers report issues and collaborate with producers to interpret metrics and resolve problems.
Governance teams define and enforce data quality standards.
Soda Cloud acts as the shared workspace where these roles collaborate, triage incidents, and resolve issues.
Soda offers three deployment models, depending on your infrastructure and data privacy needs.
Read more about
Soda integrates with the modern data stack:
Data warehouses and databases: Databricks, Snowflake, BigQuery, Redshift, PostgreSQL, MySQL, Spark, Presto, DuckDB, and more.
Orchestration platforms: Airflow, Dagster, Prefect, Azure Data Factory.
Metadata tools: Atlan, Alation, Collibra, data.world, Zeenea.
Cloud providers: AWS, Google Cloud, Azure.
To get started with Soda, check out the end-to-end guide.
Need help or want to contribute?
Join our Slack Community:
Browse GitHub Discussions:
Still have questions? Use the search bar above or reach out through our community channels for additional help.
Metrics that support all data types are foundational checks that apply to any column regardless of its data type:
Count of non-NULL values
of distinct entries
These metrics form the backbone of data completeness and consistency monitoring, ensuring every column meets basic quality expectations.
For teams that manage data like software, Git-managed data contracts offer a code-first way to define and enforce data quality expectations.
In this model, contracts are written in YAML and stored in your Git repository, right alongside your data models, transformation logic, and CI/CD workflows. You write, version, test, and promote contracts just like any other code artifact.
This approach gives engineers full control, reproducibility, and integration into development pipelines. And with the right setup, you can still collaborate with non-technical users via Soda Cloud and even sync UI-authored changes into Git using our future proposal workflow.
Full version control Track every change, roll back when needed, and manage contracts with the same discipline as application code.
Code-first workflow Keep contracts close to your data models and transformations for better alignment, automation, and traceability.
CI/CD integration Run contract verifications in your existing pipelines; on every commit, PR, or deployment.
Team governance
If you're already managing your data infrastructure in Git, Git-managed contracts are the natural extension for bringing data quality under control without adding friction or silos.
In the next sections, we’ll walk you through how to set up, author, and run Git-managed contracts using the Soda CLI.
Cloud-managed Data Contracts let you define and manage expectations for your data directly in the Soda Cloud UI.
This approach is perfect for data analysts, product owners, and business stakeholders who know what “good data” looks like but prefer intuitive tools over code. It’s also ideal for teams that want to move fast, collaborate visually, and integrate seamlessly with engineering workflows when needed.
With Soda Cloud, you can browse datasets, add quality rules, test and publish contracts, and set up scheduled or on-demand verification. All from your browser.
Faster time to value – no setup required
Accessible to everyone – empower domain experts, not just engineers
Built for collaboration – share, comment, and propose changes in a shared UI
Easily operationalized – schedule tests and trigger verifications programmatically
Cloud-managed contracts are a powerful way to bring your organization together around trusted data.
Before creating contracts in Soda Cloud, make sure:
You have a Soda Cloud account
You have access to an organization in Soda Cloud
You have connected at least one data source via a Soda Agent
The Soda Agent allows you to securely scan your data sources for quality issues directly from Soda Cloud. It can be self-hosted or Soda-hosted, depending on your deployment preferences. The self-hosted option allows for a more custom and secure deployment, while the Soda-hosted agent is easier to start with. Learn more about Deployment options
You can deploy a self-hosted agent in the infrastructure of your choice:
Kubernetes cluster
Amazon EKS
Azure AKS
Google GKE
Soda-hosted agents are included in all Free, Team, and Enterprise plans at no additional cost. However, self-hosted agents require an Enterprise plan.
If you wish to use self-hosted agents, please contact us at to discuss Enterprise plan options or reach out via the support portal for existing customers.
When you access a monitor (from Metric Monitoring) or a check (from the Contract), Soda Cloud provides a time series view that shows how the monitored metric or check result evolves. This helps you explore the history of data quality issues, spot trends, and understand changes in your data.
For certain types of checks and monitors, additional diagnostic information is also available for each monitor or check results to help you investigate issues in more detail.
For example:
Schema Checks: View a side-by-side comparison of the actual vs. expected schema to identify differences.
Missing, Duplicate, or Invalid Checks: See the percentage of failed rows vs. passing rows to understand the scale and impact of the issue.
This view helps you drill down into specific data issues, explore context, and take informed action.
As soon as a data source is connected, the metric monitoring dashboard is available and will have historical information. Soda establishes a statistical baseline for each metric and continually compares new scan results against that baseline, flagging anomalies according to the sensitivity, exclusions, and threshold strategy you’ve configured.
Metric monitors are the foundation of data observability in Soda. Monitors track data quality metrics over time and leverage historical values for analysis. Soda automatically collects these metrics and examines how they evolve over time through a to identify when metrics deviate from expected patterns and trigger alerts. These deviations are surfaced and recorded in the Metric Monitors
The Organization and Admin Settings in Soda Cloud provide a centralized interface for managing your organization’s configuration, roles, user access, and integrations. From setting your organization’s name to defining global roles, managing user groups, and enabling the Soda-hosted Agent, these settings help you tailor Soda Cloud to your team’s needs and governance policies.
To access the settings, click on your avatar on the top right, and then click Organization Settings.
Only users with the Manage Organization Settings global role can access and modify these settings.
Are data quality metrics changing over time?
Pipeline and CI/CD integration to automate data quality checks.
Platform teams deploy, manage, and secure the underlying infrastructure.
Self-hosted Agent
Same as Soda-hosted Agent, but deployed and managed in your own Kubernetes environment.
Teams needing full control over infrastructure and deployment.
Similar to Soda-hosted Agent, but deployed within the customer’s environment; data stays within your network.
Required for observability features.
Cannot scan in-memory sources like Spark or DataFrames.
Kubernetes expertise required.
BI tools: Looker, Tableau, Power BI.
Messaging and ticketing: Slack, Microsoft Teams, Jira, PagerDuty, ServiceNow, Opsgenie.
Data Testing
Proactive and preventative: Pre-production, during development or CI/CD
Prevent breakages before they happen: Validate known rules and enforce contracts
Data Observability
Reactive and adaptive: In production, runtime monitoring
Monitor data behavior and changes over time with automated detection of anomalies, schema changes, and other unexpected issues.
Soda Core
Open-source Python library (with commercial extensions) and CLI for running Data Contracts in your pipelines.
Data engineers integrating Soda into custom workflows.
Full control over orchestration, in-memory data support, contract verification.
No observability features. Required for in-memory sources (e.g., Spark, DataFrames). Data source connections managed at the environment level.
Soda-hosted Agent
Managed version of Soda that runs observability features, executes Data Contracts and scheduled them.
Teams seeking a simple, managed solution for data quality.
Centralized data source access, no setup required, observability features enabled. Enables users to create, test, execute, and schedule contracts and checks directly from the Soda Cloud UI.
Required for observability features. Cannot scan in-memory sources like Spark or DataFrames.
Hybrid collaboration Combine Git workflows with Soda Cloud for monitoring, visualization, and cross-functional input via contract proposals.
The main difference between monitors and metrics is that monitors are configurable, while metrics are not.
Monitors build on top of metrics by wrapping their static measurement in a configurable context. Each monitor is customizable, so the user can select scan time, scan frequency, thresholds, and metric to be used.
Metrics, on the other hand, are only a part of the monitor. They are built-in, static definitions of data properties; it is not possible to alter how a metric is computed at its source, but it is possible to select which metric to track through a metric monitor.
Soda offers two main types of monitors to support scalable, layered observability: dataset monitors and column monitors.
Dataset monitors provide instant, no-setup monitoring based on metadata. They track high-level metrics like row count changes, schema updates, and insert activity, making them ideal for catching structural or pipeline-level issues across large numbers of datasets.
Column monitors are more granular and customizable. They focus on specific fields, allowing users to monitor things like missing values, averages, or freshness. These monitors are useful for capturing data issues that impact accuracy or business logic at the column level.
Together, they offer broad coverage and targeted insight, helping teams detect both systemic and localized data quality issues.
Each of these sections contains summarized information about the latest scan results for each monitor. From the health tab, you can access each monitor for further investigation and configuration, as well as creating alerts.
You can turn any metric monitor into a proactive alert by clicking its bell icon on the Metric Monitors dashboard and selecting Add Notification Rule. This brings up the Add Notification Rule panel:
Name Enter a descriptive title for your rule (e.g. “Row-Count Alerts – Prod Sales”).
Data source Choose the warehouse or connection to scope your rule. Then, search for and check the specific tables (or columns) this rule should cover. The “Matches X datasets” badge updates in real time so you know exactly what you’ll be alerting on.
Applies to Pick which check type you want to alert on.
Recipients Select one or more notification targets:
Email addresses
Slack channels
Other integrations
This dialog lets you reuse a single rule for multiple datasets or checks, ensuring your team only gets the notifications they care about.

The compute method depends on the database. Soda requires specific metadata fields that are different for every database.No sampling used.
In Redshift, adding columns is not part of last_modification_time.
Last modification time: Soda uses metadata
Note that past data is only available for a limited amount of time, which varies depending on the system. The minimum goes back 120 h.
Non-UTC timestamps are not recommended when connecting Soda to Oracle data sources. Soda uses timezone data when available, but assumes UTC when the timezone is not provided by the data source.
Some databases convert timestamps to UTC, but Oracle does not do any implicit conversions and stores timestamps and timezone information as the user inputs them. Because of Oracle Python client limitations, all timezone information is stripped when Soda retrieves it, which means that Soda will read all timestamps as if they were UTC regardless of the original input.
Metadata is supported, but it requires some additional setup on Postgres's side.
Historical backfilling: not possible.
Row count: enabled out-of-the-box.
Last modification time: track_commit_timestamp must be enabled: https://www.postgresql.org/docs/current/runtime-config-replication.html#GUC-TRACK-COMMIT-TIMESTAMP
If track_commit_timestamp is not enabled, Soda will return a warning.
Metadata metrics are available and supported in BigQuery.
Historical backfilling: possible.
Partition column: can be suggested based on metadata available in BigQuery.
Soda will prioritize user-suggested columns.
If there are no user-suggested columns, Soda will try a metadata approach to find the partition column automatically.
If there are no columns found in the metadata of BigQuery, Soda will fall back on its own heuristic.
Historical backfilling is supported on Redshift and it is limited to 7 days for the metadata.
Modification time does not include schema changes. Only:
inserts
updates
deletes
Synapse does not provide metadata history tables.
Historical backfilling: not possible.
Last modification time: not possible.
Row count: current row counts are calculated via count(*).
Soda does not use metadata for this metric in Synapse.
Quartile metrics (Q1, median, Q3): not possible. Synapse does not support quartile metrics.
You may find it useful to set up multiple organizations in Soda Cloud so that each corresponds with a different environment in your network infrastructure, such as production, staging, and development. Such a setup makes it easy for you and your team to access multiple, independent Soda Cloud organizations using the same profile, or login credentials.
Note that Soda Cloud associates any API keys that you generate within an organization with both your profile and the organization in which you generated the keys. API keys are not interchangeable between organizations.
Contact [email protected] to request multiple organizations for Soda Cloud.

1
a
1
b
2
c
null
d
null
e
data


Before you can define, test, or verify Git-managed data contracts, you need to install the Soda CLI and configure your environment.
This setup gives you full control over your contracts, letting you version them in Git, execute them locally or remotely, and integrate them into your CI/CD pipelines.
Install Soda Core using pip:
If you need authenticated access, follow the to set up your environment and install the necessary Soda extensions.
If you want to interact with Soda Cloud to publish contracts and view verification results, or to use Soda Agent, you’ll need to connect the CLI to your Soda Cloud account.
Don’t have an account? to get started.
This generates a basic Soda Cloud configuration file.
Open sc.yml and fill in your API key and organization details.
Learn more about how to generate keys:
This ensures the CLI can authenticate and communicate with Soda Cloud.
To verify a contract, Soda needs to know how to connect to your data source. You have two options:
If you prefer to define your own connection locally (or aren’t using a Soda Agent), you can create a data source config file for Soda Core.
Install the required package for your data source. For example, for PostgreSQL:
See the for supported packages and configurations.
Open ds.yml and provide the necessary credentials.
For example with PostgreSQL:
Refer to the for the configurations of each data source type.
If your data source is already connected to Soda Cloud using a Soda Agent (hosted or self-hosted), you can reuse that connection without managing credentials or configs locally.
You just need to ensure you have set up the connection with Soda Cloud.
Choose the method that best fits your setup:
Use Soda Agent for a centralized, cloud-managed connection, or local configuration if you want full control within your environment.
The Soda Agent is a Helm chart that you deploy on a Kubernetes cluster and connect to your Soda Cloud account using API keys.
To take advantage of new or improved features and functionality in the Soda Agent, including new features in the Soda Library, you can upgrade your agent when a new version becomes available in ArtifactHub.io.
Note that there is no downtime associated with the exercise of upgrading a self-hosted Soda Agent. Because Soda does not define the .spec.strategy in the deployment manifest of the Soda Agent Helm chart, Kubernetes uses the default RollingUpdate to upgrade; refer to Kubernetes documentation .
If you regularly access multiple clusters, you must ensure that are first accessing the cluster which contains your deployed Soda Agent. Use the following command to determine which cluster you are accessing.
If you must switch contexts to access a different cluster, copy the name of cluster you wish to use, then run the following command.
To upgrade the agent, you must know the values for:
namespace - the namespace you created, and into which you deployed the Soda Agent
release - the name of the instance of a helm chart that is running in your Kubernetes cluster
API keys - the values Soda Cloud created which you used to run the agent application in the cluster Access the first two values by running the following command.
Output:
Access the API key values by running the following command, replacing the placeholder values with your own details.
From the output above, the command to use is:
Use the following command to search ArifactHub for the most recent version of the Soda Agent Helm chart.
Use the following command to upgrade the Helm repository.
Upgrade the Soda Agent Helm chart. The value for the chart argument can be a chart reference such as example/agent, a path to a chart directory, a packaged chart, or a URL. To upgrade the agent, Soda uses a chart reference: soda-agent/soda-agent.
From the output above, the command to use would be:
OR, if you use a values YAML file,
The Datasets page displays all datasets that have been onboarded into Soda Cloud—either through publishing a contract or via the onboarding process: Onboard datasets on Soda Cloud .
It provides a quick overview of each dataset’s health, showing at a glance if a dataset has issues, how many checks from its contract are failing, how many anomalies were detected through metric monitoring, and when the last scan was executed.
You can filter datasets by properties like data source, owners, arrival time, attributes, or flags such as failures or anomalies. Use the search bar to quickly find a specific dataset by name, and sort the list by name, creation time, or data quality status.
Learn more about custom attributes: Dataset Attributes & Responsibilities
You can also sort the datasets list by name, creation date, check failures, or anomalies to prioritize your focus.
You can tailor the Datasets view to focus on the areas that matter most to you:
Use the filter options to narrow down the view
Click the Save Dashboard button to store your current filter configuration as a collection.
Enter a name for the collection and click Save
Once saved, your collection will be available in the dropdown at the top right of the dashboard. Simply select it to switch views.
Use the context menu next to the collection name to:
Delete the collection if it’s no longer needed.
Share the collection with others in your organization.
Diagnostics Warehouse provides a clear, detailed view of the state of data checks while allowing access to failed rows in order to take a closer look and resolve data quality issues.
Diagnostics Warehouse stores all Soda scan details, failed records, and historical data quality issues directly in your data warehouse of choice, safely and securely. Nothing is stored outside. This gives you the ability to run diagnostics, resolve issues, and see exactly why problems happen. You can go as deep as you need, from a single record to a full dataset.
Each time a Soda scan runs, Diagnostics Warehouse stores failed rows together with check and scan results, and related metadata attributes. With that information, data teams can quickly diagnose and resolve issues at both dataset and row level. Additionally, Soda's Diagnostics Warehouse makes it easier for teams to build on top of Soda's outputs to set up operational workflows, and connect to BI tools you already know and trust.
Full diagnostic information in one place, including attributes.
Transparency for all: replace black-box runs with auditable facts and keep an immutable, queryable history of what was checked, when, how long it took, what failed, and why.
Faster root-cause analysis: jump from a failed check to the exact failed rows, affected datasets/columns, and prior history to see if it’s a one-off issue or a pattern.
Data minimization: Diagnostics Warehouse stores metadata about runs and checks and, for row-level checks, it only stores failed rows when the option is enabled.
Warehouse residency: Diagnostics are not stored in Soda. They live in your analytics warehouse, respecting your access controls, encryption, and audit trails.
Enable Diagnostics Warehouse in your Soda data source settings.
Grant the service identity permission to create and write to the Diagnostics Warehouse schema in your warehouse.
Run your checks; Diagnostics Warehouse tables populate automatically.
Query your warehouse and connect to your BI tools to start exploring.
Next: to enable Diagnostics Warehouse in your organization, reach out to Soda at .
The Checks page displays all checks defined in a data contract and tracked in Soda Cloud. It provides a quick overview of check health across datasets, allowing you to create custom groupings by applying filters such as data source, dataset, owners, or status (pass, fail, warning). This helps you focus on specific areas or teams that matter most.
You can also review key details like the check type, the dataset it belongs to, and the time of the last scan. Use the search bar to quickly find a specific check by name, and sort the list by name, last run time, or check status.
You can filter checks by properties such as data source, dataset, owners, attributes, or status (pass, fail, warning). Use the search bar to quickly find a specific check by name.
Learn more about custom attributes: Check and dataset attributes
You can also sort the list by name, last run time, or check status.
You can tailor the Checks view to focus on the areas that matter most to you:
Use the filter options to narrow down the view
Click the Save Dashboard button to store your current filter configuration as a collection.
Enter a name for the collection and click Save
Once saved, your collection will be available in the dropdown at the top right of the dashboard. Simply select it to switch views.
Use the context menu next to the collection name to:
Delete the collection if it’s no longer needed.
Share the collection with others in your organization.
The Dataset Page provides a detailed view of each dataset’s health and monitoring information in Soda Cloud. It includes several tabs to help you explore and manage data quality at the dataset level.
Displays the results of all checks defined in the dataset’s contract. Checks are grouped by column, with column-level checks nested under their respective columns. Columns with failed checks are automatically expanded so you can spot issues quickly. You can filter the view to show only failed checks and search for specific checks or columns by name for faster navigation and troubleshooting.
Learn more about Data Testing
Shows the metrics that are actively monitored for the dataset, helping you track trends and detect anomalies over time.
Learn more about
Provides an overview of the dataset’s structure, including column names, data types, distinct counts, and other statistics. You can search for a specific column by name to quickly locate and review its profiling details.
Lists incidents related to the dataset, helping you track issues and collaborate on resolution. You can filter incidents based on criteria such as user lead, status, or severity to focus on the most important or urgent cases.
Learn more about
The Organization Dashboard provides a high-level overview of your data quality across datasets and checks in Soda Cloud. It shows key trends over time, such as the number of checks that are passing, failing, or in a warning state, helping you identify issues early.
You’ll also find key metrics, including:
Scans in failed mode: Datasets that are currently blocked due to failing checks.
Checks currently failing: Active checks that need attention.
Overall Health Score: The number of failing checks out of the total number of checks
These insights allow you to quickly identify where action is needed.
You can tailor the Organization Dashboard to focus on the areas that matter most to you:
Apply filters based on attributes: Use the filter options to narrow down the view by attributes
Click the Save Dashboard button to store your current filter configuration as a collection.
Enter a name for the collection and click Save
Once saved, your collection will be available in the dropdown at the top right of the dashboard. Simply select it to switch views.
Use the context menu next to the collection name to:
Delete the collection if it’s no longer needed.
Share the collection with others in your organization.
The Activity section offers insights into how Soda is being used across your organization. It tracks adoption metrics, such as active users, active checks, active datasets, and the number of alerts in the last 90 days.
In addition to the built-in dashboards in Soda Cloud, you can build custom dashboards tailored to your organization’s specific needs. By leveraging the Soda REST API, you can programmatically retrieve data quality metrics, check results, and incident details, and integrate them into external dashboarding tools such as Power BI, Tableau, or Looker.
This enables you to create tailored views and reports that align with your business logic and audience, ensuring stakeholders get the right insights in the tools they already use.
Learn more on
The General Settings page allows you to configure foundational settings for your organization in Soda Cloud. These settings impact how your organization operates and how users interact with the platform.
Set the name of your organization. This name appears throughout Soda Cloud, such as in dashboards, reports and notifications.
Enable the Login As feature to allow the Soda Support team to log in as an admin within your organization. This can be useful when troubleshooting issues or providing assistance.
You can choose whether to use the Soda-hosted Agent by enabling or disabling it in the Organization Settings:
Toggle the Soda-hosted Agent option to enable or disable the agent for your organization.
Disabling the agent prevents Soda Cloud from running scans or checks via the managed agent. You’ll need to use a self-hosted agent or Soda Core in your environment instead.
By default, the Soda-hosted Agent collects profiling information (such as column-level statistics and schema details) to support features like dataset discovery and monitoring in Soda Cloud.
You can choose to disable profiling if you prefer not to send profiling data to Soda Cloud: Toggle the Profiling Data Collection option to disable profiling for your organization.
This ensures that no profiling information is collected or pushed to Soda Cloud. Only check results and metadata necessary for contract validation will be processed.
Manage secure storage of secrets such as API keys, credentials or connection details. Secrets can be used in data source configurations, checks, and other automated processes.
For more information, see the
Once your contract is authored and published (or available locally), you can verify whether the actual data complies with the defined expectations. Soda provides two execution options:
Soda Core – run verifications locally, typically in CI/CD pipelines or dev environments.
Soda Agent – run verifications remotely using an agent deployed in your environment, triggered via Soda Cloud.
Both approaches support variable overrides, publishing results to Soda Cloud, and integration into automated workflows.
Learn more about Deployment options
Soda Core runs the verification locally, connecting to your data source using the defined data source configuration file.
This command:
Connects to your database using the local config
Loads the contract
Runs all checks and returns a pass/fail result
You can pass variables defined in the contract using the --set flag:
Learn about variables in Data Contract:
To send verification results to Soda Cloud for visibility and reporting.
Add the flag --publish to the command.
This action requires the "Manage contract" permission on the dataset; the user is identified based on the API key provided in the Soda Cloud configuration.
Learn more about permissions here:
Learn how to connect the CLI to Soda Cloud:
This is recommended if you want stakeholders to see the outcomes in Soda Cloud or include them in dashboards and alerting.
Soda Agent executes verifications using data sources configured in Soda Cloud.
This setup:
Runs verifications through the Soda Agent connected to your data source
Fetches the published contract from Soda Cloud
Returns the result locally in the CLI
You can pass variables defined in the contract using the --set flag:
Learn about variables in Data Contract:
You can also push results to Soda Cloud from the agent-based run.
Add the flag --publish to the command.
This action requires the "Manage contract" permission on the dataset; the user is identified based on the API key provided in the Soda Cloud configuration.
Learn more about permissions here:
This is recommended if you want stakeholders to see the outcomes in the Soda Cloud or include them in dashboards and alerting.
Soda’s notification system helps you stay informed when data issues occur—whether it’s a failed check or an anomaly detected through metric monitoring. Notifications are dynamically dispatched using notification rules, allowing you to target alerts based on specific properties, attributes, or datasets.
Notification rules define when and to whom a notification is sent. Rules can be configured to match specific checks or anomalies, ensuring the right people are notified at the right time.
Only users with the Manage Notification Rules permission can create or edit rules. All users can view rules. Read about
To create a new notification rule:
Click on your profile in Soda Cloud and select Notification Rules from the menu.
Click New Rule.
Provide a name for the rule.
Define the Rule Scope
Checks:
All checks: The rule applies to every check in your organization.
Specific checks: Build custom rules by filtering on check properties, dataset properties, or attributes.
Anomalies from Metric Monitoring: Select specific datasets where the rule applies.
Define the recipients (users, groups, or integrations like Slack, Teams, or webhooks).
...and choose the alert type (only applicable for checks, not anomalies):
Only failures
Failures and warnings
All statuses
Save to create the notification rule
You can pause a notification rule at any time to temporarily disable alerts without deleting the rule.
When you delete the Soda Agent Helm chart from your cluster, you also delete all the agent resources on your cluster. However, if you wish to redeploy the previously-registered agent (using the same name), you need to specify the agent ID in your override values in your values YAML file.
In Soda Cloud, navigate to your avatar > Agents.
Click to select the agent you wish to redeploy, then copy the agent ID of the previously-registered agent from the URL.
For example, in the following URL, the agent ID is the long UUID at the end. https://cloud.soda.io/agents/842feab3-snip-87eb-06d2813a72c1.
Alternatively, if you use the base64 CLI tool, you can run the following command to obtain the agentID.
kubectl get secret/soda-agent-id -n soda-agent --template={{.data.SODA_AGENT_ID}} | base64 --decodeOpen your values.yml file, then add the id key:value pair under agent, using the agent ID you copied from the URL as the value.
To redeploy the agent, you need to provide the values for the API keys the agent uses to connect to Soda Cloud in the values YAML file. Access the values by running the following command, replacing the soda-agent values with your own details, then paste the values into your values YAML file.
Alternatively, if you use the base64 CLI tool, you can run the following commands to obtain the API key and API secret, respectively.
In the same directory in which the values.yml file exists, use the following command to install the Soda Agent helm chart.
Validate the Soda Agent deployment by running the following command:
Group By monitors enable you to track data quality metrics across specific segments of your dataset. Instead of monitoring a metric for a column as a whole, you can break it down per category (for example, per region, per school year, per status).
This functionality is especially valuable when:
You want to detect anomalies at a more granular level, within each segment or category.
You need visibility into how data quality differs across categories.
Incidents help you track, investigate, and resolve data quality issues when they occur. An incident is created when a data issue, such as a failed or warning check, has been confirmed and assigned to someone for resolution.
To create or update an incident, the user has to have "Manage Incidents" on the related dataset.
With Git-managed contracts, you define your expectations as code using YAML. This gives you full control over how your data is validated, and allows you to manage contracts just like any other code artifact: versioned, tested, and deployed via Git.
To learn all about the structure and supported features, refer to the full specification in the
The contract structure includes:
Dataset and column structure
Available check types (missing, invalid
Once a Data Contract is published, the next step is to verify that the actual data matches the expectations you’ve defined. Soda offers several flexible ways to execute contract verifications, depending on your needs and technical setup.
You can manually verify a contract at any time from the dataset page in Soda Cloud.
Simply open the dataset and click Verify Contract. This will:
Attributes allow you to add descriptive metadata to your datasets and checks. This metadata can then be:
Used for filtering in Soda Cloud, making it easier to search and organize datasets and checks based on specific criteria (e.g., business domain, sensitivity, criticality).
Leveraged in reporting, enabling you to group datasets, track ownership, and monitor data quality across different categories or dimensions.
Adding meaningful attributes enhances discoverability, governance, and collaboration within Soda and its integrations.
This reference hub includes detailed documentation for Soda’s key interfaces and configuration options, as well as information on Soda's architecture and specifics, including:
Data Contract Language Reference – Author, validate, and manage data contracts using YAML-based definitions.
CLI Command and Python Reference – Use Soda’s command-line interface to configure, run, and automate verification workflows.
REST API Reference – Interact with Soda Cloud programmatically to manage datasets, run verifications, and retrieve results.
pip install -i https://pypi.dev.sodadata.io/simple soda-postgreskubectl config get-contextsOrganization-level visibility: roll up results by domain, team, or pipeline. Show the impact of your data quality program to leadership with real, defensible metrics.
Open & portable features: it’s just tables in your warehouse. Query with SQL, power dashboards, join with lineage, incident, or cost data, and automate workflows.
Security & Governance: Diagnostics Warehouse stores tables in your own warehouse, giving you full control over security, retention and access.





























soda:
apikey:
id: "***"
secret: "***"
agent:
id: "842feab3-snip-87eb-06d2813a72c1"
name: "myuniqueagent"kubectl config use-context <name of cluster>helm listNAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION
soda-agent soda-agent 5 2023-01-20 11:55:49.387634 -0800 PST deployed soda-agent-0.8.26 Soda_Library_1.0.0helm get values -n <namespace> <release name>helm get values -n soda-agent soda-agent helm search hub soda-agenthelm repo updatehelm upgrade <release> <chart>
--set soda.agent.name=myuniqueagent \
# Use https://cloud.us.soda.io for US region; use https://cloud.soda.io for EU region
--set soda.cloud.endpoint=https://cloud.soda.io \
--set soda.apikey.id=*** \
--set soda.apikey.secret=**** \
--set soda.agent.logFormat=raw \
--set soda.agent.loglevel=ERROR \
--namespace soda-agenthelm upgrade soda-agent soda-agent/soda-agent \
--set soda.agent.name=myuniqueagent \
# Use https://cloud.us.soda.io for US region; use https://cloud.soda.io for EU region
--set soda.cloud.endpoint=https://cloud.soda.io \
--set soda.apikey.id=*** \
--set soda.apikey.secret=**** \
--set soda.agent.logFormat=raw \
--set soda.agent.loglevel=ERROR \
--namespace soda-agenthelm upgrade soda-agent soda-agent/soda-agent \
--values values-local.yml --namespace soda-agenthelm get values -n soda-agent soda-agentkubectl get secret/soda-agent-apikey -n soda-agent --template={{.data.SODA_API_KEY_ID}} | base64 --decodekubectl get secret/soda-agent-apikey -n soda-agent --template={{.data.SODA_API_KEY_SECRET}} | base64 --decodehelm install soda-agent soda-agent/soda-agent \
--values values.yml \
--namespace soda-agentkubectl describe podsOnly one Group By monitor can be configured at a time.
Because a Group By monitor spawns multiple monitors (one per category), limiting this to a single configuration helps manage performance.
When a Group By monitor is active in a dataset, results are displayed at the bottom of the Metric Monitors tab, on the Column Monitors table:
There, you will see:
A Group By monitor is listed like any other monitor, but its description indicates the Group By column(s) and the metric being measured (e.g. "Maximum length of Bus_No grouped by Breakdown_or_Running_Late").
From the Column Monitors table, it is possible to turn on notifications at the column level by clicking on the bell icon. Note that notifications at a category level are not available at the moment.
Expanding the monitor displays a groups table, which shows the results for each group or category. Each row corresponds to one category (or combination of categories if multiple columns are grouped). From the groups table, it is possible to delete specific categories by clicking on the bin icon on the right.
Deleting from the groups table is intended to remove groups/categories that are no longer present in the data.
Deleting a category removes the history for that monitor.
If the group/category is still present in the data, the monitor will be re-created on the next scan. It will not be backfilled, unless a historical metric collection scan is triggered.
Example:
Group By Breakdown_or_Running_Late + metric Maximum length of Bus_No → a row for each Breakdown_or_Running_Late value, with the maximum bus number length observed in that category, alongside its anomaly detection status.
You can add a Group By monitor from the Metric Monitors section of the dataset page.
Scroll to the Column Monitors table and click Add Column Monitors.
In the Add Column Monitors panel, toggle on Group By.
Select one or more columns to group by.
For the time being, only columns with a maximum of 50 distinct categories are eligible for Group By monitoring.
(Optional) Exclude specific categories (segments) that you don’t want to monitor.
Select one or more columns to monitor under Column Selection.
Enable one (or more) metric from the right-hand list.
Click Add 1 Monitor on the top right to save.
The monitor now appears in the Column Monitors table and starts tracking anomalies across each category.
Categories can be excluded when configuring the monitor. See Step 4 on Add Group By monitors.
Categories can be deleted after creation from the Groups table if you decide they should no longer be monitored.
One Group By monitor at a time Only one configuration is allowed, since Group By monitors expand into many underlying monitors.
Multiple Group By columns More than one column can be selected, but the categories generated are combinatory.
Category limits Columns with more than 50 categories cannot be used for Group By monitoring.
Exclusions and deletions You can exclude categories at configuration time or delete them later from the .
Notifications Notifications are configured at the column level, not yet at the per-category level.
With Group By monitors, you gain more granular visibility into your data quality, while keeping control over compute cost and category management.
You can create an incident directly from a check result when an issue has been identified:
On a check page, use the context menu to select Create Incident.
Provide a name and description for the incident.
Select one or multiple related check results that you want to associate with the incident.
Click Save to proceed
Once created, the incident will appear in the Incidents tab of the corresponding Dataset Page
It is possible to filter incidents based on lead, status, reporter, and severity.
Incidents can also be seen in a central place in Soda Cloud. In the top navigation, click on Incidents to see all the incidents of the organization.
Use the filters and the title search to find relevant incidents.
Assign a lead: Every incident requires a lead: the user responsible for resolving the issue.
Update status: Track progress by updating the incident’s status as the investigation and resolution evolve.
Add a resolution note: When marking an incident as resolved, a resolution note is mandatory to document what was done.
Include more check results: If new results are failing, you can include them in the incident.
After any changes, click Save to apply them.
You can integrate incidents with Slack, MS Teams, or other external systems using Soda’s webhook capabilities or the Incidents API. Learn on how to integrate with Soda: Integrations
duplicatefreshnessFilters (dataset-level and check-level) — Optional
Threshold configuration — Optional
Use of variables — Optional
Scheduling — Optional
...and more
Before publishing or verifying your contract, you can run a test command to ensure the contract is correctly defined and points to a valid dataset.
This will:
Validate the YAML syntax
Confirm that the referenced dataset exists
Check for any issues with variables, filters, or structure
Before publishing your contract, you may want to execute it to verify that it runs correctly.
Read more about how to Verify a contract
If you have Soda Cloud, once your contract is finalized and tested, you can publish it to Soda Cloud, making it the authoritative version for verification and scheduling.
This action requires the "Manage contract" permission on the dataset; the user is identified based on the API key provided in the Soda Cloud configuration.
Learn more about permissions here: Dataset Attributes & Responsibilities
Learn how to connect the CLI to Soda Cloud:
Publishing:
Uploads the contract to Soda Cloud
Makes it available for manual, scheduled, or programmatic verifications
Enables visibility and collaboration through the UI
Once published, the contract becomes the source of truth for the dataset until a new version is published.
You’re now ready to start verifying your contract and monitoring your data.
By default we'll use your existing Soda API key and secret values to perform the authentication to the Soda image registry.
Ensure these values are still present in your values.yaml , no further action is required.
You might also opt to use a new, separate Soda API key and secret to perform the authentication to the Soda image registry.
In this case, ensure the imageCredentials.apikey.id and imageCredentials.apikey.secret values are set to these new values:
If you're providing your own imagePullSecrets on the cluster, e.g. when you're pulling images from your own mirroring image registry, you must modify your existing values file.
The imagePullSecrets property that was present in versions 1.1.x has been renamed to the more standard existingImagePullSecrets .
If applicable to you, please perform the following rename in your values file:
For more information on setting up image mirroring, see Mirroring images
If you are a customer using the US instance of Soda Cloud, you'll have to configure your Agent setup accordingly. Otherwise you can ignore this section.
In version 1.2.0 we're introducing a soda.cloud.region property, that will be used to determine which registry and Soda Cloud endpoint to use. Possible values are eu and us. When the soda.cloud.region property is not set explicitly, it defaults to the value of eu.
If applicable to you, please perform the following changes in your values file:
For more information about using the US region, see Using the US image registry.
The scanlauncher section in the values file has been renamed to scanLauncher.
Please ensure the correct name is used in your values file if you have any configuration values there:
Execute the checks in the published contract
Use the latest available data
Display pass/fail results directly in the UI
This is especially useful for one-off validations, exploratory testing, or during incident investigations.
This action requires the "Manage contract" permission on the dataset. Learn more about permissions here: Dataset Attributes & Responsibilities
To monitor data quality over time, you can set up scheduled verifications directly in the contract editor.
When editing or viewing a contract:
Go to the Schedule section
Choose how often you want the contract to be verified (e.g., hourly, daily, weekly)
Save the schedule
Soda Cloud will automatically run the contract at the specified intervals, using the selected agent. All results will be stored and visualized in Soda Cloud, with alerts triggered when rules fail (if configured.)
For advanced workflows and full automation, you can verify contracts programmatically using the Soda CLI and a Soda Agent.
This is ideal for:
CI/CD pipelines
Custom orchestration (e.g., Airflow, dbt Cloud, Dagster)
Triggering verifications after data loads
First, create a Soda Cloud configuration file:
This generates a basic config file. Open it and fill in your API key and organization details.
Learn how to Generate API keys
You can test the connection:
Now you can run a verification using the CLI and a remote Soda Agent.
To verify a dataset without pushing the results to Soda Cloud:
This allows you to verify that the contract produces the expected results before pushing results to Soda Cloud.
To verify and also push the results to Soda Cloud:
This makes the verification results available in the UI for stakeholders, trigger notifications and monitoring dashboards.
This action requires the "Manage contract" permission on the dataset; the user is identified based on the API key provided in the Soda Cloud configuration. Learn more about permissions here:Dataset Attributes & Responsibilities
Only users with the Manage Attributes permission can create or edit attributes. Global and Dataset Roles
To create a new attribute:
Click your profile icon in the top-right corner and select Attributes from the menu.
Click New Attribute.
Provide a Label for the attribute. Note that a unique name will be generated from this label. This name is immutable and is used in Data Contract definitions to reference the attribute.
Select the Resource Type where the attribute applies: Dataset or Check
Choose the Type of attribute: Single select, Multi select, Checkbox, Text, Number, Date
Add a Description for context.
Click Save
To edit an attribute, use the context menu next to the attribute name and select Edit Attribute.
Note that the name property and the assigned resource type cannot be changed.
Learn how to set attributes for datasets: Dataset Attributes & Responsibilities
Attributes for checks will be defined as part of the Data Contract.
Learn how to set attributes for datasets:
Authoring in Soda Cloud:
Data Contract as code:
Once an attribute has been assigned at least once, either to a dataset or a check, it becomes available as a filter in Soda Cloud. Attributes that have not yet been used will not appear in filter options.
Test the data source connection:
pip install -i https://pypi.dev.sodadata.io/simple -U soda-athenatype: athena
name: my_athena
connection:
catalog: ${env.ATHENA_CATALOG}
access_key_id: ${env.ATHENA_ACCESS_KEY_ID}
secret_access_key: ${env.ATHENA_SECRET_ACCESS_KEY}
staging_dir: ${env.ATHENA_STAGING_DIR}
region_name: ${env.ATHENA_REGION}
work_group: ${env.ATHENA_WOKRGROUP}
# role_arn: <my_role_arn>
# profile_name: <my_aws_profile>
# session_token: <my_session_token>Data flow and data source reference – Understand how Soda interacts with other systems and manage exceptions.
Each section includes practical, example-based documentation structured to help data engineers, analysts, and platform teams apply Soda in real-world use cases.




Data testing is the practice of validating that your data meets the expectations you’ve defined for it before it reaches stakeholders, dashboards, or downstream systems. Just like software testing ensures your code behaves as intended, data testing safeguards the quality and reliability of your data.
At Soda, we see data testing as the foundation of data trust. Whether you’re verifying row counts, checking for missing or invalid values, or enforcing schema integrity, the goal is the same: catch issues early, reduce incidents, and keep your data consumers confident.
A Data Contract is a formal agreement between data producers and data consumers that defines what “good data” looks like. It sets expectations about schema, freshness, quality rules, and more, and makes those expectations explicit and testable.
With a data contract in place, producers commit to delivering data that meets certain standards. Consumers, in turn, can rely on that contract to build reports, models, or pipelines without second-guessing the data.
At Soda, Data Contracts are testable artifacts that can be authored, versioned, verified, and monitored, whether in code or in the UI. They’re the connective tissue between producers and consumers, aligning teams and eliminating ambiguity.
Defining a contract is only the first step. Verifying that your data actually meets the expectations is where the value is realized. Contract verification is the process of testing whether the data in your datasets aligns with the rules, thresholds, and schema defined in the contract.
At Soda, contract verification is fully automated. Whether triggered manually, on a schedule, or as part of your CI/CD pipelines, each verification run checks that:
The schema matches the contract definition (columns, data types, structure)
The data complies with checks like missing, duplicate, invalid values, and custom rules
This helps you catch issues early, ensure data quality over time, and build trust across your organization.
Soda supports two complementary ways to author and manage data contracts. They are designed to fit the way your team works.
If you’re a data analyst, product owner, or stakeholder, who prefers intuitive interfaces over code, Soda Cloud is the ideal workspace.
With the Soda Cloud UI, you can:
Browse datasets and view profiling insights
Define a contract with a no-code Editor
Schedule and monitor contract verifications
Collaborate with your team and publish contracts with a click
There’s no setup or YAML required, just fast, visual workflows that enable domain experts to contribute directly to data quality.
If you live in your terminal and manage your data pipelines as code, you’ll want to use Soda Core and the Soda CLI.
With this setup, you can:
Define contracts in YAML
Run contract verifications in CI/CD
Push the contract and verification results to Soda Cloud for visibility
Use Git as the source of truth for version control, collaboration, and reviews
This path offers full control, transparency, and seamless integration into your dev tooling.
Soda gives you the flexibility to blend both approaches. For example, non-technical users can define or adjust contracts visually in Soda Cloud for the datasets they manage, while engineers can use Git-managed contracts for the datasets they own.
This hybrid model enables collaboration across teams:
Business users bring domain expertise directly into the contract
Engineers maintain quality, consistency, and governance
Each dataset follows the authoring method that best suits the team responsible for it
You can mix and match—using the UI for some contracts, and code for others—depending on your team's structure and preferences.
And even if Data Contracts are managed in Git, you can still involve non-technical users who can propose changes to a contract in the UI. These approved changes can be embedded into engineering workflows and synced to Git, ensuring that every update follows your organization’s quality and deployment standards.
Choose the model, or combination of models, that works best for your organization.
Once a contract is published, you’ll want to verify that the actual data meets the contract’s expectations. This verification can be done in two ways:
Soda Agent is our managed runner that lives in your environment and connects to Soda Cloud. It handles contract verification, scheduling, and execution securely, without exposing your data externally. It is great for teams who want central management without maintaining CLI infrastructure.
Soda Core is our open-source engine you can run anywhere: locally, in CI, or data pipeline. It’s lightweight, customizable, and great for teams that prefer full control or have strict environment constraints.
Both approaches support the same Data Contract logic. Choose the one that best fits your deployment model.
This page explains Record-level Anomaly Detection (RAD) and Soda's anomaly detection capabilities through RAD.
Coming soon!
RAD functionalities will be available soon for Enterprise plan users.
Ensuring data quality can be difficult, especially when you need broad coverage quickly. Checks and column monitors are great for enforcing specific rules, but they take time to set up and require a deep understanding of your data. Soda’s Record-level Anomaly Detection (RAD) helps you get started fast, providing instant coverage across all columns, rows, and segments, without any configuration.
The algorithm analyzes historical data to build a clear picture of what normal data is supposed to look like. When incoming rows show unusual patterns, unexpected values, inconsistencies, or errors, RAD automatically triggers an alert and runs a Root Cause Analysis to pinpoint the issue. This provides quick, actionable insights while you work toward more detailed control using checks and column monitors.
Instant, broad coverage Monitor all columns, rows, and segments at once, detecting both known and unknown issues.
No configuration needed Get started immediately: no metrics or checks need to be defined. RAD automatically determines which columns to use.
One metric to track and alert on The Record-level Drift Score provides a single, explainable metric to monitor data health.
Order of operations to achieve the best coverage in the most efficient way:
Firstly, : always begin with high‑level monitors to verify if the right amount of data arrived on time and in the correct format. These require no configuration. They just need to be enabled
Secondly, RAD: apply Record-level Anomaly Detection to validate the actual content of the data. This step also requires no configuration (only enablement) and provides broad coverage across all columns and segments.
Next, : apply column‑level monitoring for specific use cases where the potential data quality issue and metric are known but expected to change over time. These should be minimized as they are prone to generating false alerts.
For a dataset to be monitored by RAD, the following conditions must be met:
Time partition column: the dataset must include a column that (for example, created_at).
Primary key: the dataset must have a primary key to uniquely identify rows.
Diagnostics Warehouse setup: a must be configured to store the daily sample, consisting of either primary keys or, ideally, a full copy of the sampled rows.
Next: to enable Record-level Anomaly Detection in your organization, reach out to Soda at .
This page describes the bi-directional integration between Soda and Collibra.
The Soda↔Collibra optimized integration synchronizes data quality checks from Soda to Collibra, creating a unified view of your data quality metrics. The implementation is optimized for performance, reliability, and maintainability, with support for bi-directional ownership sync and advanced diagnostic metrics.
High Performance: 3-5x faster execution through caching, batching, and parallel processing
Custom Attribute Syncing: Flexible mapping of Soda check attributes to Collibra attributes for rich business context
Ownership Synchronization: Bi-directional ownership sync between Collibra and Soda
Deletion Synchronization: Automatically removes obsolete check assets from Collibra when checks are deleted in Soda
Multiple Dimensions Support: Link checks to multiple data quality dimensions simultaneously
Monitor Exclusion: Option to exclude Soda monitors from synchronization, focusing only on data quality checks
Diagnostic Metrics Processing: Automatic extraction of diagnostic metrics from any Soda check type with intelligent fallbacks
Robust Error Handling: Comprehensive retry logic and graceful error recovery
Advanced Monitoring: Real-time metrics, performance tracking, and detailed reporting
CLI Interface: Flexible command-line options for different use cases
Backward Compatibility: Legacy test methods preserved for smooth migration
For technical details on how to configure the bi-directional Collibra integration, head to .
Python 3.10+ required
Valid Soda Cloud API credentials
Valid Collibra API credentials
Properly configured Collibra asset types and relations
Smart Filtering: Only processes datasets marked for synchronization
Parallel Processing: Handles multiple operations concurrently
Caching: Reduces API calls through intelligent caching
Batch Operations: Groups similar operations for efficiency
For each check in a dataset:
Bulk Creation/Updates: Processes multiple assets simultaneously
Duplicate Handling: Intelligent naming to avoid conflicts
Status Tracking: Monitors creation vs. update operations
Standard Attributes: Evaluation status, timestamps, definitions
Diagnostic Metrics: Automatically extracts and calculates diagnostic metrics from check results
Custom Attributes: Flexible mappings for business context (see )
Batch Updates: Groups attribute operations for performance
Dimension Relations: Links checks to data quality dimensions
Table/Column Relations: Creates appropriate asset relationships
Error Recovery: Graceful handling of missing or ambiguous assets
Collibra to Soda Sync: Automatically syncs dataset owners from Collibra to Soda
User Mapping: Maps Collibra users to Soda users by email address
Error Handling: Tracks missing users and synchronization failures
Metrics Tracking: Monitors successful ownership transfers
Retry Logic: Exponential backoff for transient failures
Rate Limiting: Intelligent throttling to avoid API limits
Error Aggregation: Collects and reports all issues at the end
Graceful Degradation: Continues processing despite individual failures
Head to to learn how to integrate Collibra.
Profiling provides a quick and comprehensive overview of a dataset’s structure and key statistics.
Profiling helps you understand the shape, quality, and uniqueness of your data before creating checks or metric monitors.
With profiling, you can explore metadata about your dataset, such as column names, data types, distinct counts, null counts, and summary statistics. You can also quickly search for specific columns to focus on the attributes that matter most to your analysis.
Profiling is useful for:
Business teams: Gain a fast understanding of what’s inside a dataset, its completeness, and potential anomalies.
Data teams: Validate schema, data types, and distributions before writing quality tests or transformations.
Data owners: Quickly identify unexpected values, nulls, or structural changes in a dataset.
Dataset overview: Displays a structured view of all columns, their types, and counts.
Interactive navigation: Scroll through the dataset structure or jump directly to a column of interest.
Search and filter: Quickly locate a column by name to review its profiling details.
Column-level insights:
You can enable Profiling during .
If you want to enable Profiling on an existing dataset, follow the next steps:
Click on Datasets > The dataset of your choosing
Navigate to the Columns tab in the dataset view
Click on Update Profiling Configuration
Once Profiling has been enabled, you can configure it to adapt to your organization's needs.
1. Choose a Profiling schedule
Profiling happens every 24 hours. Choose a UTC time from the dropdown menu to pick a specific hour when the scan will be scheduled.
Choose a Profiling strategy
Use sampling: To perform Profiling, Soda will use a sample of up to 1 million rows from the dataset.
Use a time window: To perform Profiling, Soda will use data present in a 30-day time window, based on the dataset time-partition column.
Click on Finish
Now, Profiling will be scheduled.
If you wish to disable column profiling at the organization level, you must possess Admin privileges in your Soda Cloud account. Once confirmed, follow these steps:
Navigate to your avatar.
Click on Organization settings.
Uncheck the box labeled Allow Soda to collect column profile information.
When you open Profiling for a dataset:
Soda runs a lightweight scan of the dataset’s metadata and a sample of the data (depending on configuration).
It calculates summary statistics for each column.
Results are displayed in the Profiling view for exploration.
Soda can only profile columns that contain NUMBERS or TEXT type data; it cannot profile columns that contain TIMESTAMP data except to create a freshness check for the anomaly dashboard.
Soda performs the Discover datasets and Profile datasets actions independently, relative to each other. If you define exclude or include rules in the Discover tab, the Profile configuration does not inherit the Discover rules. For example, if, for Discover, you exclude all datasets that begin with staging_, then configure Profile to include all datasets, Soda discovers and profiles all datasets.
After reviewing profiling results, you can:
Create tests based on profiling insights (e.g., "column should not have nulls").
Set up monitors to track data quality over time.
Export profiling information to support documentation and governance processes.
Configure Soda Cloud to connect your account to Slack so that you can:
Send Notificationsfor failed or warning check results to Slack channels
Start conversations to track and resolve data quality Incidentswith Slack channels
Only users with the Manage Notification Rules permission can create or edit rules. All users can view rules. Read about
In Soda Cloud, navigate to your avatar > Organization Settings, then navigate to the Integrations tab and click the + icon to add a new integration.
Choose Slack and proceed
Follow the guided steps to authorize Soda Cloud to connect to your Slack workspace. If necessary, contact your organization’s Slack Administrator to approve the integration with Soda Cloud.
Configuration tab: select the public channels to which Soda can post messages; Soda cannot post to private channels.
Note that Soda caches the response from the Slack API, refreshing it hourly. If you created a new public channel in Slack to use for your integration with Soda, be aware that the new channel may not appear in the Configuration tab in Soda until the hourly Slack API refresh is complete.
Scope tab: select the Soda features (alert notifications and/or incidents) that can access the Slack integration.
You can use this integration to enable Soda Cloud to send alert notifications to a Slack channel to notify your team of warn and fail check results.
With such an integration, Soda Cloud enables users to select a Slack channel as the destination for an alert notification of an individual check or checks that form a part of an agreement, or multiple checks.
To send notifications that apply to multiple checks, see
You can use this integration to notify your team when a new incident has been created in Soda Cloud. With such an integration, Soda Cloud displays an external link to an incident-specific Slack channel in the Incident Details.
Selecting the right scan time is essential for accurate data monitoring and reliable metric collection. Scans that occur too early may run before the data has been fully loaded into the database, leading to false positives or misleading results. This guide outlines how to determine the best scan time based on your data load patterns and operational needs.
Scans can be scheduled to occur from hourly to weekly. The time jumps are meant to fit into a 24-hour cycle that matches hourly/daily seasonalities related to how humans organize their day. Metric Monitoring can happen every:
Currently, this feature is only supported in Snowflake data sources.
When testing a data contract, Soda allows you to run contract validation on a sample of your dataset instead of the full data. This feature helps you quickly and cost-efficiently verify that your contract runs correctly before executing full scans.
As of July 2025, the container images required for the self-hosted Soda agent will be distributed using private registries, hosted by Soda.
EU cloud customers will use the EU registry located at registry.cloud.soda.io. US cloud customers will use the US registry located at registry.us.soda.io.
The images currently distributed through Docker Hub will stay available there. New releases will only be available in the Soda-hosted registries.
Existing or new Soda cloud API keys can be used to authenticate to the Soda-hosted registries. Starting from version
Soda offers flexible deployment options to suit your team’s infrastructure, scale, and security needs. Whether you want to embed Soda directly into your pipelines, use a centrally managed deployment, or rely on Soda’s fully-hosted solution, there’s an option for you.
This guide provides an overview of the three main deployment options: Soda Python Libraries, Soda-hosted Soda Agent, and Self-hosted Soda Agent, to help you choose the right setup for your organization.
Learn more about how to use your own SQL queries to build custom SQL Metric Monitors.
Custom SQL Monitors are available for Enterprise users and can be configured via both the Soda Cloud UI and the Monitoring Configuration API.
Custom SQL Monitors enable you to define monitoring logic using your own SQL queries. This is ideal when built-in Soda metrics or anomaly checks don’t meet your needs; for example, when you must aggregate data across multiple tables, compute ratios, or detect anomalies in grouped datasets.
A Custom SQL Monitor runs your SQL query against your connected data source and evaluates its results on a schedule, just like any other Soda monitor.
This feature can be used to:
soda contract test --data-source ds.yml --contract contract.yamlsoda contract publish --contract contract.yaml --soda-cloud sc.ymlsoda:
# These values will also be used to authenticate to the Soda image registry
apikey:
id: existing-key-id
secret: existing-key-secretsoda:
apikey:
id: existing-key-id
secret: existing-key-secet
imageCredentials:
apikey:
id: my-new-key-id
secret: my-new-key-secretsoda:
apikey:
id: ***
secret: ***
# This is no longer supported
# imagePullSecrets
# - name: my-existing-secret
# Instead, use this!
existingImagePullSecrets
- name: my-existing-secretsoda:
apikey:
id: ***
secret: ***
cloud:
# This also sets the correct endpoint under the covers.
region: "us"
# This can be removed now, as the region property sets this up correctly.
# endpoint: https://cloud.us.soda.iosoda:
apikey:
id: ***
secret: ***
# Rename this ...
# scanLauncher:
# to become
scanLauncher:
existingSecrets:
- soda-agent-secrets soda cloud create -f sc.ymlsoda cloud test -sc sc.ymlsoda contract verify --dataset datasource/db/schema/table --use-agent --soda-cloud sc.ymlsoda contract verify --dataset datasource/db/schema/table --publish --use-agent --soda-cloud sc.ymlsoda data-source test -ds ds.ymlsoda cloud create -f sc.ymlsoda cloud test -sc sc.ymlpip install -i https://pypi.dev.sodadata.io "soda-postgres>=4.0.0.dev1" -Usoda data-source create -f ds.ymltype: postgres
name: postgres
connection:
host: host_name
port: 5432
user: user_name
password: ${env.SODA_DEMO_POSTGRES_PW}
database: db_namesoda data-source test -ds ds.ymlsoda contract verify --data-source ds.yml --contract contract.yamlsoda contract verify --data-source ds.yml --contract contract.yaml --set START_DATE=2024-05-01soda contract verify --data-source ds.yml --contract contract.yaml --publish --soda-cloud sc.ymlsoda contract verify --contract contract.yaml --use-agent --soda-cloud sc.ymlsoda contract verify --contract contract.yaml --use-agent --soda-cloud sc.yml --set START_DATE=2024-05-01soda contract verify --dataset datasource/db/schema/table --publish --use-agent --soda-cloud sc.ymlCollaborate with non-technical users that use Soda Cloud and integrate with engineering workflows via Git



























Prioritize what matters: use the Record-level Drift Score consistently across datasets and data sources to rank and focus on the most critical data quality issues.
Reduce false alerts: traditional column-level monitoring increases the risk of false positives with every additional monitor. With RAD, you only need one anomaly detection monitor per dataset, minimizing noise.
Optimize compute usage: monitoring a single metric per dataset lowers computational overhead. Additionally, RAD can work with sampled data, further reducing processing demands.
Built‑in root cause analysis Quickly understand what changed and why.
Native support for backfilling and back‑testing Automatically generate and assess historical Record-level Drift Scores to review past data quality trends.
Lastly, checks: use checks for critical tables where expectations are clearly defined. For example
Unknown
Unknown
Unknown
RAD on all columns
Checks
Known
Known
Known
Missing values in Amount < 5%
Column monitors
Known
Known
Unknown
Anomaly detection on Amount for missing values

RAD monitor

# Run the integration with default settings
python main.py
# Run with debug logging for troubleshooting
python main.py --debug
# Use a custom configuration file
python main.py --config custom.yaml
# Show help and all available options
python main.py --help# Run legacy Soda client tests
python main.py --test-soda
# Run legacy Collibra client tests
python main.py --test-collibra
# Run with verbose logging (info level)
python main.py --verboseStatistics
Column name
Column data type
Number of distinct values
Number of missing (null) values
Minimum, maximum, mean (for numeric columns)
Length, patterns, or categories (for text columns)
Histogram for numeric columns
Frequent values
Extreme values, for numeric columns
Data checks that exist for this column




8 h
2 h
12 h
3 h
1 day
4 h
1 week
6 h
When is the database load expected to be complete?
Determine when the relevant tables or datasets are expected to be fully loaded.
Factor in common variances: if a load is expected to complete by 00:00 UTC but occasionally finishes at 00:10 UTC, account for the expected, albeit sporadic, delay.
Knowing this helps avoid scanning too early and capturing incomplete data.
When is a delayed load considered late or "problematic"?
If data arriving by 02:30 UTC is still valid for monitoring purposes, it may be better to delay the scan to reduce false alerts.
Scanning immediately after the earliest expected load time is not always necessary.
Understanding what qualifies as "late data" helps define the tolerance window for scan timing.
How fast after the load can someone respond to issues flagged by monitors?
If nobody can take action until 09:00 UTC, scanning earlier may not be useful unless scans feed downstream processes or dashboards.
Choose a scan time that aligns with both data readiness and team readiness.
Consistency is key
Running scans at the same time every day allows to build up a reliable baseline of expected behavior. This helps surface anomalies clearly when something deviates from the norm.
Scan frequency: daily
Expected load completion: 00:00 UTC
Occasional load delay: up to 00:10 UTC
Team available from: 08:00 UTC
Minimal buffer
00:15 UTC
Captures data soon after load with minor delay tolerance.
Conservative buffer
01:30 UTC
Allows extra time for delayed loads, reduces risk of false positives.
Operationally aligned
07:30 UTC
Ensures scan results are fresh and complete when the team starts reviewing.
When scanning large volumes of tables:
It is acceptable to configure scans for the same scheduled time (e.g. 00:00 UTC).
Scans that are scheduled in large volumes (thousands of tables) may be configured to run at the same logical time, but the system naturally distributes execution based on queuing and available resources, so the actual execution will be staggered.
Historical metric collection scans (for metric baseline backfilling) run only once at configuration time.
These scans are not governed by the scan schedule. They occur once and they are typically the most resource-intensive.
Consistency is key: Using the same scan daily establishes a stable baseline for anomaly detection.
Early scans should be avoided: Scheduling scans before the last acceptable load time is not recommended unless business needs require it.
Time zones should be centralized: Aligning scan time with the database time zone is ideal, especially when your time partitioning column is based on the insert/load time in that time zone.
Monitoring and adjusting: If load patterns or SLAs change, scan times should be revisited and adjusted accordingly.
1 h
Running a test contract on a sample enables you to:
Validate that your contract syntax, checks, and filters work as expected.
Reduce data warehouse compute cost while verifying new or updated contracts.
Iterate faster on contract definitions in development environments.
Results from sampled runs reflect only a subset of your data and may not represent its actual quality. Use full verification once your contract logic is validated.
This feature can be enabled at the data source level, applying to all datasets that use that connection.
You need the "Manage data sources" global permission to add a new data source. Learn more about Global and Dataset Roles.
To enable this feature:
Go to Data sources.
Click Edit connection for a data source.
Under the Connection Details section, toggle Data Sampling.
Specify your sample size on the Limit field.
Click Connect.
Currently available in preview. This feature is only supported in Snowflake data sources.
When connecting to Snowflake, you must provide a warehouse as part of the data source configuration. By default, this single warehouse is used for all operations, including discovery, metric monitoring, profiling, data contract executions, and the diagnostics warehouse.
The Configure warehouses per dataset feature gives you greater control and flexibility by allowing you to define specific warehouses for individual datasets. This helps you optimize cost, manage compute workloads, and allocate resources efficiently across your data operations.
You need the “Manage data sources” global permission to enable or modify this feature. Learn more about Global and Dataset Roles.
Go to Data sources in Soda Cloud.
Click Edit connection for your Snowflake data source.
Toggle on Configure Warehouses.
Specify the list of allowed warehouses that can be used by this connection.
Choose a default warehouse to use for all datasets unless otherwise specified.
Click Save on the top right to save your configuration.
Once enabled:
The warehouse specified in the data source connection is used for discovery.
The default warehouse (defined under Configure Warehouses) is used for:
Metric monitoring
Profiling
Data contract executions
Diagnostics Warehouse operations
A different warehouse can be configured at the dataset level, overriding the default.
You need the “Configure dataset” permission to edit dataset-level configurations. Learn more about Global and Dataset Roles.
Go to a dataset in Soda Cloud.
Click Edit dataset.
Under the Snowflake section, select the warehouse to use for this dataset.
Click Save to apply your changes.
1.2.0soda-agentIn order to enjoy the latest features Soda has to offer, please upgrade any self-hosted Soda agent you manage using one of the following guides.
Follow the self-hosted Soda agent upgrade or redeployment guides. Don't execute the final helm install or helm upgrade step yet.
Ensure you retrieve the soda.apikey.id and soda.apikey.secret values first, by using
helm get values -n <namespace> <release_name> .
Now pass these values back to the upgrade command via the CLI
or by using a values file:
Ensure you have a new API key id and secret by following the API key creation guide .
Follow the self-hosted Soda agent upgrade or redeployment guides. Don't execute the final helm install or helm upgrade step yet.
Now pass the API keys to use for registry access in the upgrade command via the CLI, using the imageCredentials.apikey.id and imageCredentials.apikey.secret properties.
Note that we're also still passing the soda.apikey.id and soda.apikey.secret values, which are still required for the agent to authenticate to Soda cloud.
Or when using a values file:
You can also use a self-managed, existing secret to authenticate to the Soda-hosted our your self-hosted private container registry, e.g. when mirroring container images.
You can refer to existing secrets as follows for the CLI:
Or using a values file:
When you're onboarded on the US region of Soda Cloud, you'll have to use the container registry associated with that region.
You can alter the soda.cloud.region value to automatically render the correct container registry and Soda Cloud API endpoint. Simply follow any of the above instructions and include the soda.cloud.region value.
To do so in the CLI:
Or using a values file:
If you want to mirror the Soda images into your own registries, you'll need to login to the appropriate container registry. This will allow you to pull the images into your custom container image registry.
The following values.yaml file illustrates the changes required for the Helm release to work with mirrored images:
Your existing Soda agent deployments will continue to function.
This does mean that your self-hosted agent will not be able to support features like collaborative data contracts and the fully revamped metric monitoring.
The images hosted on Dockerhub, required to run the self-hosted agent, will remain there in their current state for a grace period of 6 months. There will be no more maintenance (updates, bug fixes, security patches) for the old self-hosted agent versions.
Soda-hosted Soda Agent
Fully-managed Soda Agent, hosted by Soda.
Teams seeking a simple, managed solution for data quality.
Centralized data source access
No setup required
Observability features enabled
Enables users to create, test, execute, and schedule contracts and checks directly from the Soda Cloud UI.
Required for observability features. Cannot scan in-memory sources like Spark or DataFrames.
Available for Free, Team and Enterprise Plans.
Self-hosted Soda Agent
Same as Soda-hosted Soda Agent, but deployed and managed in your own Kubernetes environment.
Teams needing full control over infrastructure and deployment.
Similar to Soda-hosted Agent, but deployed within the customer’s environment; data stays within your network.
Full control over deployment
Integration with secrets managers
Customization to meet your organization’s specific requirements
Required for observability features. Cannot scan in-memory sources like Spark or DataFrames. Kubernetes expertise required.
Available for Enterprise Plan. Contact us
Soda Python Libraries
Open-source Python library (with commercial extensions) for programmatic configuration and enforcement of data contracts in your pipelines.
Data engineers integrating Soda into custom workflows.
Full control over orchestration
In-memory data support
Contract verification
No observability features. Required for in-memory sources (e.g., Spark, DataFrames). Data source connections managed at the environment level.
Soda-hosted Soda Agent is a fully-managed deployment of the Soda Agent, hosted by Soda in our infrastructure. It allows you to connect to your data sources and manage data quality directly from the Soda Cloud UI without any infrastructure setup on your end. You need only whitelist the IP address of the Soda-hosted agent so that it can connect to your data.
Key points:
No setup or management required. Soda handles deployment and scaling.
Data source connections are centralized in Soda Cloud, and users can leverage the Soda Agent to execute scans across those data sources.
Enable observability features in Soda Cloud, such as profiling, metric monitoring, and anomaly detection.
Enables users to create, test, execute, and schedule contracts and checks directly from the Soda Cloud UI.
Onboard your datasets in Soda Cloud with Soda-hosted agent: Onboard datasets on Soda Cloud
The Self-hosted Agent offers the same capabilities as the Soda-hosted Agent, but it is deployed and managed by your team within your own Kubernetes environment (e.g., AWS, GCP, Azure). This model provides full control over deployment, infrastructure, and security, while enabling the same centralized data source access and Soda Cloud integration for scans, contract execution, and observability features.
Learn how to deploy the Self-hosted Soda Agent: Deploy Soda Agent.
Onboard your datasets in Soda Cloud with self-hosted agent: Onboard datasets on Soda Cloud.
Soda Core is an open-source Python library and CLI that allows you to embed Soda directly in your data pipelines. You can orchestrate scans using your preferred orchestration tools or pipelines, and execute them within your own infrastructure. Additional commercial extensions are available via extensions packages, such as soda-groupby , soda-reconciliation, etc.
See detailed installation instructions here: Install Soda Python Libraries
Key points:
Ideal for teams who want full control over scan orchestration and execution.
Data source connections are configured and managed at the environment level.
Required for working with in-memory data sources like Spark and Pandas DataFrames.
Define metrics using custom aggregations or joins.
Compute grouped results (e.g., GROUP BY customer, institution, or region).
Apply filters, CTEs, and where clauses to narrow down data.
Integrate results with notification rules to alert your team when certain conditions are met.
Example scenario: an organization needs to monitor daily incidents per borough and reason in their Bus Breakdowns and Delays dataset, and flag unusual spikes/drops via notification rules.
The goal is to know which boroughs are the ones suffering the most incidents and why that's happening.
Enterprise plan.
A dataset connected in Soda Cloud.
An API token with permission to author monitors (if creating In the Column Monitoring Configuration API).
Navigate to Datasets → Custom Monitors at the bottom of the page. Click on Add Column Monitor.
Name your custom monitor and provide the custom SQL query. In this case, we are monitoring incident count by borough and reason.
Provide a Result metric and a Valid range, and define a Threshold strategy.
In this case, the result metric is incident_count; we want to group by Boro and Reason, and the valid range cannot be negative, so the minimum value is 0. Both Upper range and Lower range anomaly detection are enabled to catch unusual spikes/drops per group.
Click on Add Monitor on the top right.
The monitor will now be visible at the bottom of the Metric Monitoring dashboard.
This monitor will:
Run daily and compute incident_count for every (Boro, Reason) pair within the partitioned time window.
Store grouped results so you can see which areas and causes are trending.
Trigger notifications (based on your organization’s notification rule) when anomaly detection flags a group.
List of all the variables currently supported using ${soda.<variable>} syntax:
SCAN_TIME: time for which the scan is running; has the same value as PARTITION_END_TIME (note this is different from when the scan is running)
PARTITION_COLUMN: column used to perform time-based partitioning
PARTITION_START_TIME: start time for the partition time window
PARTITION_END_TIME: end time for the partition time window
PARTITION_INTERVAL: duration of the partition time window
TABLE: qualified name of the table being analyzed, e.g. "my-schema"."my-table"
Note: Set use_context_auth=True to use application default credentials, in which case account_info_json or account_info_json_path are not necessary.
See BigQuery's locations documentation to learn more about
location.
Test the data source connection:







Learn more about Metric Monitors that run scans at a column level.
A column monitor in Soda tracks a specific statistical metric for a given column over time. It helps detect unusual patterns or unexpected changes in column behavior, such as spikes in missing values or shifts in averages.
You can find column monitors by opening the Metric Monitors tab on any dataset and scrolling to the bottom of the page. This section lists all active column monitors in a structured, searchable view. The list can be sorted by recency or by the number of detected anomalies, allowing you to quickly focus on the most relevant issues.
Unlike dataset-level monitors, which can be applied at the data source level, column monitors are configured at the dataset level and are tailored to specific use cases. It is recommended to add column monitors only to columns where changes are likely to reflect actual data quality issues. Adding too many monitors may increase false positives and create unnecessary noise.
For column monitors to work, a time partition column must be defined. Soda uses this column to divide the data into time-based partitions, typically by day, and calculates the selected metrics within each partition. The column must be a timestamp and should reflect when records arrive in the database to ensure accurate and meaningful results.
For each dataset, you’ll see a scrollable list that includes:
Result of the anomaly detection: Anomaly, Expected or Unkown (not evaluated yet)
Column name
Metric name (e.g. Missing values percentage, Average)
Column being tracked
At the bottom of the list it is possible to load more monitors. And every monitor can be deleted and configured with opt-in notifications.
Column monitors can be added one by one or in bulk. When multiple columns are selected only metrics that are applicable to all columns will be shown.
Open the column monitor wizard
In the Metric Monitors dashboard, click Add Column Monitors.
Select columns
Search or scroll your table’s columns.
Check one or many boxes to select columns in bulk.
Column monitors are typed: metrics will appear as long as the necessary data type is available. For example, if a column type is str (text based), it will not be possible to enable numeric metrics.
Pick metrics
Select the metrics of interest.
Search or expand metrics for further configuration:
Valid Range: define MIN and MAX values the metric can take (defaults to –∞/∞ or 0–∞ for time-based metrics).
Threshold Strategy:
Add monitors
Once you’ve selected your columns and toggled the desired metrics on, click Add Monitors.
Empty monitors will be added to the list
And at the top of the page you will be prompt to run a Historical Metric Collection Scan.
Tip: add all your column monitors first, then run the historical scan in one go. This will save time and computing costs, and ensures every monitor shares the same look-back window.
Column Monitors can be configured when setting them up and while they're in production. To fine-tune the monitor to your specific needs, go to the page for each specific metric.
Learn more about
The Users and User Groups in Soda Cloud settings allows you to control access to your organization by managing individual users and user groups. This ensures that team members have the appropriate permissions to use Soda Cloud effectively.
With SSO enabled, users and groups can be synced directly from your identity provider, reducing manual effort and ensuring alignment with your organization’s access policies. Learn more about SSO configuration User and user group management with SSO.
The Invite Users feature is only available when SSO is not enabled.
To invite users manually:
Go to the Users tab in Settings.
Click the + icon at the top of the user list
Enter the email addresses of the users you want to invite.
Invited users will receive an email with a link to set their password and join your organization in Soda Cloud. Once they complete the setup, they will have access to Soda Cloud based on the roles and permissions you assign.
Deactivating a user blocks their access to Soda Cloud and disables any existing API keys associated with their account. This is useful when a user no longer needs access, but you want to retain their account for record-keeping or future reactivation.
To deactivate a user:
Go to the Users tab in Settings.
Find the user you want to deactivate.
Click the context menu for the user and select Deactivate From This Organization.
You can reactivate a user later if they need access again.
Assigning users to groups allows you to manage access and permissions more efficiently by applying global roles to groups rather than individual users.
To assign users to groups:
Go to the Users tab in Settings.
Find the user you want to assign to a group.
Click the content menu next to their name and select Edit User Groups.
Select one or more user groups from the list.
Click Save
Global roles define a user’s permissions across Soda Cloud. Assigning global roles directly to users allows you to grant them specific access rights, such as managing datasets, running scans, or configuring organization settings.
To assign a global role to a user:
Go to the Users tab in Settings.
Find the user you want to assign a role to.
Click the context menu for the user and select Assign Global Roles.
Choose one or more global roles from the list.
Click Save
User groups allow you to manage access and permissions for multiple users at once, helping you simplify and scale permission management across your organization. By assigning global roles to groups, you ensure that all members of the group have consistent access rights, without the need to assign permissions individually.
If you have Single Sign-On (SSO) enabled, user groups can also be synced automatically from your identity provider, ensuring your Soda Cloud user management aligns with your existing access policies. .
Note that by default, there is an Everyone group which is not editable and contains all the users from the organization
You can manually create user groups in Soda Cloud, whether you’re importing user groups SSO or not.
To create a user group:
Go to the User Groups tab in Settings.
Click Create User Group at the top of the user group list.
Enter a name for the group.
(Optional) Add users to the group immediately, or add them later.
You can edit the members of user groups that you have created on Soda Cloud. SSO-managed user groups cannot be edited.
To edit a user group,
Go to the User Groups tab in Settings.
Find the group you want to modify.
Click the context menu next to the group and select Edit Members.
Select the users that should be in the user group and click save
You can view the list of users in a group to understand who has access through that group and to help manage permissions across your organization.
To view the members of a group:
Go to the User Groups tab in Settings.
Click the group name or the row to open its details.
The list of users assigned to the group will be displayed.
This view helps you track group membership and verify that the correct users have the appropriate access.
Global roles define a user’s permissions across Soda Cloud. Assigning global roles directly to user groups allows you to grant them specific access rights, such as managing datasets, running scans, or configuring organization settings.
To assign a global role to a user:
Go to the User Groups tab in Settings.
Find the user group you want to assign a role to.
Click the context menu for the user and select Assign Global Roles.
Choose one or more global roles from the list.
Click Save
This page describes how Soda handles data reconciliation through different types of reconciliation checks.
Available on the 15th of September 2025
Reconciliation checks are a validation step used to ensure that data remains consistent and accurate when moving, transforming, or syncing between different systems. The core purpose is to confirm that the target data matches the source data, whether that’s during a one-time migration, a recurring data pipeline run, or ongoing synchronization across environments.
For instance, if you are migrating from a MySQL database to Snowflake, reconciliation checks can verify that the data transferred into Snowflake staging is intact and reliable before promoting it to production. This minimizes the risk of data loss, duplication, or corruption during critical migrations.
Beyond migrations, reconciliation checks are also used in data pipelines and integrations. They help validate that transformations applied in-flight do not compromise accuracy, and that downstream datasets remain coherent with upstream sources.
Other use cases include regulatory compliance, where organizations must prove that financial or operational data has been faithfully replicated across systems, and system upgrades, where schema changes or infrastructure shifts can introduce unexpected mismatches.
By systematically applying reconciliation checks, teams can maintain trust in their data, reduce operational risk, and streamline incident detection when anomalies arise.
Before defining reconciliation checks, you first specify the source dataset. This represents the system of record against which you want to validate consistency. It is possible to define a filter on the source dataset, allowing you to reconcile only a subset of records that match certain criteria (for example, only transactions from the current month, or only rows belonging to a specific business unit).
For the target dataset, the reconciliation check applies the dataset filter defined at the top of the contract (see ).
Ensure that both source and target are constrained to the same logical scope before comparisons are made, keeping the validation consistent and relevant.
At this level, aggregate metrics from the source and target datasets are compared. Examples include totals (e.g., revenue, number of rows), averages, or other summary statistics. This approach is efficient and provides a high-level signal that the data remains consistent. It is especially useful for large-scale migrations or pipelines where exact row-by-row comparison may not be necessary at all times.
Comparisons at the metric level are evaluated against a defined threshold, which represents the acceptable difference between source and target. This tolerance can be set depending on the business context. Some use cases may allow small discrepancies (e.g., rounding differences), while others require exact equality.
When comparing integrity checks such as missing values, duplicates, or invalid entries, you can reconcile either by looking at the raw count of affected records or by comparing the percentage metric (e.g., the percentage of rows with missing values in each dataset). This flexibility ensures that reconciliation is meaningful regardless of dataset size or distribution.
In addition to dataset-level filters, reconciliation checks support check-level filters, which are applied consistently to both the source and target within the scope of a specific check. These filters make it possible to validate a subset of the data relevant to the context of the check. The check-level filter is applied on top of any existing source or target dataset filters.
For more granular validation, reconciliation can be performed at the row level. This type of check surfaces detailed differences such as missing records, mismatched values, or unexpected duplicates. Row-level reconciliation is critical in scenarios where accuracy at the record level is non-negotiable—such as record that address financial transactions, user data, or regulatory reporting.
This requires specifying a primary key (or a composite key) to uniquely identify rows between the source and the target. Once rows are aligned, you can define a list of columns to test for exact matches or acceptable tolerances. If no column list is provided, the check defaults to comparing all columns in order. This flexibility ensures that comparisons can range from broad validation across the entire dataset to focused checks on only the most critical attributes.
Row-level reconciliation supports thresholds expressed either as the count of differing rows between source and target, or as the percentage of differing rows relative to the source dataset row count. These thresholds determine the acceptable level of variance before the check is considered failed, giving you fine control over sensitivity and tolerance.
This dual approach allows teams to adapt reconciliation logic to different contexts, using absolute counts when every record matters, and percentages when evaluating proportional differences in large datasets.
As with metric-level checks, you can define a check-level filter that is applied on top of any existing dataset filters. This allows you to reconcile only a targeted segment of data within the context of the specific check—for example, testing only a single business unit, product family, or date range.
Row-level reconciliation is inherently heavier than metric-level reconciliation, as it requires comparing records across potentially large datasets. To enable comparisons even when data lives in different systems, data is loaded into memory from both the source and the target, where the diff is executed. A paginated approach is used to maintain scalability; this ensures that memory usage remains stable, but execution time will increase as the dataset size and column count grow.
Recommendations
Leverage filters to scope checks to new or incremental batches of data wherever possible, rather than repeatedly reconciling the entire dataset. This reduces both execution time and operational overhead.
Use metric-level reconciliation as a first line of validation. It is significantly more efficient and scalable, and can quickly highlight whether deeper row-level analysis is even necessary.
Soda is suitable for no-code and programmatic users alike. If you are implementing checks programmatically, you can learn more about the contract language syntax for reconciliation on the . Reconciliation checks can be used for both metric- and row-level validation.
This page describes what is a Soda Agent
The Soda Agent is a tool that empowers Soda Cloud users to securely access data sources to scan for data quality. For a self-hosted agent, create a Kubernetes cluster in a cloud services provider environment, then use Helm to deploy a Soda Agent in the cluster.
This setup enables Soda Cloud users to securely connect to data sources (Snowflake, Amazon Athena, etc.) from within the Soda Cloud web application. Any user in your Soda Cloud account can add a new data source via the agent, then write their own no-code checks to check for data quality in the new data source.
When you deploy an agent, you also deploy two types of workloads in your Kubernetes cluster from a Docker image:
a Soda Agent Orchestrator which creates Kubernetes Jobs to trigger scheduled and on-demand scans of data
a Soda Agent Scan Launcher which wraps around the Soda Python Libraries, which impelement the scans.
Kubernetes is a system for orchestrating containerized applications; a Kubernetes cluster is a set of resources that supports an application deployment.
You need a Kubernetes cluster in which to deploy the containerized applications that make up the Soda Agent. Kubernetes uses the concept of that the Soda Agent Helm chart employs to store connection secrets that you specify as values during the Helm release of the Soda Agent. Depending on your cloud provider, you can arrange to store these Secrets in a specialized storage such as or .
Learn more about .
The Jobs that the agent creates access these Secrets when they execute.
Learn more about .
Within a cloud services provider environment is where you create your Kubernetes cluster. You can deploy a Soda Agent in any environment in which you can create Kubernetes clusters, such as:
Amazon Elastic Kubernetes Service (EKS)
Microsoft Azure Kubernetes Service (AKS)
Google Kubernetes Engine (GKE)
Any Kubernetes cluster version 1.21 or greater which uses standard Kubernetes
Helm is a package manager for Kubernetes which bundles YAML files together for storage in a public or private repository. This bundle of YAML files is referred to as a Helm chart. The Soda Agent is a Helm chart. Anyone with access to the Helm chart’s repo can deploy the chart to make use of YAML files in it.
Learn more about .
The Soda Agent Helm chart is stored on a and published on . Anyone can use Helm to find and deploy the Soda Agent Helm chart in their Kubernetes cluster.
Kubernetes is the most powerful and future-proof platform for running the Soda Agent because it delivers the best of both worlds: the flexibility of raw compute without the operational burden, and the scalability of managed services without their restrictions.
Kubernetes goes far beyond raw compute like EC2 or traditional Virtual Machines (VMs) by abstracting away the heavy lifting of networking, deployments, and scaling, while still giving teams precise control when needed. Practically, this makes it easy for Soda’s customers to using and , always staying up to date with the latest releases.
Unlike fully managed options such as AWS Lambda, Kubernetes has no execution time limits and is built to handle long-running, stateful, and highly scalable workloads. This means Soda is not limited to lightweight samples but can perform complete, row-level operations—powering advanced capabilities like Diagnostics Warehouse, which securely stores the exact failing records inside your own infrastructure, and , which compare data at row-level across sources.
Whether running in the cloud or on-premises, Kubernetes ensures resilience, portability, and cost-efficient resource use, making it the clear choice for complex, enterprise-grade data quality workloads.
An overview of Soda's key observability features and how they help catch data issues early.
Data observability is the ongoing process of monitoring and assessing the health of your data throughout its lifecycle. It focuses on analyzing metadata, metrics, and logs to detect issues as they arise, helping teams maintain trust in their data.
At the core of data observability are monitors that track key data quality metrics over time. When a metric behaves unexpectedly, anomaly detection algorithms analyze historical patterns to determine whether an alert should be triggered.
Typical data quality metrics to monitor are:
Schema changes to surface structural modifications
Row counts to detect unexpected changes in data volume
Most recent timestamps to detect data freshness, missing or delayed data
Missing values to track data completeness
Averages to observe shifts in distributions
Soda embraces pragmatism over purity: practical outcomes and effectiveness are more important than ideal, unidimensional approaches. Effective data quality comes from combining data observability and data testing. Each serves a different purpose. Observability is about speed and broad coverage. Testing is about precision, enforcement, and prevention.
Benefits of Data Observability
Enables broad coverage quickly, even across large data sources
Surfaces unknown issues without needing to define every rule
Requires minimal configuration to get started
Leverages existing metadata for fast and efficient monitoring
Limitations of Data Observability
Serves only as a signal. An anomaly suggests an issue but doesn’t confirm it
Can generate false alerts, since detection is driven by algorithms
Requires further investigation to validate and resolve alerts
Does not prevent issues. It flags them after they’ve happened
Start with Observability, but rely on Testing
Observability is a fast and efficient way to get initial coverage. It helps surface unknown issues with minimal setup and delivers immediate value across many datasets. However, for lasting reliability and trust in your data, testing is more important.
Testing requires more effort up front. It involves defining explicit expectations and rules for your data. But that investment pays off. When a test fails, you know there is a real data quality issue, no guesswork, no false alerts. When an anomaly is detected, it doesn't necessarily mean there is an underlying data quality issue, and more investigation effort is required.
For long-term reliability, testing is essential. It adds rigor by enforcing defined standards and helps prevent bad data going into production. Start with your most critical datasets, then expand gradually using a collaborative approach, where business users help by proposing checks. This creates a scalable framework that grows with your organization while ensuring lasting data quality.
Soda’s data observability allows teams to monitor data health across large environments without manual setup. All anomalies are surfaced in a single, easy-to-navigate dashboard, making it simple to spot issues and investigate patterns. Behind the scenes, a proprietary anomaly detection algorithm ensures high precision by minimizing false positives and focusing on meaningful deviations. Notifications are opt-in and alerts are only triggered when they matter, helping teams stay focused without being overwhelmed by noise.
Soda enables large-scale observability with ease. Instead of configuring each table manually, monitoring is applied at the data source level and automatically extends to all datasets underneath. This allows teams to activate observability across hundreds or even thousands of tables in minutes.
By leveraging metadata such as row counts, schema evolution, and insert activity, Soda delivers lightweight and efficient monitoring. There is no need to scan entire datasets or write custom logic for each table. You can do that if needed, but it is not required. Observability starts working immediately and is built to handle even the largest data platforms.
Observability is not just about what happens next. With built-in backfilling and backtesting, Soda instantly analyzes historical metadata and metric trends. From the moment observability is enabled, teams gain visibility into past data quality metrics and can detect potential anomalies that may have gone unnoticed.
This historical context is essential. It helps determine whether a current anomaly is truly new or part of an ongoing pattern. It also allows the anomaly detection algorithm to establish baselines immediately, which improves the quality of alerts from the very beginning.
Soda’s proprietary anomaly detection algorithm is specifically designed for data quality monitoring. Every component has been developed entirely in-house without relying on third-party frameworks. This gives Soda full control over the modeling stack and ensures transparency, customization, and explainability. These attributes are especially important in production environments where trust in alerts is essential.
The algorithm is built on a proprietary evaluation framework that rigorously tests its performance using hundreds of internally curated datasets with known data quality issues. This framework enables structured, repeatable experimentation and continuous benchmarking of new techniques. It prioritizes reducing false positives to ensure alerts are accurate, meaningful, and reliable.
In benchmark testing, Soda’s algorithm demonstrated a 70 percent improvement in anomaly detection accuracy compared to Facebook Prophet. Unlike generic forecasting tools that rely on rigid assumptions, Soda’s model is tailored to the real-world challenges of monitoring data quality at scale.
The system is flexible and adapts to different team needs. It can run autonomously with smart defaults or be fine-tuned through a user-in-the-loop approach. Teams can improve detection by providing feedback and adjusting sensitivity. This flexibility ensures that alerts remain focused, useful, and aligned with the needs of each organization.
Soda’s Metric Monitoring feature is the foundation of Data Observability, allowing users to automatically track key dataset and column-level statistics over time, detect deviations, and get alerted before data issues impact downstream analytics. While quality checks also keep track of measurements over time, metric monitors use that history of measurements to learn from them and automatically adjust thresholds to inform about expected values or alert about anomalies.
Metric Monitoring is developed to be a hassle-free feature. You can unlock organization‐wide observability through Soda Cloud’s . This instantly provides automated metric monitoring across hundreds of tables by simply selecting all the datasets you care about and defining a shared schedule in one step. No more configuring each table by hand: stay ahead of pipeline failures, data delivery delays, and structural changes with consistent, centralized monitoring that grows as fast as your data.
Learn more about how roles and permissions affect Metric Monitoring capabilities: .
The Webhook Integration in Soda Cloud allows you to send notifications about check results (based on notification rules) and incident updates to external systems, such as monitoring tools, incident management platforms, or custom endpoints.
This integration is ideal for teams who want to build custom workflows or integrate Soda Cloud alerts into their existing tools.
Only users with the Manage Organization Settings global role can define webhook integrations.
Follow these steps to configure a Webhook integration in Soda Cloud:
Go to the Integrations section in Settings.
Click the + button to add a new integration.
Select the integration type: Webhook, and click next.
Configure the Webhook
Name: Provide a clear name for your integration.
URL: Enter the Webhook endpoint where Soda Cloud should send notifications. Headers: (Optional)
Add authentication or custom headers required by your endpoint.
Test the Webhook
Use the built-in testing tool to simulate events and validate your Webhook integration.
You can select different event types to test and develop your integration.
For the exact payload structure and details, see the
Choose the events to send
Alert Notifications: The integration becomes available for use in notification rules. It will only send notifications when you explicitly configure a notification rule to use this Webhook.
Incidents: Triggered when users create or update incidents in Soda Cloud.
Click Save to apply
After configuring your Webhook integration with the Alert Notification scope, you can use it in your notification rules to send alerts when specific checks fail.
When creating or editing a notification rule, select your configured Webhook integration as the recipient.
For detailed steps and advanced examples, see the
Integrate Soda with Microsoft’s Purview data catalog to access details about the quality of your data from within the catalog.
Run data quality checks using Soda and visualize quality metrics and rules within the context of a table in Purview.
Give your Purview-using colleagues the confidence of knowing that the data they are using is sound.
Encourage others to add data quality checks using a link in Purview that connects directly to Soda.
In Purview, you can see all the Soda data quality checks and the value associated with the check’s latest measurement, the health score of the dataset, and the timestamp for the most recent update. Each of these checks listed in Purview includes a link that opens a new page in Soda Cloud so you can examine diagnostic and historic information about the check.
Purview displays the latest check results according to the most recent Soda scan for data quality, where color-coded icons indicate the latest result. A gray icon indicates that a check was not evaluated as part of a scan.
If Soda is performing no data quality checks on a dataset, the instructions in Purview invite a catalog user to access soda and create new checks.
You have verified some contracts and published the results to Soda Cloud.
You have a Purview account with the privileges necessary to collect the information Soda needs to complete the integration.
The data source that contains the data you wish to check for data quality is available in Purview.
Sign into your Soda Cloud account and confirm that you see the datasets you expect to see in the data source you wish to test for quality.
In your Soda Cloud account, navigate to your avatar > Profile, then navigate to the API Keys tab. Click the plus icon to generate new API keys.
Copy the following values and paste to a temporary, secure, local location.
This page lists the supported data source types and their required connection parameters for use with Soda Core.
Soda uses the official Python drivers for each supported data source. The configuration examples below include the default required fields, but you can extend them with any additional parameters supported by the underlying driver.
Each data source configuration must be written in a YAML file and passed as an argument using the CLI or Python API.
Each configuration must include
helm upgrade <release> soda-agent/soda-agent
--set soda.apikey.id=*** \
--set soda.apikey.secret=****> cat values-local.yaml
soda:
apikey:
id: ***
secret: ***
> helm upgrade soda-agent soda-agent/soda-agent \
--values values-local.yml --namespace soda-agenthelm upgrade <release> soda-agent/soda-agent
--set soda.apikey.id=*** \
--set soda.apikey.secret=****
--set imageCredentials.apikey.id=*** \
--set imageCredentials.apikey.secret=***> cat values-local.yaml
soda:
apikey:
id: ***
secret: ***
imageCredentials:
apikey:
id: ***
secret: ***
> helm upgrade soda-agent soda-agent/soda-agent \
--values values-local.yml --namespace soda-agenthelm upgrade <release> soda-agent/soda-agent
--set soda.apikey.id=*** \
--set soda.apikey.secret=****
--set existingImagePullSecrets[0].name=my-existing-secret # Mind the array and indexing syntax!> cat values-local.yaml
soda:
apikey:
id: ***
secret: ***
existingImagePullSecrets
- name: my-existing-secret
> helm upgrade soda-agent soda-agent/soda-agent \
--values values-local.yml --namespace soda-agenthelm upgrade <release> soda-agent/soda-agent
--set soda.apikey.id=*** \
--set soda.apikey.secret=****
--set soda.cloud.region=us> cat values-local.yaml
soda:
apikey:
id: ***
secret: ***
cloud:
region: "us"
> helm upgrade soda-agent soda-agent/soda-agent \
--values values-local.yml --namespace soda-agent# For Soda Cloud customers in the EU region
docker login registry.cloud.soda.io -u <APIKEY_ID> -p <APIKEY_SECRET>
# For Soda Cloud customers in the US region
docker login registry.us.soda.io -u <APIKEY_ID> -p <APIKEY_SECRET>existingImagePullSecrets
- name: my-existing-secret
soda:
apikey:
id: ***
secret: ***
agent:
image:
repository: custom.registry.org/sodadata/agent-orchestrator
scanLauncher:
image:
repository: custom.registry.org/sodadata/soda-scan-launcher
contractLauncher:
image:
repository: custom.registry.org/sodadata/soda-contract-launcher
hooks:
image:
repository: custom.registry.org/sodadata/soda-agent-utilspip install -i https://pypi.dev.sodadata.io/simple -U soda-bigquerytype: bigquery
name: my_bigquery
connection:
account_info_json: '{
"type": "service_account",
"project_id": "dbt-quickstart-44203",
"private_key_id": "fe0a60e9cb7d4369f73f7b5691ce397d1e",
"private_key": "-----BEGIN PRIVATE KEY-----<insert-private-key>-----END PRIVATE KEY-----\n",
"client_email": "[email protected]",
"client_id": "114963712293161062",
"auth_uri": "https://accounts.google.com/o/oauth2/auth",
"token_uri": "https://oauth2.googleapis.com/token",
"auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
"client_x509_cert_url": "https://www.googleapis.com/robot/v1/metadata/x509/dbt-user%40dbt-quickstart-44803.iam.gserviceaccount.com",
"universe_domain": "googleapis.com"
}' # example service account JSON string, exported from BQ; SEE NOTE
dataset: ${env.BQ_DATASET_NAME}
# optional
account_info_json_path: /path/to/service-account.json # SEE NOTE
auth_scopes:
- https://www.googleapis.com/auth/bigquery
- https://www.googleapis.com/auth/cloud-platform
- https://www.googleapis.com/auth/drive
project_id: ${env.BQ_PROJECT_ID} # Defaults to the one embedded in the account JSON
storage_project_id: ${env.BQ_STORAGE_PROJECT_ID}
location: ${env.BQ_LOCATION} # Defaults to the specified project's location
client_options: <options-dict-for-bq-client>
labels: <labels-dict-for-bq-client>
impersonation_account: <name-of-impersonation-account>
delegates: <list-of-delegates-names>
use_context_auth: false # whether to use Application Default Credentialssoda data-source test -ds ds.ymlLocally, for testing purposes, using tools like Minikube, microk8s, kind, k3s, or Docker Desktop with Kubernetes support.

Latest value
Trend sparkline
Missing values percentage
Detects anomalies in the maximum (highest) value in a column.
Unique count
Detects anomalies in the number of distinct (unique) values in a column.
Timestamp
Most recent timestamp
Detects anomalies in the most recent (latest) timestamp value in a column.
Numeric
Average
Detects anomalies in the average (mean) value of a column.
Standard deviation
Detects anomalies in the standard deviation of values in a column.
Sum
Detects anomalies in the total (sum) of values in a column.
Variance
Detects anomalies in the variance (spread) of values in a column.
Q1
Detects anomalies in the 25th percentile (first quartile) value of a column.
Median
Detects anomalies in the 50th percentile (median - Q2) value of a column.
Q3
Detects anomalies in the 75th percentile (third quartile) value of a column.
Text
Average length
Detects anomalies in the average character length of text values.
Maximum length
Detects anomalies in the shortest character length of text values.
Minimum length
Detects anomalies in the longest character length of text values.
Exclusion Values: specify literal values or ranges to ignore when marking anomalies.
All data types
Count
Detects anomalies in the number of non-missing (non-NULL) values in a column.
Duplicate percentage
Detects anomalies in the percentage of duplicate values in a column.
Maximum value
Detects anomalies in the maximum (highest) value in a column.
Minimum value
Detects anomalies in the minimum (lowest) value in a column.




















Provides early signals when something might be wrong
May result in extra work to follow up and interpret alerts








API Key Secret
Access Purview tutorial using REST APIs for instructions on how to create the following values, then paste to a temporary, secure, local location.
client_id
client_secret
tenant_id
Copy the value of your purview endpoint from the URL (https://XXX.purview.azure.com) and paste to a temporary, secure, local location.
To connect your Soda Cloud account to your Purview Account, contact your Soda Account Executive or email Soda Support with the details you collected in the previous steps to request Purview integration.




Open Source. Available for Free, Team and Enterprise Plans.







10 columns, 500K rows
1% changes
<80MB RAM
9s
360 columns, 100K rows
1% changes
<80MB RAM
1m
360 columns, 1M rows
1% changes
<80MB RAM



35m
typenameconnectionUse the exact structure required by the underlying Python driver.
Test the connection before using the configuration in a contract.
You can run verifications using Soda Core (local execution) or a Soda Agent (remote execution). To ensure consistency and compatibility, you must use the same data source name in both your local configuration for Soda Core and in Soda Cloud. See: Onboard datasets on Soda Cloud
This matching by name ensures that the data source is recognized and treated as the same across both execution modes, whether you’re running locally in Soda Core or remotely via a Soda Agent.
It’s also possible to onboard a data source to Soda Cloud and a Soda Agent after it was onboarded using Soda Core.
To learn how: Onboard datasets on Soda Cloud
You can reference environment variables in your data source configuration. This is useful for securely managing sensitive values (like credentials) or dynamically setting parameters based on your environment (e.g., dev, staging, prod).
Example:
Environment variables must be available in the runtime environment where Soda is executed (e.g., your terminal, CI/CD runner, or Docker container).
For Soda to run quality scans on your data, you must configure it to connect to your data source. To learn how to set up Soda from scratch and configure it to connect to your data sources, see Soda's Quickstart.



Every monitor in Soda has its own dedicated detail page. This page is designed to help you explore the monitor's history, understand its behavior over time, and take action when needed. From here, you can investigate anomalies, give feedback to improve the detection algorithm, create incidents, and fine-tune the monitor's sensitivity or configuration.
The page consists of two main components:
An interactive plot that visualizes metric trends, anomalies, and historical context
A results table that lists all metric values and events visible in the plot
The interactive plot gives you a time-based view of how the monitor metric has evolved. It combines metric values, expected behavior, and any detected anomalies in a single visual.
Select a time window using the range slider below the plot to zoom in or out on a specific period
Click and drag to zoom into a custom time range
Hover over data points to view detailed information for each result
Expected range: the shaded area that represents the predicted normal behavior, as defined by the anomaly detection model
Measurement: the actual metric value for each scan
Anomaly: points marked when the metric falls outside the expected range and is flagged by the algorithm
Missing: scans where no metric could be collected, typically due to unavailable data or delayed scans
Monitor created: marks the date the monitor is created
Initial configuration: shows the starting settings used when the monitor was first enabled
Configuration updated: marks changes to thresholds, exclusions, or sensitivity applied over time
Below the plot, the table lists all historical scan results, including metric values, anomaly status, and any user actions (like feedback or incidents). The plot is aligned with the table, so each data point in the plot directly corresponds to a result in the table
This makes it easy to correlate visual trends with specific events, compare changes, and drill into the context of any anomaly or data quality issue.
With the three dots it is possible to give feedback to the anomaly detection algorithm, in bulk for multiple results at once or create and link the results to an incident
Soda's Observability tools work out of the box with predefined baselines, but you can fine-tune them to your specific needs. Do this from the page for each specific metric.
By default Soda uses an adaptive statistical method, but you can control which sides of the expected range should trigger an anomaly alert:
Open the panel
Click on "Set Threshold Strategy" button on the metric of your choice
Choose your alert ranges
Upper range: when checked, Soda will flag any metric value that exceeds the upper bound of its statistical baseline.
Lower range: when checked, Soda will flag any metric value that falls below the lower bound.
Apply your settings
Click Set Threshold Strategy to save.
With this simple toggle interface you can, for example, watch only for unexpectedly high values, only for drops below your baseline, or both.
Exclude values from monitoring to tell Soda which specific values or ranges should be ignored when evaluating anomalies (e.g. test rows, manual overrides). The available inputs depend on the metric type.
Open the panel Click on Set Exclusion Values button on the metric of your choice
Define your exclusions Click on + Add exclusion
Numeric metrics (Total row count, Total row count change, Partition row count):
Type: Value or Value range
Value: enter the exact metric value (e.g. 205) or, for a range, specify both lower and upper bounds.
You can stack multiple rules by clicking + Add exclusion.
Apply Click Set Exclusion Values to save your rules.
This will not retroactively change past results. It only affects future anomaly evaluations.
Soda uses a statistical baseline to define an “expected range” for anomaly detection. You can adapt how tight or loose that range is.
Open the panel Click on Set Sensitivity button on the metric of your choice
Adjust the sensitivity
Provide a z-score: enter a value between 0.3 and 6 to control the exact width of the expected range OR use the slider to drag between Narrow (lower z-score) and Wide (higher z-score).
Default: z = 3
Preview how changing sensitivity widens or narrows the gray “expected” band in the plot
Apply
Click Apply sensitivity to save.
This will not retroactively change past results. It only affects future anomaly evaluations.
Our home-brewed anomaly detection algorithm draws trends from historical data, but it can also learn from your input as you give it feedback.
When a monitor flags an anomaly you can:
Mark as expected Teach Soda that this deviation is acceptable: future similar variations will no longer trigger alerts.
Mark as anomaly Explicitly flag a point as an anomaly, even if it fell inside the baseline. This helps refine your alert definitions.
Create new incident Create a ticket in your incident management tool directly from the panel.
Link to existing incident Attach this scan to a ticket in your external system (Jira, ServiceNow, PagerDuty, etc.), keeping engineering triage in one place.
Bulk feedback More than one scan can be added to an incident or feedback. Simply check the boxes of the scans you want to add.
Dataset settings allow you to define key metadata, ownership, and business context for your datasets. This information helps ensure data governance, accountability, and seamless integration with other tools like your data catalog.
Each dataset should have a designated dataset owner: a person or team responsible for the dataset's quality, availability and usage.
Typically, the role of a Dataset Owner includes:
Defining and maintaining the dataset's purpose and documentation.
Ensuring the dataset meets data quality standards and contract requirements.
Responding to issues, such as failed checks or data quality alerts.
Reviewing and approving changes to the dataset schema or contract.
Updating the Dataset Owner requires the following dataset role: "Configure dataset".
To assign a Dataset Owner:
Open the dataset page.
Click the context menu (⋮) in the top-right corner and select Edit Dataset.
In the Owned by section, select one or more users and/or user groups.
Click Save to apply the changes.
Responsibilities allow you to assign permissions to users or user groups, ensuring they have the access they need to work with a dataset.
A Responsibility is a combination of:
A User or User Group.
A Dataset Role, which is a predefined collection of permissions (such as the ability to edit contracts, view checks, or manage settings).
By assigning Responsibilities, you define who can do what for each dataset, supporting clear ownership, governance, and collaboration.
Learn about defining custom roles
Managing responsibilities requires the following dataset role: "Manage dataset responsibilities"
To assign a Responsibility to a user or group:
Open the dataset page.
Click the context menu (⋮) in the top-right corner and select Edit Responsibilities.
Add the desired users or user groups.
Select the appropriate Dataset Role for each.
Click Save to apply the changes.
Every dataset has a default Dataset Owner role, automatically assigned to the designated Dataset Owner(s).
This role provides essential permissions to manage and maintain the dataset.
The Dataset Owner role cannot be removed, but it can be combined with other roles for additional permissions.
Updating the dataset attributes requires the following dataset role: "Configure dataset".
Dataset attributes allow you to add descriptive metadata to your datasets. This metadata can then be:
Used for filtering in Soda Cloud, making it easier to search and organize datasets and checks based on specific criteria (e.g., business domain, sensitivity, criticality).
Leveraged in reporting, enabling you to group datasets, track ownership, and monitor data quality across different categories or dimensions.
Adding meaningful attributes enhances discoverability, governance, and collaboration within Soda and its integrations.
Learn how to define attribute types:
You can add or modify dataset attributes in the Dataset Settings page:
Click the context menu (⋮) in the top-right corner and select Edit Dataset.
Set a value for the existing attribute type. They are all optional.
Save your changes.
When managing multiple datasets, you can save time by applying changes in bulk using the Bulk Edit feature.
Go to the Datasets page.
Select the datasets you want to edit using the checkboxes.
Click Edit in the action bar.
Define attributes you want to add or modify across the selected datasets.
Define responsibilities you want to add or modify across the selected datasets.
Choose whether to update existing responsibilities (add new without removing existing) or reset (replace all existing responsibilities with the new definition).
Click Continue to review your changes.
You can automate the management of dataset attributes and responsibilities in Soda Cloud using our REST API. This allows you to:
Programmatically set or update attributes for multiple datasets.
Assign responsibilities (users, groups, and roles) to datasets at scale.
Keep your Soda Cloud configuration in sync with your data catalog or external metadata management systems.
This automation ensures that your metadata stays up-to-date and consistent across your ecosystem, supporting seamless governance and discoverability.
To do so, you can leverage our APIs: and .
This page describes how to install the Soda Python packages, which are required for running Soda scans via the CLI or Python API.
To use Soda, you must have installed the following on your system.
Python 3.8, 3.9, 3.10 or 3.11.
To check your existing version, use the CLI command: python --version or python3 --version. If you have not already installed Python, consider using pyenv to manage multiple versions of Python in your environment.
Pip 21.0 or greater.
To check your existing version, use the CLI command: pip --version
A Soda Cloud account; see how to .
Best practice dictates that you install the Soda CLI using a virtual environment. If you haven't yet, in your command-line interface tool, create a virtual environment in the .venv directory using the commands below. Depending on your version of Python, you may need to replace python with python3 in the first command.
Before you install the Soda CLI, decide which installation flow applies to your environment and license type. The two flows available serve different purposes:
How to differentiate between free open-source Soda, and paid licensed Soda?
Soda V3: package names included core if the package was free open-source. E.g.:
soda-core-postgres
To use the open source Soda Core python packages, you must install them from the public Soda PyPi registry: https://pypi.dev.sodadata.io/simple .
Install the Soda Core package for your data source. This gives you access to all the basic CLI functionality for working with contracts.
Replace soda-postgres with the appropriate package for your data source. See the for supported packages and configurations.
Now you can .
soda: "umbrella" package (does not include Diagnostics Warehouse)
Data-source-specific packages: naming pattern is “soda-<datasource>” (e.g. soda-postgres, soda-bigquery, soda-sparkdf, etc.)
If you wish to use commercial extensions to the Soda Core python package, you must install them from one of the private Soda PyPi registries below. The private PyPI installation process adds an authentication layer and region-based repositories for license-based access control of Team and Enterprise customers.
Upgrade pip inside your new virtual environment.
Choose the correct repository based on your license and region.
1 Team: Any license except "Trial" or "Enterprise" (see below)
2 Enterprise: one of enterprise , enterprise_user_based , dataset_standard , premier licenses.
Set your credentials. See how to generate your own .
Execute the following command, replacing soda>=4.0.0b0 with the package that you need to install.
Soda with Alation to access details about the quality of your data from within the data catalog.
Run data quality checks using Soda and visualize quality metrics and rules within the context of a data source, dataset, or column in Alation.
Use Soda Cloud to flag poor-quality data in lineage diagrams and during live querying.
Give your Alation users the confidence of knowing that the data they are using is sound.
🎥 Watch a showcasing the integration of Soda and Alation.
You have verified some contracts and published the results to Soda Cloud.
You have an Alation account with the privileges necessary to allow you to add a data source, create custom fields, and customize templates.
You have a git repository in which to store the integration project files.
🎥 Watch a 5-minute video that demonstrates how to integrate Soda and Alation.
Sign into your Soda Cloud account and confirm that you see the datasets you expect to see in the data source you wish to test for quality.
To connect your Soda Cloud account to your Alation Service Account, create an .env file in your integration project in your git repo and include details according to the example below. Refer to to obtain the values for your Soda API keys.
To sync a data source and schema in the Alation catalog to a data source in Soda Cloud, you must map it from Soda Cloud to Alation. Create a .datasource-mapping.yml file in your integration project and populate it with mapping data according to the following example. The table below describes where to retrieve the values for each field.
Retrieve the Alation datasource_id from the URL
Retrieve the Alation datasource_container_name (schema) from the data source page
Retrieve the Alation datasource_container_id for the datasource_container_name from the URL in the Schema page.
If your Alation account employs single sign-on (SSO) access, you must for Soda to integrate with Alation.
If your Alation account does not use SSO, skip this step and proceed to .
Create custom fields in Alation that reference information that Soda Cloud pushes to the catalog. These are the fields the catalog users will see that will display Soda Cloud data quality details. In your Alation account, navigate to Settings > Catalog Admin > Customize Catalog. In the Custom Fields tab, create the following fields:
Under the Pickers heading, create a field for “Has DQ” with Options “True” and “False”. The Alation API is case sensitive so be sure to use these exact values.
Under the Dates heading, create a field for “Profile - Last Run”.
Contact directly to acquire the assets and instructions to run the integration and view Soda Cloud details in your Alation catalog.
Access Soda Cloud to or that execute checks against datasets in your data source each time you , or using a data pipeline tool such as Airflow. Soda Cloud pushes data quality scan results to the corresponding data source in Alation so that users can review data quality information from within the catalog.
In Alation, beyond reviewing data quality information for the data source, users can access the Joins and Lineage tabs of individual datasets to examine details and investigate the source of any data quality issues.
In a dataset page in Alation, in the Overview tab, users have the opportunity to click links to directly access Soda Cloud to scrutinize data quality details; see image below.
Under the Soda DQ Overview heading in Alation, click Open in Soda to access the dataset page in Soda Cloud.
Under the Dataset Level Monitors heading in Alation, click the title of any monitor to access the check info page in Soda Cloud.
Configure Soda Cloud to connect your account to MS Teams so that you can:
Send Notificationsfor failed or warning check results to MS Teams channel
Start conversations to track and resolve data quality Incidentswith MS Teams
Only users with the Manage Notification Rules permission can create or edit rules. All users can view rules. Read about
As a user with permission to do so, log in to your Soda Cloud account, navigate to your avatar > Organization Settings, then select the Integrations tab.
In the Add Integration dialog box, select Microsoft Teams.
In the first step of the guided integration workflow, follow the instructions to navigate to your MS Teams account to create a Workflow; see Microsoft’s documentation for . Use the Workflow template to Post to a channel when a webhook request is received.
In the last step of the guided Workflow creation, copy the URL created after successfully adding the workflow.
Returning to Soda Cloud with the URL for Workflow, continue to follow the guided steps to complete the integration. Reference the following tables for guidance on the values to input in the guided steps.
Configuration tab: Provide the following information
Scope tab: select the Soda features (alert notifications and/or incidents) that can access the Slack integration.
Use the Alert Notification scope to enable Soda Cloud to send alert notifications to an MS Teams channel to notify your team of warn and fail check results. With such an integration, Soda Cloud enables users to select MS Teams as the destination for an alert notification of an individual check or checks that form a part of an agreement, or multiple checks. To send notifications that apply to multiple checks, see .
Use the Incident scope to notify your team when a new incident has been created in Soda Cloud. With such a scope, Soda Cloud displays an external link to the MS Teams channel in the Incident Details. Soda Cloud sends all incident events to only one channel in MS Teams. As such, you must provide a separate link in the Channel URL field in the Define Scope tab. For example, https://teams.microsoft.com/mychannel. To obtain the channel link in MS Teams, right-click on the channel name in the overview sidebar. Refer to for more details about using incidents in Soda Cloud.
Problem: You encounter an error that reads, “Error encountered while rendering this message.”
Solution: A fix is , the short version of which is as follows.
Restart MS Teams.
Clear your cache and cookies.
If you have not already done so, update to the latest version of MS Teams.
Configure a Webhook in Soda Cloud to connect to your ServiceNow account.
In ServiceNow, you can create a Scripted REST API that enables you to prepare a resource to work as an incoming webhook. Use the ServiceNow Resource Path in the URL field in the Soda Cloud integration setup.
This example offers guidance on how to set up a Scripted REST API Resource to generate an external link which Soda Cloud displays in the Incident Details; see image below. When you change the status of a Soda Cloud incident, the webhook also updates the status of the SNOW issue that corresponds with the incident.
Refer to Webhook API for details information.
The following steps offer a brief overview of how to set up a ServiceNow Scripted REST API Resource to integrate with a Soda Cloud webhook. Reference the ServiceNow documentation for details:
and
In ServiceNow, start by navigating to the All menu, then use the filter to search for and select Scripted REST APIs.
Click New to create a new scripted REST API. Provide a name and API ID, then click Submit to save.
In the Scipted Rest APIs list, find and open your newly-created API, then, in the Resources tab, click New to create a new resource.
Integrate Soda with Atlan to access details about the quality of your data from within the data catalog.
Run data quality checks using Soda and visualize quality metrics and rules within the context of a data source, dataset, or column in Atlan.
Use Soda Cloud to flag poor-quality data in lineage diagrams.
Give your Atlan users the confidence of knowing that the data they are using is sound.
soda data-source test -ds ds.ymltype: postgres
name: postgres
connection:
host:
port:
database:
user: ${env.SNOWFLAKE_USERNAME}
password: ${env.SNOWFLAKE_PASSWORD}Feedback: shows if the user provided feedback on a result (e.g. confirmed or dismissed an anomaly)
Configuration change: visual markers indicating when the monitor’s configuration was updated
Time-based metrics (Last modification time, Most recent timestamp):
Type: Value or Value range
Value: enter the cutoff you want to ignore (e.g. 0 days, 10 hours, 49 minutes) or, for a range, specify both lower and upper bounds.
Schema changes: exclusions are not supported for schema-drift monitors.





































In the Script field, define a script that creates new tickets when a Soda Cloud incident is opened, and updates existing tickets when a Soda Cloud incident status is updated. Use the example below for reference. You may also need to define Security settings according to your organizations authentication rules.
Click Submit, then copy the value of the Resource path to use in the URL field in the Soda Cloud integration setup.

(function process(/*RESTAPIRequest*/ request, /*RESTAPIResponse*/ response) {
var businessServiceId = '28***';
var snowInstanceId = 'dev***';
var requestBody = request.body;
var requestData = requestBody.data;
gs.info(requestData.event);
if (requestData.event == 'incidentCreated'){
gs.log("*** Incident Created ***");
var grIncident = new GlideRecord('incident');
grIncident.initialize();
grIncident.short_description = requestData.incident.description;
grIncident.description = requestData.incident.sodaCloudUrl;
grIncident.correlation_id = requestData.incident.id;
if(requestData.incident.severity == 'critical'){
grIncident.impact = 1;
}else if(requestData.incident.severity == 'major'){
grIncident.impact = 2;
}else if(requestData.incident.severity == 'minor'){
grIncident.impact = 3;
}
grIncident.business_service = businessServiceId;
grIncident.insert();
var incidentNumber = grIncident.number;
var sysid = grIncident.sys_id;
var callBackURL = requestData.incidentLinkCallbackUrl;
var req, rsp;
req = new sn_ws.RESTMessageV2();
req.setEndpoint(callBackURL.toString());
req.setHttpMethod("post");
var sodaUpdate = '{"url":"https://'+ snowInstanceId +'.service-now.com/incident.do?sys_id='+sysid + '", "text":"SNOW Incident '+incidentNumber+'"}';
req.setRequestBody(sodaUpdate.toString());
resp = req.execute();
gs.log(resp.getBody());
}else if(requestData.event == 'incidentUpdated'){
gs.log("*** Incident Updated ***");
var target = new GlideRecord('incident');
target.addQuery('correlation_id', requestData.incident.id);
target.query();
target.next();
if(requestData.incident.status == 'resolved'){
//Change this according to how SNOW is used.
target.state = 6;
target.close_notes = requestData.incident.resolutionNotes;
}else{
//Change this according to how SNOW is used.
target.state = 4;
}
target.update();
}
})(request, response);soda-postgres (paid licensed Soda).
Soda V4: no differentiation using core in package names. Differentiation will be based on the installation flows listed above.
soda-migration
soda-migration
soda-reconciliation
soda-oracle
Executing data contracts with basic data quality checks on enterprise data sources.
Use this installation method if you’re just getting started.
The Public PyPI index hosts Soda Core packages for all supported data sources.
Same as above, plus: group by checks, reconciliation checks, migrating checks from v3 to v4, running checks on Oracle data, and capturing failed rows with the Diagnostics Warehouse.
Private PyPI repositories are region-specific and require authentication using your API key credentials. This method ensures secure access to licensed components, enterprise-only extensions, and region-compliant hosting.
Team1
EU
Team
US
Enterprise2
EU
Enterprise
US
catalog:
datasource_container_name
The schema of the data source; retrieve this value from the data source page in the Alation catalog under the subheading Schemas. See image below.
catalog:
datasource_container_id
The ID of the datasource_container_name (the schema of the data source); retrieve this value from the schema page in the Alation catalog. See image below
Under the Rich Texts heading, create the following fields:
“Soda DQ Overview”
“Soda Data Quality Rules”
“Data Quality Metrics”
Add each new custom field to a Custom Template in Alation. In Customize Catalog, in the Custom Templates tab, select the Table template, then click Insert… to add a custom field to the template:
“Soda DQ Overview”
In the Table template, click Insert… to add a Grouping of Custom Fields. Label the grouping “Data Quality Info”, then Insert… two custom fields:
“Has DQ”
“Profile - Last Run”
In the Column template, click Insert… to add a custom field to the template:
“Has DQ”
In the Column template, click Insert… to add a Grouping of Custom Fields. Label the grouping “Soda Data Profile Information”, then Insert… two custom fields:
Data Quality Metrics
Soda Data Quality Rules
name
A name you choose as an identifier for an integration between Soda Cloud and a data catalog.
soda:
datasource_id
The data source information panel in Soda Cloud.
soda:
datasource_name
The data source information panel in Soda Cloud.
soda:
dataset_mapping
(Optional) When you run the integration, Soda automatically maps all of the datasets between data sources. However, if the names of the datasets differ in the tools you can use this property to manually map datasets between tools.
catalog:
type:
The name of the cataloging software; in this case, “alation”.
catalog:
datasource_id
Retrieve this value from the URL on the data source page in the Alation catalog; see image below.





You’ll be taken to the Contract Editor, a powerful interface where you can define your contract in two ways:
No-code view: Point-and-click UI to add quality checks and configure settings
Code view: YAML editor for full control and advanced use cases.
See language reference: Contract Language reference
You can switch between views at any time using the editor toggle in the top right corner.
Understanding how to structure your contract is essential. Soda supports several types of checks and configuration options:
Filter: applies a global filter to limit which rows are considered across the entire contract (e.g., only the latest partition or rows from the past 7 days.)
Variables: help you parameterize your contract, making it flexible and adaptable to different contexts (e.g., environments, schedules, or partitions.)
Dataset-level Checks: rules that apply to the dataset as a whole, like row count, freshness, or schema checks.
Column-level Checks: rules that apply to individual columns, like missing values, uniqueness, ranges, or regex formats.
All visible columns are detected during onboarding. You can also manually add columns if needed.
Variables allow dynamic substitution of values in contracts. They help you:
Parameterize values that differ across environments, datasets, or schedules.
Reuse values in multiple places within the same contract to reduce duplication and improve maintainability.
You can define variables at the top of your contract:
Then use them throughout your contract using the ${var.VARIABLE_NAME} syntax.
For example:
When running the contract, variable values must be provided unless a default is defined.
Variables are ideal for partitioned datasets, date-based rules, or customizing checks based on context.
Now: You can use ${soda.NOW} in your Contract to access the current timestamp.
Use attributes to label, sort, and route your checks in Soda Cloud. Attributes help you organize checks by properties such as domain, priority, location, and sensitivity (e.g., PII).
Learn how to leverage attributes with Notifications and Browse datasets.
Apply Attributes to Checks
You can add attributes directly to individual checks. For example:
Set Default Attributes at the Top Level
You can also define default attributes at the dataset level. These attributes apply to all checks, unless overridden at the individual check level.
When publishing contract results to Soda Cloud, all check attributes must be pre-defined in Soda Cloud. If any attribute used in a contract is not registered in your Soda Cloud environment, the results will not be published, and the data contract scan will be marked as failed.
Learn how to configure attributes in Soda Cloud: Check and dataset attributes.
Before publishing, click Test to simulate a contract verification against your live data. Soda will:
Run all defined checks
Display which rules pass or fail
Surface profiling and diagnostic insights
This dry run helps ensure your contract behaves as expected, before making it official.
This action requires the "Manage contract" permission on the dataset. Learn more about permissions here: Dataset Attributes & Responsibilities
You can test a contract on a sample of your data. Learn more at the onboarding Additional settings.
Once you're happy with the results, click Publish.
Publishing sets this version as the source of truth for that dataset. From this point on:
Verifications will use the published version
All users see this contract as the authoritative definition of data quality for that dataset
Changes will require a new version or a proposal (depending on permissions)
Publishing ensures your data expectations are versioned, visible, and enforceable.
This action requires the Manage contract permission on the dataset. Learn more about permissions here: Dataset Attributes & Responsibilities
You’re now ready to start verifying your contract and monitoring your data.
Contract history provides a snapshot view of all changes that have been made to a contract.
To access contract history:
Navigate to a dataset with an existing data contract.
Click on the icon next to Edit Contract, on the top right (or click on Edit Contract > ).
Review contract history by choosing a version on the left panel and inspecting it on the right panel.
Just as when a contract is being created or edited, you can toggle between the code and no-code views.
The code view allows to toggle diff and toggle split view.
While contract history allows to see the changes that a contract has undergone, request history provides an overview of the change requests that have been made over a specific contract.
To access the request history of any dataset, navigate to the dataset > tab Requests.
The list of requests can be filtered by title key word and by state (Open, Done and Won't do)
From this view, you can also create a request.
This view provides a snapshot of each request, making visible:
The title, description (if any), and time of creation of the request
The state of the request (Open, Done and Won't do)
The icon, which indicates that the request has a proposal
The icon, which indicates that the request has comments
To access all request history in an organization, navigate to tab Requests on top of the page.
This page provides an overview of all requests made within the organization. The requests can be filtered by:
Title key word(s)
Status
User that created the request
Users that are participants in the request
You have verified some contracts and published the results to Soda Cloud.
You have an Atlan account with the privileges necessary to allow you to set up a Connection in your Atlan workspace.
Follow the instructions to Generate API keys in Soda to use for authentication in your Atlan connection.
Follow Atlan’s documentation to set up the Connection to Soda in your Atlan workspace.
🎥 Watch the Atlan-Soda integration in action!

Name
Provide a unique name for your integration in Soda Cloud.
URL
Input the Workflow URL you obtained from MS Teams.
Enable to send notifications to Microsoft Teams when a check result triggers an alert.
Check to allow users to select MS Teams as a destination for alert notifications when check results warn or fail. Notifications
Use Microsoft Teams to track and resolve incidents in Soda Cloud.
Check to automatically send incident information to an MS Teams channel.
Channel URL
Provide a channel identifier to which Soda Cloud sends all incident events.



Data contracts define the expectations between data producers and data consumers, ensuring that the data delivered is fit for purpose and aligned with business needs. However, data requirements evolve, and consumers often identify gaps or new use cases that require adjustments.
To support this, Soda provides a collaborative process that allows data consumers to request changes to existing data contracts or propose the creation of new ones. Consumers can directly propose changes by editing the data contract with Soda's no-code editor, suggesting concrete modifications for the dataset owner to review.
This approach enables data consumers to express their requirements not just in abstract terms but in actionable, implementable contract changes. By doing so, the consumer helps the dataset owner by:
Making their needs clearer and more concrete.
Supporting faster alignment between producers and consumers.
Contributing to quicker and smoother implementation.
Reducing unnecessary communication overhead.
The dataset owner remains the final decision-maker, reviewing proposed changes, iterating with the consumer as needed, and then publishing the updated contract once consensus is reached.
This collaborative workflow ensures that data contracts remain living agreements that continuously adapt to evolving business use cases while maintaining producer accountability.
In Soda Cloud, you can access a view of the contract history, which allows to inspect all changes made to a specific contract.
Learn more about .
This action requires the Propose checks permission on the dataset. Learn more about permissions here:
Users can:
Request a change by simply describing their needs and use cases.
Propose changes directly by editing the data contract, suggesting concrete modifications for the dataset owner to review.
When a request is created, dataset owners automatically receive an email notification, ensuring they can promptly review and collaborate with the requester.
To propose a change or create a new contract, data consumers can initiate a request directly from the dataset page.
Navigate to a dataset Go to any onboarded dataset in Soda.
Start editing
If the dataset does not yet have a contract, click Create Contract.
If a contract already exists, click Edit Contract.
Provide details You will be prompted to:
Enter a title for the request.
Provide a reasoning or description of the changes, explaining why they are needed.
Save the request Once you click Save, a new request is created containing your proposed changes. This proposal is then shared with the dataset owner for review and follow-up.
In some cases, data consumers may want to request changes without directly editing the contract themselves. This allows them to highlight a need while leaving the implementation details to the dataset owner.
Navigate to the dataset Open the dataset in Soda.
Go to the Requests tab Select the Requests tab for that dataset.
Create a new request Click Create a Request.
Provide details You will be prompted to:
Enter a title for the request.
Provide a reasoning or description of the changes, explaining why they are needed.
Save the request Once you click Save, the request is created. The dataset owner will be notified and can review, clarify, and propose changes to the contract based on your input.
Each dataset page includes a Requests tab where all requests related to that dataset are listed. From here, users can:
Search for a request by name.
Filter requests by status: Open, Done, or Won’t Do.
Click on any request to access collaboration tools.
Once inside a request, users can collaborate in the following ways:
Click View Proposal to examine an existing proposal associated with the request.
When viewing a proposal, visual indicators show exactly what has changed in the contract:
Blue icon → element was modified (M).
Red icon → element was removed (R).
Green icon → element was added (A).
Blue dot → a parent element has one or more
Participants can post text messages within the request to clarify needs, align on requirements, and discuss next steps.
Users can contribute new proposals to move the request forward.
Iterate on an existing proposal: while viewing a proposal, click the pen icon to edit and build upon it.
From scratch: click Add Proposal to create a brand-new proposal.
In both cases:
Make your edits.
Click Save.
Provide a message to explain what you have done.
Click Save again
All participants are automatically notified by email when a new proposal is created or an iteration is made, ensuring everyone stays aligned and can respond promptly.
This action requires the Manage Contract permission on the dataset. Learn more about permissions here:
After reviewing a proposal, you can publish it by clicking the Publish button. Once published, all participants associated with the request will automatically receive a notification, ensuring they are informed of the update.
In case a new version of the contract has been published, it is required to sync the proposal with the latest version to publish it.
When reviewing a proposal, click on Sync to latest
There are then 2 scenarios that can arise:
Soda can automatically merges the 2 versions. You can then proceed to the next step
There are conflicts that Soda cannot resolve. In this case, you will be required to resolve the conflict. Soda offers a tool allowing you to compare the latest published version (left side) with the version of the proposal (right side). You can then edit the proposal version to resolve the conflicts. Click Continue to proceed
Optionally, do extra edits
Click Save to create a new proposal, which you can now publish
You can fetch the content of a proposal from Soda Cloud and save it as a contract file, which can then be published to Git. This allows you to incorporate approved changes into version-controlled data contracts.
Request and proposal numbers can be found on Soda Cloud when reviewing a proposal. The first number is the request, and the decimal is the proposal.
After fetching the proposal, you can optionally use the publish command to publish it from Soda Cloud to Git:
Each request in Soda has a status to reflect its lifecycle. Initially, a request is created in the Open state. Once the requested changes have been implemented and published, the request can be moved to Done. If the decision is made not to implement the request, it can be transitioned to Won’t Do. Whenever a request’s status is updated, all participants are automatically notified by email, ensuring transparency and alignment across the collaboration process.
Configure a Webhook in Soda Cloud to connect to your Jira workspace.
In this guide, we will show how you can integrate Soda Cloud Incidents with Jira. After the integration is set up, then creating an incident in Soda will automatically trigger the creation of corresponding bug ticket in Jira. The Jira ticket will include information related to the incident created in Soda, including:
The number and title of the Incident
The description of the Incident
The severity of the incident
The status of the incident
The user who reported the Incident
A link to the Incident in Soda Cloud
A link to the associated Check in Soda Cloud
A link to this Jira ticket will be sent back to Soda and displayed on the Incident page in the Integrations box. Any updates to the status of the Incident in Soda Cloud will trigger corresponding changes to the Status of the Jira ticket. Any updates to the status of the Jira ticket will trigger corresponding changes to the Status of the Incident in Soda Cloud.
In Jira, you can set up an Automation Rule that enables you to define what you want an incoming webhook to do, then provides you with a URL that you use in the URL field in the Soda Cloud integration setup.
This integration is built on two webhook events IncidentCreated and IncidentUpdated (Soda -> Jira; ), as well as the Soda Cloud API endpoint for updating incidents (Jira -> Soda; ).
In Jira, start by creating a new project dedicated to tracking data quality tickets. Navigate to the Project settings > Work Items, and make sure you have a bug type work item with the fields, as shown in the image below:
Summary
Description
Assignee
IncidentSeverity
From the same page, next click the Edit Workflow button, and make sure your workflow includes the following statuses:
Reported
Investigating
Fixing
Resolved
Here we will set up the automation in Jira so that when an Incident is created or updated in Soda, then a bug ticket will automatically be created or updated in Jira.
Navigate to Project settings > Automation, then click Create rule and, for the type of New trigger, select Incoming webhook.
Under the When: Incoming webhook trigger, click Add a component, select IF: Add a condition, then smart values condition.
What this means is that, if an incoming webhook has the incidentCreated event, then we will do something.
Next we will add another component: THEN: Add an action.
The action will be to Create work item and the Issue Type should be Bug and the Project should be our new project.
Next we add some steps to fill out our ticket with extra information obtained from the webhook data.
We start by creating a branch rule to identify our ticket:
Then we Edit the ticket fields:
Finally, the last step in our incident creation workflow is to send a post request back to Soda with a link to the issue in Jira:
The remaining parts of this automation rule cover the scenarios where the status of the incident is updated in Soda, then we will detect this change and make the corresponding updates to the issue in Jira.
When the status changes to Reported:
The same logic is used for other status changes such as Investigating and Fixing.
In case the status changes to Resolved, our rule uses a similar logic, but with the additional step of adding resolution notes as a comment to the issue in Jira:
Once you save/enable this new rule, then you can access a URL and secret that you will provide to Soda when setting up the new webhook integration.
After saving or enabling the rule, you can view details of the webhook trigger as shown below:
Next, you create a new webhook integration in Soda and provide the details from the webhook trigger above, as shown in the image below.
Paste the Webhook URL from Jira into the URL field in Soda and paste the Secret from Jira into a custom HTTP header called X-Automation-Webhook-Token.
Finally, in the Define Scope tab, make sure to select Incidents - Triggered when users create or update incidents.
We will set up a second automation rule in Jira so that when the status of the ticket changes in Jira, these changes are also reflected in Soda.
First, we set up the trigger for this automation to be when a Work item is transitioned:
Finally, we send a post request to the Soda Cloud API incidents endpoint , using information from our Jira ticket to update the severity and status of the corresponding incident in Soda:
Note that the Authorization header value must be formatted like:
Basic <base64_encoded_credentials>. Base64-encoded credentials can be generated using Soda Cloud API keys in Python like so:
Soda offers seamless integrations with many tools across your data stack. Whether you're aligning data governance efforts, collaborating across teams, or triggering workflows, you can enhance Soda’s observability capabilities with the following connections:
For more details on notification rules, see the .
To create an integration:
Go to the Integrations section in Settings.
Click the + button to add a new integration.
Select the integration type (Slack, Microsoft Teams, or Webhook).
Follow the setup steps for the chosen integration
You can update existing integrations if connection details or configurations change.
To edit an integration:
Go to the Integrations section in Settings.
Find the integration you want to update.
Click the context menu and select Edit Integration Settings.
Update the configuration as needed.
You can temporarily pause an integration if you want to stop sending notifications and incident updates without fully deleting the configuration. The integration will no longer be available in notification rules.
To pause an integration:
Go to the Integrations section in Settings.
Locate the integration you want to pause.
Change the status to "Paused" in the table
Select Pause.
While paused, the integration will no longer send any notifications. You can resume it at any time by following the same steps and selecting Active.
This quickstart shows how Soda detects unexpected data issues by leveraging AI powered Anomaly Detection and prevents future problems by using data contracts. The example uses Databricks, but you can do the same with any other database.
A data engineer at a retail company needs to maintain the regional_sales dataset so their team can manage regional sales data from hundreds of stores across the country. The dataset feeds executive dashboards and downstream ML models for inventory planning. Accuracy and freshness are critical, so you need both:
python -m venv .venv
source .venv/bin/activatepip install -i https://pypi.dev.sodadata.io/simple -U soda-postgrespip install --upgrade pipexport SODA_API_KEY_ID="your_key_id"
export SODA_API_KEY_SECRET="your_key_secret"pip install soda>=4.0.0b0 --pre -i https://${SODA_API_KEY_ID}:${SODA_API_KEY_SECRET}@enterprise.pypi.cloud.soda.io --extra-index-url=https://pypi.dev.sodadata.iopip install “soda” --pre -i “https://${SODA_API_KEY_ID}:${SODA_API_KEY_SECRET}@team.pypi.cloud.soda.io”--extra-index-url=https://pypi.dev.sodadata.iopip install soda --pre -i https://${SODA_API_KEY_ID}:${SODA_API_KEY_SECRET}@enterprise.pypi.cloud.soda.io --extra-index-url=https://pypi.dev.sodadata.ioALATION_HOST=yourcompany.alationcatalog.com
ALATION_USER=<your username for your Alation account>
ALATION_PASSWORD=<your password for your Alation account>
SODA_HOST=cloud.soda.io
SODA_API_KEY_ID=<your Soda Cloud pubic key>
SODA_API_KEY_SECRET=<your Soda Cloud private key> - name: Cars
soda:
datasource_id: 2d33bf0a-9a1c-4c4b-b148-b5af318761b3
datasource_name: adventureworks
# optional dataset_mapping soda: catalog
dataset_mapping:
Cars_data: Cars
catalog:
type: "alation"
datasource_id: "31"
datasource_container_name: "soda"
datasource_container_id: "1"
- name: Soda Demo
soda:
datasource_id: 8505cbbd-d8b3-48a4-bad4-cfb0bec4c02f
catalog:
type: "alation"
datasource_id: "37"
datasource_container_name: "public"
datasource_container_id: "2"filter: country = "${var.country}";Click Save to activate the integration.
Click Save to apply the changes.



















IncidentURL
CheckURL















import base64
api_key_id = "your_api_key_id"
api_key_secret = "your_api_key_secret"
credentials = f"{api_key_id}:{api_key_secret}"
encoded_credentials = base64.b64encode(credentials.encode()).decode()
print(f"Basic {encoded_credentials}")Make changes Update the contract based on your needs and use case. You can add, modify, or remove elements to ensure the contract reflects the requirements you want to address.
Create a new request After making your edits, click Create a Request.
-r
Yes
The request number. Identifies the request to fetch. Request numbers can be found when reviewing a proposal. See screenshot below.
-p
No
The proposal number. Defaults to the latest proposal if not specified. Proposal numbers are shown as the decimal part when reviewing a proposal. See screenshot below.
--soda-cloud, -sc
Yes
Path to the Soda Cloud config file (e.g., soda-cloud.yaml).
--f
Yes




















Path to the output file where the contract will be written.
Automated anomaly detection on key metrics (row counts, freshness, schema drift)
Proactive enforcement of business rules via data contracts
Contact us at [email protected] to get an account set up.
After signing up, you can follow the steps below to set up a data source and start improving data quality.
Soda Cloud’s no-code UI lets you connect to any data source in minutes.
In cloud.soda.io or cloud.us.soda.io, click on Data Sources → New Data Source.
Choose your data source provider.
Name your data source under Data Source Label.
Scroll down and fill in the following credentials from your data source:
Click Connect or Test connection. This will trigger the connection and move to the next step.
Select the datasets you want to onboard on Soda Cloud.
Enable Metric Monitoring. By default, Metric Monitoring is enabled to automatically track key metrics on all the datasets you onboard and alert you when anomalies are detected. It is powered by built-in machine learning that compares current values against historical trends. You can also enable Advanced Monitor Configuration.
Enable Profiling and configure it. By default, Profiling is scheduled daily at 12:00AM UTC.
Click Finish to onboard your datasets. Soda Cloud will now spin up its Soda-hosted Agent and perform an initial Profiling & Historical Metric Collection scan. This usually takes only a few minutes.
Congratulations, you’ve onboarded your first dataset! Now let’s make sure you always know what’s happening with it.
That’s where Metric Monitoring comes in. It automatically tracks key metrics like volume, freshness, and schema changes, with no manual setup required. You’ll spot anomalies, detect trends, and catch unexpected shifts before they become problems.
Go to Datasets → select the dataset to inspect.
Navigate to the Metric Monitors tab to learn more about the metrics calculated.
You'll immediately see that key metrics are automatically monitored by default, helping you detect pipeline issues, data delays, and unexpected structural changes as they happen. No setup needed, just visibility you can trust.
In this guide, we will focus on the Most recent timestamp monitor. The panel shows that it was expected to be in a range of 0 - 5m 31s, but the recorded value at scan time was 56m 49s. In order to take a closer look:
Click the Most recent timestamp (or monitor of your choice) block.
In the monitor page you’ll see:
measured value vs. expected range,
any red-dot anomalies flagged by the model,
buttons to Mark as expected, Create new incident, etc.
Flag an outlier as "expected" or investigate it further.
Soda’s anomaly detection engine was built in-house (no third-party libraries) and optimized for high precision. It continuously adapts to your data patterns, and it incorporates your feedback to reduce false alarms. Designed to minimize false positives and missed detections, it shows a 70% improvement in detecting anomalous data quality metrics compared to Facebook Prophet across hundreds of diverse, internally curated datasets containing known data quality issues.
The Anomaly Detection Algorithm offers complete control and transparency in the modeling process to allow for interpretability and adaptations. It features high accuracy while leveraging historical data, delivering improvements over time.
Our automated anomaly detection has just done the heavy lifting for you, identifying unusual patterns and potential data issues without any setup required.
But to prevent those issues from happening again, you must define exactly what your data should look like; every column, every rule, every expectation.
That’s where Data Contracts come in. They let you proactively set the standards for your data, so problems like this are flagged or even prevented before they impact your business.
Create a new data contract to define and enforce data quality expectations.
In your Dataset Details page, go to the Checks tab.
Click Create Contract.
When creating a data contract, Soda will connect to your dataset and build a data contract template based on the dataset schema. From this point, you can start adding both dataset-level checks and column-level checks, as well as defining a verification schedule or a partition.
Toggle View Code if you’d like to inspect the generated SodaCL/YAML. This gives you access to the full contract code.
You can copy the following full example, paste it into the editor and edit it as you wish. You can toggle back to no-code view to see and edit the checks in the no-code editor.
That’s right: with Soda, you can edit a contract using either a no-code interface or directly in code. This ensures an optimal experience for all user personas while also providing a version-controlled code format that can be synced with a Git repository.
Click Test to verify the contract executes as expected
When you are done with the contract, click Publish
Click Verify. Soda will evaluate your rules against the current data.
Review the outcomes of the contract checks to confirm whether the data meets expectations. You can drill into those failures in the Checks tab.
You can trigger contract verification programmatically as part of your pipeline, so your data gets tested every time it runs.
We’ve prepared an example notebook to show you how it works:
Open the following Notebook example: https://colab.research.google.com/drive/1zkV_2tLJ4ohdzmKGS3LgdFDDnTNTUXew?usp=sharing
In your Python environment, first install the Soda Core library
Then, in the same environment, create a soda-cloud.yml file that contains your API keys, which are necessary to connect to Soda Cloud. You can create this YAML file from your Profile: Generate API keys
The soda-cloud.yml file should look like the following:
Now you are ready to trigger the verification of the contract. To do that just provide the identifier of your dataset as well as the path to the configuration file you just created in the previous step. This will trigger a verification using Soda Agent and return the logs.
Create a verify_contract.py file in your environment with the code below (or run it from a Jupyter notebook/Python interpreter):
You can learn more about the Python API here: Python API
You’ve completed the tutorial and are now ready to start catching data quality issues with Soda
Explore Profiling in the Discover tab to curate column selections for deeper analysis.
Set up Notification Rules (bell icon → Add Notification Rule) to push alerts to Slack, Jira, PagerDuty, etc.
Dive into Custom Monitors via scan.yml or the UI for even more tailored metrics.
On this page, you’ll see a list of connected sources and an Add Data Source button.
You need the "Manage data sources" global permission to add a new data source.
Learn more about Global and Dataset Roles
Click Add Data Source and select your data source from the list of supported data source types.
After selecting a source, you’ll be presented with a configuration form.
Enter a friendly, unique label. A unique name will be automatically generated from this label. This becomes the immutable ID of the data source and can also be used to reference the same connection in Soda Core.
You’ll be asked to select an agent. This is the component that connects to your data source and runs scans.
You can choose from:
Soda-hosted agent – Quickest option, fully managed by Soda (recommended for getting started)
Self-hosted agent – For custom or secure deployments where you manage the agent yourself
Learn more about deployment options: Deployment options
You’ll need to fill in the connection details. Soda uses the official Python packages for each supported data source, which means you can define any properties required by those libraries, flexibly and reliably.
This includes common fields like host, port, database name, username, and more, depending on the data source.
For sensitive values such as passwords, tokens, or keys, you should use Soda Secrets instead of entering them directly in the configuration.
Secrets are encrypted and securely stored in Soda Cloud.
They can be safely referenced in your data source configuration without exposing them in plain text.
To add secrets:
Navigate to the Data Sources tab in the top navigation.
Click the Secrets tab.
Define key-value pairs for your sensitive credentials.
You can then reference a secret in your data source configuration using this syntax:
This ensures your sensitive values stay secure while still being accessible to the agent at runtime.
Once the form is complete:
Click Test Connection to validate that Soda can successfully connect to your data source.
If the test passes, click Connect to finalize the setup.
After connecting, Soda will perform an automated dataset discovery. Soda triggers a scan that analyzes the datasets and retrieves their metadata, including columns and column data types. This reduces manual setup efforts, ensures data coverage in your environment and keeps Soda's dataset inventory aligned with your data sources. This feature allows other Soda features to work seamlessly:
Contract generation
Automated discovery of time partition column
Automated discovery of Primary Keys for Diagnostics Warehouse
Dataset selection can be manual or rules-based.
Manual selection allows you to browse a directory view of all the datasets in your data source.
The Scope can range from the entire data source to a specific schema. Any element selected on the left panel becomes the scope of the dataset search.
The manual selection is made for scale; it can easily handle thousands of schemas and hundreds of thousands of datasets.
Datasets that have already been onboarded will not be visible in the manual dataset selection.
Rules-based selection allows you to automate the dataset onboarding process, only selecting those which match specified rules.
You can add rules to include or exclude datasets that match certain conditions, such as "name contains" or "name starts with", or provide your own regex pattern.
In the example below, only datasets whose name does not start with "dwh" from the public schema will be onboarded.
Once you click on Validate rule, Soda will calculate how many datasets currently match the defined conditions:
Click on Next to finish the process.
Once the onboarding process is finished (after Step 3: Enable Metric Monitoring & Profiling (optional)), an overview of the Onboarding Rules will be provided. From this view, rules can be edited or deleted.
Rules will be executed in order of appearance on this view.
The order of the rules can be changed. As soon as a dataset matches a rule, it will be onboarded automatically; datasets can only be onboarded once.
Once onboarded, datasets will appear in your Soda Cloud UI and become available for contract creation or metric monitoring.
Through Metric Monitoring, you can enable built-in monitors to automatically track row counts, schema changes, freshness, and more across your datasets. This step is optional but recommended. This can be enabled in bulk when onboarding data sources and datasets.
Learn more about Metric Monitoring: Metric Monitoring dashboard
Toggle on Metric Monitoring
When metric monitoring is enabled it's possible to later add column monitors on dataset level or overwrite any of the settings.
Set a Monitoring Schedule
The monitoring schedule defines when Soda scans a dataset to capture and evaluate metrics. While scans may run slightly later due to system delays, Soda uses the actual execution time, not the scheduled time, when visualizing time-sensitive metadata metrics like insert lag or row count deltas. This ensures accuracy.
Data-based metrics like averages or null rates are not affected by small delays, as Soda only scans complete partitions, keeping these metrics stable and reliable.
Scans can be scheduled to occur from hourly to weekly, depending on your needs.
Learn more about how to pick a scan time.
Toggle on/off Historical Metric Collection
When Historical Metric Collection is enabled, Soda automatically calculates past data quality metrics through backfilling and applies the anomaly detection algorithm to that historical data through backtesting. This gives you immediate visibility into past data quality issues, even before monitoring was activated. The historical data also helps train the anomaly detection algorithm, improving its accuracy from day one. You can specify a start date to control how far back the backfilling process should begin.
Suggest a Time Partition Column
Metrics that are not based on metadata require a time partition column to group data into daily intervals or 24-hour buckets, depending on the monitoring schedule. This column must be a timestamp field, ideally something like a created_at or last_updated column. It's important that this timestamp reflects when the data arrives in the database, rather than when the record was originally created.
Soda uses a list of suggested time partition columns to determine which column to apply. If multiple columns are suggested, Soda checks them in the order they are listed, starting with the first. It will try to match one by validating that the column is a proper timestamp and suitable for partitioning.
If none of the suggested columns match, Soda falls back to a heuristic approach. This heuristic looks at metadata, typical naming conventions, and column content to infer the most likely time partition column.
If the heuristic fails to find a suitable column or selects the wrong one, the time partition column can be manually configured after onboarding under dataset settings.
Click on Next.
Enable Profiling (optional)
Learn more about Profiling.
From this view, you can also enable Failed row collection if Diagnostics Warehouse is enabled for this data source.
Click on Finish. If you used Rules-based selection to onboard datasets, an Active Onboarding Rule Pipeline view will appear now to confirm the conditions.
Once onboarding is completed, your data source will appear in the Data Sources list. You can click the Onboarded Datasets link to access the connected datasets.
🎉 Congrats! You’ve successfully onboarded your data source. You’re now ready to create data contracts and start monitoring the quality of your data.
Note that you can repeat the datasets onboarding process at any time to add more datasets from the same data source. Datasets that previously have been onboarded will not re-appear in the data selection step. Simply return to the data source page and click Onboard Datasets to update your selection.
You need the Manage data sources global permission to add a new data source. Learn about Global and Dataset Roles
Learn more about Metric Monitors that run scans at a dataset level.
A dataset monitor in Soda tracks a specific high-level metric for an entire table (or partition) over time. It helps detect unusual patterns or unexpected changes in overall data health, such as sudden spikes or drops in row count, delays in fresh data, or schema drift.
You can find dataset monitors by opening the Metric Monitors tab on any dataset and looking at the top section labeled “Dataset Monitors.” This section lists all active dataset monitors—both metadata-based and partition-based—in a clear overview of monitor cards. This overview provides, at a glance, critical information about the status of each monitor, the value of the last scan and any detected anomalies, allowing you to have a one-look summary of the health of your data systems.
Unlike column monitors, which are configured at the dataset level but target individual columns, dataset monitors apply to the entire table (or its latest partition) and capture broad indicators of data quality. When the necessary data and metadata are available, dataset-level monitors work out of the box with no further configuration needed.
Soda Cloud uses Global Roles and Dataset Roles to manage access and permissions. These roles ensure users and user groups have the right level of access based on their responsibilities.
Global roles define permissions across the entire organization in Soda Cloud.
By default, Soda Cloud provides to Global Roles: Admin and Users. You can create custom roles with a subset of the permissions
When you deploy a self-hosted Soda Agent to a Kubernetes cluster in your cloud service provider environment, you need to provide several key parameters and values to ensure optimal operation and to allow the agent to connect to your Soda Cloud account (API keys), and connect to your data sources (data source login credentials) so that Soda can run data quality scans on the data.
By default, Soda uses as part of the Soda Agent deployment. The agent automatically converts any sensitive values you add to a values YAML file, or directly via the CLI, into Kubernetes Secrets.
As these values are sensitive, you may wish to employ the following alternative strategies to keep them secure.
soda request fetch -r 7 -p 1 -sc soda-cloud.yaml --f ./contracts/ecommerce_orders.yamlRedataset: databricks_demo/unity_catalog/demo_sales_operations/regional_sales
filter: |
order_date >= ${var.start_timestamp}
AND order_date < ${var.end_timestamp}
variables:
start_timestamp:
default: DATE_TRUNC('week', CAST('${soda.NOW}' AS TIMESTAMP))
end_timestamp:
default: DATE_TRUNC('week', CAST('${soda.NOW}' AS TIMESTAMP)) + INTERVAL '7 days'
checks:
- row_count:
- schema:
columns:
- name: order_id
data_type: INTEGER
checks:
- missing:
name: Must not have null values
- name: customer_id
data_type: INTEGER
checks:
- missing:
name: Must not have null values
- name: order_date
data_type: DATE
checks:
- missing:
name: Must not have null values
- failed_rows:
name: Cannot be in the future
expression: order_date > DATE_TRUNC('day', CAST('${soda.NOW} ' AS TIMESTAMP)) +
INTERVAL '1 day'
threshold:
must_be: 0
- name: region
data_type: VARCHAR
checks:
- invalid:
valid_values:
- North
- South
- East
- West
name: Valid values
- name: product_category
data_type: VARCHAR
- name: quantity
data_type: INTEGER
checks:
- missing:
name: Must not have null values
- invalid:
valid_min: 0
name: Must be higher than 0
- name: price
data_type: NUMERIC
checks:
- invalid:
valid_min: 0
name: Must be higher than 0
- missing:
name: Must not have null values
- name: payment_method
data_type: VARCHAR
checks:
- missing:
name: Must not have null values
- invalid:
threshold:
metric: count
must_be: 0
filter: region <> 'north'
valid_values:
- PayPal
- Bank Transfer
- Cash
- Credit Card
name: Valid values in all regions except North
- invalid:
name: Valid values in North
filter: region = 'north'
valid_values:
- PayPal
- Bank Transfer
- Credit Card
qualifier: ABC124
pip install -i https://pypi.dev.sodadata.io/simple -U soda-coresoda_cloud:
host: cloud.soda.io ## Or cloud.us.soda.io
api_key_id: YOUR_API_KEY_ID ## Replace with your actual key ID
api_key_secret: YOUR_API_KEY_ID ## Replace with your actual key secretfrom soda_core import configure_logging
from soda_core.contracts import verify_contracts_on_agent
configure_logging(verbose=False)
res = verify_contracts_on_agent(
dataset_identifiers=["databricks_demo/unity_catalog/demo_sales_operations/regional_sales"],
soda_cloud_file_path="soda-cloud.yml",
)
print(res.get_logs())${secret.SECRET_NAME}






































Soda supports two categories of dataset‐level monitors: those that rely purely on system metadata, and those that compute values by querying a designated time‐partition column. Below is an in‐depth description of each built‐in monitor. For a more detailed discussion of monitors based on querying the metadata vs monitors based on querying the data, see the Metadata vs Data-Based section in this page.
Based on metadata
Total row count
The total number of rows in the dataset at scan time.
Total row count change
Change in total row count compared to the previous scan
Last modification time
Most recent time the data was changed relative to the last scan
Schema changes
Monitors based on time partition columns look at data in the most recent partition based on a timestamp. If data is altered in an old partition, it will not be evaluated.
For example, data inserted today with timestamp of 2 days ago will not be evaluated if the partition interval is 1 day.
The dashboard provides a health table summarizing an overview of the monitors. Each monitor card is clickable and links to the Anomaly History page of the metric.
Each monitor card will have the following information:
Monitor name: the given name of the specific monitor.
Monitor explanation: a brief description of the metric used.
Status: ✅ healthy / ⚠️ violated
Today's value at scan time: last recorded value.
Expected range: calculated by the anomaly detection algorithm, based on historical data.
Trend line with last 7 observations: a sparkline that shows an overview of the monitor plot.
Bell icon: to enable/disable opt-in alerts.
Dataset-level monitors fall into two categories depending on their source of truth:
Metadata-based monitors rely solely on system metadata exposed by your data warehouse; fields like “row count,” “last modified time,” or “schema version” that the catalog provides without scanning table rows. Because they don’t touch actual data, metadata monitors are extremely efficient and run quickly. They alert you if your table grows, shrinks, stops updating, or changes structure.
Data-based monitors look directly at the contents of a designated time-partition column (e.g., a date or timestamp field) and compute a value from the rows in that partition. Examples include “Partition Row Count” (how many rows landed in today’s partition) or “Most Recent Timestamp” (the newest timestamp in that partition). Data-based monitors require a full scan of each partition they monitor, but they capture freshness and volume signals that metadata alone cannot provide. If your dataset has no time-partition column defined (or your warehouse can’t surface the needed metadata), Soda will disable the appropriate monitors so you only see the metrics that can be collected.
Use the Configure Dataset Monitors panel to pick which built-in metadata and partition-based metrics you want Soda to track at the dataset level.
Open the panel → From any dataset’s Metric Monitors dashboard, click Configure Dataset Monitors.
Enable or disable → Toggle metrics on/off directly from here. If the data source doesn't support a given metric, it will be automatically off.
Modify the monitor
Auto-apply → Changes take effect immediately for the next scan. Simply close the panel when you’re done.
Many data‐based monitors—such as Partition Row Count and Most Recent Timestamp—rely on a designated “time partition” column to know which slice of data to scan. The time partition column should be a date or timestamp field that naturally groups rows into discrete, regularly updated partitions (for example, a daily order_date or event_time). When Soda cannot detect a time partition column, metrics based on that data will not be available.
A good time partition column meets all of the following criteria:
Date or timestamp type: Each row contains a valid date (or timestamp) value.
Regular arrival cadence: New rows for each date/timestamp appear on a predictable schedule (e.g., daily, hourly).
Reflects ingestion/arrival time: The column’s value must correspond to when the record actually landed in this dataset, not when it was originally created upstream. The partition column should always show arrival date to the dataset so freshness checks remain accurate.
Logical partition boundary: It matches how you want to slice your data (e.g., order_date for daily sales, event_time for hourly logs).
When these conditions hold, partition-based monitors will reliably focus on the correct slice of data—namely, the rows that truly arrived during each time window—so any delays or backfills become immediately visible.
When you onboard a new dataset from your data source, Soda attempts to automatically detect the most likely time partition column. You can:
Finish onboarding without editing the Time Partition Column field, allowing Soda to detect it.
Suggest a Time Partition Column of your choice, forcing Soda to use that one for monitoring.
If you ever need to confirm or search for the right partition column:
Navigate to the Datasets page, select your dataset, and click the Columns tab.
Search columns with "timestamp" on them. Any column with a date or timestamp data type is a candidate.
After onboarding, you can override the time partition column at any time. Changing it will reset Soda’s anomaly detection model for partition‐based metrics, so you’ll be retraining on historical data under the new partition definition. To override:
Acess the Dataset Settings
Navigate to the Datasets tab
From this list or from the dataset page itself, click on the (⋮) menu > Edit Dataset
Find Time Partition Columns
Click on the Profiling & Metric Monitoring tab
Here you will see the current column being used for Time Partition.
Reveal the Time Partition Column drop-down menu
This will show all date and timestamp columns that can be used as a Time Partition Column.
Select your new Time Partition Column
Changing this column resets the model and historical baselines.
Click Save. Soda will:
Reset the partition‐based monitors (Partition Row Count, Most Recent Timestamp) to “training mode” and retrain baselines on the new partition.
Preserve any metadata‐based monitors (Total Row Count, Schema Changes) unchanged.
By following these steps, you ensure that Soda’s data‐based monitors always reference the correct daily (or hourly) slice of your dataset, so partition‐level metrics and freshness checks produce accurate results.
When Soda Cloud cannot obtain the underlying metadata required to calculate a dataset-level metric, it prevents you from configuring or viewing a metric that would always fail. There are two cases:
If a connected data source cannot provide the required metadata for a given dataset-level metric, such as row counts or schema timestamps, Soda will automatically disable that metric both on the Metric Monitors dashboard and in the Configure Dataset Monitors panel so you only see and configure metrics that your source can actually collect.
Some warehouses expose current metadata but don’t provide historical snapshots (for example, systems that only track the latest row count). In this case, Soda will compute the metric starting from your very first scan, but it cannot backfill any history prior to that point. As a result, anomaly detection baselines for that metric begin only at scan #1 and there is no retroactive historical data to train against.
The Schema changes monitor will not add historical metadata and backfilling will not be available, unlike with other metrics. The monitor only starts recording from the moment the dataset is onboarded.
Even when a metric is enabled and historical baselines exist, you may occasionally see gaps due to delayed or skipped scans. A “missing” metric indicates that Soda attempted to run the scan but did not receive a valid result for that metric, either because the scan agent was down, the query timed out, or metadata couldn’t be retrieved in time. Missing values do not count as anomalies; they simply mark a gap in the time series.
In Soda Cloud, you can identify these gaps as follows:
On the Metric Monitors dashboard, any missing value is shown either as a grey point or an empty checkbox in the metric sparkline:
In the detailed anomaly plot, missing points render as open circles (◯) along the timeline, and the trend line becomes dashed.
In Schema changes, no plot is available since the expected value is always 0. Hovering over an empty checkbox will display “No measurement” in the tooltip, making it easy to distinguish a gap from a healthy measurement or a flagged anomaly.
These visual cues allow to immediately recognize when a scan didn’t complete successfully, enabling further investigation and restoration of full observability before critical issues go unnoticed.
Databricks
June 6th
✅
✅
✅
Snowflake
June 6th
✅
September 1st
✅
You have an Azure account and the necessary permissions to enable you to create, or gain access to an existing AKS cluster in your region. Consult the Azure access control documentation for details.
You have installed the Azure CLI tool. This is the command-line tool you need to access your Azure account from the command-line. Run az --version to check the version of an existing install. Consult the Azure Command-Line Interface documentation for details.
You have logged in to your Azure account. Run az login to open a browser and log in to your account.
You have installed v1.22 or v1.23 of . This is the command-line tool you use to run commands against Kubernetes clusters. If you have already installed the Azure CLI tool, you can install kubectl using the following command: az aks install-cli.
Run kubectl version --output=yaml to check the version of an existing install.
You have installed . This is the package manager for Kubernetes which you will use to deploy the Soda Agent Helm chart. Run helm version to check the version of an existing install.
You have whitelisted these URLs, depending on whether you are using Soda EU cloud.soda.io or Soda US cloud.us.soda.io:
Kubernetes cluster size and capacity: 2 CPU and 2GB of RAM. In general, this is sufficient to run up to six scans in parallel.
Scan performance may vary according to the workload, or the number of scans running in parallel. To improve performance for larger workloads, consider fine-tuning the cluster size using the resources parameter for the agent-orchestrator and soda.scanlauncher.resources for the scan-launcher. Adding more resources to the scan-launcher can improve scan times by as much as 30%. Be aware that allocating too many resources may be costly relative to the small benefit of improved scan times.
To specify resources, add the following parameters to your values.yml file during deployment. Refer to Kubernetes documentation for Resource Management for Pods and Containers for information on values to supply for x.
For reference, a Soda-hosted agent specifies resources as follows:
The following table outlines the ways you can install the Helm chart to deploy a Soda Agent in your cluster.
Install the Helm chart via CLI by providing values directly in the install command.
Use this as a straight-forward way of deploying an agent on a cluster.
Install the Helm chart via CLI by providing values in a values YAML file.
Use this as a way of deploying an agent on a cluster while keeping sensitive values secure. - provide sensitive API key values in this local file or in an external secrets manager - store data source login credentials as environment variables in this local file; Soda needs access to the credentials to be able to connect to your data source to run scans of your data. See:
(Optional) You have familiarized yourself with basic Soda, Kubernetes, and Helm concepts.
Create or navigate to an existing Kubernetes cluster in your environment in which you can deploy the Soda Agent helm chart.
Use Helm to add the Soda Agent Helm chart repository.
Use the following command to install the Helm chart which deploys a Soda Agent in your cluster. (Learn more about the helm install command.)
Replace the values of soda.apikey.id and soda-apikey.secret with the values you copy+pasted from the New Soda Agent dialog box in your Soda Cloud. By default, Soda uses as part of the Soda Agent deployment. The agent automatically converts any sensitive values you add to a values YAML file, or directly via the CLI, into Kubernetes Secrets.
Replace the value of soda.agent.name with a custom name for your agent, if you wish.
Specify the value for soda.cloud.endpoint according to your local region: https://cloud.us.soda.io for the United States, or https://cloud.soda.io for all else.
(Optional) Specify the format for log output: raw for plain text, or json for JSON format.
(Optional) Specify the level of log information you wish to see when deploying the agent: ERROR, WARN, INFO, DEBUG, or TRACE.
The command-line produces output like the following message:
(Optional) Validate the Soda Agent deployment by running the following command:
In your Soda Cloud account, navigate to your avatar > Agents. Refresh the page to verify that you see the agent you just created in the list of Agents. Be aware that this may take several minutes to appear in your list of Soda Agents.
If you do no see the agent listed in Soda Cloud, use the following command to review status and investigate the logs.
(Optional) You have familiarized yourself with basic Soda, Kubernetes, and Helm concepts.
Create or navigate to an existing Kubernetes cluster in your environment in which you can deploy the Soda Agent helm chart.
Use Helm to add the Soda Agent Helm chart repository.
Using a code editor, create a new YAML file called values.yml.
To that file, copy+paste the content below, replacing the following values:
id and secret with the values you copy+pasted from the New Soda Agent dialog box in your Soda Cloud account. By default, Soda uses as part of the Soda Agent deployment. The agent automatically converts any sensitive values you add to a values YAML file, or directly via the CLI, into Kubernetes Secrets.
Replace the value of name with a custom name for your agent, if you wish.
Save the file. Then, create a namespace for the agent.
In the same directory in which the values.yml file exists, use the following command to install the Soda Agent helm chart.
(Optional) Validate the Soda Agent deployment by running the following command:
In your Soda Cloud account, navigate to your avatar > Agents. Refresh the page to verify that you see the agent you just created in the list of Agents.
If you do no see the agent listed in Soda Cloud, use the following command to review status and investigate the logs.
helm install
the action helm is to take
soda-agent (the first one)
a release named soda-agent on your cluster
soda-agent (the second one)
the name of the helm repo you installed
soda-agent (the third one)
the name of the helm chart that is the Soda Agent
The --set options either override or set some of the values defined in and used by the Helm chart. You can override these values with the --set files as this command does, or you can specify the override values using a values.yml file.
--set soda.agent.name
A unique name for your Soda Agent. Choose any name you wish, as long as it is unique in your Soda Cloud account.
--set soda.apikey.id
With the apikey.secret, this connects the Soda Agent to your Soda Cloud account. Use the value you copied from the dialog box in Soda Cloud when adding a new agent. You can use a values.yml file to pass this value to the cluster instead of exposing it here.
--set soda.apikey.secret
With the apikey.id, this connects the Soda Agent to your Soda Cloud account. Use the value you copied from the dialog box in Soda Cloud when adding a new agent. You can use a values.yml file to pass this value to the cluster instead of exposing it here.
--set soda.agent.logFormat
(Optional) Specify the format for log output: raw for plain text, or json for JSON format.
--set soda.agent.loglevel
(Optional) Specify the leve of log information you wish to see when deploying the agent: ERROR, WARN, INFO, DEBUG, or TRACE.
--namespace soda-agent
Use the namespace value to identify the namespace in which to deploy the agent.
Delete everything in the namespace which you created for the Soda Agent.
Delete the cluster. Be patient; this task may take some time to complete.
Problem: After setting up a cluster and deploying the agent, you are unable to see the agent running in Soda Cloud.
Solution: The value you specify for the soda-cloud-enpoint must correspond with the region you selected when you signed up for a Soda Cloud account:
Usehttps://cloud.us.soda.io for the United States
Use https://cloud.soda.io for all else
Problem: You need to define the outgoing port and IP address with which a self-hosted Soda Agent can communicate with Soda Cloud. Soda Agent does not require setting any inbound rules as it only polls Soda Cloud looking for instruction, which requires only outbound communication. When Soda Cloud must deliver instructions, the Soda Agent opens a bidirectional channel.
Solution: Use port 443 and passlist the fully-qualified domain names for Soda Cloud:
cloud.us.soda.io for Soda Cloud account created in the US region
OR
cloud.soda.io for Soda Cloud account created in the EU region
AND
collect.soda.io
Problem: When you attempt to create a cluster, you get an error that reads, An RSA key file or key value must be supplied to SSH Key Value. You can use --generate-ssh-keys to let CLI generate one for you.
Solution: Run the same command to create a cluster but include an extra line at the end to generate RSA keys.
Manage data sources and agents
Allow to deploy a new Soda Agent as well as configure data source connections in Soda Cloud.
✓
Create new datasets and data sources with Soda Core library
Allow the creation of new data sources in Soda Cloud when using the Soda Core library.
Allow to onboard datasets Soda Cloud on data sources connected with Soda Agent. See
✓
✓
Manage attributes
Allow to define which datasets and check attributes are available to use in the organization.
✓
Manage notification rules
Allow to manage how notifications are sent.
✓
✓
Manage organization settings
You can create custom global roles to match your organization’s needs.
To create a global role:
Go to the Global Roles section in Settings.
Click Add Global Role to create a new role.
Enter a name for the role.
Select the permissions the role should have.
Click Save.
You can edit global roles at any time to adjust permissions as your organization’s needs evolve.
To edit a global role:
Go to the Global Roles section in Settings.
Find the global role you want to modify.
Click the context menu next to the role and select Edit Global Role.
Adjust the role’s name and permissions as needed.
Click Save to apply your changes.
You can assign roles to individual users or user groups to grant them the associated permissions.
To assign a global role:
Go to the Global Roles section in Settings.
Find the global role you want to assign.
Click the context menu next to the role and select Assign Members
Select the users or user groups that should have the global roles
Click Save to apply your changes.
You can also assign roles on the Users and User groups tabs:
For users: User management
For user groups: User management
Dataset roles define permissions for specific datasets.
By default, Soda Cloud provides to Dataset Roles: Manager, Editor, and User. You can create custom roles with a subset of the permissions
View dataset
Access the dataset and view checks
✓
✓
✓
Access dataset profiling and samples
Allow users to see insights about the data
✓
✓
✓
You can create custom dataset roles to match your organization’s needs.
To create a dataset role:
Go to the Dataset Roles section in Settings.
Click Add Dataset Role to create a new role.
Enter a name for the role.
Select the permissions the role should have.
Click Save to apply your changes.
You can edit dataset roles at any time to adjust permissions as your organization’s needs evolve.
To edit a dataset role:
Go to the Dataset Roles section in Settings.
Find the dataset role you want to modify.
Click the context menu next to the role and select Edit Dataset Role.
Adjust the role’s name and permissions as needed.
Click Save to apply your changes.
Responsibilities in Soda Cloud define who has access to a dataset and what they are allowed to do. They are assigned by mapping users or user groups to a dataset role.
This ensures that the right people have the appropriate permissions for each dataset, such as the ability to manage checks, propose new rules, or view profiling information.
For example:
Assign a Manager role to a dataset owner who needs full control.
Assign a Viewer role to a business user who only needs to monitor data quality results.
By assigning responsibilities, you ensure clear access control, accountability, and governance across your datasets.
Learn about how to set up responsibilities on a dataset: Dataset Attributes & Responsibilities
Soda Cloud allows you to define default responsibilities for the dataset owner, which will automatically be granted for all dataset owners. This ensures that all users have a consistent baseline level of access unless you choose to customize it.
By default, all dataset owners have the "Manager" role.
How to Configure Default Responsibilities
Go to the Organization Settings page in Soda Cloud.
Locate the Datasets Roles section.
Select the dataset role to assign to the Dataset Owners
Click save on the top right of the page to apply changes
For everyone
Soda Cloud allows you to define default responsibilities for the Everyone group, which will automatically apply to all newly onboarded datasets. This ensures that all users have a consistent baseline level of access unless you choose to customize it.
By default:
The Everyone group is assigned as a "Viewer" for all new datasets.
This setting applies to all users in your organization unless disabled.
You can either customize the default role or disable the default responsibilities if you do not want the Everyone group to receive any automatic access to new datasets.
How to Configure Default Responsibilities
Go to the Organization Settings page in Soda Cloud.
Locate the Datasets Roles section.
Select the dataset role to assign to the Everyone group for new datasets.
To disable default responsibilities, toggle the feature off.
Click save on the top right of the page to apply changes
When you deploy a self-hosted Soda Agent from the command-line, you provide values for the API key id and API key secret which the agent uses to connect to your Soda Cloud account. You can provide these values during agent deployment in one of two ways:
directly in the helm install command that deploys the agent and stores the values as Kubernetes secrets in your cluster; see deploy using CLI only
OR
in a values.yml file which you store locally but reference in the helm install command as in the example below.
Refer to the exhaustive cloud service provider-specific instructions for more detail on how to deploy an agent using a values YAML file.
If you use private key with Snowflake or BigQuery, you can provide the required private key values in a values.yml file when you deploy or redeploy the agent.
When you, or someone in your organization, follows the guided steps to use a self-hosted Soda Agent to add a data source in Soda Cloud, one of the steps involves providing the connection details and credentials Soda needs to connect to the data source to run scans.
You can add those details directly in Soda Cloud, but because any user can then access these values, you may wish to store them securely in the values YAML file as environment variables.
Create or edit your local values YAML file to include the values for the environment variables you input into the connection configuration.
After adding the environment variables to the values YAML file, update the Soda Agent using the following command:
In step 2 of the add a data source guided steps, add data source connection configuration which look something like the following example for a PostgreSQL data source. Note the environment variable values for username and password.
Follow the remaining guided steps to add a new data source in Soda Cloud. When you save the data source and test the connection, Soda Cloud uses the values you stored as environment variables in the values YAML file you supplied during redeployment.
Use External Secrets Operator (ESO) to integrate your self-hosted Soda Agent with your secrets manager, such as a Hashicorp Vault, AWS Secrets Manager, or Azure Key Vault, and securely reconcile the login credentials that Soda Agent uses for your data sources.
Say you use a Hashicorp Vault to store data source login credentials and your security protocol demands frequent rotation of passwords. In this situation, the challenge is that apps running in your Kubernetes cluster, such as a Soda Agent, need access to the up-to-date passwords.
To address the challenge, you can set up and configure ESO in your Kubernetes cluster to regularly reconcile externally-stored password values so that your apps always have the credentials they need. Doing so obviates the need to manually redeploy a values YAML file with new passwords for apps running in the cluster each time your system refreshes the passwords.
The current integration of Soda Agent and a secrets manager does not yet support the configuration of the Soda Cloud credentials. For those credentials, use a tool such as helm-secrets or vals.
To integrate Soda Agent with a secret manager, you need the following:
External Secrets Operator (ESO) which is a Kubernetes operator that facilitates a connection between the Soda Agent and your secrets manager
a ClusterSecretStore resource which provides a central gateway with instructions on how to access your secret backend
an ExternalSecret resource which instructs the cluster on what values to fetch, and references the ClusterSecretStore
Read more about the ESO’s Resource Model.
The following procedure outlines how to use ESO to integrate with a Hashicorp Vault that uses a KV Secrets Engine v2. Extrapolate from this procedure to integrate with another secrets manager such as:
You have set up a Kubernetes cluster in your cloud services environment and deployed a self-hosted Soda Agent in the cluster.
For the purpose of this example procedure, you have set up and are using a Hashicorp Vault which contains a key-value pair for POSTGRES_USERNAME and POSTGRES_PASSWORD at the path local/soda.
Consider referencing the use case guide for integrating an External Secrets Manager with a Soda Agent which offers step-by-step instructions to set everything up locally to see the integration in action.
Use helm to install the External Secrets Operator from the Helm chart repository into the same Kubernetes cluster in which you deployed your Soda Agent.
Verify the installation using the following command:
Create a cluster-secret-store.yml file for the ClusterSecretStore configuration. The details in this file instruct the Soda Agent how to access the external secrets manager vault.
This example uses Hashicorp Vault AppRole authentication. AppRole authenticates with Vault using the App Role auth mechanism to access the contents of the secret store. It uses the SecretID in the Kubernetes secret, referenced by secretRef and the roleID, to acquire a temporary access token so that it can fetch secrets.
Access external-secrets.io documentation for configuration examples for:
Deploy the ClusterSecretStore to your cluster.
Create an soda-secret.yml file for the ExternalSecret configuration. The details in this file instruct the Soda Agent which values to fetch from the external secrets manager vault.
This example identifies:
the namespace of the Soda Agent
two remoteRef
Deploy the ExternalSecret to your cluster.
Use the following command to get the ExternalSecret to authenticate to the Hashicorp Vault using the ClusterSecretStore and fetch secrets.
Output:
Prepare a values.yml file to deploy the Soda Agent with the existingSecrets parameter that instructs it to access the ExternalSecret file to fetch data source login credentials. Refer to complete , or if you already have an agent running in a cluster.
Deploy the Soda Agent using the following command:
Output:
By default, the Soda Agent creates a secret for storing the Soda Cloud API Key details securely in your cluster. If you want to use a different secret, you can point the Soda Agent to an existing Kubernetes Secret in your cluster using the soda.apikey.existingSecret property.
To use an existing Kubernetes secret for Soda Agent’s Cloud API credentials, add existingSecret and the secretKeys values to your agent’s values YAML file, as in the following example.
The default Soda Agent settings balance performance and cost-efficiency. You can adjust these settings to better suit your needs, optimizing for larger datasets, faster scans, or improved resource management.
The example below demonstrates how you can increase the memory limit using settings in your values.yml file:
Organizations that use a Security Assertion Markup Language (SAML) 2.0 single sign-on (SSO) identity provider (IdP) can add Soda Cloud as a service provider.
Once added, employees of the organization can gain authorized and authenticated access to the organization’s Soda Cloud account by successfully logging in to their SSO. This solution not only simplifies a secure login experience for users, it enables IT Admins to:
grant their internal users’ access to Soda Cloud from within their existing SSO solution
revoke their internal users’ access to Soda Cloud from within their existing SSO solution if a user leaves their organization or no longer requires access to Soda Cloud
set up one-way user group syncing from their IdP into Soda Cloud (tested and documented for Azure Active Directory and Okta)
Soda Cloud is able to act as a service provider for any SAML 2.0 SSO identity provider. In particular, Soda has tested and has written instructions for setting up SSO access with the following identity providers:
Soda has tested and confirmed that SSO setup works with the following identity providers:
OneLogin
Auth0
Patronus
When an employee uses their SSO provider to access Soda Cloud for the first time, Soda Cloud automatically assigns the new user to roles and groups according to the for any new users. Soda Cloud also notifies the Soda Cloud Admin that a new user has joined the organization, and the new user receives a message indicating that their Soda Cloud Admin was notified of their first login. A Soda Cloud Admin or user with the permission to do so can adjust users’ roles in Organization Settings. See for details.
When an organization’s IT Admin revokes a user’s access to Soda Cloud through the SSO provider, a Soda Cloud Admin is responsible for updating the resources and ownerships linked to the User.
Once your organization enables SSO for all Soda Cloud users, Soda Cloud blocks all non-SSO login attempts and password changes. If an employee attempts a non-SSO login or attempts to change a password using “Forgot password?”, Soda Cloud presents a message that explains that they must log in or change their password using their SSO provider.
Optionally, you can set up the SSO integration Soda to include a one-way sync of user groups from your IdP into Soda Cloud which synchronizes with each user login to Soda via SSO.
Soda Cloud supports both Identity Provider Initiated (IdP-initiated), and Service Provider Initiated (SP-initiated) single sign-on integrations. Be sure to indicate which type of SSO your organization uses when setting it up with the Soda Support team.
Email to request SSO set-up for Soda Cloud and provide your Soda Cloud organization identifier, accessible via your avatar > Organization Settings, in the Organization tab.
Soda Support sends you the samlUrl that you need to configure the set up with your identity provider.
As a user with sufficient privileges in your organization’s Azure AD account, sign in through , then navigate to Enterprise applications. Click New application.
Click Create your own application.
Email to request SSO set-up for Soda Cloud and provide your Soda Cloud organization identifier, accessible via your avatar > Organization Settings, in the Organization tab.
Soda Support sends you the samlURL that you need to configure the set up with your identity provider.
As an Okta Administrator, log in to Okta and navigate Applications > Applications overview, then click Create App Integration. Refer to for full procedure.
Select SAML 2.0.
Email to request SSO set-up for Soda Cloud and provide your Soda Cloud organization identifier, accessible via your avatar > Organization Settings, in the Organization tab.
Soda Support sends you the samlURL that you need to configure the set up with your identity provider.
As an administrator in your Google Workspace, follow the instructions in to Set up your own custom SAML application.
Optionally, upload the so it appears in the app launcher with the logo instead of the first two letters of the app name.
If you wish, you can choose to regularly one-way sync the user groups you have defined in your IdP into Soda Cloud.
Doing so obviates the need to manually create user groups in Soda Cloud that you have already defined in your IdP, and enables your team to select an IdP-managed user groups when assigning ownership access permissions to a resource, in addition to any user groups you may have created manually in Soda Cloud. See:
Soda has tested and documented one-way syncing of user groups with Soda Cloud for Okta and Azure Active Directory. to request tested and documented support for other IdPs.
Soda synchronizes user groups with the IdP every time a user in your organization logs in to Soda via SSO. Soda updates the user’s group membership according to the IdP user groups to which they belong at each log in.
You cannot manage IdP user group settings or membership in Soda Cloud. Any changes that you wish to make to IdP-managed user groups must be done in the IdP itself.
In step 10 of the SAML application setup procedure , in the same User Attributes & Claims section of your Soda SAML Application in Azure AD, follow to add a group claim to your Soda SAML Application.
For the choice of which groups should be returned in the claim, best practice suggests selecting Groups assigned to the application.
For the choice of Source attribute, select Cloud-only group display names.
In step 7 of the SAML application integration procedure , follow Okta’s instructions to .
For the Name value, use Group.Authorization.
Leave the optional Name Format value as Unspecified.
Use the Filter to find a group that you wish to make available in Soda Cloud to manage access and permissions. Exercise caution! A broad filter may include user groups you do not wish to include in the sync. Double-check that the groups you select are appropriate.
To renew an SSO certificate, you need to provide Soda with the new X.509 certificate, with which Soda will update your Soda organization's SSO configuration. Since Soda can only validate SSO against one certificate, there will be downtime between you deactivating the old certificate, and Soda updating the SSO configuration.
Depending on your organization's process of renewing the certificate, you could notify Soda (or arrange for a call) in advance of the specific datetime you want to renew, so Soda can be prepared for your update and minimize the mentioned downtime.
helm repo add soda-agent [REPOSITORY_URL_PROVIDED]helm repo add soda-agent [REPOSITORY_URL_PROVIDED]kubectl delete ns soda-agentaz aks delete --resource-group SodaAgent --name soda-agent-cli-test --yessoda:
agent:
resources:
limits:
cpu: x
memory: x
requests:
cpu: x
memory: x
scanlauncher:
resources:
limits:
cpu: x
memory: x
requests:
cpu: x
memory: xsoda:
agent:
resources:
limits:
cpu: 250m
memory: 375Mi
requests:
cpu: 250m
memory: 375Mikubectl logs -l agent.soda.io/component=orchestrator -n soda-agent -fkubectl logs -l agent.soda.io/component=orchestrator -n soda-agent -fhelm install soda-agent soda-agent/soda-agent \
--set soda.agent.name=myuniqueagent \
--set soda.apikey.id=*** \
--set soda.apikey.secret=**** \
--namespace soda-agentaz aks create \
> --resource-group SodaAgent \
> --name SodaAgentCluster \
> --node-count 1 \
> --generate-ssh-keyssoda:
apikey:
id: "***"
secret: "***"
agent:
name: "myuniqueagent"
env:
POSTGRES_USER: "sodalibrary"
POSTGRES_PASS: "sodalibrary"helm upgrade soda-agent soda-agent/soda-agent \
--values values.yml \
--namespace soda-agenttype: postgres
name: postgres
connection:
host:
port:
database:
user: ${env.POSTGRES_USER}
password: ${env.POSTGRES_PASS} helm repo add external-secrets https://charts.external-secrets.io
helm install external-secrets \
external-secrets/external-secrets \
-n external-secrets \
--create-namespacekubectl -n external-secrets get allsoda:
apikey:
id: "***"
secret: "***"
agent:
name: "myuniqueagent"helm install soda-agent soda-agent/soda-agent \
--values values.yml \
--namespace soda-agentsoda:
apikey:
existingSecret: "<existing-secret-name>"
secretKeys:
idKey: "<key-for-api-id>"
secretKey: "<key-for-api-secret>"soda:
scanlauncher:
resources:
limits:
cpu: 1
memory: 2Gi
contractlauncher:
resources:
limits:
cpu: 1
memory: 2Gi
Changes in the schema compared to the previous scan—any change is automatically flagged as an anomaly
Based on time partition column
Partition row count
The number of rows in the last partition at scan time
Most recent timestamp
Time difference between scan time and the maximum timestamp in the partition column (at scan time)
PostgreSQL
June 6th
✅
—
✅
AWS Aurora
June 30th
✅
—
✅
MS SQL server
June 30th
✅
—
✅
Oracle
June 30th
June 30th
—
✅
Redshift
September 1st
June 30th
June 30th
✅
BigQuery
September 1st
✅
June 30th
✅
MySQL
Upcoming
—
—
✅
Trino
Upcoming
Upcoming
—
✅
Athena
Upcoming
Upcoming
—
✅















Manage organization settings
Deactivate users
Create, edit, or delete user groups
Create, edit, or delete dataset roles
Create, edit, or delete global roles
Assign global roles to users or user groups
Add, edit, or delete integrations
Access and download the audit trail
✓
Manage scan definitions
Update scan definition
Run scan definition manually
✓
Access failed row samples for checks
Allow users to see samples of rows that are considered invalid
✓
✓
✓
Configure dataset
Allow users to define dataset attributes and owner, change settings, and add/enable/configure metric monitors at a dataset level
✓
✓
Manage dataset responsibilities
Allow users to grant and remove permissions through responsibilities.
✓
Manage Datas Contract
Allow users to modify as well as verifying the Data contract
✓
✓
Propose checks
Allow users to propose changes in the Data Contract
✓
✓
✓
Manage incidents
Allow users to edit and close incidents.
✓
✓
✓
Delete dataset
Allow users to remove a dataset and its checks.
✓











POSTGRES_USERNAMEPOSTGRES_PASSWORDExternalSecreta refreshInterval to indicate how often the ESO must reconcile the remoteRef values; this ought to correspond to the frequency with which your passwords are reset
the secretStoreRef to indicate the ClusterSecretStore through which to access the vault
a target template that creates a file called soda-agent.conf into which it adds the username and password values in the dotenv format that the Soda Agent expects.
apiVersion: external-secrets.io/v1beta1
kind: ClusterSecretStore
metadata:
name: vault-app-role
spec:
provider:
vault:
auth:
appRole:
path: approle
roleId: 3e****54-****-936e-****-5c5a19a5eeeb
secretRef:
key: appRoleSecretId
name: external-secrets-vault-app-role-secret-id
namespace: external-secrets
path: kv
server: http://vault.vault.svc.cluster.local:8200
version: v2kubectl apply -f cluster-secret-store.yamlapiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
name: soda-agent
namespace: soda-agent
spec:
data:
- remoteRef:
key: local/soda
property: POSTGRES_USERNAME
secretKey: POSTGRES_USERNAME
- remoteRef:
key: local/soda
property: POSTGRES_PASSWORD
secretKey: POSTGRES_PASSWORD
refreshInterval: 1m
secretStoreRef:
kind: ClusterSecretStore
name: vault-app-role
target:
name: soda-agent-secrets
template:
data:
soda-agent.conf: |
POSTGRES_USERNAME={{ .POSTGRES_USERNAME }}
POSTGRES_PASSWORD={{ .POSTGRES_PASSWORD }}
engineVersion: v2kubectl apply -n soda-agent -f soda-secret.yamlkubectl get secret -n soda-agent soda-agent-secretsNAME TYPE DATA AGE
soda-agent-secrets Opaque 1 24h soda:
apikey:
id: "154k***889"
secret: "9sfjf****ff4"
agent:
name: "my-soda-agent-external-secrets"
scanLauncher:
existingSecrets:
# from spec.target.name in the ExternalSecret file
- soda-agent-secrets
contractLauncher:
existingSecrets:
# from spec.target.name in the ExternalSecret file
- soda-agent-secrets
cloud:
# Use https://cloud.us.soda.io for US region
# Use https://cloud.soda.io for EU region
endpoint: "https://cloud.soda.io"helm install soda-agent soda-agent/soda-agent \
--values values.yml \
--namespace soda-agent NAME: soda-agent
LAST DEPLOYED: Tue Aug 29 13:08:51 2023
NAMESPACE: soda-agent
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
Success, the Soda Agent is now running.
You can inspect the Orchestrators logs if you like, but if all was configured correctly, the Agent should show up in Soda Cloud.
Check the logs using:
kubectl logs -l agent.soda.io/component=orchestrator -n soda-agentendpoint according to your local region: https://cloud.us.soda.io for the United States, or https://cloud.soda.io for all else.(Optional) Specify the format for log output: raw for plain text, or json for JSON format.
(Optional) Specify the level of log information you wish to see when deploying the agent: ERROR, WARN, INFO, DEBUG, or TRACE.


In the right pane that appears, provide a name for your app, such as Soda Cloud, then select the (Non-gallery) option. Click Create.
After Azure AD creates your app, click Single sign-on in the left nav under the Manage heading, then select the SAML tile.
In the Basic SAML Configuration block that appears, click Edit.
In the Basic SAML Configuration panel, there are two fields to populate:
Identifier (Entity ID), which is the value of samlUrl from step 1.
Reply URL, which is the value of samlUrl from step 1.
Click Save, then close the confirmation message pop-up.
In the User Attributes & Claims panel, click Edit to add some attribute mappings.
Configure the claims as per the following example. Soda Cloud uses familyname and givenname, and maps emailaddress to user.userprincipalname.
(Optional) Follow the additional steps to enable one-way user group syncing to your SSO configuration; see Set up user group sync in Azure AD).
Scroll down to collect the values of three fields that Soda needs to complete the Azure AD SSO integration:
Azure AD Identifier (Section 4 in Azure). This is the IdP entity, ID, or Identity Provider Issuer that Soda needs
Login URL (Section 4 in Azure). This is the IdP SSO service URL, or Identity Provider Single Sign-On URL that Soda needs.
X.509 Certificate. Click the Download link next to Certificate (Base64).
Email the copied and downloaded values to [email protected]. With those values, Soda completes the SSO configuration for your organization in cloud.soda.io and notifies you of completion.
Soda Cloud supports both Identity Provider Initiated (IdP-initiated), and Service Provider Initiated (SP-initiated) single sign-on integrations; be sure to indicate which type of SSO your organization uses.
(Optional) Ask Soda to enable one-way user group syncing to your SSO configuration; see Set up user group sync in Azure AD)
Test the integration by assigning the Soda application in Azure AD to a single user, then requesting that they log in.
After a successful single-user test of the sign in, assign access to the Soda Azure AD app to users and/or user groups in your organization.
Provide a name for the application, Soda Cloud, and upload the Soda logo.
Click Next. In the Configure SAML tab, there are two fields to populate:
Single sign on URL, which is the value of samlUrl from step 1.
Audience URI (SP Entity ID), which is also the value of samlUrl from step 1.
The values for these fields are unique to your organization and are provided to you by Soda and they follow this pattern: https://cloud.soda.io/sso/<your-organization-identifier>/saml.
Be sure to use an email address as the application username.
Scroll down to Attribute Statements to map the following values, then click Next to continue.
map User.GivenName to user.firstName
map User.FamilyNameto user.lastName
map User.Email to user.email
(Optional) Follow the additional steps to enable one-way user group syncing to your SSO configuration; .
Select the following options, then click Finish.
I’m an Okta customer adding an internal app.
This is an internal app that we have created.
In the Sign On pane of the application, scroll down to click View Setup Instructions.
Collect the values of three fields that Soda needs to complete the Okta SSO integration:
Identity Provider Single Sign-On URL
Identity Provider Issuer
X.509 Certificate
Email the copied and downloaded values to [email protected]. With those values, Soda completes the SSO configuration for your organization in cloud.soda.io and notifies you of completion.
Soda Cloud supports both Identity Provider Initiated (IdP-initiated), and Service Provider Initiated (SP-initiated) single sign-on integrations; be sure to indicate which type of SSO your organization uses.
(Optional) Ask Soda to enable one-way user group syncing to your SSO configuration; see Set up user group sync in Okta.
Test the integration by assigning the Soda application in Okta to a single user, then requesting that they log in.
After a successful single-user test of the sign in, assign access to the Soda Okta app to users and/or user groups in your organization.
On the Google Identity Provider details page, be sure to copy or download the following values:
SSO URL
Entity ID
IDP metadata
Certificate
On the SAML Attribute mapping page, add two Google directory attributes and map as follows:
Last Name → User.FamilyName
First Name → User.GivenName
Email the copied and downloaded values to [email protected]. With those values, Soda completes the SSO configuration for your organization in cloud.soda.io and notifies you of completion. Soda Cloud supports both Identity Provider Initiated (IdP-initiated), and Service Provider Initiated (SP-initiated) single sign-on integrations; be sure to indicate which type of SSO your organization uses.
In the Google Workspace admin portal, use Google’s instructions to Turn on your SAML app and verify that SSO works with the new custom app for Soda.
After saving the group claim, navigate to Users and Groups in the left menu, and follow Microsoft’s instructions to Assign a user or group to an enterprise application. Add any existing groups to the Soda SAML Application that you wish to make available in Soda Cloud to manage access and permissions.
In your message to Soda Support or your Soda Customer Engineer, advise Soda that you wish to enable user group syncing. Soda adds a setting to your SSO configuration to enable it.
When the SSO integration is complete, you and your team can select your IdP user groups from the dropdown list of choices available when assigning ownership or permissions to resources.
Use the Add Another button to add as many groups as you wish to make available in Soda Cloud.
In your message to Soda Support or your Soda Customer Engineer, advise Soda that you wish to enable user group syncing. Soda adds a setting to your SSO configuration to enable it.
When the SSO integration is complete, you and your team can select your IdP user groups from the dropdown list of choices available when assigning ownership or permissions to resources.
You have an AWS account and the necessary permissions to enable you to create, or gain access to an EKS cluster in your region.
You have installed v1.22 or v1.23 of kubectl. This is the command-line tool you use to run commands against Kubernetes clusters. If you have installed Docker Desktop, kubectl is included out-of-the-box. Run kubectl version --output=yaml to check the version of an existing install.
You have installed Helm. This is the package manager for Kubernetes which you will use to deploy the Soda Agent Helm chart. Run helm version to check the version of an existing install.
You have whitelisted these URLs, depending on whether you are using Soda EU cloud.soda.io or Soda US cloud.us.soda.io:
Kubernetes cluster size and capacity: 2 CPU and 2GB of RAM. In general, this is sufficient to run up to six scans in parallel.
Scan performance may vary according to the workload, or the number of scans running in parallel. To improve performance for larger workloads, consider:
fine-tuning the cluster size using the resources parameter for the agent-orchestrator and soda.scanlauncher.resources for the scan-launcher. Adding more resources to the scan-launcher can improve scan times by as much as 30%.
adding more nodes to the node group; see AWS documentation for Scaling Managed Nodegroups.
adding a cluster auto-scaler to your Kubernetes cluster; see AWS documentation for (for AWS see )
Be aware, however, that allocating too many resources may be costly relative to the small benefit of improved scan times.
To specify resources, add the following parameters to your values.yml file during deployment. Refer to Kubernetes documentation for Resource Management for Pods and Containers for information on values to supply for x.
For reference, a Soda-hosted agent specifies resources as follows:
The following table outlines the two ways you can install the Helm chart to deploy a Soda Agent in your cluster.
Install the Helm chart via CLI by providing values directly in the install command.
Use this as a straight-forward way of deploying an agent on a cluster.
Install the Helm chart via CLI by providing values in a values YAML file.
Use this as a way of deploying an agent on a cluster while keeping sensitive values secure. - provide sensitive API key values in this local file - store data source login credentials as environment variables in this local file or in an external secrets manager; Soda needs access to the credentials to be able to connect to your data source to run scans of your dat See:
(Optional) If you wish, you can establish an AWS PrivateLink to provide private connectivity with Soda Cloud. Refer to (Optional) Connect via AWS PrivateLink before deploying an agent.
(Optional) If you are deploying to an existing Virtual Private Cloud (VPC), consider supplying public or private subnets with your deployment. Consult the eksctl documentation to Use existing VPC.
Create or navigate to an existing Kubernetes cluster in your environment in which you can deploy the Soda Agent helm chart. Best practices advises creating a managed node group into which you can deploy the agent.
Use Helm to add the Soda Agent Helm chart repository.
Use the following command to install the Helm chart which deploys a Soda Agent in your custer.
Replace the values of soda.apikey.id and soda-apikey.secret with the values you copy+pasted from the New Soda Agent dialog box in your Soda Cloud. By default, Soda uses as part of the Soda Agent deployment. The agent automatically converts any sensitive values you add to a values YAML file, or directly via the CLI, into Kubernetes Secrets.
Replace the value of soda.agent.name with a custom name for your agent, if you wish.
(Optional) Validate the Soda Agent deployment by running the following command:
In your Soda Cloud account, navigate to your avatar > Agents. Refresh the page to verify that you see the agent you just created in the list of Agents.
Be aware that this may take several minutes to appear in your list of Soda Agents. Use the describe pods command in step 3 to check the status of the deployment. When State: Running and Ready: True, then you can refresh and see the agent in Soda Cloud.
If you do no see the agent listed in Soda Cloud, use the following command to review status and investigate the logs.
(Optional) If you wish, you can establish an AWS PrivateLink to provide private connectivity with Soda Cloud. Refer to Connect via AWS PrivateLink before deploying an agent.
(Optional) If you are deploying to an existing Virtual Private Cloud (VPC), consider supplying public or private subnets with your deployment. Consult the eksctl documentation to Use existing VPC.
Create or navigate to an existing Kubernetes cluster in your environment in which you can deploy the Soda Agent helm chart. Best practices advises creating a managed node group into which you can deploy the agent.
Use Helm to add the Soda Agent Helm chart repository.
Using a code editor, create a new YAML file called values.yml.
To that file, copy+paste the content below, replacing the following values:
id and secret with the values you copy+pasted from the New Soda Agent dialog box in your Soda Cloud account. By default, Soda uses as part of the Soda Agent deployment. The agent automatically converts any sensitive values you add to a values YAML file, or directly via the CLI, into Kubernetes Secrets.
Replace the value of name with a custom name for your agent, if you wish.
Save the file. Then, in the same directory in which the values.yml file exists, use the following command to install the Soda Agent helm chart.
(Optional) Validate the Soda Agent deployment by running the following command:
In your Soda Cloud account, navigate to your avatar > Agents. Refresh the page to verify that you see the agent you just created in the list of Agents.
Be aware that this may take several minutes to appear in your list of Soda Agents. Use the describe pods command in step four to check the status of the deployment. When State: Running and Ready: True, then you can refresh and see the agent in Soda Cloud.
If you do no see the agent listed in Soda Cloud, use the following command to review status and investigate the logs.
If you use AWS services for your infrastructure and you have deployed or will deploy a Soda Agent in an EKS cluster, you can use an AWS PrivateLink to provide private connectivity with Soda Cloud.
Log in to your AWS console and navigate to your VPC dashboard.
Follow the AWS documentation to Connect to an endpoint service as the service customer. For security reasons, Soda does not publish its Service name. Email [email protected] with your AWS account ID to request the PrivateLink service name. Refer to AWS documentation for instructions on how to obtain your account ID.
After creating the endpoint, return to the VPC dashboard. When the status of the endpoint becomes Available, the PrivateLink is ready to use. Be aware that this make take more than 10 minutes.
Deploy a Soda Agent to your AWS EKS cluster, or, if you have already deployed one, restart your Soda Agent to begin sending data to Soda Cloud via the PrivateLink.
After you have started the agent and validated that it is running, log into your Soda Cloud account, then navigate to your avatar > Agents. Refresh the page to verify that you see the agent you just created in the list of Agents.
If you do no see the agent listed in Soda Cloud, use the following command to review status and investigate the logs.
helm install
the action helm is to take
soda-agent (the first one)
a release named soda-agent on your cluster
soda-agent (the second one)
the name of the helm repo you installed
soda-agent (the third one)
the name of the helm chart that is the Soda Agent
The --set options either override or set some of the values defined in and used by the Helm chart. You can override these values with the --set files as this command does, or you can specify the override values using a values.yml file.
--set soda.agent.name
A unique name for your Soda Agent. Choose any name you wish, as long as it is unique in your Soda Cloud account.
--set soda.apikey.id
With the apikey.secret, this connects the Soda Agent to your Soda Cloud account. Use the value you copied from the dialog box in Soda Cloud when adding a new agent. You can use a values.yml file to pass this value to the cluster instead of exposing it here.
--set soda.apikey.secret
With the apikey.id, this connects the Soda Agent to your Soda Cloud account. Use the value you copied from the dialog box in Soda Cloud when adding a new agent. You can use a values.yml file to pass this value to the cluster instead of exposing it here.
--set soda.agent.logFormat
(Optional) Specify the format for log output: raw for plain text, or json for JSON format.
--set soda.agent.loglevel
(Optional) Specify the leve of log information you wish to see when deploying the agent: ERROR, WARN, INFO, DEBUG, or TRACE.
--namespace soda-agent
Use the namespace value to identify the namespace in which to deploy the agent.
Uninstall the Soda Agent in the cluster.
Delete the EKS cluster itself.
(Optional) Access your CloudFormation console, then click Stacks to view the status of your decommissioned cluster. If you do not see your Stack, use the region drop-down menu at upper-right to select the region in which you created the cluster.
Problem: After setting up a cluster and deploying the agent, you are unable to see the agent running in Soda Cloud.
Solution: The value you specify for the soda-cloud-enpoint must correspond with the region you selected when you signed up for a Soda Cloud account:
Usehttps://cloud.us.soda.io for the United States
Use https://cloud.soda.io for all else
Problem: You need to define the outgoing port and IP address with which a self-hosted Soda Agent can communicate with Soda Cloud. Soda Agent does not require setting any inbound rules as it only polls Soda Cloud looking for instruction, which requires only outbound communication. When Soda Cloud must deliver instructions, the Soda Agent opens a bidirectional channel.
Solution: Use port 443 and passlist the fully-qualified domain names for Soda Cloud:
cloud.us.soda.io for Soda Cloud account created in the US region
OR
cloud.soda.io for Soda Cloud account created in the EU region
AND
collect.soda.io
Problem: UnauthorizedOperation: You are not authorized to perform this operation.
Solution: This error indicates that your user profile is not authorized to create the cluster. Contact your AWS Administrator to request the appropriate permissions
You have a Google Cloud Platform (GCP) account and the necessary permissions to enable you to create, or gain access to an existing Google Kubernetes Engine (GKE) cluster in your region.
You have installed the gcloud CLI tool. Use the command glcoud version to verify the version of an existing install.
If you have already installed the gcloud CLI, use the following commands to login and verify your configuration settings, respectively: gcloud auth login gcloud config list
If you are installing the gcloud CLI for the first time, be sure to complete in the installation to properly install and configure the setup.
Consider using the following command to learn a few basic glcoud commands: gcloud cheat-sheet.
You have installed v1.22 or v1.23 of . This is the command-line tool you use to run commands against Kubernetes clusters. If you have installed Docker Desktop, kubectl is included out-of-the-box. With Docker running, use the command kubectl version --output=yaml to check the version of an existing install.
You have installed . This is the package manager for Kubernetes which you will use to deploy the Soda Agent Helm chart. Run helm version to check the version of an existing install.
You have whitelisted these URLs, depending on whether you are using Soda EU cloud.soda.io or Soda US cloud.us.soda.io:
Kubernetes cluster size and capacity: 2 CPU and 2GB of RAM. In general, this is sufficient to run up to six scans in parallel.
Scan performance may vary according to the workload, or the number of scans running in parallel. To improve performance for larger workloads, consider fine-tuning the cluster size using the resources parameter for the agent-orchestrator and soda.scanlauncher.resources for the scan-launcher. Adding more resources to the scan-launcher can improve scan times by as much as 30%. Be aware, however, that allocating too many resources may be costly relative to the small benefit of improved scan times.
To specify resources, add the following parameters to your values.yml file during deployment. Refer to Kubernetes documentation for Resource Management for Pods and Containers for information on values to supply for x.
For reference, a Soda-hosted agent specifies resources as follows:
The following table outlines the two ways you can install the Helm chart to deploy a Soda Agent in your cluster.
Install the Helm chart via CLI by providing values directly in the install command.
Use this as a straight-forward way of deploying an agent on a cluster in a secure or local environment.
Install the Helm chart via CLI by providing values in a values YAML file.
Use this as a way of deploying an agent on a cluster while keeping sensitive values secure. - provide sensitive API key values in this local file - store data source login credentials as environment variables in this local file or in an external secrets manager; Soda needs access to the credentials to be able to connect to your data source to run scans of your data. See:
(Optional) You have familiarized yourself with basic Soda, Kubernetes, and Helm concepts.
Create or navigate to an existing Kubernetes cluster in your environment in which you can deploy the Soda Agent helm chart.
Add the Soda Agent Helm chart repository.
Use the following command to install the Helm chart to deploy a Soda Agent in your custer. (Learn more about the helm install command.)
Replace the values of soda.apikey.id and soda-apikey.secret with the values you copy+pasted from the New Soda Agent dialog box in your Soda Cloud account. By default, Soda uses as part of the Soda Agent deployment. The agent automatically converts any sensitive values you add to a values YAML file, or directly via the CLI, into Kubernetes Secrets.
Replace the value of soda.agent.name with a custom name for your agent, if you wish.
Specify the value for soda.cloud.endpoint according to your local region: https://cloud.us.soda.io for the United States, or https://cloud.soda.io for all else.
(Optional) Specify the format for log output: raw for plain text, or json for JSON format.
(Optional) Specify the level of log information you wish to see when deploying the agent: ERROR, WARN, INFO, DEBUG, or TRACE.
The command-line produces output like the following message:
(Optional) Validate the Soda Agent deployment by running the following command:
In your Soda Cloud account, navigate to your avatar > Agents. Refresh the page to verify that you see the agent you just created in the list of Agents.
Be aware that this may take several minutes to appear in your list of Soda Agents. Use the describe pods command in step three to check the status of the deployment. When Status: Running, then you can refresh and see the agent in Soda Cloud.
If you do no see the agent listed in Soda Cloud, use the following command to review status and investigate the logs.
(Optional) You have familiarized yourself with basic Soda, Kubernetes, and Helm concepts.
Create or navigate to an existing Kubernetes cluster in your environment in which you can deploy the Soda Agent helm chart.
Add the Soda Agent Helm chart repository.
Using a code editor, create a new YAML file called values.yml.
In that file, copy+paste the content below, replacing the following values:
id and secret with the values you copy+pasted from the New Soda Agent dialog box in your Soda Cloud account. By default, Soda uses as part of the Soda Agent deployment. The agent automatically converts any sensitive values you add to a values YAML file, or directly via the CLI, into Kubernetes Secrets.
Replace the value of name with a custom name for your agent, if you wish.
Save the file. Then, in the same directory in which the values.yml file exists, use the following command to install the Soda Agent helm chart.
(Optional) Validate the Soda Agent deployment by running the following command:
In your Soda Cloud account, navigate to your avatar > Agents. Refresh the page to verify that you see the agent you just created in the list of Agents.
Be aware that this may take several minutes to appear in your list of Soda Agents. Use the describe pods command in step four to check the status of the deployment. When Status: Running, then you can refresh and see the agent in Soda Cloud.
If you do no see the agent listed in Soda Cloud, use the following command to review status and investigate the logs.
helm install
the action helm is to take
soda-agent (the first one)
a release named soda-agent on your cluster
soda-agent (the second one)
the name of the helm repo you installed
soda-agent (the third one)
the name of the helm chart that is the Soda Agent
The --set options either override or set some of the values defined in and used by the Helm chart. You can override these values with the --set files as this command does, or you can specify the override values using a values.yml file.
--set soda.agent.name
A unique name for your Soda Agent. Choose any name you wish, as long as it is unique in your Soda Cloud account.
--set soda.apikey.id
With the apikey.secret, this connects the Soda Agent to your Soda Cloud account. Use the value you copied from the dialog box in Soda Cloud when adding a new agent. You can use a values.yml file to pass this value to the cluster instead of exposing it here.
--set soda.apikey.secret
With the apikey.id, this connects the Soda Agent to your Soda Cloud account. Use the value you copied from the dialog box in Soda Cloud when adding a new agent. You can use a values.yml file to pass this value to the cluster instead of exposing it here.
--set soda.agent.logFormat
(Optional) Specify the format for log output: raw for plain text, or json for JSON format.
--set soda.agent.loglevel
(Optional) Specify the leve of log information you wish to see when deploying the agent: ERROR, WARN, INFO, DEBUG, or TRACE.
--namespace soda-agent
Use the namespace value to identify the namespace in which to deploy the agent.
Uninstall the Soda Agent in the cluster.
Delete the cluster.
Refer to Google Kubernetes Engine documentation for details.
Problem: After setting up a cluster and deploying the agent, you are unable to see the agent running in Soda Cloud.
Solution: The value you specify for the soda-cloud-enpoint must correspond with the region you selected when you signed up for a Soda Cloud account:
Usehttps://cloud.us.soda.io for the United States
Use https://cloud.soda.io for all else
Problem: You need to define the outgoing port and IP address with which a self-hosted Soda Agent can communicate with Soda Cloud. Soda Agent does not require setting any inbound rules as it only polls Soda Cloud looking for instruction, which requires only outbound communication. When Soda Cloud must deliver instructions, the Soda Agent opens a bidirectional channel.
Solution: Use port 443 and passlist the fully-qualified domain names for Soda Cloud:
cloud.us.soda.io for Soda Cloud account created in the US region
OR
cloud.soda.io for Soda Cloud account created in the EU region
AND
collect.soda.io
This page provides detailed information about how to configure the Soda↔Collibra integration.
Both Collibra and Soda need to be configured so the integration can run successfully. This page covers both Collibra and Soda settings, including asset types, attribute types, relation types, and domain mappings. These settings establish the foundation for reliable synchronization of data quality checks and metadata between Soda and Collibra.
Configure the different types of assets in Collibra:
Define the attributes that will be set on check assets:
Flexible Extraction: Automatically extracts metrics from any diagnostic type (missing, aggregate, valid, etc.)
Future-Proof: Works with new diagnostic types that Soda may introduce
Smart Fallbacks: Falls back to datasetRowsTested
Define the types of relationships between assets:
Configure ownership role mappings:
Configure the domains where assets will be created:
Define Soda attributes and their mappings:
Multiple dimensions support
The integration supports both single and multiple dimensions for data quality checks:
Single dimension: Specify as a string value (e.g., "Completeness")
Multiple dimensions: Use a comma-separated string (e.g., "Completeness, Consistency")
When multiple dimensions are provided as a comma-separated string, the integration will:
Automatically split the string by commas and trim whitespace
Search for each dimension asset in Collibra individually
Create a relation for each dimension found
Log a warning for any dimension that cannot be found in Collibra
Example Configuration:
This will create three separate dimension relations in Collibra, one for each dimension specified.
Monitor Exclusion
The integration can exclude Soda monitors (items with metricType) from synchronization:
Enabled (sync_monitors: true): All checks and monitors are synchronized (default)
Disabled (sync_monitors: false): Only checks are synchronized, monitors are filtered out
When sync_monitors is disabled, the integration will:
Filter out all items that have a metricType attribute
Only process actual checks (items without metricType)
Log the number of monitors filtered out for each dataset
Continue processing with the remaining checks
This is useful when you want to focus on data quality checks and exclude monitoring metrics from your Collibra catalog.
Custom Attribute Syncing configuration
See the section below for detailed instructions.
The integration supports syncing custom attributes from Soda checks to Collibra assets, allowing you to enrich your Collibra assets with business context and additional metadata from your data quality checks.
Custom attribute syncing enables you to map specific attributes from your Soda checks to corresponding attribute types in Collibra. When a check is synchronized, the integration will automatically extract the values of these attributes and set them on the created/updated Collibra asset.
To enable custom attribute syncing, add the custom_attributes_mapping_soda_attribute_name_to_collibra_attribute_type_id configuration to your config.yaml file:
The configuration value is a JSON string containing key-value pairs where:
Key: The name of the attribute in Soda (as it appears on your Soda checks)
Value: The UUID of the corresponding attribute type in Collibra
First, identify which attributes from your Soda checks you want to sync to Collibra. Common examples include:
description - Check description
business_impact - Business impact assessment
data_domain - Data domain classification
For each Soda attribute, find the corresponding attribute type UUID in Collibra:
Navigate to your Collibra instance
Go to Settings → Metamodel → Attribute Types
Find or create the attribute types you want to map to
Copy the UUID of each attribute type
Create a JSON object mapping Soda attribute names to Collibra attribute type UUIDs:
Add the JSON mapping to your config.yaml file as a single-line string:
Here's a complete example showing how to configure custom attribute syncing:
Soda Check with Custom Attributes:
Result: When this check is synchronized, the integration will create a Collibra asset with these attributes automatically set:
Description: "Ensures orders table is not empty"
Business Impact: "critical"
Data Domain: "sales"
Criticality: "high"
JSON Format: The mapping must be a valid JSON string enclosed in single quotes
Attribute Type UUIDs: Use the exact UUIDs from your Collibra metamodel
Case Sensitivity: Soda attribute names are case-sensitive and must match exactly
Missing Attributes: If a Soda check doesn't have an attribute defined in the mapping, it will be skipped (no error)
Common Issues:
Invalid JSON: Ensure the JSON string is properly formatted and enclosed in single quotes
Attribute Not Found: Verify the Soda attribute names match exactly what's defined in your checks
UUID Errors: Confirm the Collibra attribute type UUIDs are correct and exist in your instance
Permission Issues: Ensure your Collibra user has permissions to set the specified attribute types
Debug Mode: Run with debug mode to see detailed logging about custom attribute processing:
Look for log messages like:
Processing custom attribute: attribute_name
Successfully set custom attribute: attribute_name
Skipping custom attribute (not found in check): attribute_name
The integration automatically synchronizes deletions, removing obsolete check assets from Collibra when checks are deleted or removed in Soda.
Pattern Matching: For each dataset, the integration searches for all check assets in Collibra using the naming pattern {checkname}___{datasetName}
Comparison: Compares the list of check assets in Collibra with the current checks returned from Soda
Identification: Identifies assets that exist in Collibra but are no longer present in Soda
Automatic Cleanup: Keeps your Collibra catalog in sync with Soda without manual intervention
Efficient Processing: Uses bulk deletion operations to minimize API calls
Idempotent: Safe to run multiple times - handles already-deleted assets gracefully
Transparent: Shows deletion progress in the console output and tracks metrics
When obsolete checks are found and deleted, you'll see:
And in the summary:
No additional configuration is required. Deletion synchronization is enabled by default and runs automatically for each dataset during the integration process.
Deletion synchronization is tracked in the integration metrics:
Checks deleted: Number of obsolete check assets removed from Collibra
Error Tracking: Any deletion failures are recorded in the error summary
404 Errors: If assets are already deleted (404 response), the integration treats this as success and continues
Other Errors: Network issues, authentication problems, or other HTTP errors are retried with exponential backoff
Missing Assets: If no check assets are found in Collibra for a dataset, deletion sync is skipped
The integration supports automatic synchronization of dataset ownership from Collibra to Soda.
Asset Discovery: For each dataset, finds the corresponding table asset in Collibra
Responsibility Extraction: Retrieves ownership responsibilities from Collibra
User Mapping: Maps Collibra users to Soda users by email address
Ownership Update: Updates the Soda dataset with synchronized owners
Ensure the following are configured in your config.yaml:
Ownership synchronization is tracked in the integration metrics:
👥 Owners synchronized: Number of successful ownership transfers
❌ Ownership sync failures: Number of failed synchronization attempts
Common issues and their handling:
Missing Collibra Asset: Skip ownership sync for that dataset
No Collibra Owners: Log information message, continue processing
User Email Mismatch: Track as error, continue with remaining users
Soda API Failures: Retry with exponential backoff
In order to show the Soda Data Quality score in Collibra, you will need to create an aggregation path as follows:
Navigate to Collibra Settings > Operating Model > Quality Score Aggregation
Create a new score aggregation. You will create two different aggregations as follows:
If you are using Collibra as a report catalog and want to show Quality Scores on your reports, you will create a third aggregation using the path “Report is part of data structure” & “Asset complies with Governance Asset”.
Assign the new aggregation paths to the asset types COLUMN and TABLE (and any other asset types such as a REPORT).
Collibra Settings > Operating Model > Asset Types > Column
Click the assignment being used (Default Assignment) > Quality Score Aggregations > External Data Quality > Choose “Soda Data Quality [COLUMN]"
Navigate to Collibra Settings > Operating Model > Asset Types > Table
(Optional) If you want to show the Soda Data Quality score in a diagram view on the assets types, you will need to add the above aggregations as an overlay for each asset type (Column, Table, Report) as follows:
For advanced configuration details, head to .
Soda-hosted agents are included in all Free, Team, and Enterprise plans at no additional cost. However, self-hosted agents require an Enterprise plan.
If you wish to use self-hosted agents, please contact us at https://www.soda.io/contact to discuss Enterprise plan options or via the support portal for existing customers.
You have created, or have access to an existing Kubernetes cluster into which you can deploy a Soda Agent.
You have installed v1.22 or v1.23 of . This is the command-line tool you use to run commands against Kubernetes clusters. If you have installed Docker Desktop, kubectl is included out-of-the-box. With Docker running, use the command kubectl version --output=yaml to check the version of an existing install.
You have installed . This is the package manager for Kubernetes which you will use to deploy the Soda Agent Helm chart. Run helm version to check the version of an existing install.
Kubernetes cluster size and capacity: 2 CPU and 2GB of RAM. In general, this is sufficient to run up to six scans in parallel.
Scan performance may vary according to the workload, or the number of scans running in parallel. To improve performance for larger workloads, consider fine-tuning the cluster size using the resources parameter for the agent-orchestrator and soda.scanlauncher.resources for the scan-launcher. Adding more resources to the scan-launcher can improve scan times by as much as 30%. Be aware, however, that allocating too many resources may be costly relative to the small benefit of improved scan times.
To specify resources, add the following parameters to your values.yml file during deployment. Refer to Kubernetes documentation for for information on values to supply for x.
For reference, a Soda-hosted agent specifies resources as follows:
The following table outlines the two ways you can install the Helm chart to deploy a Soda Agent in your cluster.
Add the Soda Agent Helm chart repository.
Use the following comand to install the Helm chart to deploy a Soda Agent in your custer. Learn more about the .
Replace the values of soda.apikey.id and soda-apikey.secret with the values you copy+pasted from the New Soda Agent dialog box in your Soda Cloud account. By default, Soda uses as part of the Soda Agent deployment. The agent automatically converts any sensitive values you add to a values YAML file, or directly via the CLI, into Kubernetes Secrets.
If you do no see the agent listed in Soda Cloud, use the following command to review status and investigate the logs.
Create or navigate to an existing Kubernetes cluster in your environment in which you can deploy the Soda Agent helm chart.
Add the Soda Agent Helm chart repository.
helm repo add soda-agent [REPOSITORY_URL_PROVIDED]
Using a code editor, create a new YAML file called values.yml.
If you do no see the agent listed in Soda Cloud, use the following command to review status and investigate the logs.
If you use private key authentication with a Soda Agent, refer to .
helm install commandThe --set options either override or set some of the values defined in and used by the Helm chart. You can override these values with the --set files as this command does, or you can specify the override values using a values.yml file.
Uninstall the Soda Agent in the cluster.
Delete the cluster.
Problem: After setting up a cluster and deploying the agent, you are unable to see the agent running in Soda Cloud.
Solution: The value you specify for the soda-cloud-enpoint must correspond with the region you selected when you signed up for a Soda Cloud account:
Usehttps://cloud.us.soda.io for the United States
Use https://cloud.soda.io for all else
Problem: You need to define the outgoing port and IP address with which a self-hosted Soda Agent can communicate with Soda Cloud. Soda Agent does not require setting any inbound rules as it only polls Soda Cloud looking for instruction, which requires only outbound communication. When Soda Cloud must deliver instructions, the Soda Agent opens a bidirectional channel.
Solution: Use port 443 and passlist the fully-qualified domain names for Soda Cloud:
cloud.us.soda.io for Soda Cloud account created in the US region
OR
cloud.soda.io for Soda Cloud account created in the EU region
AND
collect.soda.io
soda:
apikey:
id: "***"
secret: "***"
agent:
name: "myuniqueagent"
logformat: "raw"
loglevel: "ERROR"
cloud:
# Use https://cloud.us.soda.io for US region; use https://cloud.soda.io for EU region
endpoint: "https://cloud.soda.io"helm install soda-agent soda-agent/soda-agent \
--set soda.agent.name=myuniqueagent \
# Use https://cloud.us.soda.io for US region; use https://cloud.soda.io for EU region
--set soda.cloud.endpoint=https://cloud.us.soda.io
# Use <us> for US region; use <eu> for EU region
--set soda.cloud.region=us
--set soda.apikey.id=*** \
--set soda.apikey.secret=**** \
--set soda.agent.logFormat=raw \
--set soda.agent.loglevel=ERROR \
--namespace soda-agentNAME: soda-agent
LAST DEPLOYED: Mon Nov 21 16:29:38 2022
NAMESPACE: soda-agent
STATUS: deployed
REVISION: 1kubectl get pods -n soda-agentNAME READY STATUS RESTARTS AGE
soda-agent-orchestrator-ffd74c76-5g7tl 1/1 Running 0 32skubectl create ns soda-agentnamespace/soda-agent createdhelm install soda-agent soda-agent/soda-agent \
--values values.yml \
--namespace soda-agentkubectl describe pods -n soda-agenthelm uninstall soda-agent -n soda-agenteksctl delete cluster --name soda-agentsoda:
agent:
resources:
limits:
cpu: x
memory: x
requests:
cpu: x
memory: x
scanlauncher:
resources:
limits:
cpu: x
memory: x
requests:
cpu: x
memory: xsoda:
agent:
resources:
limits:
cpu: 250m
memory: 375Mi
requests:
cpu: 250m
memory: 375Mikubectl logs -l agent.soda.io/component=orchestrator -n soda-agent -fkubectl logs -l agent.soda.io/component=orchestrator -n soda-agent -fkubectl logs -l agent.soda.io/component=orchestrator -n soda-agent -fhelm install soda-agent soda-agent/soda-agent \
--set soda.agent.name=myuniqueagent \
--set soda.apikey.id=*** \
--set soda.apikey.secret=**** \
--namespace soda-agenthelm repo add soda-agent [REPOSITORY_URL_PROVIDED]helm repo add soda-agent [REPOSITORY_URL_PROVIDED]helm uninstall soda-agent -n soda-agentgcloud container clusters delete soda-agent-gkesoda:
agent:
resources:
limits:
cpu: x
memory: x
requests:
cpu: x
memory: x
scanlauncher:
resources:
limits:
cpu: x
memory: x
requests:
cpu: x
memory: xsoda:
agent:
resources:
limits:
cpu: 250m
memory: 375Mi
requests:
cpu: 250m
memory: 375Mikubectl logs -l agent.soda.io/component=orchestrator -n soda-agent -fkubectl logs -l agent.soda.io/component=orchestrator -n soda-agent -fhelm install soda-agent soda-agent/soda-agent \
--set soda.agent.name=myuniqueagent \
--set soda.apikey.id=*** \
--set soda.apikey.secret=**** \
--namespace soda-agent
cloud.soda.io
cloud.us.soda.io
registry.cloud.soda.io
registry.us.soda.io
soda-cloud-platform-registry.s3.eu-west-1.amazonaws.com
soda-cloud-us-platform-registry.s3.us-west-2.amazonaws.com
*.docker.io
*.docker.io
checkRowsTestedCalculated Values: Automatically computes check_rows_passed and check_passing_fraction when source data is available
Graceful Handling: Leaves attributes empty when diagnostic data is not present in the check result
Continue processing even if some dimensions are missing
criticality - Data criticality levelowner_team - Owning team information
Invalid UUIDs: Invalid Collibra attribute type UUIDs will cause the sync to fail for that attribute
Error Handling: Gracefully handles cases where assets are already deleted (404 errors), treating them as successful deletions
Metrics Tracking: Reports the number of checks deleted in the integration summary
Error Tracking: Records any failures for monitoring
Click the assignment being used (Default Assignment) > Quality Score Aggregations > External Data Quality > Choose “Soda Data Quality [TABLE]"






collibra:
base_url: "https://your-instance.collibra.com/rest/2.0"
username: "your-username"
password: "your-password"
general:
naming_delimiter: ">" # Used to separate parts of asset names asset_types:
table_asset_type: "00000000-0000-0000-0000-000000031007" # ID for Table assets
soda_check_asset_type: "00000000-0000-0000-0000-000000031107" # ID for Data Quality Metric type
dimension_asset_type: "00000000-0000-0000-0000-000000031108" # ID for Data Quality Dimension type
column_asset_type: "00000000-0000-0000-0000-000000031109" # ID for Column type attribute_types:
# Standard Check Attributes
check_evaluation_status_attribute: "00000000-0000-0000-0000-000000000238" # Boolean attribute for pass/fail
check_last_sync_date_attribute: "00000000-0000-0000-0000-000000000256" # Last sync timestamp
check_definition_attribute: "00000000-0000-0000-0000-000000000225" # Check definition
check_last_run_date_attribute: "01975dd9-a7b0-79fb-bb74-2c1f76402663" # Last run timestamp
check_cloud_url_attribute: "00000000-0000-0000-0000-000000000258" # Link to Soda Cloud
# Diagnostic Metric Attributes - Extracted from Soda check diagnostics
check_loaded_rows_attribute: "00000000-0000-0000-0000-000000000233" # Number of rows tested/loaded
check_rows_failed_attribute: "00000000-0000-0000-0000-000000000237" # Number of rows that failed
check_rows_passed_attribute: "00000000-0000-0000-0000-000000000236" # Number of rows that passed (calculated)
check_passing_fraction_attribute: "00000000-0000-0000-0000-000000000240" # Fraction of rows passing (calculated) relation_types:
table_column_to_check_relation_type: "00000000-0000-0000-0000-000000007018" # Relation between table/column and check
check_to_dq_dimension_relation_type: "f7e0a26b-eed6-4ba9-9152-4a1363226640" # Relation between check and dimension responsibilities:
owner_role_id: "00000000-0000-0000-0000-000000005040" # Collibra role ID for asset owners domains:
data_quality_dimensions_domain: "00000000-0000-0000-0000-000000006019" # Domain for DQ dimensions
soda_collibra_domain_mapping: '{"Sales": "0197377f-e595-7434-82c7-3ce1499ac620"}' # Dataset to domain mapping
soda_collibra_default_domain: "01975b4a-0ace-79f6-b5ec-68656ca60b11" # Default domain if no mappingsoda:
api_key_id: "your-api-key-id"
api_key_secret: "your-api-key-secret"
base_url: "https://cloud.soda.io/api/v1" general:
filter_datasets_to_sync_to_collibra: true # Only sync datasets with sync attribute
soda_no_collibra_dataset_skip_checks: false # Skip checks if dataset not in Collibra attributes:
soda_collibra_sync_dataset_attribute: "collibra_sync" # Attribute to mark datasets for sync
soda_collibra_domain_dataset_attribute_name: "rulebook" # Attribute for domain mapping
soda_dimension_attribute_name: "dimension" # Attribute for DQ dimensionchecks for orders:
- row_count > 0:
attributes:
dimension: "Completeness, Consistency, Accuracy"soda:
attributes:
# ... other attributes ...
custom_attributes_mapping_soda_attribute_name_to_collibra_attribute_type_id: '{"soda_attribute_id": "collibra_attribute_type_uuid", "another_soda_attribute": "another_collibra_uuid"}'{
"description": "00000000-0000-0000-0000-000000003114",
"business_impact": "01975f7b-0c04-7b98-9fb8-6635261a7c7b",
"data_domain": "0197ca72-aee8-7259-9e88-5b98073147ed"
}soda:
attributes:
custom_attributes_mapping_soda_attribute_name_to_collibra_attribute_type_id: '{"description": "00000000-0000-0000-0000-000000003114", "business_impact": "01975f7b-0c04-7b98-9fb8-6635261a7c7b", "data_domain": "0197ca72-aee8-7259-9e88-5b98073147ed"}'checks for orders:
- row_count > 0:
attributes:
description: "Ensures orders table is not empty"
business_impact: "critical"
data_domain: "sales"
criticality: "high"soda:
attributes:
soda_collibra_sync_dataset_attribute: "collibra_sync"
soda_collibra_domain_dataset_attribute_name: "rulebook"
soda_dimension_attribute_name: "dimension"
custom_attributes_mapping_soda_attribute_name_to_collibra_attribute_type_id: '{"description": "00000000-0000-0000-0000-000000003114", "business_impact": "01975f7b-0c04-7b98-9fb8-6635261a7c7b", "data_domain": "0197ca72-aee8-7259-9e88-5b98073147ed", "criticality": "0197f2a8-1234-5678-9abc-def012345678"}'python main.py --debugProcessing dataset 1/3: finance_loans
📋 Getting checks...
🔄 Processing 18 checks...
🏗️ Preparing assets...
📤 Creating/updating assets...
📝 Processing metadata & relations...
🗑️ Deleting 2 obsolete check(s)...
👥 Syncing ownership...🗑️ Checks deleted: 2collibra:
responsibilities:
owner_role_id: "00000000-0000-0000-0000-000000005040" # Collibra owner role IDSpecify the value for soda.cloud.endpoint according to your local region: https://cloud.us.soda.io for the United States, or https://cloud.soda.io for all else.
(Optional) Specify the format for log output: raw for plain text, or json for JSON format.
(Optional) Specify the level of log information you wish to see when deploying the agent: ERROR, WARN, INFO, DEBUG, or TRACE.
Read more about the helm install command.
The command-line produces output like the following message:
Specify the value for endpoint according to your local region: https://cloud.us.soda.io for the United States, or https://cloud.soda.io for all else.
(Optional) Specify the format for log output: raw for plain text, or json for JSON format.
(Optional) Specify the level of log information you wish to see when deploying the agent: ERROR, WARN, INFO, DEBUG, or TRACE.
cloud.soda.io
cloud.us.soda.io
registry.cloud.soda.io
registry.us.soda.io
soda-cloud-platform-registry.s3.eu-west-1.amazonaws.com
soda-cloud-us-platform-registry.s3.us-west-2.amazonaws.com
*.docker.io
*.docker.io
Specify the value for endpoint according to your local region: https://cloud.us.soda.io for the United States, or https://cloud.soda.io for all else.
(Optional) Specify the format for log output: raw for plain text, or json for JSON format.
(Optional) Specify the level of log information you wish to see when deploying the agent: ERROR, WARN, INFO, DEBUG, or TRACE.
cloud.soda.io
cloud.us.soda.io
registry.cloud.soda.io
registry.us.soda.io
soda-cloud-platform-registry.s3.eu-west-1.amazonaws.com
soda-cloud-us-platform-registry.s3.us-west-2.amazonaws.com
*.docker.io
*.docker.io
You have been granted access to the private Soda Agent repository and received the necessary credentials and repository information.
You have whitelisted these URLs, depending on whether you are using Soda EU cloud.soda.io or Soda US cloud.us.soda.io:
Replace the value of soda.agent.name with a custom name for you agent, if you wish.
Specify the value for soda.cloud.endpoint according to your local region: https://cloud.us.soda.io for the United States, or https://cloud.soda.io for all else.
(Optional) Specify the format for log output: raw for plain text, or json for JSON format.
(Optional) Specify the level of log information you wish to see when deploying the agent: ERROR, WARN, INFO, DEBUG, or TRACE.
The command-line produces output like the following message:
(Optional) Validate the Soda Agent deployment by running the following command:
In your Soda Cloud account, navigate to your avatar > Agents. Refresh the page to verify that you see the agent you just created in the list of Agents.
Be aware that this may take several minutes to appear in your list of Soda Agents. Use the describe pods command in step 3 to check the status of the deployment. When State: Running and Ready: True, then you can refresh and see the agent in Soda Cloud.
id and secret with the values you copy+pasted from the New Soda Agent dialog box in your Soda Cloud account. By default, Soda uses Kubernetes Secrets as part of the Soda Agent deployment. The agent automatically converts any sensitive values you add to a values YAML file, or directly via the CLI, into Kubernetes Secrets.
Replace the value of name with a custom name for your agent, if you wish.
Specify the value for endpoint according to your local region: https://cloud.us.soda.io for the United States, or https://cloud.soda.io for all else.
(Optional) Specify the format for log output: raw for plain text, or json for JSON format.
(Optional) Specify the level of log information you wish to see when deploying the agent: ERROR, WARN, INFO, DEBUG, or TRACE.
Save the file. Then, in the same directory in which the values.yml file exists, use the following command to install the Soda Agent helm chart.
(Optional) Validate the Soda Agent deployment by running the following command:
In your Soda Cloud account, navigate to your avatar > Agents. Refresh the page to verify that you see the agent you just created in the list of Agents.
Be aware that this may take several minutes to appear in your list of Soda Agents. Use the describe pods command in step three to check the status of the deployment. When State: Running and Ready: True, then you can refresh and see the agent in Soda Cloud.
Install the Helm chart via CLI by providing values directly in the install command.
Use this as a straight-forward way of deploying an agent on a cluster in a secure or local environment.
Install the Helm chart via CLI by providing values in a values YAML file.
Use this as a way of deploying an agent on a cluster while keeping sensitive values secure. - provide sensitive API key values in this local file - store data source login credentials as environment variables in this local file or in an external secrets manager; Soda needs access to the credentials to be able to connect to your data source to run scans of your data. See Soda Agent Extra
helm install
the action helm is to take
soda-agent (the first one)
a release named soda-agent on your cluster
soda-agent (the second one)
the name of the helm repo you installed
soda-agent (the third one)
the name of the helm chart that is the Soda Agent
--set soda.agent.name
A unique name for your Soda Agent. Choose any name you wish, as long as it is unique in your Soda Cloud account.
--set soda.apikey.id
With the apikey.secret, this connects the Soda Agent to your Soda Cloud account. Use the value you copied from the dialog box in Soda Cloud when adding a new agent. You can use a values.yml file to pass this value to the cluster instead of exposing it here.
--set soda.apikey.secret
With the apikey.id, this connects the Soda Agent to your Soda Cloud account. Use the value you copied from the dialog box in Soda Cloud when adding a new agent. You can use a values.yml file to pass this value to the cluster instead of exposing it here.
--set soda.agent.logFormat
(Optional) Specify the format for log output: raw for plain text, or json for JSON format.
--set soda.agent.loglevel
(Optional) Specify the level of log information you wish to see when deploying the agent: ERROR, WARN, INFO, DEBUG, or TRACE.
--namespace soda-agent
Use the namespace value to identify the namespace in which to deploy the agent.
cloud.soda.io
cloud.us.soda.io
registry.cloud.soda.io
registry.us.soda.io
soda-cloud-platform-registry.s3.eu-west-1.amazonaws.com
soda-cloud-us-platform-registry.s3.us-west-2.amazonaws.com
*.docker.io
*.docker.io
The Publish Contracts on Merge workflow automates the publishing of any updated or new Soda contracts when changes are pushed to the main branch.
This ensures that all contract changes are automatically deployed to Soda Cloud whenever they're merged into the production branch.
Checks out the repo
Sets up Python
Installs the latest version of soda
Identifies changed files
Filters YAML files in the contracts/ directory
Publishes valid contracts to Soda Cloud
Make sure these are set in your repository’s GitHub Secrets:
SODA_CLOUD_API_KEY
SODA_CLOUD_API_SECRET
Learn more about how to Generate API keys
Change the Action trigger.
pip install
Can specify a fixed version of soda for stability.
SODA_CLOUD_CONFIG_FILE_PATH
Path to your Soda Cloud config. Can be replaced if your setup uses a different config file name or location.
contracts/*.yml or contracts/*.yaml
Modify file pattern to match a different directory or naming convention.
The Verify Contracts on Pull Request workflow ensures that contract changes in PRs are valid and do not break expectations before merging.
The workflow runs when a PR is opened, updated, or reopened.
Checks out the PR branch
Sets up Python
Installs latest soda-postgres
Identifies changed files
Filters contracts in the contracts/ directory
Runs verification checks against a configured data source
Make sure these are set in your repository’s GitHub Secrets:
DATASOURCE_USERNAME
DATASOURCE_PASSWORD
These secrets can be customized depending on the data source type and your needs.
Change the Action trigger
pip install
Adapt the install command to install the necessary package for your data source.
You can specify a fixed version of soda for stability.
contracts/*.yml or contracts/*.yaml
Change to match your directory structure.
DATASOURCE_CONFIG_FILE_PATH
Replace with the path to your data source configuration
DATASOURCE_USERNAME and DATASOURCE_PASSWORD
Adapt the secrets used to connect to your data source depending on the data source type and security requirements.




helm install soda-agent soda-agent/soda-agent \
--set soda.agent.name=myuniqueagent \
# Use https://cloud.us.soda.io for US region; use https://cloud.soda.io for EU region
--set soda.cloud.endpoint=https://cloud.soda.io \
--set soda.apikey.id=*** \
--set soda.apikey.secret=**** \
--set soda.agent.logFormat=raw \
--set soda.agent.loglevel=ERROR \
--namespace soda-agentNAME: soda-agent
LAST DEPLOYED: Thu Jun 16 10:12:47 2022
NAMESPACE: soda-agent
STATUS: deployed
REVISION: 1soda:
apikey:
id: "***"
secret: "***"
agent:
name: "myuniqueagent"
logformat: "raw"
loglevel: "ERROR"
cloud:
# Use https://cloud.us.soda.io for US region; use https://cloud.soda.io for EU region
endpoint: "https://cloud.soda.io"helm repo add soda-agent [REPOSITORY_URL_PROVIDED]kubectl describe pods...
Containers:
soda-agent-orchestrator:
Container ID: docker://081*33a7
Image: sodadata/agent-orchestrator:latest
Image ID: docker-pullable://sodadata/agent-orchestrator@sha256:394e7c1**b5f
Port: <none>
Host Port: <none>
State: Running
Started: Thu, 16 Jun 2022 15:50:28 -0700
Ready: True
...helm repo add soda-agent [REPOSITORY_URL_PROVIDED]helm install soda-agent soda-agent/soda-agent \
--values values.yml \
--namespace soda-agentkubectl describe pods -n soda-agent...
Containers:
soda-agent-orchestrator:
Container ID: docker://081*33a7
Image: sodadata/agent-orchestrator:latest
Image ID: docker-pullable://sodadata/agent-orchestrator@sha256:394e7c1**b5f
Port: <none>
Host Port: <none>
State: Running
Started: Thu, 16 Jun 2022 15:50:28 -0700
Ready: True
...kubectl -n soda-agent rollout restart deploysoda:
apikey:
id: "***"
secret: "***"
agent:
name: "myuniqueagent"
logformat: "raw"
loglevel: "ERROR"
cloud:
# Use https://cloud.us.soda.io for US region; use https://cloud.soda.io for EU region
endpoint: "https://cloud.soda.io"helm install soda-agent soda-agent/soda-agent \
--set soda.agent.name=myuniqueagent \
# Use https://cloud.us.soda.io for US region; use https://cloud.soda.io for EU region
--set soda.cloud.endpoint=https://cloud.soda.io \
--set soda.apikey.id=*** \
--set soda.apikey.secret=*** \
--set soda.agent.logFormat=raw \
--set soda.agent.loglevel=ERROR \
--namespace soda-agent NAME: soda-agent
LAST DEPLOYED: Wed Dec 14 11:45:13 2022
NAMESPACE: soda-agent
STATUS: deployed
REVISION: 1kubectl describe podsName: soda-agent-orchestrator-66-snip
Namespace: soda-agent
Priority: 0
Service Account: soda-agent
Node: <none>
Labels: agent.soda.io/component=orchestrator
agent.soda.io/service=queue
app.kubernetes.io/instance=soda-agent
app.kubernetes.io/name=soda-agent
pod-template-hash=669snip
Annotations: seccomp.security.alpha.kubernetes.io/pod: runtime/default
Status: Running
...helm install soda-agent soda-agent/soda-agent \
--values values.yml \
--namespace soda-agentkubectl describe podsName: soda-agent-orchestrator-66-snip
Namespace: soda-agent
Priority: 0
Service Account: soda-agent
Node: <none>
Labels: agent.soda.io/component=orchestrator
agent.soda.io/service=queue
app.kubernetes.io/instance=soda-agent
app.kubernetes.io/name=soda-agent
pod-template-hash=669snip
Annotations: seccomp.security.alpha.kubernetes.io/pod: runtime/default
Status: Running
...helm install soda-agent soda-agent/soda-agent \
--set soda.agent.name=myuniqueagent \
# Use https://cloud.us.soda.io for US region; use https://cloud.soda.io for EU region
--set soda.cloud.endpoint=https://cloud.soda.io \
--set soda.apikey.id=*** \
--set soda.apikey.secret=**** \
--set soda.agent.logFormat=raw \
--set soda.agent.loglevel=ERROR \
--namespace soda-agentNAME: soda-agent
LAST DEPLOYED: Thu Jun 16 15:03:10 2022
NAMESPACE: soda-agent
STATUS: deployed
REVISION: 1minikube kubectl -- describe pods...
Containers:
soda-agent-orchestrator:
Container ID: docker://081*33a7
Image: sodadata/agent-orchestrator:latest
Image ID: docker-pullable://sodadata/agent-orchestrator@sha256:394e7c1**b5f
Port: <none>
Host Port: <none>
State: Running
Started: Thu, 16 Jun 2022 15:50:28 -0700
Ready: True
...helm install soda-agent soda-agent/soda-agent \
--values values.yml \
--namespace soda-agentminikube kubectl -- describe pods...
Containers:
soda-agent-orchestrator:
Container ID: docker://081*33a7
Image: sodadata/agent-orchestrator:latest
Image ID: docker-pullable://sodadata/agent-orchestrator@sha256:394e7c1**b5f
Port: <none>
Host Port: <none>
State: Running
Started: Thu, 16 Jun 2022 15:50:28 -0700
Ready: True
...soda:
agent:
resources:
limits:
cpu: x
memory: x
requests:
cpu: x
memory: x
scanlauncher:
resources:
limits:
cpu: x
memory: x
requests:
cpu: x
memory: xsoda:
agent:
resources:
limits:
cpu: 250m
memory: 375Mi
requests:
cpu: 250m
memory: 375Mihelm repo add soda-agent [REPOSITORY_URL_PROVIDED]kubectl logs -l agent.soda.io/component=orchestrator -n soda-agent -fkubectl logs -l agent.soda.io/component=orchestrator -n soda-agent -fhelm install soda-agent soda-agent/soda-agent \
--set soda.agent.name=myuniqueagent \
--set soda.apikey.id=*** \
--set soda.apikey.secret=**** \
--namespace soda-agenthelm uninstall soda-agent -n soda-agentminikube delete💀 Removed all traces of the "minikube" cluster.name: Publish Updated Contracts on Merge
on:
push:
branches:
- main
jobs:
publish-contracts:
runs-on: ubuntu-latest
steps:
- name: Checkout repository
uses: actions/checkout@v4
with:
fetch-depth: 0
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: '3.10'
- name: Install soda-postgres
run: pip install -i https://pypi.dev.sodadata.io "soda>=4.0.0.dev1" -U
- name: Get all changed files
id: changed-files
uses: tj-actions/changed-files@v46
- name: List all changed files
env:
ALL_CHANGED_FILES: ${{ steps.changed-files.outputs.all_changed_files }}
run: |
for file in ${ALL_CHANGED_FILES}; do
echo "$file was changed"
done
- name: Debug environment variables
env:
SODA_CLOUD_API_KEY: ${{ secrets.SODA_CLOUD_API_KEY }}
SODA_CLOUD_API_SECRET: ${{ secrets.SODA_CLOUD_API_SECRET }}
run: |
echo "Environment variables status:"
echo "SODA_CLOUD_API_KEY: $(if [ -n "$SODA_CLOUD_API_KEY" ]; then echo "✅ Set (${#SODA_CLOUD_API_KEY} chars)"; else echo "❌ Not set"; fi)"
echo "SODA_CLOUD_API_SECRET: $(if [ -n "$SODA_CLOUD_API_SECRET" ]; then echo "✅ Set (${#SODA_CLOUD_API_SECRET} chars)"; else echo "❌ Not set"; fi)"
- name: Filter and publish contracts
env:
ALL_CHANGED_FILES: ${{ steps.changed-files.outputs.all_changed_files }}
SODA_CLOUD_CONFIG_FILE_PATH: soda-cloud.yaml
SODA_CLOUD_API_KEY: ${{ secrets.SODA_CLOUD_API_KEY }}
SODA_CLOUD_API_SECRET: ${{ secrets.SODA_CLOUD_API_SECRET }}
run: |
for file in ${ALL_CHANGED_FILES}; do
if [[ "$file" == contracts/*.yml || "$file" == contracts/*.yaml ]]; then
echo "Publishing $file"
echo "Executing: soda contract publish --contract \"$file\" --soda-cloud ${SODA_CLOUD_CONFIG_FILE_PATH}"
soda contract publish --contract "$file" --soda-cloud ${SODA_CLOUD_CONFIG_FILE_PATH}
else
echo "Skipping $file (not a contract)"
fi
done
name: Verify Data Contracts on pull request
on:
pull_request:
types: [opened, synchronize, reopened]
jobs:
verify-contracts:
runs-on: ubuntu-latest
steps:
- name: Checkout repository
uses: actions/checkout@v4
with:
fetch-depth: 0
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: '3.10'
- name: Install soda-postgres
run: pip install -i https://pypi.dev.sodadata.io/simple -U soda-postgres
- name: Get all changed files
id: changed-files
uses: tj-actions/changed-files@v46
- name: List all changed files
env:
ALL_CHANGED_FILES: ${{ steps.changed-files.outputs.all_changed_files }}
run: |
for file in ${ALL_CHANGED_FILES}; do
echo "$file was changed"
done
- name: Debug environment variables
env:
DATASOURCE_USERNAME: ${{ secrets.DATASOURCE_USERNAME }}
DATASOURCE_PASSWORD: ${{ secrets.DATASOURCE_PASSWORD }}
run: |
echo "Environment variables status:"
echo "DATASOURCE_USERNAME: $(if [ -n "$DATASOURCE_USERNAME" ]; then echo "✅ Set"; else echo "❌ Not set"; fi)"
echo "DATASOURCE_PASSWORD: $(if [ -n "$DATASOURCE_PASSWORD" ]; then echo "✅ Set"; else echo "❌ Not set"; fi)"
- name: Filter and verify contracts
env:
ALL_CHANGED_FILES: ${{ steps.changed-files.outputs.all_changed_files }}
DATASOURCE_CONFIG_FILE_PATH: postgres.yaml
DATASOURCE_USERNAME: ${{ secrets.DATASOURCE_USERNAME }}
DATASOURCE_PASSWORD: ${{ secrets.DATASOURCE_PASSWORD }}
run: |
for file in ${ALL_CHANGED_FILES}; do
if [[ "$file" == contracts/*.yml || "$file" == contracts/*.yaml ]]; then
echo "Verifying $file"
echo "Executing: soda contract verify --data-source ${DATASOURCE_CONFIG_FILE_PATH} --contract \"$file\""
soda contract verify --data-source ${DATASOURCE_CONFIG_FILE_PATH} --contract "$file"
else
echo "Skipping $file (not a contract)"
fi
done




on:
push:
branches:
- mainon:
pull_request:
types: [opened, synchronize, reopened]This page provides detailed information about everything that happens while running and after running the Soda↔Collibra integration.
Advanced usage focuses on running and maintaining the Soda↔Collibra bi-directional integration after setup. The goal is to equip technical implementers with the detail required to operate the integration efficiently, resolve issues quickly, and adapt it to complex environments.
Domain Mappings: Cached for the entire session
Asset Lookups: LRU cache reduces repeated API calls
Configuration Parsing: One-time parsing with caching
Asset Operations: Create/update multiple assets in single calls
Attribute Management: Bulk attribute creation and updates
Relation Creation: Batch relationship establishment
3-5x faster execution vs. original implementation
60% fewer API calls through caching
90% reduction in rate limit errors
Improved reliability with comprehensive error handling
Small datasets (< 100 checks): 30-60 seconds
Medium datasets (100-1000 checks): 2-5 minutes
Large datasets (1000+ checks): 5-15 minutes
Performance varies based on:
Network latency to APIs
Number of existing vs. new assets
Complexity of relationships
API rate limits
Enable detailed logging for troubleshooting:
Debug output includes:
Dataset processing details
API call timing and results
Caching hit/miss statistics
Error context and stack traces
The integration automatically extracts diagnostic metrics from Soda check results and populates detailed row-level statistics in Collibra.
The system automatically extracts metrics from any diagnostic type, making it future-proof:
The system uses a metric-focused approach rather than type-specific logic:
Scans All Diagnostic Types: Iterates through every diagnostic type in the response
Extracts Relevant Metrics: Looks for specific metric fields regardless of diagnostic type name
Applies Smart Fallbacks: Uses datasetRowsTested if checkRowsTested is not available
Input: Soda Check Result
Output: Collibra Attributes
✅ Future-Proof: Automatically works with new diagnostic types Soda introduces
✅ Comprehensive: Provides both raw metrics and calculated insights
✅ Flexible: Handles partial data gracefully with intelligent fallbacks
✅ Accurate: Uses check-specific row counts when available
Head to to learn more about the Kubernetes deployment.
Modify constants.py for your environment:
For detailed information on configuring custom attribute syncing, see the section above.
Slow Processing: Increase BATCH_SIZE and DEFAULT_PAGE_SIZE
Rate Limiting: Increase RATE_LIMIT_DELAY
Memory Usage: Decrease CACHE_MAX_SIZE
API Timeouts: Check network connectivity and API endpoints
Authentication: Verify credentials and permissions
Rate Limits: Monitor API usage and adjust delays
Missing Assets: Ensure required asset types exist in Collibra
Relation Failures: Verify relation type configurations
Domain Mapping: Check domain IDs and JSON formatting
Missing Diagnostic Attributes: Check if Soda checks have lastCheckResultValue.diagnostics data
Incomplete Metrics: Some diagnostic types may only have partial metrics (e.g., aggregate checks lack failedRowsCount)
Attribute Type Configuration: Verify diagnostic attribute type IDs are configured correctly in config.yaml
Look for these patterns in debug logs:
Rate limit prevention: Normal throttling behavior
Successfully updated/created: Successful operations
Skipping dataset: Expected filtering behavior
Processing diagnostics: Diagnostic data found in check result
Found failedRowsCount in 'X': Successfully extracted failure count from diagnostic type X
Found checkRowsTested in 'X': Successfully extracted row count from diagnostic type X
Collibra Base: collibra.base_url, collibra.username, collibra.password
Soda API: soda.api_key_id, soda.api_key_secret
Asset types (table, check, dimension, column)
Attribute types (evaluation status, sync date, diagnostic metrics)
Relation types (table-to-check, check-to-dimension)
Domain IDs for asset creation
For issues and questions:
Check the section
Enable for detailed information
Review the performance metrics for bottlenecks
Consult the for usage examples
soda:
apikey:
id: "***"
secret: "***"
agent:
name: "myuniqueagent"
logformat: "raw"
loglevel: "ERROR"
cloud:
# Use https://cloud.us.soda.io for US region
# Use https://cloud.soda.io for EU region
endpoint: "https://cloud.soda.io"Ownership synchronization details
Calculates Derived Metrics: Computes passing rows and fraction when source data is available
Handles Missing Data: Gracefully skips attributes when diagnostic data is unavailable
✅ Transparent: Detailed logging shows exactly which metrics were found and used
Zero Division Errors: System automatically prevents division by zero when calculating fractions
ERROR: Issues requiring attentionUsing datasetRowsTested from 'X' as fallback: Fallback mechanism activated
No diagnostics found in check result: Check has no diagnostic data (normal for some check types)
Calculated check_rows_passed: Successfully computed passing rows
Added check_X_attribute: Diagnostic attribute successfully added to Collibra
soda.attributes.custom_attributes_mapping_soda_attribute_name_to_collibra_attribute_type_idDomain Mapping: collibra.domains.soda_collibra_domain_mapping
Ownership Sync: collibra.responsibilities.owner_role_id
Contact [email protected] for additional help
check_loaded_rows_attribute
checkRowsTested or datasetRowsTested
Total number of rows evaluated by the check
check_rows_failed_attribute
failedRowsCount
Number of rows that failed the check
check_rows_passed_attribute
Calculated
check_loaded_rows - check_rows_failed
check_passing_fraction_attribute
Calculated
check_rows_passed / check_loaded_rows
1st
checkRowsTested
Preferred - rows actually tested by the specific check
2nd
datasetRowsTested
Fallback - total dataset rows when check-specific count unavailable

============================================================
🎉 INTEGRATION COMPLETED SUCCESSFULLY 🎉
============================================================
📊 Datasets processed: 15
⏭️ Datasets skipped: 2
✅ Checks created: 45
🔄 Checks updated: 67
📝 Attributes created: 224
🔄 Attributes updated: 156
🔗 Dimension relations created: 89
📋 Table relations created: 23
📊 Column relations created: 89
👥 Owners synchronized: 12
❌ Ownership sync failures: 1
🎯 Total operations performed: 693
============================================================python main.py --debug// Missing value checks
{
"diagnostics": {
"missing": {
"failedRowsCount": 3331,
"failedRowsPercent": 1.213,
"datasetRowsTested": 274577,
"checkRowsTested": 274577
}
}
}
// Aggregate checks
{
"diagnostics": {
"aggregate": {
"datasetRowsTested": 274577,
"checkRowsTested": 274577
}
}
}// Hypothetical future types
{
"diagnostics": {
"valid": {
"failedRowsCount": 450,
"validRowsCount": 9550,
"checkRowsTested": 10000
},
"duplicate": {
"duplicateRowsCount": 200,
"checkRowsTested": 8000
}
}
}{
"name": "customer_id is present",
"evaluationStatus": "fail",
"lastCheckResultValue": {
"value": 1.213,
"diagnostics": {
"missing": {
"failedRowsCount": 3331,
"checkRowsTested": 274577
}
}
}
}Attributes Set:
- check_loaded_rows_attribute: 274577 # From checkRowsTested
- check_rows_failed_attribute: 3331 # From failedRowsCount
- check_rows_passed_attribute: 271246 # Calculated: 274577 - 3331
- check_passing_fraction_attribute: 0.9879 # Calculated: 271246 / 274577# Run all tests
python -m pytest tests/ -v
# Run specific test file
python -m pytest tests/test_integration.py -v
# Run with coverage
python -m pytest tests/ --cov=integration --cov-report=html# Comprehensive local testing (recommended)
python testing/test_k8s_local.py
# Docker-specific testing
./testing/test_docker_local.sh
# Quick validation
python testing/validate_k8s.py# Test Soda client functionality
python main.py --test-soda
# Test Collibra client functionality
python main.py --test-collibraclass IntegrationConstants:
MAX_RETRIES = 3 # API retry attempts
BATCH_SIZE = 50 # Batch operation size
DEFAULT_PAGE_SIZE = 1000 # API pagination size
RATE_LIMIT_DELAY = 2 # Rate limiting delay
CACHE_MAX_SIZE = 128 # LRU cache size# In your code
import logging
logging.basicConfig(
level=logging.DEBUG,
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)# Set custom config path
export SODA_COLLIBRA_CONFIG=/path/to/custom/config.yaml
# Enable debug mode
export SODA_COLLIBRA_DEBUG=true# Full debug output
python main.py --debug 2>&1 | tee debug.log
# Verbose logging with timestamps
python main.py --verbose
# Test specific components
python main.py --test-soda --debug
python main.py --test-collibra --debug# Basic run with default config
python main.py
# Debug mode with detailed logging
python main.py --debug
# Use custom configuration file
python main.py --config custom.yaml
# Test individual components
python main.py --test-soda --debug
python main.py --test-collibra --debug









