Only this pageAll pages
Powered by GitBook
Couldn't generate the PDF for 134 pages, generation stopped at 100.
Extend with 50 more pages.
1 of 100

Soda v4

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Unique count

Definition

The number of distinct non-NULL values in the monitored column. Highlights unexpected changes in cardinality (e.g., new user IDs, codes).

Source

data

Computation

COUNT(DISTINCT <column>

Query based

These monitors require executing queries against the data itself to surface usage and content recency patterns, for example:

  • Most recent timestamp: the latest event or ingestion time across all rows

  • Partition row count: the number of records within the current partition (e.g. today’s data)

Query-based monitors give you a window into data flow and freshness, helping detect lags in ingestion pipelines or staleness in source systems.

Metadata based

These monitors are derived directly from the data platform’s system metadata, without scanning row-level values. They surface structural signals, such as:

  • : when the dataset was last updated

  • : any alterations to the schema

  • : the overall number of records in the dataset

Maximum value

Definition

The maximum value within a specific column.

Source

data (any datatype with an order defined for max/min)

Total row count change

Definition

The difference in total row count between the current scan and the immediately preceding scan (current_count – previous_count).

Source

Standard deviation

Definition

The standard deviation of the values within a specific column.

Source

data (numeric)

Text

Text metrics help catch formatting issues, truncated values, or unexpectedly long/free-form entries.

With Soda, it's possible to assess the character‐length properties of string columns:

Q3

This metric is not supported in MySQL.

Definition

Quartiles divide all non-NULL values in the column within the latest partition into four equal parts based on value: Q3 represents the 75th percentile.

Total row count

Definition

The total number of rows in the dataset at scan time.

Source

metadata

Q1

This metric is not supported in MySQL.

Definition

Quartiles divide all non-NULL values in the column within the latest partition into four equal parts based on value: Q1 represents the 25th percentile.

Timestamp

Timestamp metrics highlight recency and time‐based anomalies, which is crucial for validating timeliness in event streams and incremental loads:

Median

This metric is not supported in MySQL.

Definition

Quartiles divide all non-NULL values in the column within the latest partition into four equal parts based on value: Q2 represents the 50th percentile, that is, the median.

Minimum value

Definition

The minimum value within a specific column.

Source

data (any datatype with an order defined for max/min)

Partition row count

Definition

The number of rows in the most recent time partition at scan time (e.g. all rows where partition_col = current_partition).

Source

Most recent timestamp (column)

Definition

Interval between scan time and the maximum value in a date/time/time-stamp column (within the partition). Supported only on date/datetime/time columns.

Source

data (timestamp)

Manage data quality issues

This section introduces the key features and workflows in Soda for managing data quality issues and reporting.

Learn how to find datasets and checks, navigate dashboards, and understand check results.

You’ll also learn how to set up notifications to stay informed and build custom dashboards using tools like Power BI or Tableau.

Numeric

Numeric metrics capture central tendency and dispersion in numerical columns, such as:

  • Mean ()

  • &

Total row count change: the delta in row count compared to the previous observation

Because they read only metadata, these monitors are extremely lightweight to compute and ideal for continuous, real-time dashboarding of dataset activity.

Depending on your data source(s), metadata-based Metric Monitors may only be supported on Tables, and not on other data objects e.g. Views.

Alternatives could be to setup these metadata-based Metric Monitors on the source Tables of your non-Table data objects, and/or store these data objects as Tables instead.

Last modification time
Schema changes
Total row count
Computation

MAX(column)

metadata

Computation

Soda keeps track of the previous total row count, fetches total row count again at scan time and subtracts both.

Computation

Uses the SQL standard for all databases STDDEV_SAMP(), which is a sampling based method.

Source

data (numeric)

Computation

  • For data sources supporting exact percentiles (e.g. PostgreSQL’s PERCENTILE_DISC(0.75)), Soda uses that function.

  • For data sources that provide approximations (such as BigQuery, SQLServer, Redshift and Trino), Soda uses those approximated values.

Computation

Through the row count value provided by the metadata, which is calculated differently for every database.

For example:

In Oracle, the total row count is calculated by doing a count(*). As in Oracle, this is not available by default.

BigQuery, on the other hand, provides metadata information through the INFORMATION_SCHEMA.TABLE_STORAGE metadata table. Specifically, Soda uses the total_rows column from that table.

Source

data (numeric)

Computation

  • For data sources supporting exact percentiles (e.g. PostgreSQL’s PERCENTILE_DISC(0.25)), Soda uses that function.

  • For data sources that provide approximations (such as BigQuery, SQLServer, Redshift and Trino), Soda uses those approximated values.

Source

data (numeric)

Computation

  • For data sources supporting exact percentiles (e.g. PostgreSQL’s PERCENTILE_DISC(0.5)), Soda uses that function.

  • For data sources that provide approximations (such as BigQuery, SQLServer, Redshift and Trino), Soda uses those approximated values.

Computation

MIN(column)

data (time partition)

Computation

Through count(*) for the partition.

Computation

MAX(timestamp)

Percentiles (Q1, median, Q3)

Numeric metrics enable you to detect outliers, shifts in scale, or drifts in distribution over time.

average
Standard deviation
variance
Sum
Average length
Minimum length
Maximum length
Most recent timestamp

What is Soda?

Soda helps data teams make sure their data can be trusted. It makes it easy to find, understand, and fix problems in the data.

You can use Soda to:

  • Monitor production data with automated, ML-powered observability that spots unexpected changes without needing to define every rule up front.

  • Define data contracts, making expectations explicit and enabling producers and consumers to collaborate on reliable data at the source.

  • Test data earlier in the pipeline, as part of CI/CD workflows or during development, to prevent bad data from reaching production.

Soda helps teams to start right and automatically detect anomalies in metrics that have already happened. And shift left to prevent issues from happening again with collaborative data contracts.

Soda v4 vs v3

This is the documentation for Soda v4. If you are still using Soda v3, head to the .

The new version of Soda has transformed the software into a full data-quality platform by layering on:

  • End-to-end data observability:

  • Collaborative data contracts:

This marks the shift from a CLI-centric checks engine toward a unified, observability-driven data quality platform with a refined, three-tier Core + Agent + Cloud architecture, built-in contracts, orchestration, and deep integrations.

to learn more about Soda's capabilities.

What is data quality?

Data quality refers to how well a dataset meets the expectations of completeness, accuracy, timeliness, uniqueness, and consistency. Good data supports business goals, drives confident decision-making, and is the base for great data products.

Poor data quality causes failed pipelines, incorrect reports, and broken AI models. Managing data quality means proactively validating assumptions and reactively monitoring for drift or degradation.

Soda helps you answer questions like:

  • Is the data fresh and complete?

  • Are there unexpected values or duplicates?

  • Did values shift outside of expected ranges?

  • Are schema or contract changes causing breakage?

Key Concepts

Data Observability

Data observability is a reactive approach to monitoring data in production and catching unexpected issues as they emerge. It helps answer the question: What is happening with my data right now, and how is that changing over time?

Use data observability to:

  • Detect anomalies in data quality metrics such as freshness, row counts, null values or custom ones

  • Monitor metric trends and seasonality

  • Identify late-arriving or missing records

  • Get alerted when values deviate from historical norms

Data Testing

Data testing is a proactive approach that validates known expectations about your data during development, deployment, or transformation. It helps you catch issues before they reach production, break reports, or impact downstream systems.

Use data testing to:

  • Align on what “good data” looks like through data contracts

  • Verify that your data meets those expectations, including schema, values, and transformations

  • Test data at every step of the pipeline to prevent bad data from reaching downstream systems

  • Integrate with CI/CD workflows for continuous quality checks during development

Data Contracts

Data contracts define what a dataset should look like, including its schema, data types, value ranges, and other constraints. They establish a shared agreement between data producers and consumers about what’s expected and what must be upheld.

Both testing and observability play a role in upholding data contracts:

  • Testing validates that data meets the contract during development, pipeline execution, and on schedule.

  • Observability monitors contract adherence in production and detects unexpected issues.

Data Observability vs Data Testing

While data testing and observability are different in when and how they operate, they work best together as a unified strategy.

Approach
Timing
Use case

Together, they enable end-to-end data quality management: testing prevents problems, and observability detects those that escape prevention. At the same time, observability can help prioritize which issues to address and shift left to resolve them upstream.

Data quality at scale across the enterprise

Divide and conquer

Managing data quality across hundreds or thousands of datasets requires a scalable, federated approach. Soda enables this through:

  • Metadata-driven observability that adapts checks to each dataset's structure and context.

  • Role-based collaboration so teams can take ownership of the data they know best.

  • An interface for both engineering and business users, enabling collaboration through code, UI, or APIs, depending on user preference and role.

  • Integration with existing tools and workflows, such as data catalogs and incident management systems.

Data quality as a team sport

Reliable data depends on collaboration across roles:

  • Data engineers embed tests and monitor pipelines to catch issues early.

  • Data producers and consumers align on expectations through data contracts.

  • Data consumers report issues and collaborate with producers to interpret metrics and resolve problems.

  • Governance teams define and enforce data quality standards.

Soda Cloud acts as the shared workspace where these roles collaborate, triage incidents, and resolve issues.

Deployment options

Soda offers three deployment models, depending on your infrastructure and data privacy needs.

Deployment Model
Description
Ideal For
Key Features
Considerations

Read more about

Supported data sources and integrations

Soda integrates with the modern data stack:

  • Data warehouses and databases: Databricks, Snowflake, BigQuery, Redshift, PostgreSQL, MySQL, Spark, Presto, DuckDB, and more.

  • Orchestration platforms: Airflow, Dagster, Prefect, Azure Data Factory.

  • Metadata tools: Atlan, Alation, Collibra, data.world, Zeenea.

  • Cloud providers: AWS, Google Cloud, Azure.

What’s next?

  • To get started with Soda, check out the end-to-end guide.


Community & Support

Need help or want to contribute?

  • Join our Slack Community: ​

  • Browse GitHub Discussions: ​

Still have questions? Use the search bar above or reach out through our community channels for additional help.

All data types

Metrics that support all data types are foundational checks that apply to any column regardless of its data type:

  • Count of non-NULL values

  • Duplicate percentage

  • Minimum and maximum values

  • of distinct entries

These metrics form the backbone of data completeness and consistency monitoring, ensuring every column meets basic quality expectations.

Average

Definition

Arithmetic mean of all non-NULL values in the column, computed per partition.

Source

data (numeric)

Computation

Using the built in AVG(column_name) function of every database.

Maximum length

Definition

The maximum (maxLength) of non-NULL string values in the column.

Source

data (value length in characters)

SQLServer: instead of length in characters, it uses data length (number of bytes)

Computation

For each value, Soda encapsulates the length in the aggregation metric: MAX(LENGTH(column))

Git-managed Data Contracts

Define, version, and test contracts as code

For teams that manage data like software, Git-managed data contracts offer a code-first way to define and enforce data quality expectations.

In this model, contracts are written in YAML and stored in your Git repository, right alongside your data models, transformation logic, and CI/CD workflows. You write, version, test, and promote contracts just like any other code artifact.

This approach gives engineers full control, reproducibility, and integration into development pipelines. And with the right setup, you can still collaborate with non-technical users via Soda Cloud and even sync UI-authored changes into Git using our future proposal workflow.


Why Git-managed?

  • Full version control Track every change, roll back when needed, and manage contracts with the same discipline as application code.

  • Code-first workflow Keep contracts close to your data models and transformations for better alignment, automation, and traceability.

  • CI/CD integration Run contract verifications in your existing pipelines; on every commit, PR, or deployment.

  • Team governance


If you're already managing your data infrastructure in Git, Git-managed contracts are the natural extension for bringing data quality under control without adding friction or silos.

In the next sections, we’ll walk you through how to set up, author, and run Git-managed contracts using the Soda CLI.

Cloud-managed Data Contracts

Cloud-managed Data Contracts let you define and manage expectations for your data directly in the Soda Cloud UI.

This approach is perfect for data analysts, product owners, and business stakeholders who know what “good data” looks like but prefer intuitive tools over code. It’s also ideal for teams that want to move fast, collaborate visually, and integrate seamlessly with engineering workflows when needed.

With Soda Cloud, you can browse datasets, add quality rules, test and publish contracts, and set up scheduled or on-demand verification. All from your browser.

Why Cloud-managed?

  • Faster time to value – no setup required

  • Accessible to everyone – empower domain experts, not just engineers

  • Built for collaboration – share, comment, and propose changes in a shared UI

  • Easily operationalized – schedule tests and trigger verifications programmatically

Cloud-managed contracts are a powerful way to bring your organization together around trusted data.

Prerequisites

Before creating contracts in Soda Cloud, make sure:

  • You have a Soda Cloud account

  • You have access to an organization in Soda Cloud

  • You have connected at least one data source via a Soda Agent

Missing values percentage

Definition

Number of missing values relative to the number of rows in the partition, expressed as percentage → (number of null or missing values in column ÷ total rows in partition) × 100

Source

data

Computation

1 - (count(column) / count(*)) × 100

Deploy Soda Agent

The Soda Agent allows you to securely scan your data sources for quality issues directly from Soda Cloud. It can be self-hosted or Soda-hosted, depending on your deployment preferences. The self-hosted option allows for a more custom and secure deployment, while the Soda-hosted agent is easier to start with. Learn more about Deployment options

You can deploy a self-hosted agent in the infrastructure of your choice:

  • Kubernetes cluster

  • Amazon EKS

  • Azure AKS

  • Google GKE

Soda-hosted agents are included in all Free, Team, and Enterprise plans at no additional cost. However, self-hosted agents require an Enterprise plan.

If you wish to use self-hosted agents, please contact us at to discuss Enterprise plan options or reach out via the support portal for existing customers.

Average length

Definition

The average (avgLength) of non-NULL string values in the column.

Source

data (value length in characters)

SQLServer: instead of length in characters, it uses data length (number of bytes).

Computation

For each value, Soda encapsulates the length in the aggregation metric: AVG(LENGTH(column))

Count

Definition

The total number of non-NULL values in the monitored column. Useful for identifying unexpected drops or spikes in data completeness.

Source

data

Computation

COUNT(<column>)

Analyze monitor and check results

When you access a monitor (from Metric Monitoring) or a check (from the Contract), Soda Cloud provides a time series view that shows how the monitored metric or check result evolves. This helps you explore the history of data quality issues, spot trends, and understand changes in your data.

Diagnostics

For certain types of checks and monitors, additional diagnostic information is also available for each monitor or check results to help you investigate issues in more detail.

For example:

  • Schema Checks: View a side-by-side comparison of the actual vs. expected schema to identify differences.

  • Missing, Duplicate, or Invalid Checks: See the percentage of failed rows vs. passing rows to understand the scale and impact of the issue.

This view helps you drill down into specific data issues, explore context, and take informed action.

Sum

Definition

Detects anomalies in the total (sum) of all non-NULL values in a given column over the latest partition. It flags unexpected increases or decreases in the aggregate amount.

Source

data (numeric)

Computation

SUM(<column>)

Variance

Definition

The variance of the values within a specific column.

Source

data (numeric)

Computation

Uses the SQL standard for all databases VAR_SAMP(), which is a sampling based method.

Most recent timestamp (dataset)

Definition

The interval between scan time and the maximum timestamp in the partition column (within the latest partition).

Source

data (time partition)

Computation

Via MAX(column) for any time related column.

No sampling used.

Minimum length

Definition

The minimum (minLength) of non-NULL string values in the column.

Source

data (value length in characters)

SQLServer: instead of length in characters, it uses data length (number of bytes)

Computation

For each value, Soda encapsulates the length in the aggregation metric: AVG(LENGTH(column))

Schema changes

Definition

The count of schema alterations (column additions, removals, or data-type changes) detected since the previous scan. Any schema change is treated as an anomaly.

Source

metadata

Computation

No sampling used. The value is calculated through the difference of two full table definitions.

Metric Monitoring dashboard

As soon as a data source is connected, the metric monitoring dashboard is available and will have historical information. Soda establishes a statistical baseline for each metric and continually compares new scan results against that baseline, flagging anomalies according to the sensitivity, exclusions, and threshold strategy you’ve configured.

What are Metric Monitors?

Metric monitors are the foundation of data observability in Soda. Monitors track data quality metrics over time and leverage historical values for analysis. Soda automatically collects these metrics and examines how they evolve over time through a to identify when metrics deviate from expected patterns and trigger alerts. These deviations are surfaced and recorded in the Metric Monitors

Last modification time

Definition

The elapsed interval between the scan time and the timestamp of the most recent change to the database. This includes any change to the data (inserts, updates, deletes) as well as any change to the schema.

Source

Metadata data sources

Oracle

  • Historical backfilling: not possible.

  • Row count: metadata row counts are calculated via count(*)

Organization and Admin settings

The Organization and Admin Settings in Soda Cloud provide a centralized interface for managing your organization’s configuration, roles, user access, and integrations. From setting your organization’s name to defining global roles, managing user groups, and enabling the Soda-hosted Agent, these settings help you tailor Soda Cloud to your team’s needs and governance policies.

To access the settings, click on your avatar on the top right, and then click Organization Settings.

Only users with the Manage Organization Settings global role can access and modify these settings.

Duplicate percentage

This metric is a work in progress.

Definition

Percentage of all duplicate non-NULL records in a column.


For example: duplicate_percent(id)

Are data quality metrics changing over time?

  • Pipeline and CI/CD integration to automate data quality checks.

  • Platform teams deploy, manage, and secure the underlying infrastructure.

    Self-hosted Agent

    Same as Soda-hosted Agent, but deployed and managed in your own Kubernetes environment.

    Teams needing full control over infrastructure and deployment.

    Similar to Soda-hosted Agent, but deployed within the customer’s environment; data stays within your network.

    Required for observability features.

    Cannot scan in-memory sources like Spark or DataFrames.

    Kubernetes expertise required.

    BI tools: Looker, Tableau, Power BI.

  • Messaging and ticketing: Slack, Microsoft Teams, Jira, PagerDuty, ServiceNow, Opsgenie.

  • Data Testing

    Proactive and preventative: Pre-production, during development or CI/CD

    Prevent breakages before they happen: Validate known rules and enforce contracts

    Data Observability

    Reactive and adaptive: In production, runtime monitoring

    Monitor data behavior and changes over time with automated detection of anomalies, schema changes, and other unexpected issues.

    Soda Core

    Open-source Python library (with commercial extensions) and CLI for running Data Contracts in your pipelines.

    Data engineers integrating Soda into custom workflows.

    Full control over orchestration, in-memory data support, contract verification.

    No observability features. Required for in-memory sources (e.g., Spark, DataFrames). Data source connections managed at the environment level.

    Soda-hosted Agent

    Managed version of Soda that runs observability features, executes Data Contracts and scheduled them.

    Teams seeking a simple, managed solution for data quality.

    Centralized data source access, no setup required, observability features enabled. Enables users to create, test, execute, and schedule contracts and checks directly from the Soda Cloud UI.

    Required for observability features. Cannot scan in-memory sources like Spark or DataFrames.

    Data Observability
    Data Testing
    Contact us
    Deployment options
    Quickstart
    Soda Community
    Soda on GitHub
    v3 documentation

    You are not logged in to Soda and are viewing the default public documentation. Learn more about .

    Ensure all changes are reviewed, tested, and approved through standard Git workflows (pull requests, approvals, branching).
  • Hybrid collaboration Combine Git workflows with Soda Cloud for monitoring, visualization, and cross-functional input via contract proposals.


  • You are not logged in to Soda and are viewing the default public documentation. Learn more about .

    tab for each dataset.

    Monitors vs metrics: what's the difference?

    The main difference between monitors and metrics is that monitors are configurable, while metrics are not.

    Monitors build on top of metrics by wrapping their static measurement in a configurable context. Each monitor is customizable, so the user can select scan time, scan frequency, thresholds, and metric to be used.

    Metrics, on the other hand, are only a part of the monitor. They are built-in, static definitions of data properties; it is not possible to alter how a metric is computed at its source, but it is possible to select which metric to track through a metric monitor.

    Types of monitors

    Soda offers two main types of monitors to support scalable, layered observability: dataset monitors and column monitors.

    1. Dataset monitors provide instant, no-setup monitoring based on metadata. They track high-level metrics like row count changes, schema updates, and insert activity, making them ideal for catching structural or pipeline-level issues across large numbers of datasets.

    2. Column monitors are more granular and customizable. They focus on specific fields, allowing users to monitor things like missing values, averages, or freshness. These monitors are useful for capturing data issues that impact accuracy or business logic at the column level.

    Together, they offer broad coverage and targeted insight, helping teams detect both systemic and localized data quality issues.

    Each of these sections contains summarized information about the latest scan results for each monitor. From the health tab, you can access each monitor for further investigation and configuration, as well as creating alerts.

    Configure Opt-in Alerts

    You can turn any metric monitor into a proactive alert by clicking its bell icon on the Metric Monitors dashboard and selecting Add Notification Rule. This brings up the Add Notification Rule panel:

    1. Name Enter a descriptive title for your rule (e.g. “Row-Count Alerts – Prod Sales”).

    2. Data source Choose the warehouse or connection to scope your rule. Then, search for and check the specific tables (or columns) this rule should cover. The “Matches X datasets” badge updates in real time so you know exactly what you’ll be alerting on.

    3. Applies to Pick which check type you want to alert on.

    4. Recipients Select one or more notification targets:

    • Email addresses

    • Slack channels

    • Other integrations

    This dialog lets you reuse a single rule for multiple datasets or checks, ensuring your team only gets the notifications they care about.

    proprietary anomaly detection algorithm

    You are not logged in to Soda and are viewing the default public documentation. Learn more about .

    metadata

    Computation

    The compute method depends on the database. Soda requires specific metadata fields that are different for every database.No sampling used.

    Exceptions

    In Redshift, adding columns is not part of last_modification_time.


    You are not logged in to Soda and are viewing the default public documentation. Learn more about .

    . Soda does not use metadata for this metric in Oracle. It requires an additional package and/or is unreliable based on the schedule of that package.
  • Last modification time: Soda uses metadata

  • Note that past data is only available for a limited amount of time, which varies depending on the system. The minimum goes back 120 h.

    Non-UTC timestamps are not recommended when connecting Soda to Oracle data sources. Soda uses timezone data when available, but assumes UTC when the timezone is not provided by the data source.

    Some databases convert timestamps to UTC, but Oracle does not do any implicit conversions and stores timestamps and timezone information as the user inputs them. Because of Oracle Python client limitations, all timezone information is stripped when Soda retrieves it, which means that Soda will read all timestamps as if they were UTC regardless of the original input.


    Postgres

    Metadata is supported, but it requires some additional setup on Postgres's side.

    • Historical backfilling: not possible.

    • Row count: enabled out-of-the-box.

    • Last modification time: track_commit_timestamp must be enabled: https://www.postgresql.org/docs/current/runtime-config-replication.html#GUC-TRACK-COMMIT-TIMESTAMP

      • If track_commit_timestamp is not enabled, Soda will return a warning.


    BigQuery

    Metadata metrics are available and supported in BigQuery.

    • Historical backfilling: possible.

    • Partition column: can be suggested based on metadata available in BigQuery.

      • Soda will prioritize user-suggested columns.

      • If there are no user-suggested columns, Soda will try a metadata approach to find the partition column automatically.

      • If there are no columns found in the metadata of BigQuery, Soda will fall back on its own heuristic.

    Partition column availability in BigQuery:

    If the user has configured a partitioning column on BigQuery's side, Soda will use it (given that it is a date/timestamp column).

    Otherwise, Soda will fall back on a standard sampling method to detect the partition column.


    Redshift

    • Historical backfilling is supported on Redshift and it is limited to 7 days for the metadata.

    • Modification time does not include schema changes. Only:

      • inserts

      • updates

      • deletes


    Synapse

    Synapse does not provide metadata history tables.

    • Historical backfilling: not possible.

    • Last modification time: not possible.

    • Row count: current row counts are calculated via count(*). Soda does not use metadata for this metric in Synapse.

    • Quartile metrics (Q1, median, Q3): not possible. Synapse does not support quartile metrics.


    You are not logged in to Soda and are viewing the default public documentation. Learn more about .

    Add multiple organizations

    You may find it useful to set up multiple organizations in Soda Cloud so that each corresponds with a different environment in your network infrastructure, such as production, staging, and development. Such a setup makes it easy for you and your team to access multiple, independent Soda Cloud organizations using the same profile, or login credentials.

    Note that Soda Cloud associates any API keys that you generate within an organization with both your profile and the organization in which you generated the keys. API keys are not interchangeable between organizations.

    Contact [email protected] to request multiple organizations for Soda Cloud.


    You are not logged in to Soda and are viewing the default public documentation. Learn more about .

    on this table is 0.66 (or 66%):
    id
    name

    1

    a

    1

    b

    2

    c

    null

    d

    null

    e

    Source

    data


    You are not logged in to Soda and are viewing the default public documentation. Learn more about .

    Missing values percentage
    Unique count

    You are not logged in to Soda and are viewing the default public documentation. Learn more about .

    Onboard datasets on Soda Cloud

    You are not logged in to Soda and are viewing the default public documentation. Learn more about .

    https://www.soda.io/contact

    You are not logged in to Soda and are viewing the default public documentation. Learn more about .


    You are not logged in to Soda and are viewing the default public documentation. Learn more about .

    Install and configure

    Before you can define, test, or verify Git-managed data contracts, you need to install the Soda CLI and configure your environment.

    This setup gives you full control over your contracts, letting you version them in Git, execute them locally or remotely, and integrate them into your CI/CD pipelines.

    Install the Soda Core Python Package

    Install Soda Core using pip:

    Replace postgres with the name of your data source type, e.g., snowflake, databricks, etc. See for a list of supported data source types.

    If you need authenticated access, follow the to set up your environment and install the necessary Soda extensions.

    Connect to Soda Cloud (optional)

    If you want to interact with Soda Cloud to publish contracts and view verification results, or to use Soda Agent, you’ll need to connect the CLI to your Soda Cloud account.

    Don’t have an account? to get started.

    Create the config file

    This generates a basic Soda Cloud configuration file.

    Add your API keys

    Open sc.yml and fill in your API key and organization details.

    Learn more about how to generate keys:

    Test the connection

    This ensures the CLI can authenticate and communicate with Soda Cloud.

    Configure your data source

    To verify a contract, Soda needs to know how to connect to your data source. You have two options:

    Connect with Soda Core

    If you prefer to define your own connection locally (or aren’t using a Soda Agent), you can create a data source config file for Soda Core.

    Install the required package for your data source. For example, for PostgreSQL:

    See the for supported packages and configurations.

    Create the config file

    Open ds.yml and provide the necessary credentials.

    For example with PostgreSQL:

    Refer to the for the configurations of each data source type.

    Avoid hardcoding secrets. Use environment variables or a secrets manager where possible.

    Test the connection:

    Use an existing Data Source via Soda Agent

    If your data source is already connected to Soda Cloud using a Soda Agent (hosted or self-hosted), you can reuse that connection without managing credentials or configs locally.

    You just need to ensure you have set up the connection with Soda Cloud.


    Choose the method that best fits your setup:

    Use Soda Agent for a centralized, cloud-managed connection, or local configuration if you want full control within your environment.

    Upgrade Soda Agent

    The Soda Agent is a Helm chart that you deploy on a Kubernetes cluster and connect to your Soda Cloud account using API keys.

    To take advantage of new or improved features and functionality in the Soda Agent, including new features in the Soda Library, you can upgrade your agent when a new version becomes available in ArtifactHub.io.

    Note that there is no downtime associated with the exercise of upgrading a self-hosted Soda Agent. Because Soda does not define the .spec.strategy in the deployment manifest of the Soda Agent Helm chart, Kubernetes uses the default RollingUpdate to upgrade; refer to Kubernetes documentation .

    1. If you regularly access multiple clusters, you must ensure that are first accessing the cluster which contains your deployed Soda Agent. Use the following command to determine which cluster you are accessing.

    If you must switch contexts to access a different cluster, copy the name of cluster you wish to use, then run the following command.

    1. To upgrade the agent, you must know the values for:

    • namespace - the namespace you created, and into which you deployed the Soda Agent

    • release - the name of the instance of a helm chart that is running in your Kubernetes cluster

    • API keys - the values Soda Cloud created which you used to run the agent application in the cluster Access the first two values by running the following command.

    Output:

    1. Access the API key values by running the following command, replacing the placeholder values with your own details.

    From the output above, the command to use is:

    1. Use the following command to search ArifactHub for the most recent version of the Soda Agent Helm chart.

    1. Use the following command to upgrade the Helm repository.

    1. Upgrade the Soda Agent Helm chart. The value for the chart argument can be a chart reference such as example/agent, a path to a chart directory, a packaged chart, or a URL. To upgrade the agent, Soda uses a chart reference: soda-agent/soda-agent.

    From the output above, the command to use would be:

    OR, if you use a values YAML file,

    Browse datasets

    The Datasets page displays all datasets that have been onboarded into Soda Cloud—either through publishing a contract or via the onboarding process: Onboard datasets on Soda Cloud .

    It provides a quick overview of each dataset’s health, showing at a glance if a dataset has issues, how many checks from its contract are failing, how many anomalies were detected through metric monitoring, and when the last scan was executed.

    You can filter datasets by properties like data source, owners, arrival time, attributes, or flags such as failures or anomalies. Use the search bar to quickly find a specific dataset by name, and sort the list by name, creation time, or data quality status.

    Learn more about custom attributes: Dataset Attributes & Responsibilities

    You can also sort the datasets list by name, creation date, check failures, or anomalies to prioritize your focus.

    Customize your datasets' view

    You can tailor the Datasets view to focus on the areas that matter most to you:

    1. Use the filter options to narrow down the view

    2. Click the Save Dashboard button to store your current filter configuration as a collection.

    1. Enter a name for the collection and click Save

    1. Once saved, your collection will be available in the dropdown at the top right of the dashboard. Simply select it to switch views.

    1. Use the context menu next to the collection name to:

    • Delete the collection if it’s no longer needed.

    • Share the collection with others in your organization.

    Diagnostics Warehouse

    Diagnostics Warehouse provides a clear, detailed view of the state of data checks while allowing access to failed rows in order to take a closer look and resolve data quality issues.

    Overview

    Diagnostics Warehouse stores all Soda scan details, failed records, and historical data quality issues directly in your data warehouse of choice, safely and securely. Nothing is stored outside. This gives you the ability to run diagnostics, resolve issues, and see exactly why problems happen. You can go as deep as you need, from a single record to a full dataset.

    Each time a Soda scan runs, Diagnostics Warehouse stores failed rows together with check and scan results, and related metadata attributes. With that information, data teams can quickly diagnose and resolve issues at both dataset and row level. Additionally, Soda's Diagnostics Warehouse makes it easier for teams to build on top of Soda's outputs to set up operational workflows, and connect to BI tools you already know and trust.

    Features & capabilities

    • Full diagnostic information in one place, including attributes.

    • Transparency for all: replace black-box runs with auditable facts and keep an immutable, queryable history of what was checked, when, how long it took, what failed, and why.

    • Faster root-cause analysis: jump from a failed check to the exact failed rows, affected datasets/columns, and prior history to see if it’s a one-off issue or a pattern.

    Security & governance

    • Data minimization: Diagnostics Warehouse stores metadata about runs and checks and, for row-level checks, it only stores failed rows when the option is enabled.

    • Warehouse residency: Diagnostics are not stored in Soda. They live in your analytics warehouse, respecting your access controls, encryption, and audit trails.


    Get started

    1. Enable Diagnostics Warehouse in your Soda data source settings.

    2. Grant the service identity permission to create and write to the Diagnostics Warehouse schema in your warehouse.

    3. Run your checks; Diagnostics Warehouse tables populate automatically.

    4. Query your warehouse and connect to your BI tools to start exploring.

    Next: to enable Diagnostics Warehouse in your organization, reach out to Soda at .

    Browse checks

    The Checks page displays all checks defined in a data contract and tracked in Soda Cloud. It provides a quick overview of check health across datasets, allowing you to create custom groupings by applying filters such as data source, dataset, owners, or status (pass, fail, warning). This helps you focus on specific areas or teams that matter most.

    You can also review key details like the check type, the dataset it belongs to, and the time of the last scan. Use the search bar to quickly find a specific check by name, and sort the list by name, last run time, or check status.

    You can filter checks by properties such as data source, dataset, owners, attributes, or status (pass, fail, warning). Use the search bar to quickly find a specific check by name.

    Learn more about custom attributes: Check and dataset attributes

    You can also sort the list by name, last run time, or check status.

    Customize your check view

    You can tailor the Checks view to focus on the areas that matter most to you:

    1. Use the filter options to narrow down the view

    2. Click the Save Dashboard button to store your current filter configuration as a collection.

    1. Enter a name for the collection and click Save

    1. Once saved, your collection will be available in the dropdown at the top right of the dashboard. Simply select it to switch views.

    1. Use the context menu next to the collection name to:

    • Delete the collection if it’s no longer needed.

    • Share the collection with others in your organization.

    Dataset dashboard

    The Dataset Page provides a detailed view of each dataset’s health and monitoring information in Soda Cloud. It includes several tabs to help you explore and manage data quality at the dataset level.

    Checks tab

    Displays the results of all checks defined in the dataset’s contract. Checks are grouped by column, with column-level checks nested under their respective columns. Columns with failed checks are automatically expanded so you can spot issues quickly. You can filter the view to show only failed checks and search for specific checks or columns by name for faster navigation and troubleshooting.

    Learn more about Data Testing

    Metric Monitoring tab

    Shows the metrics that are actively monitored for the dataset, helping you track trends and detect anomalies over time.

    Learn more about

    Profiling tab

    Provides an overview of the dataset’s structure, including column names, data types, distinct counts, and other statistics. You can search for a specific column by name to quickly locate and review its profiling details.

    Incidents tab

    Lists incidents related to the dataset, helping you track issues and collaborate on resolution. You can filter incidents based on criteria such as user lead, status, or severity to focus on the most important or urgent cases.

    Learn more about

    Organization dashboard

    The Organization Dashboard provides a high-level overview of your data quality across datasets and checks in Soda Cloud. It shows key trends over time, such as the number of checks that are passing, failing, or in a warning state, helping you identify issues early.

    You’ll also find key metrics, including:

    • Scans in failed mode: Datasets that are currently blocked due to failing checks.

    • Checks currently failing: Active checks that need attention.

    • Overall Health Score: The number of failing checks out of the total number of checks

    These insights allow you to quickly identify where action is needed.

    Customize your dashboard

    You can tailor the Organization Dashboard to focus on the areas that matter most to you:

    1. Apply filters based on attributes: Use the filter options to narrow down the view by attributes

    1. Click the Save Dashboard button to store your current filter configuration as a collection.

    1. Enter a name for the collection and click Save

    1. Once saved, your collection will be available in the dropdown at the top right of the dashboard. Simply select it to switch views.

    1. Use the context menu next to the collection name to:

    • Delete the collection if it’s no longer needed.

    • Share the collection with others in your organization.

    Activity section

    The Activity section offers insights into how Soda is being used across your organization. It tracks adoption metrics, such as active users, active checks, active datasets, and the number of alerts in the last 90 days.

    Custom Dashboards

    In addition to the built-in dashboards in Soda Cloud, you can build custom dashboards tailored to your organization’s specific needs. By leveraging the Soda REST API, you can programmatically retrieve data quality metrics, check results, and incident details, and integrate them into external dashboarding tools such as Power BI, Tableau, or Looker.

    This enables you to create tailored views and reports that align with your business logic and audience, ensuring stakeholders get the right insights in the tools they already use.

    Learn more on

    General settings

    The General Settings page allows you to configure foundational settings for your organization in Soda Cloud. These settings impact how your organization operates and how users interact with the platform.

    Organization Name

    Set the name of your organization. This name appears throughout Soda Cloud, such as in dashboards, reports and notifications.

    Allow "Login As"

    Enable the Login As feature to allow the Soda Support team to log in as an admin within your organization. This can be useful when troubleshooting issues or providing assistance.

    Enable or Disable the Soda-hosted Agent

    You can choose whether to use the Soda-hosted Agent by enabling or disabling it in the Organization Settings:

    • Toggle the Soda-hosted Agent option to enable or disable the agent for your organization.

    • Disabling the agent prevents Soda Cloud from running scans or checks via the managed agent. You’ll need to use a self-hosted agent or Soda Core in your environment instead.

    Profiling Data Collection

    By default, the Soda-hosted Agent collects profiling information (such as column-level statistics and schema details) to support features like dataset discovery and monitoring in Soda Cloud.

    You can choose to disable profiling if you prefer not to send profiling data to Soda Cloud: Toggle the Profiling Data Collection option to disable profiling for your organization.

    This ensures that no profiling information is collected or pushed to Soda Cloud. Only check results and metadata necessary for contract validation will be processed.

    Enable Data Source Secrets

    Manage secure storage of secrets such as API keys, credentials or connection details. Secrets can be used in data source configurations, checks, and other automated processes.

    For more information, see the

    Verify a contract

    Once your contract is authored and published (or available locally), you can verify whether the actual data complies with the defined expectations. Soda provides two execution options:

    • Soda Core – run verifications locally, typically in CI/CD pipelines or dev environments.

    • Soda Agent – run verifications remotely using an agent deployed in your environment, triggered via Soda Cloud.

    Both approaches support variable overrides, publishing results to Soda Cloud, and integration into automated workflows.

    Learn more about Deployment options

    Using Soda Core

    Soda Core runs the verification locally, connecting to your data source using the defined data source configuration file.

    This command:

    • Connects to your database using the local config

    • Loads the contract

    • Runs all checks and returns a pass/fail result

    With variable overrides

    You can pass variables defined in the contract using the --set flag:

    Learn about variables in Data Contract:

    Publish results to Soda Cloud

    To send verification results to Soda Cloud for visibility and reporting.

    Add the flag --publish to the command.

    This action requires the "Manage contract" permission on the dataset; the user is identified based on the API key provided in the Soda Cloud configuration.

    Learn more about permissions here:

    Learn how to connect the CLI to Soda Cloud:

    This is recommended if you want stakeholders to see the outcomes in Soda Cloud or include them in dashboards and alerting.

    Using Soda Agent

    Soda Agent executes verifications using data sources configured in Soda Cloud.

    This setup:

    • Runs verifications through the Soda Agent connected to your data source

    • Fetches the published contract from Soda Cloud

    • Returns the result locally in the CLI

    With variable overrides

    You can pass variables defined in the contract using the --set flag:

    Learn about variables in Data Contract:

    Publish results to Soda Cloud

    You can also push results to Soda Cloud from the agent-based run.

    Add the flag --publish to the command.

    This action requires the "Manage contract" permission on the dataset; the user is identified based on the API key provided in the Soda Cloud configuration.

    Learn more about permissions here:

    This is recommended if you want stakeholders to see the outcomes in the Soda Cloud or include them in dashboards and alerting.

    Notifications

    Soda’s notification system helps you stay informed when data issues occur—whether it’s a failed check or an anomaly detected through metric monitoring. Notifications are dynamically dispatched using notification rules, allowing you to target alerts based on specific properties, attributes, or datasets.

    How Notification Rules Work

    Notification rules define when and to whom a notification is sent. Rules can be configured to match specific checks or anomalies, ensuring the right people are notified at the right time.

    Creating a Notification Rule

    Only users with the Manage Notification Rules permission can create or edit rules. All users can view rules. Read about

    To create a new notification rule:

    1. Click on your profile in Soda Cloud and select Notification Rules from the menu.

    1. Click New Rule.

    1. Provide a name for the rule.

    1. Define the Rule Scope

    Checks:

    • All checks: The rule applies to every check in your organization.

    • Specific checks: Build custom rules by filtering on check properties, dataset properties, or attributes.

    • Anomalies from Metric Monitoring: Select specific datasets where the rule applies.

    1. Define the recipients (users, groups, or integrations like Slack, Teams, or webhooks).

    1. ...and choose the alert type (only applicable for checks, not anomalies):

    • Only failures

    • Failures and warnings

    • All statuses

    1. Save to create the notification rule


    Pausing Notification Rules

    You can pause a notification rule at any time to temporarily disable alerts without deleting the rule.

    Redeploy Soda Agent

    When you delete the Soda Agent Helm chart from your cluster, you also delete all the agent resources on your cluster. However, if you wish to redeploy the previously-registered agent (using the same name), you need to specify the agent ID in your override values in your values YAML file.

    1. In Soda Cloud, navigate to your avatar > Agents.

    2. Click to select the agent you wish to redeploy, then copy the agent ID of the previously-registered agent from the URL. For example, in the following URL, the agent ID is the long UUID at the end. https://cloud.soda.io/agents/842feab3-snip-87eb-06d2813a72c1. Alternatively, if you use the base64 CLI tool, you can run the following command to obtain the agentID.

     kubectl get secret/soda-agent-id -n soda-agent --template={{.data.SODA_AGENT_ID}} | base64 --decode
    1. Open your values.yml file, then add the id key:value pair under agent, using the agent ID you copied from the URL as the value.

    1. To redeploy the agent, you need to provide the values for the API keys the agent uses to connect to Soda Cloud in the values YAML file. Access the values by running the following command, replacing the soda-agent values with your own details, then paste the values into your values YAML file.

    Alternatively, if you use the base64 CLI tool, you can run the following commands to obtain the API key and API secret, respectively.

    1. In the same directory in which the values.yml file exists, use the following command to install the Soda Agent helm chart.

    1. Validate the Soda Agent deployment by running the following command:

    Group By monitors

    Group By monitors enable you to track data quality metrics across specific segments of your dataset. Instead of monitoring a metric for a column as a whole, you can break it down per category (for example, per region, per school year, per status).

    This functionality is especially valuable when:

    • You want to detect anomalies at a more granular level, within each segment or category.

    • You need visibility into how data quality differs across categories.

    Incidents

    Incidents help you track, investigate, and resolve data quality issues when they occur. An incident is created when a data issue, such as a failed or warning check, has been confirmed and assigned to someone for resolution.

    To create or update an incident, the user has to have "Manage Incidents" on the related dataset.

    Creating an Incident

    Create and edit contracts

    With Git-managed contracts, you define your expectations as code using YAML. This gives you full control over how your data is validated, and allows you to manage contracts just like any other code artifact: versioned, tested, and deployed via Git.

    To learn all about the structure and supported features, refer to the full specification in the

    The contract structure includes:

    • Dataset and column structure

    • Available check types (missing, invalid

    Upgrading from 1.1.x to 1.2.x+

    Starting from version 1.2.0 all images required for the Soda Agent are distributed using a Soda-hosted image registry.

    For more information, see .

    Set up authentication for the Soda image registry

    Verify a contract

    Once a Data Contract is published, the next step is to verify that the actual data matches the expectations you’ve defined. Soda offers several flexible ways to execute contract verifications, depending on your needs and technical setup.


    Manual execution (from the Dataset Page)

    You can manually verify a contract at any time from the dataset page in Soda Cloud.

    Simply open the dataset and click Verify Contract. This will:

    Check and dataset attributes

    Attributes allow you to add descriptive metadata to your datasets and checks. This metadata can then be:

    • Used for filtering in Soda Cloud, making it easier to search and organize datasets and checks based on specific criteria (e.g., business domain, sensitivity, criticality).

    • Leveraged in reporting, enabling you to group datasets, track ownership, and monitor data quality across different categories or dimensions.

    Adding meaningful attributes enhances discoverability, governance, and collaboration within Soda and its integrations.

    Athena

    Access configuration details to connect Soda to an Amazon Athena data source.

    Connection configuration reference

    Install the following package:

    Data source YAML


    You are not logged in to Soda and are viewing the default public documentation. Learn more about .


    You are not logged in to Soda and are viewing the default public documentation. Learn more about .


    You are not logged in to Soda and are viewing the default public documentation. Learn more about .


    You are not logged in to Soda and are viewing the default public documentation. Learn more about .


    You are not logged in to Soda and are viewing the default public documentation. Learn more about .


    You are not logged in to Soda and are viewing the default public documentation. Learn more about .


    You are not logged in to Soda and are viewing the default public documentation. Learn more about .


    You are not logged in to Soda and are viewing the default public documentation. Learn more about .


    You are not logged in to Soda and are viewing the default public documentation. Learn more about .


    You are not logged in to Soda and are viewing the default public documentation. Learn more about .


    You are not logged in to Soda and are viewing the default public documentation. Learn more about .


    You are not logged in to Soda and are viewing the default public documentation. Learn more about .


    You are not logged in to Soda and are viewing the default public documentation. Learn more about .


    You are not logged in to Soda and are viewing the default public documentation. Learn more about .


    You are not logged in to Soda and are viewing the default public documentation. Learn more about .


    You are not logged in to Soda and are viewing the default public documentation. Learn more about .


    You are not logged in to Soda and are viewing the default public documentation. Learn more about .


    You are not logged in to Soda and are viewing the default public documentation. Learn more about .


    You are not logged in to Soda and are viewing the default public documentation. Learn more about .


    You are not logged in to Soda and are viewing the default public documentation. Learn more about .


    You are not logged in to Soda and are viewing the default public documentation. Learn more about .


    You are not logged in to Soda and are viewing the default public documentation. Learn more about .


    You are not logged in to Soda and are viewing the default public documentation. Learn more about .


    You are not logged in to Soda and are viewing the default public documentation. Learn more about .


    You are not logged in to Soda and are viewing the default public documentation. Learn more about .


    You are not logged in to Soda and are viewing the default public documentation. Learn more about .


    You are not logged in to Soda and are viewing the default public documentation. Learn more about .

    Reference

    This reference hub includes detailed documentation for Soda’s key interfaces and configuration options, as well as information on Soda's architecture and specifics, including:

    • Data Contract Language Reference – Author, validate, and manage data contracts using YAML-based definitions.

    • CLI Command and Python Reference – Use Soda’s command-line interface to configure, run, and automate verification workflows.

    • REST API Reference – Interact with Soda Cloud programmatically to manage datasets, run verifications, and retrieve results.

    pip install -i https://pypi.dev.sodadata.io/simple soda-postgres
    kubectl config get-contexts
    Documentation access & licensing
    Documentation access & licensing
    Documentation access & licensing
    Documentation access & licensing
    Documentation access & licensing
    Documentation access & licensing
    Documentation access & licensing
    Documentation access & licensing
    Documentation access & licensing
    Documentation access & licensing
    Documentation access & licensing
    Documentation access & licensing
    Documentation access & licensing
    Documentation access & licensing
    Documentation access & licensing
    Documentation access & licensing
    Documentation access & licensing
    Documentation access & licensing
    Documentation access & licensing
    Documentation access & licensing
    Documentation access & licensing
    Documentation access & licensing
    Documentation access & licensing
    Documentation access & licensing
    Documentation access & licensing
    Documentation access & licensing
    Documentation access & licensing
    Documentation access & licensing
    Documentation access & licensing
    Documentation access & licensing
    Documentation access & licensing
    Documentation access & licensing
    Documentation access & licensing
    Documentation access & licensing
    Documentation access & licensing
    Documentation access & licensing
    Documentation access & licensing
    Operational excellence:
    monitor failure rates, flaky checks, and run performance. Set SLOs for data quality, and measure MTTR and improvement over time.
  • Organization-level visibility: roll up results by domain, team, or pipeline. Show the impact of your data quality program to leadership with real, defensible metrics.

  • Open & portable features: it’s just tables in your warehouse. Query with SQL, power dashboards, join with lineage, incident, or cost data, and automate workflows.

  • Security & Governance: Diagnostics Warehouse stores tables in your own warehouse, giving you full control over security, retention and access.

  • [email protected]

    You are not logged in to Soda and are viewing the default public documentation. Learn more about Documentation access & licensing.

    Metric Monitoring dashboard
    Incidents

    You are not logged in to Soda and are viewing the default public documentation. Learn more about Documentation access & licensing.

    https://docs.soda.io/api-docs/public-cloud-api-v1.html

    You are not logged in to Soda and are viewing the default public documentation. Learn more about Documentation access & licensing.

    Global and Dataset Roles

    You are not logged in to Soda and are viewing the default public documentation. Learn more about Documentation access & licensing.

    Documentation access & licensing

    You are not logged in to Soda and are viewing the default public documentation. Learn more about Documentation access & licensing.


    You are not logged in to Soda and are viewing the default public documentation. Learn more about Documentation access & licensing.


    You are not logged in to Soda and are viewing the default public documentation. Learn more about Documentation access & licensing.

    soda:
      apikey:
            id: "***"
            secret: "***"
      agent:
            id: "842feab3-snip-87eb-06d2813a72c1"
            name: "myuniqueagent"

    You are not logged in to Soda and are viewing the default public documentation. Learn more about Documentation access & licensing.

    kubectl config use-context <name of cluster>
    helm list
    NAME      	NAMESPACE 	REVISION	UPDATED                             	STATUS	  CHART            	APP VERSION     
    soda-agent	soda-agent	5       	2023-01-20 11:55:49.387634 -0800 PST	deployed	soda-agent-0.8.26	Soda_Library_1.0.0
    helm get values -n <namespace> <release name>
    helm get values -n soda-agent soda-agent 
    helm search hub soda-agent
    helm repo update
    helm upgrade <release> <chart>
     --set soda.agent.name=myuniqueagent \
     # Use https://cloud.us.soda.io for US region; use https://cloud.soda.io for EU region
     --set soda.cloud.endpoint=https://cloud.soda.io \
     --set soda.apikey.id=*** \
     --set soda.apikey.secret=**** \
     --set soda.agent.logFormat=raw \
     --set soda.agent.loglevel=ERROR \
     --namespace soda-agent
    helm upgrade soda-agent soda-agent/soda-agent \
     --set soda.agent.name=myuniqueagent \
     # Use https://cloud.us.soda.io for US region; use https://cloud.soda.io for EU region
     --set soda.cloud.endpoint=https://cloud.soda.io \
     --set soda.apikey.id=*** \
     --set soda.apikey.secret=**** \
     --set soda.agent.logFormat=raw \
     --set soda.agent.loglevel=ERROR \
     --namespace soda-agent
    helm upgrade soda-agent soda-agent/soda-agent \
       --values values-local.yml --namespace soda-agent
    helm get values -n soda-agent soda-agent
    kubectl get secret/soda-agent-apikey -n soda-agent --template={{.data.SODA_API_KEY_ID}} | base64 --decode
    kubectl get secret/soda-agent-apikey -n soda-agent --template={{.data.SODA_API_KEY_SECRET}} | base64 --decode
    helm install soda-agent soda-agent/soda-agent \
      --values values.yml \
      --namespace soda-agent
    kubectl describe pods
    You want to monitor trends and patterns that would otherwise be hidden in aggregated metrics.

    Only one Group By monitor can be configured at a time.

    Because a Group By monitor spawns multiple monitors (one per category), limiting this to a single configuration helps manage performance.

    Explore Group By monitors

    When a Group By monitor is active in a dataset, results are displayed at the bottom of the Metric Monitors tab, on the Column Monitors table:

    There, you will see:

    Entry in Column Monitors table

    A Group By monitor is listed like any other monitor, but its description indicates the Group By column(s) and the metric being measured (e.g. "Maximum length of Bus_No grouped by Breakdown_or_Running_Late").

    Group By monitors will always be displayed at the top of the Column Monitors table, even when no anomalous groups were detected.

    From the Column Monitors table, it is possible to turn on notifications at the column level by clicking on the bell icon. Note that notifications at a category level are not available at the moment.

    Groups table

    Expanding the monitor displays a groups table, which shows the results for each group or category. Each row corresponds to one category (or combination of categories if multiple columns are grouped). From the groups table, it is possible to delete specific categories by clicking on the bin icon on the right.

    Deleting from the groups table is intended to remove groups/categories that are no longer present in the data.

    • Deleting a category removes the history for that monitor.

    • If the group/category is still present in the data, the monitor will be re-created on the next scan. It will not be backfilled, unless a historical metric collection scan is triggered.

    Example:

    Group By Breakdown_or_Running_Late + metric Maximum length of Bus_No → a row for each Breakdown_or_Running_Late value, with the maximum bus number length observed in that category, alongside its anomaly detection status.

    Add Group By monitors

    You can add a Group By monitor from the Metric Monitors section of the dataset page.

    1. Scroll to the Column Monitors table and click Add Column Monitors.

    1. In the Add Column Monitors panel, toggle on Group By.

    1. Select one or more columns to group by.

    For the time being, only columns with a maximum of 50 distinct categories are eligible for Group By monitoring.

    Multiple columns can be selected, but note that the resulting categories are combinatory (e.g., Column A × Column B).

    1. (Optional) Exclude specific categories (segments) that you don’t want to monitor.

    2. Select one or more columns to monitor under Column Selection.

    1. Enable one (or more) metric from the right-hand list.

    1. Click Add 1 Monitor on the top right to save.

    The monitor now appears in the Column Monitors table and starts tracking anomalies across each category.

    Category management

    • Categories can be excluded when configuring the monitor. See Step 4 on Add Group By monitors.

    • Categories can be deleted after creation from the Groups table if you decide they should no longer be monitored.


    Key Considerations

    • One Group By monitor at a time Only one configuration is allowed, since Group By monitors expand into many underlying monitors.

    • Multiple Group By columns More than one column can be selected, but the categories generated are combinatory.

    • Category limits Columns with more than 50 categories cannot be used for Group By monitoring.

    • Exclusions and deletions You can exclude categories at configuration time or delete them later from the .

    • Notifications Notifications are configured at the column level, not yet at the per-category level.


    With Group By monitors, you gain more granular visibility into your data quality, while keeping control over compute cost and category management.


    You are not logged in to Soda and are viewing the default public documentation. Learn more about .

    Only users with the Manage Incidents permission for the dataset can create or edit incidents. All users with access to the dataset can view existing incidents. Read about Global and Dataset Roles

    You can create an incident directly from a check result when an issue has been identified:

    1. On a check page, use the context menu to select Create Incident.

    1. Provide a name and description for the incident.

    1. Select one or multiple related check results that you want to associate with the incident.

    1. Click Save to proceed

    View incidents for a dataset

    Once created, the incident will appear in the Incidents tab of the corresponding Dataset Page

    It is possible to filter incidents based on lead, status, reporter, and severity.

    View incidents across the organization

    Incidents can also be seen in a central place in Soda Cloud. In the top navigation, click on Incidents to see all the incidents of the organization.

    Use the filters and the title search to find relevant incidents.

    Updating an Incident

    Assign a lead: Every incident requires a lead: the user responsible for resolving the issue.

    Update status: Track progress by updating the incident’s status as the investigation and resolution evolve.

    Add a resolution note: When marking an incident as resolved, a resolution note is mandatory to document what was done.

    Include more check results: If new results are failing, you can include them in the incident.

    After any changes, click Save to apply them.

    Integration with External Systems

    You can integrate incidents with Slack, MS Teams, or other external systems using Soda’s webhook capabilities or the Incidents API. Learn on how to integrate with Soda: Integrations


    You are not logged in to Soda and are viewing the default public documentation. Learn more about .

    ,
    duplicate
    ,
    freshness
    , etc.)
  • Filters (dataset-level and check-level) — Optional

  • Threshold configuration — Optional

  • Use of variables — Optional

  • Scheduling — Optional

  • ...and more

  • Test your contract

    Before publishing or verifying your contract, you can run a test command to ensure the contract is correctly defined and points to a valid dataset.

    This will:

    • Validate the YAML syntax

    • Confirm that the referenced dataset exists

    • Check for any issues with variables, filters, or structure

    Run this as part of your development workflow or CI to catch errors early.

    Verify your contract

    Before publishing your contract, you may want to execute it to verify that it runs correctly.

    Read more about how to Verify a contract

    Publish the Contract

    If you have Soda Cloud, once your contract is finalized and tested, you can publish it to Soda Cloud, making it the authoritative version for verification and scheduling.

    This action requires the "Manage contract" permission on the dataset; the user is identified based on the API key provided in the Soda Cloud configuration.

    Learn more about permissions here: Dataset Attributes & Responsibilities

    Learn how to connect the CLI to Soda Cloud:

    Publishing:

    • Uploads the contract to Soda Cloud

    • Makes it available for manual, scheduled, or programmatic verifications

    • Enables visibility and collaboration through the UI

    Once published, the contract becomes the source of truth for the dataset until a new version is published.

    You’re now ready to start verifying your contract and monitoring your data.

    Contract Language reference

    You are not logged in to Soda and are viewing the default public documentation. Learn more about .

    Using your exising Soda API key and secret

    By default we'll use your existing Soda API key and secret values to perform the authentication to the Soda image registry.

    Ensure these values are still present in your values.yaml , no further action is required.

    Using a separate Soda API key and secret

    You might also opt to use a new, separate Soda API key and secret to perform the authentication to the Soda image registry.

    In this case, ensure the imageCredentials.apikey.id and imageCredentials.apikey.secret values are set to these new values:

    Specify existing imagePullSecrets

    If you're providing your own imagePullSecrets on the cluster, e.g. when you're pulling images from your own mirroring image registry, you must modify your existing values file.

    The imagePullSecrets property that was present in versions 1.1.x has been renamed to the more standard existingImagePullSecrets .

    If applicable to you, please perform the following rename in your values file:

    For more information on setting up image mirroring, see Mirroring images

    Update the region

    If you are a customer using the US instance of Soda Cloud, you'll have to configure your Agent setup accordingly. Otherwise you can ignore this section.

    In version 1.2.0 we're introducing a soda.cloud.region property, that will be used to determine which registry and Soda Cloud endpoint to use. Possible values are eu and us. When the soda.cloud.region property is not set explicitly, it defaults to the value of eu.

    If applicable to you, please perform the following changes in your values file:

    For more information about using the US region, see Using the US image registry.

    Rename scanlauncher to scanLauncher

    The scanlauncher section in the values file has been renamed to scanLauncher. Please ensure the correct name is used in your values file if you have any configuration values there:

    More info about Soda's private container registry

    You are not logged in to Soda and are viewing the default public documentation. Learn more about .

    Execute the checks in the published contract

  • Use the latest available data

  • Display pass/fail results directly in the UI

  • This is especially useful for one-off validations, exploratory testing, or during incident investigations.

    This action requires the "Manage contract" permission on the dataset. Learn more about permissions here: Dataset Attributes & Responsibilities

    Scheduled execution (from the Contract Editor)

    To monitor data quality over time, you can set up scheduled verifications directly in the contract editor.

    When editing or viewing a contract:

    1. Go to the Schedule section

    2. Choose how often you want the contract to be verified (e.g., hourly, daily, weekly)

    3. Save the schedule

    Soda Cloud will automatically run the contract at the specified intervals, using the selected agent. All results will be stored and visualized in Soda Cloud, with alerts triggered when rules fail (if configured.)

    Programmatic execution (via CLI)

    For advanced workflows and full automation, you can verify contracts programmatically using the Soda CLI and a Soda Agent.

    This is ideal for:

    • CI/CD pipelines

    • Custom orchestration (e.g., Airflow, dbt Cloud, Dagster)

    • Triggering verifications after data loads

    Step 1: Connect to Soda Cloud

    First, create a Soda Cloud configuration file:

    This generates a basic config file. Open it and fill in your API key and organization details.

    Learn how to Generate API keys

    You can test the connection:

    Step 2: Verify a Contract

    Now you can run a verification using the CLI and a remote Soda Agent.

    To verify a dataset without pushing the results to Soda Cloud:

    This allows you to verify that the contract produces the expected results before pushing results to Soda Cloud.

    To verify and also push the results to Soda Cloud:

    This makes the verification results available in the UI for stakeholders, trigger notifications and monitoring dashboards.

    This action requires the "Manage contract" permission on the dataset; the user is identified based on the API key provided in the Soda Cloud configuration. Learn more about permissions here:Dataset Attributes & Responsibilities


    You are not logged in to Soda and are viewing the default public documentation. Learn more about .

    Creating an attribute

    Only users with the Manage Attributes permission can create or edit attributes. Global and Dataset Roles

    To create a new attribute:

    1. Click your profile icon in the top-right corner and select Attributes from the menu.

    1. Click New Attribute.

    1. Provide a Label for the attribute. Note that a unique name will be generated from this label. This name is immutable and is used in Data Contract definitions to reference the attribute.

    1. Select the Resource Type where the attribute applies: Dataset or Check

    1. Choose the Type of attribute: Single select, Multi select, Checkbox, Text, Number, Date

    1. Add a Description for context.

    2. Click Save

    Edit attributes

    To edit an attribute, use the context menu next to the attribute name and select Edit Attribute.

    Note that the name property and the assigned resource type cannot be changed.

    Attributes in datasets

    Learn how to set attributes for datasets: Dataset Attributes & Responsibilities

    Attributes in checks

    Attributes for checks will be defined as part of the Data Contract.

    Learn how to set attributes for datasets:

    • Authoring in Soda Cloud:

    • Data Contract as code:

    Attributes in filters

    Once an attribute has been assigned at least once, either to a dataset or a check, it becomes available as a filter in Soda Cloud. Attributes that have not yet been used will not appear in filter options.


    You are not logged in to Soda and are viewing the default public documentation. Learn more about .

    Connection test

    Test the data source connection:

    pip install -i https://pypi.dev.sodadata.io/simple -U soda-athena
    type: athena
    name: my_athena
    connection:
      catalog: ${env.ATHENA_CATALOG}
      access_key_id: ${env.ATHENA_ACCESS_KEY_ID}
      secret_access_key: ${env.ATHENA_SECRET_ACCESS_KEY}
      staging_dir: ${env.ATHENA_STAGING_DIR}
      region_name: ${env.ATHENA_REGION}
      work_group: ${env.ATHENA_WOKRGROUP}
    #  role_arn: <my_role_arn>
    #  profile_name: <my_aws_profile>
    #  session_token: <my_session_token>

    You are not logged in to Soda and are viewing the default public documentation. Learn more about .

    Data flow and data source reference – Understand how Soda interacts with other systems and manage exceptions.

    Each section includes practical, example-based documentation structured to help data engineers, analysts, and platform teams apply Soda in real-world use cases.


    You are not logged in to Soda and are viewing the default public documentation. Learn more about .

    here
    Sign up here
    Generate API keys
    Data source reference for Soda Core
    Data source reference for Soda Core

    You are not logged in to Soda and are viewing the default public documentation. Learn more about Documentation access & licensing.

    Dataset Attributes & Responsibilities
    Dataset Attributes & Responsibilities

    You are not logged in to Soda and are viewing the default public documentation. Learn more about Documentation access & licensing.


    You are not logged in to Soda and are viewing the default public documentation. Learn more about Documentation access & licensing.

    Data Testing

    What is Data Testing?

    Data testing is the practice of validating that your data meets the expectations you’ve defined for it before it reaches stakeholders, dashboards, or downstream systems. Just like software testing ensures your code behaves as intended, data testing safeguards the quality and reliability of your data.

    At Soda, we see data testing as the foundation of data trust. Whether you’re verifying row counts, checking for missing or invalid values, or enforcing schema integrity, the goal is the same: catch issues early, reduce incidents, and keep your data consumers confident.

    What is a Data Contract?

    A Data Contract is a formal agreement between data producers and data consumers that defines what “good data” looks like. It sets expectations about schema, freshness, quality rules, and more, and makes those expectations explicit and testable.

    With a data contract in place, producers commit to delivering data that meets certain standards. Consumers, in turn, can rely on that contract to build reports, models, or pipelines without second-guessing the data.

    At Soda, Data Contracts are testable artifacts that can be authored, versioned, verified, and monitored, whether in code or in the UI. They’re the connective tissue between producers and consumers, aligning teams and eliminating ambiguity.

    What is Contract Verification?

    Defining a contract is only the first step. Verifying that your data actually meets the expectations is where the value is realized. Contract verification is the process of testing whether the data in your datasets aligns with the rules, thresholds, and schema defined in the contract.

    At Soda, contract verification is fully automated. Whether triggered manually, on a schedule, or as part of your CI/CD pipelines, each verification run checks that:

    • The schema matches the contract definition (columns, data types, structure)

    • The data complies with checks like missing, duplicate, invalid values, and custom rules

    This helps you catch issues early, ensure data quality over time, and build trust across your organization.

    Authoring and managing Data Contracts: choose your style

    Soda supports two complementary ways to author and manage data contracts. They are designed to fit the way your team works.

    Cloud-managed Contracts (Soda Cloud UI)

    If you’re a data analyst, product owner, or stakeholder, who prefers intuitive interfaces over code, Soda Cloud is the ideal workspace.

    With the Soda Cloud UI, you can:

    • Browse datasets and view profiling insights

    • Define a contract with a no-code Editor

    • Schedule and monitor contract verifications

    • Collaborate with your team and publish contracts with a click

    There’s no setup or YAML required, just fast, visual workflows that enable domain experts to contribute directly to data quality.

    Git-managed Contracts (Soda Core CLI)

    If you live in your terminal and manage your data pipelines as code, you’ll want to use Soda Core and the Soda CLI.

    With this setup, you can:

    • Define contracts in YAML

    • Run contract verifications in CI/CD

    • Push the contract and verification results to Soda Cloud for visibility

    • Use Git as the source of truth for version control, collaboration, and reviews

    This path offers full control, transparency, and seamless integration into your dev tooling.


    Combine UI and Git for cross-team collaboration

    Soda gives you the flexibility to blend both approaches. For example, non-technical users can define or adjust contracts visually in Soda Cloud for the datasets they manage, while engineers can use Git-managed contracts for the datasets they own.

    This hybrid model enables collaboration across teams:

    • Business users bring domain expertise directly into the contract

    • Engineers maintain quality, consistency, and governance

    • Each dataset follows the authoring method that best suits the team responsible for it

    You can mix and match—using the UI for some contracts, and code for others—depending on your team's structure and preferences.

    And even if Data Contracts are managed in Git, you can still involve non-technical users who can propose changes to a contract in the UI. These approved changes can be embedded into engineering workflows and synced to Git, ensuring that every update follows your organization’s quality and deployment standards.

    Choose the model, or combination of models, that works best for your organization.


    Soda Agent vs. Soda Core for execution

    Once a contract is published, you’ll want to verify that the actual data meets the contract’s expectations. This verification can be done in two ways:

    • Soda Agent is our managed runner that lives in your environment and connects to Soda Cloud. It handles contract verification, scheduling, and execution securely, without exposing your data externally. It is great for teams who want central management without maintaining CLI infrastructure.

    • Soda Core is our open-source engine you can run anywhere: locally, in CI, or data pipeline. It’s lightweight, customizable, and great for teams that prefer full control or have strict environment constraints.

    Both approaches support the same Data Contract logic. Choose the one that best fits your deployment model.

    RAD (Record-level Anomaly Detection)

    This page explains Record-level Anomaly Detection (RAD) and Soda's anomaly detection capabilities through RAD.

    Coming soon!

    RAD functionalities will be available soon for Enterprise plan users.

    What is RAD?

    Ensuring data quality can be difficult, especially when you need broad coverage quickly. Checks and column monitors are great for enforcing specific rules, but they take time to set up and require a deep understanding of your data. Soda’s Record-level Anomaly Detection (RAD) helps you get started fast, providing instant coverage across all columns, rows, and segments, without any configuration.

    The algorithm analyzes historical data to build a clear picture of what normal data is supposed to look like. When incoming rows show unusual patterns, unexpected values, inconsistencies, or errors, RAD automatically triggers an alert and runs a Root Cause Analysis to pinpoint the issue. This provides quick, actionable insights while you work toward more detailed control using checks and column monitors.

    Why use RAD?

    1. Instant, broad coverage Monitor all columns, rows, and segments at once, detecting both known and unknown issues.

    2. No configuration needed Get started immediately: no metrics or checks need to be defined. RAD automatically determines which columns to use.

    3. One metric to track and alert on The Record-level Drift Score provides a single, explainable metric to monitor data health.

    When to use RAD?

    Order of operations to achieve the best coverage in the most efficient way:

    1. Firstly, : always begin with high‑level monitors to verify if the right amount of data arrived on time and in the correct format. These require no configuration. They just need to be enabled

    2. Secondly, RAD: apply Record-level Anomaly Detection to validate the actual content of the data. This step also requires no configuration (only enablement) and provides broad coverage across all columns and segments.

    3. Next, : apply column‑level monitoring for specific use cases where the potential data quality issue and metric are known but expected to change over time. These should be minimized as they are prone to generating false alerts.

    Data quality tool
    Column
    Metric
    Failure
    Configuration

    RAD requirements

    For a dataset to be monitored by RAD, the following conditions must be met:

    • Time partition column: the dataset must include a column that (for example, created_at).

    • Primary key: the dataset must have a primary key to uniquely identify rows.

    • Diagnostics Warehouse setup: a must be configured to store the daily sample, consisting of either primary keys or, ideally, a full copy of the sampled rows.

    Next: to enable Record-level Anomaly Detection in your organization, reach out to Soda at .

    Collibra

    This page describes the bi-directional integration between Soda and Collibra.

    The Soda↔Collibra optimized integration synchronizes data quality checks from Soda to Collibra, creating a unified view of your data quality metrics. The implementation is optimized for performance, reliability, and maintainability, with support for bi-directional ownership sync and advanced diagnostic metrics.

    Key features

    • High Performance: 3-5x faster execution through caching, batching, and parallel processing

    • Custom Attribute Syncing: Flexible mapping of Soda check attributes to Collibra attributes for rich business context

    • Ownership Synchronization: Bi-directional ownership sync between Collibra and Soda

    • Deletion Synchronization: Automatically removes obsolete check assets from Collibra when checks are deleted in Soda

    • Multiple Dimensions Support: Link checks to multiple data quality dimensions simultaneously

    • Monitor Exclusion: Option to exclude Soda monitors from synchronization, focusing only on data quality checks

    • Diagnostic Metrics Processing: Automatic extraction of diagnostic metrics from any Soda check type with intelligent fallbacks

    • Robust Error Handling: Comprehensive retry logic and graceful error recovery

    • Advanced Monitoring: Real-time metrics, performance tracking, and detailed reporting

    • CLI Interface: Flexible command-line options for different use cases

    • Backward Compatibility: Legacy test methods preserved for smooth migration

    Quickstart

    For technical details on how to configure the bi-directional Collibra integration, head to .

    Prerequisites

    • Python 3.10+ required

    • Valid Soda Cloud API credentials

    • Valid Collibra API credentials

    • Properly configured Collibra asset types and relations

    Basic Usage

    Advanced Usage

    How It Works

    1. Optimized Dataset Processing

    • Smart Filtering: Only processes datasets marked for synchronization

    • Parallel Processing: Handles multiple operations concurrently

    • Caching: Reduces API calls through intelligent caching

    • Batch Operations: Groups similar operations for efficiency

    2. Enhanced Check Processing

    For each check in a dataset:

    Asset Management

    • Bulk Creation/Updates: Processes multiple assets simultaneously

    • Duplicate Handling: Intelligent naming to avoid conflicts

    • Status Tracking: Monitors creation vs. update operations

    Attribute Processing

    • Standard Attributes: Evaluation status, timestamps, definitions

    • Diagnostic Metrics: Automatically extracts and calculates diagnostic metrics from check results

    • Custom Attributes: Flexible mappings for business context (see )

    • Batch Updates: Groups attribute operations for performance

    Relationship Management

    • Dimension Relations: Links checks to data quality dimensions

    • Table/Column Relations: Creates appropriate asset relationships

    • Error Recovery: Graceful handling of missing or ambiguous assets

    3. Ownership Synchronization

    • Collibra to Soda Sync: Automatically syncs dataset owners from Collibra to Soda

    • User Mapping: Maps Collibra users to Soda users by email address

    • Error Handling: Tracks missing users and synchronization failures

    • Metrics Tracking: Monitors successful ownership transfers

    4. Advanced Error Handling

    • Retry Logic: Exponential backoff for transient failures

    • Rate Limiting: Intelligent throttling to avoid API limits

    • Error Aggregation: Collects and reports all issues at the end

    • Graceful Degradation: Continues processing despite individual failures


    Head to to learn how to integrate Collibra.

    Profiling

    Profiling provides a quick and comprehensive overview of a dataset’s structure and key statistics.

    Profiling helps you understand the shape, quality, and uniqueness of your data before creating checks or metric monitors.

    With profiling, you can explore metadata about your dataset, such as column names, data types, distinct counts, null counts, and summary statistics. You can also quickly search for specific columns to focus on the attributes that matter most to your analysis.

    Profiling is useful for:

    • Business teams: Gain a fast understanding of what’s inside a dataset, its completeness, and potential anomalies.

    • Data teams: Validate schema, data types, and distributions before writing quality tests or transformations.

    • Data owners: Quickly identify unexpected values, nulls, or structural changes in a dataset.

    Key features

    • Dataset overview: Displays a structured view of all columns, their types, and counts.

    • Interactive navigation: Scroll through the dataset structure or jump directly to a column of interest.

    • Search and filter: Quickly locate a column by name to review its profiling details.

    • Column-level insights:


    Enable & configure Profiling

    1. Enable Profiling

    You can enable Profiling during .

    If you want to enable Profiling on an existing dataset, follow the next steps:

    1. Click on Datasets > The dataset of your choosing

    2. Navigate to the Columns tab in the dataset view

    3. Click on Update Profiling Configuration

    2. Configure Profiling

    Once Profiling has been enabled, you can configure it to adapt to your organization's needs.

    1. Choose a Profiling schedule

    Profiling happens every 24 hours. Choose a UTC time from the dropdown menu to pick a specific hour when the scan will be scheduled.

    1. Choose a Profiling strategy

      • Use sampling: To perform Profiling, Soda will use a sample of up to 1 million rows from the dataset.

      • Use a time window: To perform Profiling, Soda will use data present in a 30-day time window, based on the dataset time-partition column.

    The time-partition column is specified above the columns table, on the Columns tab of any given dataset.

    1. Click on Finish

    Now, Profiling will be scheduled.

    Disable Profiling

    Disable column profiling at the organization level

    If you wish to disable column profiling at the organization level, you must possess Admin privileges in your Soda Cloud account. Once confirmed, follow these steps:

    1. Navigate to your avatar.

    2. Click on Organization settings.

    3. Uncheck the box labeled Allow Soda to collect column profile information.


    How it works

    When you open Profiling for a dataset:

    1. Soda runs a lightweight scan of the dataset’s metadata and a sample of the data (depending on configuration).

    2. It calculates summary statistics for each column.

    3. Results are displayed in the Profiling view for exploration.

    Key considerations

    • Soda can only profile columns that contain NUMBERS or TEXT type data; it cannot profile columns that contain TIMESTAMP data except to create a freshness check for the anomaly dashboard.

    • Soda performs the Discover datasets and Profile datasets actions independently, relative to each other. If you define exclude or include rules in the Discover tab, the Profile configuration does not inherit the Discover rules. For example, if, for Discover, you exclude all datasets that begin with staging_, then configure Profile to include all datasets, Soda discovers and profiles all datasets.


    Next Steps

    After reviewing profiling results, you can:

    • Create tests based on profiling insights (e.g., "column should not have nulls").

    • Set up monitors to track data quality over time.

    • Export profiling information to support documentation and governance processes.

    Slack

    Configure Soda Cloud to connect your account to Slack so that you can:

    • Send Notificationsfor failed or warning check results to Slack channels

    • Start conversations to track and resolve data quality Incidentswith Slack channels

    Configure a Slack integration

    Only users with the Manage Notification Rules permission can create or edit rules. All users can view rules. Read about

    1. In Soda Cloud, navigate to your avatar > Organization Settings, then navigate to the Integrations tab and click the + icon to add a new integration.

    2. Choose Slack and proceed

    1. Follow the guided steps to authorize Soda Cloud to connect to your Slack workspace. If necessary, contact your organization’s Slack Administrator to approve the integration with Soda Cloud.

    Configuration tab: select the public channels to which Soda can post messages; Soda cannot post to private channels.

    Note that Soda caches the response from the Slack API, refreshing it hourly. If you created a new public channel in Slack to use for your integration with Soda, be aware that the new channel may not appear in the Configuration tab in Soda until the hourly Slack API refresh is complete.

    Scope tab: select the Soda features (alert notifications and/or incidents) that can access the Slack integration.

    About integration scopes

    Integration for alert notifications

    You can use this integration to enable Soda Cloud to send alert notifications to a Slack channel to notify your team of warn and fail check results.

    With such an integration, Soda Cloud enables users to select a Slack channel as the destination for an alert notification of an individual check or checks that form a part of an agreement, or multiple checks.

    To send notifications that apply to multiple checks, see

    Integration for Soda Cloud incidents

    You can use this integration to notify your team when a new incident has been created in Soda Cloud. With such an integration, Soda Cloud displays an external link to an incident-specific Slack channel in the Incident Details.

    Scan time strategy

    Selecting the right scan time is essential for accurate data monitoring and reliable metric collection. Scans that occur too early may run before the data has been fully loaded into the database, leading to false positives or misleading results. This guide outlines how to determine the best scan time based on your data load patterns and operational needs.

    Scan frequency

    Scans can be scheduled to occur from hourly to weekly. The time jumps are meant to fit into a 24-hour cycle that matches hourly/daily seasonalities related to how humans organize their day. Metric Monitoring can happen every:

    Additional settings

    Test a contract on a sample

    Currently, this feature is only supported in Snowflake data sources.

    When testing a data contract, Soda allows you to run contract validation on a sample of your dataset instead of the full data. This feature helps you quickly and cost-efficiently verify that your contract runs correctly before executing full scans.

    More info about Soda's private container registry

    What has changed?

    As of July 2025, the container images required for the self-hosted Soda agent will be distributed using private registries, hosted by Soda.

    EU cloud customers will use the EU registry located at registry.cloud.soda.io. US cloud customers will use the US registry located at registry.us.soda.io.

    The images currently distributed through Docker Hub will stay available there. New releases will only be available in the Soda-hosted registries.

    Existing or new Soda cloud API keys can be used to authenticate to the Soda-hosted registries. Starting from version

    Deployment options

    Soda offers flexible deployment options to suit your team’s infrastructure, scale, and security needs. Whether you want to embed Soda directly into your pipelines, use a centrally managed deployment, or rely on Soda’s fully-hosted solution, there’s an option for you.

    This guide provides an overview of the three main deployment options: Soda Python Libraries, Soda-hosted Soda Agent, and Self-hosted Soda Agent, to help you choose the right setup for your organization.


    Overview of Deployment Options

    Deployment Model

    Custom monitors

    Learn more about how to use your own SQL queries to build custom SQL Metric Monitors.

    Custom SQL Monitors are available for Enterprise users and can be configured via both the Soda Cloud UI and the Monitoring Configuration API.

    Custom SQL Monitors enable you to define monitoring logic using your own SQL queries. This is ideal when built-in Soda metrics or anomaly checks don’t meet your needs; for example, when you must aggregate data across multiple tables, compute ratios, or detect anomalies in grouped datasets.

    A Custom SQL Monitor runs your SQL query against your connected data source and evaluates its results on a schedule, just like any other Soda monitor.

    This feature can be used to:

    BigQuery

    Access configuration details to connect Soda to a Google Cloud BigQuery data source.

    Connection configuration reference

    Install the following package:

    Data source YAML

    soda contract test --data-source ds.yml --contract contract.yaml
    soda contract publish --contract contract.yaml --soda-cloud sc.yml
    soda:
      # These values will also be used to authenticate to the Soda image registry
      apikey:
        id: existing-key-id
        secret: existing-key-secret
    soda:
      apikey:
        id: existing-key-id
        secret: existing-key-secet
    imageCredentials:
      apikey:
        id: my-new-key-id
        secret: my-new-key-secret
    soda:
      apikey:
        id: ***
        secret: ***
        
    # This is no longer supported
    # imagePullSecrets
    #   - name: my-existing-secret
    
    # Instead, use this!
    existingImagePullSecrets
      - name: my-existing-secret
    soda:
      apikey:
        id: ***
        secret: ***
      cloud:
        # This also sets the correct endpoint under the covers.
        region: "us"
        
        # This can be removed now, as the region property sets this up correctly. 
        # endpoint: https://cloud.us.soda.io
    soda:
      apikey:
        id: ***
        secret: ***
      # Rename this ...
      # scanLauncher:
      # to become
      scanLauncher:
        existingSecrets:
          - soda-agent-secrets 
    soda cloud create -f sc.yml
    soda cloud test -sc sc.yml
    soda contract verify --dataset datasource/db/schema/table --use-agent --soda-cloud sc.yml
    soda contract verify --dataset datasource/db/schema/table --publish --use-agent --soda-cloud sc.yml
    soda data-source test -ds ds.yml
    soda cloud create -f sc.yml
    soda cloud test -sc sc.yml
    pip install -i https://pypi.dev.sodadata.io "soda-postgres>=4.0.0.dev1" -U
    soda data-source create -f ds.yml
    type: postgres
    name: postgres
    connection:
      host: host_name
      port: 5432
      user: user_name
      password: ${env.SODA_DEMO_POSTGRES_PW}
      database: db_name
    soda data-source test -ds ds.yml
    soda contract verify --data-source ds.yml --contract contract.yaml
    soda contract verify --data-source ds.yml --contract contract.yaml --set START_DATE=2024-05-01
    soda contract verify --data-source ds.yml --contract contract.yaml --publish --soda-cloud sc.yml
    soda contract verify --contract contract.yaml --use-agent --soda-cloud sc.yml
    soda contract verify --contract contract.yaml --use-agent --soda-cloud sc.yml --set START_DATE=2024-05-01
    soda contract verify --dataset datasource/db/schema/table --publish --use-agent --soda-cloud sc.yml
    Documentation access & licensing
    Documentation access & licensing
    Documentation access & licensing

    Collaborate with non-technical users that use Soda Cloud and integrate with engineering workflows via Git


    You are not logged in to Soda and are viewing the default public documentation. Learn more about Documentation access & licensing.

    Global and Dataset Roles
    Notifications

    You are not logged in to Soda and are viewing the default public documentation. Learn more about Documentation access & licensing.

    Documentation access & licensing
    Groups table
    Documentation access & licensing
    Documentation access & licensing
    Assess the impact of data quality issues: easily determine how many rows in your dataset are affected by data quality problems.
  • Prioritize what matters: use the Record-level Drift Score consistently across datasets and data sources to rank and focus on the most critical data quality issues.

  • Reduce false alerts: traditional column-level monitoring increases the risk of false positives with every additional monitor. With RAD, you only need one anomaly detection monitor per dataset, minimizing noise.

  • Optimize compute usage: monitoring a single metric per dataset lowers computational overhead. Additionally, RAD can work with sampled data, further reducing processing demands.

  • Built‑in root cause analysis Quickly understand what changed and why.

  • Native support for backfilling and back‑testing Automatically generate and assess historical Record-level Drift Scores to review past data quality trends.

  • Lastly, checks: use checks for critical tables where expectations are clearly defined. For example

    Unknown

    Unknown

    Unknown

    RAD on all columns

    Checks

    Known

    Known

    Known

    Missing values in Amount < 5%

    Column monitors

    Known

    Known

    Unknown

    Anomaly detection on Amount for missing values

    dataset‑level metadata metrics
    column monitors
    partitions data by time
    Diagnostics Warehouse
    [email protected]

    You are not logged in to Soda and are viewing the default public documentation. Learn more about Documentation access & licensing.

    RAD monitor

    Setup & configuration
    Custom Attribute Syncing
    Setup & configuration
    Rulebook-level collection of Soda checks synced and mapped into Collibra.

    You are not logged in to Soda and are viewing the default public documentation. Learn more about Documentation access & licensing.

    # Run the integration with default settings
    python main.py
    
    # Run with debug logging for troubleshooting
    python main.py --debug
    
    # Use a custom configuration file
    python main.py --config custom.yaml
    
    # Show help and all available options
    python main.py --help
    # Run legacy Soda client tests
    python main.py --test-soda
    
    # Run legacy Collibra client tests
    python main.py --test-collibra
    
    # Run with verbose logging (info level)
    python main.py --verbose
  • Statistics

    • Column name

    • Column data type

    • Number of distinct values

    • Number of missing (null) values

    • Minimum, maximum, mean (for numeric columns)

    • Length, patterns, or categories (for text columns)

  • Histogram for numeric columns

  • Frequent values

  • Extreme values, for numeric columns

  • Data checks that exist for this column

  • Toggle on Enable Profiling

    dataset onboarding
    In the Bus Breakdown and Delays dataset, the time-partition column is Last_Updated_On

    You are not logged in to Soda and are viewing the default public documentation. Learn more about Documentation access & licensing.

    • 8 h

    • 2 h

    • 12 h

    • 3 h

    • 1 day

    • 4 h

    • 1 week

    • 6 h

    Key considerations

    Data load completion time

    When is the database load expected to be complete?

    • Determine when the relevant tables or datasets are expected to be fully loaded.

    • Factor in common variances: if a load is expected to complete by 00:00 UTC but occasionally finishes at 00:10 UTC, account for the expected, albeit sporadic, delay.

    Knowing this helps avoid scanning too early and capturing incomplete data.

    Acceptable load delay tolerance

    When is a delayed load considered late or "problematic"?

    • If data arriving by 02:30 UTC is still valid for monitoring purposes, it may be better to delay the scan to reduce false alerts.

    • Scanning immediately after the earliest expected load time is not always necessary.

    Understanding what qualifies as "late data" helps define the tolerance window for scan timing.

    Response window & team availability

    How fast after the load can someone respond to issues flagged by monitors?

    • If nobody can take action until 09:00 UTC, scanning earlier may not be useful unless scans feed downstream processes or dashboards.

    Choose a scan time that aligns with both data readiness and team readiness.

    Consistency is key

    Running scans at the same time every day allows to build up a reliable baseline of expected behavior. This helps surface anomalies clearly when something deviates from the norm.


    Example scenario

    • Scan frequency: daily

    • Expected load completion: 00:00 UTC

    • Occasional load delay: up to 00:10 UTC

    • Team available from: 08:00 UTC

    Scan options

    Strategy
    Scan time
    Rationale

    Minimal buffer

    00:15 UTC

    Captures data soon after load with minor delay tolerance.

    Conservative buffer

    01:30 UTC

    Allows extra time for delayed loads, reduces risk of false positives.

    Operationally aligned

    07:30 UTC

    Ensures scan results are fresh and complete when the team starts reviewing.

    Scan scheduling at scale

    When scanning large volumes of tables:

    • It is acceptable to configure scans for the same scheduled time (e.g. 00:00 UTC).

    • Scans that are scheduled in large volumes (thousands of tables) may be configured to run at the same logical time, but the system naturally distributes execution based on queuing and available resources, so the actual execution will be staggered.


    Historical scans

    • Historical metric collection scans (for metric baseline backfilling) run only once at configuration time.

    • These scans are not governed by the scan schedule. They occur once and they are typically the most resource-intensive.


    Best practices

    • Consistency is key: Using the same scan daily establishes a stable baseline for anomaly detection.

    • Early scans should be avoided: Scheduling scans before the last acceptable load time is not recommended unless business needs require it.

    • Time zones should be centralized: Aligning scan time with the database time zone is ideal, especially when your time partitioning column is based on the insert/load time in that time zone.

    • Monitoring and adjusting: If load patterns or SLAs change, scan times should be revisited and adjusted accordingly.

    • 1 h


    You are not logged in to Soda and are viewing the default public documentation. Learn more about .

    Running a test contract on a sample enables you to:

    • Validate that your contract syntax, checks, and filters work as expected.

    • Reduce data warehouse compute cost while verifying new or updated contracts.

    • Iterate faster on contract definitions in development environments.

    Results from sampled runs reflect only a subset of your data and may not represent its actual quality. Use full verification once your contract logic is validated.

    Enable sampling for test contracts

    This feature can be enabled at the data source level, applying to all datasets that use that connection.

    You need the "Manage data sources" global permission to add a new data source. Learn more about Global and Dataset Roles.

    To enable this feature:

    1. Go to Data sources.

    2. Click Edit connection for a data source.

    1. Under the Connection Details section, toggle Data Sampling.

    2. Specify your sample size on the Limit field.

    1. Click Connect.


    Optimize computing with multiple warehouses

    Currently available in preview. This feature is only supported in Snowflake data sources.

    When connecting to Snowflake, you must provide a warehouse as part of the data source configuration. By default, this single warehouse is used for all operations, including discovery, metric monitoring, profiling, data contract executions, and the diagnostics warehouse.

    The Configure warehouses per dataset feature gives you greater control and flexibility by allowing you to define specific warehouses for individual datasets. This helps you optimize cost, manage compute workloads, and allocate resources efficiently across your data operations.

    This feature is available only when using Soda Agent. When using Soda Core, the warehouse can be specified directly in the connection YAML instead.

    Enable the use of multiple warehouses

    You need the “Manage data sources” global permission to enable or modify this feature. Learn more about Global and Dataset Roles.

    1. Go to Data sources in Soda Cloud.

    2. Click Edit connection for your Snowflake data source.

    3. Toggle on Configure Warehouses.

    4. Specify the list of allowed warehouses that can be used by this connection.

    5. Choose a default warehouse to use for all datasets unless otherwise specified.

    6. Click Save on the top right to save your configuration.

    Default warehouse behavior

    Once enabled:

    • The warehouse specified in the data source connection is used for discovery.

    • The default warehouse (defined under Configure Warehouses) is used for:

      • Metric monitoring

      • Profiling

      • Data contract executions

      • Diagnostics Warehouse operations

    • A different warehouse can be configured at the dataset level, overriding the default.

    Specify a warehouse at the dataset level

    You need the “Configure dataset” permission to edit dataset-level configurations. Learn more about Global and Dataset Roles.

    1. Go to a dataset in Soda Cloud.

    2. Click Edit dataset.

    1. Under the Snowflake section, select the warehouse to use for this dataset.

    2. Click Save to apply your changes.

    1.2.0
    , the
    soda-agent
    Helm chart offers supports working with Soda-hosted image registries.

    In order to enjoy the latest features Soda has to offer, please upgrade any self-hosted Soda agent you manage using one of the following guides.

    How-to's

    Registry access using your existing API key

    Follow the self-hosted Soda agent upgrade or redeployment guides. Don't execute the final helm install or helm upgrade step yet.

    Ensure you retrieve the soda.apikey.id and soda.apikey.secret values first, by using helm get values -n <namespace> <release_name> .

    Now pass these values back to the upgrade command via the CLI

    or by using a values file:

    Registry access using a separate API key

    Ensure you have a new API key id and secret by following the API key creation guide .

    Follow the self-hosted Soda agent upgrade or redeployment guides. Don't execute the final helm install or helm upgrade step yet.

    Now pass the API keys to use for registry access in the upgrade command via the CLI, using the imageCredentials.apikey.id and imageCredentials.apikey.secret properties. Note that we're also still passing the soda.apikey.id and soda.apikey.secret values, which are still required for the agent to authenticate to Soda cloud.

    Or when using a values file:

    Using existing (external) secrets

    You can also use a self-managed, existing secret to authenticate to the Soda-hosted our your self-hosted private container registry, e.g. when mirroring container images.

    You can refer to existing secrets as follows for the CLI:

    Or using a values file:

    Using the US image registry

    When you're onboarded on the US region of Soda Cloud, you'll have to use the container registry associated with that region.

    You can alter the soda.cloud.region value to automatically render the correct container registry and Soda Cloud API endpoint. Simply follow any of the above instructions and include the soda.cloud.region value.

    To do so in the CLI:

    Or using a values file:

    FAQ

    Mirroring images

    If you want to mirror the Soda images into your own registries, you'll need to login to the appropriate container registry. This will allow you to pull the images into your custom container image registry.

    The following values.yaml file illustrates the changes required for the Helm release to work with mirrored images:

    Do I have to upgrade? What if we can't do that right away?

    Your existing Soda agent deployments will continue to function.

    This does mean that your self-hosted agent will not be able to support features like collaborative data contracts and the fully revamped metric monitoring.

    The images hosted on Dockerhub, required to run the self-hosted agent, will remain there in their current state for a grace period of 6 months. There will be no more maintenance (updates, bug fixes, security patches) for the old self-hosted agent versions.


    You are not logged in to Soda and are viewing the default public documentation. Learn more about .

    Description
    Ideal For
    Key Features
    Considerations
    Plans

    Soda-hosted Soda Agent

    Fully-managed Soda Agent, hosted by Soda.

    Teams seeking a simple, managed solution for data quality.

    • Centralized data source access

    • No setup required

    • Observability features enabled

    Enables users to create, test, execute, and schedule contracts and checks directly from the Soda Cloud UI.

    Required for observability features. Cannot scan in-memory sources like Spark or DataFrames.

    Available for Free, Team and Enterprise Plans.

    Self-hosted Soda Agent

    Same as Soda-hosted Soda Agent, but deployed and managed in your own Kubernetes environment.

    Teams needing full control over infrastructure and deployment.

    Similar to Soda-hosted Agent, but deployed within the customer’s environment; data stays within your network.

    • Full control over deployment

    • Integration with secrets managers

    • Customization to meet your organization’s specific requirements

    Required for observability features. Cannot scan in-memory sources like Spark or DataFrames. Kubernetes expertise required.

    Available for Enterprise Plan. Contact us

    Soda Python Libraries

    Open-source Python library (with commercial extensions) for programmatic configuration and enforcement of data contracts in your pipelines.

    Data engineers integrating Soda into custom workflows.

    • Full control over orchestration

    • In-memory data support

    • Contract verification

    No observability features. Required for in-memory sources (e.g., Spark, DataFrames). Data source connections managed at the environment level.

    Deployment Options in Detail

    Soda Agent

    Soda-hosted

    Soda-hosted Soda Agent is a fully-managed deployment of the Soda Agent, hosted by Soda in our infrastructure. It allows you to connect to your data sources and manage data quality directly from the Soda Cloud UI without any infrastructure setup on your end. You need only whitelist the IP address of the Soda-hosted agent so that it can connect to your data.

    Key points:

    • No setup or management required. Soda handles deployment and scaling.

    • Data source connections are centralized in Soda Cloud, and users can leverage the Soda Agent to execute scans across those data sources.

    • Enable observability features in Soda Cloud, such as profiling, metric monitoring, and anomaly detection.

    • Enables users to create, test, execute, and schedule contracts and checks directly from the Soda Cloud UI.


    Onboard your datasets in Soda Cloud with Soda-hosted agent: Onboard datasets on Soda Cloud


    Self-hosted Agent

    The Self-hosted Agent offers the same capabilities as the Soda-hosted Agent, but it is deployed and managed by your team within your own Kubernetes environment (e.g., AWS, GCP, Azure). This model provides full control over deployment, infrastructure, and security, while enabling the same centralized data source access and Soda Cloud integration for scans, contract execution, and observability features.

    Learn how to deploy the Self-hosted Soda Agent: Deploy Soda Agent.

    Onboard your datasets in Soda Cloud with self-hosted agent: Onboard datasets on Soda Cloud.

    Soda Python Libraries

    Soda Core is an open-source Python library and CLI that allows you to embed Soda directly in your data pipelines. You can orchestrate scans using your preferred orchestration tools or pipelines, and execute them within your own infrastructure. Additional commercial extensions are available via extensions packages, such as soda-groupby , soda-reconciliation, etc.

    See detailed installation instructions here: Install Soda Python Libraries

    Key points:

    • Ideal for teams who want full control over scan orchestration and execution.

    • Data source connections are configured and managed at the environment level.

    • Required for working with in-memory data sources like Spark and Pandas DataFrames.


    You are not logged in to Soda and are viewing the default public documentation. Learn more about .

  • Define metrics using custom aggregations or joins.

  • Compute grouped results (e.g., GROUP BY customer, institution, or region).

  • Apply filters, CTEs, and where clauses to narrow down data.

  • Integrate results with notification rules to alert your team when certain conditions are met.

  • How to create a custom SQL monitor

    Example scenario: an organization needs to monitor daily incidents per borough and reason in their Bus Breakdowns and Delays dataset, and flag unusual spikes/drops via notification rules.

    The goal is to know which boroughs are the ones suffering the most incidents and why that's happening.

    Prerequisites

    • Enterprise plan.

    • A dataset connected in Soda Cloud.

    • An API token with permission to author monitors (if creating In the Column Monitoring Configuration API).

    In the Soda Cloud UI

    1. Navigate to Datasets → Custom Monitors at the bottom of the page. Click on Add Column Monitor.

    2. Name your custom monitor and provide the custom SQL query. In this case, we are monitoring incident count by borough and reason.

    3. Provide a Result metric and a Valid range, and define a Threshold strategy. In this case, the result metric is incident_count; we want to group by Boro and Reason, and the valid range cannot be negative, so the minimum value is 0. Both Upper range and Lower range anomaly detection are enabled to catch unusual spikes/drops per group.

    4. Click on Add Monitor on the top right.

    The monitor will now be visible at the bottom of the Metric Monitoring dashboard.

    This monitor will:

    • Run daily and compute incident_count for every (Boro, Reason) pair within the partitioned time window.

    • Store grouped results so you can see which areas and causes are trending.

    • Trigger notifications (based on your organization’s notification rule) when anomaly detection flags a group.

    In the Column Monitoring Configuration API

    Coming soon


    Supported variables

    List of all the variables currently supported using ${soda.<variable>} syntax:

    • SCAN_TIME: time for which the scan is running; has the same value as PARTITION_END_TIME (note this is different from when the scan is running)

    • PARTITION_COLUMN: column used to perform time-based partitioning

    • PARTITION_START_TIME: start time for the partition time window

    • PARTITION_END_TIME: end time for the partition time window

    • PARTITION_INTERVAL: duration of the partition time window

    • TABLE: qualified name of the table being analyzed, e.g. "my-schema"."my-table"


    You are not logged in to Soda and are viewing the default public documentation. Learn more about .

    Note: Set use_context_auth=True to use application default credentials, in which case account_info_json or account_info_json_path are not necessary.

    Note: Google uses the term "dataset" differently than Soda:

    • In the context of Soda, a is a representation of a tabular data structure with rows and columns, such as a table, view, or data frame.

    • In the context of BigQuery, a dataset is “a top-level container that is used to organize and control access to your tables and views. A table or view must belong to a dataset…”

    Instances of "dataset" in Soda documentation always reference the former.

    • See Google BigQuery Integration parameters

    • See BigQuery's locations documentation to learn more about location.

    Connection test

    Test the data source connection:


    You are not logged in to Soda and are viewing the default public documentation. Learn more about .

    Documentation access & licensing
    Documentation access & licensing

    Column monitors

    Learn more about Metric Monitors that run scans at a column level.

    What is a Column Monitor?

    A column monitor in Soda tracks a specific statistical metric for a given column over time. It helps detect unusual patterns or unexpected changes in column behavior, such as spikes in missing values or shifts in averages.

    You can find column monitors by opening the Metric Monitors tab on any dataset and scrolling to the bottom of the page. This section lists all active column monitors in a structured, searchable view. The list can be sorted by recency or by the number of detected anomalies, allowing you to quickly focus on the most relevant issues.

    Unlike dataset-level monitors, which can be applied at the data source level, column monitors are configured at the dataset level and are tailored to specific use cases. It is recommended to add column monitors only to columns where changes are likely to reflect actual data quality issues. Adding too many monitors may increase false positives and create unnecessary noise.

    For column monitors to work, a time partition column must be defined. Soda uses this column to divide the data into time-based partitions, typically by day, and calculates the selected metrics within each partition. The column must be a timestamp and should reflect when records arrive in the database to ensure accurate and meaningful results.

    For each dataset, you’ll see a scrollable list that includes:

    • Result of the anomaly detection: Anomaly, Expected or Unkown (not evaluated yet)

    • Column name

    • Metric name (e.g. Missing values percentage, Average)

    • Column being tracked

    At the bottom of the list it is possible to load more monitors. And every monitor can be deleted and configured with opt-in notifications.

    Types of column monitors

    Data type
    Metric
    Description

    More metrics and monitors will be released in the future.

    Add Column Monitors

    Column monitors can be added one by one or in bulk. When multiple columns are selected only metrics that are applicable to all columns will be shown.

    1. Open the column monitor wizard

    • In the Metric Monitors dashboard, click Add Column Monitors.

    1. Select columns

    • Search or scroll your table’s columns.

    • Check one or many boxes to select columns in bulk.

    Column monitors are typed: metrics will appear as long as the necessary data type is available. For example, if a column type is str (text based), it will not be possible to enable numeric metrics.

    1. Pick metrics

    For all column metrics:

    • Data is not sampled.

    • Missing values are ignored (except in ).

    • Select the metrics of interest.

    • Search or expand metrics for further configuration:

      • Valid Range: define MIN and MAX values the metric can take (defaults to –∞/∞ or 0–∞ for time-based metrics).

      • Threshold Strategy:

    1. Add monitors

    • Once you’ve selected your columns and toggled the desired metrics on, click Add Monitors.

    • Empty monitors will be added to the list

    • And at the top of the page you will be prompt to run a Historical Metric Collection Scan.

    Tip: add all your column monitors first, then run the historical scan in one go. This will save time and computing costs, and ensures every monitor shares the same look-back window.

    Configure and fine-tune Column Monitors

    Column Monitors can be configured when setting them up and while they're in production. To fine-tune the monitor to your specific needs, go to the page for each specific metric.

    Learn more about

    User management

    The Users and User Groups in Soda Cloud settings allows you to control access to your organization by managing individual users and user groups. This ensures that team members have the appropriate permissions to use Soda Cloud effectively.

    With SSO enabled, users and groups can be synced directly from your identity provider, reducing manual effort and ensuring alignment with your organization’s access policies. Learn more about SSO configuration User and user group management with SSO.

    Invite Users

    The Invite Users feature is only available when SSO is not enabled.

    To invite users manually:

    1. Go to the Users tab in Settings.

    2. Click the + icon at the top of the user list

    1. Enter the email addresses of the users you want to invite.

    Invited users will receive an email with a link to set their password and join your organization in Soda Cloud. Once they complete the setup, they will have access to Soda Cloud based on the roles and permissions you assign.

    Deactivate Users

    Deactivating a user blocks their access to Soda Cloud and disables any existing API keys associated with their account. This is useful when a user no longer needs access, but you want to retain their account for record-keeping or future reactivation.

    To deactivate a user:

    1. Go to the Users tab in Settings.

    2. Find the user you want to deactivate.

    3. Click the context menu for the user and select Deactivate From This Organization.

    You can reactivate a user later if they need access again.

    Assign User to Groups

    Assigning users to groups allows you to manage access and permissions more efficiently by applying global roles to groups rather than individual users.

    To assign users to groups:

    1. Go to the Users tab in Settings.

    2. Find the user you want to assign to a group.

    3. Click the content menu next to their name and select Edit User Groups.

    1. Select one or more user groups from the list.

    1. Click Save

    Assign Global Roles to Users

    Global roles define a user’s permissions across Soda Cloud. Assigning global roles directly to users allows you to grant them specific access rights, such as managing datasets, running scans, or configuring organization settings.

    To assign a global role to a user:

    1. Go to the Users tab in Settings.

    2. Find the user you want to assign a role to.

    3. Click the context menu for the user and select Assign Global Roles.

    1. Choose one or more global roles from the list.

    2. Click Save

    User Groups

    User groups allow you to manage access and permissions for multiple users at once, helping you simplify and scale permission management across your organization. By assigning global roles to groups, you ensure that all members of the group have consistent access rights, without the need to assign permissions individually.

    If you have Single Sign-On (SSO) enabled, user groups can also be synced automatically from your identity provider, ensuring your Soda Cloud user management aligns with your existing access policies. .

    Note that by default, there is an Everyone group which is not editable and contains all the users from the organization

    Create User Groups

    You can manually create user groups in Soda Cloud, whether you’re importing user groups SSO or not.

    To create a user group:

    1. Go to the User Groups tab in Settings.

    2. Click Create User Group at the top of the user group list.

    1. Enter a name for the group.

    2. (Optional) Add users to the group immediately, or add them later.

    Edit Group Members

    You can edit the members of user groups that you have created on Soda Cloud. SSO-managed user groups cannot be edited.

    To edit a user group,

    1. Go to the User Groups tab in Settings.

    2. Find the group you want to modify.

    3. Click the context menu next to the group and select Edit Members.

    1. Select the users that should be in the user group and click save

    View Group Members

    You can view the list of users in a group to understand who has access through that group and to help manage permissions across your organization.

    To view the members of a group:

    1. Go to the User Groups tab in Settings.

    2. Click the group name or the row to open its details.

    3. The list of users assigned to the group will be displayed.

    This view helps you track group membership and verify that the correct users have the appropriate access.

    Assign Global Roles to User Group

    Global roles define a user’s permissions across Soda Cloud. Assigning global roles directly to user groups allows you to grant them specific access rights, such as managing datasets, running scans, or configuring organization settings.

    To assign a global role to a user:

    1. Go to the User Groups tab in Settings.

    2. Find the user group you want to assign a role to.

    3. Click the context menu for the user and select Assign Global Roles.

    1. Choose one or more global roles from the list.

    1. Click Save

    Data reconciliation

    This page describes how Soda handles data reconciliation through different types of reconciliation checks.

    Available on the 15th of September 2025

    Reconciliation checks are a validation step used to ensure that data remains consistent and accurate when moving, transforming, or syncing between different systems. The core purpose is to confirm that the target data matches the source data, whether that’s during a one-time migration, a recurring data pipeline run, or ongoing synchronization across environments.

    For instance, if you are migrating from a MySQL database to Snowflake, reconciliation checks can verify that the data transferred into Snowflake staging is intact and reliable before promoting it to production. This minimizes the risk of data loss, duplication, or corruption during critical migrations.

    Beyond migrations, reconciliation checks are also used in data pipelines and integrations. They help validate that transformations applied in-flight do not compromise accuracy, and that downstream datasets remain coherent with upstream sources.

    Other use cases include regulatory compliance, where organizations must prove that financial or operational data has been faithfully replicated across systems, and system upgrades, where schema changes or infrastructure shifts can introduce unexpected mismatches.

    By systematically applying reconciliation checks, teams can maintain trust in their data, reduce operational risk, and streamline incident detection when anomalies arise.

    Defining source dataset

    Before defining reconciliation checks, you first specify the source dataset. This represents the system of record against which you want to validate consistency. It is possible to define a filter on the source dataset, allowing you to reconcile only a subset of records that match certain criteria (for example, only transactions from the current month, or only rows belonging to a specific business unit).

    For the target dataset, the reconciliation check applies the dataset filter defined at the top of the contract (see ).

    Ensure that both source and target are constrained to the same logical scope before comparisons are made, keeping the validation consistent and relevant.

    Metric-Level Reconciliation

    At this level, aggregate metrics from the source and target datasets are compared. Examples include totals (e.g., revenue, number of rows), averages, or other summary statistics. This approach is efficient and provides a high-level signal that the data remains consistent. It is especially useful for large-scale migrations or pipelines where exact row-by-row comparison may not be necessary at all times.

    Thresholds

    Comparisons at the metric level are evaluated against a defined threshold, which represents the acceptable difference between source and target. This tolerance can be set depending on the business context. Some use cases may allow small discrepancies (e.g., rounding differences), while others require exact equality.

    When comparing integrity checks such as missing values, duplicates, or invalid entries, you can reconcile either by looking at the raw count of affected records or by comparing the percentage metric (e.g., the percentage of rows with missing values in each dataset). This flexibility ensures that reconciliation is meaningful regardless of dataset size or distribution.

    Check-level filter

    In addition to dataset-level filters, reconciliation checks support check-level filters, which are applied consistently to both the source and target within the scope of a specific check. These filters make it possible to validate a subset of the data relevant to the context of the check. The check-level filter is applied on top of any existing source or target dataset filters.

    Row-level reconciliation

    For more granular validation, reconciliation can be performed at the row level. This type of check surfaces detailed differences such as missing records, mismatched values, or unexpected duplicates. Row-level reconciliation is critical in scenarios where accuracy at the record level is non-negotiable—such as record that address financial transactions, user data, or regulatory reporting.

    This requires specifying a primary key (or a composite key) to uniquely identify rows between the source and the target. Once rows are aligned, you can define a list of columns to test for exact matches or acceptable tolerances. If no column list is provided, the check defaults to comparing all columns in order. This flexibility ensures that comparisons can range from broad validation across the entire dataset to focused checks on only the most critical attributes.

    Thresholds

    Row-level reconciliation supports thresholds expressed either as the count of differing rows between source and target, or as the percentage of differing rows relative to the source dataset row count. These thresholds determine the acceptable level of variance before the check is considered failed, giving you fine control over sensitivity and tolerance.

    This dual approach allows teams to adapt reconciliation logic to different contexts, using absolute counts when every record matters, and percentages when evaluating proportional differences in large datasets.

    Check-level filter

    As with metric-level checks, you can define a check-level filter that is applied on top of any existing dataset filters. This allows you to reconcile only a targeted segment of data within the context of the specific check—for example, testing only a single business unit, product family, or date range.

    Performance considerations

    Row-level reconciliation is inherently heavier than metric-level reconciliation, as it requires comparing records across potentially large datasets. To enable comparisons even when data lives in different systems, data is loaded into memory from both the source and the target, where the diff is executed. A paginated approach is used to maintain scalability; this ensures that memory usage remains stable, but execution time will increase as the dataset size and column count grow.

    Benchmark

    Dataset Shape
    Change Rate
    Memory Usage
    Execution Time

    Recommendations

    • Leverage filters to scope checks to new or incremental batches of data wherever possible, rather than repeatedly reconciling the entire dataset. This reduces both execution time and operational overhead.

    • Use metric-level reconciliation as a first line of validation. It is significantly more efficient and scalable, and can quickly highlight whether deeper row-level analysis is even necessary.


    Implement reconciliation checks programmatically

    Soda is suitable for no-code and programmatic users alike. If you are implementing checks programmatically, you can learn more about the contract language syntax for reconciliation on the . Reconciliation checks can be used for both metric- and row-level validation.

    Soda Agent

    This page describes what is a Soda Agent

    The Soda Agent is a tool that empowers Soda Cloud users to securely access data sources to scan for data quality. For a self-hosted agent, create a Kubernetes cluster in a cloud services provider environment, then use Helm to deploy a Soda Agent in the cluster.

    This setup enables Soda Cloud users to securely connect to data sources (Snowflake, Amazon Athena, etc.) from within the Soda Cloud web application. Any user in your Soda Cloud account can add a new data source via the agent, then write their own no-code checks to check for data quality in the new data source.

    When you deploy an agent, you also deploy two types of workloads in your Kubernetes cluster from a Docker image:

    • a Soda Agent Orchestrator which creates Kubernetes Jobs to trigger scheduled and on-demand scans of data

    • a Soda Agent Scan Launcher which wraps around the Soda Python Libraries, which impelement the scans.

    How does Soda integrate with Kubernetes?

    Kubernetes is a system for orchestrating containerized applications; a Kubernetes cluster is a set of resources that supports an application deployment.

    You need a Kubernetes cluster in which to deploy the containerized applications that make up the Soda Agent. Kubernetes uses the concept of that the Soda Agent Helm chart employs to store connection secrets that you specify as values during the Helm release of the Soda Agent. Depending on your cloud provider, you can arrange to store these Secrets in a specialized storage such as or .

    Learn more about .

    The Jobs that the agent creates access these Secrets when they execute.

    Learn more about .

    Where can a Soda Agent be deployed?

    Within a cloud services provider environment is where you create your Kubernetes cluster. You can deploy a Soda Agent in any environment in which you can create Kubernetes clusters, such as:

    • Amazon Elastic Kubernetes Service (EKS)

    • Microsoft Azure Kubernetes Service (AKS)

    • Google Kubernetes Engine (GKE)

    • Any Kubernetes cluster version 1.21 or greater which uses standard Kubernetes

    What is Helm?

    Helm is a package manager for Kubernetes which bundles YAML files together for storage in a public or private repository. This bundle of YAML files is referred to as a Helm chart. The Soda Agent is a Helm chart. Anyone with access to the Helm chart’s repo can deploy the chart to make use of YAML files in it.

    Learn more about .

    The Soda Agent Helm chart is stored on a and published on . Anyone can use Helm to find and deploy the Soda Agent Helm chart in their Kubernetes cluster.

    Why Kubernetes?

    Kubernetes is the most powerful and future-proof platform for running the Soda Agent because it delivers the best of both worlds: the flexibility of raw compute without the operational burden, and the scalability of managed services without their restrictions.

    • Kubernetes goes far beyond raw compute like EC2 or traditional Virtual Machines (VMs) by abstracting away the heavy lifting of networking, deployments, and scaling, while still giving teams precise control when needed. Practically, this makes it easy for Soda’s customers to using and , always staying up to date with the latest releases.

    • Unlike fully managed options such as AWS Lambda, Kubernetes has no execution time limits and is built to handle long-running, stateful, and highly scalable workloads. This means Soda is not limited to lightweight samples but can perform complete, row-level operations—powering advanced capabilities like Diagnostics Warehouse, which securely stores the exact failing records inside your own infrastructure, and , which compare data at row-level across sources.

    Whether running in the cloud or on-premises, Kubernetes ensures resilience, portability, and cost-efficient resource use, making it the clear choice for complex, enterprise-grade data quality workloads.

    Data Observability

    An overview of Soda's key observability features and how they help catch data issues early.

    What is Data Observability?

    Data observability is the ongoing process of monitoring and assessing the health of your data throughout its lifecycle. It focuses on analyzing metadata, metrics, and logs to detect issues as they arise, helping teams maintain trust in their data.

    At the core of data observability are monitors that track key data quality metrics over time. When a metric behaves unexpectedly, anomaly detection algorithms analyze historical patterns to determine whether an alert should be triggered.

    Typical data quality metrics to monitor are:

    • Schema changes to surface structural modifications

    • Row counts to detect unexpected changes in data volume

    • Most recent timestamps to detect data freshness, missing or delayed data

    • Missing values to track data completeness

    • Averages to observe shifts in distributions

    Soda’s practical approach to Data Quality

    Soda embraces pragmatism over purity: practical outcomes and effectiveness are more important than ideal, unidimensional approaches. Effective data quality comes from combining data observability and data testing. Each serves a different purpose. Observability is about speed and broad coverage. Testing is about precision, enforcement, and prevention.

    Benefits of Data Observability

    • Enables broad coverage quickly, even across large data sources

    • Surfaces unknown issues without needing to define every rule

    • Requires minimal configuration to get started

    • Leverages existing metadata for fast and efficient monitoring

    Limitations of Data Observability

    • Serves only as a signal. An anomaly suggests an issue but doesn’t confirm it

    • Can generate false alerts, since detection is driven by algorithms

    • Requires further investigation to validate and resolve alerts

    • Does not prevent issues. It flags them after they’ve happened

    Start with Observability, but rely on Testing

    Observability is a fast and efficient way to get initial coverage. It helps surface unknown issues with minimal setup and delivers immediate value across many datasets. However, for lasting reliability and trust in your data, testing is more important.

    Testing requires more effort up front. It involves defining explicit expectations and rules for your data. But that investment pays off. When a test fails, you know there is a real data quality issue, no guesswork, no false alerts. When an anomaly is detected, it doesn't necessarily mean there is an underlying data quality issue, and more investigation effort is required.

    For long-term reliability, testing is essential. It adds rigor by enforcing defined standards and helps prevent bad data going into production. Start with your most critical datasets, then expand gradually using a collaborative approach, where business users help by proposing checks. This creates a scalable framework that grows with your organization while ensuring lasting data quality.

    What makes Soda’s Data Observability so useful?

    Soda’s data observability allows teams to monitor data health across large environments without manual setup. All anomalies are surfaced in a single, easy-to-navigate dashboard, making it simple to spot issues and investigate patterns. Behind the scenes, a proprietary anomaly detection algorithm ensures high precision by minimizing false positives and focusing on meaningful deviations. Notifications are opt-in and alerts are only triggered when they matter, helping teams stay focused without being overwhelmed by noise.

    Effortlessly monitor thousands of tables

    Soda enables large-scale observability with ease. Instead of configuring each table manually, monitoring is applied at the data source level and automatically extends to all datasets underneath. This allows teams to activate observability across hundreds or even thousands of tables in minutes.

    By leveraging metadata such as row counts, schema evolution, and insert activity, Soda delivers lightweight and efficient monitoring. There is no need to scan entire datasets or write custom logic for each table. You can do that if needed, but it is not required. Observability starts working immediately and is built to handle even the largest data platforms.

    Start today. Look one year back.

    Observability is not just about what happens next. With built-in backfilling and backtesting, Soda instantly analyzes historical metadata and metric trends. From the moment observability is enabled, teams gain visibility into past data quality metrics and can detect potential anomalies that may have gone unnoticed.

    This historical context is essential. It helps determine whether a current anomaly is truly new or part of an ongoing pattern. It also allows the anomaly detection algorithm to establish baselines immediately, which improves the quality of alerts from the very beginning.

    High precision alerts with fewer false positives

    Soda’s proprietary anomaly detection algorithm is specifically designed for data quality monitoring. Every component has been developed entirely in-house without relying on third-party frameworks. This gives Soda full control over the modeling stack and ensures transparency, customization, and explainability. These attributes are especially important in production environments where trust in alerts is essential.

    The algorithm is built on a proprietary evaluation framework that rigorously tests its performance using hundreds of internally curated datasets with known data quality issues. This framework enables structured, repeatable experimentation and continuous benchmarking of new techniques. It prioritizes reducing false positives to ensure alerts are accurate, meaningful, and reliable.

    In benchmark testing, Soda’s algorithm demonstrated a 70 percent improvement in anomaly detection accuracy compared to Facebook Prophet. Unlike generic forecasting tools that rely on rigid assumptions, Soda’s model is tailored to the real-world challenges of monitoring data quality at scale.

    The system is flexible and adapts to different team needs. It can run autonomously with smart defaults or be fine-tuned through a user-in-the-loop approach. Teams can improve detection by providing feedback and adjusting sensitivity. This flexibility ensures that alerts remain focused, useful, and aligned with the needs of each organization.

    Metric Monitoring

    Soda’s Metric Monitoring feature is the foundation of Data Observability, allowing users to automatically track key dataset and column-level statistics over time, detect deviations, and get alerted before data issues impact downstream analytics. While quality checks also keep track of measurements over time, metric monitors use that history of measurements to learn from them and automatically adjust thresholds to inform about expected values or alert about anomalies.

    Implement Metric Monitoring at scale

    Metric Monitoring is developed to be a hassle-free feature. You can unlock organization‐wide observability through Soda Cloud’s . This instantly provides automated metric monitoring across hundreds of tables by simply selecting all the datasets you care about and defining a shared schedule in one step. No more configuring each table by hand: stay ahead of pipeline failures, data delivery delays, and structural changes with consistent, centralized monitoring that grows as fast as your data.

    Learn more about how roles and permissions affect Metric Monitoring capabilities: .

    Webhook

    The Webhook Integration in Soda Cloud allows you to send notifications about check results (based on notification rules) and incident updates to external systems, such as monitoring tools, incident management platforms, or custom endpoints.

    This integration is ideal for teams who want to build custom workflows or integrate Soda Cloud alerts into their existing tools.

    Set Up a Webhook Integration

    Only users with the Manage Organization Settings global role can define webhook integrations.

    Follow these steps to configure a Webhook integration in Soda Cloud:

    1. Go to the Integrations section in Settings.

    2. Click the + button to add a new integration.

    1. Select the integration type: Webhook, and click next.

    1. Configure the Webhook

      • Name: Provide a clear name for your integration.

      • URL: Enter the Webhook endpoint where Soda Cloud should send notifications. Headers: (Optional)

      • Add authentication or custom headers required by your endpoint.

    1. Test the Webhook

      • Use the built-in testing tool to simulate events and validate your Webhook integration.

      • You can select different event types to test and develop your integration.

      • For the exact payload structure and details, see the

    1. Choose the events to send

      • Alert Notifications: The integration becomes available for use in notification rules. It will only send notifications when you explicitly configure a notification rule to use this Webhook.

      • Incidents: Triggered when users create or update incidents in Soda Cloud.

    1. Click Save to apply

    Use in Notification Rules

    After configuring your Webhook integration with the Alert Notification scope, you can use it in your notification rules to send alerts when specific checks fail.

    When creating or editing a notification rule, select your configured Webhook integration as the recipient.

    For detailed steps and advanced examples, see the

    Purview

    Integrate Soda with Microsoft’s Purview data catalog to access details about the quality of your data from within the catalog.

    • Run data quality checks using Soda and visualize quality metrics and rules within the context of a table in Purview.

    • Give your Purview-using colleagues the confidence of knowing that the data they are using is sound.

    • Encourage others to add data quality checks using a link in Purview that connects directly to Soda.

    In Purview, you can see all the Soda data quality checks and the value associated with the check’s latest measurement, the health score of the dataset, and the timestamp for the most recent update. Each of these checks listed in Purview includes a link that opens a new page in Soda Cloud so you can examine diagnostic and historic information about the check.

    Purview displays the latest check results according to the most recent Soda scan for data quality, where color-coded icons indicate the latest result. A gray icon indicates that a check was not evaluated as part of a scan.

    If Soda is performing no data quality checks on a dataset, the instructions in Purview invite a catalog user to access soda and create new checks.

    Prerequisites

    • You have verified some contracts and published the results to Soda Cloud.

    • You have a Purview account with the privileges necessary to collect the information Soda needs to complete the integration.

    • The data source that contains the data you wish to check for data quality is available in Purview.

    Set up the integration

    1. Sign into your Soda Cloud account and confirm that you see the datasets you expect to see in the data source you wish to test for quality.

    2. In your Soda Cloud account, navigate to your avatar > Profile, then navigate to the API Keys tab. Click the plus icon to generate new API keys.

    3. Copy the following values and paste to a temporary, secure, local location.

    Data source reference for Soda Core

    This page lists the supported data source types and their required connection parameters for use with Soda Core.

    Soda uses the official Python drivers for each supported data source. The configuration examples below include the default required fields, but you can extend them with any additional parameters supported by the underlying driver.

    Each data source configuration must be written in a YAML file and passed as an argument using the CLI or Python API.

    General Guidelines

    • Each configuration must include

    helm upgrade <release> soda-agent/soda-agent
     --set soda.apikey.id=*** \
     --set soda.apikey.secret=****
    > cat values-local.yaml
    soda:
      apikey:
        id: ***
        secret: ***
    > helm upgrade soda-agent soda-agent/soda-agent \
    --values values-local.yml --namespace soda-agent
    helm upgrade <release> soda-agent/soda-agent
     --set soda.apikey.id=*** \
     --set soda.apikey.secret=****
     --set imageCredentials.apikey.id=*** \
     --set imageCredentials.apikey.secret=***
    > cat values-local.yaml
    soda:
      apikey:
        id: ***
        secret: ***
    imageCredentials:
      apikey:
        id: ***
        secret: ***
    > helm upgrade soda-agent soda-agent/soda-agent \
    --values values-local.yml --namespace soda-agent
    helm upgrade <release> soda-agent/soda-agent
     --set soda.apikey.id=*** \
     --set soda.apikey.secret=****
     --set existingImagePullSecrets[0].name=my-existing-secret  # Mind the array and indexing syntax!
    > cat values-local.yaml
    soda:
      apikey:
        id: ***
        secret: ***
    existingImagePullSecrets
      - name: my-existing-secret
    > helm upgrade soda-agent soda-agent/soda-agent \
    --values values-local.yml --namespace soda-agent
    helm upgrade <release> soda-agent/soda-agent
     --set soda.apikey.id=*** \
     --set soda.apikey.secret=****
     --set soda.cloud.region=us
    > cat values-local.yaml
    soda:
      apikey:
        id: ***
        secret: ***
      cloud:
        region: "us"
    > helm upgrade soda-agent soda-agent/soda-agent \
    --values values-local.yml --namespace soda-agent
    # For Soda Cloud customers in the EU region
    docker login registry.cloud.soda.io -u <APIKEY_ID> -p <APIKEY_SECRET>
    
    # For Soda Cloud customers in the US region
    docker login registry.us.soda.io -u <APIKEY_ID> -p <APIKEY_SECRET>
    existingImagePullSecrets
      - name: my-existing-secret
    soda:
      apikey:
        id: ***
        secret: ***
      agent:
        image:
          repository: custom.registry.org/sodadata/agent-orchestrator
      scanLauncher:
        image:
          repository: custom.registry.org/sodadata/soda-scan-launcher
      contractLauncher:
        image:
          repository: custom.registry.org/sodadata/soda-contract-launcher
      hooks:
        image:
          repository: custom.registry.org/sodadata/soda-agent-utils
    pip install -i https://pypi.dev.sodadata.io/simple -U soda-bigquery
    type: bigquery
    name: my_bigquery
    connection:
      account_info_json: '{
        "type": "service_account",
        "project_id": "dbt-quickstart-44203",
        "private_key_id": "fe0a60e9cb7d4369f73f7b5691ce397d1e",
        "private_key": "-----BEGIN PRIVATE KEY-----<insert-private-key>-----END PRIVATE KEY-----\n",
        "client_email": "[email protected]",
        "client_id": "114963712293161062",
        "auth_uri": "https://accounts.google.com/o/oauth2/auth",
        "token_uri": "https://oauth2.googleapis.com/token",
        "auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
        "client_x509_cert_url": "https://www.googleapis.com/robot/v1/metadata/x509/dbt-user%40dbt-quickstart-44803.iam.gserviceaccount.com",
        "universe_domain": "googleapis.com"
      }' # example service account JSON string, exported from BQ; SEE NOTE
      dataset: ${env.BQ_DATASET_NAME}
      # optional
      account_info_json_path: /path/to/service-account.json  # SEE NOTE
      auth_scopes:
        - https://www.googleapis.com/auth/bigquery
        - https://www.googleapis.com/auth/cloud-platform
        - https://www.googleapis.com/auth/drive
      project_id: ${env.BQ_PROJECT_ID}  # Defaults to the one embedded in the account JSON
      storage_project_id: ${env.BQ_STORAGE_PROJECT_ID}
      location: ${env.BQ_LOCATION}  # Defaults to the specified project's location
      client_options: <options-dict-for-bq-client>
      labels: <labels-dict-for-bq-client>
      impersonation_account: <name-of-impersonation-account>
      delegates: <list-of-delegates-names>
      use_context_auth: false     # whether to use Application Default Credentials
    soda data-source test -ds ds.yml
    Documentation access & licensing
    Documentation access & licensing

    Locally, for testing purposes, using tools like Minikube, microk8s, kind, k3s, or Docker Desktop with Kubernetes support.

    Secrets
    Azure Key Vault
    AWS Key Management Service (KMS)
    using external secrets
    Kubernetes concepts
    Helm concepts
    public repository
    ArtifactHub.io
    deploy, manage, and upgrade Soda Agents
    Kubernetes
    Helm
    Reconciliation Checks

    Latest value

  • Trend sparkline

  • Missing values percentage

    Detects anomalies in the maximum (highest) value in a column.

    Unique count

    Detects anomalies in the number of distinct (unique) values in a column.

    Timestamp

    Most recent timestamp

    Detects anomalies in the most recent (latest) timestamp value in a column.

    Numeric

    Average

    Detects anomalies in the average (mean) value of a column.

    Standard deviation

    Detects anomalies in the standard deviation of values in a column.

    Sum

    Detects anomalies in the total (sum) of values in a column.

    Variance

    Detects anomalies in the variance (spread) of values in a column.

    Q1

    Detects anomalies in the 25th percentile (first quartile) value of a column.

    Median

    Detects anomalies in the 50th percentile (median - Q2) value of a column.

    Q3

    Detects anomalies in the 75th percentile (third quartile) value of a column.

    Text

    Average length

    Detects anomalies in the average character length of text values.

    Maximum length

    Detects anomalies in the shortest character length of text values.

    Minimum length

    Detects anomalies in the longest character length of text values.

    choose whether to alert on the
    Upper range
    , the
    Lower range
    , or both.
  • Exclusion Values: specify literal values or ranges to ignore when marking anomalies.

  • All data types

    Count

    Detects anomalies in the number of non-missing (non-NULL) values in a column.

    Duplicate percentage

    Detects anomalies in the percentage of duplicate values in a column.

    Maximum value

    Detects anomalies in the maximum (highest) value in a column.

    Minimum value

    Detects anomalies in the minimum (lowest) value in a column.

    Missing values percentage
    How to fine-tune Metric Monitoring →

    You are not logged in to Soda and are viewing the default public documentation. Learn more about Documentation access & licensing.

    Learn more about SSO integration

    You are not logged in to Soda and are viewing the default public documentation. Learn more about Documentation access & licensing.

    Provides early signals when something might be wrong

    May result in extra work to follow up and interpret alerts

    no-code dataset onboarding
    Global and Dataset Roles
    Metric Monitoring Dashboard

    You are not logged in to Soda and are viewing the default public documentation. Learn more about Documentation access & licensing.

    Contracts: Triggered when users publish new or updated contracts in Soda Cloud.
    Webhook reference ✅
    Notifications

    You are not logged in to Soda and are viewing the default public documentation. Learn more about Documentation access & licensing.

    API Key ID
  • API Key Secret

  • Access Purview tutorial using REST APIs for instructions on how to create the following values, then paste to a temporary, secure, local location.

    • client_id

    • client_secret

    • tenant_id

  • Copy the value of your purview endpoint from the URL (https://XXX.purview.azure.com) and paste to a temporary, secure, local location.

  • To connect your Soda Cloud account to your Purview Account, contact your Soda Account Executive or email Soda Support with the details you collected in the previous steps to request Purview integration.


  • You are not logged in to Soda and are viewing the default public documentation. Learn more about Documentation access & licensing.

    Open Source. Available for Free, Team and Enterprise Plans.

    https://www.soda.io/contact
    Documentation access & licensing
    Documentation access & licensing

    10 columns, 500K rows

    1% changes

    <80MB RAM

    9s

    360 columns, 100K rows

    1% changes

    <80MB RAM

    1m

    360 columns, 1M rows

    1% changes

    <80MB RAM

    Contract Language reference

    You are not logged in to Soda and are viewing the default public documentation. Learn more about Documentation access & licensing.

    35m

    Key Concepts in Contract Authoring
    type
    ,
    name
    , and a
    connection
    block.
  • Use the exact structure required by the underlying Python driver.

  • Test the connection before using the configuration in a contract.


  • Connect to a data source already onboarded in Soda Cloud (via Soda Agent)

    You can run verifications using Soda Core (local execution) or a Soda Agent (remote execution). To ensure consistency and compatibility, you must use the same data source name in both your local configuration for Soda Core and in Soda Cloud. See: Onboard datasets on Soda Cloud

    This matching by name ensures that the data source is recognized and treated as the same across both execution modes, whether you’re running locally in Soda Core or remotely via a Soda Agent.


    Onboard a data source in Soda Cloud after using Soda Core

    It’s also possible to onboard a data source to Soda Cloud and a Soda Agent after it was onboarded using Soda Core.

    To learn how: Onboard datasets on Soda Cloud

    Using Environment Variables

    You can reference environment variables in your data source configuration. This is useful for securely managing sensitive values (like credentials) or dynamically setting parameters based on your environment (e.g., dev, staging, prod).

    Example:

    Environment variables must be available in the runtime environment where Soda is executed (e.g., your terminal, CI/CD runner, or Docker container).

    For Soda to run quality scans on your data, you must configure it to connect to your data source. To learn how to set up Soda from scratch and configure it to connect to your data sources, see Soda's Quickstart.


    Supported data sources


    You are not logged in to Soda and are viewing the default public documentation. Learn more about .

    Documentation access & licensing

    Metric monitor page

    Every monitor in Soda has its own dedicated detail page. This page is designed to help you explore the monitor's history, understand its behavior over time, and take action when needed. From here, you can investigate anomalies, give feedback to improve the detection algorithm, create incidents, and fine-tune the monitor's sensitivity or configuration.

    The page consists of two main components:

    1. An interactive plot that visualizes metric trends, anomalies, and historical context

    2. A results table that lists all metric values and events visible in the plot

    The Interactive Monitor Plot

    The interactive plot gives you a time-based view of how the monitor metric has evolved. It combines metric values, expected behavior, and any detected anomalies in a single visual.

    Navigating the Plot

    • Select a time window using the range slider below the plot to zoom in or out on a specific period

    • Click and drag to zoom into a custom time range

    • Hover over data points to view detailed information for each result

    What the plot shows

    • Expected range: the shaded area that represents the predicted normal behavior, as defined by the anomaly detection model

    • Measurement: the actual metric value for each scan

    • Anomaly: points marked when the metric falls outside the expected range and is flagged by the algorithm

    • Missing: scans where no metric could be collected, typically due to unavailable data or delayed scans

    Key events in the plot

    • Monitor created: marks the date the monitor is created

    • Initial configuration: shows the starting settings used when the monitor was first enabled

    • Configuration updated: marks changes to thresholds, exclusions, or sensitivity applied over time

    Using the Results Table

    Below the plot, the table lists all historical scan results, including metric values, anomaly status, and any user actions (like feedback or incidents). The plot is aligned with the table, so each data point in the plot directly corresponds to a result in the table

    This makes it easy to correlate visual trends with specific events, compare changes, and drill into the context of any anomaly or data quality issue.

    With the three dots it is possible to give feedback to the anomaly detection algorithm, in bulk for multiple results at once or create and link the results to an incident

    How to fine-tune Metric Monitoring

    Soda's Observability tools work out of the box with predefined baselines, but you can fine-tune them to your specific needs. Do this from the page for each specific metric.

    Set Threshold Strategy

    By default Soda uses an adaptive statistical method, but you can control which sides of the expected range should trigger an anomaly alert:

    1. Open the panel

    Click on "Set Threshold Strategy" button on the metric of your choice

    1. Choose your alert ranges

    • Upper range: when checked, Soda will flag any metric value that exceeds the upper bound of its statistical baseline.

    • Lower range: when checked, Soda will flag any metric value that falls below the lower bound.

    1. Apply your settings

    Click Set Threshold Strategy to save.

    With this simple toggle interface you can, for example, watch only for unexpectedly high values, only for drops below your baseline, or both.

    Set Exclusion Values

    Exclude values from monitoring to tell Soda which specific values or ranges should be ignored when evaluating anomalies (e.g. test rows, manual overrides). The available inputs depend on the metric type.

    1. Open the panel Click on Set Exclusion Values button on the metric of your choice

    1. Define your exclusions Click on + Add exclusion

    • Numeric metrics (Total row count, Total row count change, Partition row count):

      • Type: Value or Value range

      • Value: enter the exact metric value (e.g. 205) or, for a range, specify both lower and upper bounds.

    You can stack multiple rules by clicking + Add exclusion.

    1. Apply Click Set Exclusion Values to save your rules.

    This will not retroactively change past results. It only affects future anomaly evaluations.

    Set Sensitivity

    Soda uses a statistical baseline to define an “expected range” for anomaly detection. You can adapt how tight or loose that range is.

    1. Open the panel Click on Set Sensitivity button on the metric of your choice

    1. Adjust the sensitivity

    • Provide a z-score: enter a value between 0.3 and 6 to control the exact width of the expected range OR use the slider to drag between Narrow (lower z-score) and Wide (higher z-score).

    • Default: z = 3

    Preview how changing sensitivity widens or narrows the gray “expected” band in the plot

    1. Apply

    Click Apply sensitivity to save.

    This will not retroactively change past results. It only affects future anomaly evaluations.

    Give feedback to improve detection

    Our home-brewed anomaly detection algorithm draws trends from historical data, but it can also learn from your input as you give it feedback.

    When a monitor flags an anomaly you can:

    1. Mark as expected Teach Soda that this deviation is acceptable: future similar variations will no longer trigger alerts.

    1. Mark as anomaly Explicitly flag a point as an anomaly, even if it fell inside the baseline. This helps refine your alert definitions.

    Create incidents

    • Create new incident Create a ticket in your incident management tool directly from the panel.

    • Link to existing incident Attach this scan to a ticket in your external system (Jira, ServiceNow, PagerDuty, etc.), keeping engineering triage in one place.

    • Bulk feedback More than one scan can be added to an incident or feedback. Simply check the boxes of the scans you want to add.

    All feedback and incident links become part of the scan history, providing an auditable trail for both data engineers and business stakeholders.

    Dataset Attributes & Responsibilities

    Dataset settings allow you to define key metadata, ownership, and business context for your datasets. This information helps ensure data governance, accountability, and seamless integration with other tools like your data catalog.

    Dataset Owner

    Purpose of the Dataset Owner

    Each dataset should have a designated dataset owner: a person or team responsible for the dataset's quality, availability and usage.

    Typically, the role of a Dataset Owner includes:

    • Defining and maintaining the dataset's purpose and documentation.

    • Ensuring the dataset meets data quality standards and contract requirements.

    • Responding to issues, such as failed checks or data quality alerts.

    • Reviewing and approving changes to the dataset schema or contract.

    Updating the Dataset Owner

    Updating the Dataset Owner requires the following dataset role: "Configure dataset".

    To assign a Dataset Owner:

    1. Open the dataset page.

    2. Click the context menu (⋮) in the top-right corner and select Edit Dataset.

    1. In the Owned by section, select one or more users and/or user groups.

    1. Click Save to apply the changes.

    Responsibilities

    What are Responsibilities?

    Responsibilities allow you to assign permissions to users or user groups, ensuring they have the access they need to work with a dataset.

    A Responsibility is a combination of:

    • A User or User Group.

    • A Dataset Role, which is a predefined collection of permissions (such as the ability to edit contracts, view checks, or manage settings).

    By assigning Responsibilities, you define who can do what for each dataset, supporting clear ownership, governance, and collaboration.

    Learn about defining custom roles

    How to Add Responsibilities

    Managing responsibilities requires the following dataset role: "Manage dataset responsibilities"

    To assign a Responsibility to a user or group:

    1. Open the dataset page.

    2. Click the context menu (⋮) in the top-right corner and select Edit Responsibilities.

    1. Add the desired users or user groups.

    1. Select the appropriate Dataset Role for each.

    1. Click Save to apply the changes.

    Default Dataset Owner Role

    Every dataset has a default Dataset Owner role, automatically assigned to the designated Dataset Owner(s).

    • This role provides essential permissions to manage and maintain the dataset.

    • The Dataset Owner role cannot be removed, but it can be combined with other roles for additional permissions.

    The default permissions granted to the Dataset Owner role are customizable at the organization level. For more details on configuring the default Dataset Owner role and other roles, see the

    Dataset Attributes

    Updating the dataset attributes requires the following dataset role: "Configure dataset".

    Purpose of Attributes

    Dataset attributes allow you to add descriptive metadata to your datasets. This metadata can then be:

    • Used for filtering in Soda Cloud, making it easier to search and organize datasets and checks based on specific criteria (e.g., business domain, sensitivity, criticality).

    • Leveraged in reporting, enabling you to group datasets, track ownership, and monitor data quality across different categories or dimensions.

    Adding meaningful attributes enhances discoverability, governance, and collaboration within Soda and its integrations.

    Learn how to define attribute types:

    Adding Dataset Attributes

    You can add or modify dataset attributes in the Dataset Settings page:

    1. Click the context menu (⋮) in the top-right corner and select Edit Dataset.

    1. Set a value for the existing attribute type. They are all optional.

    1. Save your changes.

    Bulk Edit of Attributes and Responsibilities

    When managing multiple datasets, you can save time by applying changes in bulk using the Bulk Edit feature.

    How to Bulk Edit Datasets

    1. Go to the Datasets page.

    2. Select the datasets you want to edit using the checkboxes.

    1. Click Edit in the action bar.

    1. Define attributes you want to add or modify across the selected datasets.

    1. Define responsibilities you want to add or modify across the selected datasets.

      1. Choose whether to update existing responsibilities (add new without removing existing) or reset (replace all existing responsibilities with the new definition).

    1. Click Continue to review your changes.

    Integrate With Data Catalog

    You can automate the management of dataset attributes and responsibilities in Soda Cloud using our REST API. This allows you to:

    • Programmatically set or update attributes for multiple datasets.

    • Assign responsibilities (users, groups, and roles) to datasets at scale.

    • Keep your Soda Cloud configuration in sync with your data catalog or external metadata management systems.

    This automation ensures that your metadata stays up-to-date and consistent across your ecosystem, supporting seamless governance and discoverability.

    To do so, you can leverage our APIs: and .

    Soda Python Libraries

    This page describes how to install the Soda Python packages, which are required for running Soda scans via the CLI or Python API.

    Installation

    Requirements

    To use Soda, you must have installed the following on your system.

    • Python 3.8, 3.9, 3.10 or 3.11.

    To check your existing version, use the CLI command: python --version or python3 --version. If you have not already installed Python, consider using pyenv to manage multiple versions of Python in your environment.

    • Pip 21.0 or greater.

    To check your existing version, use the CLI command: pip --version

    • A Soda Cloud account; see how to .

    Best practice dictates that you install the Soda CLI using a virtual environment. If you haven't yet, in your command-line interface tool, create a virtual environment in the .venv directory using the commands below. Depending on your version of Python, you may need to replace python with python3 in the first command.

    Choose an installation flow

    Before you install the Soda CLI, decide which installation flow applies to your environment and license type. The two flows available serve different purposes:

    Use Case
    Installation Flow
    Description

    Different installations will support different packages. Learn more about which packages are supported in and PyPI.

    How to differentiate between free open-source Soda, and paid licensed Soda?

    Soda V3: package names included core if the package was free open-source. E.g.:

    • soda-core-postgres


    Public PyPI installation flow

    To use the open source Soda Core python packages, you must install them from the public Soda PyPi registry: https://pypi.dev.sodadata.io/simple .

    1. Install the Soda Core package for your data source. This gives you access to all the basic CLI functionality for working with contracts.

    Replace soda-postgres with the appropriate package for your data source. See the for supported packages and configurations.

    Now you can .

    Supported packages

    • soda: "umbrella" package (does not include Diagnostics Warehouse)

    • Data-source-specific packages: naming pattern is “soda-<datasource>” (e.g. soda-postgres, soda-bigquery, soda-sparkdf, etc.)


    Private PyPI installation flow

    If you wish to use commercial extensions to the Soda Core python package, you must install them from one of the private Soda PyPi registries below. The private PyPI installation process adds an authentication layer and region-based repositories for license-based access control of Team and Enterprise customers.

    1. Upgrade pip inside your new virtual environment.

    1. Choose the correct repository based on your license and region.

    License
    Soda Region
    Repository URL

    1 Team: Any license except "Trial" or "Enterprise" (see below) 2 Enterprise: one of enterprise , enterprise_user_based , dataset_standard , premier licenses.

    1. Set your credentials. See how to generate your own .

    1. Execute the following command, replacing soda>=4.0.0b0 with the package that you need to install.

    Included packages

    Team
    • soda: required for the contract generator (includes Diagnostics Warehouse)

    • soda-groupby

    Enterprise
    • soda: required for the contract generator (includes Diagnostics Warehouse)

    • soda-groupby

    Alation

    Soda with Alation to access details about the quality of your data from within the data catalog.

    • Run data quality checks using Soda and visualize quality metrics and rules within the context of a data source, dataset, or column in Alation.

    • Use Soda Cloud to flag poor-quality data in lineage diagrams and during live querying.

    • Give your Alation users the confidence of knowing that the data they are using is sound.

    • 🎥 Watch a showcasing the integration of Soda and Alation.

    Prerequisites

    • You have verified some contracts and published the results to Soda Cloud.

    • You have an Alation account with the privileges necessary to allow you to add a data source, create custom fields, and customize templates.

    • You have a git repository in which to store the integration project files.

    Set up the integration

    🎥 Watch a 5-minute video that demonstrates how to integrate Soda and Alation.

    1. Sign into your Soda Cloud account and confirm that you see the datasets you expect to see in the data source you wish to test for quality.

    2. To connect your Soda Cloud account to your Alation Service Account, create an .env file in your integration project in your git repo and include details according to the example below. Refer to to obtain the values for your Soda API keys.

    3. To sync a data source and schema in the Alation catalog to a data source in Soda Cloud, you must map it from Soda Cloud to Alation. Create a .datasource-mapping.yml file in your integration project and populate it with mapping data according to the following example. The table below describes where to retrieve the values for each field.

    Field
    Retrieve value from

    Retrieve the Alation datasource_id from the URL

    Retrieve the Alation datasource_container_name (schema) from the data source page

    Retrieve the Alation datasource_container_id for the datasource_container_name from the URL in the Schema page.

    Enable API access to Alation with SSO

    If your Alation account employs single sign-on (SSO) access, you must for Soda to integrate with Alation.

    If your Alation account does not use SSO, skip this step and proceed to .

    Customize the catalog

    1. Create custom fields in Alation that reference information that Soda Cloud pushes to the catalog. These are the fields the catalog users will see that will display Soda Cloud data quality details. In your Alation account, navigate to Settings > Catalog Admin > Customize Catalog. In the Custom Fields tab, create the following fields:

      • Under the Pickers heading, create a field for “Has DQ” with Options “True” and “False”. The Alation API is case sensitive so be sure to use these exact values.

      • Under the Dates heading, create a field for “Profile - Last Run”.

    Run the integration

    Contact directly to acquire the assets and instructions to run the integration and view Soda Cloud details in your Alation catalog.

    Use the integration

    Access Soda Cloud to or that execute checks against datasets in your data source each time you , or using a data pipeline tool such as Airflow. Soda Cloud pushes data quality scan results to the corresponding data source in Alation so that users can review data quality information from within the catalog.

    In Alation, beyond reviewing data quality information for the data source, users can access the Joins and Lineage tabs of individual datasets to examine details and investigate the source of any data quality issues.

    Open in Soda

    In a dataset page in Alation, in the Overview tab, users have the opportunity to click links to directly access Soda Cloud to scrutinize data quality details; see image below.

    • Under the Soda DQ Overview heading in Alation, click Open in Soda to access the dataset page in Soda Cloud.

    • Under the Dataset Level Monitors heading in Alation, click the title of any monitor to access the check info page in Soda Cloud.

    MS Teams

    Configure Soda Cloud to connect your account to MS Teams so that you can:

    • Send Notificationsfor failed or warning check results to MS Teams channel

    • Start conversations to track and resolve data quality Incidentswith MS Teams

    Configure an MS Teams integration

    Only users with the Manage Notification Rules permission can create or edit rules. All users can view rules. Read about

    1. As a user with permission to do so, log in to your Soda Cloud account, navigate to your avatar > Organization Settings, then select the Integrations tab.

    2. In the Add Integration dialog box, select Microsoft Teams.

    1. In the first step of the guided integration workflow, follow the instructions to navigate to your MS Teams account to create a Workflow; see Microsoft’s documentation for . Use the Workflow template to Post to a channel when a webhook request is received.

    2. In the last step of the guided Workflow creation, copy the URL created after successfully adding the workflow.

    3. Returning to Soda Cloud with the URL for Workflow, continue to follow the guided steps to complete the integration. Reference the following tables for guidance on the values to input in the guided steps.

    Configuration tab: Provide the following information

    Field or Label
    Guidance

    Scope tab: select the Soda features (alert notifications and/or incidents) that can access the Slack integration.

    Field or Label
    Guidance

    About integration scopes

    Integration for alert notifications

    Use the Alert Notification scope to enable Soda Cloud to send alert notifications to an MS Teams channel to notify your team of warn and fail check results. With such an integration, Soda Cloud enables users to select MS Teams as the destination for an alert notification of an individual check or checks that form a part of an agreement, or multiple checks. To send notifications that apply to multiple checks, see .

    Integration for Soda Cloud incident

    Use the Incident scope to notify your team when a new incident has been created in Soda Cloud. With such a scope, Soda Cloud displays an external link to the MS Teams channel in the Incident Details. Soda Cloud sends all incident events to only one channel in MS Teams. As such, you must provide a separate link in the Channel URL field in the Define Scope tab. For example, https://teams.microsoft.com/mychannel. To obtain the channel link in MS Teams, right-click on the channel name in the overview sidebar. Refer to for more details about using incidents in Soda Cloud.

    Troubleshoot

    Problem: You encounter an error that reads, “Error encountered while rendering this message.”

    Solution: A fix is , the short version of which is as follows.

    1. Restart MS Teams.

    2. Clear your cache and cookies.

    3. If you have not already done so, update to the latest version of MS Teams.

    ServiceNow

    Configure a Webhook in Soda Cloud to connect to your ServiceNow account.

    In ServiceNow, you can create a Scripted REST API that enables you to prepare a resource to work as an incoming webhook. Use the ServiceNow Resource Path in the URL field in the Soda Cloud integration setup.

    This example offers guidance on how to set up a Scripted REST API Resource to generate an external link which Soda Cloud displays in the Incident Details; see image below. When you change the status of a Soda Cloud incident, the webhook also updates the status of the SNOW issue that corresponds with the incident.

    Refer to Webhook API for details information.

    The following steps offer a brief overview of how to set up a ServiceNow Scripted REST API Resource to integrate with a Soda Cloud webhook. Reference the ServiceNow documentation for details:

    • and

    1. In ServiceNow, start by navigating to the All menu, then use the filter to search for and select Scripted REST APIs.

    2. Click New to create a new scripted REST API. Provide a name and API ID, then click Submit to save.

    3. In the Scipted Rest APIs list, find and open your newly-created API, then, in the Resources tab, click New to create a new resource.

    Author a contract in Soda Cloud

    Once your dataset is onboarded, you can begin defining the expectations that make up your Data Contract.

    Creating a Contract

    To create a Data Contract, navigate to any onboarded dataset and click Create Contract.

    Define Attributes

    Atlan

    Integrate Soda with Atlan to access details about the quality of your data from within the data catalog.

    • Run data quality checks using Soda and visualize quality metrics and rules within the context of a data source, dataset, or column in Atlan.

    • Use Soda Cloud to flag poor-quality data in lineage diagrams.

    • Give your Atlan users the confidence of knowing that the data they are using is sound.

    soda data-source test -ds ds.yml
    type: postgres
    name: postgres
    connection:
      host:
      port:
      database:
      user: ${env.SNOWFLAKE_USERNAME}
      password: ${env.SNOWFLAKE_PASSWORD}
    BigQuery
    Pandas
    SQL Server
    Databricks
    Polars
    Snowflake
    DuckDB
    PostgreSQL
    Synapse
    Fabric
    Redshift
    Athena
    Documentation access & licensing
  • Feedback: shows if the user provided feedback on a result (e.g. confirmed or dismissed an anomaly)

  • Configuration change: visual markers indicating when the monitor’s configuration was updated

  • Time-based metrics (Last modification time, Most recent timestamp):

    • Type: Value or Value range

    • Value: enter the cutoff you want to ignore (e.g. 0 days, 10 hours, 49 minutes) or, for a range, specify both lower and upper bounds.

  • Schema changes: exclusions are not supported for schema-drift monitors.


  • You are not logged in to Soda and are viewing the default public documentation. Learn more about Documentation access & licensing.

    Organization and Admin settings
    Organization and Admin Settings
    Check and dataset attributes
    Update dataset
    Update dataset responsibilities

    You are not logged in to Soda and are viewing the default public documentation. Learn more about Documentation access & licensing.

    Onboard datasets on Soda Cloud
    Provide a Name for your resource, then select POST as the HTTP method.
  • In the Script field, define a script that creates new tickets when a Soda Cloud incident is opened, and updates existing tickets when a Soda Cloud incident status is updated. Use the example below for reference. You may also need to define Security settings according to your organizations authentication rules.

  • Click Submit, then copy the value of the Resource path to use in the URL field in the Soda Cloud integration setup.

  • Create a Scripted REST API
    Create a Scripted REST API Resource
    ServiceNow Developer: Creating Scripted REST APIs

    You are not logged in to Soda and are viewing the default public documentation. Learn more about Documentation access & licensing.

    (function process(/*RESTAPIRequest*/ request, /*RESTAPIResponse*/ response) {
    
    
    	var businessServiceId = '28***';
    	var snowInstanceId = 'dev***';
    	
    	var requestBody = request.body;
    	var requestData = requestBody.data;
    	gs.info(requestData.event);
    	if (requestData.event == 'incidentCreated'){
    		gs.log("*** Incident Created ***");
    		var grIncident = new GlideRecord('incident');
    		grIncident.initialize();
    		grIncident.short_description = requestData.incident.description;
    
    		grIncident.description = requestData.incident.sodaCloudUrl;
    		grIncident.correlation_id = requestData.incident.id;
    		if(requestData.incident.severity == 'critical'){
    			grIncident.impact = 1;
    		}else if(requestData.incident.severity == 'major'){
    			grIncident.impact = 2;
    		}else if(requestData.incident.severity == 'minor'){
    			grIncident.impact = 3;
    		}
    		
    		grIncident.business_service = businessServiceId;
    		grIncident.insert();
    		var incidentNumber = grIncident.number;
    		var sysid = grIncident.sys_id;
    		var callBackURL = requestData.incidentLinkCallbackUrl;
    		var req, rsp;
    		
    		req = new sn_ws.RESTMessageV2();
    
    
    		req.setEndpoint(callBackURL.toString());
    		req.setHttpMethod("post");
    		var sodaUpdate = '{"url":"https://'+ snowInstanceId +'.service-now.com/incident.do?sys_id='+sysid + '", "text":"SNOW Incident '+incidentNumber+'"}';
    		req.setRequestBody(sodaUpdate.toString());
    		resp = req.execute();
    		gs.log(resp.getBody());
    		
    
    	}else if(requestData.event == 'incidentUpdated'){
    		gs.log("*** Incident Updated ***");
    		var target = new GlideRecord('incident');
    		target.addQuery('correlation_id', requestData.incident.id);
    		target.query();
    		target.next();
    
    		if(requestData.incident.status == 'resolved'){
    			//Change this according to how SNOW is used.
    			target.state = 6;
    			target.close_notes = requestData.incident.resolutionNotes;
    		}else{
    			//Change this according to how SNOW is used.
    			target.state = 4;
    		}
    		target.update();
    		
    	}
    
    
    })(request, response);
    (free open-source)
  • soda-postgres (paid licensed Soda).

  • Soda V4: no differentiation using core in package names. Differentiation will be based on the installation flows listed above.

    soda-migration

    soda-migration

  • soda-reconciliation

  • soda-oracle

  • Executing data contracts with basic data quality checks on enterprise data sources.

    Public PyPI

    Use this installation method if you’re just getting started.

    The Public PyPI index hosts Soda Core packages for all supported data sources.

    Same as above, plus: group by checks, reconciliation checks, migrating checks from v3 to v4, running checks on Oracle data, and capturing failed rows with the Diagnostics Warehouse.

    Private PyPI

    Private PyPI repositories are region-specific and require authentication using your API key credentials. This method ensures secure access to licensed components, enterprise-only extensions, and region-compliant hosting.

    Team1

    EU

    team.pypi.cloud.soda.io/

    Team

    US

    team.pypi.us.soda.io

    Enterprise2

    EU

    enterprise.pypi.cloud.soda.io

    Enterprise

    US

    public
    private
    Data source reference for Soda Core
    API key values
    Soda Python Libraries

    Sign up

    catalog: datasource_container_name

    The schema of the data source; retrieve this value from the data source page in the Alation catalog under the subheading Schemas. See image below.

    catalog: datasource_container_id

    The ID of the datasource_container_name (the schema of the data source); retrieve this value from the schema page in the Alation catalog. See image below

    Under the Rich Texts heading, create the following fields:

    • “Soda DQ Overview”

    • “Soda Data Quality Rules”

    • “Data Quality Metrics”

  • Add each new custom field to a Custom Template in Alation. In Customize Catalog, in the Custom Templates tab, select the Table template, then click Insert… to add a custom field to the template:

    • “Soda DQ Overview”

  • In the Table template, click Insert… to add a Grouping of Custom Fields. Label the grouping “Data Quality Info”, then Insert… two custom fields:

    • “Has DQ”

    • “Profile - Last Run”

  • In the Column template, click Insert… to add a custom field to the template:

    • “Has DQ”

  • In the Column template, click Insert… to add a Grouping of Custom Fields. Label the grouping “Soda Data Profile Information”, then Insert… two custom fields:

    • Data Quality Metrics

    • Soda Data Quality Rules

  • name

    A name you choose as an identifier for an integration between Soda Cloud and a data catalog.

    soda: datasource_id

    The data source information panel in Soda Cloud.

    soda: datasource_name

    The data source information panel in Soda Cloud.

    soda: dataset_mapping

    (Optional) When you run the integration, Soda automatically maps all of the datasets between data sources. However, if the names of the datasets differ in the tools you can use this property to manually map datasets between tools.

    catalog: type:

    The name of the cataloging software; in this case, “alation”.

    catalog: datasource_id

    Retrieve this value from the URL on the data source page in the Alation catalog; see image below.

    5-minute overview
    Generate API keys
    Create an API service account
    Customize the catalog
    [email protected]
    create no-code checks
    create agreements
    run a Soda scan manually
    orchestrate a scan

    You are not logged in to Soda and are viewing the default public documentation. Learn more about Documentation access & licensing.

    This action requires the Manage contract permission on the dataset. Learn more about permissions here: Dataset Attributes & Responsibilities

    You’ll be taken to the Contract Editor, a powerful interface where you can define your contract in two ways:

    • No-code view: Point-and-click UI to add quality checks and configure settings

    • Code view: YAML editor for full control and advanced use cases.

    See language reference: Contract Language reference

    You can switch between views at any time using the editor toggle in the top right corner.

    Key Concepts in Contract Authoring

    Understanding how to structure your contract is essential. Soda supports several types of checks and configuration options:

    • Filter: applies a global filter to limit which rows are considered across the entire contract (e.g., only the latest partition or rows from the past 7 days.)

    • Variables: help you parameterize your contract, making it flexible and adaptable to different contexts (e.g., environments, schedules, or partitions.)

    • Dataset-level Checks: rules that apply to the dataset as a whole, like row count, freshness, or schema checks.

    • Column-level Checks: rules that apply to individual columns, like missing values, uniqueness, ranges, or regex formats.

    All visible columns are detected during onboarding. You can also manually add columns if needed.

    Use Variables

    Variables allow dynamic substitution of values in contracts. They help you:

    • Parameterize values that differ across environments, datasets, or schedules.

    • Reuse values in multiple places within the same contract to reduce duplication and improve maintainability.

    You can define variables at the top of your contract:

    Then use them throughout your contract using the ${var.VARIABLE_NAME} syntax.

    For example:

    When running the contract, variable values must be provided unless a default is defined.

    Variables are ideal for partitioned datasets, date-based rules, or customizing checks based on context.

    Out of the box variables

    Now: You can use ${soda.NOW} in your Contract to access the current timestamp.

    Define Attributes

    Use attributes to label, sort, and route your checks in Soda Cloud. Attributes help you organize checks by properties such as domain, priority, location, and sensitivity (e.g., PII).

    Learn how to leverage attributes with Notifications and Browse datasets.

    Apply Attributes to Checks

    You can add attributes directly to individual checks. For example:

    Set Default Attributes at the Top Level

    You can also define default attributes at the dataset level. These attributes apply to all checks, unless overridden at the individual check level.

    Attribute Validation in Soda Cloud

    When publishing contract results to Soda Cloud, all check attributes must be pre-defined in Soda Cloud. If any attribute used in a contract is not registered in your Soda Cloud environment, the results will not be published, and the data contract scan will be marked as failed.

    Learn how to configure attributes in Soda Cloud: Check and dataset attributes.


    Testing the Contract

    Before publishing, click Test to simulate a contract verification against your live data. Soda will:

    • Run all defined checks

    • Display which rules pass or fail

    • Surface profiling and diagnostic insights

    This dry run helps ensure your contract behaves as expected, before making it official.

    This action requires the "Manage contract" permission on the dataset. Learn more about permissions here: Dataset Attributes & Responsibilities

    Test a contract on a data sample

    You can test a contract on a sample of your data. Learn more at the onboarding Additional settings.

    Publishing the Contract

    Once you're happy with the results, click Publish.

    Publishing sets this version as the source of truth for that dataset. From this point on:

    • Verifications will use the published version

    • All users see this contract as the authoritative definition of data quality for that dataset

    • Changes will require a new version or a proposal (depending on permissions)

    Publishing ensures your data expectations are versioned, visible, and enforceable.

    This action requires the Manage contract permission on the dataset. Learn more about permissions here: Dataset Attributes & Responsibilities


    You’re now ready to start verifying your contract and monitoring your data.

    Contract history

    Contract history provides a snapshot view of all changes that have been made to a contract.

    To access contract history:

    1. Navigate to a dataset with an existing data contract.

    2. Click on the icon next to Edit Contract, on the top right (or click on Edit Contract > ).

    3. Review contract history by choosing a version on the left panel and inspecting it on the right panel.

    Just as when a contract is being created or edited, you can toggle between the code and no-code views.

    The code view allows to toggle diff and toggle split view.

    Request history

    While contract history allows to see the changes that a contract has undergone, request history provides an overview of the change requests that have been made over a specific contract.

    View dataset request history

    To access the request history of any dataset, navigate to the dataset > tab Requests.

    The list of requests can be filtered by title key word and by state (Open, Done and Won't do)

    From this view, you can also create a request.

    This view provides a snapshot of each request, making visible:

    • The title, description (if any), and time of creation of the request

    • The state of the request (Open, Done and Won't do)

    • The icon, which indicates that the request has a proposal

    • The icon, which indicates that the request has comments

    View organization request history

    To access all request history in an organization, navigate to tab Requests on top of the page.

    This page provides an overview of all requests made within the organization. The requests can be filtered by:

    • Title key word(s)

    • Status

    • User that created the request

    • Users that are participants in the request


    You are not logged in to Soda and are viewing the default public documentation. Learn more about .

    Prerequisites

    • You have verified some contracts and published the results to Soda Cloud.

    • You have an Atlan account with the privileges necessary to allow you to set up a Connection in your Atlan workspace.

    Set up the integration

    1. Follow the instructions to Generate API keys in Soda to use for authentication in your Atlan connection.

    2. Follow Atlan’s documentation to set up the Connection to Soda in your Atlan workspace.

    3. 🎥 Watch the Atlan-Soda integration in action!


    You are not logged in to Soda and are viewing the default public documentation. Learn more about .

    Name

    Provide a unique name for your integration in Soda Cloud.

    URL

    Input the Workflow URL you obtained from MS Teams.

    Enable to send notifications to Microsoft Teams when a check result triggers an alert.

    Check to allow users to select MS Teams as a destination for alert notifications when check results warn or fail. Notifications

    Use Microsoft Teams to track and resolve incidents in Soda Cloud.

    Check to automatically send incident information to an MS Teams channel.

    Channel URL

    Provide a channel identifier to which Soda Cloud sends all incident events.

    Global and Dataset Roles
    Creating a workflow from a channel in Teams
    Notifications
    Incidents
    documented

    You are not logged in to Soda and are viewing the default public documentation. Learn more about Documentation access & licensing.

    Contract collaboration

    Data contracts define the expectations between data producers and data consumers, ensuring that the data delivered is fit for purpose and aligned with business needs. However, data requirements evolve, and consumers often identify gaps or new use cases that require adjustments.

    To support this, Soda provides a collaborative process that allows data consumers to request changes to existing data contracts or propose the creation of new ones. Consumers can directly propose changes by editing the data contract with Soda's no-code editor, suggesting concrete modifications for the dataset owner to review.

    This approach enables data consumers to express their requirements not just in abstract terms but in actionable, implementable contract changes. By doing so, the consumer helps the dataset owner by:

    • Making their needs clearer and more concrete.

    • Supporting faster alignment between producers and consumers.

    • Contributing to quicker and smoother implementation.

    • Reducing unnecessary communication overhead.

    The dataset owner remains the final decision-maker, reviewing proposed changes, iterating with the consumer as needed, and then publishing the updated contract once consensus is reached.

    This collaborative workflow ensures that data contracts remain living agreements that continuously adapt to evolving business use cases while maintaining producer accountability.

    In Soda Cloud, you can access a view of the contract history, which allows to inspect all changes made to a specific contract.

    Learn more about .

    Initiating a request

    This action requires the Propose checks permission on the dataset. Learn more about permissions here:

    Users can:

    • Request a change by simply describing their needs and use cases.

    • Propose changes directly by editing the data contract, suggesting concrete modifications for the dataset owner to review.

    When a request is created, dataset owners automatically receive an email notification, ensuring they can promptly review and collaborate with the requester.

    Initiate a request with a proposal

    To propose a change or create a new contract, data consumers can initiate a request directly from the dataset page.

    1. Navigate to a dataset Go to any onboarded dataset in Soda.

    2. Start editing

      • If the dataset does not yet have a contract, click Create Contract.

      • If a contract already exists, click Edit Contract.

    1. Provide details You will be prompted to:

      1. Enter a title for the request.

      2. Provide a reasoning or description of the changes, explaining why they are needed.

    1. Save the request Once you click Save, a new request is created containing your proposed changes. This proposal is then shared with the dataset owner for review and follow-up.

    Initiate a request without a proposal

    In some cases, data consumers may want to request changes without directly editing the contract themselves. This allows them to highlight a need while leaving the implementation details to the dataset owner.

    1. Navigate to the dataset Open the dataset in Soda.

    2. Go to the Requests tab Select the Requests tab for that dataset.

    3. Create a new request Click Create a Request.

    1. Provide details You will be prompted to:

      1. Enter a title for the request.

      2. Provide a reasoning or description of the changes, explaining why they are needed.

    1. Save the request Once you click Save, the request is created. The dataset owner will be notified and can review, clarify, and propose changes to the contract based on your input.

    Collaborating over a Request

    Each dataset page includes a Requests tab where all requests related to that dataset are listed. From here, users can:

    • Search for a request by name.

    • Filter requests by status: Open, Done, or Won’t Do.

    • Click on any request to access collaboration tools.

    Once inside a request, users can collaborate in the following ways:

    Review Proposals

    Click View Proposal to examine an existing proposal associated with the request.

    When viewing a proposal, visual indicators show exactly what has changed in the contract:

    • Blue icon → element was modified (M).

    • Red icon → element was removed (R).

    • Green icon → element was added (A).

    • Blue dot → a parent element has one or more

    Exchange Messages

    Participants can post text messages within the request to clarify needs, align on requirements, and discuss next steps.

    Create a New Proposal

    Users can contribute new proposals to move the request forward.

    Iterate on an existing proposal: while viewing a proposal, click the pen icon to edit and build upon it.

    From scratch: click Add Proposal to create a brand-new proposal.

    In both cases:

    1. Make your edits.

    1. Click Save.

    2. Provide a message to explain what you have done.

    1. Click Save again

    All participants are automatically notified by email when a new proposal is created or an iteration is made, ensuring everyone stays aligned and can respond promptly.

    Publish a proposal

    This action requires the Manage Contract permission on the dataset. Learn more about permissions here:

    Publishing to Soda Cloud

    After reviewing a proposal, you can publish it by clicking the Publish button. Once published, all participants associated with the request will automatically receive a notification, ensuring they are informed of the update.

    Resolving conflict

    In case a new version of the contract has been published, it is required to sync the proposal with the latest version to publish it.

    1. When reviewing a proposal, click on Sync to latest

    1. There are then 2 scenarios that can arise:

      1. Soda can automatically merges the 2 versions. You can then proceed to the next step

      2. There are conflicts that Soda cannot resolve. In this case, you will be required to resolve the conflict. Soda offers a tool allowing you to compare the latest published version (left side) with the version of the proposal (right side). You can then edit the proposal version to resolve the conflicts. Click Continue to proceed

    1. Optionally, do extra edits

    2. Click Save to create a new proposal, which you can now publish

    Publish to Git

    You can fetch the content of a proposal from Soda Cloud and save it as a contract file, which can then be published to Git. This allows you to incorporate approved changes into version-controlled data contracts.

    Parameter
    Required
    Description

    Request and proposal numbers can be found on Soda Cloud when reviewing a proposal. The first number is the request, and the decimal is the proposal.

    After fetching the proposal, you can optionally use the publish command to publish it from Soda Cloud to Git:

    Close a request

    Each request in Soda has a status to reflect its lifecycle. Initially, a request is created in the Open state. Once the requested changes have been implemented and published, the request can be moved to Done. If the decision is made not to implement the request, it can be transitioned to Won’t Do. Whenever a request’s status is updated, all participants are automatically notified by email, ensuring transparency and alignment across the collaboration process.

    Jira

    Configure a Webhook in Soda Cloud to connect to your Jira workspace.

    In this guide, we will show how you can integrate Soda Cloud Incidents with Jira. After the integration is set up, then creating an incident in Soda will automatically trigger the creation of corresponding bug ticket in Jira. The Jira ticket will include information related to the incident created in Soda, including:

    • The number and title of the Incident

    • The description of the Incident

    • The severity of the incident

    • The status of the incident

    • The user who reported the Incident

    • A link to the Incident in Soda Cloud

    • A link to the associated Check in Soda Cloud

    A link to this Jira ticket will be sent back to Soda and displayed on the Incident page in the Integrations box. Any updates to the status of the Incident in Soda Cloud will trigger corresponding changes to the Status of the Jira ticket. Any updates to the status of the Jira ticket will trigger corresponding changes to the Status of the Incident in Soda Cloud.

    In Jira, you can set up an Automation Rule that enables you to define what you want an incoming webhook to do, then provides you with a URL that you use in the URL field in the Soda Cloud integration setup.

    This integration is built on two webhook events IncidentCreated and IncidentUpdated (Soda -> Jira; ), as well as the Soda Cloud API endpoint for updating incidents (Jira -> Soda; ).

    Create a Jira project for DQ tickets

    In Jira, start by creating a new project dedicated to tracking data quality tickets. Navigate to the Project settings > Work Items, and make sure you have a bug type work item with the fields, as shown in the image below:

    • Summary

    • Description

    • Assignee

    • IncidentSeverity

    From the same page, next click the Edit Workflow button, and make sure your workflow includes the following statuses:

    • Reported

    • Investigating

    • Fixing

    • Resolved

    Automation Rule (Inbound)

    Initialize the webhook-trigger

    Here we will set up the automation in Jira so that when an Incident is created or updated in Soda, then a bug ticket will automatically be created or updated in Jira.

    Navigate to Project settings > Automation, then click Create rule and, for the type of New trigger, select Incoming webhook.

    Under the When: Incoming webhook trigger, click Add a component, select IF: Add a condition, then smart values condition.

    What this means is that, if an incoming webhook has the incidentCreated event, then we will do something.

    Automatic creation of the Jira ticket

    Next we will add another component: THEN: Add an action.

    The action will be to Create work item and the Issue Type should be Bug and the Project should be our new project.

    Next we add some steps to fill out our ticket with extra information obtained from the webhook data.

    We start by creating a branch rule to identify our ticket:

    Then we Edit the ticket fields:

    Finally, the last step in our incident creation workflow is to send a post request back to Soda with a link to the issue in Jira:

    Automatic updates to the Jira ticket

    The remaining parts of this automation rule cover the scenarios where the status of the incident is updated in Soda, then we will detect this change and make the corresponding updates to the issue in Jira.

    When the status changes to Reported:

    The same logic is used for other status changes such as Investigating and Fixing.

    In case the status changes to Resolved, our rule uses a similar logic, but with the additional step of adding resolution notes as a comment to the issue in Jira:

    Once you save/enable this new rule, then you can access a URL and secret that you will provide to Soda when setting up the new webhook integration.

    After saving or enabling the rule, you can view details of the webhook trigger as shown below:

    Define the Webhook integration in Soda

    Next, you create a new webhook integration in Soda and provide the details from the webhook trigger above, as shown in the image below.

    Paste the Webhook URL from Jira into the URL field in Soda and paste the Secret from Jira into a custom HTTP header called X-Automation-Webhook-Token.

    Finally, in the Define Scope tab, make sure to select Incidents - Triggered when users create or update incidents.

    Automation Rule (outbound)

    We will set up a second automation rule in Jira so that when the status of the ticket changes in Jira, these changes are also reflected in Soda.

    First, we set up the trigger for this automation to be when a Work item is transitioned:

    Finally, we send a post request to the Soda Cloud API incidents endpoint , using information from our Jira ticket to update the severity and status of the corresponding incident in Soda:

    Note that the Authorization header value must be formatted like:

    Basic <base64_encoded_credentials>. Base64-encoded credentials can be generated using Soda Cloud API keys in Python like so:

    Integrations

    Soda offers seamless integrations with many tools across your data stack. Whether you're aligning data governance efforts, collaborating across teams, or triggering workflows, you can enhance Soda’s observability capabilities with the following connections:

    Supported integrations

    Messaging and collaboration

    For more details on notification rules, see the .

    Catalogs and governance tools

    Data transformation and code repositories


    Create an Integration

    To create an integration:

    1. Go to the Integrations section in Settings.

    2. Click the + button to add a new integration.

    3. Select the integration type (Slack, Microsoft Teams, or Webhook).

    4. Follow the setup steps for the chosen integration


    Edit an Integration

    You can update existing integrations if connection details or configurations change.

    To edit an integration:

    1. Go to the Integrations section in Settings.

    2. Find the integration you want to update.

    3. Click the context menu and select Edit Integration Settings.

    4. Update the configuration as needed.

    Pause an Integration

    You can temporarily pause an integration if you want to stop sending notifications and incident updates without fully deleting the configuration. The integration will no longer be available in notification rules.

    To pause an integration:

    1. Go to the Integrations section in Settings.

    2. Locate the integration you want to pause.

    3. Change the status to "Paused" in the table

    4. Select Pause.

    While paused, the integration will no longer send any notifications. You can resume it at any time by following the same steps and selecting Active.

    Quickstart

    This quickstart shows how Soda detects unexpected data issues by leveraging AI powered Anomaly Detection and prevents future problems by using data contracts. The example uses Databricks, but you can do the same with any other database.

    Scenario

    A data engineer at a retail company needs to maintain the regional_sales dataset so their team can manage regional sales data from hundreds of stores across the country. The dataset feeds executive dashboards and downstream ML models for inventory planning. Accuracy and freshness are critical, so you need both:

    Onboard datasets on Soda Cloud

    Step 1: Connect a data source

    Before you can define contracts, you need to connect Soda Cloud to your data source. This allows Soda to access your datasets for profiling, metric monitoring, and contract verification.

    1.1 Navigate to the Data Sources page

    python -m venv .venv
    source .venv/bin/activate
    pip install -i https://pypi.dev.sodadata.io/simple -U soda-postgres
    pip install --upgrade pip
    export SODA_API_KEY_ID="your_key_id"
    export SODA_API_KEY_SECRET="your_key_secret"
    pip install soda>=4.0.0b0 --pre -i https://${SODA_API_KEY_ID}:${SODA_API_KEY_SECRET}@enterprise.pypi.cloud.soda.io --extra-index-url=https://pypi.dev.sodadata.io
    pip install “soda” --pre -i “https://${SODA_API_KEY_ID}:${SODA_API_KEY_SECRET}@team.pypi.cloud.soda.io”--extra-index-url=https://pypi.dev.sodadata.io
    pip install soda --pre -i https://${SODA_API_KEY_ID}:${SODA_API_KEY_SECRET}@enterprise.pypi.cloud.soda.io --extra-index-url=https://pypi.dev.sodadata.io
    ALATION_HOST=yourcompany.alationcatalog.com
    ALATION_USER=<your username for your Alation account>
    ALATION_PASSWORD=<your password for your Alation account>
    SODA_HOST=cloud.soda.io
    SODA_API_KEY_ID=<your Soda Cloud pubic key>
    SODA_API_KEY_SECRET=<your Soda Cloud private key>
     - name: Cars
       soda:
         datasource_id: 2d33bf0a-9a1c-4c4b-b148-b5af318761b3
         datasource_name: adventureworks
         # optional dataset_mapping   soda: catalog
         dataset_mapping:
            Cars_data: Cars
       catalog:
         type: "alation"
         datasource_id: "31"
         datasource_container_name: "soda"
         datasource_container_id: "1"
     - name: Soda Demo
       soda:
         datasource_id: 8505cbbd-d8b3-48a4-bad4-cfb0bec4c02f
       catalog:
         type: "alation"
         datasource_id: "37"
         datasource_container_name: "public"
         datasource_container_id: "2"
    filter: country = "${var.country}";
    enterprise.pypi.us.soda.io/
    Webhook
  • Click Save to activate the integration.

  • Click Save to apply the changes.

    MS Teams
    Slack
    Jira
    ServiceNow
    Notification rules documentation
    Alation
    Atlan
    Metaphor
    Purview
    Github

    You are not logged in to Soda and are viewing the default public documentation. Learn more about Documentation access & licensing.

    Documentation access & licensing
    Documentation access & licensing
    IncidentID
  • IncidentURL

  • CheckURL

  • Event payloads
    API
    https://docs.soda.io/soda/integrate-jira.html#automation-rule-outbound

    You are not logged in to Soda and are viewing the default public documentation. Learn more about Documentation access & licensing.

    import base64
    api_key_id = "your_api_key_id"
    api_key_secret = "your_api_key_secret"
    
    credentials = f"{api_key_id}:{api_key_secret}"
    encoded_credentials = base64.b64encode(credentials.encode()).decode()
    print(f"Basic {encoded_credentials}")

    Make changes Update the contract based on your needs and use case. You can add, modify, or remove elements to ensure the contract reflects the requirements you want to address.

  • Create a new request After making your edits, click Create a Request.

  • children that were updated
    .

    -r

    Yes

    The request number. Identifies the request to fetch. Request numbers can be found when reviewing a proposal. See screenshot below.

    -p

    No

    The proposal number. Defaults to the latest proposal if not specified. Proposal numbers are shown as the decimal part when reviewing a proposal. See screenshot below.

    --soda-cloud, -sc

    Yes

    Path to the Soda Cloud config file (e.g., soda-cloud.yaml).

    --f

    Yes

    Dataset Attributes & Responsibilities
    Dataset Attributes & Responsibilities
    Contract history

    You are not logged in to Soda and are viewing the default public documentation. Learn more about Documentation access & licensing.

    Path to the output file where the contract will be written.

    Automated anomaly detection on key metrics (row counts, freshness, schema drift)

  • Proactive enforcement of business rules via data contracts

  • Sign up

    Contact us at [email protected] to get an account set up.

    After signing up, you can follow the steps below to set up a data source and start improving data quality.

    Add a Data Source

    Soda Cloud’s no-code UI lets you connect to any data source in minutes.

    1. In cloud.soda.io or cloud.us.soda.io, click on Data Sources → New Data Source.

    2. Choose your data source provider.

    3. Name your data source under Data Source Label.

    4. Scroll down and fill in the following credentials from your data source:

    5. Click Connect or Test connection. This will trigger the connection and move to the next step.

    1. Select the datasets you want to onboard on Soda Cloud.

    2. Enable Metric Monitoring. By default, Metric Monitoring is enabled to automatically track key metrics on all the datasets you onboard and alert you when anomalies are detected. It is powered by built-in machine learning that compares current values against historical trends. You can also enable Advanced Monitor Configuration.

    3. Enable Profiling and configure it. By default, Profiling is scheduled daily at 12:00AM UTC.

    4. Click Finish to onboard your datasets. Soda Cloud will now spin up its Soda-hosted Agent and perform an initial Profiling & Historical Metric Collection scan. This usually takes only a few minutes.

    Part 1: Review Anomaly Detection results

    Congratulations, you’ve onboarded your first dataset! Now let’s make sure you always know what’s happening with it.

    That’s where Metric Monitoring comes in. It automatically tracks key metrics like volume, freshness, and schema changes, with no manual setup required. You’ll spot anomalies, detect trends, and catch unexpected shifts before they become problems.

    Step 1: Open the Metric Monitors dashboard

    1. Go to Datasets → select the dataset to inspect.

    1. Navigate to the Metric Monitors tab to learn more about the metrics calculated.

    You'll immediately see that key metrics are automatically monitored by default, helping you detect pipeline issues, data delays, and unexpected structural changes as they happen. No setup needed, just visibility you can trust.

    Step 2: View anomalies in a specific monitor

    In this guide, we will focus on the Most recent timestamp monitor. The panel shows that it was expected to be in a range of 0 - 5m 31s, but the recorded value at scan time was 56m 49s. In order to take a closer look:

    1. Click the Most recent timestamp (or monitor of your choice) block.

    1. In the monitor page you’ll see:

      • measured value vs. expected range,

      • any red-dot anomalies flagged by the model,

      • buttons to Mark as expected, Create new incident, etc.

    2. Flag an outlier as "expected" or investigate it further.

    Soda’s anomaly detection engine was built in-house (no third-party libraries) and optimized for high precision. It continuously adapts to your data patterns, and it incorporates your feedback to reduce false alarms. Designed to minimize false positives and missed detections, it shows a 70% improvement in detecting anomalous data quality metrics compared to Facebook Prophet across hundreds of diverse, internally curated datasets containing known data quality issues.

    The Anomaly Detection Algorithm offers complete control and transparency in the modeling process to allow for interpretability and adaptations. It features high accuracy while leveraging historical data, delivering improvements over time.

    Part 2: Attack the Issues at Source (No-Code)

    Our automated anomaly detection has just done the heavy lifting for you, identifying unusual patterns and potential data issues without any setup required.

    But to prevent those issues from happening again, you must define exactly what your data should look like; every column, every rule, every expectation.

    That’s where Data Contracts come in. They let you proactively set the standards for your data, so problems like this are flagged or even prevented before they impact your business.

    Step 1: Create a Data Contract

    Create a new data contract to define and enforce data quality expectations.

    1. In your Dataset Details page, go to the Checks tab.

    2. Click Create Contract.

    1. When creating a data contract, Soda will connect to your dataset and build a data contract template based on the dataset schema. From this point, you can start adding both dataset-level checks and column-level checks, as well as defining a verification schedule or a partition.

    1. Toggle View Code if you’d like to inspect the generated SodaCL/YAML. This gives you access to the full contract code.

    1. You can copy the following full example, paste it into the editor and edit it as you wish. You can toggle back to no-code view to see and edit the checks in the no-code editor.

    That’s right: with Soda, you can edit a contract using either a no-code interface or directly in code. This ensures an optimal experience for all user personas while also providing a version-controlled code format that can be synced with a Git repository.

    Step 2: Publish & verify

    1. Click Test to verify the contract executes as expected

    2. When you are done with the contract, click Publish

    1. Click Verify. Soda will evaluate your rules against the current data.

    Step 3: Review check results

    Review the outcomes of the contract checks to confirm whether the data meets expectations. You can drill into those failures in the Checks tab.

    Part 3: Attack the Issues at Source (Code)

    You can trigger contract verification programmatically as part of your pipeline, so your data gets tested every time it runs.

    We’ve prepared an example notebook to show you how it works:

    Open the following Notebook example: https://colab.research.google.com/drive/1zkV_2tLJ4ohdzmKGS3LgdFDDnTNTUXew?usp=sharing

    In your Python environment, first install the Soda Core library

    Then, in the same environment, create a soda-cloud.yml file that contains your API keys, which are necessary to connect to Soda Cloud. You can create this YAML file from your Profile: Generate API keys

    The soda-cloud.yml file should look like the following:

    Now you are ready to trigger the verification of the contract. To do that just provide the identifier of your dataset as well as the path to the configuration file you just created in the previous step. This will trigger a verification using Soda Agent and return the logs.

    Create a verify_contract.py file in your environment with the code below (or run it from a Jupyter notebook/Python interpreter):

    You can learn more about the Python API here: Python API

    You’ve completed the tutorial and are now ready to start catching data quality issues with Soda

    What’s Next?

    • Explore Profiling in the Discover tab to curate column selections for deeper analysis.

    • Set up Notification Rules (bell icon → Add Notification Rule) to push alerts to Slack, Jira, PagerDuty, etc.

    • Dive into Custom Monitors via scan.yml or the UI for even more tailored metrics.


    You are not logged in to Soda and are viewing the default public documentation. Learn more about .

    From the top navigation bar in Soda Cloud, click Data Sources.

    On this page, you’ll see a list of connected sources and an Add Data Source button.

    You need the "Manage data sources" global permission to add a new data source.

    Learn more about Global and Dataset Roles

    1.2 Add a new Data Source

    Click Add Data Source and select your data source from the list of supported data source types.

    After selecting a source, you’ll be presented with a configuration form.

    1.2.1 Data source label

    Enter a friendly, unique label. A unique name will be automatically generated from this label. This becomes the immutable ID of the data source and can also be used to reference the same connection in Soda Core.

    1.2.2 Choose your agent

    You’ll be asked to select an agent. This is the component that connects to your data source and runs scans.

    You can choose from:

    • Soda-hosted agent – Quickest option, fully managed by Soda (recommended for getting started)

    • Self-hosted agent – For custom or secure deployments where you manage the agent yourself

    Learn more about deployment options: Deployment options

    1.2.3 Secure your credentials with secrets

    You’ll need to fill in the connection details. Soda uses the official Python packages for each supported data source, which means you can define any properties required by those libraries, flexibly and reliably.

    This includes common fields like host, port, database name, username, and more, depending on the data source.

    1.2.4 Using secrets for sensitive credentials

    For sensitive values such as passwords, tokens, or keys, you should use Soda Secrets instead of entering them directly in the configuration.

    • Secrets are encrypted and securely stored in Soda Cloud.

    • They can be safely referenced in your data source configuration without exposing them in plain text.

    To add secrets:

    1. Navigate to the Data Sources tab in the top navigation.

    2. Click the Secrets tab.

    3. Define key-value pairs for your sensitive credentials.

    You can then reference a secret in your data source configuration using this syntax:

    This ensures your sensitive values stay secure while still being accessible to the agent at runtime.

    1.2.5 Test and Connect

    Once the form is complete:

    • Click Test Connection to validate that Soda can successfully connect to your data source.

    • If the test passes, click Connect to finalize the setup.

    Step 2: Onboard Datasets

    After connecting, Soda will perform an automated dataset discovery. Soda triggers a scan that analyzes the datasets and retrieves their metadata, including columns and column data types. This reduces manual setup efforts, ensures data coverage in your environment and keeps Soda's dataset inventory aligned with your data sources. This feature allows other Soda features to work seamlessly:

    • Contract generation

    • Automated discovery of time partition column

    • Automated discovery of Primary Keys for Diagnostics Warehouse

    Choose a dataset selection strategy

    Dataset selection can be manual or rules-based.

    • Manual selection allows you to browse a directory view of all the datasets in your data source.

    The Scope can range from the entire data source to a specific schema. Any element selected on the left panel becomes the scope of the dataset search.

    The manual selection is made for scale; it can easily handle thousands of schemas and hundreds of thousands of datasets.

    Datasets that have already been onboarded will not be visible in the manual dataset selection.

    • Rules-based selection allows you to automate the dataset onboarding process, only selecting those which match specified rules.

    Rules-based selection includes existing and future datasets that match the conditions.

    Soda will run hourly discovery scans on the data source. When Soda discovers a new dataset that matches the conditions set in the rules, it will automatically onboard it.

    You can choose the specific schemas where your datasets of interest are located.

    You can add rules to include or exclude datasets that match certain conditions, such as "name contains" or "name starts with", or provide your own regex pattern.

    After selecting a scope, you can filter datasets by different including or excluding rules.
    To create a rule, choose a condition that datasets must match in order to be onboarded.

    In the example below, only datasets whose name does not start with "dwh" from the public schema will be onboarded.

    A Rule Name can be provided to identify this dataset selection rule.

    Once you click on Validate rule, Soda will calculate how many datasets currently match the defined conditions:

    Click on Next to finish the process.


    Once the onboarding process is finished (after Step 3: Enable Metric Monitoring & Profiling (optional)), an overview of the Onboarding Rules will be provided. From this view, rules can be edited or deleted.

    • Rules will be executed in order of appearance on this view.

    • The order of the rules can be changed. As soon as a dataset matches a rule, it will be onboarded automatically; datasets can only be onboarded once.


    Once onboarded, datasets will appear in your Soda Cloud UI and become available for contract creation or metric monitoring.

    Refresh dataset discovery: Soda runs discovery scans hourly to get the latest view of tables and schemas within a data source. By pressing on the icon on the top right of the page, you can run the scan on demand.

    Step 3: Enable Metric Monitoring & Profiling (optional)

    Through Metric Monitoring, you can enable built-in monitors to automatically track row counts, schema changes, freshness, and more across your datasets. This step is optional but recommended. This can be enabled in bulk when onboarding data sources and datasets.

    Learn more about Metric Monitoring: Metric Monitoring dashboard

    1. Toggle on Metric Monitoring

    When metric monitoring is enabled it's possible to later add column monitors on dataset level or overwrite any of the settings.

    1. Set a Monitoring Schedule

    The monitoring schedule defines when Soda scans a dataset to capture and evaluate metrics. While scans may run slightly later due to system delays, Soda uses the actual execution time, not the scheduled time, when visualizing time-sensitive metadata metrics like insert lag or row count deltas. This ensures accuracy.

    Data-based metrics like averages or null rates are not affected by small delays, as Soda only scans complete partitions, keeping these metrics stable and reliable.

    Scans can be scheduled to occur from hourly to weekly, depending on your needs.

    Learn more about how to pick a scan time.

    1. Toggle on/off Historical Metric Collection

    When Historical Metric Collection is enabled, Soda automatically calculates past data quality metrics through backfilling and applies the anomaly detection algorithm to that historical data through backtesting. This gives you immediate visibility into past data quality issues, even before monitoring was activated. The historical data also helps train the anomaly detection algorithm, improving its accuracy from day one. You can specify a start date to control how far back the backfilling process should begin.

    1. Suggest a Time Partition Column

    Metrics that are not based on metadata require a time partition column to group data into daily intervals or 24-hour buckets, depending on the monitoring schedule. This column must be a timestamp field, ideally something like a created_at or last_updated column. It's important that this timestamp reflects when the data arrives in the database, rather than when the record was originally created.

    Soda uses a list of suggested time partition columns to determine which column to apply. If multiple columns are suggested, Soda checks them in the order they are listed, starting with the first. It will try to match one by validating that the column is a proper timestamp and suitable for partitioning.

    If none of the suggested columns match, Soda falls back to a heuristic approach. This heuristic looks at metadata, typical naming conventions, and column content to infer the most likely time partition column.

    If the heuristic fails to find a suitable column or selects the wrong one, the time partition column can be manually configured after onboarding under dataset settings.

    Available monitors will be enabled by default based on the information on the datasets.

    Click on Next.

    1. Enable Profiling (optional)

    Learn more about Profiling.

    From this view, you can also enable Failed row collection if Diagnostics Warehouse is enabled for this data source.

    Click on Finish. If you used Rules-based selection to onboard datasets, an Active Onboarding Rule Pipeline view will appear now to confirm the conditions.

    Step 4: Access the datasets

    Once onboarding is completed, your data source will appear in the Data Sources list. You can click the Onboarded Datasets link to access the connected datasets.

    🎉 Congrats! You’ve successfully onboarded your data source. You’re now ready to create data contracts and start monitoring the quality of your data.


    Onboard new datasets

    Note that you can repeat the datasets onboarding process at any time to add more datasets from the same data source. Datasets that previously have been onboarded will not re-appear in the data selection step. Simply return to the data source page and click Onboard Datasets to update your selection.

    You need the Manage data sources global permission to add a new data source. Learn about Global and Dataset Roles


    You are not logged in to Soda and are viewing the default public documentation. Learn more about .

    Dataset monitors

    Learn more about Metric Monitors that run scans at a dataset level.

    What is a dataset monitor?

    A dataset monitor in Soda tracks a specific high-level metric for an entire table (or partition) over time. It helps detect unusual patterns or unexpected changes in overall data health, such as sudden spikes or drops in row count, delays in fresh data, or schema drift.

    You can find dataset monitors by opening the Metric Monitors tab on any dataset and looking at the top section labeled “Dataset Monitors.” This section lists all active dataset monitors—both metadata-based and partition-based—in a clear overview of monitor cards. This overview provides, at a glance, critical information about the status of each monitor, the value of the last scan and any detected anomalies, allowing you to have a one-look summary of the health of your data systems.

    Unlike column monitors, which are configured at the dataset level but target individual columns, dataset monitors apply to the entire table (or its latest partition) and capture broad indicators of data quality. When the necessary data and metadata are available, dataset-level monitors work out of the box with no further configuration needed.

    Deploy a Soda Agent in an Azure AKS cluster

    Soda-hosted agents are included in all Free, Team, and Enterprise plans at no additional cost. However, self-hosted agents require an Enterprise plan.

    If you wish to use self-hosted agents, please contact us at to discuss Enterprise plan options or via the support portal for existing customers.

    Global and Dataset Roles

    Soda Cloud uses Global Roles and Dataset Roles to manage access and permissions. These roles ensure users and user groups have the right level of access based on their responsibilities.

    Global Roles

    Global roles define permissions across the entire organization in Soda Cloud.

    By default, Soda Cloud provides to Global Roles: Admin and Users. You can create custom roles with a subset of the permissions

    Permission Group
    Descriptions

    Soda Agent Extra

    When you deploy a self-hosted Soda Agent to a Kubernetes cluster in your cloud service provider environment, you need to provide several key parameters and values to ensure optimal operation and to allow the agent to connect to your Soda Cloud account (API keys), and connect to your data sources (data source login credentials) so that Soda can run data quality scans on the data.

    Handle sensitive values

    By default, Soda uses as part of the Soda Agent deployment. The agent automatically converts any sensitive values you add to a values YAML file, or directly via the CLI, into Kubernetes Secrets.

    As these values are sensitive, you may wish to employ the following alternative strategies to keep them secure.

    soda request fetch -r 7 -p 1 -sc soda-cloud.yaml --f ./contracts/ecommerce_orders.yamlRe
    dataset: databricks_demo/unity_catalog/demo_sales_operations/regional_sales
    filter: |
      order_date >= ${var.start_timestamp}
      AND order_date < ${var.end_timestamp}
    variables:
      start_timestamp:
        default: DATE_TRUNC('week', CAST('${soda.NOW}' AS TIMESTAMP))
      end_timestamp:
        default: DATE_TRUNC('week', CAST('${soda.NOW}' AS TIMESTAMP)) + INTERVAL '7 days'
    checks:
      - row_count:
      - schema:
    columns:
      - name: order_id
        data_type: INTEGER
        checks:
          - missing:
              name: Must not have null values
      - name: customer_id
        data_type: INTEGER
        checks:
          - missing:
              name: Must not have null values
      - name: order_date
        data_type: DATE
        checks:
          - missing:
              name: Must not have null values
          - failed_rows:
              name: Cannot be in the future
              expression: order_date > DATE_TRUNC('day', CAST('${soda.NOW} ' AS TIMESTAMP)) +
                INTERVAL '1 day'
              threshold:
                must_be: 0
      - name: region
        data_type: VARCHAR
        checks:
          - invalid:
              valid_values:
                - North
                - South
                - East
                - West
              name: Valid values
      - name: product_category
        data_type: VARCHAR
      - name: quantity
        data_type: INTEGER
        checks:
          - missing:
              name: Must not have null values
          - invalid:
              valid_min: 0
              name: Must be higher than 0
      - name: price
        data_type: NUMERIC
        checks:
          - invalid:
              valid_min: 0
              name: Must be higher than 0
          - missing:
              name: Must not have null values
      - name: payment_method
        data_type: VARCHAR
        checks:
          - missing:
              name: Must not have null values
          - invalid:
              threshold:
                metric: count
                must_be: 0
              filter: region <> 'north'
              valid_values:
                - PayPal
                - Bank Transfer
                - Cash
                - Credit Card
              name: Valid values in all regions except North
          - invalid:
              name: Valid values in North
              filter: region = 'north'
              valid_values:
                - PayPal
                - Bank Transfer
                - Credit Card
              qualifier: ABC124
    
    pip install -i https://pypi.dev.sodadata.io/simple -U soda-core
    soda_cloud:
      host: cloud.soda.io                ## Or cloud.us.soda.io
      api_key_id: YOUR_API_KEY_ID        ## Replace with your actual key ID
      api_key_secret: YOUR_API_KEY_ID    ## Replace with your actual key secret
    from soda_core import configure_logging
    from soda_core.contracts import verify_contracts_on_agent
    
    configure_logging(verbose=False)
    
    res = verify_contracts_on_agent(
        dataset_identifiers=["databricks_demo/unity_catalog/demo_sales_operations/regional_sales"],
        soda_cloud_file_path="soda-cloud.yml",
    )
    
    
    print(res.get_logs())
    ${secret.SECRET_NAME}
    Documentation access & licensing
    Documentation access & licensing
    CLI reference
    Connect to Soda Cloud
    Publish a contract

    Types of dataset monitors

    Soda supports two categories of dataset‐level monitors: those that rely purely on system metadata, and those that compute values by querying a designated time‐partition column. Below is an in‐depth description of each built‐in monitor. For a more detailed discussion of monitors based on querying the metadata vs monitors based on querying the data, see the Metadata vs Data-Based section in this page.

    Dataset monitor type
    Monitor
    Description

    Based on metadata

    Total row count

    The total number of rows in the dataset at scan time.

    Total row count change

    Change in total row count compared to the previous scan

    Last modification time

    Most recent time the data was changed relative to the last scan

    Schema changes

    Monitors based on time partition columns look at data in the most recent partition based on a timestamp. If data is altered in an old partition, it will not be evaluated.

    For example, data inserted today with timestamp of 2 days ago will not be evaluated if the partition interval is 1 day.

    For Schema Changes, the expected result is always to have no schema changes, regardless of whether there have been frequent schema changes in the past or not.

    Understanding the monitor card

    The dashboard provides a health table summarizing an overview of the monitors. Each monitor card is clickable and links to the Anomaly History page of the metric.

    Each monitor card will have the following information:

    • Monitor name: the given name of the specific monitor.

    • Monitor explanation: a brief description of the metric used.

    • Status: ✅ healthy / ⚠️ violated

    • Today's value at scan time: last recorded value.

    • Expected range: calculated by the anomaly detection algorithm, based on historical data.

    • Trend line with last 7 observations: a sparkline that shows an overview of the monitor plot.

    • Bell icon: to enable/disable opt-in alerts.

    Metadata vs Data based monitors

    Dataset-level monitors fall into two categories depending on their source of truth:

    Metadata-based dataset monitors

    Metadata-based monitors rely solely on system metadata exposed by your data warehouse; fields like “row count,” “last modified time,” or “schema version” that the catalog provides without scanning table rows. Because they don’t touch actual data, metadata monitors are extremely efficient and run quickly. They alert you if your table grows, shrinks, stops updating, or changes structure.

    Data-based dataset monitors

    Data-based monitors look directly at the contents of a designated time-partition column (e.g., a date or timestamp field) and compute a value from the rows in that partition. Examples include “Partition Row Count” (how many rows landed in today’s partition) or “Most Recent Timestamp” (the newest timestamp in that partition). Data-based monitors require a full scan of each partition they monitor, but they capture freshness and volume signals that metadata alone cannot provide. If your dataset has no time-partition column defined (or your warehouse can’t surface the needed metadata), Soda will disable the appropriate monitors so you only see the metrics that can be collected.

    Configure Dataset Monitors

    Use the Configure Dataset Monitors panel to pick which built-in metadata and partition-based metrics you want Soda to track at the dataset level.

    1. Open the panel → From any dataset’s Metric Monitors dashboard, click Configure Dataset Monitors.

    2. Enable or disable → Toggle metrics on/off directly from here. If the data source doesn't support a given metric, it will be automatically off.

    3. Modify the monitor

    4. Auto-apply → Changes take effect immediately for the next scan. Simply close the panel when you’re done.

    Time partition column

    Many data‐based monitors—such as Partition Row Count and Most Recent Timestamp—rely on a designated “time partition” column to know which slice of data to scan. The time partition column should be a date or timestamp field that naturally groups rows into discrete, regularly updated partitions (for example, a daily order_date or event_time). When Soda cannot detect a time partition column, metrics based on that data will not be available.

    What's a good time partition column?

    A good time partition column meets all of the following criteria:

    1. Date or timestamp type: Each row contains a valid date (or timestamp) value.

    2. Regular arrival cadence: New rows for each date/timestamp appear on a predictable schedule (e.g., daily, hourly).

    3. Reflects ingestion/arrival time: The column’s value must correspond to when the record actually landed in this dataset, not when it was originally created upstream. The partition column should always show arrival date to the dataset so freshness checks remain accurate.

    4. Logical partition boundary: It matches how you want to slice your data (e.g., order_date for daily sales, event_time for hourly logs).

    When these conditions hold, partition-based monitors will reliably focus on the correct slice of data—namely, the rows that truly arrived during each time window—so any delays or backfills become immediately visible.

    Suggest a time partition column during onboarding

    When you onboard a new dataset from your data source, Soda attempts to automatically detect the most likely time partition column. You can:

    • Finish onboarding without editing the Time Partition Column field, allowing Soda to detect it.

    • Suggest a Time Partition Column of your choice, forcing Soda to use that one for monitoring.

    Find a time partition column

    If you ever need to confirm or search for the right partition column:

    1. Navigate to the Datasets page, select your dataset, and click the Columns tab.

    2. Search columns with "timestamp" on them. Any column with a date or timestamp data type is a candidate.

    Manually override time partition column

    After onboarding, you can override the time partition column at any time. Changing it will reset Soda’s anomaly detection model for partition‐based metrics, so you’ll be retraining on historical data under the new partition definition. To override:

    1. Acess the Dataset Settings

    • Navigate to the Datasets tab

    • From this list or from the dataset page itself, click on the (⋮) menu > Edit Dataset

    1. Find Time Partition Columns

    • Click on the Profiling & Metric Monitoring tab

    Here you will see the current column being used for Time Partition.

    • Reveal the Time Partition Column drop-down menu

    This will show all date and timestamp columns that can be used as a Time Partition Column.

    1. Select your new Time Partition Column

    Changing this column resets the model and historical baselines.

    1. Click Save. Soda will:

      • Reset the partition‐based monitors (Partition Row Count, Most Recent Timestamp) to “training mode” and retrain baselines on the new partition.

      • Preserve any metadata‐based monitors (Total Row Count, Schema Changes) unchanged.

    By following these steps, you ensure that Soda’s data‐based monitors always reference the correct daily (or hourly) slice of your dataset, so partition‐level metrics and freshness checks produce accurate results.

    Unavailable metadata

    When Soda Cloud cannot obtain the underlying metadata required to calculate a dataset-level metric, it prevents you from configuring or viewing a metric that would always fail. There are two cases:

    Non-retrievable metadata from data source

    If a connected data source cannot provide the required metadata for a given dataset-level metric, such as row counts or schema timestamps, Soda will automatically disable that metric both on the Metric Monitors dashboard and in the Configure Dataset Monitors panel so you only see and configure metrics that your source can actually collect.

    Unavailable metadata (history)

    Some warehouses expose current metadata but don’t provide historical snapshots (for example, systems that only track the latest row count). In this case, Soda will compute the metric starting from your very first scan, but it cannot backfill any history prior to that point. As a result, anomaly detection baselines for that metric begin only at scan #1 and there is no retroactive historical data to train against.

    The Schema changes monitor will not add historical metadata and backfilling will not be available, unlike with other metrics. The monitor only starts recording from the moment the dataset is onboarded.

    Missing metric values

    Even when a metric is enabled and historical baselines exist, you may occasionally see gaps due to delayed or skipped scans. A “missing” metric indicates that Soda attempted to run the scan but did not receive a valid result for that metric, either because the scan agent was down, the query timed out, or metadata couldn’t be retrieved in time. Missing values do not count as anomalies; they simply mark a gap in the time series.

    In Soda Cloud, you can identify these gaps as follows:

    • On the Metric Monitors dashboard, any missing value is shown either as a grey point or an empty checkbox in the metric sparkline:

    • In the detailed anomaly plot, missing points render as open circles (◯) along the timeline, and the trend line becomes dashed.

    • In Schema changes, no plot is available since the expected value is always 0. Hovering over an empty checkbox will display “No measurement” in the tooltip, making it easy to distinguish a gap from a healthy measurement or a flagged anomaly.

    These visual cues allow to immediately recognize when a scan didn’t complete successfully, enabling further investigation and restoration of full observability before critical issues go unnoticed.

    Supported data sources

    Data source
    Onboarding
    Metadata
    Metadata history
    Querying data

    Databricks

    June 6th

    ✅

    ✅

    ✅

    Snowflake

    June 6th

    ✅

    September 1st

    ✅


    You are not logged in to Soda and are viewing the default public documentation. Learn more about .

    Prerequisites
    • You have an Azure account and the necessary permissions to enable you to create, or gain access to an existing AKS cluster in your region. Consult the Azure access control documentation for details.

    • You have installed the Azure CLI tool. This is the command-line tool you need to access your Azure account from the command-line. Run az --version to check the version of an existing install. Consult the Azure Command-Line Interface documentation for details.

    • You have logged in to your Azure account. Run az login to open a browser and log in to your account.

    • You have installed v1.22 or v1.23 of . This is the command-line tool you use to run commands against Kubernetes clusters. If you have already installed the Azure CLI tool, you can install kubectl using the following command: az aks install-cli. Run kubectl version --output=yaml to check the version of an existing install.

    • You have installed . This is the package manager for Kubernetes which you will use to deploy the Soda Agent Helm chart. Run helm version to check the version of an existing install.

    • You have whitelisted these URLs, depending on whether you are using Soda EU cloud.soda.io or Soda US cloud.us.soda.io:

    System requirements

    Kubernetes cluster size and capacity: 2 CPU and 2GB of RAM. In general, this is sufficient to run up to six scans in parallel.

    Scan performance may vary according to the workload, or the number of scans running in parallel. To improve performance for larger workloads, consider fine-tuning the cluster size using the resources parameter for the agent-orchestrator and soda.scanlauncher.resources for the scan-launcher. Adding more resources to the scan-launcher can improve scan times by as much as 30%. Be aware that allocating too many resources may be costly relative to the small benefit of improved scan times.

    To specify resources, add the following parameters to your values.yml file during deployment. Refer to Kubernetes documentation for Resource Management for Pods and Containers for information on values to supply for x.

    For reference, a Soda-hosted agent specifies resources as follows:

    Deploy an agent

    The following table outlines the ways you can install the Helm chart to deploy a Soda Agent in your cluster.

    Method
    Description
    When to use

    Install the Helm chart via CLI by providing values directly in the install command.

    Use this as a straight-forward way of deploying an agent on a cluster.

    Install the Helm chart via CLI by providing values in a values YAML file.

    Use this as a way of deploying an agent on a cluster while keeping sensitive values secure. - provide sensitive API key values in this local file or in an external secrets manager - store data source login credentials as environment variables in this local file; Soda needs access to the credentials to be able to connect to your data source to run scans of your data. See:

    Deploy using CLI only

    1. (Optional) You have familiarized yourself with basic Soda, Kubernetes, and Helm concepts.

    2. Create or navigate to an existing Kubernetes cluster in your environment in which you can deploy the Soda Agent helm chart.

    3. Use Helm to add the Soda Agent Helm chart repository.

    4. Use the following command to install the Helm chart which deploys a Soda Agent in your cluster. (Learn more about the helm install command.)

      • Replace the values of soda.apikey.id and soda-apikey.secret with the values you copy+pasted from the New Soda Agent dialog box in your Soda Cloud. By default, Soda uses as part of the Soda Agent deployment. The agent automatically converts any sensitive values you add to a values YAML file, or directly via the CLI, into Kubernetes Secrets.

      • Replace the value of soda.agent.name with a custom name for your agent, if you wish.

      • Specify the value for soda.cloud.endpoint according to your local region: https://cloud.us.soda.io for the United States, or https://cloud.soda.io for all else.

      • (Optional) Specify the format for log output: raw for plain text, or json for JSON format.

      • (Optional) Specify the level of log information you wish to see when deploying the agent: ERROR, WARN, INFO, DEBUG, or TRACE.

        The command-line produces output like the following message:

    5. (Optional) Validate the Soda Agent deployment by running the following command:

    6. In your Soda Cloud account, navigate to your avatar > Agents. Refresh the page to verify that you see the agent you just created in the list of Agents. Be aware that this may take several minutes to appear in your list of Soda Agents.

    If you do no see the agent listed in Soda Cloud, use the following command to review status and investigate the logs.

    Deploy using a values YAML file

    1. (Optional) You have familiarized yourself with basic Soda, Kubernetes, and Helm concepts.

    2. Create or navigate to an existing Kubernetes cluster in your environment in which you can deploy the Soda Agent helm chart.

    3. Use Helm to add the Soda Agent Helm chart repository.

    4. Using a code editor, create a new YAML file called values.yml.

    5. To that file, copy+paste the content below, replacing the following values:

      • id and secret with the values you copy+pasted from the New Soda Agent dialog box in your Soda Cloud account. By default, Soda uses as part of the Soda Agent deployment. The agent automatically converts any sensitive values you add to a values YAML file, or directly via the CLI, into Kubernetes Secrets.

      • Replace the value of name with a custom name for your agent, if you wish.

    6. Save the file. Then, create a namespace for the agent.

    7. In the same directory in which the values.yml file exists, use the following command to install the Soda Agent helm chart.

    8. (Optional) Validate the Soda Agent deployment by running the following command:

    9. In your Soda Cloud account, navigate to your avatar > Agents. Refresh the page to verify that you see the agent you just created in the list of Agents.

    If you do no see the agent listed in Soda Cloud, use the following command to review status and investigate the logs.

    About the helm install command

    Command part
    Description

    helm install

    the action helm is to take

    soda-agent (the first one)

    a release named soda-agent on your cluster

    soda-agent (the second one)

    the name of the helm repo you installed

    soda-agent (the third one)

    the name of the helm chart that is the Soda Agent

    The --set options either override or set some of the values defined in and used by the Helm chart. You can override these values with the --set files as this command does, or you can specify the override values using a values.yml file.

    Parameter key
    Parameter value, description

    --set soda.agent.name

    A unique name for your Soda Agent. Choose any name you wish, as long as it is unique in your Soda Cloud account.

    --set soda.apikey.id

    With the apikey.secret, this connects the Soda Agent to your Soda Cloud account. Use the value you copied from the dialog box in Soda Cloud when adding a new agent. You can use a values.yml file to pass this value to the cluster instead of exposing it here.

    --set soda.apikey.secret

    With the apikey.id, this connects the Soda Agent to your Soda Cloud account. Use the value you copied from the dialog box in Soda Cloud when adding a new agent. You can use a values.yml file to pass this value to the cluster instead of exposing it here.

    --set soda.agent.logFormat

    (Optional) Specify the format for log output: raw for plain text, or json for JSON format.

    --set soda.agent.loglevel

    (Optional) Specify the leve of log information you wish to see when deploying the agent: ERROR, WARN, INFO, DEBUG, or TRACE.

    --namespace soda-agent

    Use the namespace value to identify the namespace in which to deploy the agent.

    Decommission the Soda Agent and the AKS cluster

    1. Delete everything in the namespace which you created for the Soda Agent.

    2. Delete the cluster. Be patient; this task may take some time to complete.

    Troubleshoot deployment

    Problem: After setting up a cluster and deploying the agent, you are unable to see the agent running in Soda Cloud.

    Solution: The value you specify for the soda-cloud-enpoint must correspond with the region you selected when you signed up for a Soda Cloud account:

    • Usehttps://cloud.us.soda.io for the United States

    • Use https://cloud.soda.io for all else

    Problem: You need to define the outgoing port and IP address with which a self-hosted Soda Agent can communicate with Soda Cloud. Soda Agent does not require setting any inbound rules as it only polls Soda Cloud looking for instruction, which requires only outbound communication. When Soda Cloud must deliver instructions, the Soda Agent opens a bidirectional channel.

    Solution: Use port 443 and passlist the fully-qualified domain names for Soda Cloud:

    • cloud.us.soda.io for Soda Cloud account created in the US region OR

    • cloud.soda.io for Soda Cloud account created in the EU region AND

    • collect.soda.io

    Problem: When you attempt to create a cluster, you get an error that reads, An RSA key file or key value must be supplied to SSH Key Value. You can use --generate-ssh-keys to let CLI generate one for you.

    Solution: Run the same command to create a cluster but include an extra line at the end to generate RSA keys.

    https://www.soda.io/contact

    You are not logged in to Soda and are viewing the default public documentation. Learn more about .

    Admin
    User

    Manage data sources and agents

    • Allow to deploy a new Soda Agent as well as configure data source connections in Soda Cloud.

    ✓

    Create new datasets and data sources with Soda Core library

    • Allow the creation of new data sources in Soda Cloud when using the Soda Core library.

    • Allow to onboard datasets Soda Cloud on data sources connected with Soda Agent. See

    ✓

    ✓

    Manage attributes

    • Allow to define which datasets and check attributes are available to use in the organization.

    ✓

    Manage notification rules

    • Allow to manage how notifications are sent.

    ✓

    ✓

    Manage organization settings

    Create Custom Global Roles

    You can create custom global roles to match your organization’s needs.

    To create a global role:

    1. Go to the Global Roles section in Settings.

    2. Click Add Global Role to create a new role.

    1. Enter a name for the role.

    2. Select the permissions the role should have.

    1. Click Save.

    Edit Custom Global Roles

    You can edit global roles at any time to adjust permissions as your organization’s needs evolve.

    To edit a global role:

    1. Go to the Global Roles section in Settings.

    2. Find the global role you want to modify.

    3. Click the context menu next to the role and select Edit Global Role.

    1. Adjust the role’s name and permissions as needed.

    2. Click Save to apply your changes.

    Assign Members to Global Roles

    You can assign roles to individual users or user groups to grant them the associated permissions.

    To assign a global role:

    1. Go to the Global Roles section in Settings.

    2. Find the global role you want to assign.

    3. Click the context menu next to the role and select Assign Members

    4. Select the users or user groups that should have the global roles

    1. Click Save to apply your changes.

    You can also assign roles on the Users and User groups tabs:

    • For users: User management

    • For user groups: User management

    Dataset roles

    Dataset roles define permissions for specific datasets.

    By default, Soda Cloud provides to Dataset Roles: Manager, Editor, and User. You can create custom roles with a subset of the permissions

    Permission Group
    Description
    Manager
    Editor
    Viewer

    View dataset

    Access the dataset and view checks

    ✓

    ✓

    ✓

    Access dataset profiling and samples

    Allow users to see insights about the data

    ✓

    ✓

    ✓

    Create Custom Dataset Roles

    You can create custom dataset roles to match your organization’s needs.

    To create a dataset role:

    1. Go to the Dataset Roles section in Settings.

    2. Click Add Dataset Role to create a new role.

    1. Enter a name for the role.

    2. Select the permissions the role should have.

    • Click Save to apply your changes.

    Edit Dataset Roles

    You can edit dataset roles at any time to adjust permissions as your organization’s needs evolve.

    To edit a dataset role:

    1. Go to the Dataset Roles section in Settings.

    2. Find the dataset role you want to modify.

    3. Click the context menu next to the role and select Edit Dataset Role.

    1. Adjust the role’s name and permissions as needed.

    2. Click Save to apply your changes.

    Assign dataset responsibilities

    Responsibilities in Soda Cloud define who has access to a dataset and what they are allowed to do. They are assigned by mapping users or user groups to a dataset role.

    This ensures that the right people have the appropriate permissions for each dataset, such as the ability to manage checks, propose new rules, or view profiling information.

    For example:

    • Assign a Manager role to a dataset owner who needs full control.

    • Assign a Viewer role to a business user who only needs to monitor data quality results.

    By assigning responsibilities, you ensure clear access control, accountability, and governance across your datasets.

    Learn about how to set up responsibilities on a dataset: Dataset Attributes & Responsibilities

    Define default responsibilities

    For the dataset owner

    Soda Cloud allows you to define default responsibilities for the dataset owner, which will automatically be granted for all dataset owners. This ensures that all users have a consistent baseline level of access unless you choose to customize it.

    By default, all dataset owners have the "Manager" role.

    How to Configure Default Responsibilities

    1. Go to the Organization Settings page in Soda Cloud.

    2. Locate the Datasets Roles section.

    3. Select the dataset role to assign to the Dataset Owners

    1. Click save on the top right of the page to apply changes

    For everyone

    Soda Cloud allows you to define default responsibilities for the Everyone group, which will automatically apply to all newly onboarded datasets. This ensures that all users have a consistent baseline level of access unless you choose to customize it.

    By default:

    • The Everyone group is assigned as a "Viewer" for all new datasets.

    • This setting applies to all users in your organization unless disabled.

    You can either customize the default role or disable the default responsibilities if you do not want the Everyone group to receive any automatic access to new datasets.

    How to Configure Default Responsibilities

    1. Go to the Organization Settings page in Soda Cloud.

    2. Locate the Datasets Roles section.

    3. Select the dataset role to assign to the Everyone group for new datasets.

    4. To disable default responsibilities, toggle the feature off.

    1. Click save on the top right of the page to apply changes


    You are not logged in to Soda and are viewing the default public documentation. Learn more about .

    Use a values YAML file to store API key values

    When you deploy a self-hosted Soda Agent from the command-line, you provide values for the API key id and API key secret which the agent uses to connect to your Soda Cloud account. You can provide these values during agent deployment in one of two ways:

    • directly in the helm install command that deploys the agent and stores the values as Kubernetes secrets in your cluster; see deploy using CLI only OR

    • in a values.yml file which you store locally but reference in the helm install command as in the example below.

    Refer to the exhaustive cloud service provider-specific instructions for more detail on how to deploy an agent using a values YAML file.

    Use a values file to store private key authentication values

    If you use private key with Snowflake or BigQuery, you can provide the required private key values in a values.yml file when you deploy or redeploy the agent.

    • Private key authentication with Snowflake

    • Private key authentication with BigQuery

    Use environment variables to store data source connection credentials

    When you, or someone in your organization, follows the guided steps to use a self-hosted Soda Agent to add a data source in Soda Cloud, one of the steps involves providing the connection details and credentials Soda needs to connect to the data source to run scans.

    You can add those details directly in Soda Cloud, but because any user can then access these values, you may wish to store them securely in the values YAML file as environment variables.

    1. Create or edit your local values YAML file to include the values for the environment variables you input into the connection configuration.

    2. After adding the environment variables to the values YAML file, update the Soda Agent using the following command:

    3. In step 2 of the add a data source guided steps, add data source connection configuration which look something like the following example for a PostgreSQL data source. Note the environment variable values for username and password.

    4. Follow the remaining guided steps to add a new data source in Soda Cloud. When you save the data source and test the connection, Soda Cloud uses the values you stored as environment variables in the values YAML file you supplied during redeployment.

    Integrate with a secrets manager

    Use External Secrets Operator (ESO) to integrate your self-hosted Soda Agent with your secrets manager, such as a Hashicorp Vault, AWS Secrets Manager, or Azure Key Vault, and securely reconcile the login credentials that Soda Agent uses for your data sources.

    Say you use a Hashicorp Vault to store data source login credentials and your security protocol demands frequent rotation of passwords. In this situation, the challenge is that apps running in your Kubernetes cluster, such as a Soda Agent, need access to the up-to-date passwords.

    To address the challenge, you can set up and configure ESO in your Kubernetes cluster to regularly reconcile externally-stored password values so that your apps always have the credentials they need. Doing so obviates the need to manually redeploy a values YAML file with new passwords for apps running in the cluster each time your system refreshes the passwords.

    The current integration of Soda Agent and a secrets manager does not yet support the configuration of the Soda Cloud credentials. For those credentials, use a tool such as helm-secrets or vals.

    To integrate Soda Agent with a secret manager, you need the following:

    • External Secrets Operator (ESO) which is a Kubernetes operator that facilitates a connection between the Soda Agent and your secrets manager

    • a ClusterSecretStore resource which provides a central gateway with instructions on how to access your secret backend

    • an ExternalSecret resource which instructs the cluster on what values to fetch, and references the ClusterSecretStore

    Read more about the ESO’s Resource Model.

    The following procedure outlines how to use ESO to integrate with a Hashicorp Vault that uses a KV Secrets Engine v2. Extrapolate from this procedure to integrate with another secrets manager such as:

    • AWS Secrets Manager

    • Azure Key Vault

    Prerequisites

    • You have set up a Kubernetes cluster in your cloud services environment and deployed a self-hosted Soda Agent in the cluster.

    • For the purpose of this example procedure, you have set up and are using a Hashicorp Vault which contains a key-value pair for POSTGRES_USERNAME and POSTGRES_PASSWORD at the path local/soda.

    Install and set up the External Secrets Operator

    Consider referencing the use case guide for integrating an External Secrets Manager with a Soda Agent which offers step-by-step instructions to set everything up locally to see the integration in action.

    1. Use helm to install the External Secrets Operator from the Helm chart repository into the same Kubernetes cluster in which you deployed your Soda Agent.

    2. Verify the installation using the following command:

    3. Create a cluster-secret-store.yml file for the ClusterSecretStore configuration. The details in this file instruct the Soda Agent how to access the external secrets manager vault. This example uses Hashicorp Vault AppRole authentication. AppRole authenticates with Vault using the App Role auth mechanism to access the contents of the secret store. It uses the SecretID in the Kubernetes secret, referenced by secretRef and the roleID, to acquire a temporary access token so that it can fetch secrets. Access external-secrets.io documentation for configuration examples for:

    4. Deploy the ClusterSecretStore to your cluster.

    5. Create an soda-secret.yml file for the ExternalSecret configuration. The details in this file instruct the Soda Agent which values to fetch from the external secrets manager vault.

      This example identifies:

      • the namespace of the Soda Agent

      • two remoteRef

    6. Deploy the ExternalSecret to your cluster.

    7. Use the following command to get the ExternalSecret to authenticate to the Hashicorp Vault using the ClusterSecretStore and fetch secrets.

      Output:

    8. Prepare a values.yml file to deploy the Soda Agent with the existingSecrets parameter that instructs it to access the ExternalSecret file to fetch data source login credentials. Refer to complete , or if you already have an agent running in a cluster.

    9. Deploy the Soda Agent using the following command:

      Output:

    Use Soda Cloud API Keys from an existing secret

    By default, the Soda Agent creates a secret for storing the Soda Cloud API Key details securely in your cluster. If you want to use a different secret, you can point the Soda Agent to an existing Kubernetes Secret in your cluster using the soda.apikey.existingSecret property.

    To use an existing Kubernetes secret for Soda Agent’s Cloud API credentials, add existingSecret and the secretKeys values to your agent’s values YAML file, as in the following example.

    Optimize performance

    The default Soda Agent settings balance performance and cost-efficiency. You can adjust these settings to better suit your needs, optimizing for larger datasets, faster scans, or improved resource management.

    The example below demonstrates how you can increase the memory limit using settings in your values.yml file:

    Kubernetes Secrets
    Use a values YAML file to store API key values
    Use a values file to store private key authentication values
    Use environment variables to store data source connection credentials
    Integrate with a secrets manager
    Use Soda Cloud API Keys from an existing secret

    You are not logged in to Soda and are viewing the default public documentation. Learn more about .

    User and user group management with SSO

    Organizations that use a Security Assertion Markup Language (SAML) 2.0 single sign-on (SSO) identity provider (IdP) can add Soda Cloud as a service provider.

    Once added, employees of the organization can gain authorized and authenticated access to the organization’s Soda Cloud account by successfully logging in to their SSO. This solution not only simplifies a secure login experience for users, it enables IT Admins to:

    • grant their internal users’ access to Soda Cloud from within their existing SSO solution

    • revoke their internal users’ access to Soda Cloud from within their existing SSO solution if a user leaves their organization or no longer requires access to Soda Cloud

    • set up one-way user group syncing from their IdP into Soda Cloud (tested and documented for Azure Active Directory and Okta)

    Compatibility

    Soda Cloud is able to act as a service provider for any SAML 2.0 SSO identity provider. In particular, Soda has tested and has written instructions for setting up SSO access with the following identity providers:

    Soda has tested and confirmed that SSO setup works with the following identity providers:

    • OneLogin

    • Auth0

    • Patronus

    SSO access to Soda Cloud

    When an employee uses their SSO provider to access Soda Cloud for the first time, Soda Cloud automatically assigns the new user to roles and groups according to the for any new users. Soda Cloud also notifies the Soda Cloud Admin that a new user has joined the organization, and the new user receives a message indicating that their Soda Cloud Admin was notified of their first login. A Soda Cloud Admin or user with the permission to do so can adjust users’ roles in Organization Settings. See for details.

    When an organization’s IT Admin revokes a user’s access to Soda Cloud through the SSO provider, a Soda Cloud Admin is responsible for updating the resources and ownerships linked to the User.

    Once your organization enables SSO for all Soda Cloud users, Soda Cloud blocks all non-SSO login attempts and password changes. If an employee attempts a non-SSO login or attempts to change a password using “Forgot password?”, Soda Cloud presents a message that explains that they must log in or change their password using their SSO provider.

    Optionally, you can set up the SSO integration Soda to include a one-way sync of user groups from your IdP into Soda Cloud which synchronizes with each user login to Soda via SSO.

    Soda Cloud supports both Identity Provider Initiated (IdP-initiated), and Service Provider Initiated (SP-initiated) single sign-on integrations. Be sure to indicate which type of SSO your organization uses when setting it up with the Soda Support team.

    Add Soda Cloud to Azure AD

    1. Email to request SSO set-up for Soda Cloud and provide your Soda Cloud organization identifier, accessible via your avatar > Organization Settings, in the Organization tab. Soda Support sends you the samlUrl that you need to configure the set up with your identity provider.

    2. As a user with sufficient privileges in your organization’s Azure AD account, sign in through , then navigate to Enterprise applications. Click New application.

    3. Click Create your own application.

    Add Soda Cloud to Okta

    1. Email to request SSO set-up for Soda Cloud and provide your Soda Cloud organization identifier, accessible via your avatar > Organization Settings, in the Organization tab. Soda Support sends you the samlURL that you need to configure the set up with your identity provider.

    2. As an Okta Administrator, log in to Okta and navigate Applications > Applications overview, then click Create App Integration. Refer to for full procedure.

    3. Select SAML 2.0.

    Add Soda Cloud to Google Workspace

    1. Email to request SSO set-up for Soda Cloud and provide your Soda Cloud organization identifier, accessible via your avatar > Organization Settings, in the Organization tab. Soda Support sends you the samlURL that you need to configure the set up with your identity provider.

    2. As an administrator in your Google Workspace, follow the instructions in to Set up your own custom SAML application.

    3. Optionally, upload the so it appears in the app launcher with the logo instead of the first two letters of the app name.

    Sync user groups from an IdP

    If you wish, you can choose to regularly one-way sync the user groups you have defined in your IdP into Soda Cloud.

    Doing so obviates the need to manually create user groups in Soda Cloud that you have already defined in your IdP, and enables your team to select an IdP-managed user groups when assigning ownership access permissions to a resource, in addition to any user groups you may have created manually in Soda Cloud. See:

    • Soda has tested and documented one-way syncing of user groups with Soda Cloud for Okta and Azure Active Directory. to request tested and documented support for other IdPs.

    • Soda synchronizes user groups with the IdP every time a user in your organization logs in to Soda via SSO. Soda updates the user’s group membership according to the IdP user groups to which they belong at each log in.

    • You cannot manage IdP user group settings or membership in Soda Cloud. Any changes that you wish to make to IdP-managed user groups must be done in the IdP itself.

    Set up user group sync in Azure AD

    1. In step 10 of the SAML application setup procedure , in the same User Attributes & Claims section of your Soda SAML Application in Azure AD, follow to add a group claim to your Soda SAML Application.

      • For the choice of which groups should be returned in the claim, best practice suggests selecting Groups assigned to the application.

      • For the choice of Source attribute, select Cloud-only group display names.

    Set up user group sync in Okta

    1. In step 7 of the SAML application integration procedure , follow Okta’s instructions to .

      • For the Name value, use Group.Authorization.

      • Leave the optional Name Format value as Unspecified.

      • Use the Filter to find a group that you wish to make available in Soda Cloud to manage access and permissions. Exercise caution! A broad filter may include user groups you do not wish to include in the sync. Double-check that the groups you select are appropriate.

    Renew SSO certificate

    To renew an SSO certificate, you need to provide Soda with the new X.509 certificate, with which Soda will update your Soda organization's SSO configuration. Since Soda can only validate SSO against one certificate, there will be downtime between you deactivating the old certificate, and Soda updating the SSO configuration.

    Depending on your organization's process of renewing the certificate, you could notify Soda (or arrange for a call) in advance of the specific datetime you want to renew, so Soda can be prepared for your update and minimize the mentioned downtime.

    Connect to Soda Cloud

    Deploy a Soda Agent in an Amazon EKS cluster

    Soda-hosted agents are included in all Free, Team, and Enterprise plans at no additional cost. However, self-hosted agents require an Enterprise plan.

    If you wish to use self-hosted agents, please contact us at to discuss Enterprise plan options or via the support portal for existing customers.

    Deploy a Soda Agent in a Google GKE cluster

    Soda-hosted agents are included in all Free, Team, and Enterprise plans at no additional cost. However, self-hosted agents require an Enterprise plan.

    If you wish to use self-hosted agents, please contact us at to discuss Enterprise plan options or via the support portal for existing customers.

    helm repo add soda-agent [REPOSITORY_URL_PROVIDED]
    helm repo add soda-agent [REPOSITORY_URL_PROVIDED]
    kubectl delete ns soda-agent
    az aks delete --resource-group SodaAgent --name soda-agent-cli-test --yes
    soda:
      agent:
        resources:
          limits:
            cpu: x
            memory: x
          requests:
            cpu: x
            memory: x
      scanlauncher:
        resources:
          limits:
            cpu: x
            memory: x
          requests:
            cpu: x
            memory: x
    soda:
      agent:
        resources:
          limits:
            cpu: 250m
            memory: 375Mi
          requests:
            cpu: 250m
            memory: 375Mi
    kubectl logs -l agent.soda.io/component=orchestrator -n soda-agent -f
    kubectl logs -l agent.soda.io/component=orchestrator -n soda-agent -f
    helm install soda-agent soda-agent/soda-agent \
      --set soda.agent.name=myuniqueagent \
      --set soda.apikey.id=*** \
      --set soda.apikey.secret=**** \
      --namespace soda-agent
    az aks create \
    >   --resource-group SodaAgent \
    >   --name SodaAgentCluster \
    >   --node-count 1 \
    >   --generate-ssh-keys
    soda:
     apikey:
       id: "***"
       secret: "***"
     agent:
       name: "myuniqueagent"
     env:
       POSTGRES_USER: "sodalibrary"
       POSTGRES_PASS: "sodalibrary"
    helm upgrade soda-agent soda-agent/soda-agent \
      --values values.yml \
      --namespace soda-agent
    type: postgres
    name: postgres
    connection:
      host:
      port:
      database:
      user: ${env.POSTGRES_USER}
      password: ${env.POSTGRES_PASS}
     helm repo add external-secrets https://charts.external-secrets.io
    
     helm install external-secrets \
        external-secrets/external-secrets \
         -n external-secrets \
         --create-namespace
    kubectl -n external-secrets get all
    soda:
      apikey:
        id: "***"
        secret: "***"
      agent:
        name: "myuniqueagent"
    helm install soda-agent soda-agent/soda-agent \
      --values values.yml \
      --namespace soda-agent
    soda:
      apikey:
        existingSecret: "<existing-secret-name>"
        secretKeys:
          idKey: "<key-for-api-id>"
          secretKey: "<key-for-api-secret>"
    soda:
      scanlauncher:
        resources:
          limits:
            cpu: 1
            memory: 2Gi
      contractlauncher:
        resources:
          limits:
            cpu: 1
            memory: 2Gi
      

    Changes in the schema compared to the previous scan—any change is automatically flagged as an anomaly

    Based on time partition column

    Partition row count

    The number of rows in the last partition at scan time

    Most recent timestamp

    Time difference between scan time and the maximum timestamp in the partition column (at scan time)

    PostgreSQL

    June 6th

    ✅

    —

    ✅

    AWS Aurora

    June 30th

    ✅

    —

    ✅

    MS SQL server

    June 30th

    ✅

    —

    ✅

    Oracle

    June 30th

    June 30th

    —

    ✅

    Redshift

    September 1st

    June 30th

    June 30th

    ✅

    BigQuery

    September 1st

    ✅

    June 30th

    ✅

    MySQL

    Upcoming

    —

    —

    ✅

    Trino

    Upcoming

    Upcoming

    —

    ✅

    Athena

    Upcoming

    Upcoming

    —

    ✅

    Documentation access & licensing
    • Manage organization settings

    • Deactivate users

    • Create, edit, or delete user groups

    • Create, edit, or delete dataset roles

    • Create, edit, or delete global roles

    • Assign global roles to users or user groups

    • Add, edit, or delete integrations

    • Access and download the audit trail

    ✓

    Manage scan definitions

    • Update scan definition

    • Run scan definition manually

    ✓

    Access failed row samples for checks

    Allow users to see samples of rows that are considered invalid

    ✓

    ✓

    ✓

    Configure dataset

    Allow users to define dataset attributes and owner, change settings, and add/enable/configure metric monitors at a dataset level

    ✓

    ✓

    Manage dataset responsibilities

    Allow users to grant and remove permissions through responsibilities.

    ✓

    Manage Datas Contract

    Allow users to modify as well as verifying the Data contract

    ✓

    ✓

    Propose checks

    Allow users to propose changes in the Data Contract

    ✓

    ✓

    ✓

    Manage incidents

    Allow users to edit and close incidents.

    ✓

    ✓

    ✓

    Delete dataset

    Allow users to remove a dataset and its checks.

    ✓

    Onboard datasets on Soda Cloud
    Documentation access & licensing
    configurations, including the file path in the vault, one each for
    POSTGRES_USERNAME
    and
    POSTGRES_PASSWORD
    , to detail what the
    ExternalSecret
    must fetch from the Hashicorp Vault
  • a refreshInterval to indicate how often the ESO must reconcile the remoteRef values; this ought to correspond to the frequency with which your passwords are reset

  • the secretStoreRef to indicate the ClusterSecretStore through which to access the vault

  • a target template that creates a file called soda-agent.conf into which it adds the username and password values in the dotenv format that the Soda Agent expects.

  • AWS Secrets Manager
    Azure Key Vault
    deploy instructions
    redeploy instructions
    Documentation access & licensing
     apiVersion: external-secrets.io/v1beta1
     kind: ClusterSecretStore
     metadata:
     name: vault-app-role
     spec:
     provider:
       vault:
         auth:
           appRole:
             path: approle
             roleId: 3e****54-****-936e-****-5c5a19a5eeeb
             secretRef:
               key: appRoleSecretId
               name: external-secrets-vault-app-role-secret-id
               namespace: external-secrets
         path: kv
         server: http://vault.vault.svc.cluster.local:8200
         version: v2
    kubectl apply -f cluster-secret-store.yaml
    apiVersion: external-secrets.io/v1beta1
    kind: ExternalSecret
    metadata:
      name: soda-agent
      namespace: soda-agent
    spec:
      data:
      - remoteRef:
           key: local/soda
           property: POSTGRES_USERNAME
         secretKey: POSTGRES_USERNAME
      - remoteRef:
           key: local/soda
           property: POSTGRES_PASSWORD
         secretKey: POSTGRES_PASSWORD
      refreshInterval: 1m
      secretStoreRef:
         kind: ClusterSecretStore
         name: vault-app-role
      target:
         name: soda-agent-secrets
         template:
           data:
             soda-agent.conf: |
               POSTGRES_USERNAME={{ .POSTGRES_USERNAME }}
               POSTGRES_PASSWORD={{ .POSTGRES_PASSWORD }}
           engineVersion: v2
    kubectl  apply -n soda-agent -f soda-secret.yaml
    kubectl get secret -n soda-agent soda-agent-secrets
    NAME                 TYPE     DATA   AGE
    soda-agent-secrets   Opaque   1      24h
     soda:
       apikey:
         id: "154k***889"
         secret: "9sfjf****ff4"
       agent:
         name: "my-soda-agent-external-secrets"
       scanLauncher:
         existingSecrets:
           # from spec.target.name in the ExternalSecret file
           - soda-agent-secrets 
       contractLauncher:
         existingSecrets:
           # from spec.target.name in the ExternalSecret file
           - soda-agent-secrets 
       cloud:
         # Use https://cloud.us.soda.io for US region 
         # Use https://cloud.soda.io for EU region
         endpoint: "https://cloud.soda.io"
    helm install soda-agent soda-agent/soda-agent \
      --values values.yml \
      --namespace soda-agent
      NAME: soda-agent
      LAST DEPLOYED: Tue Aug 29 13:08:51 2023
      NAMESPACE: soda-agent
      STATUS: deployed
      REVISION: 1
      TEST SUITE: None
      NOTES:
      Success, the Soda Agent is now running. 
      You can inspect the Orchestrators logs if you like, but if all was configured correctly, the Agent should show up in Soda Cloud. 
      Check the logs using:
         kubectl logs -l agent.soda.io/component=orchestrator -n soda-agent
    Specify the value for endpoint according to your local region: https://cloud.us.soda.io for the United States, or https://cloud.soda.io for all else.
  • (Optional) Specify the format for log output: raw for plain text, or json for JSON format.

  • (Optional) Specify the level of log information you wish to see when deploying the agent: ERROR, WARN, INFO, DEBUG, or TRACE.

  • kubectl
    Helm
    Kubernetes Secrets
    Kubernetes Secrets
    Soda Agent Extra
    Documentation access & licensing
    Deploy using CLI only
    Deploy using a values YAML file

    In the right pane that appears, provide a name for your app, such as Soda Cloud, then select the (Non-gallery) option. Click Create.

  • After Azure AD creates your app, click Single sign-on in the left nav under the Manage heading, then select the SAML tile.

  • In the Basic SAML Configuration block that appears, click Edit.

  • In the Basic SAML Configuration panel, there are two fields to populate:

    • Identifier (Entity ID), which is the value of samlUrl from step 1.

    • Reply URL, which is the value of samlUrl from step 1.

  • Click Save, then close the confirmation message pop-up.

  • In the User Attributes & Claims panel, click Edit to add some attribute mappings.

  • Configure the claims as per the following example. Soda Cloud uses familyname and givenname, and maps emailaddress to user.userprincipalname. (Optional) Follow the additional steps to enable one-way user group syncing to your SSO configuration; see Set up user group sync in Azure AD).

  • Scroll down to collect the values of three fields that Soda needs to complete the Azure AD SSO integration:

    • Azure AD Identifier (Section 4 in Azure). This is the IdP entity, ID, or Identity Provider Issuer that Soda needs

    • Login URL (Section 4 in Azure). This is the IdP SSO service URL, or Identity Provider Single Sign-On URL that Soda needs.

    • X.509 Certificate. Click the Download link next to Certificate (Base64).

  • Email the copied and downloaded values to [email protected]. With those values, Soda completes the SSO configuration for your organization in cloud.soda.io and notifies you of completion.

    • Soda Cloud supports both Identity Provider Initiated (IdP-initiated), and Service Provider Initiated (SP-initiated) single sign-on integrations; be sure to indicate which type of SSO your organization uses.

    • (Optional) Ask Soda to enable one-way user group syncing to your SSO configuration; see Set up user group sync in Azure AD)

  • Test the integration by assigning the Soda application in Azure AD to a single user, then requesting that they log in.

  • After a successful single-user test of the sign in, assign access to the Soda Azure AD app to users and/or user groups in your organization.

  • Provide a name for the application, Soda Cloud, and upload the Soda logo.

  • Click Next. In the Configure SAML tab, there are two fields to populate:

    • Single sign on URL, which is the value of samlUrl from step 1.

    • Audience URI (SP Entity ID), which is also the value of samlUrl from step 1. The values for these fields are unique to your organization and are provided to you by Soda and they follow this pattern: https://cloud.soda.io/sso/<your-organization-identifier>/saml.

  • Be sure to use an email address as the application username.

  • Scroll down to Attribute Statements to map the following values, then click Next to continue.

    • map User.GivenName to user.firstName

    • map User.FamilyNameto user.lastName

    • map User.Email to user.email

    • (Optional) Follow the additional steps to enable one-way user group syncing to your SSO configuration; .

  • Select the following options, then click Finish.

    • I’m an Okta customer adding an internal app.

    • This is an internal app that we have created.

  • In the Sign On pane of the application, scroll down to click View Setup Instructions.

  • Collect the values of three fields that Soda needs to complete the Okta SSO integration:

    • Identity Provider Single Sign-On URL

    • Identity Provider Issuer

    • X.509 Certificate

  • Email the copied and downloaded values to [email protected]. With those values, Soda completes the SSO configuration for your organization in cloud.soda.io and notifies you of completion.

    • Soda Cloud supports both Identity Provider Initiated (IdP-initiated), and Service Provider Initiated (SP-initiated) single sign-on integrations; be sure to indicate which type of SSO your organization uses.

    • (Optional) Ask Soda to enable one-way user group syncing to your SSO configuration; see Set up user group sync in Okta.

  • Test the integration by assigning the Soda application in Okta to a single user, then requesting that they log in.

  • After a successful single-user test of the sign in, assign access to the Soda Okta app to users and/or user groups in your organization.

  • On the Google Identity Provider details page, be sure to copy or download the following values:

    • SSO URL

    • Entity ID

    • IDP metadata

    • Certificate

  • On the SAML Attribute mapping page, add two Google directory attributes and map as follows:

    • Last Name → User.FamilyName

    • First Name → User.GivenName

  • Email the copied and downloaded values to [email protected]. With those values, Soda completes the SSO configuration for your organization in cloud.soda.io and notifies you of completion. Soda Cloud supports both Identity Provider Initiated (IdP-initiated), and Service Provider Initiated (SP-initiated) single sign-on integrations; be sure to indicate which type of SSO your organization uses.

  • In the Google Workspace admin portal, use Google’s instructions to Turn on your SAML app and verify that SSO works with the new custom app for Soda.

  • After saving the group claim, navigate to Users and Groups in the left menu, and follow Microsoft’s instructions to Assign a user or group to an enterprise application. Add any existing groups to the Soda SAML Application that you wish to make available in Soda Cloud to manage access and permissions.

  • In your message to Soda Support or your Soda Customer Engineer, advise Soda that you wish to enable user group syncing. Soda adds a setting to your SSO configuration to enable it.

  • When the SSO integration is complete, you and your team can select your IdP user groups from the dropdown list of choices available when assigning ownership or permissions to resources.

  • Use the Add Another button to add as many groups as you wish to make available in Soda Cloud.

  • In your message to Soda Support or your Soda Customer Engineer, advise Soda that you wish to enable user group syncing. Soda adds a setting to your SSO configuration to enable it.

  • When the SSO integration is complete, you and your team can select your IdP user groups from the dropdown list of choices available when assigning ownership or permissions to resources.

  • Azure Active Directory
    Okta
    Google Workspace
    Global and Dataset Roles
    Global and Dataset Roles
    [email protected]
    portal.azure.com
    [email protected]
    Okta documentation
    [email protected]
    Google Workspace documentation
    Soda logo
    Contact Soda
    above
    Microsoft’s instructions
    above
    Define group attribute statements
    Sync user groups from an IdP
    User Groups

    You are not logged in to Soda and are viewing the default public documentation. Learn more about Documentation access & licensing.

    Prerequisites
    • You have an AWS account and the necessary permissions to enable you to create, or gain access to an EKS cluster in your region.

    • You have installed v1.22 or v1.23 of kubectl. This is the command-line tool you use to run commands against Kubernetes clusters. If you have installed Docker Desktop, kubectl is included out-of-the-box. Run kubectl version --output=yaml to check the version of an existing install.

    • You have installed Helm. This is the package manager for Kubernetes which you will use to deploy the Soda Agent Helm chart. Run helm version to check the version of an existing install.

    • You have whitelisted these URLs, depending on whether you are using Soda EU cloud.soda.io or Soda US cloud.us.soda.io:

    System requirements

    Kubernetes cluster size and capacity: 2 CPU and 2GB of RAM. In general, this is sufficient to run up to six scans in parallel.

    Scan performance may vary according to the workload, or the number of scans running in parallel. To improve performance for larger workloads, consider:

    • fine-tuning the cluster size using the resources parameter for the agent-orchestrator and soda.scanlauncher.resources for the scan-launcher. Adding more resources to the scan-launcher can improve scan times by as much as 30%.

    • adding more nodes to the node group; see AWS documentation for Scaling Managed Nodegroups.

    • adding a cluster auto-scaler to your Kubernetes cluster; see AWS documentation for (for AWS see )

    Be aware, however, that allocating too many resources may be costly relative to the small benefit of improved scan times.

    To specify resources, add the following parameters to your values.yml file during deployment. Refer to Kubernetes documentation for Resource Management for Pods and Containers for information on values to supply for x.

    For reference, a Soda-hosted agent specifies resources as follows:

    Deploy an agent

    The following table outlines the two ways you can install the Helm chart to deploy a Soda Agent in your cluster.

    Method
    Description
    When to use

    Install the Helm chart via CLI by providing values directly in the install command.

    Use this as a straight-forward way of deploying an agent on a cluster.

    Install the Helm chart via CLI by providing values in a values YAML file.

    Use this as a way of deploying an agent on a cluster while keeping sensitive values secure. - provide sensitive API key values in this local file - store data source login credentials as environment variables in this local file or in an external secrets manager; Soda needs access to the credentials to be able to connect to your data source to run scans of your dat See:

    Deploy using CLI only

    1. (Optional) If you wish, you can establish an AWS PrivateLink to provide private connectivity with Soda Cloud. Refer to (Optional) Connect via AWS PrivateLink before deploying an agent.

    2. (Optional) If you are deploying to an existing Virtual Private Cloud (VPC), consider supplying public or private subnets with your deployment. Consult the eksctl documentation to Use existing VPC.

    3. Create or navigate to an existing Kubernetes cluster in your environment in which you can deploy the Soda Agent helm chart. Best practices advises creating a managed node group into which you can deploy the agent.

    4. Use Helm to add the Soda Agent Helm chart repository.

    5. Use the following command to install the Helm chart which deploys a Soda Agent in your custer.

      • Replace the values of soda.apikey.id and soda-apikey.secret with the values you copy+pasted from the New Soda Agent dialog box in your Soda Cloud. By default, Soda uses as part of the Soda Agent deployment. The agent automatically converts any sensitive values you add to a values YAML file, or directly via the CLI, into Kubernetes Secrets.

      • Replace the value of soda.agent.name with a custom name for your agent, if you wish.

    6. (Optional) Validate the Soda Agent deployment by running the following command:

    7. In your Soda Cloud account, navigate to your avatar > Agents. Refresh the page to verify that you see the agent you just created in the list of Agents. Be aware that this may take several minutes to appear in your list of Soda Agents. Use the describe pods command in step 3 to check the status of the deployment. When State: Running and Ready: True, then you can refresh and see the agent in Soda Cloud.

    If you do no see the agent listed in Soda Cloud, use the following command to review status and investigate the logs.

    Deploy using a values YAML file

    1. (Optional) If you wish, you can establish an AWS PrivateLink to provide private connectivity with Soda Cloud. Refer to Connect via AWS PrivateLink before deploying an agent.

    2. (Optional) If you are deploying to an existing Virtual Private Cloud (VPC), consider supplying public or private subnets with your deployment. Consult the eksctl documentation to Use existing VPC.

    3. Create or navigate to an existing Kubernetes cluster in your environment in which you can deploy the Soda Agent helm chart. Best practices advises creating a managed node group into which you can deploy the agent.

    4. Use Helm to add the Soda Agent Helm chart repository.

    5. Using a code editor, create a new YAML file called values.yml.

    6. To that file, copy+paste the content below, replacing the following values:

      • id and secret with the values you copy+pasted from the New Soda Agent dialog box in your Soda Cloud account. By default, Soda uses as part of the Soda Agent deployment. The agent automatically converts any sensitive values you add to a values YAML file, or directly via the CLI, into Kubernetes Secrets.

      • Replace the value of name with a custom name for your agent, if you wish.

    7. Save the file. Then, in the same directory in which the values.yml file exists, use the following command to install the Soda Agent helm chart.

    8. (Optional) Validate the Soda Agent deployment by running the following command:

    9. In your Soda Cloud account, navigate to your avatar > Agents. Refresh the page to verify that you see the agent you just created in the list of Agents. Be aware that this may take several minutes to appear in your list of Soda Agents. Use the describe pods command in step four to check the status of the deployment. When State: Running and Ready: True, then you can refresh and see the agent in Soda Cloud.

    If you do no see the agent listed in Soda Cloud, use the following command to review status and investigate the logs.

    (Optional) Connect via AWS PrivateLink

    If you use AWS services for your infrastructure and you have deployed or will deploy a Soda Agent in an EKS cluster, you can use an AWS PrivateLink to provide private connectivity with Soda Cloud.

    1. Log in to your AWS console and navigate to your VPC dashboard.

    2. Follow the AWS documentation to Connect to an endpoint service as the service customer. For security reasons, Soda does not publish its Service name. Email [email protected] with your AWS account ID to request the PrivateLink service name. Refer to AWS documentation for instructions on how to obtain your account ID.

    3. After creating the endpoint, return to the VPC dashboard. When the status of the endpoint becomes Available, the PrivateLink is ready to use. Be aware that this make take more than 10 minutes.

    4. Deploy a Soda Agent to your AWS EKS cluster, or, if you have already deployed one, restart your Soda Agent to begin sending data to Soda Cloud via the PrivateLink.

    5. After you have started the agent and validated that it is running, log into your Soda Cloud account, then navigate to your avatar > Agents. Refresh the page to verify that you see the agent you just created in the list of Agents.

    If you do no see the agent listed in Soda Cloud, use the following command to review status and investigate the logs.

    About the helm install command

    Command part
    Description

    helm install

    the action helm is to take

    soda-agent (the first one)

    a release named soda-agent on your cluster

    soda-agent (the second one)

    the name of the helm repo you installed

    soda-agent (the third one)

    the name of the helm chart that is the Soda Agent

    The --set options either override or set some of the values defined in and used by the Helm chart. You can override these values with the --set files as this command does, or you can specify the override values using a values.yml file.

    Parameter key
    Parameter value, description

    --set soda.agent.name

    A unique name for your Soda Agent. Choose any name you wish, as long as it is unique in your Soda Cloud account.

    --set soda.apikey.id

    With the apikey.secret, this connects the Soda Agent to your Soda Cloud account. Use the value you copied from the dialog box in Soda Cloud when adding a new agent. You can use a values.yml file to pass this value to the cluster instead of exposing it here.

    --set soda.apikey.secret

    With the apikey.id, this connects the Soda Agent to your Soda Cloud account. Use the value you copied from the dialog box in Soda Cloud when adding a new agent. You can use a values.yml file to pass this value to the cluster instead of exposing it here.

    --set soda.agent.logFormat

    (Optional) Specify the format for log output: raw for plain text, or json for JSON format.

    --set soda.agent.loglevel

    (Optional) Specify the leve of log information you wish to see when deploying the agent: ERROR, WARN, INFO, DEBUG, or TRACE.

    --namespace soda-agent

    Use the namespace value to identify the namespace in which to deploy the agent.

    Decommission the Soda Agent and the EKS cluster

    1. Uninstall the Soda Agent in the cluster.

    2. Delete the EKS cluster itself.

    3. (Optional) Access your CloudFormation console, then click Stacks to view the status of your decommissioned cluster. If you do not see your Stack, use the region drop-down menu at upper-right to select the region in which you created the cluster.

    Troubleshoot deployment

    Problem: After setting up a cluster and deploying the agent, you are unable to see the agent running in Soda Cloud.

    Solution: The value you specify for the soda-cloud-enpoint must correspond with the region you selected when you signed up for a Soda Cloud account:

    • Usehttps://cloud.us.soda.io for the United States

    • Use https://cloud.soda.io for all else

    Problem: You need to define the outgoing port and IP address with which a self-hosted Soda Agent can communicate with Soda Cloud. Soda Agent does not require setting any inbound rules as it only polls Soda Cloud looking for instruction, which requires only outbound communication. When Soda Cloud must deliver instructions, the Soda Agent opens a bidirectional channel.

    Solution: Use port 443 and passlist the fully-qualified domain names for Soda Cloud:

    • cloud.us.soda.io for Soda Cloud account created in the US region OR

    • cloud.soda.io for Soda Cloud account created in the EU region AND

    • collect.soda.io

    Problem: UnauthorizedOperation: You are not authorized to perform this operation.

    Solution: This error indicates that your user profile is not authorized to create the cluster. Contact your AWS Administrator to request the appropriate permissions

    https://www.soda.io/contact

    You are not logged in to Soda and are viewing the default public documentation. Learn more about .

    Prerequisites
    • You have a Google Cloud Platform (GCP) account and the necessary permissions to enable you to create, or gain access to an existing Google Kubernetes Engine (GKE) cluster in your region.

    • You have installed the gcloud CLI tool. Use the command glcoud version to verify the version of an existing install.

      • If you have already installed the gcloud CLI, use the following commands to login and verify your configuration settings, respectively: gcloud auth login gcloud config list

      • If you are installing the gcloud CLI for the first time, be sure to complete in the installation to properly install and configure the setup.

      • Consider using the following command to learn a few basic glcoud commands: gcloud cheat-sheet.

    • You have installed v1.22 or v1.23 of . This is the command-line tool you use to run commands against Kubernetes clusters. If you have installed Docker Desktop, kubectl is included out-of-the-box. With Docker running, use the command kubectl version --output=yaml to check the version of an existing install.

    • You have installed . This is the package manager for Kubernetes which you will use to deploy the Soda Agent Helm chart. Run helm version to check the version of an existing install.

    • You have whitelisted these URLs, depending on whether you are using Soda EU cloud.soda.io or Soda US cloud.us.soda.io:

    System requirements

    Kubernetes cluster size and capacity: 2 CPU and 2GB of RAM. In general, this is sufficient to run up to six scans in parallel.

    Scan performance may vary according to the workload, or the number of scans running in parallel. To improve performance for larger workloads, consider fine-tuning the cluster size using the resources parameter for the agent-orchestrator and soda.scanlauncher.resources for the scan-launcher. Adding more resources to the scan-launcher can improve scan times by as much as 30%. Be aware, however, that allocating too many resources may be costly relative to the small benefit of improved scan times.

    To specify resources, add the following parameters to your values.yml file during deployment. Refer to Kubernetes documentation for Resource Management for Pods and Containers for information on values to supply for x.

    For reference, a Soda-hosted agent specifies resources as follows:

    Deploy an Agent

    The following table outlines the two ways you can install the Helm chart to deploy a Soda Agent in your cluster.

    Method
    Description
    When to use

    Install the Helm chart via CLI by providing values directly in the install command.

    Use this as a straight-forward way of deploying an agent on a cluster in a secure or local environment.

    Install the Helm chart via CLI by providing values in a values YAML file.

    Use this as a way of deploying an agent on a cluster while keeping sensitive values secure. - provide sensitive API key values in this local file - store data source login credentials as environment variables in this local file or in an external secrets manager; Soda needs access to the credentials to be able to connect to your data source to run scans of your data. See:

    Deploy using CLI only

    1. (Optional) You have familiarized yourself with basic Soda, Kubernetes, and Helm concepts.

    2. Create or navigate to an existing Kubernetes cluster in your environment in which you can deploy the Soda Agent helm chart.

    3. Add the Soda Agent Helm chart repository.

    4. Use the following command to install the Helm chart to deploy a Soda Agent in your custer. (Learn more about the helm install command.)

      • Replace the values of soda.apikey.id and soda-apikey.secret with the values you copy+pasted from the New Soda Agent dialog box in your Soda Cloud account. By default, Soda uses as part of the Soda Agent deployment. The agent automatically converts any sensitive values you add to a values YAML file, or directly via the CLI, into Kubernetes Secrets.

      • Replace the value of soda.agent.name with a custom name for your agent, if you wish.

      • Specify the value for soda.cloud.endpoint according to your local region: https://cloud.us.soda.io for the United States, or https://cloud.soda.io for all else.

      • (Optional) Specify the format for log output: raw for plain text, or json for JSON format.

      • (Optional) Specify the level of log information you wish to see when deploying the agent: ERROR, WARN, INFO, DEBUG, or TRACE.

        The command-line produces output like the following message:

    5. (Optional) Validate the Soda Agent deployment by running the following command:

    6. In your Soda Cloud account, navigate to your avatar > Agents. Refresh the page to verify that you see the agent you just created in the list of Agents. Be aware that this may take several minutes to appear in your list of Soda Agents. Use the describe pods command in step three to check the status of the deployment. When Status: Running, then you can refresh and see the agent in Soda Cloud.

    If you do no see the agent listed in Soda Cloud, use the following command to review status and investigate the logs.

    Deploy using a values YAML file

    1. (Optional) You have familiarized yourself with basic Soda, Kubernetes, and Helm concepts.

    2. Create or navigate to an existing Kubernetes cluster in your environment in which you can deploy the Soda Agent helm chart.

    3. Add the Soda Agent Helm chart repository.

    4. Using a code editor, create a new YAML file called values.yml.

    5. In that file, copy+paste the content below, replacing the following values:

      • id and secret with the values you copy+pasted from the New Soda Agent dialog box in your Soda Cloud account. By default, Soda uses as part of the Soda Agent deployment. The agent automatically converts any sensitive values you add to a values YAML file, or directly via the CLI, into Kubernetes Secrets.

      • Replace the value of name with a custom name for your agent, if you wish.

    6. Save the file. Then, in the same directory in which the values.yml file exists, use the following command to install the Soda Agent helm chart.

    7. (Optional) Validate the Soda Agent deployment by running the following command:

    8. In your Soda Cloud account, navigate to your avatar > Agents. Refresh the page to verify that you see the agent you just created in the list of Agents. Be aware that this may take several minutes to appear in your list of Soda Agents. Use the describe pods command in step four to check the status of the deployment. When Status: Running, then you can refresh and see the agent in Soda Cloud.

    If you do no see the agent listed in Soda Cloud, use the following command to review status and investigate the logs.

    About the helm install command

    Command part
    Description

    helm install

    the action helm is to take

    soda-agent (the first one)

    a release named soda-agent on your cluster

    soda-agent (the second one)

    the name of the helm repo you installed

    soda-agent (the third one)

    the name of the helm chart that is the Soda Agent

    The --set options either override or set some of the values defined in and used by the Helm chart. You can override these values with the --set files as this command does, or you can specify the override values using a values.yml file.

    Parameter key
    Parameter value, description

    --set soda.agent.name

    A unique name for your Soda Agent. Choose any name you wish, as long as it is unique in your Soda Cloud account.

    --set soda.apikey.id

    With the apikey.secret, this connects the Soda Agent to your Soda Cloud account. Use the value you copied from the dialog box in Soda Cloud when adding a new agent. You can use a values.yml file to pass this value to the cluster instead of exposing it here.

    --set soda.apikey.secret

    With the apikey.id, this connects the Soda Agent to your Soda Cloud account. Use the value you copied from the dialog box in Soda Cloud when adding a new agent. You can use a values.yml file to pass this value to the cluster instead of exposing it here.

    --set soda.agent.logFormat

    (Optional) Specify the format for log output: raw for plain text, or json for JSON format.

    --set soda.agent.loglevel

    (Optional) Specify the leve of log information you wish to see when deploying the agent: ERROR, WARN, INFO, DEBUG, or TRACE.

    --namespace soda-agent

    Use the namespace value to identify the namespace in which to deploy the agent.

    Decommission the Soda Agent and cluster

    1. Uninstall the Soda Agent in the cluster.

    2. Delete the cluster.

    Refer to Google Kubernetes Engine documentation for details.

    Troubleshoot deployment

    Problem: After setting up a cluster and deploying the agent, you are unable to see the agent running in Soda Cloud.

    Solution: The value you specify for the soda-cloud-enpoint must correspond with the region you selected when you signed up for a Soda Cloud account:

    • Usehttps://cloud.us.soda.io for the United States

    • Use https://cloud.soda.io for all else

    Problem: You need to define the outgoing port and IP address with which a self-hosted Soda Agent can communicate with Soda Cloud. Soda Agent does not require setting any inbound rules as it only polls Soda Cloud looking for instruction, which requires only outbound communication. When Soda Cloud must deliver instructions, the Soda Agent opens a bidirectional channel.

    Solution: Use port 443 and passlist the fully-qualified domain names for Soda Cloud:

    • cloud.us.soda.io for Soda Cloud account created in the US region OR

    • cloud.soda.io for Soda Cloud account created in the EU region AND

    • collect.soda.io

    https://www.soda.io/contact

    You are not logged in to Soda and are viewing the default public documentation. Learn more about .

    Setup & configuration

    This page provides detailed information about how to configure the Soda↔Collibra integration.

    Both Collibra and Soda need to be configured so the integration can run successfully. This page covers both Collibra and Soda settings, including asset types, attribute types, relation types, and domain mappings. These settings establish the foundation for reliable synchronization of data quality checks and metadata between Soda and Collibra.


    Configuration Guide

    1. Collibra Configuration

    Base Settings

    Asset Types

    Configure the different types of assets in Collibra:

    Attribute Types

    Define the attributes that will be set on check assets:

    Diagnostic Attributes Behavior:

    • Flexible Extraction: Automatically extracts metrics from any diagnostic type (missing, aggregate, valid, etc.)

    • Future-Proof: Works with new diagnostic types that Soda may introduce

    • Smart Fallbacks: Falls back to datasetRowsTested

    Relation Types

    Define the types of relationships between assets:

    Responsibilities

    Configure ownership role mappings:

    Domains

    Configure the domains where assets will be created:

    2. Soda Configuration

    Base Settings

    General Settings

    Attributes

    Define Soda attributes and their mappings:

    Multiple dimensions support

    The integration supports both single and multiple dimensions for data quality checks:

    • Single dimension: Specify as a string value (e.g., "Completeness")

    • Multiple dimensions: Use a comma-separated string (e.g., "Completeness, Consistency")

    When multiple dimensions are provided as a comma-separated string, the integration will:

    1. Automatically split the string by commas and trim whitespace

    2. Search for each dimension asset in Collibra individually

    3. Create a relation for each dimension found

    4. Log a warning for any dimension that cannot be found in Collibra

    Example Configuration:

    This will create three separate dimension relations in Collibra, one for each dimension specified.

    Monitor Exclusion

    The integration can exclude Soda monitors (items with metricType) from synchronization:

    • Enabled (sync_monitors: true): All checks and monitors are synchronized (default)

    • Disabled (sync_monitors: false): Only checks are synchronized, monitors are filtered out

    When sync_monitors is disabled, the integration will:

    1. Filter out all items that have a metricType attribute

    2. Only process actual checks (items without metricType)

    3. Log the number of monitors filtered out for each dataset

    4. Continue processing with the remaining checks

    This is useful when you want to focus on data quality checks and exclude monitoring metrics from your Collibra catalog.

    Custom Attribute Syncing configuration

    See the section below for detailed instructions.


    Custom Attribute Syncing

    The integration supports syncing custom attributes from Soda checks to Collibra assets, allowing you to enrich your Collibra assets with business context and additional metadata from your data quality checks.

    How Custom Attribute Syncing Works

    Custom attribute syncing enables you to map specific attributes from your Soda checks to corresponding attribute types in Collibra. When a check is synchronized, the integration will automatically extract the values of these attributes and set them on the created/updated Collibra asset.

    Configuration

    To enable custom attribute syncing, add the custom_attributes_mapping_soda_attribute_name_to_collibra_attribute_type_id configuration to your config.yaml file:

    The configuration value is a JSON string containing key-value pairs where:

    • Key: The name of the attribute in Soda (as it appears on your Soda checks)

    • Value: The UUID of the corresponding attribute type in Collibra

    Step-by-Step Setup

    1. Identify Soda Attributes

    First, identify which attributes from your Soda checks you want to sync to Collibra. Common examples include:

    • description - Check description

    • business_impact - Business impact assessment

    • data_domain - Data domain classification

    2. Find Collibra Attribute Type UUIDs

    For each Soda attribute, find the corresponding attribute type UUID in Collibra:

    1. Navigate to your Collibra instance

    2. Go to Settings → Metamodel → Attribute Types

    3. Find or create the attribute types you want to map to

    4. Copy the UUID of each attribute type

    3. Create the JSON Mapping

    Create a JSON object mapping Soda attribute names to Collibra attribute type UUIDs:

    4. Add to Configuration

    Add the JSON mapping to your config.yaml file as a single-line string:

    Complete Example

    Here's a complete example showing how to configure custom attribute syncing:

    Soda Check with Custom Attributes:

    Collibra Configuration:

    Result: When this check is synchronized, the integration will create a Collibra asset with these attributes automatically set:

    • Description: "Ensures orders table is not empty"

    • Business Impact: "critical"

    • Data Domain: "sales"

    • Criticality: "high"

    ⚠️ Important Notes

    1. JSON Format: The mapping must be a valid JSON string enclosed in single quotes

    2. Attribute Type UUIDs: Use the exact UUIDs from your Collibra metamodel

    3. Case Sensitivity: Soda attribute names are case-sensitive and must match exactly

    4. Missing Attributes: If a Soda check doesn't have an attribute defined in the mapping, it will be skipped (no error)

    Troubleshooting

    Common Issues:

    • Invalid JSON: Ensure the JSON string is properly formatted and enclosed in single quotes

    • Attribute Not Found: Verify the Soda attribute names match exactly what's defined in your checks

    • UUID Errors: Confirm the Collibra attribute type UUIDs are correct and exist in your instance

    • Permission Issues: Ensure your Collibra user has permissions to set the specified attribute types

    Debug Mode: Run with debug mode to see detailed logging about custom attribute processing:

    Look for log messages like:

    • Processing custom attribute: attribute_name

    • Successfully set custom attribute: attribute_name

    • Skipping custom attribute (not found in check): attribute_name


    Deletion Synchronization

    The integration automatically synchronizes deletions, removing obsolete check assets from Collibra when checks are deleted or removed in Soda.

    How It Works

    1. Pattern Matching: For each dataset, the integration searches for all check assets in Collibra using the naming pattern {checkname}___{datasetName}

    2. Comparison: Compares the list of check assets in Collibra with the current checks returned from Soda

    3. Identification: Identifies assets that exist in Collibra but are no longer present in Soda

    Benefits

    • Automatic Cleanup: Keeps your Collibra catalog in sync with Soda without manual intervention

    • Efficient Processing: Uses bulk deletion operations to minimize API calls

    • Idempotent: Safe to run multiple times - handles already-deleted assets gracefully

    • Transparent: Shows deletion progress in the console output and tracks metrics

    Example Output

    When obsolete checks are found and deleted, you'll see:

    And in the summary:

    Configuration

    No additional configuration is required. Deletion synchronization is enabled by default and runs automatically for each dataset during the integration process.

    Monitoring

    Deletion synchronization is tracked in the integration metrics:

    • Checks deleted: Number of obsolete check assets removed from Collibra

    • Error Tracking: Any deletion failures are recorded in the error summary

    Error Handling

    • 404 Errors: If assets are already deleted (404 response), the integration treats this as success and continues

    • Other Errors: Network issues, authentication problems, or other HTTP errors are retried with exponential backoff

    • Missing Assets: If no check assets are found in Collibra for a dataset, deletion sync is skipped


    Ownership Synchronization

    The integration supports automatic synchronization of dataset ownership from Collibra to Soda.

    How It Works

    1. Asset Discovery: For each dataset, finds the corresponding table asset in Collibra

    2. Responsibility Extraction: Retrieves ownership responsibilities from Collibra

    3. User Mapping: Maps Collibra users to Soda users by email address

    4. Ownership Update: Updates the Soda dataset with synchronized owners

    Configuration Requirements

    Ensure the following are configured in your config.yaml:

    Monitoring

    Ownership synchronization is tracked in the integration metrics:

    • 👥 Owners synchronized: Number of successful ownership transfers

    • ❌ Ownership sync failures: Number of failed synchronization attempts

    Error Handling

    Common issues and their handling:

    • Missing Collibra Asset: Skip ownership sync for that dataset

    • No Collibra Owners: Log information message, continue processing

    • User Email Mismatch: Track as error, continue with remaining users

    • Soda API Failures: Retry with exponential backoff

    Data Quality score guide

    In order to show the Soda Data Quality score in Collibra, you will need to create an aggregation path as follows:

    1. Navigate to Collibra Settings > Operating Model > Quality Score Aggregation

    2. Create a new score aggregation. You will create two different aggregations as follows:

    If you are using Collibra as a report catalog and want to show Quality Scores on your reports, you will create a third aggregation using the path “Report is part of data structure” & “Asset complies with Governance Asset”.

    1. Assign the new aggregation paths to the asset types COLUMN and TABLE (and any other asset types such as a REPORT).

    • Collibra Settings > Operating Model > Asset Types > Column

    • Click the assignment being used (Default Assignment) > Quality Score Aggregations > External Data Quality > Choose “Soda Data Quality [COLUMN]"

    • Navigate to Collibra Settings > Operating Model > Asset Types > Table

    1. (Optional) If you want to show the Soda Data Quality score in a diagram view on the assets types, you will need to add the above aggregations as an overlay for each asset type (Column, Table, Report) as follows:


    For advanced configuration details, head to .

    Deploy a Soda Agent in a Kubernetes cluster

    Soda-hosted agents are included in all Free, Team, and Enterprise plans at no additional cost. However, self-hosted agents require an Enterprise plan.

    If you wish to use self-hosted agents, please contact us at https://www.soda.io/contact to discuss Enterprise plan options or via the support portal for existing customers.

    Prerequisites

    • You have created, or have access to an existing Kubernetes cluster into which you can deploy a Soda Agent.

    • You have installed v1.22 or v1.23 of . This is the command-line tool you use to run commands against Kubernetes clusters. If you have installed Docker Desktop, kubectl is included out-of-the-box. With Docker running, use the command kubectl version --output=yaml to check the version of an existing install.

    • You have installed . This is the package manager for Kubernetes which you will use to deploy the Soda Agent Helm chart. Run helm version to check the version of an existing install.

    System requirements

    Kubernetes cluster size and capacity: 2 CPU and 2GB of RAM. In general, this is sufficient to run up to six scans in parallel.

    Scan performance may vary according to the workload, or the number of scans running in parallel. To improve performance for larger workloads, consider fine-tuning the cluster size using the resources parameter for the agent-orchestrator and soda.scanlauncher.resources for the scan-launcher. Adding more resources to the scan-launcher can improve scan times by as much as 30%. Be aware, however, that allocating too many resources may be costly relative to the small benefit of improved scan times.

    To specify resources, add the following parameters to your values.yml file during deployment. Refer to Kubernetes documentation for for information on values to supply for x.

    For reference, a Soda-hosted agent specifies resources as follows:

    Deploy an agent

    The following table outlines the two ways you can install the Helm chart to deploy a Soda Agent in your cluster.

    Method
    Description
    When to use

    Deploy using CLI only

    1. Add the Soda Agent Helm chart repository.

    2. Use the following comand to install the Helm chart to deploy a Soda Agent in your custer. Learn more about the .

      • Replace the values of soda.apikey.id and soda-apikey.secret with the values you copy+pasted from the New Soda Agent dialog box in your Soda Cloud account. By default, Soda uses as part of the Soda Agent deployment. The agent automatically converts any sensitive values you add to a values YAML file, or directly via the CLI, into Kubernetes Secrets.

    If you do no see the agent listed in Soda Cloud, use the following command to review status and investigate the logs.

    Deploy using a values YAML file

    1. Create or navigate to an existing Kubernetes cluster in your environment in which you can deploy the Soda Agent helm chart.

    2. Add the Soda Agent Helm chart repository. helm repo add soda-agent [REPOSITORY_URL_PROVIDED]

    3. Using a code editor, create a new YAML file called values.yml.

    If you do no see the agent listed in Soda Cloud, use the following command to review status and investigate the logs.

    If you use private key authentication with a Soda Agent, refer to .

    About the helm install command

    Command part
    Description

    The --set options either override or set some of the values defined in and used by the Helm chart. You can override these values with the --set files as this command does, or you can specify the override values using a values.yml file.

    Parameter key
    Parameter value, description

    Decomission the Soda Agent and cluster

    1. Uninstall the Soda Agent in the cluster.

    2. Delete the cluster.

    Troubleshoot deployment

    Problem: After setting up a cluster and deploying the agent, you are unable to see the agent running in Soda Cloud.

    Solution: The value you specify for the soda-cloud-enpoint must correspond with the region you selected when you signed up for a Soda Cloud account:

    • Usehttps://cloud.us.soda.io for the United States

    • Use https://cloud.soda.io for all else

    Problem: You need to define the outgoing port and IP address with which a self-hosted Soda Agent can communicate with Soda Cloud. Soda Agent does not require setting any inbound rules as it only polls Soda Cloud looking for instruction, which requires only outbound communication. When Soda Cloud must deliver instructions, the Soda Agent opens a bidirectional channel.

    Solution: Use port 443 and passlist the fully-qualified domain names for Soda Cloud:

    • cloud.us.soda.io for Soda Cloud account created in the US region OR

    • cloud.soda.io for Soda Cloud account created in the EU region AND

    • collect.soda.io

    Github

    Learn how to integrate GitHub with Soda.

    This page explains the GitHub Actions workflows that integrate with to manage data contracts automatically. It covers two key workflows:

    1. Publish Contracts on Merge to Main

    2. Verify Contracts on Pull Request


    soda:
       apikey:
         id: "***"
         secret: "***"
       agent:
         name: "myuniqueagent"
         logformat: "raw"
         loglevel: "ERROR"
       cloud:
         # Use https://cloud.us.soda.io for US region; use https://cloud.soda.io for EU region
         endpoint: "https://cloud.soda.io"
    helm install soda-agent soda-agent/soda-agent \
     --set soda.agent.name=myuniqueagent \
      # Use https://cloud.us.soda.io for US region; use https://cloud.soda.io for EU region
     --set soda.cloud.endpoint=https://cloud.us.soda.io
      # Use <us> for US region; use <eu> for EU region
     --set soda.cloud.region=us
     --set soda.apikey.id=*** \
     --set soda.apikey.secret=**** \
     --set soda.agent.logFormat=raw \
     --set soda.agent.loglevel=ERROR \    
     --namespace soda-agent
    NAME: soda-agent
    LAST DEPLOYED: Mon Nov 21 16:29:38 2022
    NAMESPACE: soda-agent
    STATUS: deployed
    REVISION: 1
    kubectl get pods -n soda-agent
    NAME                                     READY   STATUS    RESTARTS   AGE
    soda-agent-orchestrator-ffd74c76-5g7tl   1/1     Running   0          32s
    kubectl create ns soda-agent
    namespace/soda-agent created
    helm install soda-agent soda-agent/soda-agent \
      --values values.yml \
      --namespace soda-agent
    kubectl describe pods -n soda-agent
    helm uninstall soda-agent -n soda-agent
    eksctl delete cluster --name soda-agent
    soda:
      agent:
        resources:
          limits:
            cpu: x
            memory: x
          requests:
            cpu: x
            memory: x
      scanlauncher:
        resources:
          limits:
            cpu: x
            memory: x
          requests:
            cpu: x
            memory: x
    soda:
      agent:
        resources:
          limits:
            cpu: 250m
            memory: 375Mi
          requests:
            cpu: 250m
            memory: 375Mi
    kubectl logs -l agent.soda.io/component=orchestrator -n soda-agent -f
    kubectl logs -l agent.soda.io/component=orchestrator -n soda-agent -f
    kubectl logs -l agent.soda.io/component=orchestrator -n soda-agent -f
    helm install soda-agent soda-agent/soda-agent \
      --set soda.agent.name=myuniqueagent \
      --set soda.apikey.id=*** \
      --set soda.apikey.secret=**** \
      --namespace soda-agent
    helm repo add soda-agent [REPOSITORY_URL_PROVIDED]
    helm repo add soda-agent [REPOSITORY_URL_PROVIDED]
    helm uninstall soda-agent -n soda-agent
    gcloud container clusters delete soda-agent-gke
    soda:
      agent:
        resources:
          limits:
            cpu: x
            memory: x
          requests:
            cpu: x
            memory: x
      scanlauncher:
        resources:
          limits:
            cpu: x
            memory: x
          requests:
            cpu: x
            memory: x
    soda:
      agent:
        resources:
          limits:
            cpu: 250m
            memory: 375Mi
          requests:
            cpu: 250m
            memory: 375Mi
    kubectl logs -l agent.soda.io/component=orchestrator -n soda-agent -f
    kubectl logs -l agent.soda.io/component=orchestrator -n soda-agent -f
    helm install soda-agent soda-agent/soda-agent \
      --set soda.agent.name=myuniqueagent \
      --set soda.apikey.id=*** \
      --set soda.apikey.secret=**** \
      --namespace soda-agent
    Soda EU
    Soda US

    cloud.soda.io

    cloud.us.soda.io

    registry.cloud.soda.io

    registry.us.soda.io

    soda-cloud-platform-registry.s3.eu-west-1.amazonaws.com

    soda-cloud-us-platform-registry.s3.us-west-2.amazonaws.com

    *.docker.io

    *.docker.io

    if
    checkRowsTested
    is not available
  • Calculated Values: Automatically computes check_rows_passed and check_passing_fraction when source data is available

  • Graceful Handling: Leaves attributes empty when diagnostic data is not present in the check result

  • Continue processing even if some dimensions are missing

    criticality - Data criticality level
  • owner_team - Owning team information

  • Invalid UUIDs: Invalid Collibra attribute type UUIDs will cause the sync to fail for that attribute

    Bulk Deletion: Deletes all obsolete assets in a single bulk operation for efficiency
  • Error Handling: Gracefully handles cases where assets are already deleted (404 errors), treating them as successful deletions

  • Metrics Tracking: Reports the number of checks deleted in the integration summary

  • Error Tracking: Records any failures for monitoring

    Click the assignment being used (Default Assignment) > Quality Score Aggregations > External Data Quality > Choose “Soda Data Quality [TABLE]"

    Operations & advanced usage
    Soda-calculated Data Quality scores published into Collibra, plotted as a time-series quality score history.
    Custom Attribute Syncing

    You are not logged in to Soda and are viewing the default public documentation. Learn more about Documentation access & licensing.

    collibra:
      base_url: "https://your-instance.collibra.com/rest/2.0"
      username: "your-username"
      password: "your-password"
      general:
        naming_delimiter: ">"  # Used to separate parts of asset names
      asset_types:
        table_asset_type: "00000000-0000-0000-0000-000000031007"  # ID for Table assets
        soda_check_asset_type: "00000000-0000-0000-0000-000000031107"  # ID for Data Quality Metric type
        dimension_asset_type: "00000000-0000-0000-0000-000000031108"  # ID for Data Quality Dimension type
        column_asset_type: "00000000-0000-0000-0000-000000031109"  # ID for Column type
      attribute_types:
        # Standard Check Attributes
        check_evaluation_status_attribute: "00000000-0000-0000-0000-000000000238"  # Boolean attribute for pass/fail
        check_last_sync_date_attribute: "00000000-0000-0000-0000-000000000256"  # Last sync timestamp
        check_definition_attribute: "00000000-0000-0000-0000-000000000225"  # Check definition
        check_last_run_date_attribute: "01975dd9-a7b0-79fb-bb74-2c1f76402663"  # Last run timestamp
        check_cloud_url_attribute: "00000000-0000-0000-0000-000000000258"  # Link to Soda Cloud
        
        # Diagnostic Metric Attributes - Extracted from Soda check diagnostics
        check_loaded_rows_attribute: "00000000-0000-0000-0000-000000000233"      # Number of rows tested/loaded
        check_rows_failed_attribute: "00000000-0000-0000-0000-000000000237"      # Number of rows that failed
        check_rows_passed_attribute: "00000000-0000-0000-0000-000000000236"      # Number of rows that passed (calculated)
        check_passing_fraction_attribute: "00000000-0000-0000-0000-000000000240" # Fraction of rows passing (calculated)
      relation_types:
        table_column_to_check_relation_type: "00000000-0000-0000-0000-000000007018"  # Relation between table/column and check
        check_to_dq_dimension_relation_type: "f7e0a26b-eed6-4ba9-9152-4a1363226640"  # Relation between check and dimension
      responsibilities:
        owner_role_id: "00000000-0000-0000-0000-000000005040"  # Collibra role ID for asset owners
      domains:
        data_quality_dimensions_domain: "00000000-0000-0000-0000-000000006019"  # Domain for DQ dimensions
        soda_collibra_domain_mapping: '{"Sales": "0197377f-e595-7434-82c7-3ce1499ac620"}'  # Dataset to domain mapping
        soda_collibra_default_domain: "01975b4a-0ace-79f6-b5ec-68656ca60b11"  # Default domain if no mapping
    soda:
      api_key_id: "your-api-key-id"
      api_key_secret: "your-api-key-secret"
      base_url: "https://cloud.soda.io/api/v1"
      general:
        filter_datasets_to_sync_to_collibra: true  # Only sync datasets with sync attribute
        soda_no_collibra_dataset_skip_checks: false  # Skip checks if dataset not in Collibra
      attributes:
        soda_collibra_sync_dataset_attribute: "collibra_sync"  # Attribute to mark datasets for sync
        soda_collibra_domain_dataset_attribute_name: "rulebook"  # Attribute for domain mapping
        soda_dimension_attribute_name: "dimension"  # Attribute for DQ dimension
    checks for orders:
      - row_count > 0:
          attributes:
            dimension: "Completeness, Consistency, Accuracy"
    soda:
      attributes:
        # ... other attributes ...
        custom_attributes_mapping_soda_attribute_name_to_collibra_attribute_type_id: '{"soda_attribute_id": "collibra_attribute_type_uuid", "another_soda_attribute": "another_collibra_uuid"}'
    {
      "description": "00000000-0000-0000-0000-000000003114",
      "business_impact": "01975f7b-0c04-7b98-9fb8-6635261a7c7b",
      "data_domain": "0197ca72-aee8-7259-9e88-5b98073147ed"
    }
    soda:
      attributes:
        custom_attributes_mapping_soda_attribute_name_to_collibra_attribute_type_id: '{"description": "00000000-0000-0000-0000-000000003114", "business_impact": "01975f7b-0c04-7b98-9fb8-6635261a7c7b", "data_domain": "0197ca72-aee8-7259-9e88-5b98073147ed"}'
    checks for orders:
      - row_count > 0:
          attributes:
            description: "Ensures orders table is not empty"
            business_impact: "critical"
            data_domain: "sales"
            criticality: "high"
    soda:
      attributes:
        soda_collibra_sync_dataset_attribute: "collibra_sync"
        soda_collibra_domain_dataset_attribute_name: "rulebook"
        soda_dimension_attribute_name: "dimension"
        custom_attributes_mapping_soda_attribute_name_to_collibra_attribute_type_id: '{"description": "00000000-0000-0000-0000-000000003114", "business_impact": "01975f7b-0c04-7b98-9fb8-6635261a7c7b", "data_domain": "0197ca72-aee8-7259-9e88-5b98073147ed", "criticality": "0197f2a8-1234-5678-9abc-def012345678"}'
    python main.py --debug
    Processing dataset 1/3: finance_loans
      📋 Getting checks...
      🔄 Processing 18 checks...
        🏗️ Preparing assets...
        📤 Creating/updating assets...
        📝 Processing metadata & relations...
        🗑️  Deleting 2 obsolete check(s)...
      👥 Syncing ownership...
    🗑️  Checks deleted: 2
    collibra:
      responsibilities:
        owner_role_id: "00000000-0000-0000-0000-000000005040"  # Collibra owner role ID
  • Specify the value for soda.cloud.endpoint according to your local region: https://cloud.us.soda.io for the United States, or https://cloud.soda.io for all else.

  • (Optional) Specify the format for log output: raw for plain text, or json for JSON format.

  • (Optional) Specify the level of log information you wish to see when deploying the agent: ERROR, WARN, INFO, DEBUG, or TRACE.

  • Read more about the helm install command.

    The command-line produces output like the following message:

  • Specify the value for endpoint according to your local region: https://cloud.us.soda.io for the United States, or https://cloud.soda.io for all else.

  • (Optional) Specify the format for log output: raw for plain text, or json for JSON format.

  • (Optional) Specify the level of log information you wish to see when deploying the agent: ERROR, WARN, INFO, DEBUG, or TRACE.

  • Autoscaling
    Kubernetes Secrets
    Kubernetes Secrets
    Soda Agent Extra
    Documentation access & licensing
    agent-deployed
    agent-deployed
    Deploy using CLI only
    Deploy using a values YAML file
    Soda EU
    Soda US

    cloud.soda.io

    cloud.us.soda.io

    registry.cloud.soda.io

    registry.us.soda.io

    soda-cloud-platform-registry.s3.eu-west-1.amazonaws.com

    soda-cloud-us-platform-registry.s3.us-west-2.amazonaws.com

    *.docker.io

    *.docker.io

    Specify the value for endpoint according to your local region: https://cloud.us.soda.io for the United States, or https://cloud.soda.io for all else.

  • (Optional) Specify the format for log output: raw for plain text, or json for JSON format.

  • (Optional) Specify the level of log information you wish to see when deploying the agent: ERROR, WARN, INFO, DEBUG, or TRACE.

  • all the steps
    kubectl
    Helm
    Kubernetes Secrets
    Kubernetes Secrets
    Soda Agent Extra
    Documentation access & licensing
    agent-deployed
    agent-deployed
    Deploy using CLI only
    Deploy using a values YAML file
    Soda EU
    Soda US

    cloud.soda.io

    cloud.us.soda.io

    registry.cloud.soda.io

    registry.us.soda.io

    soda-cloud-platform-registry.s3.eu-west-1.amazonaws.com

    soda-cloud-us-platform-registry.s3.us-west-2.amazonaws.com

    *.docker.io

    *.docker.io

    You have been granted access to the private Soda Agent repository and received the necessary credentials and repository information.

  • You have whitelisted these URLs, depending on whether you are using Soda EU cloud.soda.io or Soda US cloud.us.soda.io:

  • Replace the value of soda.agent.name with a custom name for you agent, if you wish.

  • Specify the value for soda.cloud.endpoint according to your local region: https://cloud.us.soda.io for the United States, or https://cloud.soda.io for all else.

  • (Optional) Specify the format for log output: raw for plain text, or json for JSON format.

  • (Optional) Specify the level of log information you wish to see when deploying the agent: ERROR, WARN, INFO, DEBUG, or TRACE.

    The command-line produces output like the following message:

  • (Optional) Validate the Soda Agent deployment by running the following command:

  • In your Soda Cloud account, navigate to your avatar > Agents. Refresh the page to verify that you see the agent you just created in the list of Agents. Be aware that this may take several minutes to appear in your list of Soda Agents. Use the describe pods command in step 3 to check the status of the deployment. When State: Running and Ready: True, then you can refresh and see the agent in Soda Cloud.

    agent-deployed
  • In that file, copy+paste the content below, replacing the following values:
    • id and secret with the values you copy+pasted from the New Soda Agent dialog box in your Soda Cloud account. By default, Soda uses Kubernetes Secrets as part of the Soda Agent deployment. The agent automatically converts any sensitive values you add to a values YAML file, or directly via the CLI, into Kubernetes Secrets.

    • Replace the value of name with a custom name for your agent, if you wish.

    • Specify the value for endpoint according to your local region: https://cloud.us.soda.io for the United States, or https://cloud.soda.io for all else.

    • (Optional) Specify the format for log output: raw for plain text, or json for JSON format.

    • (Optional) Specify the level of log information you wish to see when deploying the agent: ERROR, WARN, INFO, DEBUG, or TRACE.

  • Save the file. Then, in the same directory in which the values.yml file exists, use the following command to install the Soda Agent helm chart.

  • (Optional) Validate the Soda Agent deployment by running the following command:

  • In your Soda Cloud account, navigate to your avatar > Agents. Refresh the page to verify that you see the agent you just created in the list of Agents. Be aware that this may take several minutes to appear in your list of Soda Agents. Use the describe pods command in step three to check the status of the deployment. When State: Running and Ready: True, then you can refresh and see the agent in Soda Cloud.

    agent-deployed
  • Deploy using CLI only

    Install the Helm chart via CLI by providing values directly in the install command.

    Use this as a straight-forward way of deploying an agent on a cluster in a secure or local environment.

    Deploy using a values YAML file

    Install the Helm chart via CLI by providing values in a values YAML file.

    Use this as a way of deploying an agent on a cluster while keeping sensitive values secure. - provide sensitive API key values in this local file - store data source login credentials as environment variables in this local file or in an external secrets manager; Soda needs access to the credentials to be able to connect to your data source to run scans of your data. See Soda Agent Extra

    helm install

    the action helm is to take

    soda-agent (the first one)

    a release named soda-agent on your cluster

    soda-agent (the second one)

    the name of the helm repo you installed

    soda-agent (the third one)

    the name of the helm chart that is the Soda Agent

    --set soda.agent.name

    A unique name for your Soda Agent. Choose any name you wish, as long as it is unique in your Soda Cloud account.

    --set soda.apikey.id

    With the apikey.secret, this connects the Soda Agent to your Soda Cloud account. Use the value you copied from the dialog box in Soda Cloud when adding a new agent. You can use a values.yml file to pass this value to the cluster instead of exposing it here.

    --set soda.apikey.secret

    With the apikey.id, this connects the Soda Agent to your Soda Cloud account. Use the value you copied from the dialog box in Soda Cloud when adding a new agent. You can use a values.yml file to pass this value to the cluster instead of exposing it here.

    --set soda.agent.logFormat

    (Optional) Specify the format for log output: raw for plain text, or json for JSON format.

    --set soda.agent.loglevel

    (Optional) Specify the level of log information you wish to see when deploying the agent: ERROR, WARN, INFO, DEBUG, or TRACE.

    --namespace soda-agent

    Use the namespace value to identify the namespace in which to deploy the agent.

    kubectl
    Helm
    Resource Management for Pods and Containers
    helm install command
    Kubernetes Secrets
    Soda Agent extras

    You are not logged in to Soda and are viewing the default public documentation. Learn more about Documentation access & licensing.

    Soda EU
    Soda US

    cloud.soda.io

    cloud.us.soda.io

    registry.cloud.soda.io

    registry.us.soda.io

    soda-cloud-platform-registry.s3.eu-west-1.amazonaws.com

    soda-cloud-us-platform-registry.s3.us-west-2.amazonaws.com

    *.docker.io

    *.docker.io

    Publish Contracts on Merge

    Overview

    The Publish Contracts on Merge workflow automates the publishing of any updated or new Soda contracts when changes are pushed to the main branch.

    This ensures that all contract changes are automatically deployed to Soda Cloud whenever they're merged into the production branch.

    Action

    What It Does

    1. Checks out the repo

    2. Sets up Python

    3. Installs the latest version of soda

    4. Identifies changed files

    5. Filters YAML files in the contracts/ directory

    6. Publishes valid contracts to Soda Cloud

    Required GitHub Secrets

    Make sure these are set in your repository’s GitHub Secrets:

    • SODA_CLOUD_API_KEY

    • SODA_CLOUD_API_SECRET

    Learn more about how to Generate API keys

    Customization Options

    Option
    Description

    Change the Action trigger.

    pip install

    Can specify a fixed version of soda for stability.

    SODA_CLOUD_CONFIG_FILE_PATH

    Path to your Soda Cloud config. Can be replaced if your setup uses a different config file name or location.

    contracts/*.yml or contracts/*.yaml

    Modify file pattern to match a different directory or naming convention.

    Example output


    Verify Contracts on Pull Request

    Overview

    The Verify Contracts on Pull Request workflow ensures that contract changes in PRs are valid and do not break expectations before merging.

    The workflow runs when a PR is opened, updated, or reopened.

    Action

    What It Does

    1. Checks out the PR branch

    2. Sets up Python

    3. Installs latest soda-postgres

    4. Identifies changed files

    5. Filters contracts in the contracts/ directory

    6. Runs verification checks against a configured data source

    Required Secrets

    Make sure these are set in your repository’s GitHub Secrets:

    • DATASOURCE_USERNAME

    • DATASOURCE_PASSWORD

    These secrets can be customized depending on the data source type and your needs.

    Customization Options

    Option
    Description

    Change the Action trigger

    pip install

    Adapt the install command to install the necessary package for your data source.

    You can specify a fixed version of soda for stability.

    contracts/*.yml or contracts/*.yaml

    Change to match your directory structure.

    DATASOURCE_CONFIG_FILE_PATH

    Replace with the path to your data source configuration

    DATASOURCE_USERNAME and DATASOURCE_PASSWORD

    Adapt the secrets used to connect to your data source depending on the data source type and security requirements.

    Example output

    Soda

    You are not logged in to Soda and are viewing the default public documentation. Learn more about .

    Set up user group sync in Okta
    helm install soda-agent soda-agent/soda-agent \
     --set soda.agent.name=myuniqueagent \
     # Use https://cloud.us.soda.io for US region; use https://cloud.soda.io for EU region
     --set soda.cloud.endpoint=https://cloud.soda.io \
     --set soda.apikey.id=*** \
     --set soda.apikey.secret=**** \
     --set soda.agent.logFormat=raw \
     --set soda.agent.loglevel=ERROR \
     --namespace soda-agent
    NAME: soda-agent
    LAST DEPLOYED: Thu Jun 16 10:12:47 2022
    NAMESPACE: soda-agent
    STATUS: deployed
    REVISION: 1
    soda:
       apikey:
         id: "***"
         secret: "***"
       agent:
         name: "myuniqueagent"
         logformat: "raw"
         loglevel: "ERROR"
       cloud:
         # Use https://cloud.us.soda.io for US region; use https://cloud.soda.io for EU region
         endpoint: "https://cloud.soda.io"
    helm repo add soda-agent [REPOSITORY_URL_PROVIDED]
    kubectl describe pods
    ...
    Containers:
      soda-agent-orchestrator:
         Container ID:   docker://081*33a7
         Image:          sodadata/agent-orchestrator:latest
         Image ID:       docker-pullable://sodadata/agent-orchestrator@sha256:394e7c1**b5f
         Port:           <none>
         Host Port:      <none>
         State:          Running
           Started:      Thu, 16 Jun 2022 15:50:28 -0700
         Ready:          True
    ...
    helm repo add soda-agent [REPOSITORY_URL_PROVIDED]
    helm install soda-agent soda-agent/soda-agent \
      --values values.yml \
      --namespace soda-agent
    kubectl describe pods -n soda-agent
    ...
    Containers:
      soda-agent-orchestrator:
         Container ID:   docker://081*33a7
         Image:          sodadata/agent-orchestrator:latest
         Image ID:       docker-pullable://sodadata/agent-orchestrator@sha256:394e7c1**b5f
         Port:           <none>
         Host Port:      <none>
         State:          Running
           Started:      Thu, 16 Jun 2022 15:50:28 -0700
         Ready:          True
         ...
    kubectl -n soda-agent rollout restart deploy
    soda:
       apikey:
         id: "***"
         secret: "***"
       agent:
         name: "myuniqueagent"
         logformat: "raw"
         loglevel: "ERROR"
       cloud:
         # Use https://cloud.us.soda.io for US region; use https://cloud.soda.io for EU region
         endpoint: "https://cloud.soda.io"
    helm install soda-agent soda-agent/soda-agent \
    --set soda.agent.name=myuniqueagent \
    # Use https://cloud.us.soda.io for US region; use https://cloud.soda.io for EU region
    --set soda.cloud.endpoint=https://cloud.soda.io \
    --set soda.apikey.id=*** \
    --set soda.apikey.secret=*** \
    --set soda.agent.logFormat=raw \
     --set soda.agent.loglevel=ERROR \
    --namespace soda-agent 
    NAME: soda-agent
    LAST DEPLOYED: Wed Dec 14 11:45:13 2022
    NAMESPACE: soda-agent
    STATUS: deployed
    REVISION: 1
    kubectl describe pods
    Name:             soda-agent-orchestrator-66-snip
    Namespace:        soda-agent
    Priority:         0
    Service Account:  soda-agent
    Node:             <none>
    Labels:           agent.soda.io/component=orchestrator
                   agent.soda.io/service=queue
                   app.kubernetes.io/instance=soda-agent
                   app.kubernetes.io/name=soda-agent
                   pod-template-hash=669snip
    Annotations:      seccomp.security.alpha.kubernetes.io/pod: runtime/default
    Status:           Running
    ...
    helm install soda-agent soda-agent/soda-agent \
      --values values.yml \
      --namespace soda-agent
    kubectl describe pods
    Name:             soda-agent-orchestrator-66-snip
    Namespace:        soda-agent
    Priority:         0
    Service Account:  soda-agent
    Node:             <none>
    Labels:           agent.soda.io/component=orchestrator
                   agent.soda.io/service=queue
                   app.kubernetes.io/instance=soda-agent
                   app.kubernetes.io/name=soda-agent
                   pod-template-hash=669snip
    Annotations:      seccomp.security.alpha.kubernetes.io/pod: runtime/default
    Status:           Running
    ...
    helm install soda-agent soda-agent/soda-agent \
     --set soda.agent.name=myuniqueagent \
     # Use https://cloud.us.soda.io for US region; use https://cloud.soda.io for EU region
     --set soda.cloud.endpoint=https://cloud.soda.io \
     --set soda.apikey.id=*** \
     --set soda.apikey.secret=**** \
     --set soda.agent.logFormat=raw \
     --set soda.agent.loglevel=ERROR \
     --namespace soda-agent
    NAME: soda-agent
    LAST DEPLOYED: Thu Jun 16 15:03:10 2022
    NAMESPACE: soda-agent
    STATUS: deployed
    REVISION: 1
    minikube kubectl -- describe pods
    ...
    Containers:
      soda-agent-orchestrator:
         Container ID:   docker://081*33a7
         Image:          sodadata/agent-orchestrator:latest
         Image ID:       docker-pullable://sodadata/agent-orchestrator@sha256:394e7c1**b5f
         Port:           <none>
         Host Port:      <none>
         State:          Running
           Started:      Thu, 16 Jun 2022 15:50:28 -0700
         Ready:          True
         ...
    helm install soda-agent soda-agent/soda-agent \
      --values values.yml \
      --namespace soda-agent
    minikube kubectl -- describe pods
    ...
    Containers:
      soda-agent-orchestrator:
     Container ID:   docker://081*33a7
     Image:          sodadata/agent-orchestrator:latest
     Image ID:       docker-pullable://sodadata/agent-orchestrator@sha256:394e7c1**b5f
     Port:           <none>
     Host Port:      <none>
     State:          Running
       Started:      Thu, 16 Jun 2022 15:50:28 -0700
     Ready:          True
    ...
    soda:
      agent:
        resources:
          limits:
            cpu: x
            memory: x
          requests:
            cpu: x
            memory: x
      scanlauncher:
        resources:
          limits:
            cpu: x
            memory: x
          requests:
            cpu: x
            memory: x
    soda:
      agent:
        resources:
          limits:
            cpu: 250m
            memory: 375Mi
          requests:
            cpu: 250m
            memory: 375Mi
    helm repo add soda-agent [REPOSITORY_URL_PROVIDED]
    kubectl logs -l agent.soda.io/component=orchestrator -n soda-agent -f
    kubectl logs -l agent.soda.io/component=orchestrator -n soda-agent -f
    helm install soda-agent soda-agent/soda-agent \
      --set soda.agent.name=myuniqueagent \
      --set soda.apikey.id=*** \
      --set soda.apikey.secret=**** \
      --namespace soda-agent
    helm uninstall soda-agent -n soda-agent
    minikube delete
    💀  Removed all traces of the "minikube" cluster.
    name: Publish Updated Contracts on Merge
    
    on:
      push:
        branches:
          - main
    
    jobs:
      publish-contracts:
        runs-on: ubuntu-latest
    
        steps:
          - name: Checkout repository
            uses: actions/checkout@v4
            with:
              fetch-depth: 0
    
          - name: Set up Python
            uses: actions/setup-python@v4
            with:
              python-version: '3.10'
    
          - name: Install soda-postgres
            run: pip install -i https://pypi.dev.sodadata.io "soda>=4.0.0.dev1" -U
    
          - name: Get all changed files
            id: changed-files
            uses: tj-actions/changed-files@v46
    
          - name: List all changed files
            env:
              ALL_CHANGED_FILES: ${{ steps.changed-files.outputs.all_changed_files }}
            run: |
              for file in ${ALL_CHANGED_FILES}; do
                echo "$file was changed"
              done
    
          - name: Debug environment variables
            env:
              SODA_CLOUD_API_KEY: ${{ secrets.SODA_CLOUD_API_KEY }}
              SODA_CLOUD_API_SECRET: ${{ secrets.SODA_CLOUD_API_SECRET }}
            run: |
              echo "Environment variables status:"
              echo "SODA_CLOUD_API_KEY: $(if [ -n "$SODA_CLOUD_API_KEY" ]; then echo "✅ Set (${#SODA_CLOUD_API_KEY} chars)"; else echo "❌ Not set"; fi)"
              echo "SODA_CLOUD_API_SECRET: $(if [ -n "$SODA_CLOUD_API_SECRET" ]; then echo "✅ Set (${#SODA_CLOUD_API_SECRET} chars)"; else echo "❌ Not set"; fi)"
    
          - name: Filter and publish contracts
            env:
              ALL_CHANGED_FILES: ${{ steps.changed-files.outputs.all_changed_files }}
              SODA_CLOUD_CONFIG_FILE_PATH: soda-cloud.yaml
              SODA_CLOUD_API_KEY: ${{ secrets.SODA_CLOUD_API_KEY }}
              SODA_CLOUD_API_SECRET: ${{ secrets.SODA_CLOUD_API_SECRET }}
            run: |
              for file in ${ALL_CHANGED_FILES}; do
                if [[ "$file" == contracts/*.yml || "$file" == contracts/*.yaml ]]; then
                  echo "Publishing $file"
                  echo "Executing: soda contract publish --contract \"$file\" --soda-cloud ${SODA_CLOUD_CONFIG_FILE_PATH}"
                  soda contract publish --contract "$file" --soda-cloud ${SODA_CLOUD_CONFIG_FILE_PATH}
                else
                  echo "Skipping $file (not a contract)"
                fi
              done
    
    name: Verify Data Contracts on pull request
    
    on:
      pull_request:
        types: [opened, synchronize, reopened]
    
    jobs:
      verify-contracts:
        runs-on: ubuntu-latest
    
        steps:
          - name: Checkout repository
            uses: actions/checkout@v4
            with:
              fetch-depth: 0
    
          - name: Set up Python
            uses: actions/setup-python@v4
            with:
              python-version: '3.10'
    
          - name: Install soda-postgres
            run: pip install -i https://pypi.dev.sodadata.io/simple -U soda-postgres
    
          - name: Get all changed files
            id: changed-files
            uses: tj-actions/changed-files@v46
    
          - name: List all changed files
            env:
              ALL_CHANGED_FILES: ${{ steps.changed-files.outputs.all_changed_files }}
            run: |
              for file in ${ALL_CHANGED_FILES}; do
                echo "$file was changed"
              done
    
          - name: Debug environment variables
            env:
              DATASOURCE_USERNAME: ${{ secrets.DATASOURCE_USERNAME }}
              DATASOURCE_PASSWORD: ${{ secrets.DATASOURCE_PASSWORD }}
            run: |
              echo "Environment variables status:"
              echo "DATASOURCE_USERNAME: $(if [ -n "$DATASOURCE_USERNAME" ]; then echo "✅ Set"; else echo "❌ Not set"; fi)"
              echo "DATASOURCE_PASSWORD: $(if [ -n "$DATASOURCE_PASSWORD" ]; then echo "✅ Set"; else echo "❌ Not set"; fi)"
    
          - name: Filter and verify contracts
            env:
              ALL_CHANGED_FILES: ${{ steps.changed-files.outputs.all_changed_files }}
              DATASOURCE_CONFIG_FILE_PATH: postgres.yaml
              DATASOURCE_USERNAME: ${{ secrets.DATASOURCE_USERNAME }}
              DATASOURCE_PASSWORD: ${{ secrets.DATASOURCE_PASSWORD }}
            run: |
              for file in ${ALL_CHANGED_FILES}; do
                if [[ "$file" == contracts/*.yml || "$file" == contracts/*.yaml ]]; then
                  echo "Verifying $file"
                  echo "Executing: soda contract verify --data-source ${DATASOURCE_CONFIG_FILE_PATH} --contract \"$file\""
                  soda contract verify --data-source ${DATASOURCE_CONFIG_FILE_PATH} --contract "$file"
                else
                  echo "Skipping $file (not a contract)"
                fi
              done
    
    Data source reference for Soda Core
    Documentation access & licensing
    on:
      push:
        branches:
          - main
    on:
      pull_request:
        types: [opened, synchronize, reopened]

    Operations & advanced usage

    This page provides detailed information about everything that happens while running and after running the Soda↔Collibra integration.

    Advanced usage focuses on running and maintaining the Soda↔Collibra bi-directional integration after setup. The goal is to equip technical implementers with the detail required to operate the integration efficiently, resolve issues quickly, and adapt it to complex environments.


    Performance & Monitoring

    Performance Optimization

    Caching System

    • Domain Mappings: Cached for the entire session

    • Asset Lookups: LRU cache reduces repeated API calls

    • Configuration Parsing: One-time parsing with caching

    Batch Processing

    • Asset Operations: Create/update multiple assets in single calls

    • Attribute Management: Bulk attribute creation and updates

    • Relation Creation: Batch relationship establishment

    Performance Results

    • 3-5x faster execution vs. original implementation

    • 60% fewer API calls through caching

    • 90% reduction in rate limit errors

    • Improved reliability with comprehensive error handling

    Performance Benchmarks

    Typical Performance

    • Small datasets (< 100 checks): 30-60 seconds

    • Medium datasets (100-1000 checks): 2-5 minutes

    • Large datasets (1000+ checks): 5-15 minutes

    Performance varies based on:

    • Network latency to APIs

    • Number of existing vs. new assets

    • Complexity of relationships

    • API rate limits

    Monitoring & Metrics

    Integration Completion Report

    Debug Logging

    Enable detailed logging for troubleshooting:

    Debug output includes:

    • Dataset processing details

    • API call timing and results

    • Caching hit/miss statistics

    • Error context and stack traces


    Diagnostic Metrics Processing

    The integration automatically extracts diagnostic metrics from Soda check results and populates detailed row-level statistics in Collibra.

    Supported Metrics

    Metric
    Source
    Description

    Flexible Diagnostic Type Support

    The system automatically extracts metrics from any diagnostic type, making it future-proof:

    Current Soda Diagnostic Types

    Future Diagnostic Types (Automatically Supported)

    Intelligent Extraction Logic

    The system uses a metric-focused approach rather than type-specific logic:

    1. Scans All Diagnostic Types: Iterates through every diagnostic type in the response

    2. Extracts Relevant Metrics: Looks for specific metric fields regardless of diagnostic type name

    3. Applies Smart Fallbacks: Uses datasetRowsTested if checkRowsTested is not available

    Fallback Mechanisms

    Priority
    Field Used
    Fallback Reason

    Example Processing Flow

    Input: Soda Check Result

    Output: Collibra Attributes

    Benefits

    • ✅ Future-Proof: Automatically works with new diagnostic types Soda introduces

    • ✅ Comprehensive: Provides both raw metrics and calculated insights

    • ✅ Flexible: Handles partial data gracefully with intelligent fallbacks

    • ✅ Accurate: Uses check-specific row counts when available


    Testing

    Unit Tests

    Local Kubernetes Testing

    Head to to learn more about the Kubernetes deployment.

    Legacy Tests


    Advanced Configuration

    Performance Tuning

    Modify constants.py for your environment:

    Enhanced Configuration Options

    For detailed information on configuring custom attribute syncing, see the section above.

    Custom Logging

    Environment Variables


    Troubleshooting

    Common Issues

    Performance Issues

    • Slow Processing: Increase BATCH_SIZE and DEFAULT_PAGE_SIZE

    • Rate Limiting: Increase RATE_LIMIT_DELAY

    • Memory Usage: Decrease CACHE_MAX_SIZE

    Connection Issues

    • API Timeouts: Check network connectivity and API endpoints

    • Authentication: Verify credentials and permissions

    • Rate Limits: Monitor API usage and adjust delays

    Data Issues

    • Missing Assets: Ensure required asset types exist in Collibra

    • Relation Failures: Verify relation type configurations

    • Domain Mapping: Check domain IDs and JSON formatting

    Diagnostic Metrics Issues

    • Missing Diagnostic Attributes: Check if Soda checks have lastCheckResultValue.diagnostics data

    • Incomplete Metrics: Some diagnostic types may only have partial metrics (e.g., aggregate checks lack failedRowsCount)

    • Attribute Type Configuration: Verify diagnostic attribute type IDs are configured correctly in config.yaml

    Debug Commands

    Log Analysis

    Look for these patterns in debug logs:

    General Operation Patterns:

    • Rate limit prevention: Normal throttling behavior

    • Successfully updated/created: Successful operations

    • Skipping dataset: Expected filtering behavior

    Diagnostic Processing Patterns:

    • Processing diagnostics: Diagnostic data found in check result

    • Found failedRowsCount in 'X': Successfully extracted failure count from diagnostic type X

    • Found checkRowsTested in 'X': Successfully extracted row count from diagnostic type X


    Reference

    Common Commands

    Key Configuration Sections

    • Collibra Base: collibra.base_url, collibra.username, collibra.password

    • Soda API: soda.api_key_id, soda.api_key_secret

    Essential UUIDs to Configure

    • Asset types (table, check, dimension, column)

    • Attribute types (evaluation status, sync date, diagnostic metrics)

    • Relation types (table-to-check, check-to-dimension)

    • Domain IDs for asset creation


    Support

    For issues and questions:

    1. Check the section

    2. Enable for detailed information

    3. Review the performance metrics for bottlenecks

    4. Consult the for usage examples

    soda:
       apikey:
         id: "***"
         secret: "***"
       agent:
         name: "myuniqueagent"
         logformat: "raw"
         loglevel: "ERROR"
       cloud:
         # Use https://cloud.us.soda.io for US region
         # Use https://cloud.soda.io for EU region
         endpoint: "https://cloud.soda.io"
    Performance metrics per operation
  • Ownership synchronization details

  • Calculates Derived Metrics: Computes passing rows and fraction when source data is available

  • Handles Missing Data: Gracefully skips attributes when diagnostic data is unavailable

  • ✅ Transparent: Detailed logging shows exactly which metrics were found and used

  • Zero Division Errors: System automatically prevents division by zero when calculating fractions

  • ERROR: Issues requiring attention

    Using datasetRowsTested from 'X' as fallback: Fallback mechanism activated

  • No diagnostics found in check result: Check has no diagnostic data (normal for some check types)

  • Calculated check_rows_passed: Successfully computed passing rows

  • Added check_X_attribute: Diagnostic attribute successfully added to Collibra

  • Custom Attributes
    :
    soda.attributes.custom_attributes_mapping_soda_attribute_name_to_collibra_attribute_type_id
  • Domain Mapping: collibra.domains.soda_collibra_domain_mapping

  • Ownership Sync: collibra.responsibilities.owner_role_id

  • Contact [email protected] for additional help

    check_loaded_rows_attribute

    checkRowsTested or datasetRowsTested

    Total number of rows evaluated by the check

    check_rows_failed_attribute

    failedRowsCount

    Number of rows that failed the check

    check_rows_passed_attribute

    Calculated

    check_loaded_rows - check_rows_failed

    check_passing_fraction_attribute

    Calculated

    check_rows_passed / check_loaded_rows

    1st

    checkRowsTested

    Preferred - rows actually tested by the specific check

    2nd

    datasetRowsTested

    Fallback - total dataset rows when check-specific count unavailable

    Deploy on Kubernetes
    Custom Attribute Syncing
    Troubleshooting
    Debug Logging
    Unit Tests

    You are not logged in to Soda and are viewing the default public documentation. Learn more about Documentation access & licensing.

    ============================================================
    🎉 INTEGRATION COMPLETED SUCCESSFULLY 🎉
    ============================================================
    📊 Datasets processed: 15
    ⏭️  Datasets skipped: 2
    ✅ Checks created: 45
    🔄 Checks updated: 67
    📝 Attributes created: 224
    🔄 Attributes updated: 156
    🔗 Dimension relations created: 89
    📋 Table relations created: 23
    📊 Column relations created: 89
    👥 Owners synchronized: 12
    ❌ Ownership sync failures: 1
    
    🎯 Total operations performed: 693
    ============================================================
    python main.py --debug
    // Missing value checks
    {
      "diagnostics": {
        "missing": {
          "failedRowsCount": 3331,
          "failedRowsPercent": 1.213,
          "datasetRowsTested": 274577,
          "checkRowsTested": 274577
        }
      }
    }
    
    // Aggregate checks  
    {
      "diagnostics": {
        "aggregate": {
          "datasetRowsTested": 274577,
          "checkRowsTested": 274577
        }
      }
    }
    // Hypothetical future types
    {
      "diagnostics": {
        "valid": {
          "failedRowsCount": 450,
          "validRowsCount": 9550,
          "checkRowsTested": 10000
        },
        "duplicate": {
          "duplicateRowsCount": 200,
          "checkRowsTested": 8000
        }
      }
    }
    {
      "name": "customer_id is present",
      "evaluationStatus": "fail",
      "lastCheckResultValue": {
        "value": 1.213,
        "diagnostics": {
          "missing": {
            "failedRowsCount": 3331,
            "checkRowsTested": 274577
          }
        }
      }
    }
    Attributes Set:
      - check_loaded_rows_attribute: 274577           # From checkRowsTested
      - check_rows_failed_attribute: 3331             # From failedRowsCount  
      - check_rows_passed_attribute: 271246           # Calculated: 274577 - 3331
      - check_passing_fraction_attribute: 0.9879      # Calculated: 271246 / 274577
    # Run all tests
    python -m pytest tests/ -v
    
    # Run specific test file
    python -m pytest tests/test_integration.py -v
    
    # Run with coverage
    python -m pytest tests/ --cov=integration --cov-report=html
    # Comprehensive local testing (recommended)
    python testing/test_k8s_local.py
    
    # Docker-specific testing
    ./testing/test_docker_local.sh
    
    # Quick validation
    python testing/validate_k8s.py
    # Test Soda client functionality
    python main.py --test-soda
    
    # Test Collibra client functionality
    python main.py --test-collibra
    class IntegrationConstants:
        MAX_RETRIES = 3              # API retry attempts
        BATCH_SIZE = 50              # Batch operation size
        DEFAULT_PAGE_SIZE = 1000     # API pagination size
        RATE_LIMIT_DELAY = 2         # Rate limiting delay
        CACHE_MAX_SIZE = 128         # LRU cache size
    # In your code
    import logging
    logging.basicConfig(
        level=logging.DEBUG,
        format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
    )
    # Set custom config path
    export SODA_COLLIBRA_CONFIG=/path/to/custom/config.yaml
    
    # Enable debug mode
    export SODA_COLLIBRA_DEBUG=true
    # Full debug output
    python main.py --debug 2>&1 | tee debug.log
    
    # Verbose logging with timestamps
    python main.py --verbose
    
    # Test specific components
    python main.py --test-soda --debug
    python main.py --test-collibra --debug
    # Basic run with default config
    python main.py
    
    # Debug mode with detailed logging
    python main.py --debug
    
    # Use custom configuration file
    python main.py --config custom.yaml
    
    # Test individual components
    python main.py --test-soda --debug
    python main.py --test-collibra --debug
    Make contracts dynamic with variables
    Make contracts dynamic with variables
    Check Attributes
    dataset
    agent-deployed