Automate anomaly detection

Use this guide to set up Soda and start automatically monitoring your data for quality.

Available in 2025: refer to the new Soda docs.

Use this guide to set up Soda and begin automatically monitoring the data quality of datasets in a data source. Use the guided workflow in Soda Cloud to connect to a data source, profile your data, and activate anomaly dashboards for your datasets.

About this guide

This guide offers Data Analysts, Data Scientists, and business users instructions to set up Soda to profile and begin monitoring data for quality, right out of the box.

This example offers instructions for both a self-hosted and Soda-hosted agent deployment models which use Soda Cloud connected to a Soda Agent to securely access data sources and execute scheduled scans for data quality anomaly detections. See: Choose a flavor of Soda.

Set up a Soda Agent

This setup provides a secure, out-of-the-box Soda-hosted Agent to manage access to data sources from within your Soda Cloud account.

Compatibility

BigQuery Databricks SQL MS SQL Server MySQL

PostgreSQL Redshift Snowflake

Set up

If you have not already done so, create a Soda Cloud account at cloud.soda.io. If you already have a Soda account, log in.
By default, Soda prepares a Soda-hosted agent for all newly-created accounts. However, if you are an Admin in an existing Soda Cloud account and wish to use a Soda-hosted agent, navigate to your avatar > Organization Settings. In the Organization tab, click the checkbox to Enable Soda-hosted Agent.
Navigate to your avatar > Data Sources, then access the Agents tab. Notice your out-of-the-box Soda-hosted agent that is up and running.

Invite your colleague(s) to your Soda Cloud organization so they can access the newly-deployed Soda Agent to connect to data sources and begin monitoring data quality. In your Soda Cloud account, navigate to your avatar > Invite Team Members and fill in the blanks.

Automate data quality monitoring

For preview participants, only.

As a user with permission to do so in your Soda Cloud account, navigate to your avatar > Data Sources.
In the Agents tab, confirm that you can see your Soda-hosted agent and that its status is "green" in the Last Seen column.
Navigate to the Data source tab, then click New Data Source and follow the guided steps to connect to a new data source. Refer to the subsections below for insight into the values to enter in the fields and editing panels in the guided steps.

1. Attributes

Field or Label

Guidance

Data Source Label

Provide a unique identifier for the data source. Soda Cloud uses the label you provide to define the immutable name of the data source against which it runs the Default Scan.

Default Scan Agent

Select the Soda-hosted agent, or the name of a Soda Agent that you have previously set up in your secure environment. This identifies the Soda Agent to which Soda Cloud must connect in order to run its scan.

Check Schedule

Provide the scan frequency details Soda Cloud uses to execute scans according to your needs. If you wish, you can define the schedule as a cron expression.

Starting At

Select the time of day to run the scan. The default value is midnight.

Cron Expression

(Optional) Write your own cron expression to define the schedule Soda Cloud uses to run scans.

Anomaly Dashboard Scan Schedule Available in 2025

Provide the scan frequency details Soda Cloud uses to execute a daily scan to automatically detect anomalies for the anomaly dashboard.

2. Connect

In the editing panel, provide the connection configurations Soda Cloud needs to be able to access the data in the data source. Connection configurations are data source-specific and include values for things such as a database's host and access credentials.

Access the data source-specific connection configurations for the connection syntax and descriptions; adjust the values to correspond with your data source’s details.

To more securely provide sensitive values such as usernames and passwords in a self-hosted agent deployment model, use environment variables in a values.yml file when you deploy the Soda Agent. See Use environment variables for data source connection credentials for details.

3. Discover

During its initial scan of your datasource, Soda Cloud discovers all the datasets the data source contains. It captures basic information about each dataset, including a dataset's schema and the columns it contains.

In the editing panel, specify the datasets that Soda Cloud must include or exclude from this basic discovery activity. The default syntax in the editing panel instructs Soda to collect basic dataset information from all datasets in the data source except those with names that begin with test_. The % is a wildcard character. See Add dataset discovery for more detail on profiling syntax.

Known issue: SodaCL does not support using variables in column profiling and dataset discovery configurations.

discover datasets:
  datasets:
    - include %
    - exclude test_%

4. Profile

To gather more detailed profile information about datasets in your data source and automatically build an anomaly dashboard for data quality observability, you can configure Soda Cloud to profile the columns in datasets.

Profiling a dataset produces two tabs' worth of data in a dataset page:

In the Columns tab, you can see column profile information including details such as the calculated mean value of data in a column, the maximum and minimum values in a column, and the number of rows with missing data.
In the Anomalies tab, you can access an out-of-the-box Anomaly Dashboard that uses the column profile information to automatically begin detecting anomalies in your data relative to the patterns the machine learning algorithm learns over the course of approximately five days. Available in 2025: Learn more

In the editing panel, provide details that Soda Cloud uses to determine which datasets to include or exclude when it profiles the columns in a dataset. The default syntax in the editing panel instructs Soda to profile every column of every dataset in this data source, and, superfluously, all datasets with names that begin with prod. The % is a wildcard character. See Add column profiling for more detail on profiling syntax.

Column profiling can be resource-heavy, so carefully consider the datasets for which you truly need column profile information. Refer to Compute consumption and cost considerations for more detail.

profile columns:
  columns:
    - "%.%"  # Includes all your datasets
    - prod%  # Includes all datasets that begin with 'prod'

5. Assign Owner

Field or Label

Guidance

Data Source Owner

The Data Source Owner maintains the connection details and settings for this data source and its Default Scan Definition.

Default Dataset Owner

The Datasets Owner is the user who, by default, becomes the owner of each dataset the Default Scan discovers. Refer to Manage global roles, user groups, and settings to learn how to adjust the Dataset Owner of individual datasets

Access an anomaly dashboard

After approximately five days, during which Soda’s machine learning studies your data, you can navigate to the Anomalies tab on the Dataset page on one of the datasets you included in profiling to view the issues Soda automatically detected.

The three Dataset Metrics tiles represent the most recent measurement or, in other words, one day’s worth of data anomaly detection. The three Column Metrics tiles display the last seven days’ worth of measurements and any anomalies that Soda detected.

When you click a Column Metrics tile to access more information, the list below details which columns contained anomalies.

A red warning icon for a column indicates that Soda registered an anomaly in the last daily scan of the dataset.
A green check icon for a column indicates that Soda resgisterd no anomalies in the last daily scan of the dataset.
A grayed-out icon for a column indicates that Soda registered an anomaly for a check at least once in the last seven days, but not on the most recent daily scan.

Click a Dataset Metric tile or the column name for a Column Metric to open the Check History for the anomaly detection check. Optionally, you can add feedback to individual data points in the check history graph to help refine the anomaly detection’s algorithm pattern recognition and its ability to recognize anomalies.

Set up alert notifications

The anomaly dashboard adheres to Soda’s “no noise” policy when it comes to alert notifications for data quality issues. As such, the dashboard does not automatically send any notifications to anyone out of the box. If you wish to received alert notifications for any of the anomalies the dashboard detects, use the bell (🔔) icon.

If your Soda Admin has integrated your Soda Cloud account with Slack or MS Teams to receive check notifications, you can direct anomaly dashboard alerts to those channels. The dashboard does not support sending alerts via webhook.

For a Dataset Metric, click the bell to follow the guided instructions to set up a rule that defines where to send an alert notification when Soda detects an anomalous measurement for the metric.

For a Column Metric, click the bell next to an individual column name from those listed in the table below the three column metric tiles. Follow the guided instructions to set up a rule that defines where to send an alert notification when Soda detects an anomalous measurement for the metric.

For example, if you want to receive notifications any time Soda detects an anomalous volume of duplicate values in an order_id column, click the Duplicate tile to display all the columns for which Soda automatically detects anomalies, then click the bell for order_id and set up a rule. If you also wish to receive notifications for anomalous volumes of missing values in the same column, click the Missing tile, then click the bell for order_id to set up a second rule.

Go further

Learn more about the anomaly dashboard for datasets.
Learn more about organizing check results, setting alerts, and investigating issues.
Write your own checks for data quality.
Integrate Soda with Slack to send alert notifications directly to channels in your workspace.
Integrate Soda with a data catalog to see data quality results from within the catalog:

Need help? Join the Soda community on Slack.

PreviousSelf-serve Soda NextBuild a Sigma dashboard

Last updated 1 month ago

Was this helpful?