Quickstart

This quickstart shows how Soda detects unexpected data issues by leveraging AI powered Anomaly Detection and prevents future problems by using data contracts. The example uses Databricks, but you can do the same with any other database.

Scenario

A data engineer at a retail company needs to maintain the regional_sales dataset so their team can manage regional sales data from hundreds of stores across the country. The dataset feeds executive dashboards and downstream ML models for inventory planning. Accuracy and freshness are critical, so you need both:

Automated anomaly detection on key metrics (row counts, freshness, schema drift)
Proactive enforcement of business rules via data contracts

After signing up, you can follow the steps below to set up a data source and start improving data quality.

Add a Data Source

Soda Cloud’s no-code UI lets you connect to any data source in minutes.

In cloud.soda.io or cloud.us.soda.io, click on Data Sources → New Data Source.
Choose your data source provider.
Name your data source under Data Source Label.
Scroll down and fill in the following credentials from your data source:
Click Connect or Test connection. This will trigger the connection and move to the next step.

Select the datasets you want to onboard on Soda Cloud.
Enable Metric Monitoring. By default, Metric Monitoring is enabled to automatically track key metrics on all the datasets you onboard and alert you when anomalies are detected. It is powered by built-in machine learning that compares current values against historical trends. You can also enable Advanced Monitor Configuration.
Enable Profiling and configure it. By default, Profiling is scheduled daily at 12:00AM UTC.
Click Finish to onboard your datasets. Soda Cloud will now spin up its Soda-hosted Agent and perform an initial Profiling & Historical Metric Collection scan. This usually takes only a few minutes.

Part 1: Review Anomaly Detection results

Congratulations, you’ve onboarded your first dataset! Now let’s make sure you always know what’s happening with it.

That’s where Metric Monitoring comes in. It automatically tracks key metrics like volume, freshness, and schema changes, with no manual setup required. You’ll spot anomalies, detect trends, and catch unexpected shifts before they become problems.

Step 1: Open the Metric Monitors dashboard

Go to Datasets → select the dataset to inspect.

Navigate to the Metric Monitors tab to learn more about the metrics calculated.

You'll immediately see that key metrics are automatically monitored by default, helping you detect pipeline issues, data delays, and unexpected structural changes as they happen. No setup needed, just visibility you can trust.

Step 2: View anomalies in a specific monitor

In this guide, we will focus on the Most recent timestamp monitor. The panel shows that it was expected to be in a range of 0 - 5m 31s, but the recorded value at scan time was 56m 49s. In order to take a closer look:

Click the Most recent timestamp (or monitor of your choice) block.

In the monitor page you’ll see:
- measured value vs. expected range,
- any red-dot anomalies flagged by the model,
- buttons to Mark as expected, Create new incident, etc.
Flag an outlier as "expected" or investigate it further.

Soda’s anomaly detection engine was built in-house (no third-party libraries) and optimized for high precision. It continuously adapts to your data patterns, and it incorporates your feedback to reduce false alarms. Designed to minimize false positives and missed detections, it shows a 70% improvement in detecting anomalous data quality metrics compared to Facebook Prophet across hundreds of diverse, internally curated datasets containing known data quality issues.

The Anomaly Detection Algorithm offers complete control and transparency in the modeling process to allow for interpretability and adaptations. It features high accuracy while leveraging historical data, delivering improvements over time.

Part 2: Attack the Issues at Source (No-Code)

Our automated anomaly detection has just done the heavy lifting for you, identifying unusual patterns and potential data issues without any setup required.

But to prevent those issues from happening again, you must define exactly what your data should look like; every column, every rule, every expectation.

That’s where Data Contracts come in. They let you proactively set the standards for your data, so problems like this are flagged or even prevented before they impact your business.

Step 1: Create a Data Contract

Create a new data contract to define and enforce data quality expectations.

In your Dataset Details page, go to the Checks tab.
Click Create Contract.

When creating a data contract, Soda will connect to your dataset and build a data contract template based on the dataset schema. From this point, you can start adding both dataset-level checks and column-level checks, as well as defining a verification schedule or a partition.

Toggle View Code if you’d like to inspect the generated SodaCL/YAML. This gives you access to the full contract code.

You can copy the following full example, paste it into the editor and edit it as you wish. You can toggle back to no-code view to see and edit the checks in the no-code editor.

dataset: databricks_demo/unity_catalog/demo_sales_operations/regional_sales
filter: |
  order_date >= ${var.start_timestamp}
  AND order_date < ${var.end_timestamp}
variables:
  start_timestamp:
    default: DATE_TRUNC('week', CAST('${soda.NOW}' AS TIMESTAMP))
  end_timestamp:
    default: DATE_TRUNC('week', CAST('${soda.NOW}' AS TIMESTAMP)) + INTERVAL '7 days'
checks:
  - row_count:
  - schema:
columns:
  - name: order_id
    data_type: INTEGER
    checks:
      - missing:
          name: Must not have null values
  - name: customer_id
    data_type: INTEGER
    checks:
      - missing:
          name: Must not have null values
  - name: order_date
    data_type: DATE
    checks:
      - missing:
          name: Must not have null values
      - failed_rows:
          name: Cannot be in the future
          expression: order_date > DATE_TRUNC('day', CAST('${soda.NOW} ' AS TIMESTAMP)) +
            INTERVAL '1 day'
          threshold:
            must_be: 0
  - name: region
    data_type: VARCHAR
    checks:
      - invalid:
          valid_values:
            - North
            - South
            - East
            - West
          name: Valid values
  - name: product_category
    data_type: VARCHAR
  - name: quantity
    data_type: INTEGER
    checks:
      - missing:
          name: Must not have null values
      - invalid:
          valid_min: 0
          name: Must be higher than 0
  - name: price
    data_type: NUMERIC
    checks:
      - invalid:
          valid_min: 0
          name: Must be higher than 0
      - missing:
          name: Must not have null values
  - name: payment_method
    data_type: VARCHAR
    checks:
      - missing:
          name: Must not have null values
      - invalid:
          threshold:
            metric: count
            must_be: 0
          filter: region <> 'north'
          valid_values:
            - PayPal
            - Bank Transfer
            - Cash
            - Credit Card
          name: Valid values in all regions except North
      - invalid:
          name: Valid values in North
          filter: region = 'north'
          valid_values:
            - PayPal
            - Bank Transfer
            - Credit Card
          qualifier: ABC124

That’s right: with Soda, you can edit a contract using either a no-code interface or directly in code. This ensures an optimal experience for all user personas while also providing a version-controlled code format that can be synced with a Git repository.

Step 2: Publish & verify

Click Test to verify the contract executes as expected
When you are done with the contract, click Publish

Click Verify. Soda will evaluate your rules against the current data.

Step 3: Review check results

Review the outcomes of the contract checks to confirm whether the data meets expectations. You can drill into those failures in the Checks tab.

Part 3: Attack the Issues at Source (Code)

You can trigger contract verification programmatically as part of your pipeline, so your data gets tested every time it runs.

We’ve prepared an example notebook to show you how it works:

Open the following Notebook example: https://colab.research.google.com/drive/1zkV_2tLJ4ohdzmKGS3LgdFDDnTNTUXew?usp=sharing

In your Python environment, first install the Soda Core library

pip install -i https://pypi.dev.sodadata.io/simple -U soda-core

Then, in the same environment, create a soda-cloud.yml file that contains your API keys, which are necessary to connect to Soda Cloud. You can create this YAML file from your Profile: Generate API keys

The soda-cloud.yml file should look like the following:

soda_cloud:
  host: cloud.soda.io                ## Or cloud.us.soda.io
  api_key_id: YOUR_API_KEY_ID        ## Replace with your actual key ID
  api_key_secret: YOUR_API_KEY_ID    ## Replace with your actual key secret

Now you are ready to trigger the verification of the contract. To do that just provide the identifier of your dataset as well as the path to the configuration file you just created in the previous step. This will trigger a verification using Soda Agent and return the logs.

Create a verify_contract.py file in your environment with the code below (or run it from a Jupyter notebook/Python interpreter):

from soda_core import configure_logging
from soda_core.contracts import verify_contracts_on_agent

configure_logging(verbose=False)

res = verify_contracts_on_agent(
    dataset_identifiers=["databricks_demo/unity_catalog/demo_sales_operations/regional_sales"],
    soda_cloud_file_path="soda-cloud.yml",
)


print(res.get_logs())

You can learn more about the Python API here: Python API

You’ve completed the tutorial and are now ready to start catching data quality issues with Soda

What’s Next?

Explore Profiling in the Discover tab to curate column selections for deeper analysis.
Set up Notification Rules (bell icon → Add Notification Rule) to push alerts to Slack, Jira, PagerDuty, etc.
Dive into Custom Monitors via scan.yml or the UI for even more tailored metrics.

You are not logged in to Soda and are viewing the default public documentation. Learn more about Documentation access & licensing.

PreviousWhat is Soda?NextData Observability

Last updated 14 days ago

Was this helpful?