Link Search Menu Expand Document

Quick start for SodaCL

If you are staring at a blank YAML file wondering what SodaCL checks to write to surface data quality issues, this quick start tutorial is for you.

blank-page

Soda Core and SodaCL: In brief
About this tutorial
Tutorial prerequisites
Row count and cross checks
Duplicate check
Freshness check
Missing and invalid checks
Reference checks
Schema checks
Go further

Soda Core and SodaCL: In brief

Soda Checks Language (SodaCL) is a YAML-based, domain-specific language for data reliability. Used in conjunction with Soda Core, Soda’s open-source, command-line tool, you use SodaCL to write checks for data quality, then use Soda Core to scan the data in your data source and execute those checks.

After installing Soda Core, you connect Soda Core to your data source (Snowflake, BigQuery, etc.) by defining connection details such as host, username, and password, in a configuration YAML file. Then, you define your Soda Checks for data quality in a checks YAML file. A Soda Check is a test that Soda Core performs when it scans a dataset in your data source. When you use Soda Core to run a scan on data in your data source, you reference both the configuration and checks YAML files in the scan command.

A Soda scan executes the checks you defined in the checks YAML file and returns a result for each check: pass, fail, or error. (Optionally, you can configure a check to warn instead of fail by setting an alert configuration.)

About this tutorial

With over 25 built-in SodaCL checks and metrics to choose from, it can be hard to know where to begin. This tutorial offers suggestions for some basic checks you can write to begin surfacing missing, invalid, unexpected data in your datasets.

All the example checks in this tutorial use placeholder values for dataset and column name identifiers, but you can copy+paste the examples into your own checks YAML file and adjust the details to correspond to your own data.

You do not need to follow the tutorial sequentially.

Tutorial prerequisites

  • You have completed the Quick start for Soda Core and Soda Cloud
    OR
    you have followed the instructions to install and configure Soda Core on your own.
  • You have installed a code editor such as Sublime or Visual Studio Code.
  • You have created a new YAML file in your code editor and named it checks.yml.
  • (Optional) You have read the first two sections in Metrics and checks as a primer for SodaCL.

Row count and cross checks

One of the most basic checks you can write uses the row_count metric. When it executes the following check during a scan, Soda simply counts the rows in the dataset you identify in the checks for section header to confirm that the dataset is not empty. If it counts one or more rows, the check result is pass.

# Check that a dataset contains rows
checks for dataset_name:
  - row_count > 0


The check above is an example that use a numeric metric in a standard check pattern. By contrast, the following unique cross check compares row counts between datasets within the same data source without setting a threshold for volume, like > 50.

This type of check is useful when, for example, you want to compare row counts to validate that a transformed dataset contains the same volume of data as the source from which it came.

# Compare row counts between datasets
checks for dataset_name:
  - row_count same as other_dataset_name

Run a scan to execute your checks:

soda scan -d datasource_name -c configuration.yml checks.yml

Read more

Duplicate check

For the nearly universal use case of making sure that values in a column are not duplicated, you can use the duplicate_count metrics. In the following example, Soda counts the number of duplicate values in the column_name column, identified as the argument in parentheses appended to the metric. If there is even one value that is a duplicate of another, the check result is fail.

This type of check is useful when, for example, you need to make sure that values in an id column are unique, such customer_id or product_id.

# Check that a column does not contain any duplicate values
checks for dataset_name:
  - duplicate_count(column_name) = 0


If you wish, you can quickly check for duplicate values across multiple columns. In the following example, Soda counts the number of values in column_name1 that are duplicates of values in column_name2. This type of check is useful when, for example, you need to make sure that order numbers or other unique identifiers are not duplicated.

# Check that duplicate values do not exist across columns
checks for dataset_name:
  - duplicate_count(column_name1, column_name2) = 0

Run a scan to execute your checks:

soda scan -d datasource_name -c configuration.yml checks.yml

Read more

Freshness check

If your dataset contains a column that stores timestamp information, you can configure a freshness check. This type of check is useful when, for example, you need to validate that the data feeding a weekly report or dashboard is not stale. Timely data is reliable data!

In this example, the check fails if the most-recently added row (in other words, the “youngest” row) in the timestamp_column_name column is more than 24 hours old.

# Check that data in dataset is less than one day old
checks for dataset_name:
  - freshness(timestamp_column_name) < 1d

Run a scan to execute your checks:

soda scan -d datasource_name -c configuration.yml checks.yml

Read more

Missing and invalid checks

SodaCL’s missing metrics make it easy to find null values in a column. You don’t even have to specify that NULL qualifies as a missing value because SodaCL registers null values as missing by default. The following check passes if there are no null values in column_name, identified as the value in parentheses.

# Check that there are no null values in a column
checks for dataset_name:
  - missing_count(column_name) = 0


If the type of data a dataset contains is TEXT (string, character varying, etc.), you can use an invalid metric to surface any rows that contain ill-formatted data. This type of check is useful when, for example, you need to validate that all values in an email address column are formatted as name@domain.extension.

The following example fails if, during a scan, Soda discovers that more than 5% of the values in the email_column_name do not follow the email address format.

# Check an email column that all values are in email format
checks for dataset_name:
  - invalid_percent(email_column_name) > 5:
      valid format: email


If you want to surface more than just null values as missing, you can specify a list of values that, in the context of your business rules, qualify as missing. In the example check below, Soda registers N/A, 0000, or none as missing values in addition to NULL; if it discovers more than 5% of the rows contain one of these values, the check fails.

Note that the missing value 0000 is wrapped in single quotes; all numeric values you include in such a list must be wrapped in single quotes.

# Check that fewer than 5% of values in column contain missing values
checks for dataset_name:
  - missing_percent(column_name) < 5%:
      missing values: [N/A, '0000', none]

Run a scan to execute your checks:

soda scan -d datasource_name -c configuration.yml checks.yml

Read more

Reference checks

If you need to validate that two datasets contain the same values, you can use a reference check. The following unique check compares the values of column_name and another_column, identified as the values in parentheses, between datasets within the same data source. The check passes if the values in the columns are exactly the same.

# Check that values in a column exist in another column in a different dataset
checks for dataset_name:
  - values in (column_name) must exist in different_dataset_name (another_column)


If you wish, you can compare the values of multiple columns in one check. Soda compares the column names respectively, so that in the following example, column_name1 compares to other_column1, and column_name2 compares to other_column2.

# Check that values in two columns exist in two other columns in a different dataset
checks for dataset_name:
  - values in (column_name1, column_name2) must exist in different_dataset_name (other_column1, other_column2)

Run a scan to execute your checks:

soda scan -d datasource_name -c configuration.yml checks.yml

Read more

Schema checks

To eliminate the frustration of the silently evolving dataset schema, use schema checks with alert configurations to notify you when column changes occur.

If you have set up a Soda Cloud account and connected it to Soda Core, you can use a catch-all schema check that results in a warning whenever a Soda scan reveals that a column has been added, removed, moved within the context of an index, or changed data type relative to the results of the previous scan.

# Requires a Soda Cloud account
# Check for any schema changes to dataset
checks for dataset_name:
  - schema:
      warn: 
        when schema changes: any


If you wish to apply a more granular approach to monitoring schema evolution, you can specify columns in a dataset that ought to be present or which should not exist in the dataset.

The following example warns you when, during a scan, Soda discovers that column_name is missing in the dataset; the check fails if either column_name1 or column_name2 exist in the dataset. This type of check is useful when, for example, you need to ensure that datasets do not contain columns of sensitive data such as credit card numbers or personally identifiable information (PII).

# Check for absent or forbidden columns in dataset
checks for dataset_name:
  - schema:
      warn:
        when required column missing: [column_name]
      fail:
        when forbidden column present: [column_name1, column_name2]

Be aware that a check that contains one or more alert configurations only ever yields a single check result; one check yields one check result. If your check triggers both a warn and a fail, the check result only displays the more severe, failed check result. Read more.

Run a scan to execute your checks:

soda scan -d datasource_name -c configuration.yml checks.yml

Read more

Go further


Last modified on 01-Jul-22

Was this documentation helpful?
Share feedback in the Soda community on Slack.

Help improve our docs!

  • Request a docs change.
  • Edit this page in our GitHub repo.