Link Search Menu Expand Document

Data contract check reference


experimental
Last modified on 19-Apr-24

Soda data contracts is a Python library that verifies data quality standards as early and often as possible in a data pipeline so as to prevent negative downstream impact. Learn more About Soda data contracts.

✖️    Requires Soda Core Scientific
✔️    Supported in Soda Core 3.3.0 or greater
✖️    Supported in Soda Library + Soda Cloud
✖️    Supported in Soda Cloud Agreements + Soda Agent
✖️    Supported by SodaGPT
✖️    Available as a no-code check

What follows is reference documentation and examples of each type of data contract check.
Note that data contracts checks do not follow SodaCL syntax.

Duplicate
Freshness
Missing
Row count
SQL aggregation
User-defined SQL checks
Validity
List of threshold keys

Duplicate

Type of check Accepts
threshold values
Column config
keys: required
Column config
keys: optional
no_duplicate_values no - name
columns
duplicate_count required - name
duplicate_percent required - name
dataset: dim_employee

columns:
- name: id
  checks:
  - type: no_duplicate_values
- name: last_name
  checks:
  - type: duplicate_count
    must_be_less_than: 10
    name: Fewer than 10 duplicate names
- name: address_line1
  checks:
  - type: duplicate_percent
    must_be_less_than: 1

checks:
- type: no_duplicate_values
  columns: ['phone', 'email']

Freshness

This check compares the maximum value in the column to the time the scan runs; the check fails if that computed value exceeds the threshold you specified in the check.

Type of check Accepts
threshold values
Column config
keys: required
Column config
keys: optional
freshness_in_days required - name
freshness_in_hours required - name
freshness_in_minutes required - name
dataset: dim_customer

columns:
- name: date_first_purchase
  checks:
    type: freshness_in_days
    must_be_less_than: 2
    name: New data arrived within the last 2 days

Missing

If you do not use an optional column configuration key to identify the values Soda ought to consider as missing, Soda uses NULL to identify missing values.

See also: Combine missing and validity

Type of check Accepts
threshold values
Column config
keys: required
Column config
keys: optional
no_missing_values no - name
missing_values
missing_sql_regex
missing_count required - name
missing_values
missing_sql_ regex
missing_percent required - name
missing_values
missing_sql_regex
dataset: dim_customer

columns: 
- name: title
  checks: 
  - type: no_missing_values 
- name: middle_name
  checks: 
  - type: missing_count
    must_be_less_than: 10
    # Soda includes 'NULL' in list of values by default
    missing_values: ['xxx', 'none', 'NA']
- name: last_name
  checks:
  - type: missing_count
    must_be_less_than: 5 
- name: first_name
  checks: 
  - type: missing_percent
    must_be_less_than: 1
    name: No whitespace entries
    # regular expression must match the dialect of your SQL engine
    missing_sql_regex: '[\s]'

Row count

Type of check Accepts
threshold values
Column config
keys: required
Column config
keys: optional
rows_exist no - name
row_count required - name
dataset: dim_customer

columns: 
- name: first_name
  checks: 
  - type: row_count
    must_be_between: [100, 120]
    name: Verify row count range

checks: 
- type: rows_exist

SQL aggregation

Type of check Accepts
threshold values
Column config
keys: required
Column config
keys: optional
avg required - name
sum required - name
dataset: dim_customer

columns:
- name: yearly_income
  checks:
  - type: avg
    must_be_between: [50000, 80000]
    name: Average salary within expected range

- name: total_children
  checks:
  - type: sum
    must_be_less_than: 10

User-defined SQL checks

Use a SQL expression or SQL query check to customize your data contract check. Apply these checks at the column or dataset level.

Type of check Accepts
threshold values
Column config
keys: required
Column config
keys: optional
sql_expression required metric
metric_sql_expression
name
user_defined_sql required metric
sql_query
name
dataset: CUSTOMERS

columns:
- name: country
  checks:
  - type: sql_expression
    # define a name for your custom metric
    metric: us_count
    metric_sql_expression: COUNT(CASE WHEN country = 'US' THEN 1 END)
    must_be_not_between: [100, 120]

checks:
- type: user_defined_sql
  # define a name for your custom metric
  metric: count_america
  sql_query: |
      SELECT COUNT(*)
      FROM {table_name}
      WHERE country = 'US'
  must_be_between: [0, 5]

Validity

Type of check Accepts
threshold values
Column config
keys: required
Column config
keys: optional
no_invalid_values no At least one of:
valid_values
valid_format Valid formats
valid_sql_regex
valid_min
valid_max
valid_length
valid_min_length
valid_max_length
valid_values_reference_data
invalid_values
invalid_format
invalid_sql_regex
name
invalid_count required At least one of:
valid_values
valid_format Valid formats
valid_sql_regex
valid_min
valid_max
valid_length
valid_min_length
valid_max_length
valid_values_reference_data
invalid_values
invalid_format
invalid_sql_regex
name
invalid_percent required At least one of:
valid_values
valid_format Valid formats
valid_sql_regex
valid_min
valid_max
valid_length
valid_min_length
valid_max_length
valid_values_reference_data
invalid_values
invalid_format
invalid_sql_regex
name
dataset: dim_customer

columns: 
- name: first_name
  data_type: character varying
  checks: 
  - type: no_invalid_values
    valid_min_length: 2
- name: email_address
  checks: 
  - type: invalid_count
    must_be_less_than: 25
    valid_format: email
- name: id
  checks:
  - type: invalid_percent
    must_be_less_than: 5
    valid_sql_regex: '^ID.$'
    name: Less than 5% invalid
- name: total_children
  checks:
  - type: invalid_count
    # With multiple configurations, rows must meet ALL criteria
    valid_min: 0
    valid_max: 12
    must_be_less_than: 10
    name: Acceptable range of offspring count
  - name: comment
    checks:
    - type: no_invalid_values
      valid_min_length: 0
      valid_max_length: 160


Valid formats

For a list of the available formats to use with the valid_formats column configuration key, see: List of valid formats and Formats supported with Soda for MS SQL Server for SodaCL.


Validity reference

Also known as a referential integrity or foreign key check, Soda executes a validity check with a valid_values_reference_data column configuration key as a separate query, relative to other validity queries. The query counts all values that exist in the named column which also do not exist in the column in the referenced dataset.

The referential dataset must exist in the same warehouse as the dataset identified by the contract.

dataset: dim_employee

columns:
- name: country
  checks:
  - type: invalid_percent
    must_be_less_than: 3
    valid_values_reference_data: 
      dataset: countryID
      column: id


Combine missing and validity

You can combine column configuration keys to include both missing and validity parameters. Soda separately evaluates the parameters to prevent double-counting any rows that fail to meet the specified thresholds so that a row that fails both parameters only counts as one failed row.

dataset: dim_product

columns:
- name: size
  checks:
  - type: no_invalid_values
    missing_values: ['N/A']
    valid_values: ['S', 'M', 'L']


If you add both a missing and validity check to a single column, Soda leverages the results of preceding checks when evaluating subsequent ones.

In the example below, Soda considers any row that failed the no_missing_values check as one that will fail the second, no_invalid_values check without re-evaluating the row.

dataset: dim_product

columns:
- name: size
  checks:
  - type: no_missing_values
    missing_values: ['N/A']
  - type: no_invalid_values
    valid_values: ['S', 'M', 'L']

In the case where you have configured multiple missing checks that specify different missing values, Soda does not merge the results of the check evaluation; it only honors that last set of missing values. Not supported by valid_values_reference_data.


List of threhold keys

Threshold key Expected value Example
must_be number must_be: 0
must_not_be number must_not_be: 0
must_be_greater_than number must_be_greater_than: 100
must_be_greater_than_or_equal_to number must_be_greater_than_or_equal_to: 100
must_be_less_than number must_be_less_than: 100
must_be_less_than_or_equal_to number must_be_less_than_or_equal_to: 100
must_be_between list of 2 numbers must_be_between: [0, 100]
must_be_not_between list of 2 numbers must_be_not_between: [0, 100]

Threshold boundaries

When you use must_be_between threshold keys, Soda includes the boundary values as acceptable. In the following example, a check result of 100 or 120 each passes.

dataset: dim_customer

columns:
- name: first_name
- name: middle_name
- name: last_name

checks:
- type: row_count
  must_be_between: [100, 120]


When you use must_be_between threshold keys, Soda includes the boundary values as acceptable. In the following example, a check result of 0 or 120 each fails.

dataset: dim_customer

columns:
- name: first_name
- name: middle_name
- name: last_name

checks:
- type: row_count
  must_be_not_between: [0, 120]


Use multiple thresholds to adjust the inclusion of boundary values.

dataset: dim_customer

columns:
- name: total_children
  # check passes if values are outside the range, inclusive of 20 
  checks:
  - type: avg
    must_be_less_than: 10
    must_be_greater_than_or_equal_to: 20
- name: yearly_income
  # check passes if values are inside the range, inclusive of 100
  checks:
  - type: avg
    must_be_greater_than_or_equal_to: 100
    must_be_less_than: 200

Go further


Was this documentation helpful?

What could we do to improve this page?

Documentation always applies to the latest version of Soda products
Last modified on 19-Apr-24