Data contract check reference
Last modified on 26-Apr-24
Soda data contracts is a Python library that verifies data quality standards as early and often as possible in a data pipeline so as to prevent negative downstream impact. Learn more About Soda data contracts.
✖️ Requires Soda Core Scientific
✔️ Supported in Soda Core 3.3.0 or greater
✖️ Supported in Soda Library + Soda Cloud
✖️ Supported in Soda Cloud Agreements + Soda Agent
✖️ Supported by SodaGPT
✖️ Available as a no-code check
What follows is reference documentation and examples of each type of data contract check.
Note that data contracts checks do not follow SodaCL syntax.
Duplicate
Freshness
Missing
Row count
SQL aggregation
User-defined SQL checks
Validity
List of threshold keys
Duplicate
Type of check | Accepts threshold values | Column config keys: required | Column config keys: optional |
---|---|---|---|
no_duplicate_values | no | - | name columns |
duplicate_count | required | - | name |
duplicate_percent | required | - | name |
dataset: dim_employee
columns:
- name: id
checks:
- type: no_duplicate_values
- name: last_name
checks:
- type: duplicate_count
must_be_less_than: 10
name: Fewer than 10 duplicate names
- name: address_line1
checks:
- type: duplicate_percent
must_be_less_than: 1
checks:
- type: no_duplicate_values
columns: ['phone', 'email']
Freshness
This check compares the maximum value in the column to the time the scan runs; the check fails if that computed value exceeds the threshold you specified in the check.
Type of check | Accepts threshold values | Column config keys: required | Column config keys: optional |
---|---|---|---|
freshness_in_days | required | - | name |
freshness_in_hours | required | - | name |
freshness_in_minutes | required | - | name |
dataset: dim_customer
columns:
- name: date_first_purchase
checks:
type: freshness_in_days
must_be_less_than: 2
name: New data arrived within the last 2 days
Missing
If you do not use an optional column configuration key to identify the values Soda ought to consider as missing, Soda uses NULL to identify missing values.
See also: Combine missing and validity
Type of check | Accepts threshold values | Column config keys: required | Column config keys: optional |
---|---|---|---|
no_missing_values | no | - | name missing_values missing_sql_regex |
missing_count | required | - | name missing_values missing_sql_ regex |
missing_percent | required | - | name missing_values missing_sql_regex |
dataset: dim_customer
columns:
- name: title
checks:
- type: no_missing_values
- name: middle_name
checks:
- type: missing_count
must_be_less_than: 10
# Soda includes 'NULL' in list of values by default
missing_values: ['xxx', 'none', 'NA']
- name: last_name
checks:
- type: missing_count
must_be_less_than: 5
- name: first_name
checks:
- type: missing_percent
must_be_less_than: 1
name: No whitespace entries
# regular expression must match the dialect of your SQL engine
missing_sql_regex: '[\s]'
Row count
Type of check | Accepts threshold values | Column config keys: required | Column config keys: optional |
---|---|---|---|
rows_exist | no | - | name |
row_count | required | - | name |
dataset: dim_customer
columns:
- name: first_name
checks:
- type: row_count
must_be_between: [100, 120]
name: Verify row count range
checks:
- type: rows_exist
SQL aggregation
Type of check | Accepts threshold values | Column config keys: required | Column config keys: optional |
---|---|---|---|
avg | required | - | name |
sum | required | - | name |
dataset: dim_customer
columns:
- name: yearly_income
checks:
- type: avg
must_be_between: [50000, 80000]
name: Average salary within expected range
- name: total_children
checks:
- type: sum
must_be_less_than: 10
User-defined SQL checks
Use a SQL expression or SQL query check to customize your data contract check. Apply these checks at the column or dataset level.
Type of check | Accepts threshold values | Column config keys: required | Column config keys: optional |
---|---|---|---|
sql_expression | required | metric metric_sql_expression | name |
user_defined_sql | required | metric sql_query | name |
dataset: CUSTOMERS
columns:
- name: country
checks:
- type: sql_expression
# define a name for your custom metric
metric: us_count
metric_sql_expression: COUNT(CASE WHEN country = 'US' THEN 1 END)
must_be_not_between: [100, 120]
checks:
- type: user_defined_sql
# define a name for your custom metric
metric: count_america
sql_query: |
SELECT COUNT(*)
FROM {table_name}
WHERE country = 'US'
must_be_between: [0, 5]
Validity
Type of check | Accepts threshold values | Column config keys: required | Column config keys: optional |
---|---|---|---|
no_invalid_values | no | At least one of:valid_values valid_format Valid formatsvalid_sql_regex valid_min valid_max valid_length valid_min_length valid_max_length valid_values_reference_data invalid_values invalid_format invalid_sql_regex | name |
invalid_count | required | At least one of:valid_values valid_format Valid formatsvalid_sql_regex valid_min valid_max valid_length valid_min_length valid_max_length valid_values_reference_data invalid_values invalid_format invalid_sql_regex | name |
invalid_percent | required | At least one of:valid_values valid_format Valid formatsvalid_sql_regex valid_min valid_max valid_length valid_min_length valid_max_length valid_values_reference_data invalid_values invalid_format invalid_sql_regex | name |
dataset: dim_customer
columns:
- name: first_name
data_type: character varying
checks:
- type: no_invalid_values
valid_min_length: 2
- name: email_address
checks:
- type: invalid_count
must_be_less_than: 25
valid_format: email
- name: id
checks:
- type: invalid_percent
must_be_less_than: 5
valid_sql_regex: '^ID.$'
name: Less than 5% invalid
- name: total_children
checks:
- type: invalid_count
# With multiple configurations, rows must meet ALL criteria
valid_min: 0
valid_max: 12
must_be_less_than: 10
name: Acceptable range of offspring count
- name: comment
checks:
- type: no_invalid_values
valid_min_length: 0
valid_max_length: 160
Valid formats
For a list of the available formats to use with the valid_formats
column configuration key, see: List of valid formats and Formats supported with Soda for MS SQL Server for SodaCL.
Validity reference
Also known as a referential integrity or foreign key check, Soda executes a validity check with a valid_values_reference_data
column configuration key as a separate query, relative to other validity queries. The query counts all values that exist in the named column which also do not exist in the column in the referenced dataset.
The referential dataset must exist in the same warehouse as the dataset identified by the contract.
dataset: dim_employee
columns:
- name: country
checks:
- type: invalid_percent
must_be_less_than: 3
valid_values_reference_data:
dataset: countryID
column: id
Combine missing and validity
You can combine column configuration keys to include both missing and validity parameters. Soda separately evaluates the parameters to prevent double-counting any rows that fail to meet the specified thresholds so that a row that fails both parameters only counts as one failed row.
dataset: dim_product
columns:
- name: size
checks:
- type: no_invalid_values
missing_values: ['N/A']
valid_values: ['S', 'M', 'L']
If you add both a missing and validity check to a single column, Soda leverages the results of preceding checks when evaluating subsequent ones.
In the example below, Soda considers any row that failed the no_missing_values
check as one that will fail the second, no_invalid_values
check without re-evaluating the row.
dataset: dim_product
columns:
- name: size
checks:
- type: no_missing_values
missing_values: ['N/A']
- type: no_invalid_values
valid_values: ['S', 'M', 'L']
In the case where you have configured multiple missing checks that specify different missing values, Soda does not merge the results of the check evaluation; it only honors that last set of missing values. Not supported by valid_values_reference_data
.
List of threhold keys
Threshold key | Expected value | Example |
---|---|---|
must_be | number | must_be: 0 |
must_not_be | number | must_not_be: 0 |
must_be_greater_than | number | must_be_greater_than: 100 |
must_be_greater_than_or_equal_to | number | must_be_greater_than_or_equal_to: 100 |
must_be_less_than | number | must_be_less_than: 100 |
must_be_less_than_or_equal_to | number | must_be_less_than_or_equal_to: 100 |
must_be_between | list of 2 numbers | must_be_between: [0, 100] |
must_be_not_between | list of 2 numbers | must_be_not_between: [0, 100] |
Threshold boundaries
When you use must_be_between
threshold keys, Soda includes the boundary values as acceptable. In the following example, a check result of 100
or 120
each passes.
dataset: dim_customer
columns:
- name: first_name
- name: middle_name
- name: last_name
checks:
- type: row_count
must_be_between: [100, 120]
When you use must_be_between
threshold keys, Soda includes the boundary values as acceptable. In the following example, a check result of 0
or 120
each fails.
dataset: dim_customer
columns:
- name: first_name
- name: middle_name
- name: last_name
checks:
- type: row_count
must_be_not_between: [0, 120]
Use multiple thresholds to adjust the inclusion of boundary values.
dataset: dim_customer
columns:
- name: total_children
# check passes if values are outside the range, inclusive of 20
checks:
- type: avg
must_be_less_than: 10
must_be_greater_than_or_equal_to: 20
- name: yearly_income
# check passes if values are inside the range, inclusive of 100
checks:
- type: avg
must_be_greater_than_or_equal_to: 100
must_be_less_than: 200
Go further
- Learn how to prepare a data contract.
- Need help? Join the Soda community on Slack.
Was this documentation helpful?
What could we do to improve this page?
- Suggest a docs change in GitHub.
- Share feedback in the Soda community on Slack.
Documentation always applies to the latest version of Soda products
Last modified on 26-Apr-24