Last modified on 30-Nov-23
Soda Checks Language (SodaCL) is a YAML-based, domain-specific language for data reliability. Use SodaCL to write checks for data quality which Soda then executes when it scans the data in your data source.
A metric is a property of the data in your dataset. A threshold is the value for a metric that Soda checks against during a scan. Usually, you use both a metric and a threshold to define a SodaCL check in a checks YAML file, like the following example that checks that the
dim_customer dataset is not empty.
checks for dim_customer: - row_count > 0
A check is a test for data quality that you write using the Soda Checks Language (SodaCL). SodaCL includes over 25 built-in metrics that you can use to write checks, but you also have the option of writing your own SQL queries or expressions using SodaCL.
When it scans datasets in your data source, Soda Library executes the checks you defined in your checks YAML file. Technically, a check is a Python expression that, during a Soda scan, checks metrics to see if they match the parameters you defined for a threshold. A single Soda scan executes multiple checks against one or more datasets in your data source. Read more about running scans on multiple environments.
As a result of a scan, each check results in one of three default states:
- pass: the values in the dataset match or fall within the thresholds you specified
- fail: the values in the dataset do not match or fall within the thresholds you specified
- error: the syntax of the check is invalid
A fourth state, warn, is something you can explicitly configure for individual checks. See Add alert configurations.
The scan results appear in your Soda Library command-line interface (CLI) and the latest result appears in the Checks dashboard in the Soda Cloud web application; examples follow.
Optionally, you can add
--local option to the scan command to prevent Soda Library from sending check results and any other metadata to Soda Cloud.
Soda Library 1.0.x Soda Core 3.0.x Sending failed row samples to Soda Cloud Scan summary: 6/9 checks PASSED: paxstats in paxstats2 row_count > 0 [PASSED] check_value: 15007 Look for PII [PASSED] duplicate_percent(id) = 0 [PASSED] check_value: 0.0 row_count: 15007 duplicate_count: 0 missing_count(adjusted_passenger_count) = 0 [PASSED] check_value: 0 anomaly score for row_count < default [PASSED] check_value: 0.0 Schema Check [PASSED] 1/9 checks WARNED: paxstats in paxstats2 Abnormally large PAX count [WARNED] check_value: 659837 2/9 checks FAILED: paxstats in paxstats2 Validate terminal ID [FAILED] check_value: 27 Verify 2-digit IATA [FAILED] check_value: 3 Oops! 2 failure. 1 warning. 0 errors. 6 pass. Sending results to Soda Cloud Soda Cloud Trace: 4774***8
In general, SodaCL checks fall into one of three broad categories:
A standard check, as illustrated above with
row_count, uses a language pattern that includes a metric and a threshold. All numeric, missing, and validity metrics use this pattern and have a multitude of optional configurations. Read more about standard check types below.
Quick view of standard check metricsavg
Some checks that you write with SodaCL do not use metrics and thresholds, and instead follow unique patterns relevant to the data quality parameters they check. Each unique check type has its own documentation.
For example, a reference check that validates that the values in a column in one dataset match exactly with the values in another column in another dataset uses a unique pattern.
checks for dim_employees_dev: - values in salary must exist in dim_employee_prod salary
Quick view of unique check typesanomaly score
Finally, the user-defined checks make use of common table expressions (CTE) or SQL queries to construct a check; see an example below. This check type is designed to meet the needs of more complex and specific data quality checks, needs which cannot otherwise be met using the built-in standard and unique checks SodaCL provides. Each user-defined check type has its own documentation.
Use these checks to prepare expressions or queries for your data that Soda Library executes during a scan along with all the other checks in your checks YAML file.
checks for customers: - avg_surface < 1068: avg_surface expression: AVG(size * distance)
Quick view of user-defined check typesfailed rows
Standard check types use the same pattern to compose a check, but the metrics they use can, themselves, be divided into three categories:
- numeric - metrics that involve tabulation or calculation of data
- missing - metrics that identify values or formats of data that qualify as missing, such as NULL
- validity - metrics that identify values or formats of data that, according to your own business rules, are acceptable or unacceptable
All standard checks that use numeric, missing, or validity metrics can specify a fixed threshold which is not relative to any other threshold.
row_count > 0 is an example of a check with a fixed threshold as the threshold value,
0, is absolute.
Generally, a fixed threshold check has three or four mutable parts:
|an argument (optional)|
|a comparison symbol or phrase|
checks for retail_products: - row_count > 0 - max(size) <= 500
The example above defines two checks. The first check applies to the entire dataset and counts the rows to confirm that it is not empty. If the
retail_products dataset contains more than
0 rows, the check result is
|comparison symbol|| |
The second check applies to only the
size column in the dataset and checks that the values in that column do not exceed
500. If the
size column in the
retail_products dataset contains values larger than
500, the check result is
|comparison symbol|| |
Only checks that use numeric metrics can specify a change-over-time threshold, a value that is relative to a previously-measured, or historic, value. Sometimes referred to a dynamic threshold, you use these change-over-time thresholds to gauge changes to the same metric over time.
You must have a Soda Cloud account to use change-over-time thresholds.
Refer to change-over-time thresholds for further details.
While the most basic of standard checks use a single value to identify a fixed threshold, such as
row_count >= 10, you can use comparison phrases to define the upper and lower boundaries for a fixed threshold value. Read more about fixed and dynamic thresholds.
The following sections present several ways to set boundaries using the
row_count metric in the example checks. You can use any numeric, missing, or validity metric in lieu of
By default, SodaCL includes the values that define the boundary thresholds when Soda Library executes a check. In the following example, the check passes if the number of rows is equal to 10, 11, 12, 13, 14, or 15 because SodaCL includes both boundary thresholds,
15, when Soda Library executes the check.
checks for dim_customer: - row_count between 10 and 15
Use negative values to set boundaries, if you wish. The check in the following example passes if the number of rows is equal to -3, -2, -1, 0, 1, 2, 3, 4, or 5.
checks for dim_customer: - row_count between -3 and 5
not between comparison phrase to establish a range of acceptable thresholds, so that anything that falls outside the boundaries you specify yields a fail check result. The check in the following example passes if the number of rows is not equal to -3, -2, -1, 0, 1, 2, 3, 4, or 5.
checks for dim_customer: - row_count not between -3 and 5
To exclude the values that define the boundary thresholds, use the opening bracket
( and closing bracket
) characters. In the following example, the check passes if the number of rows is equal to 11, 12, 13, 14, or 15 because the opening bracket excludes 10 as an acceptable value.
checks for dim_customer: - row_count between (10 and 15
Similarly, the following example check passes if the number of rows is equal to 11, 12, 13, or 14.
checks for dim_customer: - row_count between (10 and 15)
Though SodaCL includes the values that define the boundary thresholds during a check by default, you can use square brackets,
], to explicitly specify which values to include, if you wish.
For example, all of the following checks are equivalent and pass if the number of rows is equal to 10, 11, 12, 13, 14, or 15.
checks for dim_customer: - row_count between 10 and 15 - row_count between [10 and 15 - row_count between 10 and 15] - row_count between [10 and 15]
anomaly score avg avg_length cross distribution duplicate_count duplicate_percent failed rows freshness invalid_count invalid_percent max max_length min min_length missing_count missing_percent percentile reconciliation reference row_count schema stddev stddev_pop stddev_samp sum user-defined variance var_pop var_samp
- Access information about optional configurations that you can use in SodaCL checks.
- Need help? Join the Soda community on Slack.
- Reference tips and best practices for SodaCL.
Was this documentation helpful?
What could we do to improve this page?
Documentation always applies to the latest version of Soda products
Last modified on 30-Nov-23