Soda Checks Language (SodaCL) is a YAML-based, domain-specific language for data reliability. Used in conjunction with Soda Core, Soda’s open-source, command-line tool, you use SodaCL to write checks for data quality, then use Soda Core to scan the data in your data source and execute those checks.
A metric is a property of the data in your dataset. A threshold is the value for a metric that Soda checks against during a scan. Usually, you use both a metric and a threshold to define a SodaCL check in a checks YAML file, like the following example that checks that the
dim_customer dataset is not empty.
checks for dim_customer: - row_count > 0
A check is a test for data quality that you write using the Soda Checks Language (SodaCL). SodaCL includes over 25 built-in metrics that you can use to write checks, but you also have the option of writing your own SQL queries or expressions using SodaCL.
When it scans datasets in your data source, Soda Core executes the checks you defined in your checks YAML file. Technically, a check is a Python expression that, during a Soda scan, checks metrics to see if they match the parameters you defined for a threshold. A single Soda scan executes multiple checks against one or more datasets in your data source.
As a result of a scan, each check results in one of three default states:
- pass: the values in the dataset match or fall within the thresholds you specified
- fail: the values in the dataset do not match or fall within the thresholds you specified
- error: the syntax of the check is invalid
A fourth state, warn, is something you can explicitly configure for individual checks. See Add alert configurations.
The scan results appear in your command-line interface (CLI) and, if you have connected Soda Core to a Soda Cloud account, in the Monitors Results dashboard in the Soda Cloud web application.
Soda Core 3.0.0bx Scan summary: 1/1 check PASSED: dim_customer in adventureworks row_count > 0 [PASSED] All is good. No failures. No warnings. No errors. Sending results to Soda Cloud
In general, SodaCL checks fall into one of three broad categories:
A standard check, as illustrated above with
row_count, uses a language pattern that includes a metric and a threshold. All numeric, missing, and validity metrics use this pattern and have a multitude of optional configurations. Read more about standard check types below.
Quick view of standard check metricsavg
Some checks that you write with SodaCL do not use metrics and thresholds, and instead follow unique patterns relevant to the data quality parameters they check. Each unique check type has its own documentation.
For example, a reference check that validates that the values in a column in one dataset match exactly with the values in another column in another dataset uses a unique pattern.
checks for dim_employees_dev: - values in salary must exist in dim_employee_prod salary
Quick view of unique check typesanomaly score
Finally, the user-defined checks make use of common table expressions (CTE) or SQL queries to construct a check; see an example below. This check type is designed to meet the needs of more complex and specific data quality checks, needs which cannot otherwise be met using the built-in standard and unique checks SodaCL provides. Each user-defined check type has its own documentation.
Use these checks to prepare expressions or queries for your data that Soda Core executes during a scan along with all the other checks in your checks YAML file.
checks for customers: - avg_surface < 1068: avg_surface expression: AVG(size * distance)
Quick view of user-defined check typesfailed rows
Standard check types use the same pattern to compose a check, but the metrics they use can, themselves, be divided into three categories:
- numeric - metrics that involve tabulation or calculation of data
- missing - metrics that identify values or formats of data that qualify as missing, such as NULL
- validity - metrics that identify values or formats of data that, according to your own business rules, are acceptable or unacceptable
All standard checks that use numeric, missing, or validity metrics can specify a fixed threshold which is not relative to any other threshold.
row_count > 0 is an example of a check with a fixed threshold as the threshold value,
0, is absolute.
Generally, a fixed threshold check has three or four mutable parts:
|an argument (optional)|
|a comparison symbol or phrase|
checks for retail_products: - row_count > 0 - max(size) <= 500
The example above defines two checks. The first check applies to the entire dataset and counts the rows to confirm that it is not empty. If the
retail_products dataset contains more than
0 rows, the check result is
|comparison symbol|| |
The second check applies to only the
size column in the dataset and checks that the values in that column do not exceed
500. If the
size column in the
retail_products dataset contains values larger than
500, the check result is
|comparison symbol|| |
Only checks that use numeric metrics can specify a dynamic threshold, a value that is relative to a previously-measured, or historic, value. Sometimes referred to a change-over-time threshold, you use these dynamic thresholds to gauge changes to the same metric over time.
You must have a Soda Cloud account to use dynamic thresholds.
Refer to Dynamic thresholds for further details.
While the most basic of standard checks use a single value to identify a fixed threshold, such as
row_count >= 10, you can use comparison phrases to define the upper and lower boundaries for a fixed threshold value. Read more about fixed and dynamic thresholds.
The following sections present several ways to set boundaries using the
row_count metric in the example checks. You can use any numeric, missing, or validity metric in lieu of
By default, SodaCL includes the values that define the boundary thresholds when Soda Core executes a check. In the following example, the check passes if the number of rows is equal to 10, 11, 12, 13, 14, or 15 because SodaCL includes both boundary thresholds,
15, when Soda Core executes the check.
checks for dim_customer: - row_count between 10 and 15
Use negative values to set boundaries, if you wish. The check in the following example passes if the number of rows is equal to -3, -2, -1, 0, 1, 2, 3, 4, or 5.
checks for dim_customer: - row_count between -3 and 5
not between comparison phrase to establish a range of acceptable thresholds, so that anything that falls outside the boundaries you specify yields a fail check result. The check in the following example passes if the number of rows is not equal to -3, -2, -1, 0, 1, 2, 3, 4, or 5.
checks for dim_customer: - row_count not between -3 and 5
To exclude the values that define the boundary thresholds, use the opening bracket
( and closing bracket
) characters. In the following example, the check passes if the number of rows is equal to 11, 12, 13, 14, or 15 because the opening bracket excludes 10 as an acceptable value.
checks for dim_customer: - row_count between (10 and 15
Similarly, the following example check passes if the number of rows is equal to 11, 12, 13, or 14.
checks for dim_customer: - row_count between (10 and 15)
Though SodaCL includes the values that define the boundary thresholds during a check by default, you can use square brackets,
], to explicitly specify which values to include, if you wish.
For example, all of the following checks are equivalent and pass if the number of rows is equal to 10, 11, 12, 13, 14, or 15.
checks for dim_customer: - row_count between 10 and 15 - row_count between [10 and 15 - row_count between 10 and 15] - row_count between [10 and 15]
- Access information about optional configurations that you can use in SodaCL checks.
- Need help? Join the Soda community on Slack.
Last modified on 01-Jul-22
Was this documentation helpful?
Share feedback in the Soda community on Slack.
Help improve our docs!