Link Search Menu Expand Document

Optional check configurations

When you define SodaCL checks for data quality in your checks YAML file, you have the option of adding one or more extra configurations or syntax variations. Read more about SodaCL metrics and checks in general.

The following optional configurations are available to use with most, though not all, check types. The detailed documentation for metrics and individual check types indicate specifically which optional configurations are compatible.

Customize check names
Add alert configurations
Add a filter to a check
Use quotes in a check
Apply checks to multiple datasets
Scan a portion of your dataset


Customize check names

Add a customized, plain-language name to your check so that anyone reviewing the check results can easily grasp the intent of the check.

Add the name to the check as a nested key:value pair, as per the example below.

checks for dim_employee:
  - max(vacation_hours) < 80:
      name: Too many vacation hours
  • Be sure to add the : to the end of your check, before the nested content.
  • If name is configured, Soda Core sends the value of name to Soda Cloud as the check identifier.

Add alert configurations

When Soda Core runs a scan of your data, it returns a check result for each check. Each check results in one of three default states:

  • pass: the values in the dataset match or fall within the thresholds you specified
  • fail: the values in the dataset do not match or fall within the thresholds you specified
  • error: the syntax of the check is invalid

However, you can add alert configurations to a check to explicitly specify the conditions that warrant a warn result. Setting more granular conditions for a warn, or fail, state of a check result gives you more insight into the severity of a data quality issue.

For example, perhaps 50 missing values in a column is acceptable, but more than 50 is cause for concern; you can use alert configurations to warn you when there are 0 - 50 missing values, but fail when there are 51 or more missing values.

Configure a single alert

Add alert configurations as nested key:value pairs, as in the following example which adds a single alert configuration. It produces a warn check result when the volume of duplicate phone numbers in the dataset exceeds five. Refer to the CLI output below.

checks for dim_reseller:
  - duplicate_count(phone):
      warn: when > 5
Soda Core 3.0.0bxx
Scan summary:
1/1 check WARNED: 
    dim_reseller in adventureworks
      duplicate_count(phone) [WARNED]
        check_value: 48
Only 1 warning. 0 failure. 0 errors. 0 pass.
Sending results to Soda Cloud

Configure multiple alerts

Add multiple nested key:value pairs to define both warn alert conditions and fail alert conditions.

The following example defines the conditions for both a warn and a fail state. After a scan, the check result is warn when there are between one and ten duplicate phone numbers in the dataset, but if Soda Core discovers more than ten duplicates, as it does in the example, the check fails. If there are no duplicate phone numbers, the check passes.

checks for dim_reseller:
  - duplicate_count(phone):
      warn: when between 1 and 10
      fail: when > 10
Soda Core 3.0.0bxx
Scan summary:
1/1 check FAILED: 
    dim_reseller in adventureworks
      duplicate_count(phone) [FAILED]
        check_value: 48
Oops! 1 failures. 0 warnings. 0 errors. 0 pass.
Sending results to Soda Cloud

You can add multiple conditions to each type of alert, as with the following example, but be aware that a check that contains one or more alert configurations only ever yields a single check result.

checks for dim_reseller:
  - duplicate_count(phone):
      warn: 
        when between 1 and 2
        when between 8 and 10
      fail: when > 10
  • Be sure to add the : to the end of your check, before the nested content.
  • Be aware that a check that contains one or more alert configurations only ever yields a single check result.
  • If it is configured, Soda Core sends warn results to Soda Cloud where they appear as Warning; for fail, the result appears as Critical.

Expect one check result

Be aware that a check that contains one or more alert configurations only ever yields a single check result; one check yields one check result. If your check triggers both a warn and a fail, the check result only displays the more severe, failed check result.

Using the following example, Soda Core, during a scan, discovers that the data in the dataset triggers both alerts, but the check result is still Only 1 warning. Nonetheless, the results in the CLI still display both alerts as having both triggered a warn.

checks for dim_employee:
  - schema:
      warn:
        when required column missing: [not_so_important_column]
        when forbidden column present: [birth_date]
Soda Core 3.0.0bxx
Scan summary:
1/1 check WARNED: 
    dim_employee in adventureworks
      schema [WARNED]
        missing_column_names = [not_so_important_column]
        forbidden_present_column_names = [birth_date]
        schema_measured = [employee_key integer, ...]
Only 1 warning. 0 failure. 0 errors. 0 pass.
Sending results to Soda Cloud

Adding to the example check above, the check in the example below data triggers both warn alerts and the fail alert, but only returns a single check result, the more severe Oops! 1 failures.

checks for dim_employee:
  - schema:
      warn:
        when required column missing: [not_so_important_column]
        when forbidden column present: [birth_date]
      fail:
        when required column missing: [very_important_column]
Soda Core 3.0.0bxx
Scan summary:
1/1 check FAILED: 
    dim_employee in adventureworks
      schema [FAILED]
        missing_column_names = [very_important_column]
        schema_measured = [employee_key integer, ...]
Oops! 1 failures. 0 warnings. 0 errors. 0 pass.
Sending results to Soda Cloud

Define zones using alert configurations

Use alert configurations to write checks that define fail or warn zones. By establishing these zones, the check results register as more severe the further a measured value falls outside the threshold parameters you specify as acceptable for your data quality.

The example that follows defines split warning and failure zones in which inner is good, and outer is bad. The chart below illustrates the pass (white), warn (yellow), and fail (red) zones. Note that an individual check only ever yields one check result. If your check triggers both a warn and a fail, the check result only displays the more serious, failed check result. See Expect one check result for details.

checks for CUSTOMERS:
  - row_count:
      warn: when not between -10 and 10
      fail: when not between -20 and 20

historic-chart


The next example defines a different kind of zone in which inner is bad, and outer is good. The chart below illustrates the fail (red), warn (yellow), and pass (white) zones.

checks for CUSTOMERS:
  - row_count:
      warn: when between -20 and 20
      fail: when between -10 and 10

historic-chart2

Add a filter to a check

Add a filter to a check to specify a portion of the data against which Soda executes the check.

Add a filter as a nested key:value pair, as in the following example which filters the check results to display only those rows with a value of 81 or greater and which contain 11 in the sales_territory_key column. You cannot use a variable to specify a filter.

checks for dim_employee:
  - max(vacation_hours) < 80:
      name: Too many vacation hours for US Sales
      filter: sales_territory_key = 11

You can use AND or OR to add multiple filter conditions to a filter key:value pair to further refine your results, as in the following example.

checks for dim_employee:
  - max(vacation_hours) < 80:
      name: Too many vacation hours for US Sales
      filter: sales_territory_key = 11 AND salaried_flag = 1

To improve the readability of multiple filters in a check, consider adding filters as separate line items, as per the following example.

checks for dim_employee:
  - max(vacation_hours) < 80:
      name: Too many vacation hours for US Sales
      filter: sales_territory_key = 11 AND 
              sick_leave_hours > 0 OR
              pay_frequency > 1


Be aware that if no rows match the filter parameters you set, Soda does not evaluate the check. In other words, Soda first finds rows that match the filter, then executes the check on those rows.

If, in the example above, none of the rows contained a value of 11 in the sales_territory_key column, Soda would not evaluate the check and would return a NOT EVALUATED message in the CLI scan output, such as the following.

Soda Core 3.0.0bxx
Scan summary:
1/1 check NOT EVALUATED: 
    dim_employee in adventureworks
      Too many vacation hours for US Sales [NOT EVALUATED]
        check_value: None
1 checks not evaluated.
Apart from the checks that have not been evaluated, no failures, no warnings and no errors.

Use quotes in a check

In the checks you write with SodaCL, you can apply the quoting style that your data source uses for dataset or column names. Soda Core uses the quoting style you specify in the aggregated SQL queries it prepares, then executes during a scan.

Write a check referencing a dataset name with quotes to produce a SQL query that references the dataset name with quotes.

Check:

checks for "CUSTOMERS":
  - row_count > 0

Resulting SQL query:

SELECT
  COUNT(*)
FROM "CUSTOMERS"


Write a check referencing a column name with quotes to produce a SQL query that references the column name with quotes.

Check:

checks for CUSTOMERS:
  - missing("id") = 0

Resulting SQL query:

SELECT
  COUNT(CASE WHEN "id" IS NULL THEN 1 END)
FROM CUSTOMERS

Apply checks to multiple datasets

Add a for each section to your checks YAML file to specify a list of checks you wish to execute on multiple datasets.

  1. Add a for each dataset T section header anywhere in your YAML file. The purpose of the T is only to ensure that every for each configuration has a unique name.
  2. Nested under the section header, add two nested keys, one for datasets and one for checks.
  3. Nested under datasets, add a list of datasets against which to run the checks. Refer to the example below that illustrates how to use include and exclude configurations and wildcard characters (%) .
  4. Nested under checks, write the checks you wish to execute against all the datasets listed under datasets.
for each dataset T:
  datasets:
    # include the dataset 
    - dim_customers
    # include all datasets matching the wildcard expression
    - dim_products%
    # (optional) explicitly add the word include to make the list more readable
    - include dim_employee
    # exclude a specific dataset
    - exclude fact_survey_response
    # exclude any datasets matching the wildcard expression
    - exclude prospective_%
  checks:
    - row_count > 0


  • Soda Core dataset names matching is case insensitive.
  • If any of your checks specify column names as arguments, make sure the column exists in all datasets listed under the datasets heading.
  • To add multiple for each configurations in your checks YAML file, configure another for each section header with a different letter identifier, such as for each dataset R.

Scan a portion of your dataset

It can be time-consuming to check exceptionally large datasets for data quality in their entirety. Instead of checking whole datasets, you can use a dataset filter to specify a portion of data in a dataset against which Soda Core executes a check.

  1. In your checks YAML file, add a section header called filter, then append a dataset name and, in square brackets, the name of the filter. The name of the filter cannot contain spaces. Refer to the example below.
  2. Nested under the filter header, use a SQL expression to specify the portion of data in a dataset that Soda Core must check. The SQL expression in the example references two variables: ts_start and ts_end. When you run the soda scan command, you must include these two variables as options in the command.
    filter CUSTOMERS [daily]:
      where: TIMESTAMP '${ts_start}' <= "ts" AND "ts" < TIMESTAMP '${ts_end}'
    
  3. Add a separate section for checks for your_dataset_name [filter name]. Any checks you nest under this header execute only against the portion of data that the expression in the filter section defines. Refer to the example below.
  4. Write any checks you wish for the dataset and the columns in it.
    checks for CUSTOMERS [daily]:
      - row_count = 6
      - missing(cat) = 2
    
  5. When you wish to execute the checks, use Soda Core to run a scan of your data source and use the -v option to include the values for the variables you included in your filter expression, as in the example below.
    soda scan -d snowflake_customer_data -v ts_start=2022-03-11 ts_end=2022-03-15 checks.yml
    

If you wish to run checks on the same dataset without using a filter, add a separate section for checks for your_dataset_name without the appended filter name. Any checks you nest under this header execute against all the data in the dataset.

Go further


Last modified on 01-Jul-22

Was this documentation helpful?
Share feedback in the Soda community on Slack.

Help improve our docs!

  • Request a docs change.
  • Edit this page in our GitHub repo.