Link Search Menu Expand Document

Filters

Use filters to specify portions of data in your dataset against which Soda Core executes checks during a scan.

# In-check filter
checks for dim_employee:
  - max(vacation_hours) < 80:
      name: Too many vacation hours for US Sales
      filter: sales_territory_key = 11

# Dataset filter
filter CUSTOMERS [daily]:
  where: TIMESTAMP '${ts_start}' <= "ts" AND "ts" < TIMESTAMP '${ts_end}'

checks for CUSTOMERS [daily]:
  - row_count = 6
  - missing(cat) = 2

In-check filters
Dataset filters

Configure in-check filters

Add a filter to a check to specify a portion of the data against which Soda executes the check.

Add a filter as a nested key:value pair, as in the following example which filters the check results to display only those rows with a value of 81 or greater and which contain 11 in the sales_territory_key column. You cannot use a variable to specify a filter.

checks for dim_employee:
  - max(vacation_hours) < 80:
      name: Too many vacation hours for US Sales
      filter: sales_territory_key = 11

You can use AND or OR to add multiple filter conditions to a filter key:value pair to further refine your results, as in the following example.

checks for dim_employee:
  - max(vacation_hours) < 80:
      name: Too many vacation hours for US Sales
      filter: sales_territory_key = 11 AND salaried_flag = 1

To improve the readability of multiple filters in a check, consider adding filters as separate line items, as per the following example.

checks for dim_employee:
  - max(vacation_hours) < 80:
      name: Too many vacation hours for US Sales
      filter: sales_territory_key = 11 AND 
              sick_leave_hours > 0 OR
              pay_frequency > 1


Be aware that if no rows match the filter parameters you set, Soda does not evaluate the check. In other words, Soda first finds rows that match the filter, then executes the check on those rows.

If, in the example above, none of the rows contained a value of 11 in the sales_territory_key column, Soda would not evaluate the check and would return a NOT EVALUATED message in the CLI scan output, such as the following.

Soda Core 3.0.0bxx
Scan summary:
1/1 check NOT EVALUATED: 
    dim_employee in adventureworks
      Too many vacation hours for US Sales [NOT EVALUATED]
        check_value: None
1 checks not evaluated.
Apart from the checks that have not been evaluated, no failures, no warnings and no errors.

List of compatible metrics and checks

  • all numeric metrics, except duplicate_count
  • all missing metrics
  • all validity metrics

Configure dataset filters

It can be time-consuming to check exceptionally large datasets for data quality in their entirety. Instead of checking whole datasets, you can use a dataset filter to specify a portion of data in a dataset against which Soda Core executes a check.

  1. In your checks YAML file, add a section header called filter, then append a dataset name and, in square brackets, the name of the filter. The name of the filter cannot contain spaces. Refer to the example below.
  2. Nested under the filter header, use a SQL expression to specify the portion of data in a dataset that Soda Core must check. The SQL expression in the example references two variables: ts_start and ts_end. When you run the soda scan command, you must include these two variables as options in the command.
    filter CUSTOMERS [daily]:
      where: TIMESTAMP '${ts_start}' <= "ts" AND "ts" < TIMESTAMP '${ts_end}'
    
  3. Add a separate section for checks for your_dataset_name [filter name]. Any checks you nest under this header execute only against the portion of data that the expression in the filter section defines. Refer to the example below.
  4. Write any checks you wish for the dataset and the columns in it.
    checks for CUSTOMERS [daily]:
      - row_count = 6
      - missing(cat) = 2
    
  5. When you wish to execute the checks, use Soda Core to run a scan of your data source and use the -v option to include the values for the variables you included in your filter expression, as in the example below.
    soda scan -d snowflake_customer_data -v ts_start=2022-03-11 ts_end=2022-03-15 checks.yml
    

If you wish to run checks on the same dataset without using a filter, add a separate section for checks for your_dataset_name without the appended filter name. Any checks you nest under this header execute against all the data in the dataset.

Go further


Last modified on 01-Jul-22

Was this documentation helpful?
Share feedback in the Soda community on Slack.

Help improve our docs!

  • Request a docs change.
  • Edit this page in our GitHub repo.