Link Search Menu Expand Document

Filters and variables

Use filters or variables to specify portions of data in your dataset against which Soda Core executes checks during a scan.

# In-check filter
checks for dim_employee:
  - max(vacation_hours) < 80:
      name: Too many vacation hours for US Sales
      filter: sales_territory_key = 11

# Dataset filter
filter CUSTOMERS [daily]:
  where: TIMESTAMP '${ts_start}' <= "ts" AND "ts" < TIMESTAMP '${ts_end}'

checks for CUSTOMERS [daily]:
  - row_count = 6
  - missing(cat) = 2

# Variable in a customized check name 
variables:
  name: Customers UK
checks for dim_customer:
  - row_count > 1:
     name: Row count in ${name}

In-check filters
Dataset filters
Configure variables
Go further

Configure in-check filters

Add a filter to a check to specify a portion of the data against which Soda executes the check.

Add a filter as a nested key:value pair, as in the following example which filters the check results to display only those rows with a value of 81 or greater and which contain 11 in the sales_territory_key column. You cannot use a variable to specify an in-check filter.

checks for dim_employee:
  - max(vacation_hours) < 80:
      name: Too many vacation hours for US Sales
      filter: sales_territory_key = 11

You can use AND or OR to add multiple filter conditions to a filter key:value pair to further refine your results, as in the following example.

checks for dim_employee:
  - max(vacation_hours) < 80:
      name: Too many vacation hours for US Sales
      filter: sales_territory_key = 11 AND salaried_flag = 1

To improve the readability of multiple filters in a check, consider adding filters as separate line items, as per the following example.

checks for dim_employee:
  - max(vacation_hours) < 80:
      name: Too many vacation hours for US Sales
      filter: sales_territory_key = 11 AND 
              sick_leave_hours > 0 OR
              pay_frequency > 1


Be aware that if no rows match the filter parameters you set, Soda does not evaluate the check. In other words, Soda first finds rows that match the filter, then executes the check on those rows.

If, in the example above, none of the rows contained a value of 11 in the sales_territory_key column, Soda does not evaluate the check and returns a NOT EVALUATED message in the CLI scan output, such as the following.

Soda Core 3.0.xxx
Scan summary:
1/1 check NOT EVALUATED: 
    dim_employee in adventureworks
      Too many vacation hours for US Sales [NOT EVALUATED]
        check_value: None
1 checks not evaluated.
Apart from the checks that have not been evaluated, no failures, no warnings and no errors.

List of compatible metrics and checks

  • all numeric metrics, except duplicate_count
  • both missing metrics
  • both validity metrics

Configure dataset filters

It can be time-consuming to check exceptionally large datasets for data quality in their entirety. Instead of checking whole datasets, you can use a dataset filter to specify a portion of data in a dataset against which Soda Core executes a check.

  1. In your checks YAML file, add a section header called filter, then append a dataset name and, in square brackets, the name of the filter. The name of the filter cannot contain spaces. Refer to the example below.
  2. Nested under the filter header, use a SQL expression to specify the portion of data in a dataset that Soda Core must check.
    • The SQL expression in the example references two variables: ts_start and ts_end.
    • Variables must use the following syntax: ${VAR_NAME}.
    • When you run the soda scan command, you must include these two variables as options in the command; see step 5.
      filter CUSTOMERS [daily]:
         where: TIMESTAMP '${ts_start}' <= "ts" AND "ts" < TIMESTAMP '${ts_end}'
      
  3. Add a separate section for checks for your_dataset_name [filter name]. Any checks you nest under this header execute only against the portion of data that the expression in the filter section defines. Refer to the example below.
  4. Write any checks you wish for the dataset and the columns in it.
    checks for CUSTOMERS [daily]:
      - row_count = 6
      - missing(cat) = 2
    
  5. When you wish to execute the checks, use Soda Core to run a scan of your data source and use the -v option to include the values for the variables you included in your filter expression, as in the example below.
    soda scan -d snowflake_customer_data -v ts_start=2022-03-11 ts_end=2022-03-15 checks.yml
    

If you wish to run checks on the same dataset without using a filter, add a separate section for checks for your_dataset_name without the appended filter name. Any checks you nest under this header execute against all the data in the dataset.

Configure variables

You can use variables to:

For example, you can use a variable in a check to customize the check at scan time, as in the following example.

checks for dim_customers:
  - row_count > ${VAR_2}


To use a variable at scan time, as with a dataset filter, add a -v option to the scan command and specify the key:value pair for the variable, as in the following example.

soda scan -d aws_postgres_retail -c configuration.yml -v TODAY=2022-03-31 checks.yml

If you wish, you can use more than one variable at scan time, as in the following example.

soda scan -d aws_postgres_retail duplicate_count_filter.yml -v date=2022-07-25 -v name='rowcount check'


Configuration details and limitations

  • Variables must use the following syntax: ${VAR_NAME}.
  • If you do not explicitly specify a variable value at scan time, Soda uses environment variables.
  • For consistency, best practice dictates that you use upper case for variable names, though you can use lower case if you wish.

Go further


Was this documentation helpful?

What could we do to improve this page?


Last modified on 30-Sep-22