Link Search Menu Expand Document

Filters and variables

Last modified on 26-Jan-23

Use filters or variables to specify portions of data in your dataset against which Soda Core executes checks during a scan.

# In-check filter
checks for dim_employee:
  - max(vacation_hours) < 80:
      name: Too many vacation hours for US Sales
      filter: sales_territory_key = 11

# Dataset filter with variables
filter CUSTOMERS [daily]:
  where: TIMESTAMP '${ts_start}' <= "ts" AND "ts" < TIMESTAMP '${ts_end}'

checks for CUSTOMERS [daily]:
  - row_count = 6
  - missing(cat) = 2

# In-check variable 
checks for ${DATASET}:
  - invalid_count(last_name) = 0:
      valid length: 10 

In-check vs. dataset filters
Configure in-check filters
Configure dataset filters
Configure variables in SodaCL
Configure variables for connection configuration
Go further

In-check vs. dataset filters

The following expanation aims to help you decide when to use an in-check filter, and when to use a dataset filter.

Use dataset filters to create one or more partitions of data, commonly time partitions, upon which you want to execute large volumes of checks.

Instead of executing a great many checks on all the data in a dataset, you can specify a smaller portion of data against which to execute all the checks. Doing so helps avoid having to repeatedly apply the same filter to many checks, and it produces a WHERE clause in the SQL query that Soda prepares and executes against your data.

Known issue: Dataset filters are not compatible with failed rows checks which use a SQL query. With such a check, Soda does not apply the dataset filter at scan time.

Use in-check filters to exclude rows from an individual check evaluation.

In-check filters provide the ability to create conditions, or business rules, that data in a column must meet before Soda includes a row in a check evaluation. In other words, Soda first finds rows that match the filter, then executes the check on those rows. As an example, you may wish to use an in-check filter to support a use case in which “Column X must be filled in for all rows that have value Y in column Z”.

When you find yourself adding the same in-check filters to multiple checks, you may wish to promote an in-check filter to a dataset filter.

How Soda Core applies filters
Soda Core uses the checks you define to prepare SQL queries that it executes against the datasets in your data source. It puts as many checks under the same checks for header into a single query as it can. An in-check filter translates to a CASE syntax which Soda puts into that same query with other unfiltered checks.

For a dataset filter, Soda Core generates a separate query and, again, attempts to put all checks under a checks for header into one query including any checks that also have a in-check filter. If your checks YAML has defined some unfiltered checks for a dataset, and applied a dataset filter to other checks on a particular partition of that data, Soda Core prepares two queries, each of which has several calculated metrics in the SELECT statement and which then flow back to their respective checks to evaluate whether they pass, warn, or fail.

Configure in-check filters

Add a filter to a check to apply conditions that specify a portion of the data against which Soda executes the check. For example, you may wish to use an in-check filter to support a use case in which “Column X must be filled in for all rows that have value Y in column Z”.

Add a filter as a nested key:value pair, as in the following example which filters the check results to display only those rows with a value of 81 or greater and which contain 11 in the sales_territory_key column. You cannot use a variable to specify an in-check filter.

checks for dim_employee:
  - max(vacation_hours) < 80:
      name: Too many vacation hours for US Sales
      filter: sales_territory_key = 11

You can use AND or OR to add multiple filter conditions to a filter key:value pair to further refine your results, as in the following example.

checks for dim_employee:
  - max(vacation_hours) < 80:
      name: Too many vacation hours for US Sales
      filter: sales_territory_key = 11 AND salaried_flag = 1

To improve the readability of multiple filters in a check, consider adding filters as separate line items, as per the following example.

checks for dim_employee:
  - max(vacation_hours) < 80:
      name: Too many vacation hours for US Sales
      filter: sales_territory_key = 11 AND 
              sick_leave_hours > 0 OR
              pay_frequency > 1


Be aware that if no rows match the filter parameters you set, Soda does not evaluate the check. In other words, Soda first finds rows that match the filter, then executes the check on those rows.

If, in the example above, none of the rows contained a value of 11 in the sales_territory_key column, Soda does not evaluate the check and returns a NOT EVALUATED message in the CLI scan output, such as the following.

Soda Core 3.0.xxx
Scan summary:
1/1 check NOT EVALUATED: 
    dim_employee in adventureworks
      Too many vacation hours for US Sales [NOT EVALUATED]
        check_value: None
1 checks not evaluated.
Apart from the checks that have not been evaluated, no failures, no warnings and no errors.

List of compatible metrics and checks

  • all numeric metrics, except duplicate_count and duplicate_percent
  • both missing metrics
  • both validity metrics

Configure dataset filters

It can be time-consuming to check exceptionally large datasets for data quality in their entirety. Instead of checking whole datasets, you can use a dataset filter to specify a portion of data in a dataset against which Soda Core executes a check.

Known issue: Dataset filters are not compatible with failed rows checks which use a SQL query. With such a check, Soda does not apply the dataset filter at scan time.

  1. In your checks YAML file, add a section header called filter, then append a dataset name and, in square brackets, the name of the filter. The name of the filter cannot contain spaces. Refer to the example below.
  2. Nested under the filter header, use a SQL expression to specify the portion of data in a dataset that Soda Core must check.
    • The SQL expression in the example references two variables: ts_start and ts_end.
    • Variables must use the following syntax: ${VAR_NAME}.
    • When you run the soda scan command, you must include these two variables as options in the command; see step 5.
      filter CUSTOMERS [daily]:
         where: TIMESTAMP '${ts_start}' <= "ts" AND "ts" < TIMESTAMP '${ts_end}'
      
  3. Add a separate section for checks for your_dataset_name [filter name]. Any checks you nest under this header execute only against the portion of data that the expression in the filter section defines. Refer to the example below.
  4. Write any checks you wish for the dataset and the columns in it.
    checks for CUSTOMERS [daily]:
      - row_count = 6
      - missing(cat) = 2
    
  5. When you wish to execute the checks, use Soda Core to run a scan of your data source and use the -v option to include each value for the variables you included in your filter expression, as in the example below.
    soda scan -d snowflake_customer_data -v ts_start=2022-03-11 -v ts_end=2022-03-15 checks.yml
    

If you wish to run checks on the same dataset without using a filter, add a separate section for checks for your_dataset_name without the appended filter name. Any checks you nest under this header execute against all the data in the dataset.

Configure variables in SodaCL

You can use variables in SodaCL to:

To provide a variable at scan time, as with dynamic dataset filters or with in-check values, add a -v option to the scan command and specify the key:value pair for the variable, as in the following example.

soda scan -d aws_postgres_retail -c configuration.yml -v TODAY=2022-03-31 checks.yml

If you wish, you can provide the value more than one variable at scan time, as in the following example.

soda scan -d aws_postgres_retail duplicate_count_filter.yml -v date=2022-07-25 -v name='rowcount check'


Example: customize a check name

See also: Customize check names.

variables:
  name: Customers UK
checks for dim_customer:
  - row_count > 1:
     name: Row count in ${name}


Example: provide a dataset name at scan time

checks for ${DATASET}:
  - invalid_count(last_name) = 0:
      valid length: 10 

Scan command:

soda scan -d my_datasource_name -c configuration.yml -v DATASET=dim_customer checks.yml


Example: provide a column name at scan time

checks for dim_customer:
  - invalid_count(${COLUMN}) = 0:
      valid length: 10 

Scan command:

soda scan -d my_datasource_name -c configuration.yml -v COLUMN=last_name checks.yml


Example: provide a threshold value at scan time

checks for dim_customer:
  - invalid_count(last_name) = ${LENGTH}:
      valid length: 10 

Scan command:

soda scan -d my_datasource_name -c configuration.yml -v LENGTH=0 checks.yml

Example: use a variable in an in-check filter

checks for dim_employee:
  - max(vacation_hours) < 80:
      name: Too many vacation hours for US Sales
      filter: sales_territory_key = ${SALES_TER}

Example: use a variable for a check identity

Read more about adding a check identity.

checks for dim_product:
  - row_count > 0:
      identity: ${IDENTITY}


Configuration details and limitations

  • Variables must use the following syntax: ${VAR_NAME}.
  • You cannot use a variable to provide a scan-time value for a configuration key value, such as the value for valid length for an invalid_count check.
  • For consistency, best practice dictates that you use upper case for variable names, though you can use lower case if you wish.
  • You may need to wrap date values for variables in single quotes for a check to execute properly. The use of single quotes is bound to the data source, so if your data source demands single quotes around date values for SQL queries, you must also include them when providing date values in Soda. Refer to the # Dataset filter with variables example at the top of this page.

Configure variables for connection configuration

You can use variables to:

  • resolve credentials in configuration files using system variables; see Configure Soda Core
  • pass variables for values in configuration files; see instructions below

If you use Soda Core to execute Soda scans for data quality, you can pass variables at scan time to provide values for data source connection configuration keys in your configuration YAML file. For example, you may wish to pass a variable for the value of password in your configuration YAML.

  1. Adjust the data source connection configuration in your configuration YAML to include a variable.
    data_source adventureworks:
      type: postgres
      host: localhost
      username: noname
      password: ${PASSWORD}
      database: sodacore
      schema: public
    
  2. Save then file, then run a scan that uses a -v option to include the value of the variable in the scan command.
    soda scan -d adventureworks -c configuration.yml -v PASSWORD=123abc checks.yml
    

You can provide the values for multiple variables in a single scan command.

soda scan -d adventureworks -c configuration.yml -v USERNAME=sodacore -v PASSWORD=123abc -v FRESH_NOW=2022-05-31 21:00:00 checks.yml


Configuration details and limitations

  • If you do not explicitly specify a variable value at scan time to resolve credentials, Soda uses environment variables.
  • For consistency, best practice dictates that you use upper case for variable names, though you can use lower case if you wish.

Go further


Was this documentation helpful?

What could we do to improve this page?


Last modified on 26-Jan-23