Reference checks
Last modified on 26-Jan-23
Use a reference check to validate that column contents match between datasets in the same data source.
checks for dim_department_group:
- values in (department_group_name) must exist in dim_employee (department_name)
Define reference checks
Failed row samples
Optional check configurations
Go further
Define reference checks
In the context of SodaCL check types, reference checks are unique. This check is limited in its syntax variation, with only a few mutable parts to specify column and dataset names.
The example below checks that the values in the source column, department_group_name
, in the dim_department_group
dataset exist in the destination column, department_name
, in the dim_employee
dataset. If the values are absent in the department_name
column, the check fails.
- Soda CL considers missing values in the source column as invalid.
- Optionally, do not use brackets around column names. The brackets serve as visual aids to improve readability.
checks for dim_department_group:
- values in (department_group_name) must exist in dim_employee (department_name)
You can use reference checks to compare the values of multiple columns in different datasets, as in the following example. Soda compares the columns in the order you list them, so in the example below, last_name
compares to last_name
, and first_name
compares to first_name
.
checks for dim_customers_dev:
- values in (last_name, first_name) must exist in dim_customers_prod (last_name, first_name)
Failed row samples
Reference checks automatically collect samples of any failed rows to display Soda Cloud. The default number of failed row samples that Soda collects and displays is 100.
If you wish to limit or broaden the sample size, you can use the samples limit
configuration in a reference check configuration. You can add this configuration to your checks YAML file for Soda Core, or when writing checks as part of an agreement in Soda Cloud.
checks for dim_customers_dev:
- values in (last_name, first_name) must exist in dim_customers_prod (last_name, first_name):
samples limit: 20
For security, you can add a configuration to your data source connection details to prevent Soda from collecting failed rows samples from specific columns that contain sensitive data. Refer to Disable failed rows sampling for specific columns.
To review the failed rows in Soda Cloud, navigate to the Checks dashboard, then click the row for a reference check. Examine failed rows in the Failed rows tab; see Examine failed rows for further details.
Optional check configurations
Supported | Configuration | Documentation |
---|---|---|
✓ | Define a name for a schema check; see example. | Customize check names |
✓ | Add an identity to a check. | Add a check identity |
Define alert configurations to specify warn and fail alert conditions. | - | |
Apply an in-check filter to return results for a specific portion of the data in your dataset. | - | |
✓ | Use quotes when identifying dataset or column names; see example. Note that the type of quotes you use must match that which your data source uses. For example, BigQuery uses a backtick (`) as a quotation mark. | Use quotes in a check |
Use wildcard characters ( % or * ) in values in the check. | - | |
Use for each to apply schema checks to multiple datasets in one scan. | - | |
Apply a dataset filter to partition data during a scan; see example. | Scan a portion of your dataset |
Example with check name
checks for dim_department_group:
- values in (department_group_name) must exist in dim_employee (department_name):
name: Compare department datasets
Example with quotes
checks for dim_department_group:
- values in ("department_group_name") must exist in dim_employee ("department_name")
Go further
- Learn more about SodaCL metrics and checks in general.
- Use a schema check to discover missing or forbidden columns in a dataset.
- Need help? Join the Soda community on Slack.
- Reference tips and best practices for SodaCL.
Was this documentation helpful?
What could we do to improve this page?
- Suggest a docs change in GitHub.
- Share feedback in the Soda community on Slack.
Last modified on 26-Jan-23