Schema checks
Last modified on 29-Mar-23
Use a schema check to validate the presence, absence or position of columns in a dataset, or to validate the type of data column contains. Read more about SodaCL metrics and checks in general.
checks for dim_product:
- schema:
name: Confirm that required columns are present
warn:
when required column missing: [weight_unit_measure_code, product_subcategory_key, made_up_column]
fail:
when required column missing:
- product_key
- product_alternate_key
- schema:
warn:
when forbidden column present: [credit_card]
when wrong column type:
standard_cost: money
fail:
when forbidden column present: [pii*]
when wrong column type:
reorder_point: smallint
- schema:
name: Columns out of order
warn:
when wrong column index:
style: 1
fail:
when wrong column index:
model_name: 22
# Requires a Soda Cloud account
- schema:
name: Any schema changes
warn:
when schema changes: any
Define schema checks
Optional check configurations
List of validation keys
Expect one check result
Go further
Define schema checks
In the context of SodaCL check types, schema checks are unique. Schema checks always employ alert configurations – specifying warn and/or fail alert conditions – with validation keys that are unique to this type of check. Refer to Add alert configurations for exhaustive alert configuration details.
The validation key:value pairs in schema checks set the conditions for a warn or a fail check result. See a List of validation keys below.
For example, the following check uses the when required column missing
validation key to validate that specific columns are present in a dataset; if any of columns in the list are absent, the check result is fail.
checks for dim_product:
- schema:
fail:
when required column missing:
- standard_cost
- list_price
- weight
In the example above, the value for the validation key is in a nested list format, but you can use an inline list of comma-separated values inside square brackets instead. The following example yields identical checks results to the example above.
checks for dim_product:
- schema:
fail:
when required column missing: [standard_cost, list_price, weight]
You can define a schema check with both warn and fail alert conditions, each with multiple validation keys. Refer to Configure multiple alerts for details. Be aware, however, that a single schema check only ever produces a single check result. See Expect one check result below for details.
The following example is a single check; Soda executes each of its validations during a scan. Note that unlike the nested list of column names in the example above, the nested key:value pairs that form the value for these validation keys are indented, but do not use a -
.
checks for dim_product:
- schema:
warn:
when forbidden column present: [standard_cost]
when wrong column type:
standard_cost: money
weight: double precision
fail:
when forbidden column present: [sombrero]
when wrong column type:
reorder_point: smallint
Define schema evolution checks
Rather than specifying exact parameters for column changes, you can use the when schema changes
validation key to warn or fail when indistinct changes occur in a dataset.
This type of validation key requires a Soda Cloud account. If you have connected Soda Core to a Soda Cloud account, Soda Core pushes check results to your cloud account where Soda Cloud stores all the previously-measured, historic values for your checks in the Cloud Metric Store. SodaCL can then use these stored values to establish a relative state against which to evaluate future schema checks. Therefore, you must have a created and connected a Soda Cloud account to use schema evolution checks.
Soda Cloud must have at least two measurements to yield a check result. In other words, the first time you run a scan to execute a schema evolution check, Soda returns no results because it has nothing against which to compare; the second scan that executes the check yields a check result.
checks for dim_customer:
- schema:
warn:
when schema changes: any
fail:
when schema changes:
- column delete
- column add
- column index change
- column type change
Optional check configurations
Supported | Configuration | Documentation |
---|---|---|
✓ | Define a name for a schema check; see example. | Customize check names |
✓ | Add an identity to a check. | Add a check identity |
✓ | Define alert configurations to specify warn and fail alert conditions; see example. | Add alert configurations |
Apply an in-check filter to return results for a specific portion of the data in your dataset. | - | |
✓ | Use quotes when identifying dataset or column names; see example. Note that the type of quotes you use must match that which your data source uses. For example, BigQuery uses a backtick (`) as a quotation mark. | Use quotes in a check |
✓ | Use wildcard characters ( % or * ) in values in the check; see example. | See note in example below. |
✓ | Use for each to apply schema checks to multiple datasets in one scan; see example. | Apply checks to multiple datasets |
✓ | Apply a dataset filter to partition data during a scan; see example. | Scan a portion of your dataset |
Example with check name
checks for dim_product:
- schema:
name: Confirm that required columns are present
warn:
when required column missing: [weight_unit_measure_code, product_subcategory_key]
Example with alert configuration
checks for dim_product:
- schema:
warn:
when forbidden column present: [standard_cost]
Example with quotes
checks for dim_product:
- schema:
warn:
when wrong column type:
standard_cost: "money"
Example with wildcards
You can use *
or %
as wildcard characters in a list of column names. If the column name begins with a wildcard character, add single quotes as per the example below.
checks for dim_product:
- schema:
fail:
when forbidden column present:
- credit_card
- obsolete_%
- '%SALARY%'
- pii*
Example with for each
for each dataset T:
datasets:
- dim_product_%
checks:
schema:
warn:
when schema changes: any
Example with dataset filter
filter CUSTOMERS [daily]:
where: TIMESTAMP '{ts_start}' <= "ts" AND "ts" < TIMESTAMP '${ts_end}'
checks for CUSTOMERS [daily]:
- schema:
fail:
when forbidden column present:
- credit_card
List of validation keys
Validation key | Values | Requires a Soda Cloud account |
---|---|---|
when required column missing | one or more column names in an inline list of comma-separated values, or a nested list | |
when forbidden column present | one or more column names in an inline list of comma-separated values, or a nested list | |
when wrong column type | nested key:value pair to identify column:expected_data_type | |
when wrong column index | nested key:value pair to identify column:expected_position_in_dataset_index | |
when schema changes | any as an inline valuecolumn add as a nested list itemcolumn delete as a nested list itemcolumn index change as a nested list item column type change as a nested list item | ✓ |
Expect one check result
Be aware that a check that contains one or more alert configurations only ever yields a single check result; one check yields one check result. If your check triggers both a warn
and a fail
, the check result only displays the more severe, failed check result.
Using the following example, Soda Core, during a scan, discovers that the data in the dataset triggers both alerts, but the check result is still Only 1 warning
. Nonetheless, the results in the CLI still display both alerts as having both triggered a warn
.
checks for dim_employee:
- schema:
warn:
when required column missing: [not_so_important_column]
when forbidden column present: [birth_date]
Soda Core 3.0.xxx
Scan summary:
1/1 check WARNED:
dim_employee in adventureworks
schema [WARNED]
missing_column_names = [not_so_important_column]
forbidden_present_column_names = [birth_date]
schema_measured = [employee_key integer, ...]
Only 1 warning. 0 failure. 0 errors. 0 pass.
Sending results to Soda Cloud
Adding to the example check above, the check in the example below data triggers both warn
alerts and the fail
alert, but only returns a single check result, the more severe Oops! 1 failures.
checks for dim_employee:
- schema:
warn:
when required column missing: [not_so_important_column]
when forbidden column present: [birth_date]
fail:
when required column missing: [very_important_column]
Soda Core 3.0.xxx
Scan summary:
1/1 check FAILED:
dim_employee in adventureworks
schema [FAILED]
missing_column_names = [very_important_column]
schema_measured = [employee_key integer, ...]
Oops! 1 failures. 0 warnings. 0 errors. 0 pass.
Sending results to Soda Cloud
Go further
- Learn more about SodaCL metrics and checks in general.
- Use a reference check to validate matching contents between datasets.
- Need help? Join the Soda community on Slack.
- Reference tips and best practices for SodaCL.
Was this documentation helpful?
What could we do to improve this page?
- Suggest a docs change in GitHub.
- Share feedback in the Soda community on Slack.
Last modified on 29-Mar-23