Link Search Menu Expand Document

Schema checks

Last modified on 23-Feb-24

Use a schema check to validate the presence, absence or position of columns in a dataset, or to validate the type of data column contains.

checks for dim_product:
  - schema:
      name: Confirm that required columns are present
      warn:
        when required column missing: [weight_unit_measure_code, product_subcategory_key, made_up_column]
      fail:
        when required column missing:
          - product_key 
          - product_alternate_key
  - schema:
      warn:
        when forbidden column present: [credit_card]
        when wrong column type:
          standard_cost: money
      fail:
        when forbidden column present: [pii*]
        when wrong column type:
          reorder_point: smallint
  - schema:
      name: Columns out of order
      warn:
        when wrong column index:
          style: 1
      fail:
        when wrong column index:
          model_name: 22
  - schema:
      name: Any schema changes
      warn: 
        when schema changes: any

Define schema checks
Define schema evolution checks
Optional check configurations
List of validation keys
Expect one check result
Example: Detect PII
Go further

Define schema checks

In the context of SodaCL check types, schema checks are unique. Schema checks always employ alert configurations – specifying warn and/or fail alert conditions – with validation keys. Refer to Add alert configurations for exhaustive alert configuration details.

The validation key:value pairs in schema checks set the conditions for a warn or a fail check result. See a List of validation keys below.

For example, the following check uses the when required column missing validation key to validate that specific columns are present in a dataset; if any of columns in the list are absent, the check result is fail.

checks for dim_product:
  - schema:
      fail:
        when required column missing:
          - standard_cost
          - list_price
          - weight

In the example above, the value for the validation key is in a nested list format, but you can use an inline list of comma-separated values inside square brackets instead. The following example yields identical check results to the example above.

checks for dim_product:
  - schema:
      fail:
        when required column missing: [standard_cost, list_price, weight]

You can define a schema check with both warn and fail alert conditions, each with multiple validation keys. Refer to Configure multiple alerts for details. Be aware, however, that a single schema check only ever produces a single check result. See Expect one check result below for details.

The following example is a single check; Soda executes each of its validations during a scan. Note that unlike the nested list of column names in the example above, the nested key:value pairs that form the value for these validation keys are indented, but do not use a -.

checks for dim_product:
  - schema:
      warn:
        when forbidden column present: [standard_cost]
        when wrong column type:
          standard_cost: money
          weight: double precision
      fail:
        when forbidden column present: [sombrero]
        when wrong column type:
          reorder_point: smallint


Add a schema_name parameter to a schema check to address a situation in which you need to explicitly identify or override a dataset’s schema in the data source.

checks for dim_employee:
   - schema:
      schema_name: staff.pr
      name: Required columns present
      warn:
        when required column missing: [last_name, birth_date]


Define schema evolution checks

Rather than specifying exact parameters for column changes, you can use the when schema changes validation key to warn or fail when indistinct changes occur in a dataset.

Soda Cloud must have at least two measurements to yield a check result. In other words, the first time you run a scan to execute a schema evolution check, Soda returns no results because it has nothing against which to compare; the second scan that executes the check yields a check result.

checks for dim_customer:
  - schema:
      warn:
        when schema changes: any
      fail:
        when schema changes: 
         - column delete
         - column add
         - column index change
         - column type change

The output in Soda Cloud displays the output of all the alert states during the scan.

schema-results

Optional check configurations

Supported Configuration Documentation
Define a name for a schema check; see example. Customize check names
Add an identity to a check. Add a check identity
Define alert configurations to specify warn and fail alert conditions; see example. Add alert configurations
  Apply an in-check filter to return results for a specific portion of the data in your dataset. -
Use quotes when identifying dataset or column names; see example.
Note that the type of quotes you use must match that which your data source uses. For example, BigQuery uses a backtick (`) as a quotation mark.
Use quotes in a check
Use wildcard characters ( % or * ) in values in the check; see example. See note in example below.
Use for each to apply schema checks to multiple datasets in one scan; see example. Apply checks to multiple datasets
Apply a dataset filter to partition data during a scan; see example. Scan a portion of your dataset

Example with check name

checks for dim_product:
  - schema:
      name: Confirm that required columns are present
      warn:
        when required column missing: [weight_unit_measure_code, product_subcategory_key]

Example with alert configuration

checks for dim_product:
  - schema:
      warn:
        when forbidden column present: [standard_cost]

Example with quotes

checks for dim_product:
  - schema:
      warn:
        when wrong column type:
          standard_cost: "money"

Example with wildcards

You can use * or % as wildcard characters in a list of column names. If the column name begins with a wildcard character, add single quotes as per the example below.

checks for dim_product:
  - schema:
      fail:
        when forbidden column present:
          - credit_card
          - obsolete_%
          - '%SALARY%'
          - pii*

Example with for each

for each dataset T:
  datasets:
    - dim_product_%
  checks:
    - schema:
       warn:
         when schema changes: any

Example with dataset filter

filter CUSTOMERS [daily]:
  where: TIMESTAMP '{ts_start}' <= "ts" AND "ts" < TIMESTAMP '${ts_end}'

checks for CUSTOMERS [daily]:
  - schema:
      fail:
        when forbidden column present:
          - credit_card


List of validation keys

Validation key Values
when required column missing one or more column names in an inline
list of comma-separated values, or a nested list
when forbidden column present one or more column names in an inline
list of comma-separated values, or a nested list
when wrong column type nested key:value pair to identify column:expected_data_type
when wrong column index nested key:value pair to identify
column:expected_position_in_dataset_index
when schema changes any as an inline value
column add as a nested list item
column delete as a nested list item
column index change as a nested list item
column type change as a nested list item

Expect one check result

Be aware that a check that contains one or more alert configurations only ever yields a single check result; one check yields one check result. If your check triggers both a warn and a fail, the check result only displays the more severe, failed check result.

Using the following example, Soda Library, during a scan, discovers that the data in the dataset triggers both alerts, but the check result at the bottom is Oops! 1 failures. Nonetheless, the results in the Scan summary section of the CLI output still display both the warn and fail alerts as having been triggered.

checks for dim_product:
  - schema:
      name: Required columns all present
      warn:
        when required column missing: [weight_unit_measure_code, product_subcategory_key, made_up_column]
      fail:
        when required column missing: [pretend_column]
Soda Library 1.0.x
Soda Core 3.0.x
Scan summary:
1/1 check FAILED: 
    dim_product in adventureworks
      Required columns all present [FAILED]
        fail_missing_column_names = [pretend_column]
        warn_missing_column_names = [made_up_column]
        schema_measured = [product_key integer, product_alternate_key character varying ...]
Oops! 1 failures. 0 warnings. 0 errors. 0 pass.
Sending results to Soda Cloud
Soda Cloud Trace: 7845***

Example: Detect PII

To address a common use case, you may wish to use a schema check in combination with a for each configuration and wildcard characters to automatically detect columns that contain personally identifiable information (PII) in your datasets, as in the following example.

for each dataset R:
  tables:
    # Apply the check to any dataset that begins with retail.
    - retail%
  checks:
    - schema:
        fail:
          when forbidden column present: ['*name*', '*address*', '*phone*', '*email*']

Go further


Was this documentation helpful?

What could we do to improve this page?

Documentation always applies to the latest version of Soda products
Last modified on 23-Feb-24