Write a data contract
Write a contract for data quality that stipulates the standards to which all data moving through a pipeline or workflow must adhere.
dataset: dim_customer
filter_sql: |
created > ${FILTER_START_TIME}
owner: zaynabissa@company.com
columns:
- name: last_name
data_type: character varying
checks:
- type: no_missing_values
- type: no_duplicate_values
- type: no_invalid_values
valid_regex: '^(?:[A-Z])$'
- name: total_children
data_type: integer
checks:
- type: avg
must_be_between: [2, 10]
- name: country_id
checks:
- type: invalid_percent
valid_values_column:
dataset: COUNTRIES
column: id
must_be_less_than: 5
- name: date_first_purchase
checks:
- type: freshness_in_hours
must_be_less_than: 6
checks:
- type: rows_exist
- type: no_duplicate_values
columns: ['phone', 'email']Prepare a data contract
Organize your data contracts
(Optional) Add YAML code completion in VS Code
(Optional) Add YAML code completion in PyCharm
List of configuration keys
Top-level key
Value
Required
Column key
Value
Required
Checks key
Value
Required
Threshold key
Expected value
Example
Threshold boundaries
Leverage Soda YAML extensibility
Go further
Last updated
Was this helpful?
