Create a data contract
Last modified on 30-Nov-23
Use Soda’s data contracts to set data quality standards for data products. In a programmatic scan, Soda translates contract standards into Soda Checks Language, then executes the standards as data quality checks during a scan.
dataset: dim_customer
columns:
- name: id
data_type: character varying
unique: true
- name: cst_size
data_type: decimal
- name: cst_size_txt
valid_values: [1, 2, 3]
- name: distance
data_type: integer
- name: country
data_type: varchar
not_null: true
reference:
dataset: COUNTRIES
column: id
- name: ts
checks:
- missing_count(id) = 0
Install data contracts
Define a data contract
Enforce a data contract
Go further
Install data contracts
Data contracts are only available for use in programmatic scans using Soda Core or Soda Library.
Soda Core CLI, Soda Library CLI, and Soda Cloud do not support data contracts.
- Confirm that you are using Python 3.8 or greater and Pip 21.0 or greater.
- Best practice dictates that you install data contracts in a virtual environment. In your command-line interface tool, create a virtual environment.
- Use the following command to install
soda-core-contracts
pip install soda-core-contracts
- Execute the following command, replacing the package name with the install package that matches the type of data source you use to store data.
If you install Soda Library, you must also create a Soda Cloud account in order to execute scans.# For Soda Library, enterprise software pip install -i https://pypi.cloud.soda.io soda-postgres # For Soda Core, OSS pip install soda-core-postgres
Data source | Soda Library install package | Soda Core install package |
---|---|---|
Amazon Athena | soda-athena | soda-core-athena |
Amazon Redshift | soda-redshift | soda-core-redshift |
Apache Spark DataFrames | soda-spark-df | soda-core-spark-df |
Azure Synapse | soda-sqlserver | soda-core-sqlserver |
ClickHouse | soda-mysql | soda-core-mysql |
Dask and Pandas | soda-pandas-dask | soda-core-pandas-dask |
Databricks | soda-spark[databricks] | |
Denodo | soda-denodo | soda-core-denodo |
Dremio | soda-dremio | soda-core-dremio |
DuckDB | soda-duckdb | soda-core-duckdb |
GCP Big Query | soda-bigquery | soda-core-bigquery |
Google CloudSQL | soda-postgres | soda-core-postgres |
IBM DB2 | soda-db2 | soda-core-db2 |
Local file | Use Dask. | Use Dask. |
MotherDuck | soda-duckdb | soda-core-duckdb |
MS SQL Server | soda-sqlserver | soda-core-sqlserver |
MySQL | soda-mysql | soda-core-mysql |
OracleDB | soda-oracle | soda-core-oracle |
PostgreSQL | soda-postgres | soda-core-postgres |
Presto | soda-presto | soda-core-presto |
Snowflake | soda-snowflake | soda-core-snowflake |
Trino | soda-trino | soda-core-trino |
Vertica | soda-vertica | soda-core-vertica |
Define a data contract
Using a code editor, create a new YAML file name contracts.yml
. In the contracts.yml
file, define the data quality standards you want to enforce for your data quality product.
A data contract consists of two required top-level configuration keys for dataset
and columns
, one required column key for name
, and any number of optional column keys to define the quality standards for your data that you wish to enforce.
Optionally, you can use a checks
configuration key to add SodaCL checks to the same file so that Soda executes both your defined checks, and the data contract standards as checks, during a scan of your data.
The following simple data contract example ensures that the data type of all values in the last_name
column in the dim_customer
dataset are character varying
. Soda translates the contract standards into a SodaCL check. During a scan, Soda executes the SodaCL check it prepared and, if any value in that column is not character varying, the check fails indicating a breach of the data contract standard.
dataset: dim_customer
columns:
- name: last_name
data_type: character varying
Even simpler, you can use a data contract to enforce only the schema of a dataset, ensuring that the columns you name exist.
dataset: dim_customer
columns:
- name: first_name
- name: last_name
- name: birthdate
Slightly different than most of the data quality standards is the reference
configuration. Use a reference standard to validate that column contents match between datasets in the same data source. For example, you may wish to use a reference standard to ensure that all values in a state column are properly entered by comparing the contents to a separate dataset with ISO 3166-2 state name standards.
dataset: dim_employee
columns:
- name: state
reference:
dataset: iso_3166-2
column: state_code
Reference standards automatically collect samples of any values that breach the standard. The default number of samples that Soda collects is 100.
If you wish to limit or broaden the sample size, you can use the samples_limit
configuration to the reference standard.
dataset: dim_employee
columns:
- name: state
reference:
dataset: iso_3166-2
column: state_code
samples_limit: 20
Configuration keys
Top-level key | Value | Required | YAML data type |
---|---|---|---|
dataset | Specify the name of the dataset upon which you wish to enforce the contract. | Required | string |
columns | Provide a list of columns that form part of the data contract. | Required | list of objects |
checks | Define SodaCL checks that Soda executes during a scan in addition to the data contract standard you define. | Optional | list of SodaCL checks |
Column key | Value | Required | YAML data type |
---|---|---|---|
name | Specify the name of a column in your dataset. | Required | string |
data_type | Identify the type of data the column must contain. | Optional | string |
invalid_format | Define the format of a value that Soda ought to register as invalid. Only works with columns that contain data type TEXT. | Optional | See List of valid formats. |
invalid_regex | Specify a regular expression to define your own custom invalid values. | Optional | regex, no forward slash delimiters |
invalid_values | Specify the values that Soda ought to consider invalid. Numeric characters in a valid values list must be enclosed in single quotes. | Optional | values in a list |
not_null | Use the value true to ensure that there are no NULL values in the column; use false to ensure that all values in a column are NULL. | Optional | boolean |
missing_regex | Provide a regular expression that specifies the values that Soda out to consider as missing. | Optional | regex, no forward slash delimiters |
missing_values | Provide a comma-separated list of values to specify the values that Soda ought to consider as missing. | Optional | list of strings or numbers |
valid_format | Define the format of a value that Soda ought to register as valid. Only works with columns that contain data type TEXT. | Optional | See List of valid formats. |
valid_length | Specify a valid length for a string. Only works with columns that contain data type TEXT. | Optional | integer |
valid_max | Specify a maximum numerical value for valid values. | Optional | integer or float |
valid_max_length | Specify a valid maximum length for a string. Only works with columns that contain data type TEXT. | Optional | integer |
valid_min | Specify a minimum numerical value for valid values. | Optional | integer or float |
valid_min_length | Specify a valid minimum length for a string. Only works with columns that contain data type TEXT. | Optional | integer |
valid_regex | Specify a regular expression to define your own custom valid values. | Optional | regex, no forward slash delimiters |
valid_values | Specify the values that Soda ought to consider valid. Numeric characters in a valid values list must be enclosed in single quotes. | Optional | values in a list |
unique | Use the value true to ensure that all values in the column are unique; use false to ensure that all values in a column are the same. | Optional | boolean |
reference | This contract standard ensures that values in the named column exist in a column in a different dataset in the same data source. Configure at least two required reference keys to indicate the dataset and column to which Soda ought to compare values in the named column. | Optional | object |
Reference key | Value | Required | YAML data type |
---|---|---|---|
dataset | Provide the name of the dataset that contains the values which must exist in the named column to which the reference standard applies. | Required | string |
column | Provide the name of the column that contains the values which must exist in the named column to which the reference standard applies. | Required | string |
samples_limit | Specify the maximum number of failed row samples that Soda collects when this standard is breached. | Optional | number |
Enforce a data contract
In a programmatic scan, Soda translates data contract standards into Soda Checks Language, then executes the standards as data quality checks during a scan. To initiate the translation and scan, prepare a programmatic scan using the following example.
from soda.contracts.data_contract_translator import DataContractTranslator
from soda.scan import Scan
import logging
# Read the data contract file as a Python str
with open("dim_customer_data_contract.yml") as f:
data_contract_yaml_str: str = f.read()
# Translate the data contract standards into SodaCL
data_contract_parser = DataContractTranslator()
sodacl_yaml_str = data_contract_parser.translate_data_contract_yaml_str(data_contract_yaml_str)
# Log or save the SodaCL checks file to help with debugging
logging.debug(sodacl_yaml_str)
# Execute the translated SodaCL checks in a scan
scan = Scan()
scan.set_data_source_name("SALESDB")
scan.add_configuration_yaml_file(file_path="~/.soda/my_local_soda_environment.yml")
scan.add_sodacl_yaml_str(sodacl_yaml_str)
scan.execute()
scan.assert_no_checks_fail()
Go further
- Need help? Join the Soda community on Slack.
Was this documentation helpful?
What could we do to improve this page?
- Suggest a docs change in GitHub.
- Share feedback in the Soda community on Slack.
Documentation always applies to the latest version of Soda products
Last modified on 30-Nov-23