Verify a data contract
Last modified on 20-Nov-24
As the development team explores data contracts, expect minor imperfections, inconsistencies, and limited support, compatibility, and functionality if you download and use the
soda-core-contracts
package. To verify a Soda data contract is to scan the data in a data source to execute the data contract checks you defined in a contracts YAML file. Available as a Python library, you run the scan programmatically, invoking Soda data contracts in a CI/CD workflow when you create a new pull request, or in a data pipeline after importing or transforming new data.
When deciding when to verify a data contract, consider that contract verification works best on new data as soon as it is produced so as to limit its exposure to other systems or users who might access it. The earlier in a pipeline or workflow, the better! Further, best practice suggests that you store batches of new data in a temporary table, verify a contract on the batches, then append the data to a larger table.
✖️ Requires Soda Core Scientific
✔️ Experimentally supported in Soda Core 3.3.3 or greater for PostgreSQL, Snowflake, and Spark
✖️ Supported in Soda Core CLI
✖️ Supported in Soda Library + Soda Cloud
✖️ Supported in Soda Cloud Agreements + Soda Agent
✖️ Available as a no-code check
Prerequisites
Verify a data contract via API
Review contract verification results
About data source configurations
Verify data contracts with Spark
Validate data contracts
Add a check identity
Skip checks during contract verification
Go further
Prerequisites
- Python 3.8 or greater
- a code or text editor
- your data source connection credentials and details
- a
soda-core-contracts
package and asoda-core[package]
installed in a virtual environment. Refer to the list of data source-specific Soda Core packages available to use. - a Soda data contracts YAML file; see Write a data contract
Verify a data contract via API
- In your code or text editor, create a new file name
data_source.yml
accessible from within your working directory in your virtual environment. - To that file, add a data source configuration for Soda to connect to your data source and access the data within it to verify the contract. The example that follows is for a PostgreSQL data source; see data source configuration for further details .
Best practice dictates that you store sensitive credential values as environment variables using uppercase and underscores for the variables.name: local_postgres type: postgres connection: host: localhost database: yourdatabase username: ${POSTGRES_USERNAME} password: ${POSTGRES_PASSWORD}
Alternatively, you can use a YAML string or dict to define connection details; use one of the
with_data_source_...(...)
methods. - Add the following block to your Python working environment. Replace the values of the file paths with your own data source YAML file and contract YAML file respectively.
from soda.contracts.contract_verification import ContractVerification, ContractVerificationResult contract_verification_result: ContractVerificationResult = ( ContractVerification.builder() .with_contract_yaml_file('soda/local_postgres/public/customers.yml') .with_data_source_yaml_file('soda/local_postgres/data_source.yml') .execute() ) print(str(contract_verification_result))
- At runtime, Soda connects with your data source and verifies the contract by executing the data contract checks in your file. Use
${SCHEMA}
syntax to provide any environment variable values in a contract YAML file. Soda returns results of the verification as pass or fail check results, or indicate errors if any exist; see below.
Review contract verification results
Contract verification results make a distinction between two types of problems: failed checks, and execution errors.
Output | Meaning | Action | Method |
---|---|---|---|
Failed checks | A failed check indicates that the values in the dataset do not match or fall within the thresholds you specified in the check. | Review the data at its source to determine the cause of the failure. | .has_failures() |
Execution errors | An execution error means that Soda could not evaluate one or more checks in the data contract. Errors include incorrect inputs such as missing files, invalid files, connection issues, or invalid contract format, or query execution exceptions. | Use the error logs to investigate the root cause of the issue. | .has_errors() |
When Soda surfaces a failed check or an execution error, you may wish to stop the pipeline from processing the data any further. To do so, you can use the Soda data contracts API in one of two ways:
- Append
.assert_ok()
at the end of the contract verification result which produces a SodaException when a check fails or when or execution errors occur. The exception message includes a full report. - Test for the result using
if not contract_verification_result.is_ok():
Usestr(contract_verification_result)
to get a report.
About data source configurations
Soda data contracts connects to a data source to perform queries, and verify schemas and data quality checks on data stored in a data source. Notably, it does not extract or ingest data, it only scans your data to complete contract verification. If you are using the Contract API, you only need to provide one data source configuration in the contract verification which Soda uses to verify contracts.
Best practice dictates that you store sensitive credential values as environment variables that use uppercase and underscores, such as password: ${DATA_SOURCE_PASSWORD}
. Soda data contracts uses environment variables by default; you can pass extra variables via the API using .with_variables({"DATA_SOURCE_PASSWORD": "***"})
.
Verify data contracts with Spark
Where you have a Spark session that potentially includes data frames that live in-memory, you can pass a Spark session into the contract verification API to verify a data contract in data frames without persisting and reloading.
Use with_data_source_spark_session
to pass your Spark session into the contract verification, as in the example below.
spark_session: SparkSession = ...
contract_verification: ContractVerification = (
ContractVerification.builder()
.with_contract_yaml_str(contract_yaml_str)
.with_data_source_spark_session(spark_session=spark_session, data_source_name="spark_ds")
.execute()
)
Validate data contracts
If you wish to validate the syntax of a data contract without actually executing the contract verification, use the build
method instead of execute
on the contract verification builder, as in the following example.
contract_verification: ContractVerification = (
ContractVerification.builder()
.with_contract_yaml_file('soda/local_postgres/public/customers.yml')
.build()
)
if contract_verification.logs.has_errors():
print(f"The contract has syntax or semantic errors: \n{contract_verification.logs}")
Add a check identity
Add an identity to a check to correlate the check’s verification results with a check in Soda Cloud.
In a contract YAML file, every check must have a unique identity. By default, Soda generates a check identity based on the location of the checks list and two properties: type
and name
. This is generally enough information to correlate a data contracts check with a check in Soda Cloud.
However, if the error Duplicate check identity
appears in the verification output, that indicates that two checks exist with the same type and name, or same type and no name. Where this occurs, manually change the name of one of the checks or, in the case where neither check has a name, add a name to one of the checks.
Be aware that if you do change or add a name to a data contract check, Soda Cloud considers this as a new check and discards the previous check result’s history; it would appear as though the original check and its results had disappeared.
Skip checks during contract verification
During a contract verification, you can arrange skip checks using check.skip
as in the following example that does not check the schema of the dataset.
contract_verification: ContractVerification = (
ContractVerification.builder()
.with_data_source_yaml_file('soda/local_postgres/data_source.yml')
.with_contract_yaml_file('soda/local_postgres/public/customers.yml')
.build()
)
contract = contract_verification.contracts[0]
for check in contract.checks:
if check.type != "schema":
check.skip = True
contract_verification_result: ContractVerificationResult = contract_verification.execute()
Go further
- Need help? Join the Soda community on Slack.
Was this documentation helpful?
What could we do to improve this page?
- Suggest a docs change in GitHub.
- Share feedback in the Soda community on Slack.
Documentation always applies to the latest version of Soda products
Last modified on 20-Nov-24