Link Search Menu Expand Document

Set up data contracts


experimental
Last modified on 26-Apr-24

Use Soda data contracts to set data quality standards for data products. In a programmatic Soda scan, Soda executes the standards as data quality checks.

dataset: dim_product

columns:

- name: id
  data_type: VARCHAR
  checks:
  - type: duplicate_count

- name: size
  data_type: VARCHAR
  checks:
  - type: invalid_count
    valid_values: ['S', 'M', 'L']
    must_be_greater_than_or_equal: 10

- name: distance
  checks:
  - type: invalid_count
    valid_min: 0
    valid_max: 1000

- name: created
  optional: true

checks:
  - type: row_count

✖️    Requires Soda Core Scientific
✔️    Supported in Soda Core 3.3.0 or greater
✖️    Supported in Soda Library + Soda Cloud
✖️    Supported in Soda Cloud Agreements + Soda Agent
✖️    Supported by SodaGPT
✖️    Available as a no-code check

About data contracts
Prerequisites
Install data contracts
Go further

About data contracts

Soda data contracts is a Python library that verifies data quality standards as early and often as possible in a data pipeline so as to prevent negative downstream impact.

Begin by preparing a contract in a YAML file that stipulates the quality standards to which any newly ingested or transformed data must adhere, such as schema and column data type, freshness, and missing or validity standards. Each time the pipeline accepts or produces new data, Soda executes the checks in the contract; where a check fails, it indicates that new data does not meet the contract’s data quality standards and warrants investigation or quarantining.

If you consider a data pipeline as a set of components – data transformations, and ingestions, etc. – you can apply a data contract to each of these components to frequently gauge data quality standards. Doing so frequently and consistently enables you to effectively break apart a dense data pipeline into manageable parts wherein data quality is verified before data moves from one component to the next. Use the same strategy of frequent verification in a CI/CD workflow to make sure that newly-committed code adheres to your stipulated data quality standards.

Prerequisites

Soda Core 3.3.0 supports the newest, experimental version of soda-contracts. The new version introduces changes that may not be compatible with the previous experimental version of soda-contracts. To continue using the first version of soda-contracts without any adjustments, upgrade to Soda Core 3.2.4 for the latest in bug fixes and updates.

  • Python 3.8 or greater
  • Pip 21.0 or greater
  • a code or text editor
  • your data source connection credentials and details
  • (optional) a local development environment in which to test data contract execution
  • (optional) a git repository to store and control the versions of your data contract YAML files

Install data contracts

Data contracts are only available for use in programmatic scans using Soda Core.
Soda Core CLI does not support data contracts.

  1. Best practice dictates that you install data contracts in a virtual environment. In your command-line interface tool, create and activate a Python virtual environment.
  2. Execute the following command, replacing the package name with the install package that matches the type of data source you use to store data; see the complete list of packages.
    pip install soda-core-postgres
    
  3. Use the following command to install soda-core-contracts
    pip install soda-core-contracts
    
  4. Validate the installation using the following command.
    soda --help
    

To exit the virtual environment, use the command deactivate.

Go further


Was this documentation helpful?

What could we do to improve this page?

Documentation always applies to the latest version of Soda products
Last modified on 26-Apr-24