# Contract Language reference

A Soda data contract is a YAML document that contains data quality checks. The checks in a Soda data contract are similar to unit tests for testing software. Verifying a contract means evaluating the checks against the actual data. If contract verification fails, notifications can be sent out or the new data can be stopped from being released to users.

This page documents the necessary contract structure and the supported check types, filters, and configuration options available in soda's data contract language.

## Contract YAML document structure

### Example

A Soda data contract YAML must include the following blocks:

* A top-level `dataset:` key
* One of the following:
  * A `checks:` block, if you’re defining dataset-wide checks
  * A `columns:` list

{% code title="contract.yml" %}

```yaml
dataset: datasource/db/schema/dataset #mandatory dataset fully qualified name

checks: #dataset level checks
  - schema:
  - row_count: 

columns: #columns list
  - name: id
    checks: # column level checks (optional)
      - missing:
  - name: name
    checks:
      - missing:
          threshold:
            metric: percent
            must_be_less_than: 10
  - name: size
    checks:
      - invalid:
          valid_values: ['S', 'M', 'L'] 
```

{% endcode %}

### Dataset fully qualified name

Every contract must have a dataset fully qualified name that follows the structure below:

```yaml
dataset: datasource/db/schema/dataset
```

Casing has to match the casing in the data source.

A dataset fully qualified name is composed of 3 parts, all slash-separated:

* The **data source name**: This is the name of the data source in which the dataset exists. The data source name is configured in a data source YAML configuration file or in Soda Cloud.
  * A **list of prefixes**: Prefixes represent the full list of hierarchical elements of the data source, like the **database name** and the **schema name**.\
    Postgres, for example, has a database name and a schema name as prefixes. Databricks has catalog and schema as prefixes. Prefix names are also slash-separated.
* The **dataset name**: The name of the table or view.

### Columns

Every contract must include a list of columns. If a schema check is configured, this list can be used to validate the **presence, data type, and order** of columns.

Each column entry includes:

* `name`: the column name (required)
* `data_type` (optional): used by the schema check
* `character_maximum_length` (optional): used by the schema check
* `checks` (optional): a list of checks that apply only to that column
  * `type`
  * Threshold keys (for any check that *needs* a threshold, such as `must_be`, `must_be_between`, etc.)

Example:

{% code title="contract.yml" %}

```yaml
# Example of fully operational contract with only column-level checks

dataset: datasource/db/schema/dataset
columns:
  - name: id
    data_type: varchar
    checks:    # Column-level checks definition
      - missing:
      - duplicate:
  - name: name
  - name: code
```

{% endcode %}

{% hint style="warning" %}
A contract that exclusively defines **dataset-level checks** that do not run against any specific column **still requires the `columns` list**. In this case, providing an empty list is enough.
{% endhint %}

<pre class="language-yaml" data-title="contract.yml"><code class="lang-yaml"># Example of fully operational contract with only dataset-level checks

dataset: datasource/db/schema/dataset

checks:
  - schema:
  - row_count:
  
<strong>columns: []
</strong></code></pre>

### Checks

The `checks` section defines the quality rules and expectations for your dataset. Checks can be written at two levels:

* **Dataset-level**: Rules that apply to the dataset as a whole (e.g., row count, schema, freshness, duplicates across multiple columns).
* **Column-level**: Rules that apply to a specific column (e.g., missing, invalid, duplicate on a single column).

Each check entry includes:

* `type`
* `column:` (not required to compile)
* Threshold keys (for any check that *needs* a threshold, such as `must_be`, `must_be_between`, etc.)

**Example of a dataset-level `checks` block:**

```yaml
checks: 
  - schema: 
  - row_count:  
```

***

**Example of column-level `checks` block:**

<pre class="language-yaml"><code class="lang-yaml">columns:
  - name: email
    checks:
<strong>      - missing:
</strong><strong>      - invalid:
</strong>          invalid_format:
            name: Email format
            regex: ^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}$
</code></pre>

Each check defines what “good data” looks like, ensuring that your dataset meets the expected standards.

You can customize each check by adding filters, thresholds, or variable references depending on your specific use case. Learn more about the different check types on this page.

***

## Schema check

A Schema check verifies that the structure of a dataset matches the expectations defined in the contract—such as required columns, data types, and optional constraints. It ensures that changes to the dataset schema (like missing columns, unexpected data types, or reordered fields) are detected early, helping maintain stability and trust in your data.

**Example:**

<pre class="language-yaml" data-title="contract.yml"><code class="lang-yaml">dataset: datasource/db/schema/dataset

columns:
  - name: id
  - name: last_name
  - name: address_line1

<strong>checks:
</strong><strong>  - schema:
</strong><strong>      allow_extra_columns: false
</strong><strong>      allow_other_column_order: false
</strong></code></pre>

**Column keys used by the schema check:**

The schema check uses the columns as specified in the contract and compares them with the measured dataset columns as as the expected columns.

<table><thead><tr><th>Key</th><th>Description</th><th width="112.0333251953125">Optional</th></tr></thead><tbody><tr><td><code>name</code></td><td>The name of the column. The name is <strong>case sensitive</strong>.</td><td>No</td></tr><tr><td><code>data_type</code></td><td>The data type of the column as is returned by the metadata from the data source. Data types are compared case insensitive.</td><td>Yes</td></tr><tr><td><code>character_maximum_length</code></td><td>The character maximum length. This represents the maximum length for data types like <code>VARCHAR(255)</code></td><td>Yes</td></tr></tbody></table>

**Check configuration keys:**

<table><thead><tr><th>Key</th><th>Description</th><th width="106.33331298828125">Optional</th></tr></thead><tbody><tr><td><code>allow_extra_columns</code></td><td>Specifies if columns other than the ones listed in the contract are allowed in the dataset. Values are <code>true</code> or <code>false</code>. Default is false, which means that all columns in the dataset must be specified in the contract.</td><td>Yes</td></tr><tr><td><code>allow_other_column_order</code></td><td>Specifies if column ordering must be exact the same as in the contract YAML document. Values are <code>true</code> or <code>false</code>. Default is false, which means that columns must appear in the order as in the contract.</td><td>Yes</td></tr><tr><td><code>name</code></td><td><a data-mention href="#check-names">#check-names</a></td><td>Yes</td></tr><tr><td><code>qualifier</code></td><td><a data-mention href="#check-qualifiers">#check-qualifiers</a></td><td>Yes</td></tr><tr><td><code>attributes</code></td><td><a data-mention href="#check-attributes">#check-attributes</a></td><td>Yes</td></tr></tbody></table>

***

## Row count check

A Row Count check verifies that the dataset contains the expected number of rows. It ensures that the data source has at least a minimum number of records and no unexpected gaps. This is a fundamental check to confirm the presence and completeness of data in a dataset.

**Example**:

<pre class="language-yaml" data-title="contract.yml"><code class="lang-yaml">dataset: datasource/db/schema/dataset

<strong>checks:
</strong><strong>  - row_count:
</strong>
columns: []
</code></pre>

**Check configuration keys**

<table><thead><tr><th>Key</th><th>Description</th><th width="106.33331298828125">Optional</th></tr></thead><tbody><tr><td><code>name</code></td><td><a data-mention href="#check-names">#check-names</a></td><td>Yes</td></tr><tr><td><code>threshold</code></td><td>The default row count threshold verifies there is at least 1 row present. <a data-mention href="#thresholds">#thresholds</a></td><td>Yes</td></tr><tr><td><code>filter</code></td><td><a data-mention href="#configure-a-check-filter">#configure-a-check-filter</a></td><td>Yes</td></tr><tr><td><code>qualifier</code></td><td><a data-mention href="#check-qualifiers">#check-qualifiers</a></td><td>Yes</td></tr><tr><td><code>attributes</code></td><td><a data-mention href="#check-attributes">#check-attributes</a></td><td>Yes</td></tr></tbody></table>

***

## Freshness check

A freshness check verifies if data is not too old by measuring the period between the current time and the most recent timestamp present in a given column.

***

**Example:**

<pre class="language-yaml" data-title="contract.yml"><code class="lang-yaml">dataset: datasource/db/schema/dataset

<strong>checks:
</strong><strong>  - freshness:
</strong><strong>      column: created_at
</strong><strong>      threshold:
</strong><strong>        unit: hour
</strong><strong>        must_be_less_than: 24
</strong>
columns: []
</code></pre>

**Check configuration keys**

<table><thead><tr><th>Key</th><th>Description</th><th width="106.33331298828125">Optional</th></tr></thead><tbody><tr><td><code>column</code></td><td>Specifies a timestamp column that represents the time of the row or</td><td>No</td></tr><tr><td><code>name</code></td><td><a data-mention href="#check-names">#check-names</a></td><td>Yes</td></tr><tr><td><code>threshold</code></td><td><p>It defines the <strong>acceptable data age limit.</strong> It sets the maximum time allowed between the current time and the newest timestamp found in the target column.<br></p><ul><li><strong><code>unit</code></strong> : The time unit used for the threshold, e.g., <code>minute</code>, <code>hour</code>, <code>day</code>.</li><li><a data-mention href="#thresholds">#thresholds</a></li></ul></td><td>No</td></tr><tr><td><code>filter</code></td><td><a data-mention href="#configure-a-check-filter">#configure-a-check-filter</a></td><td>Yes</td></tr><tr><td><code>qualifier</code></td><td><a data-mention href="#check-qualifiers">#check-qualifiers</a></td><td>Yes</td></tr><tr><td><code>attributes</code></td><td><a data-mention href="#check-attributes">#check-attributes</a></td><td>Yes</td></tr></tbody></table>

***

## Missing check

A Missing check verifies that a specific column does not contain null or empty values beyond an acceptable threshold. It ensures that critical fields are populated and helps catch incomplete or corrupted data that could impact downstream processes.

**Example:**

<pre class="language-yaml" data-title="contract.yml"><code class="lang-yaml">dataset: datasource/db/schema/dataset

columns:
  - name: id
    checks:
<strong>      - missing:
</strong></code></pre>

***

**Check configuration keys**

<table><thead><tr><th>Key</th><th>Description</th><th width="106.33331298828125">Optional</th></tr></thead><tbody><tr><td><code>missing_values</code></td><td>The list of values is considered missing. NULL is always considered a missing value and doesn't need to be included</td><td>No*</td></tr><tr><td><code>missing_format</code></td><td>A SQL regex that matches with missing values. (Advanced)</td><td>No*</td></tr><tr><td><code>name</code></td><td><a data-mention href="#check-names">#check-names</a></td><td>Yes</td></tr><tr><td><code>threshold</code></td><td>The default missing check threshold verifies there are no missing values in the column. <a data-mention href="#thresholds">#thresholds</a></td><td>Yes</td></tr><tr><td><code>filter</code></td><td><a data-mention href="#configure-a-check-filter">#configure-a-check-filter</a></td><td>Yes</td></tr><tr><td><code>qualifier</code></td><td><a data-mention href="#check-qualifiers">#check-qualifiers</a></td><td>Yes</td></tr><tr><td><code>attributes</code></td><td><a data-mention href="#check-attributes">#check-attributes</a></td><td>Yes</td></tr></tbody></table>

\* there are several configuration keys to configure the missing values. At least one missing configuration is required.

***

**Configure extra missing values in a list**

Configure a list of values that, apart from NULL, must also be considered as missing values. Typical examples are `'-'`, `'No value'`, `'N/A'`, `'None'`, `'null'`, `-1`, `999`

<pre class="language-yaml" data-title="contract.yml"><code class="lang-yaml">dataset: datasource/db/schema/dataset

columns:
  - name: id
    checks:
      - missing:
<strong>          missing_values: ['N/A', '-']
</strong></code></pre>

{% hint style="info" %}
The `missing_values` configuration is only supported on **string** and **numeric** data types.
{% endhint %}

***

**Configure extra missing values with a regular expression**

Configure a SQL regular expression that, apart from NULL, also matches values that are considered as missing values. Only supported for columns having a text data type

<pre class="language-yaml" data-title="contract.yml"><code class="lang-yaml">dataset: datasource/db/schema/dataset

columns:
  - name: id
    checks:
      - missing:
          missing_format:
<strong>            name: All dashes
</strong><strong>            regex: ^[-]+$
</strong></code></pre>

The `regex` syntax must match the SQL engine of the data source.

The `name` is a human-readable description of what the regex does. It helps explain the purpose of the pattern, making it easier for others to understand the intent of the check.

***

## Invalid check

An Invalid check verifies that the values in a column conform to a set of allowed formats, values, or constraints. It helps catch data that does not meet the expected standards, such as incorrect formats, out-of-range values, or entries that violate business rules.

**Example with valid values**:

<pre class="language-yaml" data-title="contract.yml"><code class="lang-yaml">dataset: datasource/db/schema/dataset

columns:
  - name: size
    checks:
<strong>      - invalid:
</strong><strong>          valid_values: ['S', 'M', 'L']
</strong></code></pre>

**Check configuration keys**

<table><thead><tr><th>Key</th><th>Description</th><th width="105.699951171875">Optional</th></tr></thead><tbody><tr><td><code>valid_values</code></td><td>A list of valid values</td><td>No*</td></tr><tr><td><code>valid_format</code></td><td>A SQL regular expression that matches with valid values</td><td>No*</td></tr><tr><td><code>valid_reference_data</code></td><td>References a dataset and column that contains valid values</td><td>No*</td></tr><tr><td><code>valid_min</code></td><td>Valid values must be greater or equal than the configured value</td><td>No*</td></tr><tr><td><code>valid_max</code></td><td>Valid values must be less or equal than the configured value</td><td>No*</td></tr><tr><td><code>valid_length</code></td><td>The fixed length of valid values</td><td>No*</td></tr><tr><td><code>valid_min_length</code></td><td>The minimum length of valid values</td><td>No*</td></tr><tr><td><code>valid_max_length</code></td><td>The maximum length of valid values</td><td>No*</td></tr><tr><td><code>invalid_values</code></td><td>A list of invalid values</td><td>No*</td></tr><tr><td><code>invalid_format</code></td><td>A SQL regular expression that matches with invalid values</td><td>No*</td></tr><tr><td><code>name</code></td><td><a data-mention href="#check-names">#check-names</a></td><td>Yes</td></tr><tr><td><code>threshold</code></td><td>The default invalid check threshold ensures there must not be any invalid values. <a data-mention href="#thresholds">#thresholds</a></td><td>Yes</td></tr><tr><td><code>filter</code></td><td><a data-mention href="#configure-a-check-filter">#configure-a-check-filter</a></td><td>Yes</td></tr><tr><td><code>qualifier</code></td><td><a data-mention href="#check-qualifiers">#check-qualifiers</a></td><td>Yes</td></tr><tr><td><code>attributes</code></td><td><a data-mention href="#check-attributes">#check-attributes</a></td><td>Yes</td></tr></tbody></table>

\* there are several configuration keys to configure the invalid values. At least one validity configuration is required. Multiple validity configurations can be combined.

**Example with a regular expression**

Format is only supported for string data type.

<pre class="language-yaml" data-title="contract.yml"><code class="lang-yaml">dataset: datasource/db/schema/dataset

columns:
  - name: code
    checks:
      - invalid:
<strong>          valid_format:
</strong><strong>            name: Alpha Code
</strong><strong>            regex: \^[A-Z]{3}$\
</strong></code></pre>

The `regex` syntax must match the SQL engine of the data source.

The `name` is a human-readable description of what the regex does. It helps explain the purpose of the pattern, making it easier for others to understand the intent of the check.

***

**Example minimum and maximum values:**

<pre class="language-yaml" data-title="contract.yml"><code class="lang-yaml">dataset: datasource/db/schema/dataset

columns:
  - name: code
    checks:
      - invalid:
<strong>          valid_min: 1
</strong><strong>          valid_max: 100
</strong></code></pre>

`valid_min` and `valid_max` can be used independently.

It's supported on columns with numerical data types.

***

**Example minimum and maximum length:**

<pre class="language-yaml" data-title="contract.yml"><code class="lang-yaml">dataset: datasource/db/schema/dataset

columns:
  - name: code
    checks:
      - invalid:
<strong>          valid_length: 2
</strong><strong>          valid_min_length: 1
</strong><strong>          valid_max_length: 5
</strong></code></pre>

`valid_length` , `valid_min_length` and `valid_max_length` can be used independently.

***

#### **Example with a reference dataset**

<pre class="language-yaml" data-title="contract.yml"><code class="lang-yaml">dataset: datasource/db/schema/dataset

columns:
  - name: code
    checks:
      - invalid:
<strong>          valid_reference_data:
</strong><strong>            dataset: data_source/db/schema/reference_dataset
</strong><strong>            column: valid_code
</strong></code></pre>

`valid_reference_data` has 2 nested configurations:

* `dataset`: fully qualified dataset name. Limitation: The reference dataset must be in the same data source.
* `column`: the name of the column containing the valid values

Commonly referred to as a reference check or referential integrity check.

***

## Duplicate check

A Duplicate check ensures that values in a column or combination of columns are unique, helping prevent data integrity issues. It can be used at the **column level** or **dataset level**.

### Single column duplicate check

<pre class="language-yaml" data-title="contract.yml"><code class="lang-yaml">dataset: datasource/db/schema/dataset

columns:
  - name: id
    checks:
<strong>      - duplicate:
</strong></code></pre>

**Configuration keys:**

<table><thead><tr><th>Key</th><th>Description</th><th width="106.33331298828125">Optional</th></tr></thead><tbody><tr><td><code>name</code></td><td><a data-mention href="#check-names">#check-names</a></td><td>Yes</td></tr><tr><td><code>threshold</code></td><td>The default duplicate check threshold requires that all values are unique. <a data-mention href="#thresholds">#thresholds</a></td><td>Yes</td></tr><tr><td><code>filter</code></td><td>Check filter</td><td>Yes</td></tr><tr><td><code>qualifier</code></td><td><a data-mention href="#check-qualifiers">#check-qualifiers</a></td><td>Yes</td></tr><tr><td><code>attributes</code></td><td><a data-mention href="#check-attributes">#check-attributes</a></td><td>Yes</td></tr></tbody></table>

### Multi-column duplicate check

<pre class="language-yaml" data-title="contract.yml"><code class="lang-yaml">dataset: datasource/db/schema/dataset

checks:
  - duplicate:
<strong>      columns: ['col1', 'col2']
</strong>      
columns: []
</code></pre>

**Configuration keys:**

<table><thead><tr><th>Key</th><th>Description</th><th width="106.33331298828125">Optional</th></tr></thead><tbody><tr><td><code>columns</code></td><td>List of column names</td><td>No</td></tr><tr><td><code>name</code></td><td><a data-mention href="#check-names">#check-names</a></td><td>Yes</td></tr><tr><td><code>threshold</code></td><td>The default duplicate check threshold requires that all values are unique. Data Contract Language Reference ✅</td><td>Yes</td></tr><tr><td><code>filter</code></td><td><a data-mention href="#configure-a-check-filter">#configure-a-check-filter</a></td><td>Yes</td></tr><tr><td><code>qualifier</code></td><td><a data-mention href="#check-qualifiers">#check-qualifiers</a></td><td>Yes</td></tr><tr><td><code>attributes</code></td><td><a data-mention href="#check-attributes">#check-attributes</a></td><td>Yes</td></tr></tbody></table>

***

## Aggregate check

An Aggregate check verifies that the result of an aggregate function, such as `avg`, `sum`, `min`, `max`, or `count`—meets the expected thresholds. It helps ensure that summary statistics over a column remain within acceptable ranges and that key metrics derived from your data stay accurate and consistent.

**Example:**

<pre class="language-yaml" data-title="contract.yml"><code class="lang-yaml">dataset: datasource/db/schema/dataset

columns:
  - name: age
    checks:
<strong>      - aggregate:
</strong><strong>          function: avg
</strong><strong>          threshold:
</strong><strong>            must_be_between:
</strong><strong>              greater_than: 20
</strong><strong>              less_than: 50
</strong></code></pre>

**Check configuration keys**

<table><thead><tr><th>Key</th><th>Description</th><th width="105.70001220703125">Optional</th></tr></thead><tbody><tr><td><code>function</code></td><td>Supported by all data sources: <code>avg</code>, <code>avg_length</code>, <code>max</code>, <code>min</code>, <code>max_length</code>, <code>min_length</code>, <code>sum</code></td><td>No</td></tr><tr><td><code>name</code></td><td><a data-mention href="#check-names">#check-names</a></td><td>Yes</td></tr><tr><td><code>threshold</code></td><td><a data-mention href="#thresholds">#thresholds</a></td><td>No</td></tr><tr><td><code>filter</code></td><td><a data-mention href="#configure-a-check-filter">#configure-a-check-filter</a></td><td>Yes</td></tr><tr><td><code>qualifier</code></td><td><a data-mention href="#check-qualifiers">#check-qualifiers</a></td><td>Yes</td></tr><tr><td><code>attributes</code></td><td><a data-mention href="#check-attributes">#check-attributes</a></td><td>Yes</td></tr></tbody></table>

***

## Metric check

A Metric check validates the result of a custom SQL expression or query against a threshold. It supports complex business logic, calculated metrics, or cross-column relationships. Metric checks can be defined at the dataset level (across multiple columns) or column level (single column).

For better performance, we recommend SQL expressions over full SQL queries—they’re simpler to write and can be combined with other checks in a single query.

### Verify a SQL expression value against a threshold

**Example at the dataset level:**

<pre class="language-yaml" data-title="contract.yml"><code class="lang-yaml">dataset: datasource/db/schema/dataset

checks:
  - metric:
<strong>      expression: AVG("end" - "start")
</strong><strong>      threshold:
</strong><strong>        must_be_between:
</strong><strong>          greater_than: 3
</strong><strong>          less_than: 8
</strong>
columns: []
</code></pre>

**Example at the column level:**

<pre class="language-yaml" data-title="contract.yml"><code class="lang-yaml">dataset: datasource/db/schema/dataset

columns:
  - name: end
    checks:
      - metric:
<strong>          expression: AVG("end" - "start")
</strong><strong>          threshold:
</strong><strong>            must_be_between:
</strong><strong>              greater_than: 3
</strong><strong>              less_than: 8
</strong></code></pre>

***

**Check configuration keys**

<table><thead><tr><th>Key</th><th>Description</th><th width="106.33331298828125">Optional</th></tr></thead><tbody><tr><td><code>expression</code></td><td>A SQL expression that produces the numeric metric value that will be compared with the threshold.</td><td>No</td></tr><tr><td><code>name</code></td><td><a data-mention href="#check-names">#check-names</a></td><td>Yes</td></tr><tr><td><code>threshold</code></td><td><a data-mention href="#thresholds">#thresholds</a></td><td>No</td></tr><tr><td><code>qualifier</code></td><td><a data-mention href="#check-qualifiers">#check-qualifiers</a></td><td>Yes</td></tr><tr><td><code>attributes</code></td><td><a data-mention href="#check-attributes">#check-attributes</a></td><td>Yes</td></tr></tbody></table>

### Verify a SQL query value against a threshold

**Example at the dataset level:**

<pre class="language-yaml" data-title="contract.yml"><code class="lang-yaml">dataset: datasource/db/schema/dataset

checks:
  - metric:
<strong>      query: |
</strong><strong>        SELECT AVG("end" - "start")
</strong><strong>        FROM datasource.db.schema.table
</strong><strong>      threshold:
</strong><strong>        must_be_greater_than: 3
</strong>
columns: []
</code></pre>

**Example at the column level:**

<pre class="language-yaml" data-title="contract.yml"><code class="lang-yaml">dataset: datasource/db/schema/dataset

columns:
  - name: end
    checks:
      - metric:
<strong>          query: |
</strong><strong>            SELECT AVG("end" - "start")
</strong><strong>            FROM datasource.db.schema.table
</strong><strong>          threshold:
</strong><strong>            must_be_greater_than: 3
</strong></code></pre>

<table><thead><tr><th>Key</th><th>Description</th><th width="106.33331298828125">Optional</th></tr></thead><tbody><tr><td><code>query</code></td><td>A SQL query that produces a single numeric metric value that will be compared with the threshold.</td><td>No</td></tr><tr><td><code>name</code></td><td><a data-mention href="#check-names">#check-names</a></td><td>Yes</td></tr><tr><td><code>threshold</code></td><td><a data-mention href="#thresholds">#thresholds</a></td><td>No</td></tr><tr><td><code>qualifier</code></td><td><a data-mention href="#check-qualifiers">#check-qualifiers</a></td><td>Yes</td></tr><tr><td><code>attributes</code></td><td><a data-mention href="#check-attributes">#check-attributes</a></td><td>Yes</td></tr></tbody></table>

***

## Failed rows check

A Failed Rows check identifies rows that violate a specific condition, such as a filter or SQL expression. It helps pinpoint problematic data like outliers or rule violations and can save references for further review.

Failed Rows checks can apply at the dataset level (testing conditions across multiple columns) or at the column level (validating individual column values).

We recommend using SQL expressions over full SQL queries for simplicity and efficiency—they allow combining multiple checks into a single query for better performance.

### Verify failed rows with a SQL expression

**Example at the dataset level:**

<pre class="language-yaml" data-title="contract.yml"><code class="lang-yaml">dataset: datasource/db/schema/dataset

checks:
<strong>  - failed_rows:
</strong><strong>      expression: ("end" - "start") > 5
</strong>      
columns: []
</code></pre>

**Example at the column level:**

<pre class="language-yaml" data-title="contract.yml"><code class="lang-yaml">dataset: datasource/db/schema/dataset

columns:
  - name: end
    data_type: DATE
    checks:
<strong>      - failed_rows:
</strong><strong>          expression: ("end" - "start") > 0
</strong></code></pre>

**Check configuration keys**

<table><thead><tr><th>Key</th><th>Description</th><th width="106.33331298828125">Optional</th></tr></thead><tbody><tr><td><code>Expression</code></td><td>A SQL expression, used as the WHERE clause, that produces any tabular data indicating failures with the data. An expression returning zero rows indicates there is no problem with the data.</td><td>No</td></tr><tr><td><code>name</code></td><td><a data-mention href="#check-names">#check-names</a></td><td>Yes</td></tr><tr><td><code>threshold</code></td><td>The default is that there should be no failed rows.<a data-mention href="#thresholds">#thresholds</a></td><td>Yes</td></tr><tr><td><code>qualifier</code></td><td><a data-mention href="#check-qualifiers">#check-qualifiers</a></td><td>Yes</td></tr><tr><td><code>attributes</code></td><td><a data-mention href="#check-attributes">#check-attributes</a></td><td>Yes</td></tr></tbody></table>

### Verify and visualize failed rows with a SQL query

**Example at the dataset level:**

<pre class="language-yaml" data-title="contract.yml"><code class="lang-yaml">dataset: datasource/db/schema/dataset

checks:
<strong>  - failed_rows:
</strong><strong>      query: |
</strong><strong>        SELECT * FROM datasource.db.schema.table WHERE ("end" - "start") > 5
</strong>        
columns: []
</code></pre>

**Example at the column level:**

<pre class="language-yaml" data-title=""><code class="lang-yaml">dataset: datasource/db/schema/dataset

columns:
  - name: end
    data_type: DATE
    checks:
<strong>      - failed_rows:
</strong><strong>        query: |
</strong><strong>          SELECT * FROM datasource.db.schema.table WHERE ("end" - "start") > 0
</strong></code></pre>

**Check configuration keys**

<table><thead><tr><th>Key</th><th>Description</th><th width="106.33331298828125">Optional</th></tr></thead><tbody><tr><td><code>query</code></td><td>A SQL query that produces any tabular data indicating failures with the data. An expression returning zero rows indicates there is no problem with the data. I</td><td>No</td></tr><tr><td><code>name</code></td><td><a data-mention href="#check-names">#check-names</a></td><td>Yes</td></tr><tr><td><code>threshold</code></td><td>The default is that there should be no failed rows. The threshold can only be count-based. Percent thresholds are not supported. <a data-mention href="#thresholds">#thresholds</a></td><td>Yes</td></tr><tr><td><code>qualifier</code></td><td><a data-mention href="#check-qualifiers">#check-qualifiers</a></td><td>Yes</td></tr><tr><td><code>attributes</code></td><td><a data-mention href="#check-attributes">#check-attributes</a></td><td>Yes</td></tr></tbody></table>

***

## Reconciliation checks

> Documentation for reconciliation checks is available in [reconciliation-checks](https://docs.soda.io/reference/contract-language-reference/reconciliation-checks "mention")

## Group By check

A `group by` check allows you to run validations on **aggregated metrics per group of records** based on one or more fields. You can then apply thresholds to metrics like counts, sums, or averages for each group.

{% hint style="warning" %}
Group By checks require having installed **the `soda-groupby` package using the** [**private PyPI**](#private-pypi-installation-flow) (Team or Entreprise license), unless you are using an agent.

Need access to the private PyPI? Please [contact us](https://www.soda.io/contact).
{% endhint %}

> **Follow the** [**private PyPi installation flow**](https://docs.soda.io/deployment-options/soda-python-libraries#private-pypi-installation-flow) to set up your environment and install the necessary Soda extensions.

### Verify a SQL query value against a threshold

**Example at the dataset level:**

<pre class="language-yaml" data-title="contract.yml"><code class="lang-yaml">dataset: datasource/db/schema/dataset

checks:
  - group_by:
      name: ""
<strong>      query: |-
</strong><strong>        SELECT
</strong><strong>            warehouse_id,
</strong><strong>            product_id,
</strong><strong>            COUNT(*) AS record_count,
</strong><strong>            SUM(qty_on_hand) AS total_qty_on_hand,
</strong><strong>        FROM soda_demo.retail_stock_levels
</strong><strong>        GROUP BY warehouse_id, product_id
</strong>      fields:
        - warehouse_id
        - product_id
      metric_name: total_qty_on_hand
<strong>      threshold:
</strong><strong>        must_be_greater_than: 1
</strong>    
columns: []
</code></pre>

**Check configuration keys**

<table><thead><tr><th>Key</th><th>Description</th><th width="105.70001220703125">Optional</th></tr></thead><tbody><tr><td><code>query</code></td><td>SQL query returning grouping fields and metrics</td><td>No</td></tr><tr><td><code>fields</code></td><td>List of column(s) used to group results</td><td>No</td></tr><tr><td><code>metric_name</code></td><td>Metric (from query alias) used for threshold evaluation</td><td></td></tr><tr><td><code>name</code></td><td><a data-mention href="#check-names">#check-names</a></td><td>Yes</td></tr><tr><td><code>threshold</code></td><td>The threshold to compare each of the group metrics with.<a data-mention href="#thresholds">#thresholds</a></td><td>No</td></tr><tr><td><code>qualifier</code></td><td><a data-mention href="#check-qualifiers">#check-qualifiers</a></td><td>Yes</td></tr><tr><td><code>attributes</code></td><td><a data-mention href="#check-attributes">#check-attributes</a></td><td>Yes</td></tr></tbody></table>

***

## Common check configurations

### Check filter

Most checks support a `filter` to limit the rows on which the check applies:

<pre class="language-yaml" data-title="contract.yml"><code class="lang-yaml">dataset: datasource/db/schema/dataset

columns:
  - name: id
    checks:
      - missing:
<strong>          filter: "country = 'USA'"
</strong></code></pre>

Note: If a dataset filter is configured, both the dataset and the check filter will be applied.

### Column expression

**All column-level checks** (except `failed_rows` check type) as well as the dataset-level **Freshness check,** **support a `column_expression` configuration** that allows you to modify the column value before the check is evaluated. You can optionally define:

```yaml
column_expression: <SQL expression>
```

This SQL expression:

* Uses the SQL dialect of your connected data source.
* Is applied during query generation.
* Modifies the column value before the check is evaluated

Soda then runs the check against the result of that expression instead of the raw column.

{% hint style="info" %}
Performance will depend on the complexity of the SQL expression. All expressions must be valid in the underlying warehouse dialect; **Soda does not validate SQL syntax before execution**.
{% endhint %}

You can use `column_expression` to:

* Access and validate **nested properties** in structured columns (for example, JSON, STRUCT, or OBJECT fields).
* **Cast data types** before applying checks (e.g., convert a string to a date for freshness validation).
* Reuse standard Soda checks (freshness, missing, validity, etc.) without writing custom SQL.

#### Define a column expression at the column level

You can define `column_expression` at the column level so that it applies to all checks on that column.

<pre class="language-yaml"><code class="lang-yaml">columns:
  - name: column_name
<strong>    column_expression: &#x3C;SQL expression>
</strong>    checks:
      - missing:
      - duplicate:
      - invalid:
          valid_min: 0
</code></pre>

#### Override at check level

You can also define a `column_expression` inside an individual check:

<pre class="language-yaml"><code class="lang-yaml">columns:
  - name: employee
    checks:
      - invalid:
          valid_min: 100
<strong>          column_expression: &#x3C;SQL expression>
</strong></code></pre>

If both column-level and check-level `column_expression` are defined, the check-level expression overrides the column-level expression.

#### Use Cases

<details>

<summary><strong>Validate nested fields in structured columns</strong></summary>

If a column "employee" contains structured data such as JSON or STRUCT:

```yaml
{
  "data": {
    "id": 1234,
    "value": 250
  }
}
```

You can validate nested properties directly using standard checks.

**Example (Postgres syntax)**

```yaml
columns:
  - name: employee
    data_type: struct
    column_expression: employee.data->>'id'
    checks:
      - invalid:
          valid_min: 1000
          column_expression: employee.data->>'id'
      - invalid:
          valid_min: 200
          column_expression: employee.data->>'value'
```

Soda applies built-in validation logic to the extracted nested value. In this example, `employee.data->>'id'` is evaluated before applying the `invalid` check.

</details>

<details>

<summary><strong>Cast data types for compatibility with checks</strong></summary>

Some checks require specific data types.

For example:

* `freshness` requires a date/timestamp
* `valid format` (regex) works on strings
* Numeric validations require numeric types

If your column type does not match the check’s requirement, you can cast it.

**Example: Cast string to date for freshness (Postgres syntax)**

```yaml
checks:
  - freshness:
      column: hire_date
      column_expression: hire_date::date
      threshold:
        must_be_less_than: "7 days"
```

If `hire_date` is stored as a string, casting it to `date` using `::date` allows freshness evaluation, since the check requires a valid date-formatted value.

**Example: Cast numeric to string for regex validation (Postgres syntax)**

```yaml
columns:
  - name: employee_id
    type: 
    checks:
      - invalid:
          valid_format: "^[A-Z]{2}[0-9]{6}$"
          column_expression: employee_id::text
```

This enables regex validation on numeric IDs, since in some data sources, regular expressions can only be executed on text-type columns.

</details>

{% hint style="warning" %}
**Limitations & considerations**

* Expressions must use the **data source dialect** (Snowflake, BigQuery, Databricks, etc.).
* Soda does not normalize SQL syntax across data sources.
* `column_expression` is not supported for:
  * `failed_rows`
  * `row_count`
  * `duplicate` (dataset level)
  * `group_by`
  * `schema`
  * Custom metric checks
    {% endhint %}

### Thresholds

Every check (except the schema check) has a metric value that is evaluated and produces a check outcome. The threshold is the check part that specifies which numeric values make the check pass or fail.

An example threshold is: "The missing count metric value must be less than 5."

<pre class="language-yaml" data-title="contract.yml"><code class="lang-yaml">dataset: datasource/db/schema/dataset

columns:
  - name: status
    checks:
      - missing:
<strong>          threshold:
</strong><strong>            must_be_less_than: 0
</strong></code></pre>

Use one of the following threshold configuration keys to specify an open range.

| Key                             | Description                                                                                                                                                                                                                                                                                                                  |
| ------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `must_be`                       | The metric value must be equal to the configured value                                                                                                                                                                                                                                                                       |
| `must_not_be`                   | The metric value must be different from the configured value                                                                                                                                                                                                                                                                 |
| `must_be_greater_than`          | The metric value must be greater than the configured value                                                                                                                                                                                                                                                                   |
| `must_be_greater_than_or_equal` | The metric value must be greater than or equal to the configured value                                                                                                                                                                                                                                                       |
| `must_be_less_than`             | The metric value must be less than the configured value                                                                                                                                                                                                                                                                      |
| `must_be_less_than_or_equal`    | The metric value must be less than or equal to the configured value                                                                                                                                                                                                                                                          |
| `must_be_between`               | The metric value must be between the acceptable bounds for a metric. The bounds must be expressed as either a **strict** range (`greater_than` / `less_than`) or an **inclusive** range (`greater_than_or_equal` / `less_than_or_equal`). The bounds are specified as **nested properties** under `must_be_between`.         |
| `must_be_not_between`           | The metric value must not be between the acceptable bounds for a metric. The bounds must be expressed as either a **strict** range (`greater_than` / `less_than`) or an **inclusive** range (`greater_than_or_equal` / `less_than_or_equal`). The bounds are specified as **nested properties** under `must_not_be_between`. |

Use a closed range to ensure metric values are between 2 boundaries.

**Example:**

<pre class="language-yaml" data-title="contract.yml"><code class="lang-yaml">dataset: datasource/db/schema/dataset

columns:
  - name: success_ratio
    checks:
      - missing:
          threshold:
            must_be_between:
<strong>              greater_than: 0
</strong><strong>              less_than: 100
</strong></code></pre>

* Use `greater_than` or `greater_than_or_equal` to specify the lower bound
* Use `less_than` or `less_than_or_equal` to specify the upper bound

Use an open range to ensure metric values are outside 2 boundaries.

**Example:**

<pre class="language-yaml" data-title="contract.yml"><code class="lang-yaml">dataset: datasource/db/schema/dataset

columns:
  - name: success_ratio
    checks:
      - missing:
          threshold:
            must_be_not_between:
<strong>              less_than: 0
</strong><strong>              greater_than: 100
</strong></code></pre>

* Use `less_than` or `less_than_or_equal` to specify the lower bound
* Use `greater_than` or `greater_than_or_equal` to specify the upper bound

**Set a threshold as a percentage**

To specify a threshold as a percentage of the total, add `metric: percent` to the threshold.

The `metric` The property has 2 possible values: `count` is the default and `percentage` can be configured.

Percentage-based thresholds are only available on checks where the metric value can be expressed as a percentage of a total value. These are missing, invalid, and duplicate checks.

Note that the total value refers to the total number of rows checked. In case of a dataset filter or a check filter, the total refers to the number of rows passing the filters.

**Example:**

<pre class="language-yaml" data-title="contract.yml"><code class="lang-yaml">dataset: datasource/db/schema/table

columns:
  - name: status
    checks:
      - missing:
          threshold:
<strong>            metric: percent
</strong><strong>            must_be_less_than: 5
</strong></code></pre>

#### Set a threshold level

To change the default "fail" threshold level, add `level: warn` to the threshold.

This will cause a given check to emit warning instead of check failure.

**Example:**

<pre class="language-yaml" data-title="contract.yml"><code class="lang-yaml">dataset: datasource/db/schema/dataset

columns:
  - name: status
    checks:
      - missing:
          threshold:
<strong>            level: warn
</strong></code></pre>

### Check names

The `name` property provides a human-readable description for a check. It helps explain the purpose of the check in plain language, making it easier for users to understand the intent behind each rule, especially when reviewing results in Soda Cloud.

For example, instead of relying on technical details like a regex pattern or threshold, a well-written `name` offers a simple summary, such as:

<pre class="language-yaml" data-title="contract.yml"><code class="lang-yaml">dataset: datasource/db/schema/dataset

columns:
  - name: status
    checks:
      - missing:
<strong>          name:  The missing count metric value must be less than 5%
</strong>          threshold:
            metric: percent
            must_be_less_than: 5
</code></pre>

In this example, the `name` ("Email format") clarifies the purpose of the regex. This is particularly helpful for non-technical users reviewing the contract or verification results.

**Best Practices for Check Names:**

* Write check names in plain language.
* Focus on the intent or business rule the check enforces.
* Avoid repeating technical details already visible in the check configuration.

The `name` Property is optional but highly recommended for improving clarity and collaboration.

By default, the check name is the check type.

### Check Attributes

Use attributes to **label**, **sort**, and **route** your checks in Soda Cloud. Attributes help you organize checks by properties such as domain, priority, location, and sensitivity (e.g., PII).

> Learn how to leverage attributes with [notifications](https://docs.soda.io/manage-issues/notifications "mention") and [browse-datasets](https://docs.soda.io/manage-issues/browse-datasets "mention").

**Apply Attributes to Checks**

You can add attributes directly to individual checks. For example:

<pre class="language-yaml" data-title="contract.yml"><code class="lang-yaml">dataset: datasource/db/schema/dataset

columns:
  - name: customer_email
    data_type: VARCHAR
    checks:
      - missing:
<strong>          attributes:
</strong><strong>            domain: Sales
</strong><strong>            pii: True
</strong><strong>            dimension: completeness
</strong></code></pre>

Attributes can also hold **list values**:

<pre class="language-yaml" data-title="contract.yml"><code class="lang-yaml">columns:
  - name: customer_email
    checks:
      - missing:
<strong>          attributes:
</strong><strong>            tags: [prod, critical]
</strong><strong>            domain: Sales
</strong></code></pre>

**Set Default Attributes at the Top Level**

You can also define default attributes at the dataset level. These attributes apply to **all checks**, unless overridden at the individual check level.

<pre class="language-yaml" data-title="contract.yml"><code class="lang-yaml">dataset: datasource/db/schema/dataset

<strong>check_attributes:
</strong><strong>  domain: Sales
</strong>  
columns:
  - name: customer_email
    data_type: VARCHAR
    checks:
      - missing:
          attributes:
            pii: True
            dimension: completeness
</code></pre>

#### **Filter checks by attributes at runtime**

Attributes can also be used to select which checks to run during contract verification using the `--check-filter` CLI option. For example, `-cf attributes.severity=critical` runs only checks with `severity: critical`.

For list-valued attributes (e.g., `tags: [prod, critical]`), use `-cf attributes.tags=prod` to match any check where the list contains `prod`, or `-cf "attributes.tags=[prod,critical]"` to match the exact list.

> Learn more in the [CLI reference](https://docs.soda.io/cli-reference#check-filters).

#### **Attribute Validation in Soda Cloud**

When publishing contract results to Soda Cloud, **all check attributes must be pre-defined in Soda Cloud**. If any attribute used in a contract is not registered in your Soda Cloud environment, the results will **not be published**, and the data contract scan will be **marked as failed**.

> Learn how to configure attributes in Soda Cloud: [check-and-dataset-attributes](https://docs.soda.io/manage-issues/check-and-dataset-attributes "mention").

### Check qualifiers

When sending results to Soda Cloud, check qualifiers ensure that the checks in the source YAML document can be correlated with the checks in Soda Cloud.

Most checks don't need a qualifier because there is only one check in a `checks` section of the same type. If you have multiple checks of the same type in a `checks` section, you need to ensure that each check has a distinct string `qualifier` value.

Keep in mind that if you change the qualifier, the historical information of this check on Soda Cloud is lost.

We want to ensure that all contracts are Soda Cloud ready so we have made it mandatory to specify qualifiers where needed.

In case two checks have the same identity, contract verification will complain with an error `Duplicate identity`

Configure distinct qualifiers to create unique check identities:

<pre class="language-yaml" data-title="contract.yml"><code class="lang-yaml">dataset: datasource/db/schema/table

columns:
  - name: status
    checks:
      - invalid:
          valid_values: ["C", "A", "D"]
      - invalid:
<strong>          qualifier: c2
</strong>          valid_format:
            regex: ^[CAD]$
            name: Characters format
</code></pre>

Any text or numeric value is allowed as the qualifier. No qualifier is considered different from other checks with a qualifier like in the example above.

***

## Configure a filter for all checks

A dataset filter ensures that all checks in the contract are only applied to a subset of the rows in the dataset.

This is most commonly used to apply all checks to the latest time partition of the data. Each time new data is appended to an incremental dataset, the contract verification should evaluate the checks only on the rows in the latest time partition.

<pre class="language-yaml" data-title="contract.yml"><code class="lang-yaml">dataset: datasource / db / schema / dataset;

<strong>filter: "created_at >= CURRENT_DATE - INTERVAL '1 day'";
</strong>    
columns: []
</code></pre>

## Make contracts dynamic with variables

Variables allow dynamic substitution of values in contracts. They help you:

* **Parameterize** values that differ across environments, datasets, or schedules.
* **Reuse values** in multiple places within the same contract to reduce duplication and improve maintainability.

Variables can be used in filters, checks, thresholds, and more.

Declare variables at the top of the contract:

<pre class="language-yaml" data-title="contract.yml"><code class="lang-yaml">dataset: datasource/db/schema/dataset

<strong>variables:
</strong><strong>  COUNTRY:
</strong><strong>    default: France
</strong>    
columns: []
</code></pre>

The default value is optional. Variables without a `default` are required when verifying a contract (see below). *If you don't specify a* `default` *value, don't forget the colon (*`:`*) after the variable name.*

Use variables in other places in the contract with syntax: `${var.VARIABLE_NAME}`

{% code title="contract.yml" %}

```yaml
dataset: datasource/db/schema/dataset

filter: "created_at >= '${var.START_DATE}'"

columns:
  - name: status
    checks:
      - missing:
```

{% endcode %}

{% hint style="info" %}
**Variables are case sensitive**. We recommend always using upper case and underscores for variable names.
{% endhint %}

You can specify variables values when verifying a contract:

```bash
soda contract verify --set START_DATE=2024-05-19T14:30:00Z
```

{% hint style="danger" %}
Setting a variable at verification time overrides the default value configured in the contract.
{% endhint %}

#### Allowed variable type

Soda currently supports only **string** and **numeric** variables, meaning that variables must be represented as either **text values** or **numbers**. Other variable types (such as boolean, date/time, arrays, or complex objects) are not supported at this time, so **any inputs should be formatted or converted into one of these two supported formats** before being used in Soda.

#### Soda variables

`${soda.NOW}` is a built-in variable that provides the **current timestamp at the moment of scan execution**, allowing you to create dynamic filters relative to the time of scan execution. For example, to check for records in the last 24 hours.

***

## Schedule a contract verification

Soda Cloud can execute the contract verification on a time schedule.

First, configure the time schedule in the contract.

{% code title="" %}

```yaml
dataset: datasource/db/schema/dataset

soda_agent:
  checks_schedule:
    cron: 0 0 * * *
    timezone: UTC
    
columns: []
```

{% endcode %}

* This defines a cron-based schedule to trigger contract **verification**
* Requires the data source to be configured in Soda Cloud with Soda Agent

Second, publish the contract to Soda Cloud.

<br>

***

{% if visitor.claims.plan === 'free' %}
{% hint style="success" %}
You are **logged in to Soda** and seeing the **Free license** documentation. Learn more about [documentation-access-and-licensing](https://docs.soda.io/reference/documentation-access-and-licensing "mention").
{% endhint %}
{% endif %}

{% if visitor.claims.plan === 'teams' %}
{% hint style="success" %}
You are **logged in to Soda** and seeing the **Team license** documentation. Learn more about [documentation-access-and-licensing](https://docs.soda.io/reference/documentation-access-and-licensing "mention").
{% endhint %}
{% endif %}

{% if visitor.claims.plan === 'enterprise' || visitor.claims.plan === 'enterpriseUserBased' %}
{% hint style="success" %}
You are **logged in to Soda** and seeing the **Enterprise license** documentation. Learn more about [documentation-access-and-licensing](https://docs.soda.io/reference/documentation-access-and-licensing "mention").
{% endhint %}
{% endif %}

{% if !(visitor.claims.plan === 'free' || visitor.claims.plan === 'teams' || visitor.claims.plan === 'enterprise' || visitor.claims.plan === 'enterpriseUserBased') %}
{% hint style="info" %}
You are **not logged in to Soda** and are viewing the default public documentation. Learn more about [documentation-access-and-licensing](https://docs.soda.io/reference/documentation-access-and-licensing "mention").
{% endhint %}
{% endif %}
