# Reconciliation checks

{% hint style="warning" %}
Reconciliation checks require having installed **the `soda-reconciliation` package using the** [**private PyPI**](#private-pypi-installation-flow) with an **Enterprise license**, unless you are using an agent.

Need access to the private PyPI? Please [contact us](mailto:support@soda.io).
{% endhint %}

Reconciliation checks validate that a **target dataset** matches one or more **source datasets**, ensuring that data remains consistent after migrations, in pipelines, or during synchronizations. They can be used for both **metric-level (aggregate) validation** and **row-level (record-by-record) validation**.

> Learn more about the use cases and performance considerations in the [Data reconciliation](https://docs.soda.io/data-testing/data-reconciliation) page.

## Prerequisites

To use **reconciliation checks**, you must either:

* Run your contract with **Soda Agent**, which has reconciliation support built in, **or**
* Install the **Soda Reconciliation extension** locally:

{% code overflow="wrap" %}

```shellscript
pip install "soda-reconciliation" --pre -i "https://${SODA_API_KEY_ID}:${SODA_API_KEY_SECRET}@enterprise.pypi.cloud.soda.io" --extra-index-url=https://pypi.cloud.soda.io
```

{% endcode %}

> **Follow the** [**private PyPi installation flow**](https://docs.soda.io/deployment-options/soda-python-libraries#private-pypi-installation-flow) to set up your environment and install the necessary Soda extensions.

## Example

#### Structure

* `dataset:` defines the **target dataset**.
* `reconciliation.sources:` defines the **source datasets** to compare against. Each source has a `name` and a `dataset`.
* `reconciliation.sources[].filter:` is an optional filter applied only to that source dataset.
* Each reconciliation `check` specifies which source it targets via the `source:` field.
* Each reconciliation `check` supports an additional **check-level filter** applied consistently to both source and target, layered on top of dataset-level filters.
* Thresholds define acceptable differences.

> All Common Check Configurations (filters, thresholds, names, qualifiers, attributes) apply to reconciliation checks. Learn more about [Common Check Configurations](https://docs.soda.io/reference/contract-language-reference/..#common-check-configurations).

```yaml
dataset: cloud_data_source/db/schema/dataset
filter: created_at >= CURRENT_DATE - INTERVAL '1 day'

reconciliation:
  sources:
    - name: on_prem
      dataset: on_prem_data_source/db/schema/dataset
      filter: created_at >= CURRENT_DATE - INTERVAL '1 day'
    - name: staging
      dataset: staging_data_source/db/schema/dataset
  checks:
    - row_count_diff:
        source: on_prem
        threshold:
          must_be_less_than: 1
    - row_count_diff:
        source: staging
    - aggregate_diff:
        source: on_prem
        function: avg
        column: employee_key
        filter: employee_key < 100
    - duplicate_diff:
        source: on_prem
        columns: [employee_key]
        threshold:
          must_be_less_than: 1
          metric: percent
    - freshness_diff:
        source: staging
        column: hire_date
        threshold:
          must_be: 0
          unit: hour
    - metric_diff:
        source: on_prem
        source_expression: SUM(employee_key + parent_employee_key)
        target_expression: SUM(employee_key + parent_employee_key)
        threshold:
          must_be_less_than: 100
          metric: percent
    - rows_diff:
        source: on_prem
        source_key_columns: [employee_key]
        target_key_columns: [employee_key]
        source_columns: [price, order_date]
        target_columns: [price, order_date]
        threshold:
          must_be: 0
```

## Source configuration

Each entry in the `sources:` list configures a source dataset to compare against the target.

<table><thead><tr><th width="160.199951171875">Key</th><th width="159.7999267578125">Required</th><th>Description</th></tr></thead><tbody><tr><td><code>name</code></td><td>Yes</td><td>Identifier for this source. Used in the <code>source:</code> field on checks. Must be unique across sources.</td></tr><tr><td><code>dataset</code></td><td>Yes</td><td>DQN path to the source dataset (e.g., <code>data_source/db/schema/table</code>). Must be unique across sources.</td></tr><tr><td><code>filter</code></td><td>No</td><td>SQL filter expression applied only to this source dataset.</td></tr><tr><td><code>default</code></td><td>No</td><td>Set to <code>true</code> to mark this as the unnamed default source. Used only when <a href="#migrating-from-single-source-to-multi-source">migrating from single-source</a>. Cannot be combined with <code>name</code>.</td></tr></tbody></table>

### Targeting checks to a source

Every check should specify which source it validates against using the `source:` field:

```yaml
checks:
  - row_count_diff:
      source: staging        # runs against the "staging" source
  - aggregate_diff:
      source: warehouse      # runs against the "warehouse" source
      function: sum
      column: amount
```

If a [default (unnamed) source](#migrating-from-single-source-to-multi-source) exists, checks without `source:` run against it. If no default source exists, every check **must** have an explicit `source:`.

### Validation rules

* All sources must have a `name` (unless using `default: true` for migration)
* `default: true` cannot be combined with `name`
* Source names must be unique
* Datasets must be unique across sources
* If no default source exists, every check must specify `source:`
* Source names are trimmed of whitespace; empty or whitespace-only names are rejected

***

## Row count diff

Compares the row count of the source and target datasets.

**Example**

```yaml
reconciliation:
  sources:
    - name: staging
      dataset: contracts-source/postgres/public/dim_employee_copy
  checks:
    - row_count_diff:
        source: staging
        threshold:
          must_be_less_than: 1
```

**Configuration keys**

<table><thead><tr><th width="159.60003662109375">Key</th><th width="160.20001220703125">Optional</th><th>Description</th></tr></thead><tbody><tr><td><code>source</code></td><td>No</td><td>Name of the source to validate against</td></tr><tr><td><code>name</code></td><td>Yes</td><td><a href="..#check-names">Check names</a></td></tr><tr><td><code>threshold</code></td><td>Yes</td><td><p><a href="..#thresholds">Thresholds</a></p><p>Acceptable difference between source and target.<br><br>By default, threshold = 0</p></td></tr><tr><td><code>filter</code></td><td>Yes</td><td><a href="../../../data-testing/data-reconciliation#check-level-filter">Configure a check filter</a>, applied to both source and target.</td></tr><tr><td><code>qualifier</code></td><td>Yes</td><td><a href="..#check-qualifiers">Check qualifiers</a></td></tr><tr><td><code>attributes</code></td><td>Yes</td><td><a href="..#check-attributes">Check attributes</a></td></tr></tbody></table>

***

## Aggregate diff

Compares the result of an aggregate function on a column between source and target.

**Example**

```yaml
reconciliation:
  sources:
    - name: staging
      dataset: contracts-source/postgres/public/dim_employee_copy
  checks:
    - aggregate_diff:
        source: staging
        function: avg
        column: employee_key
        filter: employee_key < 100
        threshold:
          must_be_less_than: 0.5
```

**Configuration keys**

<table><thead><tr><th width="160.20001220703125">Key</th><th width="160.20001220703125">Optional</th><th width="446.20001220703125">Description</th></tr></thead><tbody><tr><td><code>source</code></td><td>No</td><td>Name of the source to validate against</td></tr><tr><td><code>function</code></td><td>No</td><td>Aggregate function (<code>avg</code>, <code>sum</code>, <code>min</code>, <code>max</code>, <code>avg_length</code>, etc.)</td></tr><tr><td><code>column</code></td><td>Yes</td><td>Column to aggregate</td></tr><tr><td><code>threshold</code></td><td>Yes</td><td><p><a href="..#thresholds">Thresholds</a></p><p>Acceptable difference between source and target.<br><br>By default, threshold = 0</p></td></tr><tr><td><code>filter</code></td><td>Yes</td><td><p><a href="../../../data-testing/data-reconciliation#check-level-filter">Configure a check filter</a></p><p>Filter applied to both source and target</p></td></tr><tr><td><code>name</code></td><td>Yes</td><td><a href="..#check-names">Check names</a></td></tr><tr><td><code>qualifier</code></td><td>Yes</td><td><a href="..#check-qualifiers">Check qualifiers</a></td></tr><tr><td><code>attributes</code></td><td>Yes</td><td><a href="..#check-attributes">Check attributes</a></td></tr></tbody></table>

***

## Duplicate diff

Compares the number or percentage of duplicate rows based on one or more columns.

**Example**

```yaml
reconciliation:
  sources:
    - name: staging
      dataset: contracts-source/postgres/public/dim_employee_copy
  checks:
    - duplicate_diff:
        source: staging
        columns: [employee_key]
        threshold:
          must_be_less_than: 1
          metric: percent
```

**Configuration keys**

<table><thead><tr><th width="160.20001220703125">Key</th><th width="160.20001220703125">Optional</th><th>Description</th></tr></thead><tbody><tr><td><code>source</code></td><td>No</td><td>Name of the source to validate against</td></tr><tr><td><code>columns</code></td><td>No</td><td>List of column(s) to evaluate duplicates on</td></tr><tr><td><code>threshold</code></td><td>Yes</td><td><p><a href="..#thresholds">Thresholds</a></p><p>Acceptable difference between source and target.<br><br>By default, threshold = 0</p></td></tr><tr><td><code>filter</code></td><td>Yes</td><td><p><a href="../../../data-testing/data-reconciliation#check-level-filter">Configure a check filter</a></p><p>Filter applied to both source and target.<br><br>Support both comparison of <code>metric:percent</code> and <code>metric:count</code><br></p></td></tr><tr><td><code>name</code></td><td>Yes</td><td><a href="..#check-names">Check names</a></td></tr><tr><td><code>qualifier</code></td><td>Yes</td><td><a href="..#check-qualifiers">Check qualifiers</a></td></tr><tr><td><code>attributes</code></td><td>Yes</td><td><a href="..#check-attributes">Check attributes</a></td></tr></tbody></table>

***

## Freshness diff

Compares freshness (recency of the latest timestamp) between source and target.

**Example**

```yaml
reconciliation:
  sources:
    - name: staging
      dataset: contracts-source/postgres/public/dim_employee_copy
  checks:
    - freshness_diff:
        source: staging
        column: hire_date
        threshold:
          must_be_less_than: 1
          unit: hour
```

**Configuration keys**

<table><thead><tr><th width="160.20001220703125">Key</th><th width="160.20001220703125">Optional</th><th>Description</th></tr></thead><tbody><tr><td><code>source</code></td><td>No</td><td>Name of the source to validate against</td></tr><tr><td><code>column</code></td><td>Yes</td><td>Timestamp column used to measure freshness</td></tr><tr><td><code>unit</code></td><td>Yes</td><td>Unit of time (<code>hour</code>, <code>minute</code>, <code>day</code>)</td></tr><tr><td><code>threshold</code></td><td>Yes</td><td><p><a href="..#thresholds">Thresholds</a></p><p>Acceptable difference between source and target.<br><br>By default, threshold = 0</p></td></tr><tr><td><code>filter</code></td><td>Yes</td><td><p><a href="../../../data-testing/data-reconciliation#check-level-filter">Configure a check filter</a></p><p>Filter applied to both source and target</p></td></tr><tr><td><code>name</code></td><td>Yes</td><td><a href="..#check-names">Check names</a></td></tr><tr><td><code>qualifier</code></td><td>Yes</td><td><a href="..#check-qualifiers">Check qualifiers</a></td></tr><tr><td><code>attributes</code></td><td>Yes</td><td><a href="..#check-attributes">Check attributes</a></td></tr></tbody></table>

***

## Metric diff

Compares results of custom SQL expressions or queries across source and target.

**Example**

```yaml
reconciliation:
  sources:
    - name: staging
      dataset: contracts-source/postgres/public/dim_employee_copy
  checks:
    - metric_diff:
        source: staging
        source_expression: SUM(employee_key + parent_employee_key)
        target_expression: SUM(employee_key + parent_employee_key)
        threshold:
          must_be_less_than: 100
```

**Configuration keys**

<table><thead><tr><th width="185.79998779296875">Key</th><th width="159.4000244140625">Optional</th><th>Description</th></tr></thead><tbody><tr><td><code>source</code></td><td>No</td><td>Name of the source to validate against</td></tr><tr><td><code>source_expression</code></td><td>No*</td><td>SQL expression for source</td></tr><tr><td><code>target_expression</code></td><td>No*</td><td>SQL expression for target</td></tr><tr><td><code>source_query</code></td><td>No*</td><td>Full SQL query for source metric</td></tr><tr><td><code>target_query</code></td><td>No*</td><td>Full SQL query for target metric</td></tr><tr><td><code>threshold</code></td><td>Yes</td><td><p><a href="..#thresholds">Thresholds</a></p><p>Acceptable difference between source and target.<br><br>By default, threshold = 0</p></td></tr><tr><td><code>filter</code></td><td>Yes</td><td><p><a href="../../../data-testing/data-reconciliation#check-level-filter">Configure a check filter</a></p><p>Filter applied to both source and target</p></td></tr><tr><td><code>name</code></td><td>Yes</td><td><a href="..#check-names">Check names</a></td></tr><tr><td><code>qualifier</code></td><td>Yes</td><td><a href="..#check-qualifiers">Check qualifiers</a></td></tr><tr><td><code>attributes</code></td><td>Yes</td><td><a href="..#check-attributes">Check attributes</a></td></tr></tbody></table>

\* Either expression or query must be defined.

***

## Rows diff

Compares rows between source and target based on keys, and checks specified columns for differences.

{% hint style="warning" %}
Rows diff reconciliation checks are **not supported in Synapse**.
{% endhint %}

**Example**

```yaml
reconciliation:
  sources:
    - name: staging
      dataset: contracts-source/postgres/public/dim_employee_copy
  checks:
    - rows_diff:
        source: staging
        source_key_columns: [employee_key]
        target_key_columns: [employee_key]
        source_columns: [price, order_date]
        target_columns: [price, order_date]
        threshold:
          must_be: 0
          metric: percent
```

**Configuration keys**

<table><thead><tr><th width="185.20001220703125">Key</th><th width="159.4000244140625">Optional</th><th>Description</th></tr></thead><tbody><tr><td><code>source</code></td><td>No</td><td>Name of the source to validate against</td></tr><tr><td><code>source_key_columns</code></td><td>No</td><td>Key column(s) to align rows in source dataset</td></tr><tr><td><code>target_key_columns</code></td><td>No</td><td>Key column(s) to align rows in target dataset</td></tr><tr><td><code>source_columns</code></td><td>Yes</td><td>Columns to compare in the target dataset. If omitted, all columns are compared based on column order.<br><br>The number of defined source columns must match the number of defined target columns</td></tr><tr><td><code>target_columns</code></td><td>Yes</td><td>Columns to compare in the target dataset. If omitted, all columns are compared based on column order.<br><br>The number of defined target columns must match the number of defined source columns.</td></tr><tr><td><code>threshold</code></td><td>Yes</td><td><p><a href="..#thresholds">Thresholds</a></p><p>Acceptable difference between source and target.<br></p><p><strong>Thresholds</strong> can be defined in two ways:</p><ul><li>As the <strong>count of differing rows</strong> between source and target.</li><li>As the <strong>percentage of differing rows</strong>, relative to the number of tested rows in the <strong>source dataset</strong>.</li></ul><p>By default, threshold = 0</p></td></tr><tr><td><code>filter</code></td><td>Yes</td><td><p><a href="../../../data-testing/data-reconciliation#check-level-filter">Configure a check filter</a></p><p>Filter applied to both source and target</p></td></tr><tr><td><code>name</code></td><td>Yes</td><td><a href="..#check-names">Check names</a></td></tr><tr><td><code>qualifier</code></td><td>Yes</td><td><a href="..#check-qualifiers">Check qualifiers</a></td></tr><tr><td><code>attributes</code></td><td>Yes</td><td><a href="..#check-attributes">Check attributes</a></td></tr></tbody></table>

### Limitations

* It is **not possible** to perform rows diff reconciliation with **datasets located in the same Snowflake data source**.\
  If your data source is Snowflake, and you wish to reconcile datasets within Snowflake, one of the datasets to reconcile must be:
  * in a different Snowflake instance, **or**
  * in a different data source type.

***

## Migrating from single-source to multi-source

If you have an existing contract that uses the single-source `source:` syntax and want to add additional sources, you can migrate without losing check history in Soda Cloud.

Check identity in Soda Cloud is derived from the check's path. When you move from single-source to multi-source, checks that target the original source need to preserve their path so that Soda Cloud treats them as the same check and retains their historical results.

### How to migrate

Use the `default: true` flag on a source **without a name**. This unnamed default source preserves the original check path format, so Soda Cloud continues to recognize those checks as the same ones from before the migration.

**Before** (single source):

```yaml
reconciliation:
  source:
    dataset: on_prem_data_source/db/schema/dataset
    filter: created_at >= CURRENT_DATE - INTERVAL '1 day'
  checks:
    - row_count_diff:
    - aggregate_diff:
        function: avg
        column: employee_key
```

**After** (multi-source, preserving history):

```yaml
reconciliation:
  sources:
    - dataset: on_prem_data_source/db/schema/dataset
      filter: created_at >= CURRENT_DATE - INTERVAL '1 day'
      default: true
    - name: warehouse
      dataset: warehouse_data_source/db/schema/dataset
  checks:
    - row_count_diff:                    # runs against default source, same check identity
    - aggregate_diff:                    # runs against default source, same check identity
        function: avg
        column: employee_key
    - row_count_diff:
        source: warehouse                # new check against warehouse
```

### Key points

* The original source becomes an entry with `default: true` and no `name`. This preserves check paths and therefore history.
* Checks without a `source:` field continue to run against the default source.
* New sources get a `name`, and checks targeting them use `source: <name>`.
* `default: true` **cannot** be combined with `name`; the unnamed default is exclusively a migration mechanism.
* Only one source can be `default: true`.

***

<details>

<summary><strong>Legacy single-source syntax</strong></summary>

The original single-source syntax uses `source:` (singular) instead of `sources:` (plural). This syntax is still supported and is equivalent to defining a single unnamed default source.

```yaml
reconciliation:
  source:
    dataset: on_prem_data_source/db/schema/dataset
    filter: created_at >= CURRENT_DATE - INTERVAL '1 day'
  checks:
    - row_count_diff:
        threshold:
          must_be_less_than: 1
    - aggregate_diff:
        function: avg
        column: employee_key
```

With single-source syntax, checks do not need a `source:` field since there is only one source to target.

{% hint style="info" %}
If you are writing a new contract, use the `sources:` (plural) syntax instead. See [Migrating from single-source to multi-source](#migrating-from-single-source-to-multi-source) if you want to add additional sources to an existing single-source contract.
{% endhint %}

</details>

***

## Reconciliation checks with in-memory data sources (DuckDB, Spark)

Reconciliation checks can compare datasets that live in **different execution contexts**, such as:

* a dataset in a database (for example, PostgreSQL, Snowflake), and
* a dataset in an **in-memory data source** (for example, [DuckDB](https://docs.soda.io/reference/data-source-reference-for-soda-core/duckdb) in-memory or a [Spark Dataframe](https://docs.soda.io/reference/data-source-reference-for-soda-core/spark-dataframe)).

***

### Compare DuckDB datasets

To compare DuckDB datasets, you must pass **two different values** in `data_sources`:

1. The in-memory data source
2. The database data source

### Compare DuckDB to another data source

If you want to compare a DuckDB dataset to a different data source (for example, Postgres), you must pass DuckDB in `data_sources` and Data source config in `data_source_file_paths`:

1. The in-memory data source, passed directly as a `data_sources` object
2. The database data source, passed via a `data_source_file_paths` configuration file

#### Example: DuckDB (in-memory) vs PostgreSQL

The example below shows how to reconcile a DuckDB in-memory dataset with a PostgreSQL dataset using `verify_contract_locally`:

```python
result = verify_contract_locally(
    data_sources=[DuckDBDataSourceImpl.from_existing_cursor(cursor, name="snowflake")],
    data_source_file_paths=["../postgres_supabase.yaml"],
    contract_file_path="🐼.contract.yaml",
)
```

### Spark DataFrames

The same pattern applies when reconciling against **Spark DataFrames**:

* The Spark DataFrame is passed as an in-memory data source
* The database connection is provided via a YAML file in `data_source_file_paths`

<br>

***

{% if (visitor.claims.plan === 'datasetStandard')%}
{% hint style="success" %}
You are **logged in to Soda** and seeing the **Free license** documentation. Learn more about [documentation-access-and-licensing](https://docs.soda.io/reference/documentation-access-and-licensing "mention").
{% endhint %}
{% endif %}

{% if (visitor.claims.plan === 'enterprise')%}
{% hint style="success" %}
You are **logged in to Soda** and seeing the **Team license** documentation. Learn more about [documentation-access-and-licensing](https://docs.soda.io/reference/documentation-access-and-licensing "mention").
{% endhint %}
{% endif %}

{% if (visitor.claims.plan === 'enterpriseUserBased')%}
{% hint style="success" %}
You are **logged in to Soda** and seeing the **Enterprise license** documentation. Learn more about [documentation-access-and-licensing](https://docs.soda.io/reference/documentation-access-and-licensing "mention").
{% endhint %}
{% endif %}

{% if !(visitor.claims.plan === 'enterprise' || visitor.claims.plan === 'enterpriseUserBased' || visitor.claims.plan === 'datasetStandard')%}
{% hint style="info" %}
You are **not logged in to Soda** and are viewing the default public documentation. Learn more about [documentation-access-and-licensing](https://docs.soda.io/reference/documentation-access-and-licensing "mention").
{% endhint %}
{% endif %}


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.soda.io/reference/contract-language-reference/reconciliation-checks.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
