# Data reconciliation

**Reconciliation checks** are a **validation step** used to ensure that data remains consistent and accurate when moving, transforming, or syncing between different systems. The core purpose is to confirm that the **target data matches one or more source datasets,** whether that's during a one-time migration, a recurring data pipeline run, or ongoing synchronization across environments.

For instance, if you are migrating from a MySQL database to Snowflake, reconciliation checks can verify that the data transferred into Snowflake staging is intact and reliable before promoting it to production. This **minimizes the risk of data loss, duplication, or corruption during critical migrations**.

Beyond migrations, reconciliation checks are also used in **data pipelines and integrations**. They help validate that transformations applied in-flight do not compromise accuracy, and that downstream datasets remain coherent with upstream sources.

Other use cases include **regulatory compliance**, where organizations must prove that financial or operational data has been faithfully replicated across systems, and **system upgrades**, where schema changes or infrastructure shifts can introduce unexpected mismatches.

By systematically applying reconciliation checks, teams can maintain trust in their data, reduce operational risk, and streamline incident detection when anomalies arise.

## Multiple source datasets

Reconciliation supports comparing your target dataset against **multiple source datasets** simultaneously. This is useful when:

* **Multiple upstream systems** feed into a single target: for example, validating that a consolidated data warehouse matches both a staging environment and an operational database.
* **Cross-environment validation**: ensuring consistency across regional databases, different cloud providers, or separate business units that all contribute to one target.
* **Phased migrations**: during gradual cutovers, you may need to reconcile against both the old and new source systems until the migration is complete.

Each source is defined with a `name` and a `dataset`, and each check specifies which source it targets using the `source:` field. This gives you full control over which validations run against which upstream system.

> Learn more about the syntax in the [reconciliation checks reference](https://docs.soda.io/reference/contract-language-reference/reconciliation-checks).

## Defining source datasets

Before defining reconciliation checks, you specify one or more **source datasets** in the `sources:` list. Each source represents a system of record against which you want to validate consistency. It is possible to define a **filter** on each source dataset, allowing you to reconcile only a subset of records that match certain criteria (for example, only transactions from the current month, or only rows belonging to a specific business unit).

For the **target dataset**, the reconciliation check applies the **dataset filter defined at the top of the contract** (see [#key-concepts-in-contract-authoring](https://docs.soda.io/cloud-managed-data-contracts/author-a-contract-in-soda-cloud#key-concepts-in-contract-authoring "mention")).

Ensure that both source and target are constrained to the same logical scope before comparisons are made, keeping the validation consistent and relevant.

<div align="center"><figure><img src="https://1123167021-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FA2PmHkO5cBgeRPdiPPOG%2Fuploads%2FqnZYM9jwi8Prdg3n0Xh7%2Fimage.png?alt=media&#x26;token=43e754ab-65f8-4cf7-ab41-381589bc0f9e" alt="" width="450"><figcaption><p>In a data contract, configure a reconciliation check by defining the <strong>source dataset</strong> and applying <strong>aligned filters</strong></p></figcaption></figure></div>

## **Metric-Level Reconciliation**

At this level, aggregate metrics from the source and target datasets are compared. Examples include totals (e.g., revenue, number of rows), averages, or other summary statistics. This approach is efficient and provides a high-level signal that the data remains consistent. It is especially useful for large-scale migrations or pipelines where exact row-by-row comparison may not be necessary at all times.

### Thresholds

Comparisons at the metric level are evaluated against a defined threshold, which represents the **acceptable difference between source and target**. This tolerance can be set depending on the business context. Some use cases may allow small discrepancies (e.g., rounding differences), while others require exact equality.

When comparing integrity checks such as missing values, duplicates, or invalid entries, you can reconcile either by looking at the **raw count** of affected records or by comparing the **percentage metric** (e.g., the percentage of rows with missing values in each dataset). This flexibility ensures that reconciliation is meaningful regardless of dataset size or distribution.

### Check-level filter

In addition to dataset-level filters, reconciliation checks support **check-level filters**, which are applied consistently to both the source and target within the scope of a specific check. These filters make it possible to validate a **subset of the data** relevant to the context of the check. The check-level filter is applied **on top of any existing source or target dataset filters**.

<figure><img src="https://1123167021-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FA2PmHkO5cBgeRPdiPPOG%2Fuploads%2FDbeHODWtAgAy7XKGDhvu%2Fimage.png?alt=media&#x26;token=2a43b107-2721-4f91-b3b5-6f916882c4c9" alt="" width="563"><figcaption></figcaption></figure>

## **Row-level reconciliation**

For more granular validation, reconciliation can be performed at the **row level**. This type of check surfaces detailed differences such as **missing records**, **mismatched values**, or **unexpected duplicates**. Row-level reconciliation is critical in scenarios where accuracy at the record level is non-negotiable—such as record that address financial transactions, user data, or regulatory reporting.

This requires specifying a **primary key (or a composite key)** to uniquely identify rows between the source and the target. Once rows are aligned, you can define a **list of columns to test** for exact matches or acceptable tolerances. If no column list is provided, the check defaults to comparing **all columns in order**. This flexibility ensures that comparisons can range from broad validation across the entire dataset to focused checks on only the most critical attributes.

{% hint style="warning" %}
[Rows diff](https://docs.soda.io/reference/contract-language-reference/reconciliation-checks#rows-diff) reconciliation checks are **not supported in Synapse**.
{% endhint %}

### Thresholds

Row-level reconciliation supports thresholds expressed either as the **count of differing rows** between source and target, or as the **percentage of differing rows relative to the source dataset row count**. These thresholds determine the acceptable level of variance before the check is considered failed, giving you fine control over sensitivity and tolerance.

This dual approach allows teams to adapt reconciliation logic to different contexts, using absolute counts when every record matters, and percentages when evaluating proportional differences in large datasets.

### Check-level filter

As with metric-level checks, you can define a **check-level filter** that is applied on top of any existing dataset filters. This allows you to reconcile only a targeted segment of data within the context of the specific check—for example, testing only a single business unit, product family, or date range.

<figure><img src="https://1123167021-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FA2PmHkO5cBgeRPdiPPOG%2Fuploads%2FFvSbXSZrelE5DZ8PaQ0m%2Fimage.png?alt=media&#x26;token=4c580abd-2af0-42b9-a845-952bc8824b7d" alt="" width="563"><figcaption></figcaption></figure>

### Performance considerations

Row-level reconciliation is inherently **heavier** than metric-level reconciliation, as it requires comparing records across potentially large datasets. To enable comparisons even when data lives in different systems, data is loaded into memory from both the source and the target, where the diff is executed. A **paginated approach** is used to maintain scalability; this ensures that memory usage remains stable, but execution time will increase as the dataset size and column count grow.

### Benchmarks

| Dataset Shape          | Change Rate | Memory Usage | Execution Time |
| ---------------------- | ----------- | ------------ | -------------- |
| 10 columns, 500K rows  | 1% changes  | <80MB RAM    | **9s**         |
| 360 columns, 100K rows | 1% changes  | <80MB RAM    | **1m**         |
| 360 columns, 1M rows   | 1% changes  | <80MB RAM    | **35m**        |

**Recommendations**

* **Leverage filters to scope checks to new or incremental batches of data** wherever possible, rather than repeatedly reconciling the entire dataset. This reduces both execution time and operational overhead.
* Use **metric-level reconciliation as a first line of validation**. It is significantly more efficient and scalable, and can quickly highlight whether deeper row-level analysis is even necessary.

***

## Implement reconciliation checks programmatically

Soda is suitable for no-code and programmatic users alike. If you are implementing checks programmatically, you can learn more about the **contract language syntax for reconciliation** on the [Contract Language reference](https://docs.soda.io/reference/contract-language-reference/reconciliation-checks). Reconciliation checks can be used for both metric- and row-level validation.

> Learn more in the [reconciliation checks reference](https://docs.soda.io/reference/contract-language-reference/reconciliation-checks).

<br>

***

{% if visitor.claims.plan === 'free' %}
{% hint style="success" %}
You are **logged in to Soda** and seeing the **Free license** documentation. Learn more about [documentation-access-and-licensing](https://docs.soda.io/reference/documentation-access-and-licensing "mention").
{% endhint %}
{% endif %}

{% if visitor.claims.plan === 'teams' %}
{% hint style="success" %}
You are **logged in to Soda** and seeing the **Team license** documentation. Learn more about [documentation-access-and-licensing](https://docs.soda.io/reference/documentation-access-and-licensing "mention").
{% endhint %}
{% endif %}

{% if visitor.claims.plan === 'enterprise' || visitor.claims.plan === 'enterpriseUserBased' %}
{% hint style="success" %}
You are **logged in to Soda** and seeing the **Enterprise license** documentation. Learn more about [documentation-access-and-licensing](https://docs.soda.io/reference/documentation-access-and-licensing "mention").
{% endhint %}
{% endif %}

{% if !(visitor.claims.plan === 'free' || visitor.claims.plan === 'teams' || visitor.claims.plan === 'enterprise' || visitor.claims.plan === 'enterpriseUserBased') %}
{% hint style="info" %}
You are **not logged in to Soda** and are viewing the default public documentation. Learn more about [documentation-access-and-licensing](https://docs.soda.io/reference/documentation-access-and-licensing "mention").
{% endhint %}
{% endif %}
