# Reconciliation checks {% hint style="warning" %} Reconciliation checks require having installed **the `soda-reconciliation` package using the** [**private PyPI**](#private-pypi-installation-flow) with an **Enterprise license**, unless you are using an agent. Need access to the private PyPI? Please [contact us](mailto:support@soda.io). {% endhint %} Reconciliation checks validate that a **target dataset** matches one or more **source datasets**, ensuring that data remains consistent after migrations, in pipelines, or during synchronizations. They can be used for both **metric-level (aggregate) validation** and **row-level (record-by-record) validation**. > Learn more about the use cases and performance considerations in the [Data reconciliation](/data-testing/data-reconciliation.md) page. ## Prerequisites To use **reconciliation checks**, you must either: * Run your contract with **Soda Agent**, which has reconciliation support built in, **or** * Install the **Soda Reconciliation extension** locally: {% code overflow="wrap" %} ```shellscript pip install "soda-reconciliation" --pre -i "https://${SODA_API_KEY_ID}:${SODA_API_KEY_SECRET}@enterprise.pypi.cloud.soda.io" --extra-index-url=https://pypi.cloud.soda.io ``` {% endcode %} > **Follow the** [**private PyPi installation flow**](/deployment-options/soda-python-libraries.md#private-pypi-installation-flow) to set up your environment and install the necessary Soda extensions. ## Example #### Structure * `dataset:` defines the **target dataset**. * `reconciliation.sources:` defines the **source datasets** to compare against. Each source has a `name` and a `dataset`. * `reconciliation.sources[].filter:` is an optional filter applied only to that source dataset. * Each reconciliation `check` specifies which source it targets via the `source:` field. * Each reconciliation `check` supports an additional **check-level filter** applied consistently to both source and target, layered on top of dataset-level filters. * Thresholds define acceptable differences. > All Common Check Configurations (filters, thresholds, names, qualifiers, attributes) apply to reconciliation checks. Learn more about [Common Check Configurations](/reference/contract-language-reference.md#common-check-configurations). ```yaml dataset: cloud_data_source/db/schema/dataset filter: created_at >= CURRENT_DATE - INTERVAL '1 day' reconciliation: sources: - name: on_prem dataset: on_prem_data_source/db/schema/dataset filter: created_at >= CURRENT_DATE - INTERVAL '1 day' - name: staging dataset: staging_data_source/db/schema/dataset checks: - row_count_diff: source: on_prem threshold: must_be_less_than: 1 - row_count_diff: source: staging - aggregate_diff: source: on_prem function: avg column: employee_key filter: employee_key < 100 - duplicate_diff: source: on_prem columns: [employee_key] threshold: must_be_less_than: 1 metric: percent - freshness_diff: source: staging column: hire_date threshold: must_be: 0 unit: hour - metric_diff: source: on_prem source_expression: SUM(employee_key + parent_employee_key) target_expression: SUM(employee_key + parent_employee_key) threshold: must_be_less_than: 100 metric: percent - rows_diff: source: on_prem source_key_columns: [employee_key] target_key_columns: [employee_key] source_columns: [price, order_date] target_columns: [price, order_date] threshold: must_be: 0 ``` ## Source configuration Each entry in the `sources:` list configures a source dataset to compare against the target.

Key	Required	Description
`name`	Yes	Identifier for this source. Used in the `source:` field on checks. Must be unique across sources.
`dataset`	Yes	DQN path to the source dataset (e.g., `data_source/db/schema/table`). Must be unique across sources.
`filter`	No	SQL filter expression applied only to this source dataset.
`default`	No	Set to `true` to mark this as the unnamed default source. Used only when migrating from single-source. Cannot be combined with `name`.

### Targeting checks to a source Every check should specify which source it validates against using the `source:` field: ```yaml checks: - row_count_diff: source: staging # runs against the "staging" source - aggregate_diff: source: warehouse # runs against the "warehouse" source function: sum column: amount ``` If a [default (unnamed) source](#migrating-from-single-source-to-multi-source) exists, checks without `source:` run against it. If no default source exists, every check **must** have an explicit `source:`. ### Validation rules * All sources must have a `name` (unless using `default: true` for migration) * `default: true` cannot be combined with `name` * Source names must be unique * Datasets must be unique across sources * If no default source exists, every check must specify `source:` * Source names are trimmed of whitespace; empty or whitespace-only names are rejected *** ## Row count diff Compares the row count of the source and target datasets. **Example** ```yaml reconciliation: sources: - name: staging dataset: contracts-source/postgres/public/dim_employee_copy checks: - row_count_diff: source: staging threshold: must_be_less_than: 1 ``` **Configuration keys**

Key	Optional	Description
`source`	No	Name of the source to validate against
`name`	Yes	Check names
`threshold`	Yes	Thresholds Acceptable difference between source and target. By default, threshold = 0
`filter`	Yes	Configure a check filter, applied to both source and target.
`qualifier`	Yes	Check qualifiers
`attributes`	Yes	Check attributes

*** ## Aggregate diff Compares the result of an aggregate function on a column between source and target. **Example** ```yaml reconciliation: sources: - name: staging dataset: contracts-source/postgres/public/dim_employee_copy checks: - aggregate_diff: source: staging function: avg column: employee_key filter: employee_key < 100 threshold: must_be_less_than: 0.5 ``` **Configuration keys**

Key	Optional	Description
`source`	No	Name of the source to validate against
`function`	No	Aggregate function (`avg`, `sum`, `min`, `max`, `avg_length`, etc.)
`column`	Yes	Column to aggregate
`threshold`	Yes	Thresholds Acceptable difference between source and target. By default, threshold = 0
`filter`	Yes	Configure a check filter Filter applied to both source and target
`name`	Yes	Check names
`qualifier`	Yes	Check qualifiers
`attributes`	Yes	Check attributes

*** ## Duplicate diff Compares the number or percentage of duplicate rows based on one or more columns. **Example** ```yaml reconciliation: sources: - name: staging dataset: contracts-source/postgres/public/dim_employee_copy checks: - duplicate_diff: source: staging columns: [employee_key] threshold: must_be_less_than: 1 metric: percent ``` **Configuration keys**

Key	Optional	Description
`source`	No	Name of the source to validate against
`columns`	No	List of column(s) to evaluate duplicates on
`threshold`	Yes	Thresholds Acceptable difference between source and target. By default, threshold = 0
`filter`	Yes	Configure a check filter Filter applied to both source and target. Support both comparison of `metric:percent` and `metric:count`
`name`	Yes	Check names
`qualifier`	Yes	Check qualifiers
`attributes`	Yes	Check attributes

*** ## Freshness diff Compares freshness (recency of the latest timestamp) between source and target. **Example** ```yaml reconciliation: sources: - name: staging dataset: contracts-source/postgres/public/dim_employee_copy checks: - freshness_diff: source: staging column: hire_date threshold: must_be_less_than: 1 unit: hour ``` **Configuration keys**

Key	Optional	Description
`source`	No	Name of the source to validate against
`column`	Yes	Timestamp column used to measure freshness
`unit`	Yes	Unit of time (`hour`, `minute`, `day`)
`threshold`	Yes	Thresholds Acceptable difference between source and target. By default, threshold = 0
`filter`	Yes	Configure a check filter Filter applied to both source and target
`name`	Yes	Check names
`qualifier`	Yes	Check qualifiers
`attributes`	Yes	Check attributes

*** ## Metric diff Compares results of custom SQL expressions or queries across source and target. **Example** ```yaml reconciliation: sources: - name: staging dataset: contracts-source/postgres/public/dim_employee_copy checks: - metric_diff: source: staging source_expression: SUM(employee_key + parent_employee_key) target_expression: SUM(employee_key + parent_employee_key) threshold: must_be_less_than: 100 ``` **Configuration keys**

Key	Optional	Description
`source`	No	Name of the source to validate against
`source_expression`	No*	SQL expression for source
`target_expression`	No*	SQL expression for target
`source_query`	No*	Full SQL query for source metric
`target_query`	No*	Full SQL query for target metric
`threshold`	Yes	Thresholds Acceptable difference between source and target. By default, threshold = 0
`filter`	Yes	Configure a check filter Filter applied to both source and target
`name`	Yes	Check names
`qualifier`	Yes	Check qualifiers
`attributes`	Yes	Check attributes

\* Either expression or query must be defined. *** ## Rows diff Compares rows between source and target based on keys, and checks specified columns for differences. {% hint style="warning" %} Rows diff reconciliation checks are **not supported in Synapse**. {% endhint %} **Example** ```yaml reconciliation: sources: - name: staging dataset: contracts-source/postgres/public/dim_employee_copy checks: - rows_diff: source: staging source_key_columns: [employee_key] target_key_columns: [employee_key] source_columns: [price, order_date] target_columns: [price, order_date] threshold: must_be: 0 metric: percent ``` **Configuration keys**

Key	Optional	Description
`source`	No	Name of the source to validate against
`source_key_columns`	No	Key column(s) to align rows in source dataset
`target_key_columns`	No	Key column(s) to align rows in target dataset
`source_columns`	Yes	Columns to compare in the target dataset. If omitted, all columns are compared based on column order. The number of defined source columns must match the number of defined target columns
`target_columns`	Yes	Columns to compare in the target dataset. If omitted, all columns are compared based on column order. The number of defined target columns must match the number of defined source columns.
`threshold`	Yes	Thresholds Acceptable difference between source and target. Thresholds can be defined in two ways: As the count of differing rows between source and target. As the percentage of differing rows, relative to the number of tested rows in the source dataset. By default, threshold = 0
`filter`	Yes	Configure a check filter Filter applied to both source and target
`name`	Yes	Check names
`qualifier`	Yes	Check qualifiers
`attributes`	Yes	Check attributes

### Limitations * It is **not possible** to perform rows diff reconciliation with **datasets located in the same Snowflake data source**.\ If your data source is Snowflake, and you wish to reconcile datasets within Snowflake, one of the datasets to reconcile must be: * in a different Snowflake instance, **or** * in a different data source type. *** ## Migrating from single-source to multi-source If you have an existing contract that uses the single-source `source:` syntax and want to add additional sources, you can migrate without losing check history in Soda Cloud. Check identity in Soda Cloud is derived from the check's path. When you move from single-source to multi-source, checks that target the original source need to preserve their path so that Soda Cloud treats them as the same check and retains their historical results. ### How to migrate Use the `default: true` flag on a source **without a name**. This unnamed default source preserves the original check path format, so Soda Cloud continues to recognize those checks as the same ones from before the migration. **Before** (single source): ```yaml reconciliation: source: dataset: on_prem_data_source/db/schema/dataset filter: created_at >= CURRENT_DATE - INTERVAL '1 day' checks: - row_count_diff: - aggregate_diff: function: avg column: employee_key ``` **After** (multi-source, preserving history): ```yaml reconciliation: sources: - dataset: on_prem_data_source/db/schema/dataset filter: created_at >= CURRENT_DATE - INTERVAL '1 day' default: true - name: warehouse dataset: warehouse_data_source/db/schema/dataset checks: - row_count_diff: # runs against default source, same check identity - aggregate_diff: # runs against default source, same check identity function: avg column: employee_key - row_count_diff: source: warehouse # new check against warehouse ``` ### Key points * The original source becomes an entry with `default: true` and no `name`. This preserves check paths and therefore history. * Checks without a `source:` field continue to run against the default source. * New sources get a `name`, and checks targeting them use `source: `. * `default: true` **cannot** be combined with `name`; the unnamed default is exclusively a migration mechanism. * Only one source can be `default: true`. ***

Legacy single-source syntax

The original single-source syntax uses `source:` (singular) instead of `sources:` (plural). This syntax is still supported and is equivalent to defining a single unnamed default source. ```yaml reconciliation: source: dataset: on_prem_data_source/db/schema/dataset filter: created_at >= CURRENT_DATE - INTERVAL '1 day' checks: - row_count_diff: threshold: must_be_less_than: 1 - aggregate_diff: function: avg column: employee_key ``` With single-source syntax, checks do not need a `source:` field since there is only one source to target. {% hint style="info" %} If you are writing a new contract, use the `sources:` (plural) syntax instead. See [Migrating from single-source to multi-source](#migrating-from-single-source-to-multi-source) if you want to add additional sources to an existing single-source contract. {% endhint %}

*** ## Reconciliation checks with in-memory data sources (DuckDB, Spark) Reconciliation checks can compare datasets that live in **different execution contexts**, such as: * a dataset in a database (for example, PostgreSQL, Snowflake), and * a dataset in an **in-memory data source** (for example, [DuckDB](/reference/data-source-reference-for-soda-core/duckdb.md) in-memory or a [Spark Dataframe](/reference/data-source-reference-for-soda-core/spark-dataframe.md)). *** ### Compare DuckDB datasets To compare DuckDB datasets, you must pass **two different values** in `data_sources`: 1. The in-memory data source 2. The database data source ### Compare DuckDB to another data source If you want to compare a DuckDB dataset to a different data source (for example, Postgres), you must pass DuckDB in `data_sources` and Data source config in `data_source_file_paths`: 1. The in-memory data source, passed directly as a `data_sources` object 2. The database data source, passed via a `data_source_file_paths` configuration file #### Example: DuckDB (in-memory) vs PostgreSQL The example below shows how to reconcile a DuckDB in-memory dataset with a PostgreSQL dataset using `verify_contract_locally`: ```python result = verify_contract_locally( data_sources=[DuckDBDataSourceImpl.from_existing_cursor(cursor, name="snowflake")], data_source_file_paths=["../postgres_supabase.yaml"], contract_file_path="🐼.contract.yaml", ) ``` ### Spark DataFrames The same pattern applies when reconciling against **Spark DataFrames**: * The Spark DataFrame is passed as an in-memory data source * The database connection is provided via a YAML file in `data_source_file_paths`
*** {% if (visitor.claims.plan === 'datasetStandard') %} {% hint style="success" %} You are **logged in to Soda** and seeing the **Free license** documentation. Learn more about [Documentation access & licensing](/reference/documentation-access-and-licensing.md). {% endhint %} {% endif %} {% if (visitor.claims.plan === 'enterprise') %} {% hint style="success" %} You are **logged in to Soda** and seeing the **Team license** documentation. Learn more about [Documentation access & licensing](/reference/documentation-access-and-licensing.md). {% endhint %} {% endif %} {% if (visitor.claims.plan === 'enterpriseUserBased') %} {% hint style="success" %} You are **logged in to Soda** and seeing the **Enterprise license** documentation. Learn more about [Documentation access & licensing](/reference/documentation-access-and-licensing.md). {% endhint %} {% endif %} {% if !(visitor.claims.plan === 'enterprise' || visitor.claims.plan === 'enterpriseUserBased' || visitor.claims.plan === 'datasetStandard') %} {% hint style="info" %} You are **not logged in to Soda** and are viewing the default public documentation. Learn more about [Documentation access & licensing](/reference/documentation-access-and-licensing.md). {% endhint %} {% endif %} --- # Agent Instructions: Querying This Documentation If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question. Perform an HTTP GET request on the current page URL with the `ask` query parameter: ``` GET https://docs.soda.io/reference/contract-language-reference/reconciliation-checks.md?ask= ``` The question should be specific, self-contained, and written in natural language. The response will contain a direct answer to the question and relevant excerpts and sources from the documentation. Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.