Reconciliation checks
This page documents the contract language syntax for reconciliation.
Only available with Soda Agent.
Reconciliation checks validate that a target dataset matches a source dataset, ensuring that data remains consistent after migrations, in pipelines, or during synchronizations. They can be used for both metric-level (aggregate) validation and row-level (record-by-record) validation.
Learn more about the use cases and performance considerations: Data reconciliation
Prerequisites
To use reconciliation checks, you must either:
Run your contract with Soda Agent, which has reconciliation support built in, or
Install the Soda Reconciliation extension locally:
pip install -i https://pypi.dev.sodadata.io/simple -U soda-reconciliationFollow the CLI reference #Private PyPI installation flow to set up your environment and install the necessary Soda extensions.
Example
Structure
dataset:defines the target dataset.reconciliation.source.dataset:defines the source dataset to compare against.reconciliation.source.filter:is an optional filter applied only to the source dataset.Each reconciliation
check:supports an additional check-level filter applied consistently to both source and target, layered on top of dataset-level filters.Thresholds define acceptable differences.
All Common Check Configurations (filters, thresholds, names, qualifiers, attributes) apply to reconciliation checks. Contract Language reference #Common check configurations
dataset: cloud_data_source/db/schema/dataset
filter: created_at >= CURRENT_DATE - INTERVAL '1 day'
reconciliation:
source:
dataset: on_prem_data_source/db/schema/dataset
filter: created_at >= CURRENT_DATE - INTERVAL '1 day'
checks:
- row_count_diff:
threshold:
must_be_less_than: 1
- aggregate_diff:
function: avg
column: employee_key
filter: employee_key < 100
- duplicate_diff:
columns: [employee_key]
threshold:
must_be_less_than: 1
metric: percent
- freshness_diff:
column: hire_date
threshold:
must_be: 0
unit: hour
- metric_diff:
source_expression: SUM(employee_key + parent_employee_key)
target_expression: SUM(employee_key + parent_employee_key)
threshold:
must_be_less_than: 100
metric: percent
- rows_diff:
source_key_columns: [employee_key]
target_key_columns: [employee_key]
source_columns: [price, order_date]
target_columns: [price, order_date]
threshold:
must_be: 0Row count diff
Compares the row count of the source and target datasets.
Example
reconciliation:
source:
dataset: contracts-source/postgres/public/dim_employee_copy
checks:
- row_count_diff:
threshold:
must_be_less_than: 1Configuration keys
threshold
Contract Language reference #Thresholds
Acceptable difference between source and target. By default, threshold = 0
Yes
Aggregate diff
Compares the result of an aggregate function on a column between source and target.
Example
reconciliation:
source:
dataset: contracts-source/postgres/public/dim_employee_copy
checks:
- aggregate_diff:
function: avg
column: employee_key
filter: employee_key < 100
threshold:
must_be_less_than: 0.5Configuration keys
function
Aggregate function (avg, sum, min, max, avg_length, etc.)
No
column
Column to aggregate
Yes
threshold
Contract Language reference #Thresholds
Acceptable difference between source and target. By default, threshold = 0
Yes
Duplicate diff
Compares the number or percentage of duplicate rows based on one or more columns.
Example
reconciliation:
source:
dataset: contracts-source/postgres/public/dim_employee_copy
checks:
- duplicate_diff:
columns: [employee_key]
threshold:
must_be_less_than: 1
metric: percentConfiguration keys
columns
List of column(s) to evaluate duplicates on
No
threshold
Contract Language reference #Thresholds
Acceptable difference between source and target. By default, threshold = 0
Yes
filter
Reconciliation checks. Filter applied to both source and target.
Support both comparison of metric:percent and metric:count
Yes
Freshness diff
Compares freshness (recency of the latest timestamp) between source and target.
Example
reconciliation:
source:
dataset: contracts-source/postgres/public/dim_employee_copy
checks:
- freshness_diff:
column: hire_date
threshold:
must_be_less_than: 1
unit: hourConfiguration keys
column
Timestamp column used to measure freshness
Yes
unit
Unit of time (hour, minute, day)
Yes
threshold
Contract Language reference #Thresholds
Acceptable difference between source and target. By default, threshold = 0
Yes
Metric diff
Compares results of custom SQL expressions or queries across source and target.
Example
reconciliation:
source:
dataset: contracts-source/postgres/public/dim_employee_copy
checks:
- metric_diff:
source_expression: SUM(employee_key + parent_employee_key)
target_expression: SUM(employee_key + parent_employee_key)
threshold:
must_be_less_than: 100Configuration keys
source_expression
SQL expression for source
No*
target_expression
SQL expression for target
No*
source_query
Full SQL query for source metric
No*
target_query
Full SQL query for target metric
No*
threshold
Contract Language reference #Thresholds
Acceptable difference between source and target. By default, threshold = 0
Yes
* Either expression or query must be defined.
Rows diff
Compares rows between source and target based on keys, and checks specified columns for differences.
Example
reconciliation:
source:
dataset: contracts-source/postgres/public/dim_employee_copy
checks:
- rows_diff:
source_key_columns: [employee_key]
target_key_columns: [employee_key]
source_columns: [price, order_date]
target_columns: [price, order_date]
threshold:
must_be: 0
metric: percentConfiguration keys
source_key_columns
Key column(s) to align rows in source dataset
No
target_key_columns
Key column(s) to align rows in target dataset
No
source_columns
Columns to compare in the target dataset. If omitted, all columns are compared based on column order. The number of defined source columns must match the number of defined target columns
Yes
target_columns
Columns to compare in the target dataset. If omitted, all columns are compared based on column order. The number of defined target columns must match the number of defined source columns.
Yes
threshold
Contract Language reference #Thresholds
Acceptable difference between source and target.
Thresholds can be defined in two ways:
As the count of differing rows between source and target.
As the percentage of differing rows, relative to the number of tested rows in the source dataset.
By default, threshold = 0
Yes
Last updated
Was this helpful?
