Compare data using SodaCL

Learn how to use the Soda Checks Language to compare data across datasets in the same, or different, data sources.

There are several ways to use SodaCL metrics and checks to compare data across datasets and data sources. The following offers some advice about how and when to use different types of checks to obtain the comparison results you need.

See also: Reconciliation checks

Have you got an idea or example of how to compare data that we haven't documented here? Let us know!


Compare data in the same data source and schema

Use a cross check to conduct a row count comparison between datasets in the same data source. If you wish to compare datasets in different data sources, or datasets in the same data source but with different schemas, see Compare data in different data sources or schemas.

checks for dim_employee:
  - row_count same as dim_department_group

Use a reference check to conduct a row-by-row comparison of values in two datasets in the same data source and return a result that indicates the volume and samples of mismatched rows, as in the following example which ensures that the values in each of the two names columns are identical. If you wish to compare datasets in the same data source but with different schemas, see Compare data in different data sources or schemas.

checks for dim_customers_dev:
  - values in (last_name, first_name) must exist in dim_customers_prod (last_name, first_name)

Alternatively, you can use a failed rows check to customize a SQL query that compares the values of datasets.

- failed rows:
      name: Validate that the data is the same as retail customers
      fail query: |
                  with table_1_not_in_table_2 as (
                  select
                    *
                  from retail_customers
                  except
                  select
                    *
                  from retail_sfdc_customers
                  )
                  , table_2_not_in_table_1 as (
                  select
                    *
                  from retail_sfdc_customers
                  except
                  select
                    *
                  from retail_customers
                  )
                  select
                  'found in retail_sfdc_customers but missing in retail_customers' as directionality,
                  *
                  from table_1_not_in_table_2
                  union all
                  select
                  'found in retail_customers but missing in retail_sfdc_customers' as directionality,
                  *
                  from table_2_not_in_table_1

Compare partitioned data in the same data source but different schemas

If you wish to compare data between datasets in different schemas, but only compare partitioned data from each dataset, you can use dataset filters.

Note that not all data sources fully support the schema.dataset format for the dataset identifier in a check, as included in the following example. Some users have reported success using this syntax.

Output:

Compare data in different data sources or schemas

Use a cross check to conduct a simple row count comparison of datasets in two different data sources, as in the following example that compares the row counts of two datasets in different data sources. Note that each data source involved in this check has been connected to data source either in the configuration.yml file with Soda Library, or in the Add Data Source workflow in Soda Cloud.

You can use a reference check to compare the values of different datasets in the same data source (same data source, same schema), but if the datasets are in different schemas, as might happen when you have different environments like production, staging, development, etc., then Soda considers those datasets as different data sources. Where that is the case, you have a couple of options.

You can use a cross check to compare the row count of datasets in the same data source, but with different schemas. First, you must add dataset + schema as a separate data source connection in your configuration.yml, as in the following example that uses the same connection details but provides different schemas:

Then, you can define a cross check that compares values across these data sources.

Alternatively, depending on the type of data source you are using, you can use a failed rows check to write a custom SQL query that compares contents of datasets that you define by adding the schema before the dataset name, such as prod.retail_customers and staging.retail_customers.

The following example accesses a single Snowflake data source and compares values between the same datasets but in different databases and schemas: prod.staging.dmds_scores and prod.measurement.post_scores.

See also: Configure the same scan to run in multiple environments

Compare dates in a dataset to validate event sequence

You can use a user-defined metric to write a custom SQL query that compares date values in the same dataset. Refer to Custom check templates.

Go further

Need help? Join the Soda community on Slack.

Last updated

Was this helpful?