Distribution checks
Use a SodaCL distribution check to monitor the consistency of a column over time.
Distribution checks will no longer be supported in Soda v4; they will deprecated and replaced by MAD.
In the short term, v3 users can use summary statistics instead.
Use a distribution check to determine whether the distribution of a column has changed between two points in time. For example, if you trained a model at a particular moment in time, you can use a distribution check to find out how much the data in the column has changed over time, or if it has changed all.
checks for dim_customer:
- distribution_difference(number_cars_owned) > 0.05:
distribution reference file: ./cars_owned_dist_ref.yml
method: chi_square
# (optional) filter to a specific point in time or any other dimension
filter: purchase_date > 2022-10-01 and purchase_date < 2022-12-01
# (optional) database specific sampling query for continuous columns. For
# example, for PostgreSQL the following query randomly samples 50% of the data
# with seed 61.
sample: TABLESAMPLE BERNOULLI (50) REPEATABLE (61)✔️ Requires Soda Core Scientific (included in a Soda Agent) ✔️ Supported in Soda Core ✔️ Supported in Soda Library + Soda Cloud ✖️ Supported in Soda Cloud Agreements + Soda Agent ✖️ Available as a no-code check
About distribution checks
To detect changes in the distribution of a column between different points in time, Soda uses approaches based on hypothesis testing and based on metrics that quantify the distance between samples.
When using hypothesis testing, a distribution check allows you to determine whether enough evidence exists to conclude that the distribution of a column has changed. It returns the probability that the difference between samples taken at two points in time would have occurred if they came from the same distribution (see p-value). If this probability is smaller than a threshold that you define, the check warns you that the column's distribution has changed.
You can use the following statistical tests for hypothesis testing in your distribution checks.
Kolmogorov-Smirnov for continuous data
Chi-square for categorical data
When using a metric to measure distance between samples, a distribution check returns the value of the distance metric that you chose based on samples taken at two points in time. If the value of the distance metric is larger than a threshold that you define, the check warns that the column's distribution has changed.
You can use the following distance metrics in your distribution checks.
Population Stability Index (PSI) for continuous or categorical data
Standardized Wasserstein Distance (SWD) (standardized using the sample standard deviation) for continuous or categorical data
Standardized Earth Mover's Distance (standardized using the sample standard deviation, this metric is equal to the SWD) for continuous or categorical data
Install Soda Scientific
To use a distribution check, you must install Soda Scientific in the same directory or virtual environment in which you installed Soda Library. Best practice recommends installing Soda Library and Soda Scientific in a virtual environment to avoid library conflicts, but you can Install Soda Scientific locally if you prefer.
Set up a virtual environment, and install Soda Library in your new virtual environment.
Use the following command to install Soda Scientific.
pip install -i https://pypi.cloud.soda.io soda-scientificRefer to Troubleshoot Soda Scientific installation for help with issues during installation.
Generate a distribution reference object (DRO)
Not yet supported in Soda Cloud
Before defining a distribution check, you must generate a distribution reference object (DRO).
When you run a distribution check, Soda compares the data in a column of your dataset with a snapshot of the same column at a different point in time. This snapshot exists in the DRO, which serves as a point of reference. The distribution check result indicates whether the difference between the distributions of the snapshot and the actual datasets is statistically significant.
To create a DRO, you use the CLI command soda update-dro. When you execute the command, Soda stores the entire contents of the column(s) you specified in local memory. Before executing the command, examine the volume of data the column(s) contains and ensure that your system can accommodate storing it in local memory.
If you have not already done so, create a directory to contain the files that Soda uses for a distribution check.
Use a code editor to create a file called
distribution_reference.yml(though, you can name it anything you wish) in your Soda project directory, then add the following example content to the file.dataset: your_dataset_name column: column_name_in_dataset distribution_type: categorical # (optional) filter to a specific point in time or any other dimension filter: "column_name between '2010-01-01' and '2020-01-01'" # (optional) database specific sampling query, for example for postgres\ # the following query randomly samples 50% of the data with seed 61Optionally, you can define multiple DROs in your
distribution_reference.ymlfile by naming them. The following example defines two DROs.dro_name1: dataset: your_dataset_name column: column_name_in_dataset distribution_type: categorical dro_name2: dataset: your_dataset_name column: column_name2_in_dataset distribution_type: continuousChange the values for
datasetandcolumnto reflect your own dataset's identifiers.(Optional) Change the value for
distribution_typeto capturecategoricalorcontinuousdata.(Optional) Define the value of
filterto specify the portion of the data in your dataset for which you are creating a DRO. If you trained a model on data in which thedate_first_customercolumn contained values between 2010-01-01 and 2020-01-01, you can use a filter based on that period to test whether the distribution of the column has changed since then. If you do not wish to define a filter, remove the key-value pair from the file.(Optional) If you wish to define multiple DROs in a single
distribution_reference.ymlfile, change the namesdro_name1anddro_name2.Save the file, then, while still in your Soda project directory, run the
soda update-drocommand to create a distribution reference object. For a list of options available to use with the command, runsoda update-dro --help.soda update-dro -d your_datasource_name -c your_configuration_file.yml ./distribution_reference.ymlIf you defined multiple DROs in your
distribution_reference.ymlfile, specify which DRO you want to update using the-nargument.-nindicates name. When multiple DROs are defined in a singledistribution_reference.ymlfile, Soda requires all of them to be named. Thus, you must provide the DRO name with the-nargument when using thesoda update-drocommand.soda update-dro -n dro_name1 -d your_datasource_name -c your_configuration_file.yml ./distribution_reference.ymlReview the changed contents of your
distribution_reference.ymlfile. The following is an example of the information that Soda added to the file.dataset: dim_customer column: number_cars_owned distribution_type: categorical filter: date_first_purchase between '2010-01-01' and '2020-01-01' distribution reference: weights: - 0.34932914953473276 - 0.2641744211209695 - 0.22927937675827742 - 0.08899588833585804 - 0.06822116425016231 bins: - 2 - 1 - 0 - 3 - 4Soda appended a new key called
distribution referenceto the file, together with an array ofbinsand a corresponding array ofweights.
Read more about
binsandweights, and how Soda computes the number of bins for a DRO.
Define a distribution check
If you have not already done so, create a
checks.ymlfile in your Soda project directory. The checks YAML file stores the Soda Checks you write, including distribution checks; Soda Library executes the checks in the file when it runs a scan of your data.In your new file, add the following example content.
checks for your_dataset_name: - distribution_difference(column_name, dro_name) > your_threshold: method: your_method_of_choice distribution reference file: ./distribution_reference.yml # (optional) filter to a specific point in time, or any other dimension filter: column_name > min_allowed_column_value and column_name < max_allowed_value # (optional) database specific sampling query for continuous columns. For # example, for PostgreSQL, the following query randomly samples 50% of the data # with seed 61 sample: TABLESAMPLE BERNOULLI (50) REPEATABLE (61)Replace the following values with your own dataset and threshold details.
your_dataset_name- the name of your datasetcolumn_name- the column against which to compare the DROdro_name- the name of the DRO (optional, required ifdistribution_reference.ymlcontains named DROs)> your_threshold- the threshold for the distribution check that you specify as acceptable
Replace the value of
your_method_of_choicewith the type of test you want to use in the distribution check. If you do not specify amethod, the distribution check defaults toksfor continuous data, orchi_squarefor categorical data.ksfor the Kolmogorov-Smirnov testchi_squarefor the Chi-square testpsifor the Population Stability Index metricswdfor the Standardized Wasserstein Distance (SWD) metricsemdfor the Standardized Earth Mover's Distance (SEMD) metric SWD and the SEMD are the same metric.
(Optional), to filter the data in the distribution check, replace the value of
filterwith a filter that specifies the portion of the data in your dataset for which you are checking the distribution.(Optional), to sample the data in the distribution check for continuous columns only, replace the value of
samplewith a query that specifies the portion of the data in your dataset for which you are checking the distribution. The data source you are using must support the query you write. For example, for PostgreSQL, you can use theTABLESAMPLEclause to randomly sample 50% of the data with seed 61. Best practice dictates that you use sampling for large datasets that might not fit in memory. Refer to the define the sample size for details. If you do not usesampleorfilterin a distribution check for continuous columns, Soda fetches up to 1 million records by applying alimitclause for better memory management. For categorical columns, Soda does not supportsample.Run a soda scan of your data source to execute the distribution check(s) you defined.
soda scan -d your_datasource_name checks.yml -c /path/to/your_configuration_file.yml your_check_file.ymlWhen Soda Library executes the distribution check above, it compares the values in
column_nameto a sample that Soda creates based on thebins,weights, anddata_typeindro_namedefined in thedistribution_reference.ymlfile. Specifically, it checks whether the value ofyour_method_of_choiceis larger than0.05.
Distribution check details
For continuous columns, When you execute the
soda scancommand, Soda stores up to one million records in local memory. If the column has more than one million records, then Soda applieslimitSQL clause to make sure that your system can accommodate storing it in local memory.For continuous columns, as explained in Generate a Distribution Reference Object (DRO), Soda uses bins and weights to take random samples from your DRO. Therefore, it is possible that the original dataset that you used to create the DRO resembles a different underlying distribution than the dataset that Soda creates by sampling from the DRO. To limit the impact of this possibility, Soda runs the tests in each distribution check ten times and returns the median of the results, either as a p-value or a distance metric). For example, if you use the Kolmogorov-Smirnov test and a threshold of 0.05, the distribution check uses the Kolmogorov-Smirnov test to compare ten different samples from your DRO to the data in your column. If the median of the returned p-values is smaller than 0.05, the check issues a warning. This approach does change the interpretation of the distribution check results. For example, the probability of a type I error is multiple orders of magnitude smaller than the significance level that you choose.
For categorical columns, Soda fetches the aggregated calculated value counts of each category. If there are more than one million distinct categories, Soda skips the distribution check and issues a warning.
Bins and weights
Soda uses the bins and weights to generate a sample from the reference distribution when it executes the distribution check during a scan. By creating a sample using the DRO's bins and weights, you do not have to save the entire – potentially very large - sample. The distribution_type value impacts how the weights and bins will be used to generate a sample, so make sure your choice reflects the nature of your data (continuous or categorical).
To compute the number of bins for a DRO, Soda uses different strategies based on whether outlier values are present in the dataset.
By default Soda automatically computes the number of bins for each DRO by taking the maximum of Sturges and Freedman Diaconis Estimator methods. numpy.histogram_bin_edges(data, bins='auto') also applies this practice by default.
For datasets with outliers, such as in the example below, the default strategy does not work well. When taking the maximum of Sturges and Freedman Diaconis Estimator methods, it produces a great number of bins, 3466808, while there are only nine elements in the array. The outlier value 10e6 result in a misleading bin size.
import numpy as np
arr = np.array([0, 0, 0, 1, 2, 3, 3, 4, 10e6])
number_of_bins = np.histogram_bin_edges(arr, bins='auto').size # return 3466808If the number of bins is greater than the size of data, Soda uses interquantile range (IQR) to detect and filter the outliers. Basically, for data that is greater than Q3 + 1.5 IQR and less than Q1 - 1.5 IQR Soda removes the datasets, then recomputes the number of bins with the same method by taking the maximum of Sturges and Freedman Diaconis Estimator.
After removing the outliers, if the number of bins still exceeds the size of the filtered data, Soda takes the square root of the dataset size to set the number of bins. To cover edge cases, if the square root of the dataset size exceeds one million, then Soda sets the number of bins to one million to prevent it from generating too many bins.
Define the sample size
You can add a sample parameter for both a distribution check and DRO to include a sample SQL clause that Soda passes when it executes the check during a scan.
Apply a sample to a distribution check for continuous columns
If the data to which you wish to apply distribution check does not fit in memory or involves a time constraint, use a sample to specify a SQL query that returns a sample of the data. The SQL query that you provide is specific to the type of data source you use. In the example below, the SQL query for a PostgreSQL data source randomly samples 50% of the data with seed 61. You can customize the sample SQL query to meet your needs.
Use sample for continuous values, only. For categorical values refer to Distribution check details.
checks for dim_customer:
- distribution_difference(budget) < 0.05:
distribution reference file: ./dro_dim_customer.yml
method: ks
# (optional) data source-specific sampling query; for example for postgres\
# the following query randomly samples 50% of the data with seed 61
sample: TABLESAMPLE BERNOULLI (50) REPEATABLE (61)Sampling Caveats
Some data sources do not have a built-in sampling function. For example, BigQuery does not support TABLESAMPLE BERNOULLI. In such a case, add a filter parameter to randomly obtain a sample of the data. The filter parameter applies a data source-specific SQL WHERE clause to the data. In the example below, the SQL query for a BigQuery data source randomly samples 50% of the data.
Distribution Check
checks for dim_customer:
- distribution_difference(number_cars_owned) > 0.05:
distribution reference file: ./cars_owned_dist_ref.yml
method: chi_square
# (optional) data source-specific sampling query, for example for postgres\
# the following query randomly samples 50% of the data
filter: rand() < 0.5DRO
dataset: your_dataset_name
column: column_name_in_dataset
distribution_type: categorical
# (optional) data source-specific sampling query; for example for postgres\
# the following query randomly samples 50% of the data
filter: rand() < 0.5Distribution check examples
You can define multiple distribution checks in a single checks.yml file. If you create a new DRO for another dataset and column in sales_dist_ref.yml for example, you can define two distribution checks in the same checks.yml file, as per the following.
checks for dim_customer:
- distribution_difference(number_cars_owned) > 0.05:
method: chi_square
distribution reference file: ./cars_owned_dist_ref.yml
checks for fact_sales_quota:
- distribution_difference(calendar_quarter) < 0.2:
method: psi
distribution reference file: ./sales_dist_ref.ymlAlternatively you can define two DROs in distribution_reference.yml, naming them cars_owned_dro and calendar_quarter_dro, and use both in a single checks.yml file
checks for dim_customer:
- distribution_difference(number_cars_owned, cars_owned_dro) > 0.05:
method: chi_square
distribution reference file: ./distribution_reference.yml
checks for fact_sales_quota:
- distribution_difference(calendar_quarter, calendar_quarter_dro) < 0.2:
method: psi
distribution reference file: ./distribution_reference.ymlYou can also define multiple checks for different columns in the same dataset by generating multiple DROs for those columns. Refer to the following example.
checks for dim_customer:
- distribution_difference(number_cars_owned, cars_owned_dro) > 0.05:
method: chi_square
distribution reference file: ./distribution_reference.yml
- distribution_difference(total_children, total_children_dro) < 0.2:
method: psi
distribution reference file: ./distribution_reference.yml
checks for fact_sales_quota:
- distribution_difference(calendar_quarter, calendar_quarter_dro) < 0.2:
method: psi
distribution reference file: ./distribution_reference.ymlOptional check configurations
Define alert configurations to specify warn and fail thresholds.
-
✓
Apply an in-check filter to return results for a specific portion of the data in your dataset; see example.
✓
Use quotes when identifying dataset or column names; see example. Note that the type of quotes you use must match that which your data source uses. For example, BigQuery uses a backtick (`) as a quotation mark.
Use wildcard characters ( % or * ) in values in the check.
-
✓
Use for each to apply distribution checks to multiple datasets in one scan; see example.
✓
Instruct Soda to collect random samples. Because sampling SQL clauses vary significantly between data sources, consult your data source’s documentation; see example.
-
Example with check name
checks for dim_customer:
- distribution_difference(number_cars_owned) > 0.05:
method: chi_square
distribution reference file: dist_ref.yml
name: Distribution checkExample with quotes
checks for dim_customer:
- distribution_difference("number_cars_owned") < 0.2:
method: psi
distribution reference file: dist_ref.yml
name: Distribution checkExample with for each
for each dataset T:
dataset:
- dim_customer
checks:
- distribution_difference(number_cars_owned) < 0.15:
method: swd
distribution reference file: dist_ref.ymlExample with in-check filter
checks for dim_customer:
- distribution_difference(number_cars_owned) < 0.05:
method: swd
distribution reference file: dist_ref.yml
filter: date_first_purchase between '2010-01-01' and '2022-01-01'Example with dataset filter
filter dim_customer [first_purchase]:
where: date_first_purchase between '2010-01-01' and '2022-01-01'
checks for dim_customer [first_purchase]:
- distribution_difference(number_cars_owned) < 0.05:
method: swd
distribution reference file: dist_ref.ymlExample with in-check sampling
The following example works for PostgreSQL data sources. It randomly samples 50% of the dataset with seed value 61.
checks for dim_customer:
- distribution_difference(number_cars_owned) > 0.05:
distribution reference file: ./cars_owned_dist_ref.yml
method: chi_square
sample: TABLESAMPLE BERNOULLI (50) REPEATABLE (61)List of comparison symbols and phrases
=
<
>
<=
>=
!=
<>
between
not between Troubleshoot Soda Scientific installation
While installing Soda Scientific works on Linux, you may encounter issues if you install Soda Scientific on Mac OS (particularly, machines with the M1 ARM-based processor) or any other operating system. If that is the case, consider using one of the following alternative installation procedures.
Need help? Ask the team in the Soda community on Slack.
Install Soda Scientific Locally
Set up a virtual environment, and install Soda Library in your new virtual environment.
Use the following command to install Soda Scientific.
pip install -i https://pypi.cloud.soda.io soda-scientificRefer to Troubleshoot Soda Scientific installation for help with issues during installation.
Use Docker to run Soda Library
Use Soda’s Docker image in which Soda Scientific is pre-installed. You need Soda Scientific to be able to use SodaCL distribution checks or anomaly detection checks.
If you have not already done so, install Docker in your local environment.
From Terminal, run the following command to pull Soda Library’s official Docker image; adjust the version to reflect the most recent release.
docker pull sodadata/soda-library:v1.0.3Verify the pull by running the following command.
docker run sodadata/soda-library:v1.0.3 --helpOutput:
Usage: soda [OPTIONS] COMMAND [ARGS]... Soda Library CLI version 1.0.x, Soda Core CLI version 3.0.xx Options: --version Show the version and exit. --help Show this message and exit. Commands: ingest Ingests test results from a different tool scan Runs a scan suggest Generates suggestions for a dataset test-connection Tests a connection update-dro Updates contents of a distribution reference fileWhen you run the Docker image on a non-Linux/amd64 platform, you may see the following warning from Docker, which you can ignore.
WARNING: The requested image's platform (linux/amd64) does not match the detected host platform (linux/arm64/v8) and no specific platform was requestedWhen you are ready to run a Soda scan, use the following command to run the scan via the docker image. Replace the placeholder values with your own file paths and names.
docker run -v /path/to/your_soda_directory:/sodacl sodadata/soda-library scan -d your_data_source -c /sodacl/your_configuration.yml /sodacl/your_checks.ymlOptionally, you can specify the version of Soda Library to use to execute the scan. This may be useful when you do not wish to use the latest released version of Soda Library to run your scans. The example scan command below specifies Soda Library version 1.0.0.
docker run -v /path/to/your_soda_directory:/sodacl sodadata/soda-library:v1.0.0 scan -d your_data_source -c /sodacl/your_configuration.yml /sodacl/your_checks.yml
Error: Mounts denied
If you encounter the following error, follow the procedure below.
docker: Error response from daemon: Mounts denied:
The path /soda-library-test/files is not shared from the host and is not known to Docker.
You can configure shared paths from Docker -> Preferences... -> Resources -> File Sharing.
See https://docs.docker.com/desktop/mac for more info.You need to give Docker permission to acccess your configuration.yml and checks.yml files in your environment. To do so:
Access your Docker Dashboard, then select Preferences (gear symbol).
Select Resources, then follow the Docker instructions to add your Soda project directory – the one you use to store your configuration.yml and checks.yml files – to the list of directories that can be bind-mounted into Docker containers.
Click Apply & Restart, then repeat steps 2 - 4 above.
Error: Configuration path does not exist
If you encounter the following error, double check the syntax of the scan command in step 4 above.
Be sure to prepend
/sodacl/to both the congifuration.yml filepath and the checks.yml filepath.Be sure to mount your files into the container by including the
-voption. For example,-v /Users/MyName/soda_project:/sodacl.
Soda Library 1.0.x
Configuration path 'configuration.yml' does not exist
Path "checks.yml" does not exist
Scan summary:
No checks found, 0 checks evaluated.
2 errors.
Oops! 2 errors. 0 failures. 0 warnings. 0 pass.
ERRORS:
Configuration path 'configuration.yml' does not exist
Path "checks.yml" does not existGo further
Reference tips and best practices for SodaCL.
Use a freshness check to gauge how recently your data was captured.
Use reference checks to compare the values of one column to another.
Need help? Join the Soda community on Slack.
Last updated
Was this helpful?
