Link Search Menu Expand Document

Distribution checks

Last modified on 06-Dec-22

Use a distribution check to determine whether the distribution of a column has changed between two points in time. For example, if you trained a model at a particular moment in time, you can use a distribution check to find out how much the data in the column has changed over time, or if it has changed all.

  • Requires Soda Core Scientific.
  • Limitation: Soda Cloud cannot yet maintain the distribution reference object (DRO), but distribution check results appear in the Check Results dashboard.
checks for dim_customer:
  - distribution_difference(number_cars_owned) > 0.05:
      distribution reference file: ./cars_owned_dist_ref.yml
      method: chi_square
      # (optional) filter to a specific point in time or any other dimension 
      filter: purchase_date > 2022-10-01 and purchase_date < 2022-12-01
      # (optional) database specific sampling query, for example for postgres\
      # the following query will randomly sample 50% of the data with seed 61
      sample: TABLESAMPLE BERNOULLI (50) REPEATABLE (61)

About distribution checks
Prerequisites
Install Soda Core Scientific
Generate a distribution reference object (DRO)
Define a distribution check
    Distribution check details
    Bins and weights
Distribution check examples
Optional check configurations
List of comparison symbols and phrases
Troubleshoot Soda Core Scientific installation
Go further

About distribution checks

To detect changes in the distribution of a column between different points in time, Soda uses approaches based on hypothesis testing and based on metrics that quantify the distance between samples.

When using hypothesis testing, a distribution check allows you to determine whether enough evidence exists to conclude that the distribution of a column has changed. It returns the probability that the difference between samples taken at two points in time would have occurred if they came from the same distribution (see p-value). If this probability is smaller than a threshold that you define, the check warns you that the column’s distribution has changed.

You can use the following statistical tests for hypothesis testing in your distribution checks.

When using a metric to measure distance between samples, a distribution check returns the value of the distance metric that you chose based on samples taken at two points in time. If the value of the distance metric is larger than a threshold that you define, the check warns that the column’s distribution has changed.

You can use the following distance metrics in your distribution checks.

Sample sizes in distribution checks In hypothesis testing, the statistical power of a test refers to its ability to reject the null hypothesis when it is false. Specifically, the power of a test tells you how likely it is that the null hypothesis will be rejected if the true difference with the alternative hypothesis were of a particular size; see effect size. A very powerful test is able to reject the null hypothesis even if the true difference is small.

Since distribution checks issue warnings based on the p-value alone and do not take effect size into account, having too much power can make the results of the checks hard to interpret. An extremely powerful test rejects the null hypothesis for effect sizes that are negligible. Because the power of a test increases as its sample size increases, there is a sample size limit of one million in distribution checks. Soon, users will be able to define the sample size in a distribution check.

The default sample size limit of 1 million rows is based on simulations that used the Kolmogorov-Smirnov test. The simulation generated samples from a normal distribution, an exponential distribution, a laplacian distribution, a beta distribution, and a mixture distribution (generated by randomly choosing between two normal distributions). The Kolmogorov-Smirnov test compared these samples to samples that came from the same distributions, but with different means. For example, it compared samples from a normal distribution to samples from another normal distribution with a different mean.

For each distribution type, the Kolmogorov-Smirnov test rejected the null hypothesis 100% of the time if the effect size was equal to, or larger than, a shift to the mean of 1% of the standard deviation, when using a sample size of one million. Using such a sample size does not result in problems with local memory.
Distribution check thresholds for distance metrics The values of the Population Stability Index (PSI) and the Standardized Wasserstein Distance (SWD) can be hard to interpret. Consider carefully investigating which distribution thresholds make sense for your use case.

Some common interpretations of the PSI result are as follows:
  • PSI < 0.1: no significant distribution change
  • 0.1 < PSI < 0.2: moderate distribution change
  • PSI >= 0.2: significant distribution change
During simulations, for a difference in mean between distributions of size relative to 10% of their standard deviation, the SWD value converged to approximately 0.1.

Prerequisites

Install Soda Core Scientific

To use a distribution check, you must install Soda Core Scientific in the same directory or virtual environment in which you installed Soda Core. Best practice recommends installing Soda Core and Soda Core Scientific in a virtual environment to avoid library conflicts, but you can Install Soda Core Scientific locally if you prefer.

  1. Set up a virtual environment, as described in the Soda Core install documentation.
  2. Install Soda Core in your new virtual environment.
  3. Use the following command to install Soda Core Scientific.
pip install soda-core-scientific

Note that installing the Soda Core Scientific package also installs several scientific dependencies. Reference the soda-core-scientific setup file in the public GitHub repository for details.

Refer to Troubleshoot Soda Core Scientific installation for help with issues during installation.

Generate a distribution reference object (DRO)

Before defining a distribution check, you must generate a distribution reference object (DRO).

When you run a distribution check, Soda compares the data in a column of your dataset with a snapshot of the same column at a different point in time. This snapshot exists in the DRO, which serves as a point of reference. The distribution check result indicates whether the difference between the distributions of the snapshot and the actual datasets is statistically significant.

To create a DRO, you use the CLI command soda update-dro. When you execute the command, Soda stores the entire contents of the column(s) you specified in local memory. Before executing the command, examine the volume of data the column(s) contains and ensure that your system can accommodate storing it in local memory.

  1. If you have not already done so, create a directory to contain the files that Soda uses for a distribution check.
  2. Use a code editor to create a file called distribution_reference.yml (though, you can name it anything you wish) in your Soda project directory, then add the following example content to the file.
dataset: your_dataset_name
column: column_name_in_dataset
distribution_type: categorical
# (optional) filter to a specific point in time or any other dimension 
filter: "column_name between '2010-01-01' and '2020-01-01'"

Alternatively, you can define multiple DROs in your distribution_reference.yml file by naming them. The following example content defines two DROs

dro_name1:
  dataset: your_dataset_name
  column: column_name_in_dataset
  distribution_type: categorical
dro_name2:
  dataset: your_dataset_name
  column: column_name2_in_dataset
  distribution_type: continuous
  1. Change the values for dataset and column to reflect your own dataset’s identifiers.
  2. (Optional) Change the value for distribution_type to capture categorical or continuous data.
  3. (Optional) Define the value of filter to specify the portion of the data in your dataset for which you are creating a DRO. If you trained a model on data in which the date_first_customer column contained values between 2010-01-01 and 2020-01-01, you can use a filter based on that period to test whether the distribution of the column has changed since then.
    If you do not wish to define a filter, remove the key-value pair from the file.
  4. (Optional) If you wish to define multiple DROs in a single distribution_reference.yml file, change the names dro_name1 and dro_name2.
  5. Save the file, then, while still in your Soda project directory, run the soda update-dro command to create a distribution reference object. For a list of options available to use with the command, run soda update-dro --help.
soda update-dro -d your_datasource_name -c your_configuration_file.yml ./distribution_reference.yml 

If you defined multiple DROs in your distribution_reference.yml file, specify which DRO you want to update using the -n argument. -n stands for name. When multiple DROs are defined in a single distribution_reference.yml file, Soda requires all of them to be named. Thus, you must provide the DRO name with the -n argument when using the soda update-dro command.

soda update-dro -n dro_name1 -d your_datasource_name -c your_configuration_file.yml ./distribution_reference.yml 
  1. Review the changed contents of your distribution_reference.yml file. The following is an example of the information that Soda added to the file.
dataset: dim_customer
column: number_cars_owned
distribution_type: categorical
filter: date_first_purchase between '2010-01-01' and '2020-01-01'
distribution reference:
  weights:
    - 0.34932914953473276
    - 0.2641744211209695
    - 0.22927937675827742
    - 0.08899588833585804
    - 0.06822116425016231
  bins:
    - 2
    - 1
    - 0
    - 3
    - 4

Soda appended a new key called distribution reference to the file, together with an array of bins and a corresponding array of weights. Read more about bins and weights, and how Soda computes the number of bins for a DRO.

Define a distribution check

  1. If you have not already done so, create a checks.yml file in your Soda project directory. The checks YAML file stores the Soda Checks you write, including distribution checks; Soda Core executes the checks in the file when it runs a scan of your data. Refer to more detailed instructions in the Soda Core documentation.
  2. In your new file, add the following example content.
checks for your_dataset_name:
  - distribution_difference(column_name, dro_name) > your_threshold:
      method: your_method_of_choice
      distribution reference file: ./distribution_reference.yml
  1. Replace the following values with your own dataset and threshold details.
  • your_dataset_name - the name of your dataset
  • column_name - the column against which to compare the DRO
  • dro_name - the name of the DRO (optional, required if distribution_reference.yml contains named DROs)
  • > 0.05 - the threshold for the distribution check that you specify as acceptable
  1. Replace the value of your_method_of_choice with the type of test you want to use in the distribution check.
    • ks for the Kolmogorov-Smirnov test
    • chi_square for the Chi-square test
    • psi for the Population Stability Index metric
    • swd for the Standardized Wasserstein Distance (SWD) metric
    • semd for the Standardized Earth Mover’s Distance (SEMD) metric (the SWD and the SEMD are the same metric)
      If you do not specify a method, the distribution check defaults to ks for continuous data or chi_square for categorical data respectively.
  2. Run a soda scan of your data source to execute the distribution check(s) you defined. Refer to Soda Core documentation for more details.
soda scan -d your_datasource_name checks.yml -c /path/to/your_configuration_file.yml your_check_file.yml

When Soda Core executes the distribution check above, it compares the values in column_name to a sample that Soda creates based on the bins, weights, and data_type in dro_name defined in the distribution_reference.yml file. Specifically, it checks whether the value of your_method_of_choice is larger than 0.05.

Distribution check details

  • When you execute the soda scan command, Soda stores the entire contents of the column(s) you specified in local memory. Before executing the command, examine the volume of data the column(s) contains and ensure that your system can accommodate storing it in local memory.

  • As explained in Generate a Distribution Reference Object (DRO), Soda uses bins and weights to take random samples from your DRO. Therefore, it is possible that the original dataset that you used to create the DRO resembles a different underlying distribution than the dataset that Soda creates by sampling from the DRO. To limit the impact of this possibility, Soda runs the tests in each distribution check ten times and returns the median of the results (either p-value or distance metric).

    For example, if you use the Kolmogorov-Smirnov test and a threshold of 0.05, the distribution check uses the Kolmogorov-Smirnov test to compare ten different samples from your DRO to the data in your column. If the median of the returned p-values is smaller than 0.05, the check issues a warning. This approach does change the interpretation of the distribution check results. For example, the probability of a type I error is multiple orders of magnitude smaller than the signifance level that you choose.

Bins and weights

Soda uses the bins and weights to generate a sample from the reference distribution when it executes the distribution check during a scan. By creating a sample using the DRO’s bins and weights, you do not have to save the entire – potentially very large - sample. The distribution_type value impacts how the weights and bins will be used to generate a sample, so make sure your choice reflects the nature of your data (continuous or categorical).

To compute the number of bins for a DRO, Soda uses different strategies based on whether outlier values are present in the dataset.

By default Soda automatically computes the number of bins for each DRO by taking the maximum of Sturges and Freedman Diaconis Estimator methods. numpy.histogram_bin_edges(data, bins=’auto’) also applies this practice by default.

For datasets with outliers, such as in the example below, the default strategy does not work well. When taking the maximum of Sturges and Freedman Diaconis Estimator methods, it produces a great number of bins, 3466808, while there is only nine elements in the array. The outlier value 10e6 causes to obtain this misleading bin size.

import numpy as np
arr = np.array([0, 0, 0, 1, 2, 3, 3, 4, 10e6])
number_of_bins = np.histogram_bin_edges(arr, bins='auto').size # return 3466808

If the number of bins is greater than the size of data, Soda uses interquantile range (IQR) to detect and filter the outliers. Basically, for data that is greater than Q3 + 1.5 IQR and less than Q1 - 1.5 IQR Soda removes the datasets, then recomputes the number of bins with the same method by taking the maximum of Sturges and Freedman Diaconis Estimator.

After removing the outliers, if the number of bins still exceeds the size of the filtered data, Soda takes the square root of the dataset size to set the number of bins. To cover edge cases, if the square root of dataset size exceeds one million, then Soda sets the number of bins to one million to prevent it from generating too many bins.

Distribution check examples

You can define multiple distribution checks in a single checks.yml file. If you create a new DRO for another dataset and column in sales_dist_ref.yml for example, you can define two distribution checks in the same checks.yml file, as per the following.

checks for dim_customer:
  - distribution_difference(number_cars_owned) > 0.05:
      method: chi_square
      distribution reference file: ./cars_owned_dist_ref.yml

checks for fact_sales_quota:
  - distribution_difference(calendar_quarter) < 0.2:
      method: psi
      distribution reference file: ./sales_dist_ref.yml

Alternatively you can define two DROs in distribution_reference.yml, naming them cars_owned_dro and calendar_quarter_dro, and use both in a single checks.yml file

checks for dim_customer:
  - distribution_difference(number_cars_owned, cars_owned_dro) > 0.05:
      method: chi_square
      distribution reference file: ./distribution_reference.yml

checks for fact_sales_quota:
  - distribution_difference(calendar_quarter, calendar_quarter_dro) < 0.2:
      method: psi
      distribution reference file: ./distribution_reference.yml

You can also define multiple checks for different columns in the same dataset by generating multiple DROs for those columns. Refer to the following example.

checks for dim_customer:
  - distribution_difference(number_cars_owned, cars_owned_dro) > 0.05:
      method: chi_square
      distribution reference file: ./distribution_reference.yml
   - distribution_difference(total_children, total_children_dro) < 0.2:
      method: psi
      distribution reference file: ./distribution_reference.yml

checks for fact_sales_quota:
  - distribution_difference(calendar_quarter, calendar_quarter_dro) < 0.2:
      method: psi
      distribution reference file: ./distribution_reference.yml

Optional check configurations

Supported Configuration Documentation
Define a name for a distribution check; see example. Customize check names
Add an identity to a check. Add a check identity
  Define alert configurations to specify warn and fail thresholds. -
Apply an in-check filter to return results for a specific portion of the data in your dataset; see example. Configure in-check filters
Use quotes when identifying dataset or column names; see example.
Note that the type of quotes you use must match that which your data source uses. For example, BigQuery uses a backtick (`) as a quotation mark.
Use quotes in a check
  Use wildcard characters ( % or * ) in values in the check. -
Use for each to apply distribution checks to multiple datasets in one scan; see example. Apply checks to multiple datasets
Apply a dataset filter to partition data during a scan; see example. Scan a portion of your dataset

Example with check name

checks for dim_customer:
- distribution_difference(number_cars_owned) > 0.05: 
    method: chi_square
    distribution reference file: dist_ref.yml
    name: Distribution check

Example with quotes

checks for dim_customer:
- distribution_difference("number_cars_owned") < 0.2:
    method: psi
    distribution reference file: dist_ref.yml
    name: Distribution check

Example with for each

for each dataset T:
    dataset:
        - dim_customer
    checks:
    - distribution_difference(number_cars_owned) < 0.15:
        method: swd
        distribution reference file: dist_ref.yml

Example with in-check filter

checks for dim_customer:
- distribution_difference(number_cars_owned) < 0.05: 
    method: swd
    distribution reference file: dist_ref.yml
    filter: date_first_purchase between '2010-01-01' and '2022-01-01'

Example with dataset filter

filter dim_customer [first_purchase]:
  where: date_first_purchase between '2010-01-01' and '2022-01-01' 

checks for dim_customer [first_purchase]:
- distribution_difference(number_cars_owned) < 0.05: 
    method: swd
    distribution reference file: dist_ref.yml


Example with in-check sampling

The following example works for postgres. It randomly samples 50% of the table with seed value 61. Since sampling SQL clauses vary significantly between databases, consult your database documentation.

checks for dim_customer:
  - distribution_difference(number_cars_owned) > 0.05:
      distribution reference file: ./cars_owned_dist_ref.yml
      method: chi_square
      sample: TABLESAMPLE BERNOULLI (50) REPEATABLE (61)

List of comparison symbols and phrases

 = 
 < 
 >
 <=
 >=
 !=
 <> 
 between 
 not between 

Troubleshoot Soda Core Scientific installation

While installing Soda Core Scientific works on Linux, you may encounter issues if you install Soda Core Scientific on Mac OS (particularly, machines with the M1 ARM-based processor) or any other operating system. If that is the case, consider using one of the following alternative installation procedures.

Need help? Ask the team in the Soda community on Slack.

Use Docker to run Soda Core

Use Soda’s Docker image in which Soda Core Scientific is pre-installed.

  1. If you have not already done so, install Docker in your local environment.
  2. From Terminal, run the following command to pull the latest Soda Core’s official Docker image.
    docker pull sodadata/soda-core
    
  3. Verify the pull by running the following command.
    docker run sodadata/soda-core --help
    

    Output:

     Usage: soda [OPTIONS] COMMAND [ARGS]...
    
     Soda Core CLI version 3.0.xxx
    
     Options:
     --help  Show this message and exit.
    
     Commands:
     scan    runs a scan
     update-dro  updates a distribution reference file
    

    When you run the Docker image on a non-Linux/amd64 platform, you may see the following warning from Docker, which you can ignore.

    WARNING: The requested image's platform (linux/amd64) does not match the detected host platform (linux/arm64/v8) and no specific platform was requested
    
  4. When you are ready to run a Soda scan, use the following command to run the scan via the docker image. Replace the placeholder values with your own file paths and names.
    docker run -v /path/to/your_soda_directory:/sodacl sodadata/soda-core scan -d your_data_source -c /sodacl/your_configuration.yml /sodacl/your_checks.yml
    

    Optionally, you can specify the version of Soda Core to use to execute the scan. This may be useful when you do not wish to use the latest released version of Soda Core to run your scans. The example scan command below specifies Soda Core version 3.0.0.

    docker run -v /path/to/your_soda_directory:/sodacl sodadata/soda-core:v3.0.0 scan -d your_data_source -c /sodacl/your_configuration.yml /sodacl/your_checks.yml
    
What does the scan command do?
  • docker run ensures that the docker engine runs a specific image.
  • -v mounts your SodaCL files into the container. In other words, it makes the configuration.yml and checks.yml files in your local environment available to the docker container. The command example maps your local directory to /sodacl inside of the docker container.
  • sodadata/soda-core refers to the image that docker run must use.
  • scan instructs Soda Core to execute a scan of your data.
  • -d indicates the name of the data source to scan.
  • -c specifies the filepath and name of the configuration YAML file.


Error: Mounts denied

If you encounter the following error, follow the procedure below.

docker: Error response from daemon: Mounts denied: 
The path /soda-core-test/files is not shared from the host and is not known to Docker.
You can configure shared paths from Docker -> Preferences... -> Resources -> File Sharing.
See https://docs.docker.com/desktop/mac for more info.

You need to give Docker permission to acccess your configuration.yml and checks.yml files in your environment. To do so:

  1. Access your Docker Dashboard, then select Preferences (gear symbol).
  2. Select Resources, then follow the Docker instructions to add your Soda project directory – the one you use to store your configuration.yml and checks.yml files – to the list of directories that can be bind-mounted into Docker containers.
  3. Click Apply & Restart, then repeat steps 2 - 4 above.


Error: Configuration path does not exist

If you encounter the following error, double check the syntax of the scan command in step 4 above.

  • Be sure to prepend /sodacl/ to both the congifuration.yml filepath and the checks.yml filepath.
  • Be sure to mount your files into the container by including the -v option. For example, -v /Users/MyName/soda_core_project:/sodacl.
Soda Core 3.0.xxx
Configuration path 'configuration.yml' does not exist
Path "checks.yml" does not exist
Scan summary:
No checks found, 0 checks evaluated.
2 errors.
Oops! 2 errors. 0 failures. 0 warnings. 0 pass.
ERRORS:
Configuration path 'configuration.yml' does not exist
Path "checks.yml" does not exist


Install Soda Core Scientific Locally

The following works on Mac OS on a machine with the M1 ARM-based processor. Consult the sections below to troubleshoot errors that may arise.

From your command-line interface, use the following command to install Soda Core Scientific.

pip install soda-core-scientific

Error: No module named ‘wheel’

If you encounter the following error, follow the procedure below.

Collecting lightgbm>=2.2.3
  Using cached lightgbm-3.3.2.tar.gz (1.5 MB)
  Preparing metadata (setup.py) ... error
  error: subprocess-exited-with-error
  
  × python setup.py egg_info did not run successfully.
  │ exit code: 1
  ╰─> [6 lines of output]
      Traceback (most recent call last):
        File "<string>", line 2, in <module>
        File "<pip-setuptools-caller>", line 34, in <module>
        File "/private/var/folders/vj/7nxglgz93mv6cv472sl0pnm40000gq/T/pip-install-j0txphmm/lightgbm_327e689fd1a645dfa052e5669c31918c/setup.py", line 17, in <module>
          from wheel.bdist_wheel import bdist_wheel
      ModuleNotFoundError: No module named 'wheel'
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.
  1. Install wheel.
    pip install wheel
    
  2. Run the command to install Soda Core Scientific, again.
    pip install soda-core-scientific 
    


Error: RuntimeError: Count not find a ‘llvm-config’ binary

If you encounter the following error, follow the procedure below.

      RuntimeError: Could not find a `llvm-config` binary. There are a number of reasons this could occur, please see: https://llvmlite.readthedocs.io/en/latest/admin-guide/install.html#using-pip for help.
      error: command '/Users/yourname/Projects/testing/venv/bin/python3' failed with exit code 1
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for llvmlite
  1. To install llvmlite, you must have a llvm-config binary file that the llvmlite installation process uses. In Terminal, use homebrew to run the following command.
    brew install llvm@11
    
  2. Homebrew installs this file in /opt/homebrew/opt/llvm@11/bin/llvm-config. To ensure that the llvmlite installation process uses this binary file, run the following command.
    export LLVM_CONFIG=/opt/homebrew/opt/llvm@11/bin/llvm-config
    
  3. Run the command to install Soda Core Scientific, again.
    pip install soda-core-scientific 
    

Go further


Was this documentation helpful?

What could we do to improve this page?


Last modified on 06-Dec-22