# Distribution checks

{% hint style="danger" %}
Distribution checks will no longer be supported in Soda v4; they will deprecated and replaced by MAD.

In the short term, v3 users can use summary statistics instead.
{% endhint %}

Use a distribution check to determine whether the distribution of a column has changed between two points in time. For example, if you trained a model at a particular moment in time, you can use a distribution check to find out how much the data in the column has changed over time, or if it has changed all.

```yaml
checks for dim_customer:
  - distribution_difference(number_cars_owned) > 0.05:
      distribution reference file: ./cars_owned_dist_ref.yml
      method: chi_square
      # (optional) filter to a specific point in time or any other dimension 
      filter: purchase_date > 2022-10-01 and purchase_date < 2022-12-01
      # (optional) database specific sampling query for continuous columns. For 
      # example, for PostgreSQL the following query randomly samples 50% of the data 
      # with seed 61.
      sample: TABLESAMPLE BERNOULLI (50) REPEATABLE (61)
```

✔️    Requires Soda Core Scientific (included in a Soda Agent)\
✔️    Supported in Soda Core\
✔️    Supported in Soda Library + Soda Cloud\
✖️    Supported in Soda Cloud Agreements + Soda Agent\
✖️    Available as a no-code check

## About distribution checks

To detect changes in the distribution of a column between different points in time, Soda uses approaches based on [hypothesis testing](https://en.wikipedia.org/wiki/Statistical_hypothesis_testing) and based on metrics that quantify the distance between samples.

When using hypothesis testing, a distribution check allows you to determine whether enough evidence exists to conclude that the distribution of a column has changed. It returns the probability that the difference between samples taken at two points in time would have occurred if they came from the same distribution (see [p-value](https://en.wikipedia.org/wiki/P-value)). If this probability is smaller than a threshold that you define, the check warns you that the column's distribution has changed.

You can use the following statistical tests for hypothesis testing in your distribution checks.

* [Kolmogorov-Smirnov](https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test) for continuous data
* [Chi-square](https://en.wikipedia.org/wiki/Chi-squared_test) for categorical data

When using a metric to measure distance between samples, a distribution check returns the value of the distance metric that you chose based on samples taken at two points in time. If the value of the distance metric is larger than a threshold that you define, the check warns that the column's distribution has changed.

You can use the following distance metrics in your distribution checks.

* [Population Stability Index (PSI)](https://mwburke.github.io/data%20science/2018/04/29/population-stability-index.html) for continuous or categorical data
* [Standardized Wasserstein Distance (SWD) ](https://en.wikipedia.org/wiki/Wasserstein_metric)(standardized using the sample standard deviation) for continuous or categorical data
* [Standardized Earth Mover's Distance](https://en.wikipedia.org/wiki/Earth_mover's_distance) (standardized using the sample standard deviation, this metric is equal to the SWD) for continuous or categorical data

<details>

<summary>Sample sizes in distribution checks</summary>

In hypothesis testing, the [statistical power](https://en.wikipedia.org/wiki/Power_of_a_test) of a test refers to its ability to reject the null hypothesis when it is false. Specifically, the power of a test tells you how likely it is that the null hypothesis will be rejected if the true difference with the alternative hypothesis were of a particular size; see [effect size](https://en.wikipedia.org/wiki/Effect_size#Relationship_to_test_statistics). A very powerful test is able to reject the null hypothesis even if the true difference is small.\
\
Since distribution checks issue warnings based on the p-value alone and do not take effect size into account, having too much power can make the results of the checks hard to interpret. An extremely powerful test rejects the null hypothesis for effect sizes that are negligible. Because the power of a test increases as its sample size increases, there is a sample size limit of one million in distribution checks.\
\
The default sample size limit of 1 million rows is based on simulations that used the Kolmogorov-Smirnov test. The simulation generated samples from a normal distribution, an exponential distribution, a laplacian distribution, a beta distribution, and a mixture distribution (generated by randomly choosing between two normal distributions). The Kolmogorov-Smirnov test compared these samples to samples that came from the same distributions, but with different means. For example, it compared samples from a normal distribution to samples from another normal distribution with a different mean.\
\
For each distribution type, the Kolmogorov-Smirnov test rejected the null hypothesis 100% of the time if the effect size was equal to, or larger than, a shift to the mean of 1% of the standard deviation, when using a sample size of one million. Using such a sample size does not result in problems with local memory.\
\
If you wish, you can define your own sample size using a SQL query. See Define the sample size section, below.

</details>

<details>

<summary>Distribution check thresholds for distance metrics</summary>

The values of the Population Stability Index (PSI) and the Standardized Wasserstein Distance (SWD) can be hard to interpret. Consider carefully investigating which distribution thresholds make sense for your use case.\
\
Some common interpretations of the PSI result are as follows:

* PSI < 0.1: no significant distribution change
* 0.1 < PSI < 0.2: moderate distribution change
* PSI >= 0.2: significant distribution change

During simulations, for a difference in mean between distributions of size relative to 10% of their standard deviation, the SWD value converged to approximately 0.1.

</details>

## Install Soda Scientific

To use a distribution check, you must install Soda Scientific in the same directory or virtual environment in which you installed Soda Library. Best practice recommends installing Soda Library and Soda Scientific in a virtual environment to avoid library conflicts, but you can [Install Soda Scientific locally](#install-soda-scientific-locally) if you prefer.

1. Set up a virtual environment, and install Soda Library in your new virtual environment.
2. Use the following command to install Soda Scientific.

```sh
pip install -i https://pypi.cloud.soda.io soda-scientific
```

<details>

<summary>List of Soda Scientific dependencies</summary>

* pandas<2.0.0
* wheel
* pydantic>=1.8.1,<2.0.0
* scipy>=1.8.0
* numpy>=1.23.3, <2.0.0
* inflection==0.5.1
* httpx>=0.18.1,<2.0.0
* PyYAML>=5.4.1,<7.0.0
* cython>=0.22
* prophet>=1.1.0,<2.0.0

</details>

Refer to [Troubleshoot Soda Scientific installation](#troubleshoot-soda-scientific-installation) for help with issues during installation.

## Generate a distribution reference object (DRO)

*Not yet supported in Soda Cloud*

Before defining a distribution check, you must generate a distribution reference object (DRO).

When you run a distribution check, Soda compares the data in a column of your dataset with a snapshot of the same column at a different point in time. This snapshot exists in the DRO, which serves as a point of reference. The distribution check result indicates whether the difference between the distributions of the snapshot and the actual datasets is statistically significant.

To create a DRO, you use the CLI command `soda update-dro`. When you execute the command, Soda stores the entire contents of the column(s) you specified in local memory. Before executing the command, examine the volume of data the column(s) contains and ensure that your system can accommodate storing it in local memory.

1. If you have not already done so, create a directory to contain the files that Soda uses for a distribution check.
2. Use a code editor to create a file called `distribution_reference.yml` (though, you can name it anything you wish) in your Soda project directory, then add the following example content to the file.

   ```yaml
   dataset: your_dataset_name
   column: column_name_in_dataset
   distribution_type: categorical
   # (optional) filter to a specific point in time or any other dimension 
   filter: "column_name between '2010-01-01' and '2020-01-01'"
   # (optional) database specific sampling query, for example for postgres\
   # the following query randomly samples 50% of the data with seed 61
   ```

   Optionally, you can define multiple DROs in your `distribution_reference.yml` file by naming them. The following example defines two DROs.

   ```yaml
   dro_name1:
     dataset: your_dataset_name
     column: column_name_in_dataset
     distribution_type: categorical
   dro_name2:
     dataset: your_dataset_name
     column: column_name2_in_dataset
     distribution_type: continuous
   ```
3. Change the values for `dataset` and `column` to reflect your own dataset's identifiers.
4. (Optional) Change the value for `distribution_type` to capture `categorical` or `continuous` data.
5. (Optional) Define the value of `filter` to specify the portion of the data in your dataset for which you are creating a DRO. If you trained a model on data in which the `date_first_customer` column contained values between 2010-01-01 and 2020-01-01, you can use a filter based on that period to test whether the distribution of the column has changed since then.\
   If you do not wish to define a filter, remove the key-value pair from the file.
6. (Optional) If you wish to define multiple DROs in a single `distribution_reference.yml` file, change the names `dro_name1` and `dro_name2`.
7. Save the file, then, while still in your Soda project directory, run the `soda update-dro` command to create a distribution reference object. For a list of options available to use with the command, run `soda update-dro --help`.

   ```bash
   soda update-dro -d your_datasource_name -c your_configuration_file.yml ./distribution_reference.yml 
   ```

   If you defined multiple DROs in your `distribution_reference.yml` file, specify which DRO you want to update using the `-n` argument. `-n` indicates name. When multiple DROs are defined in a single `distribution_reference.yml` file, Soda requires all of them to be named. Thus, you must provide the DRO name with the `-n` argument when using the `soda update-dro` command.

   ```bash
   soda update-dro -n dro_name1 -d your_datasource_name -c your_configuration_file.yml ./distribution_reference.yml 
   ```
8. Review the changed contents of your `distribution_reference.yml` file. The following is an example of the information that Soda added to the file.

   ```yaml
   dataset: dim_customer
   column: number_cars_owned
   distribution_type: categorical
   filter: date_first_purchase between '2010-01-01' and '2020-01-01'
   distribution reference:
     weights:
       - 0.34932914953473276
       - 0.2641744211209695
       - 0.22927937675827742
       - 0.08899588833585804
       - 0.06822116425016231
     bins:
       - 2
       - 1
       - 0
       - 3
       - 4
   ```

   Soda appended a new key called `distribution reference` to the file, together with an array of `bins` and a corresponding array of `weights`.

> [Read more](#bins-and-weights) about `bins` and `weights`, and how Soda computes the number of bins for a DRO.

## Define a distribution check

1. If you have not already done so, create a `checks.yml` file in your Soda project directory. The checks YAML file stores the Soda Checks you write, including distribution checks; Soda Library executes the checks in the file when it runs a scan of your data.
2. In your new file, add the following example content.

   ```yaml
   checks for your_dataset_name:
     - distribution_difference(column_name, dro_name) > your_threshold:
         method: your_method_of_choice
         distribution reference file: ./distribution_reference.yml
         # (optional) filter to a specific point in time, or any other dimension 
         filter: column_name > min_allowed_column_value and column_name < max_allowed_value
         # (optional) database specific sampling query for continuous columns. For 
         # example, for PostgreSQL, the following query randomly samples 50% of the data 
         # with seed 61
         sample: TABLESAMPLE BERNOULLI (50) REPEATABLE (61)
   ```
3. Replace the following values with your own dataset and threshold details.
   * `your_dataset_name` - the name of your dataset
   * `column_name` - the column against which to compare the DRO
   * `dro_name` - the name of the DRO (optional, required if `distribution_reference.yml` contains named DROs)
   * `> your_threshold` - the threshold for the distribution check that you specify as acceptable
4. Replace the value of `your_method_of_choice` with the type of test you want to use in the distribution check. If you do not specify a `method`, the distribution check defaults to `ks` for continuous data, or `chi_square` for categorical data.
   * `ks` for the Kolmogorov-Smirnov test
   * `chi_square` for the Chi-square test
   * `psi` for the Population Stability Index metric
   * `swd` for the Standardized Wasserstein Distance (SWD) metric
   * `semd` for the Standardized Earth Mover's Distance (SEMD) metric\
     SWD and the SEMD are the same metric.<br>
5. (Optional), to filter the data in the distribution check, replace the value of `filter` with a filter that specifies the portion of the data in your dataset for which you are checking the distribution.
6. (Optional), to sample the data in the distribution check for **continuous columns** only, replace the value of `sample` with a query that specifies the portion of the data in your dataset for which you are checking the distribution. The data source you are using must support the query you write.\
   For example, for PostgreSQL, you can use the `TABLESAMPLE` clause to randomly sample 50% of the data with seed 61. Best practice dictates that you use sampling for large datasets that might not fit in memory. Refer to the [define the sample size](#define-the-sample-size) for details. If you do not use `sample` or `filter` in a distribution check for **continuous columns**, Soda fetches up to 1 million records by applying a `limit` clause for better memory management. For **categorical columns**, Soda does not support `sample`.
7. Run a soda scan of your data source to execute the distribution check(s) you defined.

   ```bash
   soda scan -d your_datasource_name checks.yml -c /path/to/your_configuration_file.yml your_check_file.yml
   ```

   When Soda Library executes the distribution check above, it compares the values in `column_name` to a sample that Soda creates based on the `bins`, `weights`, and `data_type` in `dro_name` defined in the `distribution_reference.yml` file. Specifically, it checks whether the value of `your_method_of_choice` is larger than `0.05`.

### Distribution check details

* For continuous columns, When you execute the `soda scan` command, Soda stores up to one million records in local memory. If the column has more than one million records, then Soda applies `limit` SQL clause to make sure that your system can accommodate storing it in local memory.
* For continuous columns, as explained in [Generate a Distribution Reference Object (DRO)](#generate-a-distribution-reference-object-dro), Soda uses bins and weights to take random samples from your DRO. Therefore, it is possible that the original dataset that you used to create the DRO resembles a different underlying distribution than the dataset that Soda creates by sampling from the DRO. To limit the impact of this possibility, Soda runs the tests in each distribution check ten times and returns the median of the results, either as a p-value or a distance metric).\
  \
  For example, if you use the Kolmogorov-Smirnov test and a threshold of 0.05, the distribution check uses the Kolmogorov-Smirnov test to compare ten different samples from your DRO to the data in your column. If the median of the returned p-values is smaller than 0.05, the check issues a warning. This approach does change the interpretation of the distribution check results. For example, the probability of a type I error is multiple orders of magnitude smaller than the significance level that you choose.
* For categorical columns, Soda fetches the aggregated calculated value counts of each category. If there are more than one million distinct categories, Soda skips the distribution check and issues a warning.

### Bins and weights

Soda uses the `bins` and `weights` to generate a sample from the reference distribution when it executes the distribution check during a scan. By creating a sample using the DRO's bins and weights, you do not have to save the entire – potentially very large - sample. The `distribution_type` value impacts how the weights and bins will be used to generate a sample, so make sure your choice reflects the nature of your data (continuous or categorical).

To compute the number of bins for a DRO, Soda uses different strategies based on whether outlier values are present in the dataset.

By default Soda automatically computes the number of bins for each DRO by taking the maximum of [Sturges](https://en.wikipedia.org/wiki/Histogram#Number_of_bins_and_width) and [Freedman Diaconis Estimator](https://en.wikipedia.org/wiki/Freedman%E2%80%93Diaconis_rule) methods. [numpy.histogram\_bin\_edges(data, bins='auto')](https://numpy.org/doc/stable/reference/generated/numpy.histogram_bin_edges.html#numpy.histogram_bin_edges) also applies this practice by default.

For datasets *with* outliers, such as in the example below, the default strategy does not work well. When taking the maximum of [Sturges](https://en.wikipedia.org/wiki/Histogram#Number_of_bins_and_width) and [Freedman Diaconis Estimator](https://en.wikipedia.org/wiki/Freedman%E2%80%93Diaconis_rule) methods, it produces a great number of bins, `3466808`, while there are only nine elements in the array. The outlier value `10e6` result in a misleading bin size.

```python
import numpy as np
arr = np.array([0, 0, 0, 1, 2, 3, 3, 4, 10e6])
number_of_bins = np.histogram_bin_edges(arr, bins='auto').size # return 3466808
```

If the number of bins is greater than the size of data, Soda uses [interquantile range (IQR)](https://en.wikipedia.org/wiki/Interquartile_range) to detect and filter the outliers. Basically, for data that is greater than `Q3 + 1.5 IQR` and less than `Q1 - 1.5 IQR` Soda removes the datasets, then recomputes the number of bins with the same method by taking the maximum of [Sturges](https://en.wikipedia.org/wiki/Histogram#Number_of_bins_and_width) and [Freedman Diaconis Estimator](https://en.wikipedia.org/wiki/Freedman%E2%80%93Diaconis_rule).

After removing the outliers, if the number of bins still exceeds the size of the filtered data, Soda takes the square root of the dataset size to set the number of bins. To cover edge cases, if the square root of the dataset size exceeds one million, then Soda sets the number of bins to one million to prevent it from generating too many bins.

### Define the sample size

You can add a `sample` parameter for both a distribution check and DRO to include a sample SQL clause that Soda passes when it executes the check during a scan.

#### Apply a sample to a distribution check for continuous columns

If the data to which you wish to apply distribution check does not fit in memory or involves a time constraint, use a `sample` to specify a SQL query that returns a sample of the data. The SQL query that you provide is specific to the type of data source you use. In the example below, the SQL query for a PostgreSQL data source randomly samples 50% of the data with seed 61. You can customize the `sample` SQL query to meet your needs.

Use `sample` for continuous values, only. For categorical values refer to [Distribution check details](#distribution-check-details).

```yaml
checks for dim_customer:
  - distribution_difference(budget) < 0.05:
      distribution reference file: ./dro_dim_customer.yml 
      method: ks
      # (optional) data source-specific sampling query; for example for postgres\
      # the following query randomly samples 50% of the data with seed 61
      sample: TABLESAMPLE BERNOULLI (50) REPEATABLE (61)
```

#### Sampling Caveats

Some data sources do not have a built-in sampling function. For example, BigQuery does not support `TABLESAMPLE BERNOULLI`. In such a case, add a `filter` parameter to randomly obtain a sample of the data. The `filter` parameter applies a data source-specific SQL `WHERE` clause to the data. In the example below, the SQL query for a BigQuery data source randomly samples 50% of the data.

**Distribution Check**

```yaml
checks for dim_customer:
  - distribution_difference(number_cars_owned) > 0.05:
      distribution reference file: ./cars_owned_dist_ref.yml 
      method: chi_square
      # (optional) data source-specific sampling query, for example for postgres\
      # the following query randomly samples 50% of the data
      filter: rand() < 0.5
```

**DRO**

```yaml
dataset: your_dataset_name
column: column_name_in_dataset
distribution_type: categorical
# (optional) data source-specific sampling query; for example for postgres\
# the following query randomly samples 50% of the data
filter: rand() < 0.5
```

## Distribution check examples

You can define multiple distribution checks in a single `checks.yml` file. If you create a new DRO for another dataset and column in `sales_dist_ref.yml` for example, you can define two distribution checks in the same `checks.yml` file, as per the following.

```yaml
checks for dim_customer:
  - distribution_difference(number_cars_owned) > 0.05:
      method: chi_square
      distribution reference file: ./cars_owned_dist_ref.yml

checks for fact_sales_quota:
  - distribution_difference(calendar_quarter) < 0.2:
      method: psi
      distribution reference file: ./sales_dist_ref.yml
```

Alternatively you can define two DROs in `distribution_reference.yml`, naming them `cars_owned_dro` and `calendar_quarter_dro`, and use both in a single `checks.yml` file

```yaml
checks for dim_customer:
  - distribution_difference(number_cars_owned, cars_owned_dro) > 0.05:
      method: chi_square
      distribution reference file: ./distribution_reference.yml

checks for fact_sales_quota:
  - distribution_difference(calendar_quarter, calendar_quarter_dro) < 0.2:
      method: psi
      distribution reference file: ./distribution_reference.yml
```

You can also define multiple checks for different columns in the same dataset by generating multiple DROs for those columns. Refer to the following example.

```yaml
checks for dim_customer:
  - distribution_difference(number_cars_owned, cars_owned_dro) > 0.05:
      method: chi_square
      distribution reference file: ./distribution_reference.yml
   - distribution_difference(total_children, total_children_dro) < 0.2:
      method: psi
      distribution reference file: ./distribution_reference.yml

checks for fact_sales_quota:
  - distribution_difference(calendar_quarter, calendar_quarter_dro) < 0.2:
      method: psi
      distribution reference file: ./distribution_reference.yml
```

## Optional check configurations

<table><thead><tr><th width="100" align="center">Supported</th><th>Configuration</th><th>Documentation</th></tr></thead><tbody><tr><td align="center">✓</td><td>Define a name for a distribution check; see <a href="#example-with-check-name">example</a>.</td><td><a href="../optional-config#customize-check-names">Customize check names</a></td></tr><tr><td align="center">✓</td><td>Add an identity to a check.</td><td><a href="../optional-config#add-a-check-identity">Add a check identity</a></td></tr><tr><td align="center"> </td><td>Define alert configurations to specify warn and fail thresholds.</td><td>-</td></tr><tr><td align="center">✓</td><td>Apply an in-check filter to return results for a specific portion of the data in your dataset; see <a href="#example-with-in-check-filter">example</a>.</td><td><a href="../filters#configure-in-check-filters">Configure in-check filters</a></td></tr><tr><td align="center">✓</td><td>Use quotes when identifying dataset or column names; see <a href="#example-with-quotes">example</a>.<br>Note that the type of quotes you use must match that which your data source uses. For example, BigQuery uses a backtick (`) as a quotation mark.</td><td><a href="../optional-config#use-quotes-in-a-check">Use quotes in a check</a></td></tr><tr><td align="center"> </td><td>Use wildcard characters ( % or * ) in values in the check.</td><td>-</td></tr><tr><td align="center">✓</td><td>Use for each to apply distribution checks to multiple datasets in one scan; see <a href="#example-with-for-each">example</a>.</td><td><a href="../optional-config#apply-checks-to-multiple-datasets">Apply checks to multiple datasets</a></td></tr><tr><td align="center">✓</td><td>Apply a dataset filter to partition data during a scan; see <a href="#example-with-dataset-filter">example</a>.</td><td><a href="../optional-config#scan-a-portion-of-your-dataset">Scan a portion of your dataset</a></td></tr><tr><td align="center">✓</td><td>Instruct Soda to collect random samples. Because sampling SQL clauses vary significantly between data sources, consult your data source’s documentation; see <a href="#example-with-in-check-sampling">example</a>.</td><td>-</td></tr></tbody></table>

#### Example with check name

```yaml
checks for dim_customer:
- distribution_difference(number_cars_owned) > 0.05: 
    method: chi_square
    distribution reference file: dist_ref.yml
    name: Distribution check
```

#### Example with quotes

```yaml
checks for dim_customer:
- distribution_difference("number_cars_owned") < 0.2:
    method: psi
    distribution reference file: dist_ref.yml
    name: Distribution check
```

#### Example with for each

```yaml
for each dataset T:
    dataset:
        - dim_customer
    checks:
    - distribution_difference(number_cars_owned) < 0.15:
        method: swd
        distribution reference file: dist_ref.yml
```

#### Example with in-check filter

```yaml
checks for dim_customer:
- distribution_difference(number_cars_owned) < 0.05: 
    method: swd
    distribution reference file: dist_ref.yml
    filter: date_first_purchase between '2010-01-01' and '2022-01-01'
```

#### Example with dataset filter

```yaml
filter dim_customer [first_purchase]:
  where: date_first_purchase between '2010-01-01' and '2022-01-01' 

checks for dim_customer [first_purchase]:
- distribution_difference(number_cars_owned) < 0.05: 
    method: swd
    distribution reference file: dist_ref.yml
```

#### Example with in-check sampling

The following example works for PostgreSQL data sources. It randomly samples 50% of the dataset with seed value 61.

```yaml
checks for dim_customer:
  - distribution_difference(number_cars_owned) > 0.05:
      distribution reference file: ./cars_owned_dist_ref.yml
      method: chi_square
      sample: TABLESAMPLE BERNOULLI (50) REPEATABLE (61)
```

## List of comparison symbols and phrases

```
 = 
 < 
 >
 <=
 >=
 !=
 <> 
 between 
 not between 
```

## Troubleshoot Soda Scientific installation

While installing Soda Scientific works on Linux, you may encounter issues if you install Soda Scientific on Mac OS (particularly, machines with the M1 ARM-based processor) or any other operating system. If that is the case, consider using one of the following alternative installation procedures.

* [Install Soda Scientific locally](#install-soda-scientific-locally)
* [Use Docker to run Soda Library](#use-docker-to-run-soda-library)

> Need help? Ask the team in the [Soda community on Slack](https://community.soda.io/slack).

### Install Soda Scientific Locally

1. Set up a virtual environment, and install Soda Library in your new virtual environment.
2. Use the following command to install Soda Scientific.

```
pip install -i https://pypi.cloud.soda.io soda-scientific
```

<details>

<summary>List of Soda Scientific dependencies</summary>

* pandas<2.0.0
* wheel
* pydantic>=1.8.1,<2.0.0
* scipy>=1.8.0
* numpy>=1.23.3, <2.0.0
* inflection==0.5.1
* httpx>=0.18.1,<2.0.0
* PyYAML>=5.4.1,<7.0.0
* cython>=0.22
* prophet>=1.1.0,<2.0.0

</details>

Refer to [Troubleshoot Soda Scientific installation](https://docs.soda.io/soda-documentation/soda-v3/anomaly-score#troubleshoot-soda-scientific-installation) for help with issues during installation.

### Use Docker to run Soda Library

Use Soda’s Docker image in which Soda Scientific is pre-installed. You need Soda Scientific to be able to use SodaCL [distribution checks](https://docs.soda.io/soda-documentation/soda-v3/sodacl-reference/distribution) or [anomaly detection checks](https://docs.soda.io/soda-documentation/soda-v3/sodacl-reference/anomaly-detection).

1. If you have not already done so, [install Docker](https://docs.docker.com/get-docker/) in your local environment.
2. From Terminal, run the following command to pull Soda Library’s official Docker image; adjust the version to reflect the most [recent release](https://docs.soda.io/soda-documentation/soda-v3/release-notes/soda-library).

   ```
   docker pull sodadata/soda-library:v1.0.3
   ```
3. Verify the pull by running the following command.

   ```sh
   docker run sodadata/soda-library:v1.0.3 --help
   ```

   Output:

   ```sh
    Usage: soda [OPTIONS] COMMAND [ARGS]...

      Soda Library CLI version 1.0.x, Soda Core CLI version 3.0.xx

    Options:
      --version  Show the version and exit.
      --help     Show this message and exit.

    Commands:
      ingest           Ingests test results from a different tool
      scan             Runs a scan
      suggest          Generates suggestions for a dataset
      test-connection  Tests a connection
      update-dro       Updates contents of a distribution reference file
   ```

   When you run the Docker image on a non-Linux/amd64 platform, you may see the following warning from Docker, which you can ignore.

   ```sh
   WARNING: The requested image's platform (linux/amd64) does not match the detected host platform (linux/arm64/v8) and no specific platform was requested
   ```
4. When you are ready to run a Soda scan, use the following command to run the scan via the docker image. Replace the placeholder values with your own file paths and names.

   ```sh
   docker run -v /path/to/your_soda_directory:/sodacl sodadata/soda-library scan -d your_data_source -c /sodacl/your_configuration.yml /sodacl/your_checks.yml
   ```

   Optionally, you can specify the version of Soda Library to use to execute the scan. This may be useful when you do not wish to use the latest released version of Soda Library to run your scans. The example scan command below specifies Soda Library version 1.0.0.

   ```sh
   docker run -v /path/to/your_soda_directory:/sodacl sodadata/soda-library:v1.0.0 scan -d your_data_source -c /sodacl/your_configuration.yml /sodacl/your_checks.yml
   ```

<details>

<summary>What does the scan command do?</summary>

* `docker run` ensures that the docker engine runs a specific image.
* `-v` mounts your SodaCL files into the container. In other words, it makes the configuration.yml and checks.yml files in your local environment available to the docker container. The command example maps your local directory to `/sodacl` inside of the docker container.
* `sodadata/soda-library` refers to the image that `docker run` must use.
* `scan` instructs Soda Library to execute a scan of your data.
* `-d` indicates the name of the data source to scan.
* `-c` specifies the filepath and name of the configuration YAML file.

</details>

#### **Error: Mounts denied**

If you encounter the following error, follow the procedure below.

```sh
docker: Error response from daemon: Mounts denied: 
The path /soda-library-test/files is not shared from the host and is not known to Docker.
You can configure shared paths from Docker -> Preferences... -> Resources -> File Sharing.
See https://docs.docker.com/desktop/mac for more info.
```

You need to give Docker permission to acccess your configuration.yml and checks.yml files in your environment. To do so:

1. Access your Docker Dashboard, then select Preferences (gear symbol).
2. Select Resources, then follow the [Docker instructions](https://www.docker.com/blog/file-sharing-with-docker-desktop/) to add your Soda project directory – the one you use to store your configuration.yml and checks.yml files – to the list of directories that can be bind-mounted into Docker containers.
3. Click Apply & Restart, then repeat steps 2 - 4 above.

<br>

#### **Error: Configuration path does not exist**

If you encounter the following error, double check the syntax of the scan command in step 4 above.

* Be sure to prepend `/sodacl/` to both the congifuration.yml filepath and the checks.yml filepath.
* Be sure to mount your files into the container by including the `-v` option. For example, `-v /Users/MyName/soda_project:/sodacl`.

```sh
Soda Library 1.0.x
Configuration path 'configuration.yml' does not exist
Path "checks.yml" does not exist
Scan summary:
No checks found, 0 checks evaluated.
2 errors.
Oops! 2 errors. 0 failures. 0 warnings. 0 pass.
ERRORS:
Configuration path 'configuration.yml' does not exist
Path "checks.yml" does not exist
```

## Go further

* Reference [tips and best practices for SodaCL](https://docs.soda.io/soda-documentation/soda-v3/soda-cl-overview/quick-start-sodacl#tips-and-best-practices-for-sodacl).
* Use a [freshness check](https://docs.soda.io/soda-documentation/soda-v3/sodacl-reference/freshness) to gauge how recently your data was captured.
* Use [reference checks](https://docs.soda.io/soda-documentation/soda-v3/sodacl-reference/reference) to compare the values of one column to another.

> Need help? Join the [Soda community on Slack](https://community.soda.io/slack).
