Link Search Menu Expand Document

Install Soda Library

Last modified on 27-Sep-23

× Soda Core, the free, open-source Python library and CLI tool from which Soda Library extends, continues to exist as an OSS project in GitHub, including all Soda Core documentation.

Migrate to Soda Library to connect to Soda Cloud and access all the newest Soda features.

Soda Library is a Python library and command-line interface (CLI) tool that enables Data Engineers to test the data in a data source to surface invalid, missing, or unexpected data.

Compatibility
Requirements
Install
Upgrade
Migrate from Soda Core
Install Soda Scientific
Uninstall
Go further

Compatibility

Use Soda Library to scan a variety of data sources.

Amazon Athena
Amazon Redshift
Apache Spark DataFrames1
Apache Spark for Databricks SQL
Azure Synapse (Experimental)
ClickHouse (Experimental)
Dask and Pandas (Experimental)1
Denodo (Experimental)
Dremio
DuckDB (Experimental)
GCP Big Query
Google CloudSQL
IBM DB2
Local file using Dask1
MS SQL Server
MySQL
OracleDB
PostgreSQL
Snowflake
Trino
Vertica (Experimental)

1 For use with programmatic Soda scans, only.

Requirements

To use Soda Library, you must have installed the following on your system.

  • Python 3.8 or greater. To check your existing version, use the CLI command: python --version or python3 --version
    If you have not already installed Python, consider using pyenv to manage multiple versions of Python in your environment.
  • Pip 21.0 or greater. To check your existing version, use the CLI command: pip --version

Install

  1. Best practice dictates that you install the Soda Library CLI using a virtual environment. In your command-line interface tool, create a virtual environment in the .venv directory using the commands below. Depending on your version of Python, you may need to replace python with python3 in the first command.
    python -m venv .venv
    source .venv/bin/activate
    
  2. Upgrade pip inside your new virtual environment.
    pip install --upgrade pip
    
  3. Execute the following command, replacing soda-postgres with the install package that matches the type of data source you use to store data.
    pip install -i https://pypi.cloud.soda.io soda-postgres
    
Data source Install package
Amazon Athena soda-athena
Amazon Redshift soda-redshift
Apache Spark DataFrames
(For use with programmatic Soda scans, only.)
soda-spark-df
Azure Synapse (Experimental) soda-sqlserver
ClickHouse (Experimental) soda-mysql
Dask and Pandas (Experimental) soda-pandas-dask
Databricks soda-spark[databricks]
Denodo (Experimental) soda-denodo
Dremio soda-dremio
DuckDB (Experimental) soda-duckdb
GCP Big Query soda-bigquery
Google CloudSQL soda-postgres
IBM DB2 soda-db2
Local file Use Dask.
MS SQL Server soda-sqlserver
MySQL soda-mysql
OracleDB soda-oracle
PostgreSQL soda-postgres
Snowflake soda-snowflake
Trino soda-trino
Vertica (Experimental) soda-vertica

To deactivate the virtual environment, use the following command:

deactivate
  1. Best practice dictates that you install the Soda Library CLI using a virtual environment. In your command-line interface tool, create a virtual environment in the .venv directory using the commands below. Depending on your version of Python, you may need to replace python with python3 in the first command. Reference the virtualenv documentation for activating a Windows script.
    python -m venv .venv
    .venv\Scripts\activate
    
  2. Upgrade pip inside your new virtual environment.
    pip install --upgrade pip
    
  3. Execute the following command, replacing soda-postgres with the install package that matches the type of data source you use to store data.
    pip install -i https://pypi.cloud.soda.io soda-postgres
    
Data source Install package
Amazon Athena soda-athena
Amazon Redshift soda-redshift
Apache Spark DataFrame
(For use with programmatic Soda scans, only.)
soda-spark-df
Azure Synapse (Experimental) soda-sqlserver
ClickHouse (Experimental) soda-mysql
Dask and Pandas (Experimental) soda-pandas-dask
Databricks soda-spark[databricks]
Denodo (Experimental) soda-denodo
Dremio soda-dremio
DuckDB (Experimental) soda-duckdb
GCP Big Query soda-bigquery
Google CloudSQL soda-postgres
IBM DB2 soda-db2
MS SQL Server soda-sqlserver
MySQL soda-mysql
OracleDB soda-oracle
PostgreSQL soda-postgres
Snowflake soda-snowflake
Trino soda-trino
Vertica (Experimental) soda-vertica

To deactivate the virtual environment, use the following command:

deactivate

Reference the virtualenv documentation for activating a Windows script.

Use Soda’s Docker image in which Soda Scientific is pre-installed.

  1. If you have not already done so, install Docker in your local environment.
  2. From Terminal, run the following command to pull Soda Library’s official Docker image; adjust the version to reflect the most recent release.
    docker pull sodadata/soda-library:v1.0.3
    
  3. Verify the pull by running the following command.
    docker run sodadata/soda-library:v1.0.3 --help
    

    Output:

     Usage: soda [OPTIONS] COMMAND [ARGS]...
    
       Soda Library CLI version 1.0.x, Soda Core CLI version 3.0.xx
    
     Options:
       --version  Show the version and exit.
       --help     Show this message and exit.
    
     Commands:
       ingest           Ingests test results from a different tool
       scan             Runs a scan
       suggest          Generates suggestions for a dataset
       test-connection  Tests a connection
       update-dro       Updates contents of a distribution reference file
    

    When you run the Docker image on a non-Linux/amd64 platform, you may see the following warning from Docker, which you can ignore.

    WARNING: The requested image's platform (linux/amd64) does not match the detected host platform (linux/arm64/v8) and no specific platform was requested
    
  4. When you are ready to run a Soda scan, use the following command to run the scan via the docker image. Replace the placeholder values with your own file paths and names.
    docker run -v /path/to/your_soda_directory:/sodacl sodadata/soda-library scan -d your_data_source -c /sodacl/your_configuration.yml /sodacl/your_checks.yml
    

    Optionally, you can specify the version of Soda Library to use to execute the scan. This may be useful when you do not wish to use the latest released version of Soda Library to run your scans. The example scan command below specifies Soda Library version 1.0.0.

    docker run -v /path/to/your_soda_directory:/sodacl sodadata/soda-library:v1.0.0 scan -d your_data_source -c /sodacl/your_configuration.yml /sodacl/your_checks.yml
    
What does the scan command do?
  • docker run ensures that the docker engine runs a specific image.
  • -v mounts your SodaCL files into the container. In other words, it makes the configuration.yml and checks.yml files in your local environment available to the docker container. The command example maps your local directory to /sodacl inside of the docker container.
  • sodadata/soda-library refers to the image that docker run must use.
  • scan instructs Soda Library to execute a scan of your data.
  • -d indicates the name of the data source to scan.
  • -c specifies the filepath and name of the configuration YAML file.


Error: Mounts denied

If you encounter the following error, follow the procedure below.

docker: Error response from daemon: Mounts denied: 
The path /soda-library-test/files is not shared from the host and is not known to Docker.
You can configure shared paths from Docker -> Preferences... -> Resources -> File Sharing.
See https://docs.docker.com/desktop/mac for more info.

You need to give Docker permission to acccess your configuration.yml and checks.yml files in your environment. To do so:

  1. Access your Docker Dashboard, then select Preferences (gear symbol).
  2. Select Resources, then follow the Docker instructions to add your Soda project directory – the one you use to store your configuration.yml and checks.yml files – to the list of directories that can be bind-mounted into Docker containers.
  3. Click Apply & Restart, then repeat steps 2 - 4 above.


Error: Configuration path does not exist

If you encounter the following error, double check the syntax of the scan command in step 4 above.

  • Be sure to prepend /sodacl/ to both the congifuration.yml filepath and the checks.yml filepath.
  • Be sure to mount your files into the container by including the -v option. For example, -v /Users/MyName/soda_project:/sodacl.
Soda Library 1.0.x
Configuration path 'configuration.yml' does not exist
Path "checks.yml" does not exist
Scan summary:
No checks found, 0 checks evaluated.
2 errors.
Oops! 2 errors. 0 failures. 0 warnings. 0 pass.
ERRORS:
Configuration path 'configuration.yml' does not exist
Path "checks.yml" does not exist



Upgrade

To upgrade your existing Soda Library tool to the latest version, use the following command, replacing redshift with the install package that matches the type of data source you are using.

pip install -i https://pypi.cloud.soda.io soda-redshift -U

Migrate from Soda Core

Soda Core, the free, open-source Python library and CLI tool upon which Soda Library is built, continues to exist as an OSS project in GitHub. To migrate from an existing Soda Core installation to Soda Library, simply uninstall the old and install the new from the command-line.

  1. Uninstall your existing Soda Core packages using the following command.
    pip freeze | grep soda | xargs pip uninstall -y
    
  2. Install a Soda Library package that corresponds to your data source. Your new package automatically comes with a 45-day free trial. Our Soda team will contact you with licensing options after the trial period.
    pip install -i https://pypi.cloud.soda.io soda-postgres
    
  3. If you had connected Soda Core to Soda Cloud, you do not need to change anything for Soda Library to work with your Soda Cloud account.
    If you had not connected Soda Core to Soda Cloud, you need to connect Soda Library to Soda Cloud. Soda Library requires API keys to validate licensing or trial status and run scans for data quality. See Configure Soda Library for instructions.
  4. You do not need to adjust your existing configuration.yml or checks.yml files which will continue to work as before.

Install Soda Scientific

Install Soda Scientific to be able to use SodaCL distribution checks or anomaly score checks.

You have two installation options to choose from:

  1. Set up a virtual environment, as described in the Soda Library install documentation.
  2. Install Soda Library in your new virtual environment.
  3. Use the following command to install Soda Scientific.
pip install -i https://pypi.cloud.soda.io soda-scientific
List of Soda Scientific dependencis
  • pandas<2.0.0
  • wheel
  • pydantic>=1.8.1,<2.0.0
  • scipy>=1.8.0
  • numpy>=1.23.3, <2.0.0
  • inflection==0.5.1
  • httpx>=0.18.1,<2.0.0
  • PyYAML>=5.4.1,<7.0.0
  • cython>=0.22
  • prophet>=1.1.0,<2.0.0


Error: Library not loaded

If you have defined an anomaly score check and you use an M1 MacOS machine, you may get aLibrary not loaded: @rpath/libtbb.dylib error. This is a known issue in the MacOS community and is caused by issues during the installation of the prophet library. There currently are no official workarounds or releases to fix the problem, but the following adjustments may address the issue.

  1. Install soda-scientific as per the virtual environment installation instructions and activate the virtual environment.
  2. Use the following command to navigate to the directory in which the stan_model of the prophet package is installed in your virtual environment.
    cd path_to_your_python_virtual_env/lib/pythonyour_version/site_packages/prophet/stan_model/
    

    For example, if you have created a python virtual environment in a /venvs directory in your home directory and you use Python 3.9, you would use the following command.

    cd ~/venvs/soda-library-prophet11/lib/python3.9/site-packages/prophet/stan_model/
    
  3. Use the ls command to determine the version number of cmndstan that prophet installed. The cmndstan directory name includes the version number.
    ls
    cmdstan-2.26.1		prophet_model.bin
    
  4. Add the rpath of the tbb library to your prophet installation using the following command.
    install_name_tool -add_rpath @executable_path/cmdstanyour_cmdstan_version/stan/lib/stan_math/lib/tbb prophet_model.bin
    

    With cmdstan version 2.26.1, you would use the following command.

    install_name_tool -add_rpath @executable_path/cmdstan-2.26.1/stan/lib/stan_math/lib/tbb prophet_model.bin
    

Use Docker to run Soda Scientific

Use Soda’s Docker image in which Soda Scientific is pre-installed.

  1. If you have not already done so, install Docker in your local environment.
  2. From Terminal, run the following command to pull Soda Library’s official Docker image; adjust the version to reflect the most recent release.
    docker pull sodadata/soda-library:v1.0.3
    
  3. Verify the pull by running the following command.
    docker run sodadata/soda-library:v1.0.3 --help
    

    Output:

     Usage: soda [OPTIONS] COMMAND [ARGS]...
    
       Soda Library CLI version 1.0.x, Soda Core CLI version 3.0.xx
    
     Options:
       --version  Show the version and exit.
       --help     Show this message and exit.
    
     Commands:
       ingest           Ingests test results from a different tool
       scan             Runs a scan
       suggest          Generates suggestions for a dataset
       test-connection  Tests a connection
       update-dro       Updates contents of a distribution reference file
    

    When you run the Docker image on a non-Linux/amd64 platform, you may see the following warning from Docker, which you can ignore.

    WARNING: The requested image's platform (linux/amd64) does not match the detected host platform (linux/arm64/v8) and no specific platform was requested
    
  4. When you are ready to run a Soda scan, use the following command to run the scan via the docker image. Replace the placeholder values with your own file paths and names.
    docker run -v /path/to/your_soda_directory:/sodacl sodadata/soda-library scan -d your_data_source -c /sodacl/your_configuration.yml /sodacl/your_checks.yml
    

    Optionally, you can specify the version of Soda Library to use to execute the scan. This may be useful when you do not wish to use the latest released version of Soda Library to run your scans. The example scan command below specifies Soda Library version 1.0.0.

    docker run -v /path/to/your_soda_directory:/sodacl sodadata/soda-library:v1.0.0 scan -d your_data_source -c /sodacl/your_configuration.yml /sodacl/your_checks.yml
    
What does the scan command do?
  • docker run ensures that the docker engine runs a specific image.
  • -v mounts your SodaCL files into the container. In other words, it makes the configuration.yml and checks.yml files in your local environment available to the docker container. The command example maps your local directory to /sodacl inside of the docker container.
  • sodadata/soda-library refers to the image that docker run must use.
  • scan instructs Soda Library to execute a scan of your data.
  • -d indicates the name of the data source to scan.
  • -c specifies the filepath and name of the configuration YAML file.


Error: Mounts denied

If you encounter the following error, follow the procedure below.

docker: Error response from daemon: Mounts denied: 
The path /soda-library-test/files is not shared from the host and is not known to Docker.
You can configure shared paths from Docker -> Preferences... -> Resources -> File Sharing.
See https://docs.docker.com/desktop/mac for more info.

You need to give Docker permission to acccess your configuration.yml and checks.yml files in your environment. To do so:

  1. Access your Docker Dashboard, then select Preferences (gear symbol).
  2. Select Resources, then follow the Docker instructions to add your Soda project directory – the one you use to store your configuration.yml and checks.yml files – to the list of directories that can be bind-mounted into Docker containers.
  3. Click Apply & Restart, then repeat steps 2 - 4 above.


Error: Configuration path does not exist

If you encounter the following error, double check the syntax of the scan command in step 4 above.

  • Be sure to prepend /sodacl/ to both the congifuration.yml filepath and the checks.yml filepath.
  • Be sure to mount your files into the container by including the -v option. For example, -v /Users/MyName/soda_project:/sodacl.
Soda Library 1.0.x
Configuration path 'configuration.yml' does not exist
Path "checks.yml" does not exist
Scan summary:
No checks found, 0 checks evaluated.
2 errors.
Oops! 2 errors. 0 failures. 0 warnings. 0 pass.
ERRORS:
Configuration path 'configuration.yml' does not exist
Path "checks.yml" does not exist


Uninstall

  1. (Optional) From the command-line, run the following command to determine which Soda packages exist in your environment.
    pip freeze | grep soda
    
  2. (Optional) Run the following command to uninstall a specific Soda package from your environment.
    pip uninstall soda-postgres
    
  3. Run the following command to uninstall all Soda packages from your environment, completely.
    pip freeze | grep soda | xargs pip uninstall -y
    

Go further


Was this documentation helpful?

What could we do to improve this page?

Documentation always applies to the latest version of Soda products
Last modified on 27-Sep-23