Install Soda Core
Last modified on 31-May-23
Soda Core is a command-line interface (CLI) tool that enables you to scan the data in your data source to surface invalid, missing, or unexpected data. Alternatively, you can use the Soda Core Python library to programmatically execute scans; see Define programmatic scans using Python.
Compatibility
Requirements
Install
Upgrade
Install Soda Core Scientific
Go further
Compatibility
Use Soda Core to scan a variety of data sources.
Amazon Athena Amazon Redshift Apache Spark DataFrames1 Apache Spark for Databricks SQL Azure Synapse (Experimental) ClickHouse (Experimental) Dask and Pandas (Experimental)1 Denodo (Experimental) Dremio DuckDB (Experimental) | GCP Big Query IBM DB2 Local file using Dask1 MS SQL Server MySQL OracleDB PostgreSQL Snowflake Trino Vertica (Experimental) |
1 For use with programmatic Soda scans, only.
Requirements
To use Soda Core, you must have installed the following on your system.
- Python 3.8 or greater. To check your existing version, use the CLI command:
python --version
orpython3 --version
If you have not already installed Python, consider using pyenv to manage multiple versions of Python in your environment. - Pip 21.0 or greater. To check your existing version, use the CLI command:
pip --version
Install
- Best practice dictates that you install the Soda Core CLI using a virtual environment. In your command-line interface tool, create a virtual environment in the
.venv
directory using the commands below. Depending on your version of Python, you may need to replacepython
withpython3
in the first command.python -m venv .venv source .venv/bin/activate
- Upgrade pip inside your new virtual environment.
pip install --upgrade pip
- Execute the following command, replacing
soda-core-postgres
with the install package that matches the type of data source you use to store data.pip install soda-core-postgres
Data source | Install package |
---|---|
Amazon Athena | soda-core-athena |
Amazon Redshift | soda-core-redshift |
Apache Spark DataFrames (For use with programmatic Soda scans, only.) | soda-core-spark-df |
Azure Synapse (Experimental) | soda-core-sqlserver |
ClickHouse (Experimental) | soda-core-mysql |
Dask and Pandas (Experimental) | soda-core-pandas-dask |
Databricks | soda-core-spark[databricks] |
Denodo (Experimental) | soda-core-denodo |
Dremio | soda-core-dremio |
DuckDB (Experimental) | soda-core-duckdb |
GCP Big Query | soda-core-bigquery |
IBM DB2 | soda-core-db2 |
Local file | Use Dask. |
MS SQL Server | soda-core-sqlserver |
MySQL | soda-core-mysql |
OracleDB | soda-core-oracle |
PostgreSQL | soda-core-postgres |
Snowflake | soda-core-snowflake |
Trino | soda-core-trino |
Vertica (Experimental) | soda-core-vertica |
To deactivate the virtual environment, use the following command:
deactivate
- Best practice dictates that you install the Soda Core CLI using a virtual environment. In your command-line interface tool, create a virtual environment in the
.venv
directory using the commands below. Depending on your version of Python, you may need to replacepython
withpython3
in the first command. Reference the virtualenv documentation for activating a Windows script.python -m venv .venv .venv\Scripts\activate
- Upgrade pip inside your new virtual environment.
pip install --upgrade pip
- Execute the following command, replacing
soda-core-postgres
with the install package that matches the type of data source you use to store data.pip install soda-core-postgres
Data source | Install package |
---|---|
Amazon Athena | soda-core-athena |
Amazon Redshift | soda-core-redshift |
Apache Spark DataFrame (For use with programmatic Soda scans, only.) | soda-core-spark-df |
Azure Synapse (Experimental) | soda-core-sqlserver |
ClickHouse (Experimental) | soda-core-mysql |
Dask and Pandas (Experimental) | soda-core-pandas-dask |
Databricks | soda-core-spark[databricks] |
Denodo (Experimental) | soda-core-denodo |
Dremio | soda-core-dremio |
DuckDB (Experimental) | soda-core-duckdb |
GCP Big Query | soda-core-bigquery |
IBM DB2 | soda-core-db2 |
MS SQL Server | soda-core-sqlserver |
MySQL | soda-core-mysql |
OracleDB | soda-core-oracle |
PostgreSQL | soda-core-postgres |
Snowflake | soda-core-snowflake |
Trino | soda-core-trino |
Vertica (Experimental) | soda-core-vertica |
To deactivate the virtual environment, use the following command:
deactivate
Reference the virtualenv documentation for activating a Windows script.
Use Soda’s Docker image in which Soda Core Scientific is pre-installed.
- If you have not already done so, install Docker in your local environment.
- From Terminal, run the following command to pull the latest Soda Core’s official Docker image.
docker pull sodadata/soda-core
- Verify the pull by running the following command.
docker run sodadata/soda-core --help
Output:
Usage: soda [OPTIONS] COMMAND [ARGS]... Soda Core CLI version 3.0.xxx Options: --help Show this message and exit. Commands: scan runs a scan update-dro updates a distribution reference file
When you run the Docker image on a non-Linux/amd64 platform, you may see the following warning from Docker, which you can ignore.
WARNING: The requested image's platform (linux/amd64) does not match the detected host platform (linux/arm64/v8) and no specific platform was requested
- When you are ready to run a Soda scan, use the following command to run the scan via the docker image. Replace the placeholder values with your own file paths and names.
docker run -v /path/to/your_soda_directory:/sodacl sodadata/soda-core scan -d your_data_source -c /sodacl/your_configuration.yml /sodacl/your_checks.yml
Optionally, you can specify the version of Soda Core to use to execute the scan. This may be useful when you do not wish to use the latest released version of Soda Core to run your scans. The example scan command below specifies Soda Core version 3.0.0.
docker run -v /path/to/your_soda_directory:/sodacl sodadata/soda-core:v3.0.0 scan -d your_data_source -c /sodacl/your_configuration.yml /sodacl/your_checks.yml
What does the scan command do?
docker run
ensures that the docker engine runs a specific image.-v
mounts your SodaCL files into the container. In other words, it makes the configuration.yml and checks.yml files in your local environment available to the docker container. The command example maps your local directory to/sodacl
inside of the docker container.sodadata/soda-core
refers to the image thatdocker run
must use.scan
instructs Soda Core to execute a scan of your data.-d
indicates the name of the data source to scan.-c
specifies the filepath and name of the configuration YAML file.
Error: Mounts denied
If you encounter the following error, follow the procedure below.
docker: Error response from daemon: Mounts denied:
The path /soda-core-test/files is not shared from the host and is not known to Docker.
You can configure shared paths from Docker -> Preferences... -> Resources -> File Sharing.
See https://docs.docker.com/desktop/mac for more info.
You need to give Docker permission to acccess your configuration.yml and checks.yml files in your environment. To do so:
- Access your Docker Dashboard, then select Preferences (gear symbol).
- Select Resources, then follow the Docker instructions to add your Soda project directory – the one you use to store your configuration.yml and checks.yml files – to the list of directories that can be bind-mounted into Docker containers.
- Click Apply & Restart, then repeat steps 2 - 4 above.
Error: Configuration path does not exist
If you encounter the following error, double check the syntax of the scan command in step 4 above.
- Be sure to prepend
/sodacl/
to both the congifuration.yml filepath and the checks.yml filepath. - Be sure to mount your files into the container by including the
-v
option. For example,-v /Users/MyName/soda_core_project:/sodacl
.
Soda Core 3.0.xxx
Configuration path 'configuration.yml' does not exist
Path "checks.yml" does not exist
Scan summary:
No checks found, 0 checks evaluated.
2 errors.
Oops! 2 errors. 0 failures. 0 warnings. 0 pass.
ERRORS:
Configuration path 'configuration.yml' does not exist
Path "checks.yml" does not exist
Upgrade
To upgrade your existing Soda Core tool to the latest version, use the following command, replacing soda-core-redshift
with the install package that matches the type of data source you are using.
pip install soda-core-redshift -U
Install Soda Core Scientific
Install Soda Core Scientific to be able to use SodaCL distribution checks or anomaly score checks.
You have three installation options to choose from:
- Install Soda Core Scientific in a virtual environment (Recommended)
- Use Docker to run Soda Core with Soda Scientific
Install Soda Core Scientific in a virtual environment (Recommended)
- Set up a virtual environment, as described in the Soda Core install documentation.
- Install Soda Core in your new virtual environment.
- Use the following command to install Soda Core Scientific.
pip install soda-core-scientific
Note that installing the Soda Core Scientific package also installs several scientific dependencies. Reference the soda-core-scientific setup file in the public GitHub repository for details.
Error: Library not loaded
If you have defined an anomaly score
check and you use an M1 MacOS machine, you may get aLibrary not loaded: @rpath/libtbb.dylib
error. This is a known issue in the MacOS community and is caused by issues during the installation of the prophet library. There currently are no official workarounds or releases to fix the problem, but the following adjustments may address the issue.
- Install soda-core-scientific as per the virtual environment installation instructions and activate the virtual environment.
- Use the following command to navigate to the directory in which the
stan_model
of theprophet
package is installed in your virtual environment.cd path_to_your_python_virtual_env/lib/pythonyour_version/site_packages/prophet/stan_model/
For example, if you have created a python virtual environment in a
/venvs
directory in your home directory and you use Python 3.9, you would use the following command.cd ~/venvs/soda-core-prophet11/lib/python3.9/site-packages/prophet/stan_model/
- Use the
ls
command to determine the version number ofcmndstan
thatprophet
installed. Thecmndstan
directory name includes the version number.ls cmdstan-2.26.1 prophet_model.bin
- Add the
rpath
of thetbb
library to yourprophet
installation using the following command.install_name_tool -add_rpath @executable_path/cmdstanyour_cmdstan_version/stan/lib/stan_math/lib/tbb prophet_model.bin
With
cmdstan
version2.26.1
, you would use the following command.install_name_tool -add_rpath @executable_path/cmdstan-2.26.1/stan/lib/stan_math/lib/tbb prophet_model.bin
Use Docker to run Soda Core Scientific
Use Soda’s Docker image in which Soda Core Scientific is pre-installed.
- If you have not already done so, install Docker in your local environment.
- From Terminal, run the following command to pull the latest Soda Core’s official Docker image.
docker pull sodadata/soda-core
- Verify the pull by running the following command.
docker run sodadata/soda-core --help
Output:
Usage: soda [OPTIONS] COMMAND [ARGS]... Soda Core CLI version 3.0.xxx Options: --help Show this message and exit. Commands: scan runs a scan update-dro updates a distribution reference file
When you run the Docker image on a non-Linux/amd64 platform, you may see the following warning from Docker, which you can ignore.
WARNING: The requested image's platform (linux/amd64) does not match the detected host platform (linux/arm64/v8) and no specific platform was requested
- When you are ready to run a Soda scan, use the following command to run the scan via the docker image. Replace the placeholder values with your own file paths and names.
docker run -v /path/to/your_soda_directory:/sodacl sodadata/soda-core scan -d your_data_source -c /sodacl/your_configuration.yml /sodacl/your_checks.yml
Optionally, you can specify the version of Soda Core to use to execute the scan. This may be useful when you do not wish to use the latest released version of Soda Core to run your scans. The example scan command below specifies Soda Core version 3.0.0.
docker run -v /path/to/your_soda_directory:/sodacl sodadata/soda-core:v3.0.0 scan -d your_data_source -c /sodacl/your_configuration.yml /sodacl/your_checks.yml
What does the scan command do?
docker run
ensures that the docker engine runs a specific image.-v
mounts your SodaCL files into the container. In other words, it makes the configuration.yml and checks.yml files in your local environment available to the docker container. The command example maps your local directory to/sodacl
inside of the docker container.sodadata/soda-core
refers to the image thatdocker run
must use.scan
instructs Soda Core to execute a scan of your data.-d
indicates the name of the data source to scan.-c
specifies the filepath and name of the configuration YAML file.
Error: Mounts denied
If you encounter the following error, follow the procedure below.
docker: Error response from daemon: Mounts denied:
The path /soda-core-test/files is not shared from the host and is not known to Docker.
You can configure shared paths from Docker -> Preferences... -> Resources -> File Sharing.
See https://docs.docker.com/desktop/mac for more info.
You need to give Docker permission to acccess your configuration.yml and checks.yml files in your environment. To do so:
- Access your Docker Dashboard, then select Preferences (gear symbol).
- Select Resources, then follow the Docker instructions to add your Soda project directory – the one you use to store your configuration.yml and checks.yml files – to the list of directories that can be bind-mounted into Docker containers.
- Click Apply & Restart, then repeat steps 2 - 4 above.
Error: Configuration path does not exist
If you encounter the following error, double check the syntax of the scan command in step 4 above.
- Be sure to prepend
/sodacl/
to both the congifuration.yml filepath and the checks.yml filepath. - Be sure to mount your files into the container by including the
-v
option. For example,-v /Users/MyName/soda_core_project:/sodacl
.
Soda Core 3.0.xxx
Configuration path 'configuration.yml' does not exist
Path "checks.yml" does not exist
Scan summary:
No checks found, 0 checks evaluated.
2 errors.
Oops! 2 errors. 0 failures. 0 warnings. 0 pass.
ERRORS:
Configuration path 'configuration.yml' does not exist
Path "checks.yml" does not exist
Go further
- Next: Configure your newly-installed Soda Core to connect to your data source.
- Need help? Join the Soda community on Slack.
Was this documentation helpful?
What could we do to improve this page?
- Suggest a docs change in GitHub.
- Share feedback in the Soda community on Slack.
Last modified on 31-May-23