Install Soda Library
From your command-line interface, execute a pip install command to install Soda Library in your environment.
The Soda environment has been updated since this tutorial.
Refer to v4 documentation for updated tutorials.
Soda Library is a Python library and command-line interface (CLI) tool that enables Data Engineers to test the data in a data source to surface invalid, missing, or unexpected data.
As a step in the Get started roadmap, this guide offers instructions to set up, install, and configure Soda in a self-operated deployment model.
Get started roadmap
Choose a flavor of SodaSet up Soda: self-operated 📍 You are here! a. Review requirements b. Install Soda Library c. Configure Soda
Write SodaCL checks
Run scans and review results
Organize, alert, investigate
💡 TL;DR: Follow a 15-min tutorial to set up and run Soda with example data.
Requirements
To use Soda Library, you must have installed the following on your system.
Python 3.8, 3.9, or 3.10. To check your existing version, use the CLI command:
python --versionorpython3 --versionIf you have not already installed Python, consider using pyenv to manage multiple versions of Python in your environment.Pip 21.0 or greater. To check your existing version, use the CLI command:
pip --versionA Soda Cloud account; see next section.
Create a Soda Cloud account
In a browser, navigate to cloud.soda.io/signup to create a new Soda account, which is free for a 45-day trial. If you already have a Soda account, log in.
Navigate to your avatar > Profile, then access the API keys tab. Click the plus icon to generate new API keys.
Copy+paste the API key values to a temporary, secure place in your local environment.
Install Soda Library
Best practice dictates that you install the Soda Library CLI using a virtual environment. In your command-line interface tool, create a virtual environment in the
.venvdirectory using the commands below. Depending on your version of Python, you may need to replacepythonwithpython3in the first command.
python -m venv .venv
source .venv/bin/activateUpgrade pip inside your new virtual environment.
pip install --upgrade pipExecute the following command, replacing
soda-postgreswith the install package that matches the type of data source you use to store data.
# For bash interactive shell
pip install -i https://pypi.cloud.soda.io soda-postgres
# For zsh interactive shell
pip install -i https://pypi.cloud.soda.io "soda-postgres"Amazon Athena
soda-athena
Amazon Redshift
soda-redshift
Apache Spark DataFrames (For use with programmatic Soda scans, only.)
soda-spark-df
Azure Synapse
soda-sqlserver
ClickHouse
soda-mysql
Dask and Pandas
soda-pandas-dask
Databricks
soda-spark[databricks]
Denodo
soda-denodo
Dremio
soda-dremio
DuckDB
soda-duckdb
GCP BigQuery
soda-bigquery
Google CloudSQL
soda-postgres
IBM DB2
soda-db2
Local file
Use Dask.
MotherDuck
soda-duckdb
MS SQL Server
soda-sqlserver
MySQL
soda-mysql
OracleDB
soda-oracle
PostgreSQL
soda-postgres
Presto
soda-presto
Snowflake
soda-snowflake
Trino
soda-trino
Vertica
soda-vertica
To deactivate the virtual environment, use the following command:
deactivateTroubleshoot
As of version 1.7.0, Soda Library packages include Pydantic version 2 for data validation. If your systems require the use of Pydantic version 1, you can install an extra package that uses Pydantic version 1. To do so, use the following command, adjusting the type of library to correspond with your data source.
#bash
pip install -i https://pypi.cloud.soda.io soda-postgres[pydanticv1]
#zsh
pip install -i https://pypi.cloud.soda.io "soda-spark-df[pydanticv1]"Configure Soda
Soda Library connects with Spark DataFrames in a unique way, using programmtic scans.
If you are using Spark DataFrames, follow the configuration details in Connect to Spark.
If you are not using Spark DataFrames, continue to step 2.
In the same directory and environment in which you installed Soda Library, use a code editor to create a new
configuration.ymlfile. This file stores connection details for your data sources and your Soda Cloud account. Use the data source-specific connection configurations (see: Data source reference) to copy+paste the connection syntax into your file, then adjust the values to correspond with your data source’s details, as in the following example for PostgreSQL.You can use system variables to pass sensitive values, if you wish.
If you want to run scans on multiple schemas in the data source, add one data source config block per schema.
data_source my_datasource: type: postgres host: localhost username: postgres password: secret database: postgres schema: publi
Copy+paste the following
soda_cloudconfiguration syntax into theconfiguration.ymlfile, as in the example below. Input the API key values you created in Soda CLoud.Do not nest the
soda_cloudconfiguration under thedatasourceconfiguration.For
host, usecloud.soda.iofor EU region; usecloud.us.soda.iofor USA region, according to your selection when you created your Soda Cloud account.Optionally, provide a value for the
schemeproperty to indicate which scheme to use to initialize the URI instance. If you do not explicitly include aschemeproperty, Soda uses the defaulthttps.soda_cloud: # Use cloud.soda.io for EU region # Use cloud.us.soda.io for US region host: https://cloud.soda.io api_key_id: 2e0ba0cb-your-api-key-7b api_key_secret: 5wd-your-api-key-secret-aGuRg scheme:Save the
configuration.ymlfile. Run the following scan to confirm that Soda can successfully connect with your data source.soda test-connection -d my_datasource -c configuration.yml
Provide credentials as system variables
If you wish, you can provide data source login credentials or any of the properties in the configuration YAML file as system variables instead of storing the values directly in the file. System variables persist only for as long as you have the terminal session open in which you created the variable. For a longer-term solution, consider using permanent environment variables stored in your ~/.bash_profile or ~/.zprofile files.
For connection configuration values
From your command-line interface, set a system variable to store the value of a property that the configuration YAML file uses. For example, you can use the following command to define a system variable for your password.
export POSTGRES_PASSWORD=1234Test that the system retrieves the value that you set by running an
echocommand.
echo $POSTGRES_PASSWORDIn the configuration YAML file, set the value of the property to reference the environment variable, as in the following example.
data_source my_database_name:
type: postgres
host: soda-temp-demo
port: '5432'
username: sodademo
password: ${POSTGRES_PASSWORD}
database: postgres
schema: publicSave the configuration YAML file, then run a scan to confirm that Soda Library connects to your data source without issue.
soda test-connection -d my_datasource -c configuration.ymlFor API key values
From your command-line interface, set a system variable to store the value of a property that the configuration YAML file uses. For example, you can use the following command to define a system variable for your password.
export API_KEY=1234Test that the system retrieves the value that you set by running an
echocommand.
echo $API_KEYIn the configuration YAML file, set the value of the property to reference the environment variable, as in the following example.
data_source my_database_name: type: postgres host: soda-temp-demo port: '5432' username: sodademo password: ${POSTGRES_PASSWORD} database: postgres schema: public soda_cloud: host: cloud.soda.io api_key_id: ${API_KEY} api_key_secret: ${API_SECRET}Save the configuration YAML file, then run a scan to confirm that Soda Library connects to Soda Cloud without issue.
soda test-connection -d my_datasource -c configuration.ymlNext
Choose a flavor of SodaSet up Soda: self-operatedRun scans and review results
Organize, alert, investigate
Need help? Join the Soda community on Slack.
Last updated
Was this helpful?
