Install Soda Library

From your command-line interface, execute a pip install command to install Soda Library in your environment.

Soda Library is a Python library and command-line interface (CLI) tool that enables Data Engineers to test the data in a data source to surface invalid, missing, or unexpected data.

As a step in the Get started roadmap, this guide offers instructions to set up, install, and configure Soda in a self-operated deployment model.

Get started roadmap

  1. Choose a flavor of Soda

  2. Set up Soda: self-operated 📍 You are here! a. Review requirements b. Install Soda Library c. Configure Soda

  3. Write SodaCL checks

  4. Run scans and review results

  5. Organize, alert, investigate

💡 TL;DR: Follow a 15-min tutorial to set up and run Soda with example data.

Requirements

To use Soda Library, you must have installed the following on your system.

  • Python 3.8, 3.9, or 3.10. To check your existing version, use the CLI command: python --version or python3 --version If you have not already installed Python, consider using pyenv to manage multiple versions of Python in your environment.

  • Pip 21.0 or greater. To check your existing version, use the CLI command: pip --version

  • A Soda Cloud account; see next section.

Python versions Soda supports

Soda officially supports Python versions 3.8, 3.9, and 3.10. Though largely funcntional, efforts to fully support Python 3.11 and 3.12 are ongoing.

Using Python 3.11, some users might have some issues with dependencies constraints. At times, extra the combination of Python 3.11 and dependencies constraints requires that a dependency be built from source rather than downloaded pre-built.

The same applies to Python 3.12, although there is some anecdotal evidence that indicates that 3.12 might not work in all scenarios due to dependencies constraints.

Create a Soda Cloud account

  1. In a browser, navigate to cloud.soda.io/signup to create a new Soda account, which is free for a 45-day trial. If you already have a Soda account, log in.

  2. Navigate to your avatar > Profile, then access the API keys tab. Click the plus icon to generate new API keys.

  3. Copy+paste the API key values to a temporary, secure place in your local environment.

Why do I need a Soda Cloud account?

To validate your account license or free trial, Soda Library must communicate with a Soda Cloud account via API keys. You create a set of API keys in your Soda Cloud account, then use them to configure the connection to Soda Library.

Learn more

Install Soda Library

  1. Best practice dictates that you install the Soda Library CLI using a virtual environment. In your command-line interface tool, create a virtual environment in the .venv directory using the commands below. Depending on your version of Python, you may need to replace python with python3 in the first command.

python -m venv .venv
source .venv/bin/activate
  1. Upgrade pip inside your new virtual environment.

pip install --upgrade pip
  1. Execute the following command, replacing soda-postgres with the install package that matches the type of data source you use to store data.

# For bash interactive shell
pip install -i https://pypi.cloud.soda.io soda-postgres
# For zsh interactive shell
pip install -i https://pypi.cloud.soda.io "soda-postgres"
Data source
Install package

Amazon Athena

soda-athena

Amazon Redshift

soda-redshift

Apache Spark DataFrames (For use with programmatic Soda scans, only.)

soda-spark-df

Azure Synapse

soda-sqlserver

ClickHouse

soda-mysql

Dask and Pandas

soda-pandas-dask

Databricks

soda-spark[databricks]

Denodo

soda-denodo

Dremio

soda-dremio

DuckDB

soda-duckdb

GCP BigQuery

soda-bigquery

Google CloudSQL

soda-postgres

IBM DB2

soda-db2

Local file

Use Dask.

MotherDuck

soda-duckdb

MS SQL Server

soda-sqlserver

MySQL

soda-mysql

OracleDB

soda-oracle

PostgreSQL

soda-postgres

Presto

soda-presto

Snowflake

soda-snowflake

Trino

soda-trino

Vertica

soda-vertica

To deactivate the virtual environment, use the following command:

deactivate

Troubleshoot

As of version 1.7.0, Soda Library packages include Pydantic version 2 for data validation. If your systems require the use of Pydantic version 1, you can install an extra package that uses Pydantic version 1. To do so, use the following command, adjusting the type of library to correspond with your data source.

#bash
pip install -i https://pypi.cloud.soda.io soda-postgres[pydanticv1]

#zsh
pip install -i https://pypi.cloud.soda.io  "soda-spark-df[pydanticv1]"

Configure Soda

  1. Soda Library connects with Spark DataFrames in a unique way, using programmtic scans.

    • If you are using Spark DataFrames, follow the configuration details in Connect to Spark.

    • If you are not using Spark DataFrames, continue to step 2.

  2. In the same directory and environment in which you installed Soda Library, use a code editor to create a new configuration.yml file. This file stores connection details for your data sources and your Soda Cloud account. Use the data source-specific connection configurations (see: Data source reference) to copy+paste the connection syntax into your file, then adjust the values to correspond with your data source’s details, as in the following example for PostgreSQL.

    • You can use system variables to pass sensitive values, if you wish.

    • If you want to run scans on multiple schemas in the data source, add one data source config block per schema.

       data_source my_datasource:
       type: postgres
       host: localhost
       username: postgres
       password: secret
       database: postgres
       schema: publi
  3. Copy+paste the following soda_cloud configuration syntax into the configuration.yml file, as in the example below. Input the API key values you created in Soda CLoud.

    • Do not nest the soda_cloud configuration under the datasource configuration.

    • For host, use cloud.soda.io for EU region; use cloud.us.soda.io for USA region, according to your selection when you created your Soda Cloud account.

    • Optionally, provide a value for the scheme property to indicate which scheme to use to initialize the URI instance. If you do not explicitly include a scheme property, Soda uses the default https.

         soda_cloud:
           # Use cloud.soda.io for EU region
           # Use cloud.us.soda.io for US region
           host: https://cloud.soda.io
           api_key_id: 2e0ba0cb-your-api-key-7b
           api_key_secret: 5wd-your-api-key-secret-aGuRg
           scheme:
    • Save the configuration.yml file. Run the following scan to confirm that Soda can successfully connect with your data source.

      soda test-connection -d my_datasource -c configuration.yml

Provide credentials as system variables

If you wish, you can provide data source login credentials or any of the properties in the configuration YAML file as system variables instead of storing the values directly in the file. System variables persist only for as long as you have the terminal session open in which you created the variable. For a longer-term solution, consider using permanent environment variables stored in your ~/.bash_profile or ~/.zprofile files.

For connection configuration values

  1. From your command-line interface, set a system variable to store the value of a property that the configuration YAML file uses. For example, you can use the following command to define a system variable for your password.

export POSTGRES_PASSWORD=1234
  1. Test that the system retrieves the value that you set by running an echo command.

echo $POSTGRES_PASSWORD
  1. In the configuration YAML file, set the value of the property to reference the environment variable, as in the following example.

data_source my_database_name:
  type: postgres
  host: soda-temp-demo
  port: '5432'
  username: sodademo
  password: ${POSTGRES_PASSWORD}
  database: postgres
  schema: public
  1. Save the configuration YAML file, then run a scan to confirm that Soda Library connects to your data source without issue.

soda test-connection -d my_datasource -c configuration.yml

For API key values

  1. From your command-line interface, set a system variable to store the value of a property that the configuration YAML file uses. For example, you can use the following command to define a system variable for your password.

export API_KEY=1234
  1. Test that the system retrieves the value that you set by running an echo command.

echo $API_KEY
  1. In the configuration YAML file, set the value of the property to reference the environment variable, as in the following example.

    data_source my_database_name:
      type: postgres
      host: soda-temp-demo
      port: '5432'
      username: sodademo
      password: ${POSTGRES_PASSWORD}
      database: postgres
      schema: public
    
    soda_cloud:
      host: cloud.soda.io
      api_key_id: ${API_KEY}
      api_key_secret: ${API_SECRET}
  2. Save the configuration YAML file, then run a scan to confirm that Soda Library connects to Soda Cloud without issue.

soda test-connection -d my_datasource -c configuration.yml

Next

  1. Choose a flavor of Soda

  2. Set up Soda: self-operated

  3. Run scans and review results

  4. Organize, alert, investigate

Need help? Join the Soda community on Slack.

Last updated

Was this helpful?