Link Search Menu Expand Document

Choose a flavor of Soda

Last modified on 26-Apr-24

A lightweight, versatile tool for testing and monitoring data quality, you have several options for deploying Soda in your environment.

As the first step in the Get started roadmap, this guide helps you decide how to set up Soda to best meet your data quality testing and monitoring needs. After choosing a flavor of Soda (type of deployment model), access the corresponding Set up Soda instructions below.

Get started roadmap

  1. Choose a flavor of Soda 📍 You are here!
  2. Set up Soda: sign up and install, deploy, or invoke
  3. Write SodaCL checks
  4. Run scans and review results
  5. Organize, alert, investigate

Choose a flavor of Soda

This guide helps you decide how to set up Soda to best meet your data quality testing and monitoring needs. See also: Soda product overview.

You can set up Soda in one of four flavors:

Flavor Description Soda
Library
Soda
Agent
Soda
Cloud
Self-operated A simple setup in which you install Soda Library locally and connect it to Soda Cloud via API keys. done   done
Soda-hosted agent Recommended
A setup in which you manage data quality entirely from your Soda Cloud account.
    done
Self-hosted agent
A setup in which you deploy a Soda Agent in a Kubernetes cluster in a cloud-services environment and connect it to Soda Cloud via different API keys.   done done
Programmatic A setup in which you invoke Soda Library programmatically. done   done
Why do I need a Soda Cloud account? To validate your account license or free trial, Soda Library or a Soda Agent must communicate with a Soda Cloud account via API keys. You create a set of API keys in your Soda Cloud account, then use them to configure the connection to Soda Library or a Soda Agent.
Learn more


Self-operated

This simple setup enables you to pip install Soda Library from the command-line, then prepare YAML files to:

  • configure connections to your data sources to run scans
  • configure the connection to your Soda Cloud account to validate your license and visualize and share data quality check results
  • write data quality checks

Use this setup for:
A small team: Manage data quality within a small data engineering team or data analytics team who is comfortable working with the command-line and YAML files to design and execute scans for data quality.
POC: Conduct a proof-of-concept evaluation of Soda as a data quality testing and monitoring tool. See: Take a sip of Soda
Basic DQ: Start from scratch to set up basic data quality checks on key datasets. See: Check suggestions

Requirements:

  • Python 3.8 or greater
  • Pip 21.0 or greater
  • Login credentials for your data source (Snowflake, Athena, MS SQL Server, etc.)

with-library


Soda-hosted agent

Recommended

This setup provides a secure, out-of-the-box Soda Agent to manage access to data sources from within your Soda Cloud account. Quickly configure connections to your data sources in the Soda Cloud user interface, then empower all your colleagues to explore datasets, access check results, customize collections, and create their own no-code checks for data quality.

Use this setup for:
A quick start: Use the out-of-the-box agent to start testing data quality right away from within the Soda Cloud user interface, without the need to install or deploy any other tools.
Self-serve data quality: Empower data analysts and scientists to self-serve and create their own no-code checks for data quality. See: Self-serve Soda
Automated data monitoring: Set up data profiling and automated data quality monitoring. See: Automate monitoring
Data catalog integration: Integrate Soda with a data catalog such as Atlan, Alation, or Metaphor. See: Integrate Soda

Soda hosts agents in a secure environment in Amazon AWS. As a SOC 2 Type 2 certified business, Soda responsibly manages Soda-hosted agents to ensure that they remain private, secure, and independent of all other hosted agents. See Data security and privacy for details.

Requirements:

  • Login credentials for your data source (BigQuery, Databricks SQL, MS SQL Server, MySQL, PostgreSQL, Redshift, or Snowflake)

with-managed-agent


Self-hosted agent

This setup enables a data or infrastructure engineer to deploy Soda Library as an agent in a Kubernetes cluster within a cloud-services environment such as Google Cloud Platform, Azure, or AWS.

The engineer can manage access to data sources while giving Soda Cloud end-users easy access to Soda check results and enabling them to write their own checks for data quality. Users connect to data sources and create no-code checks for data quality directly in the Soda Cloud user interface.

Use this setup for:
Self-serve data quality: Empower data analysts and scientists to self-serve and create their own checks for data quality. See: Self-serve Soda
Data migration: Migrate good-quality data from one data source to another. See: Test before data migration
Automated data monitoring: Set up data profiling and automated data quality monitoring. See: Automate monitoring
Data catalog integration: Integrate Soda with a data catalog such as Atlan, Alation, or Metaphor. See: Integrate Soda
Secrets manager integration: Integrate you Soda Agent with an external secrets manager to securely access frequently-rotated data source login credentials. See: Integrate with a Secrets Manager

Requirements:

  • Access to your cloud-services environment, plus the authorization to deploy containerized apps in a new or existing Kubernetes cluster
  • Login credentials for your data source (Snowflake, Athena, MS SQL Server, etc.)

with-agent


Programmatic

Use this setup to invoke Soda programmatically in, for example, and Airflow DAG or GitHub Workflow. You provide connection details for data sources and Soda Cloud inline or in external YAML files, and similarly define data quality checks inline or in a separate YAML file.

Use this setup for:
Testing during development: Test data before and after ingestion and transformation during development. See: Test data during development
Circuit-breaking in a pipeline: Test data in a pipeline so as to enable circuit breaking that prevents bad-quality data from having a downstream impact. See: Test data in production
Databricks Notebook: Invoke Soda data quality scans in a Databricks Notebook. See: Add Soda to a Databricks notebook

Requirements:

  • Python 3.8 or greater
  • Pip 21.0 or greater
  • Login credentials for your data source (Snowflake, Athena, MS SQL Server, etc.)

programmatic


Next

  1. Choose a flavor of Soda
  2. Set up Soda. Select the setup instructions that correspond with your flavor of Soda:
  3. Write SodaCL checks
  4. Run scans and review results
  5. Organize, alert, investigate

Need help? Join the Soda community on Slack.


Was this documentation helpful?

What could we do to improve this page?

Documentation always applies to the latest version of Soda products
Last modified on 26-Apr-24