Choose a flavor of Soda
Last modified on 30-Nov-23
A lightweight, versatile tool for testing and monitoring data quality, you have several options for deploying Soda in your environment.
As the first step in the Get started roadmap, this guide helps you decide how to set up Soda to best meet your data quality testing and monitoring needs. After choosing a flavor of Soda (type of deployment model), access the corresponding Set up Soda instructions below.
Get started roadmap
- Choose a flavor of Soda 📍 You are here!
- Set up Soda: install, deploy, or invoke
- Write SodaCL checks
- Run scans and review results
- Organize, alert, investigate
Choose a flavor of Soda
This guide helps you decide how to set up Soda to best meet your data quality testing and monitoring needs. Optionally, access a full Soda product overview.
You can set up Soda in one of three (soon to be four!) flavors:
Flavor | Description | Soda Library | Soda Agent | Soda Cloud |
---|---|---|---|---|
Self-operated | A simple setup in which you install Soda Library locally and connect it to Soda Cloud via API keys. | ![]() | ![]() | |
Self-hosted agent | Recommended A setup in which you deploy a Soda Agent in a Kubernetes cluster in a cloud-services environment and connect it to Soda Cloud via different API keys. | ![]() | ![]() | |
Fully-managed SaaS | Coming soon! A setup in which you manage data quality entirely from your Soda Cloud account. | ![]() | ||
Programmatic | A setup in which you invoke Soda Library programmatically. | ![]() | ![]() |
Why do I need a Soda Cloud account?
To validate your account license or free trial, Soda Library or a Soda Agent must communicate with a Soda Cloud account via API keys. You create a set of API keys in your Soda Cloud account, then use them to configure the connection to Soda Library or a Soda Agent.Learn more
Self-operated
This simple setup enables you to pip install
Soda Library from the command-line, then prepare YAML files to:
- configure connections to your data sources to run scans
- configure the connection to your Soda Cloud account to validate your license and visualize and share data quality check results
- write data quality checks
Use this setup for:
✅ A small team: Manage data quality within a small data engineering team or data analytics team who is comfortable working with the command-line and YAML files to design and execute scans for data quality.
✅ POC: Conduct a proof-of-concept evaluation of Soda as a data quality testing and monitoring tool. See: Take a sip of Soda
✅ Basic DQ: Start from scratch to set up basic data quality checks on key datasets. See: Check suggestions
Requirements:
- Python 3.8 or greater
- Pip 21.0 or greater
- Login credentials for your data source (Snowflake, Athena, MS SQL Server, etc.)
Self-hosted agent
Recommended
This setup enables a data or infrastructure engineer to deploy Soda Library as an agent in a Kubernetes cluster within a cloud-services environment such as Google Cloud Platform, Azure, or AWS.
The engineer can manage access to data sources while giving Soda Cloud end-users easy access to Soda check results and enabling them to write their own checks for data quality. Users connect to data sources and write checks for data quality directly in the Soda Cloud user interface.
Use this setup for:
✅ Self-serve data quality: Empower data analysts and scientists to self-serve and write their own checks for data quality. See: Self-serve Soda
✅ Data migration: Migrate good-quality data from one data source to another. See: Test before data migration
✅ Automated data monitoring: Set up data profiling and automated data quality monitoring. See: Automate monitoring
✅ Data catalog integration: Integrate Soda with a data catalog such as Atlan, Alation, or Metaphor. See: Integrate Soda
Requirements:
- Access to your cloud-services environment, plus the authorization to deploy containerized apps in a new or existing Kubernetes cluster
- Login credentials for your data source (Snowflake, Athena, MS SQL Server, etc.)
Programmatic
Use this setup to invoke Soda programmatically in, for example, and Airflow DAG or GitHub Workflow. You provide connection details for data sources and Soda Cloud inline or in external YAML files, and similarly define data quality checks inline or in a separate YAML file.
Use this setup for:
✅ Testing during development: Test data before and after ingestion and transformation during development. See: Test data during development
✅ Circuit-breaking in a pipeline: Test data in a pipeline so as to enable circuit breaking that prevents bad-quality data from having a downstream impact. See: Test data in production
✅ Databricks Notebook: Invoke Soda data quality scans in a Databricks Notebook. See: Add Soda to a Databricks notebook
Requirements:
- Python 3.8 or greater
- Pip 21.0 or greater
- Login credentials for your data source (Snowflake, Athena, MS SQL Server, etc.)
Next
-
Choose a flavor of Soda - Set up Soda. Select the setup instructions that correspond with your flavor of Soda:
- Write SodaCL checks
- Run scans and review results
- Organize, alert, investigate
Need help? Join the Soda community on Slack.
Was this documentation helpful?
What could we do to improve this page?
- Suggest a docs change in GitHub.
- Share feedback in the Soda community on Slack.
Documentation always applies to the latest version of Soda products
Last modified on 30-Nov-23