Choose a flavor of Soda
Last modified on 20-Nov-24
A lightweight, versatile tool for testing and monitoring data quality, you have several options for deploying Soda in your environment.
As the first step in the Get started roadmap, this guide helps you decide how to set up Soda to best meet your data quality testing and monitoring needs. After choosing a flavor of Soda (type of deployment model), access the corresponding Set up Soda instructions below.
Get started roadmap
- Choose a flavor of Soda 📍 You are here!
- Set up Soda: sign up and install, deploy, or invoke
- Write SodaCL checks
- Run scans and review results
- Organize, alert, investigate
Choose a flavor of Soda
This guide helps you decide how to set up Soda to best meet your data quality testing and monitoring needs. You can set up Soda in one or more of four flavors.
Flavor | Description | Soda Library | Soda Agent | Soda Cloud |
---|---|---|---|---|
Self-operated | A simple setup in which you install Soda Library locally and connect it to Soda Cloud via API keys. | |||
Soda-hosted agent | Recommended A Saas-style setup in which you manage data quality entirely from your Soda Cloud account. | |||
Self-hosted agent | A setup in which you deploy a Soda Agent in a Kubernetes cluster in a cloud-services environment and connect it to Soda Cloud via different API keys. | |||
Programmatic | A setup in which you invoke Soda Library programmatically. |
Why do I need a Soda Cloud account?
To validate your account license or free trial, Soda Library or a Soda Agent must communicate with a Soda Cloud account via API keys. You create a set of API keys in your Soda Cloud account, then use them to configure the connection to Soda Library or a Soda Agent.Learn more
Self-operated
This simple setup enables you to pip install
Soda Library from the command-line, then prepare YAML files to:
- configure connections to your data sources to run scans
- configure the connection to your Soda Cloud account to validate your license and visualize and share data quality check results
- write data quality checks
Use this setup for:
✅ A small team: Manage data quality within a small data engineering team or data analytics team who is comfortable working with the command-line and YAML files to design and execute scans for data quality.
✅ POC: Conduct a proof-of-concept evaluation of Soda as a data quality testing and monitoring tool. See: Take a sip of Soda
✅ Basic DQ: Start from scratch to set up basic data quality checks on key datasets. See: Check suggestions
✅ Data migration: Migrate good-quality data from one data source to another. See: Test before data migration
Requirements:
- Python 3.8, 3.9, or 3.10
- Pip 21.0 or greater
- Login credentials for your data source (Snowflake, Athena, MS SQL Server, etc.)
Soda-hosted agent
Recommended
This setup provides a secure, out-of-the-box Soda Agent to manage access to data sources from within your Soda Cloud account. Quickly configure connections to your data sources in the Soda Cloud user interface, then empower all your colleagues to explore datasets, access check results, customize collections, and create their own no-code checks for data quality.
See also: Soda-hosted vs. self-hosted agent
Use this setup for:
✅ A quick start: Use the out-of-the-box agent to start testing data quality right away from within the Soda Cloud user interface, without the need to install or deploy any other tools.
✅ Anomaly detection dashboard: Use Soda’s out-of-the-box anomaly dashboards to get automated insights into basic data quality metrics for your datasets. See: Add anomaly dashboards
✅ Automated data monitoring: Set up data profiling and automated data quality monitoring. See: Automate monitoring
✅ Self-serve data quality: Empower data analysts and scientists to self-serve and create their own no-code checks for data quality. See: Self-serve Soda
✅ Data migration: Migrate good-quality data from one data source to another. See: Test before data migration
✅ Data catalog integration: Integrate Soda with a data catalog such as Atlan, Alation, or Metaphor. See: Integrate Soda
Soda hosts agents in a secure environment in Amazon AWS. As a SOC 2 Type 2 certified business, Soda responsibly manages Soda-hosted agents to ensure that they remain private, secure, and independent of all other hosted agents. See Data security and privacy for details.
Requirements:
- Login credentials for your data source (BigQuery, Databricks SQL, MS SQL Server, MySQL, PostgreSQL, Redshift, or Snowflake); Soda securely stores passwords as Kubernetes secrets
Self-hosted agent
This setup enables a data or infrastructure engineer to deploy Soda Library as an agent in a Kubernetes cluster within a cloud-services environment such as Google Cloud Platform, Azure, or AWS.
The engineer can manage access to data sources while giving Soda Cloud end-users easy access to Soda check results and enabling them to write their own checks for data quality. Users connect to data sources and create no-code checks for data quality directly in the Soda Cloud user interface.
See also: Soda-hosted vs. self-hosted agent
Use this setup for:
✅ Self-serve data quality: Empower data analysts and scientists to self-serve and create their own checks for data quality. See: Self-serve Soda
✅ Data migration: Migrate good-quality data from one data source to another. See: Test before data migration
✅ Anomaly detection dashboard: Use Soda’s out-of-the-box anomaly dashboards to get automated insights into basic data quality metrics for your datasets. See: Add anomaly dashboards
✅ Data catalog integration: Integrate Soda with a data catalog such as Atlan, Alation, or Metaphor. See: Integrate Soda
✅ Secrets manager integration: Integrate your Soda Agent with an external secrets manager to securely access frequently-rotated data source login credentials. See: Integrate with a Secrets Manager
Requirements:
- Access to your cloud-services environment, plus the authorization to deploy containerized apps in a new or existing Kubernetes cluster
- Login credentials for your data source (Snowflake, Athena, MS SQL Server, etc.)
Programmatic
Use this setup to invoke Soda programmatically in, for example, and Airflow DAG or GitHub Workflow. You provide connection details for data sources and Soda Cloud inline or in external YAML files, and similarly define data quality checks inline or in a separate YAML file.
Use this setup for:
✅ Testing during development: Test data before and after ingestion and transformation during development. See: Test data during development
✅ Circuit-breaking in a pipeline: Test data in an Airflow pipeline so as to enable circuit breaking that prevents bad-quality data from having a downstream impact. See: Test data in production
✅ Databricks Notebook: Invoke Soda data quality scans in a Databricks Notebook. See: Add Soda to a Databricks notebook
Requirements:
- Python 3.8, 3.9, or 3.10
- Pip 21.0 or greater
- Login credentials for your data source (Snowflake, Athena, MS SQL Server, etc.)
Soda-hosted vs. self-hosted agent
Though similar, the type of Soda agent you choose to use depends upon the following factors.
Factor | Soda-hosted agent | Self-hosted agent |
---|---|---|
Data source compatibility | Compatible with a limited subset of Soda-supported data sources. | Compatible with nearly all Soda-supported data sources. |
Upgrade maintenance | Soda manages all upgrades to the latest available version of the Soda Agent. | You manage all upgrades to your Soda Agent deployed on your Kubernetes cluster. |
External Secrets manager integration | Unable to integrate with an External Secrets manager. | Able to integrate with an External Secrets manager (Hashicorp Vault, Azure Key Vault, etc.) to better manage frequently-rotated login credentials. |
Network connectivity | Access Soda Agent via public networks of passlisting. | Deploy the Soda Agent inside your own private cloud on on premises network infrastructure. |
Next
-
Choose a flavor of Soda - Set up Soda. Select the setup instructions that correspond with your flavor of Soda:
- Write SodaCL checks
- Run scans and review results
- Organize, alert, investigate
Need help? Join the Soda community on Slack.
Was this documentation helpful?
What could we do to improve this page?
- Suggest a docs change in GitHub.
- Share feedback in the Soda community on Slack.
Documentation always applies to the latest version of Soda products
Last modified on 20-Nov-24