Soda Agent basic concepts

Establish a baseline understanding of the concepts involved in deploying a Soda Agent.

The Soda Agent is a tool that empowers Soda Cloud users to securely access data sources to scan for data quality. For a self-hosted agent, create a Kubernetes cluster in a cloud services provider environment, then use Helm to deploy a Soda Agent in the cluster.

This setup enables Soda Cloud users to securely connect to data sources (Snowflake, Amazon Athena, etc.) from within the Soda Cloud web application. Any user in your Soda Cloud account can add a new data source via the agent, then write their own no-code checks to check for data quality in the new data source.

What follows is an extremely abridged introduction to a few basic elements involved in the deployment and setup of a self-hosted Soda Agent.

What is Soda Library?

Soda Library is a Python library and command-line tool that serves as the backbone of Soda technology. It is the software that performs the work of converting user-defined input into SQL queries that execute when you run scans for data quality in a data source. Connect Soda Library to a Soda Cloud account where you and your team can use the web application to collaborate on data quality monitoring.

What is Soda Cloud?

Soda Cloud is Soda's web application. It provides a friendly, intuitive and low-barrier interface that can be used by technical and non-technical users alike. Soda Cloud allows for collaboration between teams and for different permission levels depending on your organization's structure. In the web application, for example, non-technical users can propose new checks and contract changes, while engineering teams can review the suggestions and approve or modify them.

What are data contracts?

Data contracts are formal agreements between your organization's teams (usually between data producers and data consumers), and they specify the expected shape, quality, and behavior of a dataset. A data contract sets clear rules for what the data must look like and how it should perform, so that producers and consumers of data share a common understanding.

Both Soda Library and Soda Cloud make use of data contracts written in YAML language. Contracts include checks for data quality. The checks are tests that Soda Library executes when it runs a scan of your data.

Go to Contract Language reference to learn more about contract structure and requirements.

What is Soda Agent?

Soda Agent is essentially Soda Library functionality that you deploy in a Kubernetes cluster in your own cloud services provider environment. When you deploy an agent, you also deploy two types of workloads in your Kubernetes cluster from a Docker image:

a Soda Agent Orchestrator which creates Kubernetes Jobs to trigger scheduled and on-demand scans of data
a Soda Agent Scan Launcher which wraps around Soda Library, the tool which performs the scan itself

How does Soda integrate with Kubernetes?

Kubernetes is a system for orchestrating containerized applications; a Kubernetes cluster is a set of resources that supports an application deployment.

You need a Kubernetes cluster in which to deploy the containerized applications that make up the Soda Agent. Kubernetes uses the concept of Secrets that the Soda Agent Helm chart employs to store connection secrets that you specify as values during the Helm release of the Soda Agent. Depending on your cloud provider, you can arrange to store these Secrets in a specialized storage such as Azure Key Vault or AWS Key Management Service (KMS).

Learn more about using external secrets.

The Jobs that the agent creates access these Secrets when they execute.

Learn more about Kubernetes concepts.

Where can a Soda Agent be deployed?

Within a cloud services provider environment is where you create your Kubernetes cluster. You can deploy a Soda Agent in any environment in which you can create Kubernetes clusters, such as:

Amazon Elastic Kubernetes Service (EKS)
Microsoft Azure Kubernetes Service (AKS)
Google Kubernetes Engine (GKE)
Any Kubernetes cluster version 1.21 or greater which uses standard Kubernetes
Locally, for testing purposes, using tools like Minikube, microk8s, kind, k3s, or Docker Desktop with Kubernetes support.

What is Helm?

Helm is a package manager for Kubernetes which bundles YAML files together for storage in a public or private repository. This bundle of YAML files is referred to as a Helm chart. The Soda Agent is a Helm chart. Anyone with access to the Helm chart’s repo can deploy the chart to make use of YAML files in it.

Learn more about Helm concepts.

The Soda Agent Helm chart is stored on a public repository and published on ArtifactHub.io. Anyone can use Helm to find and deploy the Soda Agent Helm chart in their Kubernetes cluster.

Why Kubernetes?

Kubernetes is the most powerful and future-proof platform for running the Soda Agent because it delivers the best of both worlds: the flexibility of raw compute without the operational burden, and the scalability of managed services without their restrictions.

Kubernetes goes far beyond raw compute like EC2 or traditional Virtual Machines (VMs) by abstracting away the heavy lifting of networking, deployments, and scaling, while still giving teams precise control when needed. Practically, this makes it easy for Soda’s customers to deploy, manage, and upgrade Soda Agents using Kubernetes and Helm, always staying up to date with the latest releases.
Unlike fully managed options such as AWS Lambda, Kubernetes has no execution time limits and is built to handle long-running, stateful, and highly scalable workloads. This means Soda is not limited to lightweight samples but can perform complete, row-level operations—powering advanced capabilities like Diagnostics Warehouse, which securely stores the exact failing records inside your own infrastructure, and Reconciliation Checks, which compare data at row-level across sources.

Whether running in the cloud or on-premises, Kubernetes ensures resilience, portability, and cost-efficient resource use, making it the clear choice for complex, enterprise-grade data quality workloads.

Go further

Deploy a Soda Agent in a Kubernetes cluster

PreviousReference NextCLI reference

Last updated 24 days ago

Was this helpful?