Data flows between Soda & user

A description of the data flows that take place when the user interacts with Soda in an enterprise deployment with a self-hosted agent.

Platform architecture

1. Instruction Retrieval by the Agent

The Soda Agent deployed in the customer’s Kubernetes cluster (within their VPC) continuously polls the Soda Cloud API for any new “instructions.”
If instructions are available, the Agent retrieves required resources from Soda Cloud, which include:
- The contract file defining the data quality checks.
- Data source connection details, where sensitive credentials are replaced by placeholders for environment variables.
- Diagnostic warehouse connection details for failed-row storage.

2. Launching a Kubernetes Job

Once the resources are retrieved, the Agent launches a Kubernetes Job inside the customer environment to execute the data quality scan.
Files such as the contract and configuration are passed into the Job via mounted volumes.
Environment variables (or workload identities via ServiceAccounts) are injected into the Pod to supply secure credentials at runtime.

3. Data Source Connections

The Soda process running inside the Job uses the provided details to connect directly to the customer’s data sources (e.g., databases or data warehouses).
This connection stays entirely within the customer’s network, ensuring no row-level data leaves the environment.

4. Running the Data Quality Contract

The Soda Agent translates the data contract into optimized SQL statements
The Soda process executes the contract verification, validating the datasets against the defined checks.
During this process:
- Data quality metrics (aggregated statistics, not raw data) are calculated.
- Any rows failing validation checks are isolated for further inspection.

5. Reporting Results

Metrics and summaries are sent back to Soda Cloud via an API call.
- These metrics include table/dataset names, check names, logic, and aggregated results.
No row-level data is transmitted to Soda Cloud.
Failed rows (if applicable) are written to the diagnostic warehouse inside the customer network, making them accessible for troubleshooting without ever leaving the environment.

6. Job Completion and Termination

Once the Soda process completes:
- The Kubernetes Job terminates automatically.
- The Agent is aware of the termination and signals completion of the instruction back to Soda Cloud through an API call.

Security and Isolation

All communication between Soda Cloud and the Agent is encrypted at rest and in transit.
Soda Cloud operates in a multi-tenant environment but does not store any record-level data.

Summary

In this architecture, Soda Cloud orchestrates the scans and collects only aggregated metrics, while all sensitive data stays in the customer’s network. The failed rows go to a customer-controlled diagnostics warehouse, ensuring compliance with data privacy policies.

You are not logged in to Soda and are viewing the default public documentation. Learn more about Documentation access & licensing.

PreviousWebhook API NextDocumentation access & licensing

Last updated 1 month ago

Was this helpful?