A description of the data flows that take place when the user interacts with Soda in an enterprise deployment with a self-hosted agent.
Platform architecture
Soda platform architecture diagram
All flows start with instruction retrieval and a Kubernetes Job launch:
1. Instruction retrieval by the Agent
1
The Soda Agent deployed in the customer's Kubernetes cluster (within their VPC) continuously polls the Soda Cloud API for any new "instructions."
2
If instructions are available, the Agent retrieves required resources from Soda Cloud, which include:
The contract file defining the data quality checks.
Data source connection details, where sensitive credentials are replaced by placeholders for environment variables.
Diagnostic warehouse connection details for failed-row storage.
2. Launching a Kubernetes Job
1
Once the resources are retrieved, the Agent launches a Kubernetes Job inside the customer environment to execute the data quality scan.
2
Environment variables (or workload identities via ServiceAccounts) are injected into the Pod to supply secure credentials at runtime.
3
Files such as the contract and configuration are retrieved by the scan process using a call to the Soda Cloud API.
Contract Verification
3. Data Source Connections
The Soda process running inside the Job uses the provided details to connect directly to the customer's data sources (e.g., databases or data warehouses).
This connection stays entirely within the customer's network, ensuring no row-level data leaves the environment.
4. Running the Data Quality Contract
1
The Soda Agent translates the data contract into optimized SQL statements
2
The Soda process executes the contract verification, validating the datasets against the defined checks.
During this process:
Data quality metrics (aggregated statistics, not raw data) are calculated.
Any rows failing validation checks are isolated for further inspection.
5. Reporting Results
1
Metrics and summaries are sent back to Soda Cloud via an API call.
These metrics include table/dataset names, check names, logic, and aggregated results.
2
No row-level data is transmitted to Soda Cloud.
3
Failed rows (if applicable) are written to the diagnostic warehouse inside the customer network, making them accessible for troubleshooting without ever leaving the environment.
6. Job Completion and Termination
Once the Soda process completes:
1
The Kubernetes Job terminates automatically.
2
The Agent is aware of the termination and signals completion of the instruction back to Soda Cloud through an API call.
Metrics Monitoring
3. Data source connections
The Soda process connects directly to the customer's data source and executes SQL to calculate the latest metric value (e.g. row count, null rate). This runs entirely within the customer's network.
4. Running the Metrics Monitoring Scan
1
The Agent fetches the metric's history (time-series data) from Soda Cloud via API.
2
The anomaly detection algorithm runs locally inside the Agent.
The algorithm uses the fetched history and the latest value to determine whether the result is anomalous.
3
The Agent sends only the metric value and the anomaly result (pass/fail) back to Soda Cloud.
No raw row data is transmitted.
5. Job completion and termination
Once the Soda process completes:
1
The Kubernetes Job terminates automatically.
2
The Agent signals completion of the instruction back to Soda Cloud through an API call.
Security and Isolation
All communication between Soda Cloud and the Agent is encrypted at rest and in transit.
Soda Cloud operates in a multi-tenant environment but does not store any record-level data.
Summary
In this architecture, Soda Cloud orchestrates the scans and collects only aggregated metrics, while all sensitive data stays in the customer's network. For contract verification, failed rows go to a customer-controlled diagnostics warehouse. For metrics monitoring, only the metric value and anomaly result are transmitted to Soda Cloud. In all cases, no row-level data ever leaves the customer's environment.
You are not logged in to Soda and are viewing the default public documentation. Learn more about Documentation access & licensing.