Deploy a Soda Agent in Google GKE
Last modified on 31-May-23
The Soda Agent is a tool that empowers Soda Cloud users to securely access data sources to scan for data quality. Create a Google Kubernetes Engine (GKE) cluster, then use Helm to deploy a Soda Agent in the cluster.
This setup enables Soda Cloud users to securely connect to data sources (BigQuery, Snowflake, etc.) from within the Soda Cloud web application. Any user in your Soda Cloud account can add a new data source via the agent, then write their own agreements to check for data quality in the new data source.
Deployment overview
Compatibility
Prerequisites
Create a Soda Cloud account and API keys
Create a Kubernetes cluster
Deploy a Soda Agent
Deploy using CLI only
Deploy using a values YAML file
(Optional) Create a practice data source
About the helm install
command
Decommission the Soda Agent and cluster
Troubleshoot deployment
Go further
Deployment overview
- (Optional) Familiarize yourself with basic Soda, Kubernetes, and Helm concepts.
- Install, or confirm the installation of, a few required command-line tools.
- Sign up for a Soda Cloud account and create new API keys.
- Use the command-line to create a Kubernetes cluster.
- Deploy the Soda Agent in the new cluster.
- Verify the existence of your new Soda Agent in your Soda Cloud account.
Compatibility
Soda supports Kubernetes cluster version 1.21 or greater.
You can deploy a Soda Agent to connect with the following data sources:
Amazon Athena Amazon Redshift Azure Synapse (Experimental) ClickHouse (Experimental) Denodo (Experimental) Dremio DuckDB (Experimental) GCP Big Query | IBM DB2 MS SQL Server † MySQL OracleDB PostgreSQL Snowflake Trino Vertica (Experimental) |
† MS SQL Server with Windows Authentication does not work with Soda Agent out-of-the-box.
Prerequisites
- (Optional) You have familiarized yourself with basic Soda, Kubernetes, and Helm concepts.
- You have a Google Cloud Platform (GCP) account and the necessary permissions to enable you to create a Google Kubernetes Engine (GKE) cluster in Autopilot mode in your region.
- You have installed the gcloud CLI tool. Use the command
glcoud version
to verify the version of an existing install.- If you have already installed the gcloud CLI, use the following commands to login and verify your configuration settings, respectively:
gcloud auth login
gcloud config list
- If you are installing the gcloud CLI for the first time, be sure to complete all the steps in the installation to properly install and configure the setup.
- Consider using the following command to learn a few basic glcoud commands:
gcloud cheat-sheet
.
- If you have already installed the gcloud CLI, use the following commands to login and verify your configuration settings, respectively:
- You have installed v1.22 or v1.23 of kubectl. This is the command-line tool you use to run commands against Kubernetes clusters. If you have installed Docker Desktop, kubectl is included out-of-the-box. With Docker running, use the command
kubectl version --output=yaml
to check the version of an existing install. - You have installed Helm. This is the package manager for Kubernetes which you will use to deploy the Soda Agent Helm chart. Run
helm version
to check the version of an existing install.
Create a Soda Cloud account and API keys
The Soda Agent communicates with your Soda Cloud account using API public and private keys. Note that the keys a Soda Agent uses are different from the API keys Soda Core uses to connect to Soda Cloud.
- If you have not already done so, create a Soda Cloud account at cloud.soda.io.
- In your Soda Cloud account, navigate to your avatar > Scans & Data, then navigate to the Agents tab. Click New Soda Agent.
- The dialog box that appears offers abridged instructions to set up a new Soda Agent from the command-line; more thorough instructions exist in this documentation, below.
For now, copy and paste the values for both the API Key ID and API Key Secret to a temporary, secure place in your local environment. You will need these values in the next section when you deploy the agent in your Kubernetes cluster.
- You can keep the dialog box open in Soda Cloud, or close it.
Create a GKE Autopilot cluster
To deploy a Soda Agent in a Kubernetes cluster, you must first create a network and a cluster.
GKE offers several types of clusters and modes you can use; refer to GKE documentation for details. The instructions below detail the steps to deploy a cluster using GKE Autopilot, in a single zone, as a private cluster with outbound internet access.
- Use the following command to create a network. Pick a name for the network that is unique in your environment.
gcloud compute networks create soda-agent-net-2 \ --subnet-mode custom
Output:
Created [https://www.googleapis.com/compute/v1/projects/test-gke/global/networks/soda-agent-net-2]. NAME SUBNET_MODE BGP_ROUTING_MODE IPV4_RANGE GATEWAY_IPV4 soda-agent-net-2 CUSTOM REGIONAL ...
- In the newly-created network, create a subnet with two secondary ranges, to be used for the pods and services in the cluster.
gcloud compute networks subnets create soda-agent-subnet-2 \ --network soda-agent-net-2 \ --range 192.168.0.0/20 \ --secondary-range agent-pods=10.4.0.0/14,agent-services=10.0.32.0/20 \ --enable-private-ip-google-access
Output:
Created [https://www.googleapis.com/compute/v1/projects/test-gke/regions/us-west1/subnetworks/soda-agent-subnet-2]. NAME REGION NETWORK RANGE STACK_TYPE IPV6_ACCESS_TYPE INTERNAL_IPV6_PREFIX EXTERNAL_IPV6_PREFIX soda-agent-subnet-2 us-west1 soda-agent-net-2 xxx.xxx.x.x/20 IPV4_ONLY
- Use the following command to create a cluster in your network, providing a unique name for the cluster. Replace the values for
region
andmaster-authorized-networks
with your own region and IP address, respectively.
Read more about thecreate-auto
command and its flags in Google documentation.gcloud container clusters create-auto soda-agent-gke \ --region us-west1 \ --enable-private-nodes \ --network soda-agent-net-2 \ --subnetwork soda-agent-subnet-2 \ --cluster-secondary-range-name agent-pods \ --services-secondary-range-name agent-services \ --enable-master-authorized-networks \ --master-authorized-networks xxx.xxx.x.x/20
Output:
... kubeconfig entry generated for soda-agent-gke. NAME LOCATION MASTER_VERSION MASTER_IP MACHINE_TYPE NODE_VERSION NUM_NODES STATUS soda-agent-gke us-west1 1.24.5-gke.600 xx.xxx.xx.xx e2-medium 1.24.5-gke.600 3 RUNNING
- Because the cluster is private, it cannot reach public IP addresses directly. Use the following command to add a network address translation (NAT) router to your network to route outbound network requests towards public IP addresses, such as
cloud.soda.io
orcloud.us.soda.io
, through a virtual NAT.gcloud compute routers create agent-nat-router-1 \ --network soda-agent-net-2 \ --region=us-west1
Output:
Creating router [agent-nat-router-1]...done. NAME REGION NETWORK agent-nat-router-1 us-west1 soda-agent-net-2
- Use the following command to add configurations to the router.
gcloud compute routers nats create nat-config \ --router agent-nat-router-1 \ --nat-all-subnet-ip-ranges \ --auto-allocate-nat-external-ips
- Use the following command to add configuration to connect the new cluster to your local kubectl configuration.
gcloud container clusters get-credentials soda-agent-gke \ --region us-west1
- (Optional) To enable other machines or networks to connect to the cluster, use the following command to add broader IP ranges. Replace the value for
Z.Z.Z.Z/29
with your own IP range.gcloud container clusters update soda-agent-gke \ --enable-master-authorized-networks \ --master-authorized-networks Z.Z.Z.Z/29
- Use the following command to create a new namespace in your cluster.
kubectl create ns soda-agent
- Run the following command to change the context to associate the current namespace to
soda-agent
.kubectl config set-context --current --namespace=soda-agent
- Run the following command to verify that the cluster kubectl recognizes
soda-agent
as the current namespace.kubectl config get-contexts
Output:
CURRENT NAME CLUSTER AUTHINFO NAMESPACE * gke_soda-agent-gke_us-west1... gke_soda-agent-gke_us-west1... gke_soda-agent-gke_us-west1*** soda-agent
Deploy a Soda Agent
The following table outlines the two ways you can install the Helm chart to deploy a Soda Agent in your cluster.
Method | Description | When to use |
---|---|---|
CLI only | Install the Helm chart via CLI by providing values directly in the install command. | Use this as a straight-forward way of deploying an agent on a cluster in a secure or local environment. |
Use a values YAML file | Install the Helm chart via CLI by providing values in a values YAML file. | Use this as a way of deploying an agent on a cluster while keeping sensitive values secure. - provide sensitive API key values in this local file - store data source login credentials as environment variables in this local file; Soda needs access to the credentials to be able to connect to your data source to run scans of your data. See: Manage sensitive values. |
Deploy using CLI only
- Add the Soda Agent Helm chart repository.
helm repo add soda-agent https://helm.soda.io/soda-agent/
- Use the following command to install the Helm chart to deploy a Soda Agent in your custer. (Learn more about the
helm install
command.)- Replace the values of
soda.apikey.id
andsoda-apikey.secret
with the values you copy+pasted from the New Soda Agent dialog box in your Soda Cloud account. - Replace the value of
soda.agent.name
with a custom name for your agent, if you wish. - Specify the value for
soda.cloud.endpoint
according to your local region:https://cloud.us.soda.io
for the United States, orhttps://cloud.soda.io
for all else. - Optionally, add the
soda.core
settings to configure idle workers in the cluster. Launch an idle worker so at scan time, the agent can hand over instructions to an already running idle Scan Launcher to avoid the start-from-scratch setup time for a pod. This helps your test scans from Soda Cloud run faster. You can have multiple idle scan launchers waiting for instructions.
helm install soda-agent soda-agent/soda-agent \ > --set soda.agent.name=myuniqueagent \ > --set soda.cloud.endpoint=https://cloud.soda.io \ > --set soda.apikey.id=*** \ > --set soda.apikey.secret=*** \ > --namespace soda-agent \ > --set soda.core.idle=true \ > --set soda.core.replicas=1
The command-line produces output like the following message:
NAME: soda-agent LAST DEPLOYED: Wed Dec 14 11:45:13 2022 NAMESPACE: soda-agent STATUS: deployed REVISION: 1
- Replace the values of
- (Optional) Validate the Soda Agent deployment by running the following command:
kubectl describe pods
- In your Soda Cloud account, navigate to your avatar > Scans & Data > Agents tab. Refresh the page to verify that you see the agent you just created in the list of Agents.
Be aware that this may take several minutes to appear in your list of Soda Agents. Use thedescribe pods
command in step three to check the status of the deployment. WhenStatus: Running
, then you can refresh and see the agent in Soda Cloud.Name: soda-agent-orchestrator-66-snip Namespace: soda-agent Priority: 0 Service Account: soda-agent Node: <none> Labels: agent.soda.io/component=orchestrator agent.soda.io/service=queue app.kubernetes.io/instance=soda-agent app.kubernetes.io/name=soda-agent pod-template-hash=669snip Annotations: seccomp.security.alpha.kubernetes.io/pod: runtime/default Status: Running ...
- Next: Add a data source in Soda Cloud using the Soda Agent you just deployed. If you wish, you can create a practice data source so you can try adding a data source in Soda Cloud using the Soda Agent you just deployed.
Deploy using a values YAML file
- Using a code editor, create a new YAML file called
values.yml
. - In that file, copy+paste the content below, replacing the following values:
id
andsecret
with the values you copy+pasted from the New Soda Agent dialog box in your Soda Cloud account.- Replace the value of
name
with a custom name for your agent, if you wish. - Specify the value for
endpoint
according to your local region:https://cloud.us.soda.io
for the United States, orhttps://cloud.soda.io
for all else. - Optionally, add the
soda.core
settings to configure idle workers in the cluster. Launch an idle worker so at scan time, the agent can hand over instructions to an already running idle Scan Launcher to avoid the start-from-scratch setup time for a pod. This helps your test scans from Soda Cloud run faster. You can have multiple idle scan launchers waiting for instructions.
soda: apikey: id: "***" secret: "***" agent: name: "myuniqueagent" core: idle: true replicas: 1 cloud: endpoint: "https://cloud.soda.io"
- Save the file. Then, in the same directory in which the
values.yml
file exists, use the following command to install the Soda Agent helm chart.helm install soda-agent soda-agent/soda-agent \ --values values.yml \ --namespace soda-agent
- (Optional) Validate the Soda Agent deployment by running the following command:
kubectl describe pods
- In your Soda Cloud account, navigate to your avatar > Scans & Data > Agents tab. Refresh the page to verify that you see the agent you just created in the list of Agents.
Be aware that this may take several minutes to appear in your list of Soda Agents. Use thedescribe pods
command in step four to check the status of the deployment. WhenStatus: Running
, then you can refresh and see the agent in Soda Cloud.Name: soda-agent-orchestrator-66-snip Namespace: soda-agent Priority: 0 Service Account: soda-agent Node: <none> Labels: agent.soda.io/component=orchestrator agent.soda.io/service=queue app.kubernetes.io/instance=soda-agent app.kubernetes.io/name=soda-agent pod-template-hash=669snip Annotations: seccomp.security.alpha.kubernetes.io/pod: runtime/default Status: Running ...
- Next: Add a data source in Soda Cloud using the Soda Agent you just deployed. If you wish, you can create a practice data source so you can try adding a data source in Soda Cloud using the Soda Agent you just deployed.
(Optional) Create a practice data source
If you wish to try creating a new data source in Soda Cloud using the agent you deployed, you can use the following command to create a PostgreSQL warehouse containing example data from the NYC Bus Breakdowns and Delay Dataset.
From the command-line, copy+paste and run the following to create the data source as a pod on your new cluster.
cat <<EOF | kubectl apply -n soda-agent -f -
---
apiVersion: v1
kind: Pod
metadata:
name: nybusbreakdowns
labels:
app: nybusbreakdowns
spec:
containers:
- image: sodadata/nybusbreakdowns
imagePullPolicy: IfNotPresent
name: nybusbreakdowns
ports:
- name: tcp-postgresql
containerPort: 5432
restartPolicy: Always
---
apiVersion: v1
kind: Service
metadata:
labels:
app: nybusbreakdowns
name: nybusbreakdowns
spec:
ports:
- name: tcp-postgresql
port: 5432
protocol: TCP
targetPort: tcp-postgresql
selector:
app: nybusbreakdowns
type: ClusterIP
EOF
Output:
pod/nybusbreakdowns created
service/nybusbreakdowns created
Once the pod of practice data is running, you can use the following configuration details when you add a data source in Soda Cloud, in step 2, Connect the Data Source.
data_source your_datasource_name:
type: postgres
connection:
host: nybusbreakdowns
port: 5432
username: sodacore
password: sodacore
database: sodacore
schema: new_york
About the helm install
command
helm install soda-agent soda-agent/soda-agent \
--set soda.agent.target=azure-aks-virtualnodes \
--set soda.agent.name=myuniqueagent \
--set soda.apikey.id=*** \
--set soda.apikey.secret=**** \
--namespace soda-agent
Command part | Description |
---|---|
helm install | the action helm is to take |
soda-agent (the first one) | a release named soda-agent on your cluster |
soda-agent (the second one) | the name of the helm repo you installed |
soda-agent (the third one) | the name of the helm chart that is the Soda Agent |
The --set
options either override or set some of the values defined in and used by the Helm chart. You can override these values with the --set
files as this command does, or you can specify the override values using a values.yml file.
Parameter key | Parameter value, description |
---|---|
--set soda.agent.target | (Optional) The cluster the command target. Use when deploying to aws-eks or azure-aks-virtualnodes . |
--set soda.agent.name | A unique name for your Soda Agent. Choose any name you wish, as long as it is unique in your Soda Cloud account. |
--set soda.apikey.id | With the apikey.secret, this connects the Soda Agent to your Soda Cloud account. Use the value you copied from the dialog box in Soda Cloud when adding a new agent. You can use a values.yml file to pass this value to the cluster instead of exposing it here. |
--set soda.apikey.secret | With the apikey.id, this connects the Soda Agent to your Soda Cloud account. Use the value you copied from the dialog box in Soda Cloud when adding a new agent. You can use a values.yml file to pass this value to the cluster instead of exposing it here. |
--set soda.core.idle=true | (Optional) Launch an idle worker so at scan time, the agent can hand over instructions to an already running idle Scan Launcher to avoid the start-from-scratch setup time for a pod. You can have multiple idle scan launchers waiting for instructions. |
--set soda.core.replicas=1 | (Optional) Replicate an idle worker to have more workers ready to handle instructions without setting up a new pod. |
--namespace soda-agent | Use the namespace value to identify the namespace in which to deploy the agent. |
Decommission the Soda Agent and cluster
- Uninstall the Soda Agent in the cluster.
helm delete soda-agent -n soda-agent
- Delete the cluster.
gcloud container clusters delete soda-agent-gke
Refer to Google Kubernetes Engine documentation for details.
Troubleshoot deployment
Refer to Helpful kubectl commands for instructions on accessing logs and investigating issues.
Problem: Scans launched from Soda Cloud take an excessive amount of time to run.
Solution: Consider adjusting the number of replicas for idle workers with kubectl. Launch extra idle workers so at scan time, the agent can hand over instructions to an already running idle Scan Launcher to avoid the start-from-scratch setup time for a pod.
- Ensure that the agent was deployed with the
soda.core
configurations foridle: true
andreplicas: 1
or more. - Run the following command to increase the number of active replicas to 2.
kubectl scale deployment/soda-agent-scanlauncher \ --replicas 2 -n soda-agent
Go further
- Next: Add a data source in Soda Cloud using the Soda Agent you just deployed.
- Access a list of helpful
kubectl
commands for running commands on your Kubernetes cluster. - Learn more about securely storing and accessing API keys and data source login credentials.
- Need help? Join the Soda community on Slack.
Was this documentation helpful?
What could we do to improve this page?
- Suggest a docs change in GitHub.
- Share feedback in the Soda community on Slack.
Last modified on 31-May-23