Connect Soda to GCP Big Query
Last modified on 31-May-23
For Soda to run quality scans of your data, you must configure it to connect to your data source.
- For Soda Core, add the connection configurations to your
configuration.yml
file. Read more. - For Soda Cloud, add the connection configurations to step 3 of the New Data Source workflow. Read more.
Connection configuration
Authentication methods
Test the datasource connection
Supported data types
Use a file reference for a Big Query data source connection
A note about BigQuery datasets: Google uses the term dataset slightly differently than Soda (and many others) do.
- In the context of Soda, a dataset is a representation of a tabular data structure with rows and columns. A dataset can take the form of a table in PostgreSQL or Snowflake, a stream, or a DataFrame in a Spark application.
- In the context of BigQuery, a dataset is “a top-level container that is used to organize and control access to your tables and views. A table or view must belong to a dataset…”
Instances of “dataset” in Soda documentation always reference the former.
Connection configuration
Install package: soda-core-bigquery
data_source my_datasource_name:
type: bigquery
connection:
account_info_json: '{
"type": "service_account",
"project_id": "...",
"private_key_id": "...",
"private_key": "-----BEGIN PRIVATE KEY-----\n...\n-----END PRIVATE KEY-----\n",
"client_email": "...@project.iam.gserviceaccount.com",
"client_id": "...",
"auth_uri": "https://accounts.google.com/o/oauth2/auth",
"token_uri": "https://accounts.google.com/o/oauth2/token",
"auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
"client_x509_cert_url": "https://www.googleapis.com/robot/v1/metadata/x509/..."
}'
auth_scopes:
- https://www.googleapis.com/auth/bigquery
- https://www.googleapis.com/auth/cloud-platform
- https://www.googleapis.com/auth/drive
project_id: "..."
dataset: sodacore
Property | Required |
---|---|
type | required |
account_info_json | optional; inline properties listed below; if not provided, Soda uses Google Application Default Credentials |
type | required |
project_id | required |
private_key_id | required |
private_key | required |
client_email | required |
client_id | required |
auth_uri | required |
token_uri | required |
auth_provider_x509_cert_url | required |
client_x509_cert_url | required |
auth_scopes | optional; Soda applies the three scopes listed above by default |
project_id | optional; overrides project_id from account_info_json |
storage_project_id | optional; enables you to use separate project for compute and storage |
dataset | required |
Authentication methods
Using GCP BigQuery, you have the option of using one of several methods to authenticate the connection.
- Application Default Credentials
- Application Default Credentials with Service Account impersonation
- Service Account Key (see connection configuration above)
- Service Account Key with Service Account Impersonation
Application Default Credentials
Add the use_context_auth
property to your connection configuration, as per the following example.
data_source my_datasource:
type: bigquery
connection:
use_context_auth: True
Application Default Credentials with Service Account impersonation
Add the use_context_auth
and impersonation_account
properties to your connection configuration, as per the following example.
data_source my_datasource:
type: bigquery
connection:
use_context_auth: True
impersonation_account: <SA_EMAIL>
Service Account Key with Service Account impersonation
Add the impersonation_account
property to your connection configuration, as per the following example.
data_source my_database_name:
type: bigquery
connection:
account_info_json: '{
"type": "service_account",
"project_id": "...",
"private_key_id": "...",
...}'
impersonation_account: <SA_EMAIL>
Test the data source connection
To confirm that you have correctly configured the connection details for the data source(s) in your configuration YAML file, use the test-connection
command. If you wish, add a -V
option to the command to returns results in verbose mode in the CLI.
soda test-connection -d my_datasource -c configuration.yml -V
Supported data types
Category | Data type |
---|---|
text | STRING |
number | INT64, DECIMAL, BINUMERIC, BIGDECIMAL, FLOAT64 |
time | DATE, DATETIME, TIME, TIMESTAMP |
Use a file reference for a Big Query data source connection
If you already store information about your data source in a JSON file in a secure location, you can configure your BigQuery data source connection details in Soda Cloud to refer to the JSON file for service account information. To do so, you must add two elements:
volumes
andvolumeMounts
parameters in thevalues.yml
file that your Soda Agent helm chart uses- the
account_info_json_path
in your data source connection configuration
You, or an IT Admin in your organization, can add the following scanlauncher
parameters to the existing values.yml
that your Soda Agent uses for deployment and redployment in your Kubernetes cluster. Refer to Deploy using a values YAML file for details.
soda:
scanlauncher:
volumeMounts:
- name: gcloud-credentials
mountPath: /opt/soda/etc
volumes:
- name: gcloud-credentials
secret:
secretName: gcloud-credentials
items:
- key: serviceaccount.json
path: serviceaccount.json
Use the following command to add the service account information to a Kubernetes secret that the Soda Agent consumes according to the configuration above; replace the angle brackets and the values in them with your own values.
kubectl create secret generic -n <soda-agent-namespace> gcloud-credentials --from-file=serviceaccount.json=<local path to the serviceccount.json>
After you make both of these changes, you must redeploy the Soda Agent. Refer to Deploy using a values YAML file for details.
Adjust the data source connection configuration to include the account_info_json_path
configuration, as per the following example.
my_datasource_name:
type: bigquery
connection:
account_info_json_path: /opt/soda/etc/serviceaccount.json
auth_scopes:
- https://www.googleapis.com/auth/bigquery
- https://www.googleapis.com/auth/cloud-platform
- https://www.googleapis.com/auth/drive
project_id: ***
dataset: sodacore
Was this documentation helpful?
What could we do to improve this page?
- Suggest a docs change in GitHub.
- Share feedback in the Soda community on Slack.
Last modified on 31-May-23