Link Search Menu Expand Document

Configure Soda Core

Create a configuration YAML file to configure connection details for Soda Core to access your data source. Store the configuration.yml as a hidden file in your local user home directory: ~/.soda/configuration.yml. When you execute a scan, Soda Core uses that path and filename by default to find the configuration YAML and use the information within to connect to your data source.

To set the data source configurations, use the following example configurations that correspond to each kind of data source that Soda Core supports.

Connect to Amazon Athena
Connect to Amazon Redshift
Connect to Apache Spark DataFrames</br> Connect to GCP BigQuery
Connect to PostgreSQL
Connect to Snowflake
Add prefixes to datasets
Connect Soda Core to Soda Cloud

Connect to Amazon Athena

data_source athena:
  type: athena
  connection:
    access_key_id: 
    secret_access_key: 
    region_name: eu-west-1
    staging_dir: 
    database: 
Property Required Notes
type required  
access_key_id optional Use system variables to retrieve this value securely.
secret_access_key optional Use system variables to retrieve this value securely.
region_name optional  
staging_dir required  
database required  

Access keys and IAM role are mutually exclusive: if you provide values for access_key_id and secret_access_key, you cannot use Identity and Access Management role; if you provide value for role_arn, then you cannot use the access keys. Refer to Identity and Access Management in Athena for details.

Supported data types

Category Data type
text CHAR, VARCHAR, STRING
number TINYINT, SMALLINT, INT, INTEGER, BIGINT, DOUBLE, FLOAT, DECIMAL
time DATE, TIMESTAMP

Connect to Amazon Redshift

data_source my_database_name:
  type: redshift
  connection:
    host: db
    username:
    password:
    database: soda_test
    access_key_id:
    secret_access_key:
    role_arn:
    region: eu-west-1
  schema: public
Property Required Notes
type required  
host required  
username required  
password required  
database required  
schema    
access_key_id optional Use system variables to retrieve this value securely.
secret_access_key optional Use system variables to retrieve this value securely.
role_arn optional The Amazon Resource Name of an IAM role that you want to use.
region optional  

Access keys and IAM role are mutually exclusive: if you provide values for access_key_id and secret_access_key, you cannot use Identity and Access Management role; if you provide value for role_arn, then you cannot use the access keys. Refer to Amazon Redshift Authorization parameters for details.

Supported data types

Category Data type
text CHARACTER VARYING, CHARACTER, CHAR, TEXT, NCHAR, NVARCHAR, BPCHAR
number SMALLINT, INT2, INTEGER, INT, INT4, BIGINT, INT8
time DATE, TIME, TIMETZ, TIMESTAMP, TIMESTAMPTZ

Connect to Apache Spark DataFrames

  • For use with programmatic Soda scans, only.
  • Unlike other data sources, Soda Core does not require a configuration YAML file to run scans against Spark DataFrames.

Note: There are two Soda Core packages for Spark:

  • soda-core-spark-df, in its early release candidate form, enables you to pass dataframe objects into Soda scans programatically, after you have associated the temporary tables to DataFrames via the Spark API.
  • soda-core-spark continues as a work-in-progress and will connect to Soda Core much the same as other data sources, via connection details in a configuration YAML.

A Spark cluster contains a distributed collection of data. Spark DataFrames are distributed collections of data that are organized into named columns, much like a table in a database, and which are stored in-memory in a cluster. To make a DataFrame available to Soda Core to run scans against, you must use a driver program like PySpark and the Spark API to link DataFrames to individual, named, temporary tables in the cluster. You pass this information into a Soda scan programatically.

  1. If you are not installing Soda Core Spark DataFrames on a cluster, skip to step 2. To install Soda Core Spark DataFrames on a cluster, such as a Kubernetes cluster or a Databricks cluster, install libsasl2-dev before installing soda-core-spark-df. For Ubuntu users, install libsasl2-dev using the following command:
sh sudo apt-get -y install unixodbc-dev libsasl2-dev gcc python-dev
  1. Confirm that you have already:
  • installed soda-core-spark-df
  • set up a a Spark session
spark_session: SparkSession = ...user-defined-way-to-create-the-spark-session...
  • your Spark cluster contains one or more DataFrames
df = ...user-defined-way-to-build-the-dataframe...
  1. Use the Spark API to link the name of a temporary table to a DataFrame. In this example, the name of the table is customers.
db.createOrReplaceTempView('customers')
  1. Use the Spark API to link a DataFrame to the name of each temporary table against which you wish to run Soda scans. Refer to PySpark documentation.
  2. Define a programmatic scan for the data in the DataFrames, and include one extra method to pass all the DataFrames to Soda Core: add_spark_session(self, spark_session, data_source_name: str). The default value for data_source_name is "spark_df". Refer to the example below.
spark_session = ...your_spark_session...
df1.createOrReplaceTempView("TABLE_ONE")
df2.createOrReplaceTempView("TABLE_TWO")
...

scan = Scan()
scan.set_scan_definition_name('YOUR_SCHEDULE_NAME')
scan.set_data_source_name("spark_df")
scan.add_configuration_yaml_file(file_path="somedirectory/your_configuration.yml")
scan.add_spark_session(spark_session)
... all other scan methods in the standard programmatic scan ...


Connect to GCP BigQuery

A note about BigQuery datasets: Google uses the term dataset slightly differently than Soda (and many others) do.

  • In the context of Soda, a dataset is a representation of a tabular data structure with rows and columns. A dataset can take the form of a table in PostgreSQL or Snowflake, a stream in Kafka, or a DataFrame in a Spark application.
  • In the context of BigQuery, a dataset is “a top-level container that is used to organize and control access to your tables and views. A table or view must belong to a dataset…”

Instances of “dataset” in Soda documentation always reference the former.

data_source my_database_name:
  type: bigquery
  connection:
    account_info_json: '{
        "type": "service_account",
        "project_id": "...",
        "private_key_id": "...",
        "private_key": "-----BEGIN PRIVATE KEY-----\n...\n-----END PRIVATE KEY-----\n",
        "client_email": "...@project.iam.gserviceaccount.com",
        "client_id": "...",
        "auth_uri": "https://accounts.google.com/o/oauth2/auth",
        "token_uri": "https://accounts.google.com/o/oauth2/token",
        "auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
        "client_x509_cert_url": "https://www.googleapis.com/robot/v1/metadata/x509/..."
}'
    auth_scopes:
    - https://www.googleapis.com/auth/bigquery
    - https://www.googleapis.com/auth/cloud-platform
    - https://www.googleapis.com/auth/drive
    project_id: "..."
    dataset: sodacore
Property Required
type required
account_info_json reqired; inline properties listed below
  type required
  project_id required
  private_key_id required
  private_key required
  client_email required
  client_id required
  auth_uri required
  token_uri required
  auth_provider_x509_cert_url required
  client_x509_cert_url required
auth_scopes optional; Soda applies the three scopes listed above by default
project_id optional; overrides project_id from account_info_json
dataset required

Supported data types

Category Data type
text STRING
number INT64, DECIMAL, BINUMERIC, BIGDECIMAL, FLOAT64
time DATE, DATETIME, TIME, TIMESTAMP

Connect to PostgreSQL

data_source my_database_name:
  type: postgres
  connection:
    host: db
    port: "5432"
    username:
    password:
    database: postgres
  schema: public
Property Required Notes
type required  
host required  
port optional  
username required Use system variables to retrieve this value securely.
password required Use system variables to retrieve this value securely.
database required  
schema required  

Supported data types

Category Data type
text CHARACTER VARYING, CHARACTER, CHAR, TEXT
number SMALLINT, INTEGER, BIGINT, DECIMAL, NUMERIC, VARIABLE, REAL, DOUBLE PRECISION, SMALLSERIAL, SERIAL, BIGSERIAL
time TIMESTAMPT, DATE, TIME, TIMESTAMP WITH TIME ZONE, TIMESTAMP WITHOUT TIME ZONE, TIME WITH TIME ZONE, TIME WITHOUT TIME ZONE

Connect to Snowflake

data_source orders:
  type: snowflake
  connection:
    username: "SODATESTING"
    password: "abc123"
    account: sodadatapartner.eu-central-1
    database: SNOWFLAKE_SAMPLE_DATA
    warehouse:
    connection_timeout:
    role: PUBLIC
    client_session_keep_alive:
    session_parameters:
      QUERY_TAG: soda-queries
      QUOTED_IDENTIFIERS_IGNORE_CASE: false
  schema: public
Property Required Notes
type required The name of your Snowflake virtual data source.
username required Use system variables to retrieve this value securely using, for example, ${SNOWFLAKE_USER}.
password required Use system variables to retrieve this value securely using, for example, ${SNOWFLAKE_PASSWORD}.
account required Use system variables to retrieve this value securely using, for example, ${SNOWFLAKE_ACCOUNT}.
database required  
schema required  
warehouse required  
connection_timeout required  
role optional See Snowflake System-Defined Roles for details.
QUERY_TAG optional See QUERY_TAG in Snowflake documentation.
QUOTED_IDENTIFIERS_IGNORE_CASE optional See QUOTED_IDENTIFIERS_IGNORE_CASE in Snowflake documentation.

Supported data types

Category Data type
text CHAR, VARCHAR, CHARACTER, STRING, TEXT
number NUMBER, INT, INTEGER, BIGINT, SMALLINT, TINYINT, BYTEINT, FLOAT, FLOAT4, FLOAT8, DOUBLE, DOUBLE PRECISION, REAL
time DATE, DATETIME, TIME, TIMESTAMP, TIMESTAMPT_LTZ, TIMESTAMP_NTZ, TIMESTAMP_TZ

Add prefixes to datasets

To assist in identifying details when Soda Core scans your data with the verbose option, you can add a prefix to the name of a table with the name of a database or schema. Use the data source property table_prefix according to the following example.

data_source my_database_name:
  type: postgres
  connection:
    host: db
    username:
    password:
  database: postgres
  schema: public
  table_prefix: "public"

If you have configured the table_prefix property, a check for row_count yields scan output that reveals the table prefix information: FROM public.CUSTOMERS.

Checks YAML file:

checks for "CUSTOMERS":
  - row_count > 0

Soda scan output, with -V option for verbose mode:

soda scan -d adventureworks -V -c configuration.yml checks.yml

Soda scan output:

Soda Core 0.0.x
Reading configuration file "configuration.yml"
Reading SodaCL file "checks.yml"
Scan execution starts
Query adventureworks.CUSTOMERS.aggregation[0]:
SELECT
  COUNT(*)
FROM public.CUSTOMERS
Scan summary:
1/1 query OK
  adventureworks.CUSTOMERS.aggregation[0] [OK] 0:00:00.035515
1/1 check PASSED:
    CUSTOMERS in adventureworks
      row_count > 0 [PASSED]
        check_value: 99
All is good. No failures. No warnings. No errors.

Connect Soda Core to Soda Cloud

  1. If you have not already done so, create a free Soda Cloud Developer account at cloud.soda.io/signup.
  2. To your configuration.yml, add the following syntax. Be sure to add the syntax for soda_cloud at the root level of the YAML file, not nested under any other data_source syntax.
    soda_cloud:
      host: cloud.soda.io
      api_key_id:
      api_key_secret:
    
  3. In your Soda Cloud account, navigate to your avatar > Profile > API Keys, then click the plus icon to generate new API keys.
    • Copy the API Key ID, then paste it into the configuration.yml as the value for api_key_id.
    • Copy the API Key Secret, then paste it into the configuration.yml as the value for api_key_secret.
  4. Save the changes to the configuration.yml file. Close the Create API Key dialog box in Soda Cloud.

The next time you execute a scan in Soda Core, it pushes the scan results to Soda Cloud where you can view the results in the Monitors dashboard. Refer to Soda Cloud documentation for more information.


Last modified on 01-Jul-22

Was this documentation helpful?
Share feedback in the #soda-core channel in the Soda community on Slack.