Link Search Menu Expand Document

Configure Soda Core

After you install Soda Core, you must create files and configure a few settings before you can run a scan.

  • Create a configuration YAML file to provide details for Soda Core to connect your data source (except Apache Spark DataFrames, which does not use a configuration YAML file).
  • Create a checks YAML file to define your Soda Checks for data quality.

Configuration instructions
Provide credentials as system variables
Connect to Amazon Athena
Connect to Amazon Redshift
Connect to Apache Spark DataFrames
Connect to GCP BigQuery
Connect to PostgreSQL
Connect to Snowflake

Configuration instructions

Consider following the Quick start for Soda Core and Soda Cloud that guides you through the steps to configure Soda Core and run a scan of your data.

  1. Soda Core connects with Spark DataFrames in a unique way, using programmtic scans.
    • If you are using Spark DataFrames, follow the configuration details in Connect to Apache Spark DataFrames, then skip to step 6 to create a checks YAML file.
    • If you are not using Spark DataFrames, continue to step 2.
  2. Create a directory in your environment in which to store your configuration and checks YAML files.
  3. In your code editor, create a new YAML file named configuration.yml and save it in the directory you just created.
  4. The configuration YAML file stores connection details for your data source. Use the data source-specific sections below to copy+paste the connection syntax into your file, then adjust the values to correspond with your data source’s details. You can use system variables to pass sensitive values, if you wish.
    The following is an example of the connection details Soda Core requires to connect to a PostgreSQL data source.
    data_source postgres_retail:
      type: postgres
      connection:
     host: db
     port: "5432"
     username: postgres
     password: secret
     database: postgres
      schema: public
    
  5. Save the configuration.yml file, then create another new YAML file named checks.yml and save it the directory you created.
  6. A Soda Check is a test that Soda Core performs when it scans a dataset in your data source. The checks YAML file stores the Soda Checks you write using SodaCL. Copy+paste the following basic check syntax in your file, then adjust the value for dataset_name to correspond with the name of one of the datasets in your data source.
    checks for dataset_name:
      - row_count > 0
    
  7. Save the changes to the checks.yml file.
  8. Next: run a scan of the data in your data source.

Provide credentials as system variables

If you wish, you can provide data source login credentials or any of the properties in the configuration YAML file as system variables instead of storing the values directly in the file. System variables persist only for as long as you have the terminal session open in which you created the variable. For a longer-term solution, consider using permanent environment variables stored in your ~/.bash_profile or ~/.zprofile files.

  1. From your command-line interface, set a system variable to store the value of a property that the configuration YAML file uses. For example, you can use the following command to define a system variable for your password.
    export POSTGRES_PASSWORD=1234
    
  2. Test that the system retrieves the value that you set by running an echo command.
    echo $POSTGRES_PASSWORD
    
  3. In the configuration YAML file, set the value of the property to reference the environment variable, as in the following example.
data_source my_database_name:
  type: postgres
  connection:
    host: soda-temp-demo
    port: '5432'
    username: sodademo
    password: ${POSTGRES_PASSWORD}
    database: postgres
  schema: public

Run a scan to confirm that Soda Core connects to your data source without issue.

soda scan -d your_datasource -c configuration.yml checks.yml


Connect to Amazon Athena

data_source athena:
  type: athena
  connection:
    access_key_id: 
    secret_access_key: 
    region_name: eu-west-1
    staging_dir: 
    database: 
Property Required Notes
type required  
access_key_id optional Use system variables to retrieve this value securely.
secret_access_key optional Use system variables to retrieve this value securely.
region_name optional  
staging_dir required  
database required  

Access keys and IAM role are mutually exclusive: if you provide values for access_key_id and secret_access_key, you cannot use Identity and Access Management role; if you provide value for role_arn, then you cannot use the access keys. Refer to Identity and Access Management in Athena for details.

Supported data types

Category Data type
text CHAR, VARCHAR, STRING
number TINYINT, SMALLINT, INT, INTEGER, BIGINT, DOUBLE, FLOAT, DECIMAL
time DATE, TIMESTAMP

Connect to Amazon Redshift

data_source my_database_name:
  type: redshift
  connection:
    host: db
    username:
    password:
    database: soda_test
    access_key_id:
    secret_access_key:
    role_arn:
    region: eu-west-1
  schema: public
Property Required Notes
type required  
host required  
username required  
password required  
database required  
schema    
access_key_id optional Use system variables to retrieve this value securely.
secret_access_key optional Use system variables to retrieve this value securely.
role_arn optional The Amazon Resource Name of an IAM role that you want to use.
region optional  

Access keys and IAM role are mutually exclusive: if you provide values for access_key_id and secret_access_key, you cannot use Identity and Access Management role; if you provide value for role_arn, then you cannot use the access keys. Refer to Amazon Redshift Authorization parameters for details.

Supported data types

Category Data type
text CHARACTER VARYING, CHARACTER, CHAR, TEXT, NCHAR, NVARCHAR, BPCHAR
number SMALLINT, INT2, INTEGER, INT, INT4, BIGINT, INT8
time DATE, TIME, TIMETZ, TIMESTAMP, TIMESTAMPTZ

Connect to Apache Spark DataFrames

  • For use with programmatic Soda scans, only.
  • Unlike other data sources, Soda Core does not require a configuration YAML file to run scans against Spark DataFrames.

Note: There are two Soda Core packages for Spark:

  • soda-core-spark-df, in its early release candidate form, enables you to pass dataframe objects into Soda scans programatically, after you have associated the temporary tables to DataFrames via the Spark API.
  • soda-core-spark continues as a work-in-progress and will connect to Soda Core much the same as other data sources, via connection details in a configuration YAML.

A Spark cluster contains a distributed collection of data. Spark DataFrames are distributed collections of data that are organized into named columns, much like a table in a database, and which are stored in-memory in a cluster. To make a DataFrame available to Soda Core to run scans against, you must use a driver program like PySpark and the Spark API to link DataFrames to individual, named, temporary tables in the cluster. You pass this information into a Soda scan programatically.

  1. If you are not installing Soda Core Spark DataFrames on a cluster, skip to step 2. To install Soda Core Spark DataFrames on a cluster, such as a Kubernetes cluster or a Databricks cluster, install libsasl2-dev before installing soda-core-spark-df. For Ubuntu users, install libsasl2-dev using the following command:
sh sudo apt-get -y install unixodbc-dev libsasl2-dev gcc python-dev
  1. Confirm that you have already:
  • installed soda-core-spark-df
  • set up a a Spark session
spark_session: SparkSession = ...user-defined-way-to-create-the-spark-session...
  • your Spark cluster contains one or more DataFrames
df = ...user-defined-way-to-build-the-dataframe...
  1. Use the Spark API to link the name of a temporary table to a DataFrame. In this example, the name of the table is customers.
db.createOrReplaceTempView('customers')
  1. Use the Spark API to link a DataFrame to the name of each temporary table against which you wish to run Soda scans. Refer to PySpark documentation.
  2. Define a programmatic scan for the data in the DataFrames, and include one extra method to pass all the DataFrames to Soda Core: add_spark_session(self, spark_session, data_source_name: str). The default value for data_source_name is "spark_df". Refer to the example below.
spark_session = ...your_spark_session...
df1.createOrReplaceTempView("TABLE_ONE")
df2.createOrReplaceTempView("TABLE_TWO")
...

scan = Scan()
scan.set_scan_definition_name('YOUR_SCHEDULE_NAME')
scan.set_data_source_name("spark_df")
scan.add_configuration_yaml_file(file_path="somedirectory/your_configuration.yml")
scan.add_spark_session(spark_session)
... all other scan methods in the standard programmatic scan ...


Connect to GCP BigQuery

A note about BigQuery datasets: Google uses the term dataset slightly differently than Soda (and many others) do.

  • In the context of Soda, a dataset is a representation of a tabular data structure with rows and columns. A dataset can take the form of a table in PostgreSQL or Snowflake, a stream in Kafka, or a DataFrame in a Spark application.
  • In the context of BigQuery, a dataset is “a top-level container that is used to organize and control access to your tables and views. A table or view must belong to a dataset…”

Instances of “dataset” in Soda documentation always reference the former.

data_source my_database_name:
  type: bigquery
  connection:
    account_info_json: '{
        "type": "service_account",
        "project_id": "...",
        "private_key_id": "...",
        "private_key": "-----BEGIN PRIVATE KEY-----\n...\n-----END PRIVATE KEY-----\n",
        "client_email": "...@project.iam.gserviceaccount.com",
        "client_id": "...",
        "auth_uri": "https://accounts.google.com/o/oauth2/auth",
        "token_uri": "https://accounts.google.com/o/oauth2/token",
        "auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
        "client_x509_cert_url": "https://www.googleapis.com/robot/v1/metadata/x509/..."
}'
    auth_scopes:
    - https://www.googleapis.com/auth/bigquery
    - https://www.googleapis.com/auth/cloud-platform
    - https://www.googleapis.com/auth/drive
    project_id: "..."
    dataset: sodacore
Property Required
type required
account_info_json reqired; inline properties listed below
  type required
  project_id required
  private_key_id required
  private_key required
  client_email required
  client_id required
  auth_uri required
  token_uri required
  auth_provider_x509_cert_url required
  client_x509_cert_url required
auth_scopes optional; Soda applies the three scopes listed above by default
project_id optional; overrides project_id from account_info_json
dataset required

Supported data types

Category Data type
text STRING
number INT64, DECIMAL, BINUMERIC, BIGDECIMAL, FLOAT64
time DATE, DATETIME, TIME, TIMESTAMP

Connect to PostgreSQL

data_source my_database_name:
  type: postgres
  connection:
    host: db
    port: "5432"
    username:
    password:
    database: postgres
  schema: public
Property Required Notes
type required  
host required  
port optional  
username required Use system variables to retrieve this value securely.
password required Use system variables to retrieve this value securely.
database required  
schema required  

Supported data types

Category Data type
text CHARACTER VARYING, CHARACTER, CHAR, TEXT
number SMALLINT, INTEGER, BIGINT, DECIMAL, NUMERIC, VARIABLE, REAL, DOUBLE PRECISION, SMALLSERIAL, SERIAL, BIGSERIAL
time TIMESTAMPT, DATE, TIME, TIMESTAMP WITH TIME ZONE, TIMESTAMP WITHOUT TIME ZONE, TIME WITH TIME ZONE, TIME WITHOUT TIME ZONE

Connect to Snowflake

data_source orders:
  type: snowflake
  connection:
    username: "SODATESTING"
    password: "abc123"
    account: sodadatapartner.eu-central-1
    database: SNOWFLAKE_SAMPLE_DATA
    warehouse:
    connection_timeout:
    role: PUBLIC
    client_session_keep_alive:
    session_parameters:
      QUERY_TAG: soda-queries
      QUOTED_IDENTIFIERS_IGNORE_CASE: false
  schema: public
Property Required Notes
type required The name of your Snowflake virtual data source.
username required Use system variables to retrieve this value securely using, for example, ${SNOWFLAKE_USER}.
password required Use system variables to retrieve this value securely using, for example, ${SNOWFLAKE_PASSWORD}.
account required Use system variables to retrieve this value securely using, for example, ${SNOWFLAKE_ACCOUNT}.
database required  
schema required  
warehouse required  
connection_timeout required  
role optional See Snowflake System-Defined Roles for details.
QUERY_TAG optional See QUERY_TAG in Snowflake documentation.
QUOTED_IDENTIFIERS_IGNORE_CASE optional See QUOTED_IDENTIFIERS_IGNORE_CASE in Snowflake documentation.

Supported data types

Category Data type
text CHAR, VARCHAR, CHARACTER, STRING, TEXT
number NUMBER, INT, INTEGER, BIGINT, SMALLINT, TINYINT, BYTEINT, FLOAT, FLOAT4, FLOAT8, DOUBLE, DOUBLE PRECISION, REAL
time DATE, DATETIME, TIME, TIMESTAMP, TIMESTAMPT_LTZ, TIMESTAMP_NTZ, TIMESTAMP_TZ

Go further


Last modified on 01-Jul-22

Was this documentation helpful?
Share feedback in the Soda community on Slack.

Help improve our docs!

  • Request a docs change.
  • Edit this page in our GitHub repo.