Link Search Menu Expand Document

Add a data source

Last modified on 06-Dec-22

To run scans for data quality in your data source such as PostgreSQL or GCP BigQuery, you must first connect Soda Cloud to a data source.

In setting up a data source, you provide details such as database access credentials, details about the frequency with which Soda Cloud runs scans on this data source, and how much profile information Soda Cloud fetches about the datasets in your data source.

Compatibility
Prerequisites
Add a new data source
Go further

Compatibility

You can connect Soda Cloud with the following data sources via a Soda Agent:

Amazon Athena
Amazon Redshift
Azure Synapse (Experimental)
ClickHouse (Experimental)
Denodo (Experimental)
Dremio
DuckDB (Experimental)
GCP Big Query
IBM DB2
MS SQL Server
MySQL
OracleDB
PostgreSQL
Snowflake
Trino

Soda Cloud can, however, accept scan results from Soda Core for the following data sources:

Amazon Athena
Amazon Redshift
Apache Spark DataFrames1
Apache Spark for Databricks SQL
Azure Synapse (Experimental)
ClickHouse (Experimental)
Denodo (Experimental)
Dremio
DuckDB (Experimental)
GCP Big Query
IBM DB2
MS SQL Server
MySQL
OracleDB
PostgreSQL
Snowflake
Trino

1 For use with programmatic Soda scans, only.

Prerequisites

  • You have created a Soda Cloud account.
  • You, or an IT Administrator in your organization, has deployed a Soda Agent in an Amazon Elastic Kubernetes Service (EKS) Fargate environment and connected it to your Soda Cloud account.

Add a new data source

In your Soda Cloud account, navigate to your avatar > Scans & Data. Click New Data Source, then follow the guided steps to create a new data source. Refer to the sections below for insight into the values to enter in the fields and editing panels in the guided steps.

1. Attributes

Field or Label Guidance
Data Source Label Provide a unique identifier for the data source. Soda Cloud uses the label you provide to define the immutable name of the data source against which it runs the Default Scan.
Default Scan Definition Label Provide a name for the default scan definition. A collection of checks YAML files that contain the checks for data quality you wish to scan at a specific time, including details for which Soda Agent to use to connect to which data source.
Default Scan Definition Agent Select the name of a Soda Agent that you have previously set up in your secure environment and connected to a specific data source. This identifies the Soda Agent to which Soda Cloud must connect in order to run its scan.
Schedule Definition Provide the scan frequency details Soda Cloud uses to execute scans according to your needs. If you wish, you can define the schedule as a cron expression.
Starting At Select the time of day to run the scan. The default value is midnight.
Time Zone Select a timezone. The default value is UTC.
Cron Expression (Optional) Write your own cron expression to define the schedule Soda Cloud uses to run scans.


2. Connect the Data Source

In the editing panel, provide the connection configurations Soda Cloud needs to be able to access the data in the data source. Connection configurations are data source-specific and include values for things such as a database’s host and access credentials.

To more securely provide sensitive values such as usernames and passwords, use environment variables in a values.yml file when you deploy the Soda Agent. See Use environment variables for data source connection credentials for details.

Access the data source-specific connection configurations listed below to copy+paste the connection syntax into the editing panel, then adjust the values to correspond with your data source’s details.


3. Discover Datasets

During its initial scan of your datasource, Soda Cloud discovers all the datasets the data source contains. It captures basic information about each dataset, including a dataset’s schema and the columns it contains.

In the editing panel, specify the datasets that Soda Cloud should include or exclude from this basic discovery activity. The default syntax in the editing panel indicates that Soda Cloud will collect basic dataset information from all datasets in the data source except those with names that begin with test_. The % is a wildcard character.

discover datasets:
  datasets:
    - include %
    - exclude test_%


4. Profile datasets

To gather more detailed profile information about datasets in your data source, you can configure Soda Cloud to profile the columns in datasets.

Column profile information includes details such as the calculated mean value of data in a column, the maximum and minimum values in a column, and the number of rows with missing data.

In the editing panel, provide details that Soda Cloud uses to determine which datasets to include or exclude when it profiles the columns in a dataset. The default syntax in the editing panel indicates that Soda will profile every column of every dataset in this data source, and, superfluously, all datasets with names that begin with prod. The % is a wildcard character.

Column profiling can be resource-heavy, so carefully consider the datasets for which you truly need column profile information.

profile columns:
  columns:
    - "%.%"    # Includes all your datasets
    - "prod%"  # Includes all datasets that begin with 'prod'


5. Check Datasets

When Soda Cloud automatically discovers the datasets in a data source, it prepares automated monitoring checks for each dataset. These checks detect anomalies and monitor schema evolution, corresponding to the SodaCL anomaly score and schema checks, respectively.

In the editing panel, specify the datasets that Soda Cloud should include or exclude when preparing automated monitoring checks. The default syntax in the editing panel indicates that Soda will add automated monitoring to all datasets in the data source except those with names that begin with test_. The % is a wildcard character.

automated monitoring:
  datasets:
    - include %
    - exclude test_%

6. Assign Owner

Field or Label Guidance
Data Source Owner The Data Source Owner maintains the connection details and settings for this data source and its Default Scan Schedule.
Default Dataset Owner The Datasets Owner is the user who, by default, becomes the owner of each dataset the Default Scan discovers. Refer to Roles and Rights in Soda Cloud to learn how to adjust the Dataset Owner of individual datasets.

Go further


Was this documentation helpful?

What could we do to improve this page?


Last modified on 06-Dec-22