Last modified on 30-Nov-23
Use this guide to set up Soda and begin automatically monitoring the data quality of datasets in a data source. Use the guided workflow in Soda Cloud to connect to a data source, profile your data, and add automated checks for data quality.
This guide offers Data Analysts, Data Scientists, and business users instructions to set up Soda to profile and begin monitoring data for quality, right out of the box.
This example uses a self-hosted agent deployment model which uses Soda Cloud connected to a Soda Agent which securely accesses data sources and executes scheduled scans for data quality. See: Choose a flavor of Soda.
The Soda Agent is a tool that empowers Soda Cloud users to securely access data sources to scan for data quality. To set up a Soda Agent, you create a Kubernetes cluster in a cloud services provider environment, then use Helm to deploy a Soda Agent in the cluster.
If you do not have the access or authorization to set up a cluster or deploy the containerized agent, pass the instructions to your data engineering or IT team to complete the exercise for you.
- Create a Soda Cloud account which is free for a 45-day trial.
- Access the exhaustive deployment instructions to deploy a Soda Agent in the cloud services provider you use.
- Cloud services provider-agnostic instructions
- Amazon Elastic Kubernetes Service (EKS)
- Microsoft Azure Kubernetes Service (AKS)
- Google Kubernetes Engine (GKE)
- Use the instructions for managing sensitive values to securely store API key values and data source login credential values that the Soda Agent needs to connect to both Soda Cloud and your data sources. Pass the environment variable identifiers to your colleagues to use when adding a new data source in Soda Cloud.
- Invite your colleague(s) to your Soda Cloud organization so they can access the newly-deployed Soda Agent to connect to data sources and begin monitoring data quality. In your Soda Cloud account, navigate to your avatar > Invite Team Members and fill in the blanks.
- If you have not already done so, create a Soda Cloud account, or accept your colleague’s emailed invitation to create an account and join their Soda Cloud organization.
- In Soda Cloud, navigate to your avatar > Data Sources.
- In the Agents tab, confirm that you can see the Soda Agent you deployed and that its status is “green” in the Last Seen column. If not, refer to the Soda Agent documentation to troubleshoot its status.
- Navigate to the Data source tab, then click New Data Source and follow the guided steps to connect to a new data source and opt-in to automated monitoring checks.
Refer to the sections below for insight into the values to enter in the fields and editing panels in the guided steps.
|Field or Label||Guidance|
|Data Source Label||Provide a unique identifier for the data source. Soda Cloud uses the label you provide to define the immutable name of the data source against which it runs the Default Scan.|
|Default Scan Schedule Label||Provide a name for the default scan schedule for this data sources. The scan schedule indicates which Soda Agent to use to execute the scan, and when.|
|Default Scan Schedule Agent||Select the name of a Soda Agent that you have previously set up in your secure environment and connected to a specific data source. This identifies the Soda Agent to which Soda Cloud must connect in order to run its scan.|
|Schedule Definition||Provide the scan frequency details Soda Cloud uses to execute scans according to your needs. If you wish, you can define the schedule as a cron expression.|
|Starting At||Select the time of day to run the scan. The default value is midnight.|
|Time Zone||Select a timezone. The default value is UTC.|
|Cron Expression||(Optional) Write your own cron expression to define the schedule Soda Cloud uses to run scans.|
In the editing panel, provide the connection configurations Soda Cloud needs to be able to access the data in the data source. Connection configurations are data source-specific and include values for things such as a database’s host and access credentials.
To more securely provide sensitive values such as usernames and passwords, use environment variables in a
values.yml file when you deploy the Soda Agent. See Use environment variables for data source connection credentials for details.
Access the data source-specific connection configurations to copy+paste the connection syntax into the editing panel, then adjust the values to correspond with your data source’s details.
During its initial scan of your datasource, Soda Cloud discovers all the datasets the data source contains. It captures basic information about each dataset, including a dataset’s schema and the columns it contains.
In the editing panel, specify the datasets that Soda Cloud must include or exclude from this basic discovery activity. The default syntax in the editing panel instructs Soda to collect basic dataset information from all datasets in the data source except those with names that begin with
% is a wildcard character. See Add dataset discovery for more detail on profiling syntax.
Known issue: SodaCL does not support using variables in column profiling and dataset discovery configurations.
discover datasets: datasets: - include % - exclude test_%
To gather more detailed profile information about datasets in your data source, you can configure Soda Cloud to profile the columns in datasets.
Column profile information includes details such as the calculated mean value of data in a column, the maximum and minimum values in a column, and the number of rows with missing data.
In the editing panel, provide details that Soda Cloud uses to determine which datasets to include or exclude when it profiles the columns in a dataset. The default syntax in the editing panel instructs Soda to profile every column of every dataset in this data source, and, superfluously, all datasets with names that begin with
% is a wildcard character. See Add column profiling for more detail on profiling syntax.
Column profiling can be resource-heavy, so carefully consider the datasets for which you truly need column profile information. Refer to Compute consumption and cost considerations for more detail.
profile columns: columns: - "%.%" # Includes all your datasets - prod% # Includes all datasets that begin with 'prod'
When Soda Cloud automatically discovers the datasets in a data source, it prepares automated monitoring checks for each dataset. These checks detect anomalies and monitor schema evolution, corresponding to the SodaCL anomaly score and schema checks, respectively.
- Anomaly score automatically detects anomalies in your time-series data. Its algorithm learns the patterns of your data – including trends and seasonality – to identify and flag anomalies in your data. A detected anomaly yields a failed check result.
- Schema evolution automatically detects changes in a dataset’s schema, whether columns are added, removed, or the index or data type of a column has changed. Any of those schema changes yield a failed check result.
In the editing panel, specify the datasets that Soda Cloud must include or exclude when preparing automated monitoring checks. The default syntax in the editing panel indicates that Soda adds automated monitoring to all datasets in the data source except those with names that begin with
% is a wildcard character.
automated monitoring: datasets: - include % - exclude test_%
|Field or Label||Guidance|
|Data Source Owner||The Data Source Owner maintains the connection details and settings for this data source and its Default Scan Schedule.|
|Default Dataset Owner||The Datasets Owner is the user who, by default, becomes the owner of each dataset the Default Scan discovers. Refer to Roles and Rights in Soda Cloud to learn how to adjust the Dataset Owner of individual datasets.|
Best practice dictates that you send a notification to someone on your team when bad-quality data triggers a failed check result on one of your automated checks. When alerted to an issue, a person or team can take action to investigate and resolve it.
In Soda Cloud, navigate to your avatar > Notification Rules, then click New Notification Rule. Follow the guided steps to complete the new rule, supplying values that will trigger a notification any time one of your automated monitoring checks fails.
The example below sends such notifications for the automated anomaly score check on the
- Learn more about organizing check results, setting alerts, and investigating issues.
- Write your own checks for data quality.
- Integrate Soda with Slack to send alert notifications directly to channels in your workspace.
- Integrate Soda with a data catalog to see data quality results from within the catalog:
- Need help? Join the Soda community on Slack.
Was this documentation helpful?
What could we do to improve this page?
Documentation always applies to the latest version of Soda products
Last modified on 30-Nov-23