Self-serve Soda

Follow this guide to enable Soda Cloud end users to create no-code checks for data quality for the data that matters to them the most.

Use this guide to set up Soda Cloud and enable users across your organization to serve themselves when it comes to testing data quality.

Deploy a Soda Agent in a Kubernetes cluster to connect to both a data source and the Soda Cloud, then invite your Data Analyst and Data Scientist colleagues to join the account, start data quality discussions, and begin creating their own SodaCL checks for data quality.

About this guide

The instructions below offer Data Engineers an example of how to set up the Soda Cloud to enable non-coder colleagues to propose, discuss, and create their own data quality tests. After all, data quality testing is a team sport!

Once you have completed the set-up, you can direct your non-coding colleagues to log in to Soda Cloud and begin creating Discussions. A Discussion in Soda is a messaging space that facilitates collaboration between data producers and data consumers. Together, colleagues can establish the expected and agreed-upon state of data quality in a dataset by proposing, then approving data quality checks that execute as part of a scheduled scan in Soda.

When checks fail during data quality scans, you and your colleagues get alerts via Slack which enable you to address issues before they have a downstream impact on the users or applications that depend upon the data.

Access or deploy a Soda Agent

If you have not already done so, create a Soda Cloud account at cloud.soda.io. If you already have a Soda account, log in.
By default, Soda prepares a Soda-hosted agent for all newly-created accounts. However, if you are an Admin in an existing Soda Cloud account and wish to use a Soda-hosted agent, navigate to your avatar > Organization Settings. In the Organization tab, click the checkbox to Enable Soda-hosted Agent.
Navigate to your avatar > Data Sources, then access the Agents tab. Notice your out-of-the-box Soda-hosted agent that is up and running.

Connect a data source

Depending on your deployment model, Soda Agent supports connections with the following data sources.

Self-hosted agent

Amazon Athena Amazon Redshift Azure Synapse ClickHouse Databricks SQL Denodo Dremio DuckDB GCP BigQuery Google CloudSQL

IBM DB2 MotherDuck MS SQL Server¹ MySQL OracleDB PostgreSQL Presto Snowflake Trino Vertica

¹ MS SQL Server with Windows Authentication does not work with Soda Agent out-of-the-box.

Soda-hosted agent

BigQuery Databricks SQL MS SQL Server MySQL

PostgreSQL Redshift Snowflake

Log in to your Soda Cloud account, then navigate to your avatar > Data Sources.
In the Agents tab, confirm that you can see a Soda-hosted agent, or the Soda Agent you deployed, and that its status is “green” in the Last Seen column. If not, refer to the Soda Agent documentation to troubleshoot its status.
Navigate to the Data Sources tab, then click New Data Source and follow the guided steps to:
- identify the new data source and its default scan definition
- provide connection configuration details for the data source such as name, schema, and login credentials, and test the connection to the data source
- profile the datasets in the data source to gather basic metadata about the contents of each
- identify the datasets to which you wish to apply automated monitoring for anomalies and schema changes
- assign ownership roles for the data source and its datasets
Save the new data source.

Set up Slack integration and notification rules

Use this integration to enable Soda to send alert notifications to a Slack channel to notify your team when check results warn and fail.

If your team does not use Slack, you can follow the instructions to integrate with MS Teams, instead, or skip this step as Soda sends alert notifications via email by default.

Log in to your Soda Cloud account and navigate to your avatar > Organization Settings, then navigate to the Integrations tab and click the + icon to add a new integration.
Follow the guided steps to authorize Soda to connect to your Slack workspace. If necessary, contact your organization's Slack Administrator to approve the integration with Soda.

Configuration tab: select the public channels to which Soda can post messages; Soda cannot post to private channels.
Scope tab: select the two Soda features, Alert Notifications and Discussions, which can access the Slack integration.

To dictate where Soda must send alert notifications for checks that fail, create a new notification rule. Navigate to your avatar > Notification Rules, then click New Notification Rule. Follow the guided steps to complete the new rule directly Soda to send check results that fail to a specific channel in your Slack workspace.

Learn more about Integrating with Slack.
Learn more about Setting notification rules.

Invite your colleagues

After testing and saving the new data source, invite your colleagues to your Soda Cloud account so they can begin creating new agreements.

Navigate to your avatar > Invite Team Members, then complete the form to send invitations to your colleagues.

Begin a discussion and propose checks

While waiting for your colleagues to accept your Soda invitation, get a head start on setting up data quality checks on the data that matters the most to your data consumers.

🎥 Watch a 5-minute video of the following procedure, if you like!

In Soda Cloud, navigate to Discussions from the main navigation bar.
Start a New Discussion, providing relevant details for a discussion on data quality metrics, and adding people whose perspectives will add value to the data quality of a particular dataset.
Kick off the data quality discussion with your colleagues: begin with Propose Check, then use the no-code check interface to select from the list available checks for the dataset. Most common baseline data quality checks include: missing, invalid, duplicate, and freshness. Refer to Define SodaCL checks for more detail on how to leverage no-code checks.
After filling in the blanks and testing the check, Propose Check to add the SodaCL check to the discussion. When your colleagues join and review the Discussions, they can add comments or propose new or different checks to address the data quality issues of this dataset.
When you and your team agree on the data quality checks to add to the dataset, you, as the data producer, can Review & Add the check to a scan for the dataset – either existing or new – so that Soda begins executing the check as per the data source's default scan schedule.

✨Well done!✨ You've taken the first step towards a future in which you and your colleagues can collaborate on defining and maintaining good-quality data. Huzzah!

Go further?

Get organized in Soda!
Integrate Soda with your data catalog.
Use failed row samples to investigate data quality issues.
Request a demo. Hey, what can Soda do for you?

Need help? Join the Soda community on Slack.

PreviousTest data quality during CI/CD development NextAutomate anomaly detection

Last updated 1 month ago

Was this helpful?