Enable end users to test data quality
Last modified on 31-May-23
Use this guide to set up the Soda platform to enable users across your organization to serve themselves when it comes to testing data quality.
Deploy a Soda Agent in a Kubernetes cluster to connect to both a data source and the Soda platform, then invite your Data Analyst and Scientist colleagues to join the platform to create agreements and begin writing their own SodaCL checks for data quality.
01 Learn the basics of Soda
02 Get context for this guide
03 Deploy a Soda Agent in a Kubernetes cluster
04 Connect a data source to the Soda platform
05 Set up Slack integration and notification rules
06 Invite your colleagues to begin writing agreements
Soda works by taking data quality checks that you prepare and using them to run a scan of datasets in a data source. A scan is a CLI command which instructs Soda to prepare optimized SQL queries that execute data quality checks on your data source to find invalid, missing, or unexpected data. When checks fail, they surface bad-quality data and present check results that help you investigate and address quality issues.
To enable your colleagues to test data quality, you install Soda as an Agent in your own network infrastructure, and sign up for a Soda platform account so that you can complete the following tasks:
- Connect to your data source.
To connect to a data source such as Snowflake, Amazon Athena, or Big Query, you add a new data source in the Soda platform which stores access details for your data source such as host, port, and data source login credentials.
- Define checks to surface “bad” data.
To define the data quality checks that Soda runs against a dataset, you use an Agreement, a contract between stakeholders that stipulates the expected and agreed-upon state of data quality in a data source. These agreements contain checks, which are tests that Soda performs when it scans a dataset in your data source. The agreement stores the checks you write using the Soda Checks Language (SodaCL), a domain-specific language for data quality testing.
- Run a scan to execute your data quality checks.
During a scheduled scan, Soda does not ingest your data, it only scans it for quality metrics, then uses the metadata to prepare scan results1. After a scan, each check results in one of three default states:
- pass: the values in the dataset match or fall within the thresholds you specified
- fail: the values in the dataset do not match or fall within the thresholds you specified
- error: the syntax of the check is invalid
- A fourth state, warn, is something you can explicitly configure for individual checks.
- Review scan results and investigate issues.
You can review the scan output in your Soda platform account which offers access to visualized scan results, trends in data quality over time, and the ability to integrate with the messaging, ticketing, and data cataloging tools you already use, like Slack, Jira, and Alation.
1 An exception to this rule is when Soda collects failed row samples that it presents in scan output to aid issue investigation, a feature you can disable.
Learn more about How Soda works.
Learn more about running Soda scans.
Learn more about SodaCL Metrics and checks.
Access the Glossary for a full list of Soda terminology.
About this guide
The instructions below offer Data Engineers an example of how to set up the Soda platform to enable colleagues to prepare their own data quality tests. After all, data quality testing is a team sport!
For context, the example assumes that you have the appropriate access to a cloud services provider environment such as Azure, AWS, or Google Cloud that allows you to create and deploy applications to a cluster. Further, it assumes that you, or someone on your team, has access to the login credentials that Soda needs to be able to access a data source such as MS SQL, Big Query, or Athena so that Soda can run scans of the data.
Once you have completed the set-up, you can direct your colleagues to log in to the Soda platform and begin creating Agreements. An agreement is a contract between stakeholders that stipulates the expected and agreed-upon state of data quality in a data source. It contains data quality checks that run according to the schedule you defined for the data source.
When checks fail during data quality scans, you and your colleagues get alerts via Slack which enable you to address issues before they have a downstream impact on the users or systems that depend upon the data.
(Not quite ready for this big gulp of Soda? 🥤Try taking a sip, first.)
Deploy a Soda Agent
The Soda Agent is a tool that empowers Soda platform users to securely access data sources to scan for data quality. Create a Kubernetes cluster in a cloud services provider environment, then use Helm to deploy a Soda Agent in the cluster.
Access the exhaustive deployment instructions for the cloud services provider you use.
- Amazon Elastic Kubernetes Service (EKS)
- Microsoft Azure Kubernetes Service (AKS)
- Google Kubernetes Engine (GKE)
- Cloud services provider-agnostic instructions
Connect a data source
- Log in to your Soda platform account, then navigate to your avatar > Scans & Data.
- In the Agents tab, confirm that you can see the Soda Agent you deployed and that its status is “green” in the Last Seen column. If not, refer to the Soda Agent documentation to troubleshoot its status.
- Navigate to the Data source tab, then click New Data Source and follow the guided steps to:
- identify the new data source and its default scan schedule
- provide connection configuration details for the data source, and test the connection to the data source
- profile the datasets in the data source to gather basic metadata about the contents of each
- identify the datasets to which you wish to apply automated monitoring for anomalies and schema changes
- assign ownership roles for the data source and its datasets
- Save the new data source.
Set up Slack integration and notification rules
Use this integration to enable Soda to send alert notifications to a Slack channel to notify your team of warn and fail check results. If your team does not use Slack, you can skip this step and Soda sends alert notifications via email.
- Log in to your Soda platform account and navigate to your avatar > Organization Settings, then navigate to the Integrations tab and click the + icon to add a new integration.
- Follow the guided steps to authorize Soda to connect to your Slack workspace. If necessary, contact your organization’s Slack Administrator to approve the integration with Soda.
- Configuration tab: select the public channels to which Soda can post messages; Soda cannot post to private channels.
- Scope tab: select the Soda features, both alert notifications and incidents, which can access the Slack integration.
- To dictate where Soda should send alert notifications for checks that fail, create a new notification rule. Navigate to your avatar > Notification Rules, then click New Notification Rule. Follow the guided steps to complete the new rule directly Soda to send check results that fail to a specific channel in your Slack workspace.
Learn more about Integrating with Slack.
Learn more about Setting notification rules.
Invite your colleagues
After testing and saving the new data source, invite your colleagues to your Soda platform account so they can begin creating new agreements.
Navigate to your avatar > Invite Team Members, then complete the form to send invitations to your colleagues. Provide them with the following links to help them get started:
✨Well done!✨ You’ve taken the first step towards a future in which you and your colleagues can collaborate on defining and maintaining good-quality data. Huzzah!
ExperimentSodaCL tutorial Study metrics and checks Compare data
Choose your adventureTest data during development Test data in a pipeline
- Request a demo. Hey, what can Soda do for you?
- Join the Soda community on Slack.
Was this documentation helpful?
What could we do to improve this page?
- Suggest a docs change in GitHub.
- Share feedback in the Soda community on Slack.
Last modified on 31-May-23