Add Soda to a Databricks notebook
Use this guide to invoke Soda data quality tests from inside a Databricks notebook.
Use this guide to install and set up Soda in a Databricks notebook so you can run data quality tests on data in a Spark data source.
🎥 Watch a video that demonstrates how to add Soda to your Databricks pipeline: https://go.soda.io/soda-databricks-video
About this guide
The instructions below offer Data Engineers an example of how to write Python in a Databricks notebook to set up Soda, then write and execute scans for data quality in Spark.
This example uses a programmatic deployment model which invokes the Soda Python library, and uses Soda Cloud to validate a commercial usage license and display visualized data quality test results. See: Choose a flavor of Soda.
Create a Soda Cloud account
To validate your account license or free trial, Soda Library must communicate with a Soda Cloud account via API keys. You create a set of API keys in your Soda Cloud account, then use them to configure the connection to Soda Library.
In a browser, navigate to cloud.soda.io/signup to create a new Soda account, which is free for a 45-day trial. If you already have a Soda account, log in.
Navigate to your avatar > Profile, then access the API keys tab. Click the plus icon to generate new API keys.
Copy+paste the API key values to a temporary, secure place in your local environment.
Set up Soda
Soda Library has the following requirements:
Python 3.8, 3.9, or 3.10
Pip 21.0 or greater
Download the notebook: Soda Databricks notebook
Go further
Use Soda to test data in a Databricks pipeline.
Learn more about SodaCL checks and metrics.
Access instructions to Generate API Keys.
Need help? Join the Soda community on Slack.
Last updated
Was this helpful?
