Add Soda to a Databricks notebook

Use this guide to invoke Soda data quality tests from inside a Databricks notebook.

Use this guide to install and set up Soda in a Databricks notebook so you can run data quality tests on data in a Spark data source.

🎥 Watch a video that demonstrates how to add Soda to your Databricks pipeline: https://go.soda.io/soda-databricks-video

About this guide

The instructions below offer Data Engineers an example of how to write Python in a Databricks notebook to set up Soda, then write and execute scans for data quality in Spark.

This example uses a programmatic deployment model which invokes the Soda Python library, and uses Soda Cloud to validate a commercial usage license and display visualized data quality test results. See: Choose a flavor of Soda.

Create a Soda Cloud account

To validate your account license or free trial, Soda Library must communicate with a Soda Cloud account via API keys. You create a set of API keys in your Soda Cloud account, then use them to configure the connection to Soda Library.

  1. In a browser, navigate to cloud.soda.io/signup to create a new Soda account, which is free for a 45-day trial. If you already have a Soda account, log in.

  2. Navigate to your avatar > Profile, then access the API keys tab. Click the plus icon to generate new API keys.

  3. Copy+paste the API key values to a temporary, secure place in your local environment.

Set up Soda

Soda Library has the following requirements:

  • Python 3.8, 3.9, or 3.10

  • Pip 21.0 or greater

Python versions Soda supports

Soda officially supports Python versions 3.8, 3.9, and 3.10. Though largely funcntional, efforts to fully support Python 3.11 and 3.12 are ongoing.

Using Python 3.11, some users might have some issues with dependencies constraints. At times, extra the combination of Python 3.11 and dependencies constraints requires that a dependency be built from source rather than downloaded pre-built.

The same applies to Python 3.12, although there is some anecdotal evidence that indicates that 3.12 might not work in all scenarios due to dependencies constraints.

Download the notebook: Soda Databricks notebook

Go further

Need help? Join the Soda community on Slack.

Last updated

Was this helpful?