Add Soda to a Databricks notebook
Last modified on 30-Nov-23
Use this guide to install and set up Soda in a Databricks notebook so you can run data quality tests on data in a Spark data source.
(Not quite ready for this big gulp of Soda? 🥤Try taking a sip, first.)
About this guide
Create a Soda Cloud account
Set up Soda
Go further
About this guide
The instructions below offer Data Engineers an example of how to write Python in a Databricks notebook to set up Soda, then write and execute scans for data quality in Spark.
This example uses a programmatic deployment model which invokes the Soda Python library, and uses Soda Cloud to validate a commercial usage license and display visualized data quality test results. See: Choose a flavor of Soda.
Create a Soda Cloud account
To validate your account license or free trial, Soda Library must communicate with a Soda Cloud account via API keys. You create a set of API keys in your Soda Cloud account, then use them to configure the connection to Soda Library. Learn more
- In a browser, navigate to cloud.soda.io/signup to create a new Soda account, which is free for a 45-day trial. If you already have a Soda account, log in.
- Navigate to your avatar > Profile, then access the API keys tab. Click the plus icon to generate new API keys.
- Copy+paste the API key values to a temporary, secure place in your local environment.
Set up Soda
Soda Library has the following requirements:
- Python 3.8 or greater
- Pip 21.0 or greater
Download the notebook: Soda Databricks notebook
# Install a Soda Library package with Apache Spark DataFrame
pip install -i https://pypi.cloud.soda.io soda-spark-df
# Import Scan from Soda Library
# A scan is a command that executes checks to extract information about data in a dataset.
from soda.scan import Scan
# Create a Spark DataFrame, or use the Spark API to read data and create a DataFrame
# A Spark DataFrame is a distributed collection of data organized into named columns which provides a structured and tabular representation of data within the Apache Spark framework.
df = spark.table("delta.`/databricks-datasets/adventureworks/tables/adventureworks`")
# Create a view that Soda uses as a dataset
df.createOrReplaceTempView("adventureworks")
# Create a scan object
scan = Scan()
# Set a scan definition
# Use a scan definition to configure which data to scan, and when and how to execute the scan.
scan.set_scan_definition_name("Databricks Notebook")
scan.set_data_source_name("spark_df")
# Attach a Spark session
scan.add_spark_session(spark)
# Define checks for datasets
# A Soda Check is a test that Soda Library performs when it scans a dataset in your data source. You can define your checks in-line in the notebook, or define them in a separate checks.yml fail that is accessible by Spark.
checks = """
checks for dim_customer:
- invalid_count(email_address) = 0:
valid format: email
name: Ensure values are formatted as email addresses
- missing_count(last_name) = 0:
name: Ensure there are no null values in the Last Name column
- duplicate_count(phone) = 0:
name: No duplicate phone numbers
- freshness(date_first_purchase) < 7d:
name: Data in this dataset is less than 7 days old
- schema:
warn:
when schema changes: any
name: Columns have not been added, removed, or changed
sample datasets:
datasets:
- include dim_%
"""
# OR, define checks in a file accessible via Spark, then use the scan.add_sodacl_yaml method to retrieve the checks
scan.add_sodacl_yaml_str(checks)
# Add your Soda Cloud connection configuration using the API Keys you created in Soda Cloud
# Use cloud.soda.io for EU region
# Use cloud.us.soda.io for US region
config ="""
soda_cloud:
host: cloud.soda.io
api_key_id: 39**9
api_key_secret: hN**_W1Q
"""
# OR, configure the connection details in a file accessible via Spark, then use the scan.add_configuration_yaml method to retrieve the config
scan.add_configuration_yaml_str(config)
# Execute a scan
scan.execute()
# Check the Scan object for methods to inspect the scan result
# The following prints all logs to the console
print(scan.get_logs_text())
Go further
- Learn more about SodaCL checks and metrics.
- Access instructions to Generate API Keys.
- Need help? Join the Soda community on Slack.
Was this documentation helpful?
What could we do to improve this page?
- Suggest a docs change in GitHub.
- Share feedback in the Soda community on Slack.
Documentation always applies to the latest version of Soda products
Last modified on 30-Nov-23