Connect Soda to Apache Spark
Last modified on 31-May-23
For Soda to run quality scans of your data, you must configure it to connect to your data source.
- For Soda Core, add the connection configurations to your
configuration.yml
file. Read more. - For Soda Cloud, add the connection configurations to step 3 of the New Data Source workflow. Read more.
Spark packages
Connect to Spark DataFrames
Use Soda Core with Spark DataFrames on Databricks
Connect to Spark for Hive
Connect to Spark for ODBC
Connect to Spark for Databricks SQL
Test the data source connection
Supported data types
Spark packages
There are several Soda Core install packages for Spark.
Package | Description |
---|---|
soda-core-spark-df | Enables you to pass dataframe objects into Soda scans programatically, after you have associated the temporary tables to DataFrames via the Spark API. - For use with programmatic Soda scans, only. - Supports Delta Lake Tables on Databricks. - Use for Spark DataFrames on Databricks. |
soda-core-spark[hive] | A package you add to soda-core-spark-df if you are using Apache Hive. |
soda-core-spark[odbc] | A package you add to soda-core-spark-df if you are using ODBC. |
soda-core-spark[databricks] | A package you use to install Soda Core for Databricks SQL on the Databricks Lakehouse Platform. |
soda-core-spark | A work-in-progress, this package will connect to Soda Core much the same as other data sources, via connection details in a configuration YAML. |
Connect to Spark DataFrames
- For use with programmatic Soda scans, only.
- Unlike other data sources, Soda Core for SparkDf does not require a configuration YAML file to run scans against Spark DataFrames.
A Spark cluster contains a distributed collection of data. Spark DataFrames are distributed collections of data that are organized into named columns, much like a table in a database, and which are stored in-memory in a cluster.
To make a DataFrame available to Soda Core to run scans against, you must use a driver program like PySpark and the Spark API to link DataFrames to individual, named, temporary tables in the cluster. You pass this information into a Soda scan programatically. You can also pass Soda Cloud connection details programmatically; see Connect Soda Core for SparkDF to Soda Cloud. Refer to the soda-core repo in GitHub for details.
- If you are not installing Soda Core Spark DataFrames on a cluster, skip to step 2. To install Soda Core Spark DataFrames on a cluster, such as a Kubernetes cluster or a Databricks cluster, install
libsasl2-dev
before installingsoda-core-spark-df
. For Ubuntu users, installlibsasl2-dev
using the following command:sh sudo apt-get -y install unixodbc-dev libsasl2-dev gcc python-dev
- If you are not using Spark with Hive or ODBC, skip to step 3. Otherwise, install the separate dependencies as needed, and configure connection details for each dependency; see below.
- Confirm that you have completed the following.
- installed
soda-core-spark-df
- set up a a Spark session
spark_session: SparkSession = ...user-defined-way-to-create-the-spark-session...
- confirmed that your Spark cluster contains one or more DataFrames
df = ...user-defined-way-to-build-the-dataframe...
- installed
- Use the Spark API to link the name of a temporary table to a DataFrame. In this example, the name of the table is
customers
.db.createOrReplaceTempView('customers')
- Use the Spark API to link a DataFrame to the name of each temporary table against which you wish to run Soda scans. Refer to PySpark documentation.
- Define a programmatic scan for the data in the DataFrames, and include one extra method to pass all the DataFrames to Soda Core:
add_spark_session(self, spark_session, data_source_name: str)
. The default value fordata_source_name
is"spark_df"
. Refer to the example below.
spark_session = ...your_spark_session...
df1.createOrReplaceTempView("TABLE_ONE")
df2.createOrReplaceTempView("TABLE_TWO")
...
scan = Scan()
scan.set_scan_definition_name('YOUR_SCHEDULE_NAME')
scan.set_data_source_name("spark_df")
scan.add_configuration_yaml_file(file_path="somedirectory/your_configuration.yml")
scan.add_spark_session(spark_session)
... all other scan methods in the standard programmatic scan ...
Use Soda Core with Spark DataFrames on Databricks
Use the soda-core-spark-df
package to connect to Databricks using a Notebook.
- Follow steps 1-2 in the instructions to install
soda-core-spark-df
. - Reference the following Notebook example to connect to Databricks.
# import Scan from Soda Core
from soda.scan import Scan
# Create a Spark DataFrame, or use the Spark API to read data and create a DataFrame
df = spark.createDataFrame([(1, "a"), (2, "b")], ("id", "name"))
# Create a view that SodaCL uses as a dataset
df.createOrReplaceTempView("my_df")
# Create a Scan object, set a scan definition, and attach a Spark session
scan = Scan()
scan.set_scan_definition_name("test")
scan.set_data_source_name("spark_df")
scan.add_spark_session(spark)
# Define checks for datasets
checks ="""
checks for my_df:
- row_count > 0
"""
# If you defined checks in a file accessible via Spark, you can use the scan.add_sodacl_yaml_file method to retrieve the checks
scan.add_sodacl_yaml_str(checks)
# Optionally, add a configuration file with Soda Cloud credentials
# config = """
# soda_cloud:
# api_key_id: xyz
# api_key_secret: xyz
# """
# scan.add_configuration_yaml_str(config)
# Execute a scan
scan.execute()
# Check the Scan object for methods to inspect the scan result; the following prints all logs to console
print(scan.get_logs_text())
Connect to Spark for Hive
An addition to soda-core-spark-df
, install and configure this package if you use Apache Hive.
data_source my_datasource_name:
type: spark
username:
password:
host:
port:
database:
auth_method:
Property | Required |
---|---|
type | required |
username | required |
password | required |
host | required |
port | required |
database | required |
auth_method | required |
Connect to Spark for ODBC
An addition to soda-core-spark-df
, install and configure this package if you use ODBC.
data_source my_datasource_name:
type: spark
driver:
host:
port:
token:
organization:
cluster:
server_side_parameters:
Property | Required |
---|---|
type | required |
driver | required |
host | required |
port | required |
token | required |
organization | required |
cluster | required |
server_side_parameters | required |
Connect to Spark for Databricks SQL
- Install
soda-core-spark[databricks]
to connect to Databricks SQL.
Refer to Install Soda Core for details. - If you have not done so already, install
databricks-sql-connector
. Refer to Databricks documentation for details. - Configure the data source connection in your
configuration.yml
file as per the following example.
data_source my_datasource_name:
type: spark
catalog: samples
schema: nyctaxi
method: databricks
host: hostname_from_Databricks_SQL_settings
http_path: http_path_from_Databricks_SQL_settings
token: my_access_token
Property | Required |
---|---|
type | required |
catalog | required |
schema | required |
method | required |
host | required |
token | required |
Test the data source connection
To confirm that you have correctly configured the connection details for the data source(s) in your configuration YAML file, use the test-connection
command. If you wish, add a -V
option to the command to returns results in verbose mode in the CLI.
soda test-connection -d my_datasource -c configuration.yml -V
Supported data types
Category | Data type |
---|---|
text | CHAR, VARCHAR, TEXT |
number | BIG INT, NUMERIC, BIT, SMALLINT, DECIMAL, SMALLMONEY, INT, TINYINT, MONEY, FLOAT, REAL |
time | DATE, TIME, DATETIME, DATETIMEOFFSET |
Was this documentation helpful?
What could we do to improve this page?
- Suggest a docs change in GitHub.
- Share feedback in the Soda community on Slack.
Last modified on 31-May-23