Link Search Menu Expand Document

Install and use Soda Spark

Soda Spark is an extension of Soda SQL that allows you to run Soda SQL functionality programmatically on a Spark dataframe. Reference the Soda SQL documentation to learn how to use Soda Spark, particularly how to define tests in a scan YAML file.

Requirements
Compatibility
Install Soda Spark
Use Soda Spark
How Soda Spark works
Contribute
Go further


Requirements

To use Soda Spark, you must have installed the following on your system.

  • Python 3.7 or greater. To check your existing version, use the CLI command: python --version
  • Pip 21.0 or greater. To check your existing version, use the CLI command: pip --version

For Linux users only, install the following:

  • On Debian Buster: apt-get install g++ unixodbc-dev python3-dev libssl-dev libffi-dev
  • On CentOS 8: yum install gcc-c++ unixODBC-devel python38-devel libffi-devel openssl-devel

For MSSQL Server users only, install the following:

Compatibility

Use Soda Spark to scan a variety of data sources:

Amazon Athena
Amazon Redshift
Apache Hive (experimental)
Apache Spark (experimental)
GCP Big Query
MySQL (experimental)
Microsoft SQL Server (experimental)
PostgreSQL
Snowflake

Install Soda Spark

From your command-line interface tool, execute the following command.

$ pip install soda-spark

Use Soda Spark

As an extension of Soda SQL, Soda Spark allows you to run Soda SQL functionality programmatically on a Spark dataframe. Reference the Soda SQL documentation to learn how to use Soda Spark.

From your Python prompt, execute the following commands to programmatically run Soda SQL functionality.

>>> from pyspark.sql import DataFrame, SparkSession
>>> from sodaspark import scan
>>>
>>> spark_session = SparkSession.builder.getOrCreate()
>>>
>>> id = "a76824f0-50c0-11eb-8be8-88e9fe6293fd"
>>> df = spark_session.createDataFrame([
...	   {"id": id, "name": "Paula Landry", "size": 3006},
...	   {"id": id, "name": "Kevin Crawford", "size": 7243}
... ])
>>>
>>> scan_definition = ("""
... table_name: demodata
... metrics:
... - row_count
... - max
... - min_length
... tests:
... - row_count > 0
... columns:
...   id:
...     valid_format: uuid
...     tests:
...     - invalid_percentage == 0
... """)
>>> scan_result = scan.execute(scan_definition, df)
>>>
>>> scan_result.measurements
[Measurement(metric='schema', ...), Measurement(metric='row_count', ...), ...]
>>> scan_result.test_results
[TestResult(test=Test(..., expression='row_count > 0', ...), passed=True, skipped=False, ...)]
>>>

Alternatively, you can prepare a scan YAML file that Soda Spark uses to prepare SQL queries to run against your data.

>>> scan_yml = "static/demodata.yml"
>>> scan_result = scan.execute(scan_yml, df)
>>>
>>> scan_result.measurements
[Measurement(metric='schema', ...), Measurement(metric='row_count', ...), ...]
>>>

Send scan results to Soda Cloud

Use the following command to send Soda Spark scan results to Soda cloud. Use Soda Cloud documentation to learn how to generate API keys to connect Soda Spark to Soda Cloud.

>>> import os
>>> from sodasql.soda_server_client.soda_server_client import SodaServerClient
>>>
>>> soda_server_client = SodaServerClient(
...     host="cloud.soda.io",
...     api_key_id=os.getenv("API_PUBLIC"),
...     api_key_secret=os.getenv("API_PRIVATE"),
... )
>>> scan_result = scan.execute(scan_yml, df, soda_server_client=soda_server_client)
>>>

How Soda Spark works

When you execute Soda Spark, it completes the following tasks:

  1. It sets up the scan using the Spark dialect and a Spark session as a warehouse connection.
  2. It creates, or replaces, a global temporary view for the Spark dataframe.
  3. It executes the Soda scan on the temporary view.

Contribute

Soda Spark is an open-source software project hosted in GitHub. Learn how to contribute to the Soda Spark project.

Go further


Last modified on 26-Nov-21

Was this documentation helpful?
Give us your feedback in the #soda-docs channel in the Soda community on Slack or open an issue in GitHub.