# Spark Dataframe

## Connection configuration reference <a href="#connection-configuration-reference" id="connection-configuration-reference"></a>

Soda supports **Apache Spark** as a scalable distributed SQL engine that can be used with in-memory data frames and existing Spark sessions.

Install the following package:

```bash
pip install soda-sparkdf
```

***

### From existing spark session

If you already have a running Spark session, you can initialize a Soda Spark DataFrame data source directly from it.

```python
from pyspark.sql import SparkSession
from soda_core.contracts import verify_contract_locally
from soda_sparkdf import SparkDataFrameDataSource

spark = (
    SparkSession.builder.master("local[*]")
    .appName("soda_sparkdf")
    .getOrCreate()
)

# Create a database (schema) for organization
spark.sql("CREATE DATABASE IF NOT EXISTS my_schema")
spark.sql("USE my_schema")

# Create the DataFrame and save it as a table in the schema
df = spark.createDataFrame([(1,), (2,), (3,)], ["id"])
df.write.mode("overwrite").saveAsTable("my_table")

spark_data_source = SparkDataFrameDataSource.from_existing_session(
    session=spark,
    name="my_sparkdf"
)

result = verify_contract_locally(
    data_sources=[spark_data_source],
    contract_file_path="./my_table.yaml",
    soda_cloud_file_path="../soda-cloud.yaml",
    publish=True
)

if result.is_ok:
    print("✅ Contract verification passed.")
else:
    print("❌ Contract verification failed:")
    print(result.get_errors_str())
```

> Learn more about [python-api](https://docs.soda.io/reference/python-api "mention").

#### Example contract

Here’s a minimal example of a **Soda contract** that validates the `my_table` dataset in Spark:

{% code title="contract.yml" %}

```yaml
dataset: my_sparkdf/my_schema/my_table
columns:
  - name: id
    data_type: integer
    checks:
      - missing:
checks:
  - row_count:
      threshold:
        must_be: 3
```

{% endcode %}

###

## Using Spark DataFrames on Databricks

### Example contract: Spark - Databricks

Below is an example of a contract scan that runs on a Databricks table with a Spark connector:

```python
from soda_core.contracts import verify_contract_locally
from soda_sparkdf import SparkDataFrameDataSource
from soda_core import configure_logging

# Enable or disable verbose logging
configure_logging(verbose=True)

# unity catalog tables are available in the spark session
# dataset DQN in the contract should include the full path to the table, e.g.,
# dataset: soda_databricks_example/unity_catalog/tyler/obs_test_data_seasoned
spark_data_source = SparkDataFrameDataSource.from_existing_session(
    session=spark,
    name="soda_databricks_example"
)

result = verify_contract_locally(
    data_sources=[spark_data_source],
    contract_file_path="obs_test_data_seasoned.yml",
    soda_cloud_file_path="sc_config.yml",
    publish=False
)

if result.is_ok:
    print("✅ Contract verification passed.")
else:
    print("❌ Contract verification failed:")
    print(result.get_errors_str())
```

### Databricks Connect compatibility

The **Soda Data Contract Spark package** is compatible with the following **Databricks Connect** versions:

* **17.3.x:** **17.3.1** and above (within the same minor version)
* **17.2.x: 17.2.4** and above (within the same minor version)
* **17.1.x:** **17.1.7** and above (within the same minor version)
* **17.0.x:** **17.0.10** and above (within the same minor version)
* **16.x**: **16.4.9** and above (within the same major version)

Versions below these minimums are **not supported**.

### Troubleshoot

<i class="fa-square-xmark">:square-xmark:</i> **Problem:** In a Databricks Notebook, running Soda results in validation or assertion errors, such as:

```
ValidationError
AssertionError: Expected a list, got <class 'NoneType'>
Unexpected or missing fields during Soda initialization
```

**Solution:**\
Databricks runtimes ship with a preinstalled version of **`pydantic`** that may be incompatible with the version required by Soda.

Before invoking Soda, upgrade `pydantic` in your Databricks Notebook:

```python
!pip install -U pydantic
dbutils.library.restartPython()  # for pip installation to take effect
```

<i class="fa-square-xmark">:square-xmark:</i> **Problem:** In Databricks Notebook, running

```python
from soda_sparkdf import SparkDataFrameDataSource
```

results in error

{% code overflow="wrap" %}

```
ImportError: cannot import name 'sql' from 'databricks' (/databricks/spark/python/databricks/__init__.py)
```

{% endcode %}

<i class="fa-square-check">:square-check:</i> **Solution:** Run these in your Databricks Notebook:

```python
!pip install databricks-sql-connector
dbutils.library.restartPython()  # for pip installation to take effect
```

<br>

***

{% if (visitor.claims.plan === 'datasetStandard')%}
{% hint style="success" %}
You are **logged in to Soda** and seeing the **Free license** documentation. Learn more about [documentation-access-and-licensing](https://docs.soda.io/reference/documentation-access-and-licensing "mention").
{% endhint %}
{% endif %}

{% if (visitor.claims.plan === 'enterprise')%}
{% hint style="success" %}
You are **logged in to Soda** and seeing the **Team license** documentation. Learn more about [documentation-access-and-licensing](https://docs.soda.io/reference/documentation-access-and-licensing "mention").
{% endhint %}
{% endif %}

{% if (visitor.claims.plan === 'enterpriseUserBased')%}
{% hint style="success" %}
You are **logged in to Soda** and seeing the **Enterprise license** documentation. Learn more about [documentation-access-and-licensing](https://docs.soda.io/reference/documentation-access-and-licensing "mention").
{% endhint %}
{% endif %}

{% if !(visitor.claims.plan === 'enterprise' || visitor.claims.plan === 'enterpriseUserBased' || visitor.claims.plan === 'datasetStandard')%}
{% hint style="info" %}
You are **not logged in to Soda** and are viewing the default public documentation. Learn more about [documentation-access-and-licensing](https://docs.soda.io/reference/documentation-access-and-licensing "mention").
{% endhint %}
{% endif %}


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.soda.io/reference/data-source-reference-for-soda-core/spark-dataframe.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
