# Connect Soda to Dask and Pandas

## Connection configuration reference

For use with [programmatic Soda scans](https://docs.soda.io/soda-documentation/soda-v3/quick-start-sip/programmatic), only.

Install package: `soda-pandas-dask`

Define a programmatic scan for the data in the DataFrames. You do not need to configure a connection to a data source, but you must still configure a connection to Soda Cloud using API Keys. Refer to the following example.

<details>

<summary>Why do I need a Soda Cloud account?</summary>

To validate your account license or free trial, Soda Library must communicate with a Soda Cloud account via API keys. You create a set of API keys in your Soda Cloud account, then use them to configure the connection to Soda Library.\
[Learn more](https://docs.soda.io/soda-documentation/soda-v3/quick-start-sip/get-started-roadmap#about-soda)

</details>

### Load CSV file into Dataframe

```python
import pandas as pd

import dask
import dask.datasets
from soda.scan import Scan

# Read more info in "Note on new release" section
dask.config.set({"dataframe.convert-string": False})

# Create a Soda scan object
scan = Scan()

# Load timeseries data from dask datasets
df_timeseries = dask.datasets.timeseries().reset_index()
df_timeseries["email"] = "a@soda.io"

# Create an artificial pandas dataframe
df_employee = pd.DataFrame({"email": ["a@soda.io", "b@soda.io", "c@soda.io"]})

# Either add Dask dataframe to scan and assign a dataset name to refer from checks.yaml
scan.add_dask_dataframe(dataset_name="timeseries", dask_df=df_timeseries, data_source_name="orders")
# OR, add Pandas dataframe to scan and assign a dataset name to refer from checks.yaml
scan.add_pandas_dataframe(dataset_name="employee", pandas_df=df_employee, data_source_name="orders")

# Optionally, add multiple dataframes as unique data sources. Note the change of 
# the data_source_name parameter. 
scan.add_dask_dataframe(dataset_name="inquiries", dask_df=[...], data_source_name="customers")

# Set the scan definition name and default data source to use
scan.set_scan_definition_name("test")
scan.set_data_source_name("orders")

# Add configuration YAML file
# You do not need connection to a data source; you must have a connection to Soda Cloud
# Choose one of the following two options:
# 1) From a file
scan.add_configuration_yaml_file(file_path="~/.soda/configuration.yml")
# 2) Inline in the code
# For host, use cloud.soda.io for EU region; use cloud.us.soda.io for US region
scan.add_configuration_yaml_str(
    """
    soda_cloud:
      host: cloud.soda.io
      api_key_id: 2e0ba0cb-your-api-key-7b
      api_key_secret: 5wd-your-api-key-secret-aGuRg
"""

# Define checks in yaml format
# Alternatively, refer to a yaml file using scan.add_sodacl_yaml_file(<filepath>)
checks = """
for each dataset T:
  datasets:
    - include %
  checks:
    - row_count > 0
profile columns:
  columns:
    - employee.%
checks for employee:
    - values in (email) must exist in timeseries (email) # Error expected
    - row_count same as timeseries # Error expected
checks for timeseries:
  - avg_x_minus_y between -1 and 1:
      avg_x_minus_y expression: AVG(x - y)
  - failed rows:
      samples limit: 50
      fail condition: x >= 3
  - schema:
      name: Confirm that required columns are present
      warn:
        when required column missing: [x]
        when forbidden column present: [email]
        when wrong column type:
          email: varchar
      fail:
        when required column missing:
          - y
  - invalid_count(email) = 0:
      valid format: email
  - valid_count(email) > 0:
      valid format: email
"""

scan.add_sodacl_yaml_str(checks)

scan.set_verbose(True)
scan.execute()
```

### Load JSON file into Dataframe

```python
import pandas as pd
from soda.scan import Scan

# Create a Soda scan object
scan = Scan()

# Load JSON file into DataFrame
df = pd.read_json('your_file.json')

...
```

## Add optional parameter for `COUNT`

Prior to `soda-pandas-dask` version 1.6.4, Soda only supported `dask-sql` versions up to `2023.10` in which the `COUNT(*)` clause behaved as `COUNT(1)` by default. With `dask-sql` versions greater than `2023.10`, Dask's behavior changed so that `COUNT(*)` behaves as `COUNT(*)`. Therefore, upgrading your `soda-pandas-dask` package, which supports newer versions of `dask-sql` with the new behavior, might lead to unexpected differences in your check results.

To mitigate confusion, with `soda-pandas-dask` version 1.6.4 or greater, use the optional `use_dask_count_star_as_count_one` parameter when calling `scan.add_dask_dataframe()` or `scan.add_pandas_dataframe()` to explicitly set the behavior of the `COUNT(*)` clause, as in the following example.

| Parameter setting                        | behavior                                       |
| ---------------------------------------- | ---------------------------------------------- |
| `use_dask_count_star_as_count_one=True`  | `COUNT(*)` behaves as SQL `COUNT(1)` operation |
| `use_dask_count_star_as_count_one=False` | `COUNT(*)` behaves as SQL `COUNT(*)` operation |

If you do not add the parameter, Soda defaults to `use_dask_count_star_as_count_one=True`.

```python
import pandas as pd

import dask
import dask.datasets
from soda.scan import Scan

# Create a Soda scan object
scan = Scan()

# Load timeseries data from Dask datasets
df_timeseries = dask.datasets.timeseries().reset_index()
df_timeseries["email"] = "a@soda.io"

# Add Dask Dataframe to scan and assign a dataset name to refer from checks.yaml
# Dask uses SQL COUNT(*) operation, instead of COUNT(1)
scan.add_dask_dataframe(dataset_name="timeseries", dask_df=df_timeseries, data_source_name="orders", use_dask_count_star_as_count_one=False)
```

## Add optional parameter for text data conversion

In `dask>=2023.7.1` and later, if you use `pandas>=2` and `pyarrow>=12`, Dask Dataframe automatically converts text data to `string[pyarrow]` data type. With `soda-pandas-dask` version 1.6.4, Soda's updated codebase uses `dask>=2023.7.1` but it still expects text data to be converted to `object` data type.

Add the `dask.config.set({"dataframe.convert-string": False})` parameter set to `False`, as in the following example, to avoid `KeyError: string[pyarrow]` errors. Access [Dask documentation](https://docs.dask.org/en/stable/changelog.html#v2023-7-1) for further details.

```python
import pandas as pd

import dask
import dask.datasets
from soda.scan import Scan

# Avoid string conversion errors
dask.config.set({"dataframe.convert-string": False})

# Create a Soda scan object
scan = Scan()

# Load timeseries data from Dask datasets
df_timeseries = dask.datasets.timeseries().reset_index()
df_timeseries["email"] = "a@soda.io"

# Add Dask Dataframe to scan and assign a dataset name to refer from checks.yaml
scan.add_dask_dataframe(dataset_name="timeseries", dask_df=df_timeseries, data_source_name="orders", use_dask_count_star_as_count_one=False)
```

<br>

## Troubleshoot

**Problem:** You encounter errors when trying to install `soda-pandas-dask` in an environment that uses Python 3.11. This may manifest as an issue with dependencies or as an error that reads, `Pre-scan validation failed, see logs for details.`

**Workaround:** Uninstall the `soda-pandas-dask` package, then downgrade the version of Python your environment uses to Python 3.9. Install the `soda-pandas-dask` package again.

<br>

**Problem:** The `COUNT(*`) behavior in `dask-sql` is behaving unexpectedly or yielding confusing check results.

**Solution:** Upgrade `soda-pandas-dask` to version 1.6.4 or greater and use the optional `use_dask_count_star_as_count_one=True` parameter when calling `scan.add_dask_dataframe()` or `scan.add_pandas_dataframe()` to persist old `dask-sql` behavior. See [Add optional parameter for `COUNT`](#add-optional-parameter-for-count).

<br>

**Problem:** You encounter an error that reads `KeyError: string[pyarrow]`.

**Solution:** Upgrade `soda-pandas-dask` to version 1.6.4 or greater and use the `dask.config.set({"dataframe.convert-string": False})` parameter set to `False`. See [Add optional parameter text data conversion](#add-optional-parameter-for-text-data-conversion).
