Connect Soda to Dask and Pandas

Access configuration details to connect Soda to Dask and Pandas.

Connection configuration reference

For use with programmatic Soda scans, only.

Install package: soda-pandas-dask

Define a programmatic scan for the data in the DataFrames. You do not need to configure a connection to a data source, but you must still configure a connection to Soda Cloud using API Keys. Refer to the following example.

Why do I need a Soda Cloud account?

To validate your account license or free trial, Soda Library must communicate with a Soda Cloud account via API keys. You create a set of API keys in your Soda Cloud account, then use them to configure the connection to Soda Library. Learn more

Load CSV file into Dataframe

import pandas as pd

import dask
import dask.datasets
from soda.scan import Scan

# Read more info in "Note on new release" section
dask.config.set({"dataframe.convert-string": False})

# Create a Soda scan object
scan = Scan()

# Load timeseries data from dask datasets
df_timeseries = dask.datasets.timeseries().reset_index()
df_timeseries["email"] = "[email protected]"

# Create an artificial pandas dataframe
df_employee = pd.DataFrame({"email": ["[email protected]", "[email protected]", "[email protected]"]})

# Either add Dask dataframe to scan and assign a dataset name to refer from checks.yaml
scan.add_dask_dataframe(dataset_name="timeseries", dask_df=df_timeseries, data_source_name="orders")
# OR, add Pandas dataframe to scan and assign a dataset name to refer from checks.yaml
scan.add_pandas_dataframe(dataset_name="employee", pandas_df=df_employee, data_source_name="orders")

# Optionally, add multiple dataframes as unique data sources. Note the change of 
# the data_source_name parameter. 
scan.add_dask_dataframe(dataset_name="inquiries", dask_df=[...], data_source_name="customers")

# Set the scan definition name and default data source to use
scan.set_scan_definition_name("test")
scan.set_data_source_name("orders")

# Add configuration YAML file
# You do not need connection to a data source; you must have a connection to Soda Cloud
# Choose one of the following two options:
# 1) From a file
scan.add_configuration_yaml_file(file_path="~/.soda/configuration.yml")
# 2) Inline in the code
# For host, use cloud.soda.io for EU region; use cloud.us.soda.io for US region
scan.add_configuration_yaml_str(
    """
    soda_cloud:
      host: cloud.soda.io
      api_key_id: 2e0ba0cb-your-api-key-7b
      api_key_secret: 5wd-your-api-key-secret-aGuRg
"""

# Define checks in yaml format
# Alternatively, refer to a yaml file using scan.add_sodacl_yaml_file(<filepath>)
checks = """
for each dataset T:
  datasets:
    - include %
  checks:
    - row_count > 0
profile columns:
  columns:
    - employee.%
checks for employee:
    - values in (email) must exist in timeseries (email) # Error expected
    - row_count same as timeseries # Error expected
checks for timeseries:
  - avg_x_minus_y between -1 and 1:
      avg_x_minus_y expression: AVG(x - y)
  - failed rows:
      samples limit: 50
      fail condition: x >= 3
  - schema:
      name: Confirm that required columns are present
      warn:
        when required column missing: [x]
        when forbidden column present: [email]
        when wrong column type:
          email: varchar
      fail:
        when required column missing:
          - y
  - invalid_count(email) = 0:
      valid format: email
  - valid_count(email) > 0:
      valid format: email
"""

scan.add_sodacl_yaml_str(checks)

scan.set_verbose(True)
scan.execute()

Load JSON file into Dataframe

import pandas as pd
from soda.scan import Scan

# Create a Soda scan object
scan = Scan()

# Load JSON file into DataFrame
df = pd.read_json('your_file.json')

...

Add optional parameter for `COUNT`

Prior to soda-pandas-dask version 1.6.4, Soda only supported dask-sql versions up to 2023.10 in which the COUNT(*) clause behaved as COUNT(1) by default. With dask-sql versions greater than 2023.10, Dask's behavior changed so that COUNT(*) behaves as COUNT(*). Therefore, upgrading your soda-pandas-dask package, which supports newer versions of dask-sql with the new behavior, might lead to unexpected differences in your check results.

To mitigate confusion, with soda-pandas-dask version 1.6.4 or greater, use the optional use_dask_count_star_as_count_one parameter when calling scan.add_dask_dataframe() or scan.add_pandas_dataframe() to explicitly set the behavior of the COUNT(*) clause, as in the following example.

Parameter setting

behavior

use_dask_count_star_as_count_one=True

COUNT(*) behaves as SQL COUNT(1) operation

use_dask_count_star_as_count_one=False

COUNT(*) behaves as SQL COUNT(*) operation

If you do not add the parameter, Soda defaults to use_dask_count_star_as_count_one=True.

import pandas as pd

import dask
import dask.datasets
from soda.scan import Scan

# Create a Soda scan object
scan = Scan()

# Load timeseries data from Dask datasets
df_timeseries = dask.datasets.timeseries().reset_index()
df_timeseries["email"] = "[email protected]"

# Add Dask Dataframe to scan and assign a dataset name to refer from checks.yaml
# Dask uses SQL COUNT(*) operation, instead of COUNT(1)
scan.add_dask_dataframe(dataset_name="timeseries", dask_df=df_timeseries, data_source_name="orders", use_dask_count_star_as_count_one=False)

Add optional parameter for text data conversion

In dask>=2023.7.1 and later, if you use pandas>=2 and pyarrow>=12, Dask Dataframe automatically converts text data to string[pyarrow] data type. With soda-pandas-dask version 1.6.4, Soda's updated codebase uses dask>=2023.7.1 but it still expects text data to be converted to object data type.

Add the dask.config.set({"dataframe.convert-string": False}) parameter set to False, as in the following example, to avoid KeyError: string[pyarrow] errors. Access Dask documentation for further details.

import pandas as pd

import dask
import dask.datasets
from soda.scan import Scan

# Avoid string conversion errors
dask.config.set({"dataframe.convert-string": False})

# Create a Soda scan object
scan = Scan()

# Load timeseries data from Dask datasets
df_timeseries = dask.datasets.timeseries().reset_index()
df_timeseries["email"] = "[email protected]"

# Add Dask Dataframe to scan and assign a dataset name to refer from checks.yaml
scan.add_dask_dataframe(dataset_name="timeseries", dask_df=df_timeseries, data_source_name="orders", use_dask_count_star_as_count_one=False)

Troubleshoot

Problem: You encounter errors when trying to install soda-pandas-dask in an environment that uses Python 3.11. This may manifest as an issue with dependencies or as an error that reads, Pre-scan validation failed, see logs for details.

Workaround: Uninstall the soda-pandas-dask package, then downgrade the version of Python your environment uses to Python 3.9. Install the soda-pandas-dask package again.

Problem: The COUNT(*) behavior in dask-sql is behaving unexpectedly or yielding confusing check results.

Solution: Upgrade soda-pandas-dask to version 1.6.4 or greater and use the optional use_dask_count_star_as_count_one=True parameter when calling scan.add_dask_dataframe() or scan.add_pandas_dataframe() to persist old dask-sql behavior. See Add optional parameter for COUNT.

Problem: You encounter an error that reads KeyError: string[pyarrow].

Solution: Upgrade soda-pandas-dask to version 1.6.4 or greater and use the dask.config.set({"dataframe.convert-string": False}) parameter set to False. See Add optional parameter text data conversion.

PreviousConnect Soda to ClickHouse NextConnect Soda to Databricks

Last updated 26 days ago

Was this helpful?