Connect Soda to Dask and Pandas
Access configuration details to connect Soda to Dask and Pandas.
Connection configuration reference
For use with programmatic Soda scans, only.
Install package: soda-pandas-dask
Define a programmatic scan for the data in the DataFrames. You do not need to configure a connection to a data source, but you must still configure a connection to Soda Cloud using API Keys. Refer to the following example.
Load CSV file into Dataframe
import pandas as pd
import dask
import dask.datasets
from soda.scan import Scan
# Read more info in "Note on new release" section
dask.config.set({"dataframe.convert-string": False})
# Create a Soda scan object
scan = Scan()
# Load timeseries data from dask datasets
df_timeseries = dask.datasets.timeseries().reset_index()
df_timeseries["email"] = "[email protected]"
# Create an artificial pandas dataframe
df_employee = pd.DataFrame({"email": ["[email protected]", "[email protected]", "[email protected]"]})
# Either add Dask dataframe to scan and assign a dataset name to refer from checks.yaml
scan.add_dask_dataframe(dataset_name="timeseries", dask_df=df_timeseries, data_source_name="orders")
# OR, add Pandas dataframe to scan and assign a dataset name to refer from checks.yaml
scan.add_pandas_dataframe(dataset_name="employee", pandas_df=df_employee, data_source_name="orders")
# Optionally, add multiple dataframes as unique data sources. Note the change of
# the data_source_name parameter.
scan.add_dask_dataframe(dataset_name="inquiries", dask_df=[...], data_source_name="customers")
# Set the scan definition name and default data source to use
scan.set_scan_definition_name("test")
scan.set_data_source_name("orders")
# Add configuration YAML file
# You do not need connection to a data source; you must have a connection to Soda Cloud
# Choose one of the following two options:
# 1) From a file
scan.add_configuration_yaml_file(file_path="~/.soda/configuration.yml")
# 2) Inline in the code
# For host, use cloud.soda.io for EU region; use cloud.us.soda.io for US region
scan.add_configuration_yaml_str(
"""
soda_cloud:
host: cloud.soda.io
api_key_id: 2e0ba0cb-your-api-key-7b
api_key_secret: 5wd-your-api-key-secret-aGuRg
"""
# Define checks in yaml format
# Alternatively, refer to a yaml file using scan.add_sodacl_yaml_file(<filepath>)
checks = """
for each dataset T:
datasets:
- include %
checks:
- row_count > 0
profile columns:
columns:
- employee.%
checks for employee:
- values in (email) must exist in timeseries (email) # Error expected
- row_count same as timeseries # Error expected
checks for timeseries:
- avg_x_minus_y between -1 and 1:
avg_x_minus_y expression: AVG(x - y)
- failed rows:
samples limit: 50
fail condition: x >= 3
- schema:
name: Confirm that required columns are present
warn:
when required column missing: [x]
when forbidden column present: [email]
when wrong column type:
email: varchar
fail:
when required column missing:
- y
- invalid_count(email) = 0:
valid format: email
- valid_count(email) > 0:
valid format: email
"""
scan.add_sodacl_yaml_str(checks)
scan.set_verbose(True)
scan.execute()Load JSON file into Dataframe
import pandas as pd
from soda.scan import Scan
# Create a Soda scan object
scan = Scan()
# Load JSON file into DataFrame
df = pd.read_json('your_file.json')
...Add optional parameter for COUNT
COUNTPrior to soda-pandas-dask version 1.6.4, Soda only supported dask-sql versions up to 2023.10 in which the COUNT(*) clause behaved as COUNT(1) by default. With dask-sql versions greater than 2023.10, Dask's behavior changed so that COUNT(*) behaves as COUNT(*). Therefore, upgrading your soda-pandas-dask package, which supports newer versions of dask-sql with the new behavior, might lead to unexpected differences in your check results.
To mitigate confusion, with soda-pandas-dask version 1.6.4 or greater, use the optional use_dask_count_star_as_count_one parameter when calling scan.add_dask_dataframe() or scan.add_pandas_dataframe() to explicitly set the behavior of the COUNT(*) clause, as in the following example.
use_dask_count_star_as_count_one=True
COUNT(*) behaves as SQL COUNT(1) operation
use_dask_count_star_as_count_one=False
COUNT(*) behaves as SQL COUNT(*) operation
If you do not add the parameter, Soda defaults to use_dask_count_star_as_count_one=True.
import pandas as pd
import dask
import dask.datasets
from soda.scan import Scan
# Create a Soda scan object
scan = Scan()
# Load timeseries data from Dask datasets
df_timeseries = dask.datasets.timeseries().reset_index()
df_timeseries["email"] = "[email protected]"
# Add Dask Dataframe to scan and assign a dataset name to refer from checks.yaml
# Dask uses SQL COUNT(*) operation, instead of COUNT(1)
scan.add_dask_dataframe(dataset_name="timeseries", dask_df=df_timeseries, data_source_name="orders", use_dask_count_star_as_count_one=False)Add optional parameter for text data conversion
In dask>=2023.7.1 and later, if you use pandas>=2 and pyarrow>=12, Dask Dataframe automatically converts text data to string[pyarrow] data type. With soda-pandas-dask version 1.6.4, Soda's updated codebase uses dask>=2023.7.1 but it still expects text data to be converted to object data type.
Add the dask.config.set({"dataframe.convert-string": False}) parameter set to False, as in the following example, to avoid KeyError: string[pyarrow] errors. Access Dask documentation for further details.
import pandas as pd
import dask
import dask.datasets
from soda.scan import Scan
# Avoid string conversion errors
dask.config.set({"dataframe.convert-string": False})
# Create a Soda scan object
scan = Scan()
# Load timeseries data from Dask datasets
df_timeseries = dask.datasets.timeseries().reset_index()
df_timeseries["email"] = "[email protected]"
# Add Dask Dataframe to scan and assign a dataset name to refer from checks.yaml
scan.add_dask_dataframe(dataset_name="timeseries", dask_df=df_timeseries, data_source_name="orders", use_dask_count_star_as_count_one=False)
Troubleshoot
Problem: You encounter errors when trying to install soda-pandas-dask in an environment that uses Python 3.11. This may manifest as an issue with dependencies or as an error that reads, Pre-scan validation failed, see logs for details.
Workaround: Uninstall the soda-pandas-dask package, then downgrade the version of Python your environment uses to Python 3.9. Install the soda-pandas-dask package again.
Problem: The COUNT(*) behavior in dask-sql is behaving unexpectedly or yielding confusing check results.
Solution: Upgrade soda-pandas-dask to version 1.6.4 or greater and use the optional use_dask_count_star_as_count_one=True parameter when calling scan.add_dask_dataframe() or scan.add_pandas_dataframe() to persist old dask-sql behavior. See Add optional parameter for COUNT.
Problem: You encounter an error that reads KeyError: string[pyarrow].
Solution: Upgrade soda-pandas-dask to version 1.6.4 or greater and use the dask.config.set({"dataframe.convert-string": False}) parameter set to False. See Add optional parameter text data conversion.
Last updated
Was this helpful?
