Connect Soda to Dask and Pandas
Access configuration details to connect Soda to Dask and Pandas.
Connection configuration reference
For use with programmatic Soda scans, only.
Install package: soda-pandas-dask
Define a programmatic scan for the data in the DataFrames. You do not need to configure a connection to a data source, but you must still configure a connection to Soda Cloud using API Keys. Refer to the following example.
Load CSV file into Dataframe
import pandas as pd
import dask
import dask.datasets
from soda.scan import Scan
# Read more info in "Note on new release" section
dask.config.set({"dataframe.convert-string": False})
# Create a Soda scan object
scan = Scan()
# Load timeseries data from dask datasets
df_timeseries = dask.datasets.timeseries().reset_index()
df_timeseries["email"] = "[email protected]"
# Create an artificial pandas dataframe
df_employee = pd.DataFrame({"email": ["[email protected]", "[email protected]", "[email protected]"]})
# Either add Dask dataframe to scan and assign a dataset name to refer from checks.yaml
scan.add_dask_dataframe(dataset_name="timeseries", dask_df=df_timeseries, data_source_name="orders")
# OR, add Pandas dataframe to scan and assign a dataset name to refer from checks.yaml
scan.add_pandas_dataframe(dataset_name="employee", pandas_df=df_employee, data_source_name="orders")
# Optionally, add multiple dataframes as unique data sources. Note the change of
# the data_source_name parameter.
scan.add_dask_dataframe(dataset_name="inquiries", dask_df=[...], data_source_name="customers")
# Set the scan definition name and default data source to use
scan.set_scan_definition_name("test")
scan.set_data_source_name("orders")
# Add configuration YAML file
# You do not need connection to a data source; you must have a connection to Soda Cloud
# Choose one of the following two options:
# 1) From a file
scan.add_configuration_yaml_file(file_path="~/.soda/configuration.yml")
# 2) Inline in the code
# For host, use cloud.soda.io for EU region; use cloud.us.soda.io for US region
scan.add_configuration_yaml_str(
"""
soda_cloud:
host: cloud.soda.io
api_key_id: 2e0ba0cb-your-api-key-7b
api_key_secret: 5wd-your-api-key-secret-aGuRg
"""
# Define checks in yaml format
# Alternatively, refer to a yaml file using scan.add_sodacl_yaml_file(<filepath>)
checks = """
for each dataset T:
datasets:
- include %
checks:
- row_count > 0
profile columns:
columns:
- employee.%
checks for employee:
- values in (email) must exist in timeseries (email) # Error expected
- row_count same as timeseries # Error expected
checks for timeseries:
- avg_x_minus_y between -1 and 1:
avg_x_minus_y expression: AVG(x - y)
- failed rows:
samples limit: 50
fail condition: x >= 3
- schema:
name: Confirm that required columns are present
warn:
when required column missing: [x]
when forbidden column present: [email]
when wrong column type:
email: varchar
fail:
when required column missing:
- y
- invalid_count(email) = 0:
valid format: email
- valid_count(email) > 0:
valid format: email
"""
scan.add_sodacl_yaml_str(checks)
scan.set_verbose(True)
scan.execute()
Load JSON file into Dataframe
import pandas as pd
from soda.scan import Scan
# Create a Soda scan object
scan = Scan()
# Load JSON file into DataFrame
df = pd.read_json('your_file.json')
...
Add optional parameter for COUNT
COUNT
Prior to soda-pandas-dask
version 1.6.4, Soda only supported dask-sql
versions up to 2023.10
in which the COUNT(*)
clause behaved as COUNT(1)
by default. With dask-sql
versions greater than 2023.10
, Dask's behavior changed so that COUNT(*)
behaves as COUNT(*)
. Therefore, upgrading your soda-pandas-dask
package, which supports newer versions of dask-sql
with the new behavior, might lead to unexpected differences in your check results.
To mitigate confusion, with soda-pandas-dask
version 1.6.4 or greater, use the optional use_dask_count_star_as_count_one
parameter when calling scan.add_dask_dataframe()
or scan.add_pandas_dataframe()
to explicitly set the behavior of the COUNT(*)
clause, as in the following example.
use_dask_count_star_as_count_one=True
COUNT(*)
behaves as SQL COUNT(1)
operation
use_dask_count_star_as_count_one=False
COUNT(*)
behaves as SQL COUNT(*)
operation
If you do not add the parameter, Soda defaults to use_dask_count_star_as_count_one=True
.
import pandas as pd
import dask
import dask.datasets
from soda.scan import Scan
# Create a Soda scan object
scan = Scan()
# Load timeseries data from Dask datasets
df_timeseries = dask.datasets.timeseries().reset_index()
df_timeseries["email"] = "[email protected]"
# Add Dask Dataframe to scan and assign a dataset name to refer from checks.yaml
# Dask uses SQL COUNT(*) operation, instead of COUNT(1)
scan.add_dask_dataframe(dataset_name="timeseries", dask_df=df_timeseries, data_source_name="orders", use_dask_count_star_as_count_one=False)
Add optional parameter for text data conversion
In dask>=2023.7.1
and later, if you use pandas>=2
and pyarrow>=12
, Dask Dataframe automatically converts text data to string[pyarrow]
data type. With soda-pandas-dask
version 1.6.4, Soda's updated codebase uses dask>=2023.7.1
but it still expects text data to be converted to object
data type.
Add the dask.config.set({"dataframe.convert-string": False})
parameter set to False
, as in the following example, to avoid KeyError: string[pyarrow]
errors. Access Dask documentation for further details.
import pandas as pd
import dask
import dask.datasets
from soda.scan import Scan
# Avoid string conversion errors
dask.config.set({"dataframe.convert-string": False})
# Create a Soda scan object
scan = Scan()
# Load timeseries data from Dask datasets
df_timeseries = dask.datasets.timeseries().reset_index()
df_timeseries["email"] = "[email protected]"
# Add Dask Dataframe to scan and assign a dataset name to refer from checks.yaml
scan.add_dask_dataframe(dataset_name="timeseries", dask_df=df_timeseries, data_source_name="orders", use_dask_count_star_as_count_one=False)
Troubleshoot
Problem: You encounter errors when trying to install soda-pandas-dask
in an environment that uses Python 3.11. This may manifest as an issue with dependencies or as an error that reads, Pre-scan validation failed, see logs for details.
Workaround: Uninstall the soda-pandas-dask
package, then downgrade the version of Python your environment uses to Python 3.9. Install the soda-pandas-dask
package again.
Problem: The COUNT(*
) behavior in dask-sql
is behaving unexpectedly or yielding confusing check results.
Solution: Upgrade soda-pandas-dask
to version 1.6.4 or greater and use the optional use_dask_count_star_as_count_one=True
parameter when calling scan.add_dask_dataframe()
or scan.add_pandas_dataframe()
to persist old dask-sql
behavior. See Add optional parameter for COUNT
.
Problem: You encounter an error that reads KeyError: string[pyarrow]
.
Solution: Upgrade soda-pandas-dask
to version 1.6.4 or greater and use the dask.config.set({"dataframe.convert-string": False})
parameter set to False
. See Add optional parameter text data conversion.
Last updated
Was this helpful?