Soda Library Python API reference
Access Python reference content for the Soda Scan class and its methods.
Use the Python API to programmatically execute Soda scans. The following content offers a reference for the Soda scan class and its methods.
Refer to Program a scan, Program a scan tab, for instructional details and an example of a complete file.
Classes
Use the Scan class to programmatically define and execute data quality scans. See Invoke Soda Library for an example of how to use the Soda Library Python API in a programmatic scan.
class Scan()Methods
Use this method to execute the scan. When executed, Soda returns an integer exit code as per the table that follows.
def execute(self) -> int0
All checks passed. No runtime errors.
1
Soda recorded a warn result for one or more checks.
2
Soda recorded a fail result for one or more checks.
3
Soda encountered a runtime issue but was able to send check results to Soda Cloud.
4
Soda encountered a runtime issue and was unable to send check results to Soda Cloud.
Provide required scan settings
Specify the data source on which Soda executes the checks.
Provide the scan definition name if the scan has been defined in Soda Cloud. By providing this value, Soda correlates subsequent scans from the same pipeline.
To retrieve this value, navigate to the Scans page in Soda Cloud, then select the scan definition you wish to execute remotely and copy the scan name, which is the smaller text under the label. For example, weekday_scan_schedule.
Add configurations to a scan
Add data source and Soda Cloud connection configurations from a YAML file. file_path is a string that points to a configuration file. ~ expands to the user's home directory.
Optionally, add all connection configurations from all matching YAML files in the file path according to your specifications.
pathis a string that is the path to a directory, but you can use it as a path to a configuration file.~expands to the user's home directory or the directory in which to search for configuration files.recursiverequires a boolean value that controls whether Soda scans nested directories. If unspecified, the default value istrue.suffixesis an optional list of strings that you use when recursively scanning directories to load only those files with a specific extension. If unspecified, the default values are.ymland.yaml.
Optionally, add connection configurations from a YAML-formatted string.
environment_yaml_stris a string that represents a configuration and must be YAML-formatted.file_pathis an optional string that you use to get the location of errors in the logs.
Add SodaCL checks to a scan
Add a SodaCL checks YAML file to the scan according to a file path you specify. file_path is a string that identifies a checks YAML file.
Optionally, add all the files in a directory to the scan as SodaCL checks YAML files.
pathis a string that identifies a directory, but you can use it as a path to a configuration file.~expands to the user's home directory or the directory in which to search for checks YAML files.recursiveis an optional boolean value that controls whether Soda scans nested directories. If unspecified, the default value istrue.suffixesis an optional list of strings that you use when recursively scanning directories to load only those files with a specific extension. If unspecified, the default values are.ymland.yaml.
Optionally, add SodaCL checks from a YAML-formatted string.
sodacl_yaml_stris a string that represents the SodaCL checks and must be YAML-formatted.file_pathis an optional string that you use to get the location of errors in the logs.
If you use a check template for SodaCL checks, add a SodaCL template file to the scan. file_path is a string that identifies a SodaCL template file.
If you use multiple check templates for SodaCL checks, add all the template files in a directory to the scan. path is a string that identifies the directory that contains the SodaCL template files.
Add local data to a scan
If you use Pandas, add a Pandas Dataframe dataset to the scan.
dataset_nameis a string to identify a dataset.pandas_dfis a Pandas Dataframe object.data_source_nameis a string to identify a data source.
If you use Dask, add a Dask Dataframe dataset to the scan.
dataset_nameis a string used to identify a dataset.dask_dfis a Dask Dataframe object.data_source_nameis a string to identify a data source.
If you use PySpark, add a Spark session to the scan.
spark_sessionis a Spark session object.data_source_nameis a string to identify a data source.
If you use a pre-existing DuckDB connection object as a data source, add a DuckDB connection to the scan.
duckdb_connectionis a DuckDB connection object.data_source_nameis a string to identify a data source.
Add optional scan settings
Configure a scan to output verbose log information. This is useful when you wish to see the SQL queries that Soda executes or to troubleshoot scan issues.
Configure Soda to prevent it from sending scan results to Soda Cloud. This is useful if, for example, you are testing checks locally and do not wish to muddy the measurements in your Soda Cloud account with test run metadata.
Configure a scan to have access to custom variables that can be referenced in your SodaCL files.variables is a dictionary with string keys and string values.
Add configurations to handle scan results
Use the following configurations to handle errors and/or warnings that occurred during a Soda scan.
Instruct Soda to raise an AssertionError when errors occur in the scan logs.
Instruct Soda to raise an AssertionError when errors or warnings occur in the scan logs.
Instruct Soda to raise an AssertionError when a specific error message occurs in the scan logs. Use expected_error_message to specify the error message as a string.
Instruct Soda to return a boolean value to indicate that errors occurred in the scan logs.
Instruct Soda to return a boolean value to indicate that errors or warnings occurred in the scan logs.
Instruct Soda to return a string that represents the logs from the scan.
Instruct Soda to return a list of strings of scan errors in the logs.
Instruct Soda to return a list of strings of scan errors and warnings in the logs.
Instruct Soda to return a string of all scan errors in the logs.
Instruct Soda to return a dictionary containing the results of the scan.
The scan results dictionary includes the following keys:
Add configurations to handle check results
Use the following configurations to handle the results of checks executed during a Soda scan.
Instruct Soda to raise an AssertionError when any check execution results in a fail state.
Instruct Soda to raise an AssertionError when any check execution results in a fail or warn state.
Instruct Soda to return a boolean value to indicate that one or more checks executed during the scan resulted in a fail state.
Instruct Soda to return a boolean value to indicate that one or more checks executed during the scan resulted in a warn state.
Instruct Soda to return a boolean value to indicate that one or more checks executed during the scan resulted in a fail or warn state.
Instruct Soda to return a list of strings of checks that resulted in a fail state.
Instruct Soda to return a string of checks that resulted in a fail state.
Instruct Soda to return a list of strings of checks that resulted in a fail or warn state.
Instruct Soda to return a string of checks that resulted in a fail or warn state.
Instruct Soda to return a string of all check results.
Attributes
Configure the datasource-level samples limit for the failed rows sampler. This is useful when scanning Pandas, Dask, or Spark Dataframes.
Replace the failed rows sampler with a custom sampler. See Configure a custom sampler for instructions about how to define a custom sampler.
Need help? Join the Soda community on Slack.
Last updated
Was this helpful?
