Invoke Soda Library
Last modified on 20-Nov-24
To automate the search for bad-quality data, you can use Soda library to programmatically set up and execute scans. As a Python library, you can invoke Soda just about anywhere you need it; the invocation instructions below offers a very simple invocation example to extrapolate from. Consult the Use case guides for more examples of how to programmatically run Soda scans for data quality.
Alternatively, you can install and use the Soda Library CLI to run scans; see Install Soda Library.
As a step in the Get started roadmap, this guide offers instructions to set up, install, and configure Soda in a programmatic deployment model.
Get started roadmap
-
Choose a flavor of Soda - Set up Soda: programmatic 📍 You are here!
a. Review requirements
b. Create a Soda Cloud account
c. Set up basic programmatic invocation in Python - Write SodaCL checks
- Run scans and review results
- Organize, alert, investigate
Requirements
To use Soda Library, you must have installed the following on your system.
- Python 3.8, 3.9, or 3.10
- Pip 21.0 or greater
- A Soda Cloud account; see next section.
Python versions Soda supports
Soda officially supports Python versions 3.8, 3.9, and 3.10.Though largely funcntional, efforts to fully support Python 3.11 and 3.12 are ongoing.
Using Python 3.11, some users might have some issues with dependencies constraints. At times, extra the combination of Python 3.11 and dependencies constraints requires that a dependency be built from source rather than downloaded pre-built.
The same applies to Python 3.12, although there is some anecdotal evidence that indicates that 3.12 might not work in all scenarios due to dependencies constraints.
Create a Soda Cloud account
- In a browser, navigate to cloud.soda.io/signup to create a new Soda account, which is free for a 45-day trial. If you already have a Soda account, log in.
- Navigate to your avatar > Profile, then access the API keys tab. Click the plus icon to generate new API keys.
- Copy+paste the API key values to a temporary, secure place in your local environment.
Why do I need a Soda Cloud account?
To validate your account license or free trial, Soda Library must communicate with a Soda Cloud account via API keys. You create a set of API keys in your Soda Cloud account, then use them to configure the connection to Soda Library.Learn more
Set up basic programmatic invocation in Python
As in the simple example below, invoke the Python library and provide:
- your data source connection configuration details, including environment variables, using one of the listed methods; consult Data source reference for data source-specific connection config
- your Soda Cloud account API key values:
- use cloud.soda.io for EU region
- use cloud.us.soda.io for US region
Use the following guidance for optional elements of a programmatic scan.
- You can save Soda Library scan results anywhere in your system; the
scan_result
object contains all the scan result information. To import Soda Library in Python so you can utilize theScan()
object, install a Soda Library package, then usefrom soda.scan import Scan
. - If you provide a name for the scan definition to identify inline checks in a programmatic scan as independent of other inline checks in a different programmatic scan or pipeline, be sure to set a unique scan definition name for each programmatic scan. Using the same scan definition name in multiple programmatic scans results in confused check results in Soda Cloud.
- If you wish to collect samples of failed rows when a check fails, you can employ a custom sampler; see Configure a failed row sampler.
- Be sure to include any variables in your programmatic scan before the check YAML files. Soda requires the variable input for any variables defined in the check YAML files.
from soda.scan import Scan
scan = Scan()
scan.set_data_source_name("events")
# Add configuration YAML files
#########################
# Choose one of the following to specify data source connection configurations :
# 1) From a file
scan.add_configuration_yaml_file(file_path="~/.soda/my_local_soda_environment.yml")
# 2) Inline in the code
# For host, use cloud.soda.io for EU region; use cloud.us.soda.io for US region
scan.add_configuration_yaml_str(
"""
data_source events:
type: snowflake
host: ${SNOWFLAKE_HOST}
username: ${SNOWFLAKE_USERNAME}
password: ${SNOWFLAKE_PASSWORD}
database: events
schema: public
soda_cloud:
host: cloud.soda.io
api_key_id: 2e0ba0cb-your-api-key-7b
api_key_secret: 5wd-your-api-key-secret-aGuRg
scheme:
"""
)
# Add variables
###############
scan.add_variables({"date": "2022-01-01"})
# Add check YAML files
##################
scan.add_sodacl_yaml_file("./my_programmatic_test_scan/sodacl_file_one.yml")
scan.add_sodacl_yaml_file("./my_programmatic_test_scan/sodacl_file_two.yml")
scan.add_sodacl_yaml_files("./my_scan_dir")
scan.add_sodacl_yaml_files("./my_scan_dir/sodacl_file_three.yml")
# OR
# Define checks using SodaCL
##################
checks = """
checks for cities:
- row_count > 0
"""
# Add template YAML files, if used
##################
scan.add_template_files(template_path)
# Add the checks to the scan
####################
scan.add_sodacl_yaml_str(checks)
# OR Add the checks to scan with virtual filename identifier
# for advanced use cases such as partial/concurrent scans
####################
scan.add_sodacl_yaml_str(
checks
file_name=f"checks-{scan_name}.yml",
)
# Execute the scan
##################
scan.execute()
# Set logs to verbose mode, equivalent to CLI -V option
##################
scan.set_verbose(True)
# Set scan definition name, equivalent to CLI -s option
# The scan definition name MUST be unique to this scan, and
# not duplicated in any other programmatic scan
##################
scan.set_scan_definition_name("YOUR_SCHEDULE_NAME")
# Do not send results to Soda Cloud, equivalent to CLI -l option;
##################
scan.set_is_local(True)
# Inspect the scan result
#########################
scan.get_scan_results()
# Inspect the scan logs
#######################
scan.get_logs_text()
# Typical log inspection
##################
scan.assert_no_error_logs()
scan.assert_no_checks_fail()
# Advanced methods to inspect scan execution logs
#################################################
scan.has_error_logs()
scan.get_error_logs_text()
# Advanced methods to review check results details
########################################
scan.get_checks_fail()
scan.has_check_fails()
scan.get_checks_fail_text()
scan.assert_no_checks_warn_or_fail()
scan.get_checks_warn_or_fail()
scan.has_checks_warn_or_fail()
scan.get_checks_warn_or_fail_text()
scan.get_all_checks_text()
See the Python API Reference page for detailed documentation of the Scan
class in Soda Library.
Tips and best practices
- You can save Soda Library scan results anywhere in your system; the
scan_result
object contains all the scan result information. To import Soda Library in Python so you can utilize theScan()
object, install a Soda Library package, then usefrom soda.scan import Scan
. - Be sure to include any variables in your programmatic scan before the check YAML files. Soda requires the variable input for any variables defined in the check YAML files.
- Because Soda Library pushes scan results to Soda Cloud, you may not want to change the scan definition name with each scan. Soda Cloud uses the scan definition name to correlate subsequent scan results, thus retaining an historical record of the measurements over time.
Sometimes, changing the name is useful, like when you wish to configure a single scan to run in multiple environments. Be aware, however, that if you change the scan definition name with each scan for the same environment, Soda Cloud recognizes each set of scan results as independent from previous scan results, thereby making it appear as though it records a new, separate check result with each scan and archives or “disappears” previous results. See also: Missing check results in Soda Cloud
Scan exit codes
Soda Library’s scan output includes an exit code which indicates the outcome of the scan.
0 | all checks passed, all good from both runtime and Soda perspective |
1 | Soda issues a warning on a check(s) |
2 | Soda issues a failure on a check(s) |
3 | Soda encountered a runtime issue, and was able to submit scan results to Soda Cloud |
4 | Soda encountered a runtime issue, but was unable to submit any results to Soda Cloud |
To obtain the exit code, you can add the following to your programmatic scan.
exit_code = scan.execute()
print(exit_code)
Next
-
Choose a flavor of Soda -
Set up Soda: programmatic - Write SodaCL checks
- Run scans and review results
- Organize, alert, investigate
Need help? Join the Soda community on Slack.
—
Was this documentation helpful?
What could we do to improve this page?
- Suggest a docs change in GitHub.
- Share feedback in the Soda community on Slack.
Documentation always applies to the latest version of Soda products
Last modified on 20-Nov-24