Link Search Menu Expand Document

Invoke Soda Library

Last modified on 26-Apr-24

To automate the search for bad-quality data, you can use Soda library to programmatically set up and execute scans. As a Python library, you can invoke Soda just about anywhere you need it; the invocation instructions below offers a very simple invocation example to extrapolate from. Consult the Use case guides for more examples of how to programmatically run Soda scans for data quality.

Alternatively, you can install and use the Soda Library CLI to run scans; see Install Soda Library.

As a step in the Get started roadmap, this guide offers instructions to set up, install, and configure Soda in a programmatic deployment model.

Get started roadmap

  1. Choose a flavor of Soda
  2. Set up Soda: programmatic 📍 You are here!
         a. Review requirements
         b. Create a Soda Cloud account
         c. Set up basic programmatic invocation in Python
  3. Write SodaCL checks
  4. Run scans and review results
  5. Organize, alert, investigate


Requirements

To use Soda Library, you must have installed the following on your system.

  • Python 3.8 or greater
  • Pip 21.0 or greater
  • A Soda Cloud account; see next section.

Create a Soda Cloud account

  1. In a browser, navigate to cloud.soda.io/signup to create a new Soda account, which is free for a 45-day trial. If you already have a Soda account, log in.
  2. Navigate to your avatar > Profile, then access the API keys tab. Click the plus icon to generate new API keys.
  3. Copy+paste the API key values to a temporary, secure place in your local environment.
Why do I need a Soda Cloud account? To validate your account license or free trial, Soda Library must communicate with a Soda Cloud account via API keys. You create a set of API keys in your Soda Cloud account, then use them to configure the connection to Soda Library.
Learn more

Set up basic programmatic invocation in Python

As in the simple example below, invoke the Python library and provide:

  • your data source connection configuration details, including environment variables, using one of the listed methods; consult Data source reference for data source-specific connection config
  • your Soda Cloud account API key values:
    • use cloud.soda.io for EU region
    • use cloud.us.soda.io for US region

Use the following guidance for optional elements of a programmatic scan.

  • You can save Soda Library scan results anywhere in your system; the scan_result object contains all the scan result information. To import Soda Library in Python so you can utilize the Scan() object, install a Soda Library package, then use from soda.scan import Scan.
  • If you wish to collect samples of failed rows when a check fails, you can employ a custom sampler; see Configure a failed row sampler.
  • Be sure to include any variables in your programmatic scan before the check YAML files. Soda requires the variable input for any variables defined in the check YAML files.
from soda.scan import Scan

scan = Scan()
scan.set_data_source_name("events")

# Add configuration YAML files
#########################
# Choose one of the following to specify data source connection configurations :
# 1) From a file
scan.add_configuration_yaml_file(file_path="~/.soda/my_local_soda_environment.yml")
# 2) Inline in the code
# For host, use cloud.soda.io for EU region; use cloud.us.soda.io for US region
scan.add_configuration_yaml_str(
    """
    data_source events:
      type: snowflake
      host: ${SNOWFLAKE_HOST}
      username: ${SNOWFLAKE_USERNAME}
      password: ${SNOWFLAKE_PASSWORD}
      database: events
      schema: public

    soda_cloud:
      host: cloud.soda.io
      api_key_id: 2e0ba0cb-your-api-key-7b
      api_key_secret: 5wd-your-api-key-secret-aGuRg
      scheme:
"""
)

# Add variables
###############
scan.add_variables({"date": "2022-01-01"})


# Add check YAML files
##################
scan.add_sodacl_yaml_file("./my_programmatic_test_scan/sodacl_file_one.yml")
scan.add_sodacl_yaml_file("./my_programmatic_test_scan/sodacl_file_two.yml")
scan.add_sodacl_yaml_files("./my_scan_dir")
scan.add_sodacl_yaml_files("./my_scan_dir/sodacl_file_three.yml")

# OR

# Define checks using SodaCL
##################
checks = """
checks for cities:
    - row_count > 0
"""

# Add the checks to the scan
####################
scan.add_sodacl_yaml_str(checks)

# OR Add the checks to scan with virtual filename identifier
# for advanced use cases such as partial/concurrent scans
####################
scan.add_sodacl_yaml_str(
    checks
    file_name=f"checks-{scan_name}.yml",
)

# Execute the scan
##################
scan.execute()

# Set logs to verbose mode, equivalent to CLI -V option
##################
scan.set_verbose(True)

# Set scan definition name, equivalent to CLI -s option;
# see Tips and best practices below
##################
scan.set_scan_definition_name("YOUR_SCHEDULE_NAME")


# Inspect the scan result
#########################
scan.get_scan_results()

# Inspect the scan logs
#######################
scan.get_logs_text()

# Typical log inspection
##################
scan.assert_no_error_logs()
scan.assert_no_checks_fail()

# Advanced methods to inspect scan execution logs
#################################################
scan.has_error_logs()
scan.get_error_logs_text()

# Advanced methods to review check results details
########################################
scan.get_checks_fail()
scan.has_check_fails()
scan.get_checks_fail_text()
scan.assert_no_checks_warn_or_fail()
scan.get_checks_warn_or_fail()
scan.has_checks_warn_or_fail()
scan.get_checks_warn_or_fail_text()
scan.get_all_checks_text()

Tips and best practices

  • You can save Soda Library scan results anywhere in your system; the scan_result object contains all the scan result information. To import Soda Library in Python so you can utilize the Scan() object, install a Soda Library package, then use from soda.scan import Scan.
  • Be sure to include any variables in your programmatic scan before the check YAML files. Soda requires the variable input for any variables defined in the check YAML files.
  • Because Soda Library pushes scan results to Soda Cloud, you may not want to change the scan definition name with each scan. Soda Cloud uses the scan definition name to correlate subsequent scan results, thus retaining an historical record of the measurements over time.
    Sometimes, changing the name is useful, like when you wish to configure a single scan to run in multiple environments. Be aware, however, that if you change the scan definition name with each scan for the same environment, Soda Cloud recognizes each set of scan results as independent from previous scan results, thereby making it appear as though it records a new, separate check result with each scan and archives or “disappears” previous results. See also: Missing check results in Soda Cloud

Scan exit codes

Soda Library’s scan output includes an exit code which indicates the outcome of the scan.

0 all checks passed, all good from both runtime and Soda perspective
1 Soda issues a warning on a check(s)
2 Soda issues a failure on a check(s)
3 Soda encountered a runtime issue, and was able to submit scan results to Soda Cloud
4 Soda encountered a runtime issue, but was unable to submit any results to Soda Cloud

To obtain the exit code, you can add the following to your programmatic scan.

exit_code = scan.execute()
print(exit_code)

Next

  1. Choose a flavor of Soda
  2. Set up Soda: programmatic
  3. Write SodaCL checks
  4. Run scans and review results
  5. Organize, alert, investigate

Need help? Join the Soda community on Slack.

Was this documentation helpful?

What could we do to improve this page?

Documentation always applies to the latest version of Soda products
Last modified on 26-Apr-24