Link Search Menu Expand Document

Run a Soda Library scan

Last modified on 27-Sep-23

× Soda Core, the free, open-source Python library and CLI tool from which Soda Library extends, continues to exist as an OSS project in GitHub, including all Soda Core documentation.

Migrate to Soda Library to connect to Soda Cloud and access all the newest Soda features.

A scan is a command that executes checks to extract information about data in a dataset.

soda scan -d postgres_retail_db  -c configuration.yml checks.yml

A check is a test that Soda Library performs when it scans a dataset in your data source. Soda Library uses the checks you define in the checks YAML file to prepare SQL queries that it runs against the data in a table. Soda Library can execute multiple checks against one or more datasets in a single scan.

Anatomy of a scan command
Variables
Scan output
Examine a scan’s SQL queries
Programmatically use scan output
Configure the same scan to run in multiple environments
Add scan options
Troubleshoot
Go further

Anatomy of a scan command

Each scan requires the following as input:

  • the name of the data source that contains the dataset you wish to scan, identified using the -d option
  • a configuration.yml file, which contains details about how Soda Library can connect to your data source, identified using the -c option
  • a checks.yml file which contains the checks you write using SodaCL

Scan command:

soda scan -d postgres_retail -c configuration.yml checks.yml


Note that you can use the -c option to include multiple configuration YAML files in one scan execution. Include the filepath of each YAML file if you stored them in a directory other than the one in which you installed Soda Library.

soda scan -d postgres_retail -c other-directory/configuration.yml other-directory/checks.yml


You can also include multiple checks YAML files in one scan execution. Use multiple checks YAML files to execute different sets of checks during a single scan.

soda scan -d postgres_retail -c configuration.yml checks_stats1.yml checks_stats2.yml


Use the soda soda scan --help command to review options you can include to customize the scan. See also: Add scan options.

Variables

There are several ways you can use variables in checks, filters, and in your data source configuration to pass values at scan time; a few examples follow.

Refer to the comprehensive Filters and variables documentation for details.

# In-check filter
checks for dim_employee:
  - max(vacation_hours) < 80:
      name: Too many vacation hours for US Sales
      filter: sales_territory_key = 11

# Dataset filter with variables
filter CUSTOMERS [daily]:
  where: TIMESTAMP '${ts_start}' <= "ts" AND "ts" < TIMESTAMP '${ts_end}'

checks for CUSTOMERS [daily]:
  - row_count = 6
  - missing(cat) = 2

# In-check variable 
checks for ${DATASET}:
  - invalid_count(last_name) = 0:
      valid length: 10 

Scan output

As a result of a scan, each check results in one of three default states:

  • pass: the values in the dataset match or fall within the thresholds you specified
  • fail: the values in the dataset do not match or fall within the thresholds you specified
  • error: the syntax of the check is invalid

A fourth state, warn, is something you can explicitly configure for individual checks. See Add alert configurations.

The scan results appear in your Soda Library command-line interface (CLI) and the latest result appears in the Checks dashboard in the Soda Cloud web application; examples follow.

Optionally, you can add --local option to the scan command to prevent Soda Library from sending check results and any other metadata to Soda Cloud.

Soda Library 1.0.x
Soda Core 3.0.x
Sending failed row samples to Soda Cloud
Scan summary:
6/9 checks PASSED: 
    paxstats in paxstats2
      row_count > 0  [PASSED]
        check_value: 15007
      Look for PII  [PASSED]
      duplicate_percent(id) = 0  [PASSED]
        check_value: 0.0
        row_count: 15007
        duplicate_count: 0
      missing_count(adjusted_passenger_count) = 0  [PASSED]
        check_value: 0
      anomaly score for row_count < default  [PASSED]
        check_value: 0.0
      Schema Check [PASSED]
1/9 checks WARNED: 
    paxstats in paxstats2
      Abnormally large PAX count [WARNED]
        check_value: 659837
2/9 checks FAILED: 
    paxstats in paxstats2
      Validate terminal ID [FAILED]
        check_value: 27
      Verify 2-digit IATA [FAILED]
        check_value: 3
Oops! 2 failure. 1 warning. 0 errors. 6 pass.
Sending results to Soda Cloud
Soda Cloud Trace: 4774***8

check-dashboard

Example output with a check that triggered a warning:

Soda Library 1.0.x
Scan summary:
1/1 check WARNED: 
    CUSTOMERS in postgres_retail
      schema [WARNED]
        missing_column_names = [sombrero]
        schema_measured = [geography_key, customer_alternate_key, title, first_name, last_name ...]
Only 1 warning. 0 failure. 0 errors. 0 pass.

Example output with a check that failed:

Soda Library 1.0.x
Scan summary:
1/1 check FAILED: 
    CUSTOMERS in postgres_retail
      freshness(full_date_alternate_key) < 3d [FAILED]
        max_column_timestamp: 2020-06-24 00:04:10+00:00
        max_column_timestamp_utc: 2020-06-24 00:04:10+00:00
        now_variable_name: NOW
        now_timestamp: 2022-03-10T16:30:12.608845
        now_timestamp_utc: 2022-03-10 16:30:12.608845+00:00
        freshness: 624 days, 16:26:02.608845
Oops! 1 failures. 0 warnings. 0 errors. 0 pass.

Examine a scan’s SQL queries

To examine the SQL queries that Soda Library prepares and executes as part of a scan, you can add the -V option to your soda scan command. This option prints the queries as part of the scan results.

soda scan -d postgres_retail -c configuration.yml -V checks.yml

Programmatically use scan output

Optionally, you can insert the output of Soda Library scans into your data orchestration tool such as Dagster, or Apache Airflow.

You can save Soda Library scan results anywhere in your system; the scan_result object contains all the scan result information. To import the Soda Library library in Python so you can utilize the Scan() object, install a Soda Library package, then use from soda.scan import Scan. Refer to Define programmatic scans and Test data in a pipeline for details.

Configure the same scan to run in multiple environments

When you want to run a scan that executes the same checks on different environments or schemas, such as development, production, and staging, you must apply the following configurations to ensure that Soda Cloud does not incomprehensibly merge the checks results from scans of multiple environments.

  1. In your configuration.yml file, provide separate connection configurations for each environment, as in the following example.
    data_source nyc_dev:
      type: postgres
      connection:
     host: host
     port: '5432'
     username: ${POSTGRES_USER}
     password: ${POSTGRES_PASSWORD}
     database: postgres
     schema: public
    data_source nyc_prod:
      type: postgres
      connection:
     host: host
     port: '5432'
     username: ${POSTGRES_USER}
     password: ${POSTGRES_PASSWORD}
     database: postgres
     schema: public
    
  2. Provide a scan definition name at scan time using the -s option. The scan definition helps Soda Cloud to distinguish different scan contexts and therefore plays a crucial role when the checks.yml file names and the checks themselves are the same.
    # for NYC data source for dev
    soda scan -d nyc_dev -c configuration.yml -s nyc_a checks.yml
    # for NYC data source for prod
    soda scan -d nyc_prod -c configuration.yml -s nyc_b checks.yml
    

See also: Troubleshoot missing check results
See also: Add a check identity

Add scan options

When you run a scan in Soda Library, you can specify some options that modify the scan actions or output. Add one or more of the following options to a soda scan command.

Option Required Description and examples
-c TEXT or
--configuration TEXT
Use this option to specify the file path and file name for the configuration YAML file.
-d TEXT or
--data-source TEXT
Use this option to specify the data source that contains the datasets you wish to scan.
-l or
--local
  Use this option to prevent Soda Library from pushing check results or any other metadata to Soda Cloud.
-s TEXT or
--scan-definition TEXT
  Use this option to provide a scan definition name so that Soda Cloud keeps check results from different environments (dev, prod, staging) separate. See Configure a single scan to run in multiple environments.
-srf or
--scan-results-file TEXT
  Specify the file name and file path to which Soda Library sends a JSON file of the scan results. You can use this in addition to, or instead of, sending results to Soda Cloud.
soda scan -d adventureworks -c configuration.yml -srf test.json checks.yml
-t TEXT or
--data-timestamp TEXT
  Placeholder, only.
-T TEXT or
--template TEXT
Use this option to specify the file path and file name for a templates YAML file.  
-v TEXT or
--variable TEXT
  Replace TEXT with variables you wish to apply to the scan, such as a filter for a date. Put single or double quotes around any value with spaces.
soda scan -d my_datasource -v start=2020-04-12 -c configuration.yml checks.yml
V or
--verbose
  Return scan output in verbose mode to review query details.

Troubleshoot

Problem: When you run a scan, you get an error that reads, Exception while exporting Span batch.

Solution: Without an internet connection, Soda Library is unable to communicate with soda.connect.io to transmit anonymous usage statistics about the software.
If you are using Soda Library offline, you can resolve the issue by setting send_anonymous_usage_stats: false in your configuration.yml file. Refer to Soda Library usage statistics for further details.


Problem: In a Windows environment, you see an error that reads [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (ssl_c:997).

Short-term solution: Use pip install pip-system-certs to temporarily resolve the issue. This install works to resolve the issue only on Windows machines where the Ops team installs all the certificates needed through Group Policy Objects, or similar. However, the fix is short-term because when you try to run this in a pipeline on another machine, the error will reappear.

Short-term solution: Contact your Operations or System Admin team to obtain the proxy certificate.

Go further


Was this documentation helpful?

What could we do to improve this page?

Documentation always applies to the latest version of Soda products
Last modified on 27-Sep-23