Run a Soda Library scan
Last modified on 27-Sep-23
Migrate to Soda Library to connect to Soda Cloud and access all the newest Soda features.
A scan is a command that executes checks to extract information about data in a dataset.
soda scan -d postgres_retail_db -c configuration.yml checks.yml
A check is a test that Soda Library performs when it scans a dataset in your data source. Soda Library uses the checks you define in the checks YAML file to prepare SQL queries that it runs against the data in a table. Soda Library can execute multiple checks against one or more datasets in a single scan.
Anatomy of a scan command
Variables
Scan output
Examine a scan’s SQL queries
Programmatically use scan output
Configure the same scan to run in multiple environments
Add scan options
Troubleshoot
Go further
Anatomy of a scan command
Each scan requires the following as input:
- the name of the data source that contains the dataset you wish to scan, identified using the
-d
option - a
configuration.yml
file, which contains details about how Soda Library can connect to your data source, identified using the-c
option - a
checks.yml
file which contains the checks you write using SodaCL
Scan command:
soda scan -d postgres_retail -c configuration.yml checks.yml
Note that you can use the -c
option to include multiple configuration YAML files in one scan execution. Include the filepath of each YAML file if you stored them in a directory other than the one in which you installed Soda Library.
soda scan -d postgres_retail -c other-directory/configuration.yml other-directory/checks.yml
You can also include multiple checks YAML files in one scan execution. Use multiple checks YAML files to execute different sets of checks during a single scan.
soda scan -d postgres_retail -c configuration.yml checks_stats1.yml checks_stats2.yml
Use the soda soda scan --help
command to review options you can include to customize the scan. See also: Add scan options.
Variables
There are several ways you can use variables in checks, filters, and in your data source configuration to pass values at scan time; a few examples follow.
Refer to the comprehensive Filters and variables documentation for details.
# In-check filter
checks for dim_employee:
- max(vacation_hours) < 80:
name: Too many vacation hours for US Sales
filter: sales_territory_key = 11
# Dataset filter with variables
filter CUSTOMERS [daily]:
where: TIMESTAMP '${ts_start}' <= "ts" AND "ts" < TIMESTAMP '${ts_end}'
checks for CUSTOMERS [daily]:
- row_count = 6
- missing(cat) = 2
# In-check variable
checks for ${DATASET}:
- invalid_count(last_name) = 0:
valid length: 10
Scan output
As a result of a scan, each check results in one of three default states:
- pass: the values in the dataset match or fall within the thresholds you specified
- fail: the values in the dataset do not match or fall within the thresholds you specified
- error: the syntax of the check is invalid
A fourth state, warn, is something you can explicitly configure for individual checks. See Add alert configurations.
The scan results appear in your Soda Library command-line interface (CLI) and the latest result appears in the Checks dashboard in the Soda Cloud web application; examples follow.
Optionally, you can add --local
option to the scan command to prevent Soda Library from sending check results and any other metadata to Soda Cloud.
Soda Library 1.0.x
Soda Core 3.0.x
Sending failed row samples to Soda Cloud
Scan summary:
6/9 checks PASSED:
paxstats in paxstats2
row_count > 0 [PASSED]
check_value: 15007
Look for PII [PASSED]
duplicate_percent(id) = 0 [PASSED]
check_value: 0.0
row_count: 15007
duplicate_count: 0
missing_count(adjusted_passenger_count) = 0 [PASSED]
check_value: 0
anomaly score for row_count < default [PASSED]
check_value: 0.0
Schema Check [PASSED]
1/9 checks WARNED:
paxstats in paxstats2
Abnormally large PAX count [WARNED]
check_value: 659837
2/9 checks FAILED:
paxstats in paxstats2
Validate terminal ID [FAILED]
check_value: 27
Verify 2-digit IATA [FAILED]
check_value: 3
Oops! 2 failure. 1 warning. 0 errors. 6 pass.
Sending results to Soda Cloud
Soda Cloud Trace: 4774***8
Example output with a check that triggered a warning:
Soda Library 1.0.x
Scan summary:
1/1 check WARNED:
CUSTOMERS in postgres_retail
schema [WARNED]
missing_column_names = [sombrero]
schema_measured = [geography_key, customer_alternate_key, title, first_name, last_name ...]
Only 1 warning. 0 failure. 0 errors. 0 pass.
Example output with a check that failed:
Soda Library 1.0.x
Scan summary:
1/1 check FAILED:
CUSTOMERS in postgres_retail
freshness(full_date_alternate_key) < 3d [FAILED]
max_column_timestamp: 2020-06-24 00:04:10+00:00
max_column_timestamp_utc: 2020-06-24 00:04:10+00:00
now_variable_name: NOW
now_timestamp: 2022-03-10T16:30:12.608845
now_timestamp_utc: 2022-03-10 16:30:12.608845+00:00
freshness: 624 days, 16:26:02.608845
Oops! 1 failures. 0 warnings. 0 errors. 0 pass.
Examine a scan’s SQL queries
To examine the SQL queries that Soda Library prepares and executes as part of a scan, you can add the -V
option to your soda scan
command. This option prints the queries as part of the scan results.
soda scan -d postgres_retail -c configuration.yml -V checks.yml
Programmatically use scan output
Optionally, you can insert the output of Soda Library scans into your data orchestration tool such as Dagster, or Apache Airflow.
You can save Soda Library scan results anywhere in your system; the scan_result
object contains all the scan result information. To import the Soda Library library in Python so you can utilize the Scan()
object, install a Soda Library package, then use from soda.scan import Scan
. Refer to Define programmatic scans and Test data in a pipeline for details.
Configure the same scan to run in multiple environments
When you want to run a scan that executes the same checks on different environments or schemas, such as development, production, and staging, you must apply the following configurations to ensure that Soda Cloud does not incomprehensibly merge the checks results from scans of multiple environments.
- In your
configuration.yml
file, provide separate connection configurations for each environment, as in the following example.data_source nyc_dev: type: postgres connection: host: host port: '5432' username: ${POSTGRES_USER} password: ${POSTGRES_PASSWORD} database: postgres schema: public data_source nyc_prod: type: postgres connection: host: host port: '5432' username: ${POSTGRES_USER} password: ${POSTGRES_PASSWORD} database: postgres schema: public
- Provide a
scan definition
name at scan time using the-s
option. The scan definition helps Soda Cloud to distinguish different scan contexts and therefore plays a crucial role when thechecks.yml
file names and the checks themselves are the same.# for NYC data source for dev soda scan -d nyc_dev -c configuration.yml -s nyc_a checks.yml # for NYC data source for prod soda scan -d nyc_prod -c configuration.yml -s nyc_b checks.yml
See also: Troubleshoot missing check results
See also: Add a check identity
Add scan options
When you run a scan in Soda Library, you can specify some options that modify the scan actions or output. Add one or more of the following options to a soda scan
command.
Option | Required | Description and examples |
---|---|---|
-c TEXT or--configuration TEXT | ✓ | Use this option to specify the file path and file name for the configuration YAML file. |
-d TEXT or--data-source TEXT | ✓ | Use this option to specify the data source that contains the datasets you wish to scan. |
-l or--local | Use this option to prevent Soda Library from pushing check results or any other metadata to Soda Cloud. | |
-s TEXT or--scan-definition TEXT | Use this option to provide a scan definition name so that Soda Cloud keeps check results from different environments (dev, prod, staging) separate. See Configure a single scan to run in multiple environments. | |
-srf or --scan-results-file TEXT | Specify the file name and file path to which Soda Library sends a JSON file of the scan results. You can use this in addition to, or instead of, sending results to Soda Cloud. soda scan -d adventureworks -c configuration.yml -srf test.json checks.yml | |
-t TEXT or--data-timestamp TEXT | Placeholder, only. | |
-T TEXT or--template TEXT | Use this option to specify the file path and file name for a templates YAML file. | |
-v TEXT or--variable TEXT | Replace TEXT with variables you wish to apply to the scan, such as a filter for a date. Put single or double quotes around any value with spaces. soda scan -d my_datasource -v start=2020-04-12 -c configuration.yml checks.yml | |
V or --verbose | Return scan output in verbose mode to review query details. |
Troubleshoot
Problem: When you run a scan, you get an error that reads, Exception while exporting Span batch.
Solution: Without an internet connection, Soda Library is unable to communicate with soda.connect.io
to transmit anonymous usage statistics about the software.
If you are using Soda Library offline, you can resolve the issue by setting send_anonymous_usage_stats: false
in your configuration.yml
file. Refer to Soda Library usage statistics for further details.
Problem: In a Windows environment, you see an error that reads [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (ssl_c:997)
.
Short-term solution: Use pip install pip-system-certs
to temporarily resolve the issue. This install works to resolve the issue only on Windows machines where the Ops team installs all the certificates needed through Group Policy Objects, or similar. However, the fix is short-term because when you try to run this in a pipeline on another machine, the error will reappear.
Short-term solution: Contact your Operations or System Admin team to obtain the proxy certificate.
Go further
- Consider completing the Quick start for SodaCL to learn how to write more checks for data quality.
- Need help? Join the Soda community on Slack.
Was this documentation helpful?
What could we do to improve this page?
- Suggest a docs change in GitHub.
- Share feedback in the Soda community on Slack.
Documentation always applies to the latest version of Soda products
Last modified on 27-Sep-23