Link Search Menu Expand Document

Scans

A scan is a command that executes tests to extract information about data in a dataset.

Soda SQL uses the input in the scan YAML file and Soda Cloud monitors to prepare SQL queries that it runs against the data in a dataset. All tests return true or false; if true, the test passed and you know your data is sound; if false, the test fails which means the scan discovered data that falls outside the expected or acceptable parameters you defined in your test.

Run a scan in Soda SQL
Scan output in Soda SQL
Schedule a scan in Soda Cloud
Scan output in Soda Cloud
Programmatically use scan output
Go further

Run a scan in Soda SQL

When you run a scan, Soda SQL uses the configurations in your scan YAML file and Soda Cloud monitors to prepare, then run SQL queries against data in your data source. The default tests and metrics Soda SQL configured when it created the YAML file focus on finding missing, invalid, or unexpected data in your datasets.

Each scan requires the following as input:

  • a warehouse YAML file, which represents a connection to your data source
  • a scan YAML file, including its filepath, which contains the metric and test instructions that Soda SQL uses to scan datasets in your data source

Example command

$ soda scan warehouse.yml tables/demodata.yml

When Soda SQL runs a scan, it performs the following actions:

  • fetches column metadata (column name, type, and nullable)
  • executes a single aggregation query that computes aggregate metrics for multiple columns, such as missing, min, or max
  • for each column, executes:
    • a query for distinct_count, unique_count, and valid_count
    • a query for mins (list of smallest values)
    • a query for maxs (list of greatest values)
    • a query for frequent_values
    • a query for histograms

To allow some databases, such as Snowflake, to cache scan results, the column queries use the same Column Table Expression (CTE). This practice aims to improve overall scan performance.

To test specific portions of data, such as data pertaining to a specific date, you can apply dynamic filters when you scan data in your warehouse. See Apply filters for instructions. Further, use the soda scan --help command to review optional parameters you can include to customize the scan.

Scan output in Soda SQL

By default, the output of a Soda SQL scan appears in your command-line interface. In the example below, Soda SQL executed three tests and all the tests passed. The Exit code is a process code: 0 indicates success with no test failures; a non-zero number indicates failures.

  | < 200 {}
  | 54 measurements computed
  | 3 tests executed
  | All is good. No tests failed.
  | Exiting with code 0

In the following example, some of the tests Soda SQL executed failed, as indicated by Exiting with code 2. The scan output indicates which tests failed and why, so that you can investigate and resolve the issues in the data source.

  | < 200 {}
  | 304 measurements computed
  | 8 tests executed
  | 2 of 8 tests failed:
  |   Test column(EMAIL) test(missing_count == 0) failed with metric values {"missing_count": 33}
  |   Test column(CREDIT_CARD_NUMBER) test(invalid_percentage == 0) failed with metric values {"invalid_percentage": 28.4}
  | Exiting with code 2 

Schedule a scan in Soda Cloud

When you connect a data source to your Soda Cloud account, the guided steps ask that you define a schedule for scans of your data. See Import settings for more information about setting a scan schedule. Note, you cannot run an ad hoc scan directly from Soda Cloud.

You can also define scan schedules for individual datasets. For example, you can specify a more frequent scan schedule for a dataset that changes often. Learn more about adjusting a dataset scan schedule.

Scan output in Soda Cloud

Whether you defined your tests in your scan YAML file for Soda SQL or in a monitor in Soda Cloud, in the Soda Cloud web user interface, all test results manifest as monitor results. Log in to view the Monitors dashboard; each row in the Monitor Results table represents the result of a test, and the icon indicates whether the test passed or failed.

monitor-results

Soda Cloud uses Soda SQL in the background to run scheduled scans. Soda SQL uses a secure API to connect to Soda Cloud. When it completes a scan, Soda SQL:

  1. pushes the results of any tests you configured in the scan YAML file to Soda Cloud
  2. fetches tests associated with any monitors you created in Soda Cloud, then executes the tests and pushes the test results to Soda Cloud

scan-with-cloud

Programmatically use scan output

Optionally, you can insert the output of Soda SQL scans into your data orchestration tool such as Dagster, or Apache Airflow. You can save Soda SQL scan results anywhere in your system; the scan_result object contains all the scan result information. See Configure programmatic scans for details.

Further, in your orchestration tool, you can use Soda SQL scan results to block the data pipeline if it encounters bad data, or to run in parallel to surface issues with your data. Learn more about orchestrating scans.

Go further



Last modified on 16-Jul-21

Was this documentation helpful?
Give us your feedback in the #soda-docs channel in the Soda community on Slack or open an issue in GitHub.