Run a scan and view results

Soda uses the input in the checks and data source connection configurations to prepare a scan that it runs against the data in a dataset.

Soda uses checks and the data source connection configurations to prepare a scan that it runs against datasets to extract metadata and gauge data quality.

A check is a test that Soda performs when it scans a dataset in your data source. Soda uses the checks you defined as no-code checks in Soda Cloud, or wrote in a checks YAML file, to prepare SQL queries that it runs against the data in a dataset. Soda can execute multiple checks against one or more datasets in a single scan.

As a step in the Get started roadmap, this guide offers instructions to schedule a Soda scan, run a scan, or invoke a scan programmatically.

Get started roadmap

  1. Choose a flavor of Soda

  2. Set up Soda: install, deploy, or invoke

  3. Write SodaCL checks

  4. Run scans and review results 📍 You are here! a. Scan for data quality b. View scan results

  5. Organize, alert, investigate

Scan for data quality

Set a scan definition in a no-code check

✖️ Requires Soda Core Scientific ✖️ Requires Soda Core ✖️ Requires Soda Library + Soda Cloud ✔️ Requires Soda Agent + Soda Cloud


When you create a no-code check in Soda Cloud, one of the required fields asks that you associate the check with an existing scan definition, or that you create a new scan definition.

If you wish to change a no-code check's existing scan definition:

  1. As a user with permission to do so, navigate to the dataset in which the no-code check exists.

  2. From the dataset's page, locate the check you wish to adjust, and click the stacked dots at right, then select Edit Check. You can only edit a check via the no-code interface if it was first created as a no-code check, as indicated by the cloud icon in the Origin column of the table of checks.

  3. Adjust the value in the Add to Scan Definition field as needed, then save. Soda executes the check during the next scan according to the definition you selected.

If you wish to schedule a new scan to execute a no-code check more or less frequently, or at a different time of day:

  1. From the dataset's page, locate the check you wish to adjust and click the stacked dots at right, then select Edit Check. You can only edit a check via the no-code interface if it was first created as a no-code check, as indicated by the cloud icon in the Origin column of the table of checks.

  2. Use the dropdown in the Add to Scan Definition field to access the create a new Scan Definition link.

  3. Fill out the form to define your new scan definition, then save it. Save the change to your no-code check. Soda executes the check during the next scan according to your new definition.

Set a scan definition in an agreement

✖️ Requires Soda Core Scientific ✖️ Requires Soda Core ✖️ Requires Soda Library + Soda Cloud ✔️ Requires Soda Agent + Soda Cloud


When you create a Soda Agreement in Soda Cloud, the last step in the flow demands that you select a scan definition. The scan definition indicates which Soda Agent to use to execute the scan, on which data source, and when. Effectively, a scan definition defines the what, when, and where to run a scheduled scan.

If you wish to change an agreement's existing scan definition:

  1. Navigate to Agreements, then click the stacked dots next to the agreement you wish to change and select Edit Agreement.

  2. In the Set a Scan schedule tab, then use the dropdown menu to select a different scan definition.

  3. Save your change. The agreement edit triggers a new stakeholder approval request to all stakeholders. Your revised agreement does not run again until all stakehoders have approved it.

If you wish to schedule a new scan to execute the checks in an agreement more or less frequently, or at a different time of day:

  1. Navigate to Agreements, then click the stacked dots next to the agreement you wish to change and select Edit Agreement.

  2. In the Set a Scan schedule tab, then click the new Scan Definition link and populate the fields as in the example below.

  3. Save your change. The agreement edit triggers a new stakeholder approval request to all stakeholders. Your revised agreement does not run again until all stakehoders have approved it.

Troubleshoot

Problem: When running a programmatic scan or a scan from the command-line, I get an error that reads Error while executing Soda Cloud command response code: 400.

Solution: While there may be several reasons Soda returns a 400 error, you can address the following which may resolve the issue:

  • Upgrade to the latest version of Soda Library.

  • Confirm that all the checks in your checks YAML file identify a dataset against which to execute. For example, the following syntax yields a 400 error because the checks: does not identify a dataset.

checks:
    - schema:
        warn:
            when schema changes: any

View scan results

Soda Cloud displays the latest status of all of your checks in the Checks dashboard. There two methods through which a check and its latest result appears on the dashboard.

  • When you define checks in a checks YAML file and use Soda Library to run a scan, the checks and their latest results manifest in the Checks dashboard in Soda Cloud.

  • Any time Soda Cloud runs a scheduled scan of your data as part of an agreement, it displays the checks and their latest results in the Checks dashboard.

As a result of a scan, each check results in one of three default states:

  • pass: the values in the dataset match or fall within the thresholds you specified

  • fail: the values in the dataset do not match or fall within the thresholds you specified

  • error: the syntax of the check is invalid

A fourth state, warn, is something you can explicitly configure for individual checks. See Add alert configurations.

The scan results appear in your Soda Library command-line interface (CLI) and the latest result appears in the Checks dashboard in the Soda Cloud web application; examples follow.

Optionally, you can add --local option to the scan command to prevent Soda Library from sending check results and any other metadata to Soda Cloud.

Soda Library 1.0.x
Soda Core 3.0.x
Sending failed row samples to Soda Cloud
Scan summary:
6/9 checks PASSED: 
    paxstats in paxstats2
      row_count > 0  [PASSED]
        check_value: 15007
      Look for PII  [PASSED]
      duplicate_percent(id) = 0  [PASSED]
        check_value: 0.0
        row_count: 15007
        duplicate_count: 0
      missing_count(adjusted_passenger_count) = 0  [PASSED]
        check_value: 0
      anomaly detection for row_count  [PASSED]
        check_value: 0.0
      Schema Check [PASSED]
1/9 checks WARNED: 
    paxstats in paxstats2
      Abnormally large PAX count [WARNED]
        check_value: 659837
2/9 checks FAILED: 
    paxstats in paxstats2
      Validate terminal ID [FAILED]
        check_value: 27
      Verify 2-digit IATA [FAILED]
        check_value: 3
Oops! 2 failure. 1 warning. 0 errors. 6 pass.
Sending results to Soda Cloud
Soda Cloud Trace: 4774***8

Scan failed

Check results indicate whether check passed, warned, or failed during the scan. However, if a scan itself failed to complete successfully, Soda Cloud displays a warning in the Datasets dashboard to indicate the dataset for which a scheuled scan has failed.

See Manage scheduled scans for instructions on how to set up scan failure alerts.

Examine scan logs

When you notice or receive a notification about a scan failure or delay, you can access the scan’s logs to investigate what is causing the issue.

  1. Log in to your Soda Cloud account, then navigate to Scans, and access the Agents tab.

  2. From the list of scan definitions, select the one that failed or timed out.

  3. On the scan definitions’s page, in the list of scan results, locate the one that failed or timed out, then click the stacked dots to its right and select Scan Logs.

  4. Review the scan log, using the filter to show only warning or errors if you wish, or downloading the log file for external analysis.

Alternatively, you can access the scan logs from within an agreement.

  1. To examine a detailed scan log of the lastest scan for an agreement, navigate to Agreements, then click to select an agreement.

  2. In the Agreement dashboard, click See results in the Last scan tile, then click the Scan Logs tabs.

Examine a scan's SQL queries in the command-line output

To examine the SQL queries that Soda Library prepares and executes as part of a scan, you can add the -V option to your soda scan command. This option prints the queries as part of the scan results.

soda scan -d postgres_retail -c configuration.yml -V checks.yml

Programmatically use scan output

Optionally, you can insert the output of Soda Library scans into your data orchestration tool such as Dagster, or Apache Airflow.

You can save Soda Library scan results anywhere in your system; the scan_result object contains all the scan result information. To import the Soda Library library in Python so you can utilize the Scan() object, install a Soda Library package, then use from soda.scan import Scan. Refer to Define programmatic scans and Test data in an Airflow pipeline for details.

Next

  1. Choose a flavor of Soda

  2. Set up Soda: install, deploy, or invoke

  3. Write SodaCL checks

  4. Run scans and review results

Need help? Join the Soda community on Slack.

Last updated

Was this helpful?