Link Search Menu Expand Document

Define tests

A test is a check that Soda SQL performs when it scans a dataset in your data source. Technically, it is a Python expression that, during a Soda SQL scan, checks metrics to see if they match the parameters you defined for a measurement. A single Soda SQL scan runs against a single dataset in your data source, but each scan can run multiple tests against multiple columns.

As a result of a scan, each test either passes or fails. When a test fails, it means that a property of the data in your dataset did not match the test parameters you defined. In other words, any test that returns true during a Soda SQL scan passes; any test that returns false, fails.

The scan results appear in your command-line interface (CLI). The results include an exit code which is an indicator of the test results: 0 means all tests passed; a non-zero value means one or more tests have failed. See Scan output in Soda SQL for details.

Soda Cloud defines tests inside monitors. Refer to Create monitors and alerts to learn how to define a test in a monitor using Soda Cloud.

Define tests using metrics
Example tests using built-in metrics
Example tests using custom metrics
Define names for tests
Best practices for defining tests and running scans
Go further

Define tests using metrics

You define your tests in your scan YAML file which is associated with a specific dataset in your data source. You can write tests using built-in dataset metrics that Soda SQL applies to an entire dataset, or built-in column metrics that Soda SQL applies to individual columns you identify. You can also write tests using custom metrics (also known as SQL metrics) that you can apply to an entire dataset or to individual columns. See example tests that use each built-in metric.

Regardless of where it applies, each test is generally comprised of three parts:

  • a metric (a property of the data in your data source)
  • a comparison operator
  • a value

For example:

tests:
  - row_count > 0

At times, a test may only include one part, a metric, that simply returns a calculated value. For example, you may wish to write a test that simply returns the calculated sum of the values in a numeric column.

columns:
  commission_paid:
    tests:
      - sum

However, where a test must determine whether or not data is valid, you must add a fourth element, a column configuration key to define what qualifies as valid. In the scan YAML file, you define a column configuration key before the test that will use the definition of “valid”.

In the example below, the user defined the valid_format as date_eu or dd/mm/yyyy format. The metric invalid_percentage refers to the valid_format configuration key to determine if the data in the column is valid. Note that valid_format applies only to columns with data type TEXT. Refer to Data types for details.

To see a list of all available column configuration keys, see Column configureation keys.

columns:
    start_date:
        valid_format: date_eu
        tests:
            - invalid_percentage < 2.0


Example tests using built-in metrics

Reference the table below which corresponds to the following example scan YAML file. Both the id and feepct columns are of data type TEXT, enabling the user to define a valid_format for the contents of the columns. See Valid format values for details.

table_name: demodata
metrics:
  - row_count
  - missing_count
  - missing_percentage
  - values_count
  - ...
tests:
  - row_count > 0
columns:
  id:
    valid_format: uuid
    tests:
      - invalid_percentage == 0
  feepct:
    valid_format: number_percentage
    tests:
      - invalid_percentage == 0
Built-in metric Comparison operator Value Applies to Test
row_count > 0 whole dataset Checks to see if the dataset has at least one row. If the test fails, it means the dataset has no rows, which means that the dataset is empty.
invalid_percentage == 0 id column Checks to see if all rows in the id column contain data in a valid format. If the test fails, it means that more than 0% of the rows contain invalid data, which is data that is in non-UUID format.
invalid_percentage == 0 feepct column Checks to see if all rows in the feepct column contain data in a valid format. If the test fails, it means that more than 0% of the rows contain invalid data, which is data that is not a numerical percentage.

See example tests that use each built-in metric.


Example tests using custom metrics

If the built-in dataset and column metrics that Soda SQL offers do not quite give you the information you need from a scan, you can use custom metrics to customize your queries. Custom metrics essentially enable you to add SQL queries to your scan YAML file so that Soda SQL runs them during a scan. See Custom metrics

Reference the table below which corresponds to the following example scan YAML file.

table_name: yourdataset
sql_metrics:
    - sql: |
          SELECT sum(volume) as total_volume_us
          FROM CUSTOMER_TRANSACTIONS
          WHERE country = 'US'
      tests:
          - total_volume_us > 5000
Custom metric Comparison operator Value Applies to Test
total_volume_us > 5000 whole dataset Checks to see if the sum of all customer transactions in the United States exceeds 5000. If the test fails, it means that the total volume of transactions is less than 5000.

Define names for tests

Soda SQL can run both anonymous or named tests. Named tests are useful if you intend to push Soda SQL scan results to your Soda Cloud account where you can update a test and retain its test history.

Example of an anonymous test

tests:
    - total_volume_us > 5000

Examples of named tests

tests:
    volume_test_max:  total_volume_us > 3000
    volume_test_min:  total_volume_us < 5000

Best practices for defining tests and running scans

  • There is no limit to the number of tests that Soda SQL can run during a scan. However, the volume of tests, the type of metrics you use, and the size of your dataset all affect the duration of a scan. (Testing to determine more precise compute costs per metric is ongoing.) Take these elements into consideration as you decide how to test the quality of your data.
  • Where you need to define tests that execute against facts and dimensions tables, you can use custom metrics to join facts and dimensions using SQL statement to validate your business metrics. In general, write aggregation tests for facts tables and validity tests for data in dimensions tables.
  • If you are using a data orchestration tool to schedule regular Soda scans of your data, consider the frequency of change in your data streams, or how your data pipeline is scheduled. For example, if your data sources or streams provide new data in a batch once per day, schedule a Soda scan once per day after batch processing.

Go further



Last modified on 16-Jul-21

Was this documentation helpful?
Give us your feedback in the #soda-docs channel in the Soda community on Slack or open an issue in GitHub.