A test is a check that Soda SQL performs when it scans a dataset in your data source. Technically, it is a Python expression that, during a Soda SQL scan, checks metrics to see if they match the parameters you defined for a measurement. A single Soda SQL scan runs against a single dataset in your data source, but each scan can run multiple tests against multiple columns.
As a result of a scan, each test either passes or fails. When a test fails, it means that a property of the data in your dataset did not match the test parameters you defined. In other words, any test that returns
true during a Soda SQL scan passes; any test that returns
The scan results appear in your command-line interface (CLI). The results include an exit code which is an indicator of the test results:
0 means all tests passed; a non-zero value means one or more tests have failed. See Scan output in Soda SQL for details.
Soda Cloud defines tests inside monitors. Refer to Create monitors and alerts to learn how to define a test in a monitor using Soda Cloud.
You define your tests in your scan YAML file which is associated with a specific dataset in your data source. You can write tests using built-in dataset metrics that Soda SQL applies to an entire dataset, or built-in column metrics that Soda SQL applies to individual columns you identify. You can also write tests using custom metrics (also known as SQL metrics) that you can apply to an entire dataset or to individual columns. See example tests that use each built-in metric.
Regardless of where it applies, each test is generally comprised of three parts:
- a metric (a property of the data in your data source)
- a comparison operator
- a value
tests: - row_count > 0
At times, a test may only include one part, a metric, that simply returns a calculated value. For example, you may wish to write a test that simply returns the calculated sum of the values in a numeric column.
columns: commission_paid: tests: - sum
However, where a test must determine whether or not data is valid, you must add a fourth element, a column configuration key to define what qualifies as valid. In the scan YAML file, you define a column configuration key before the test that will use the definition of “valid”.
In the example below, the user defined the
date_eu or dd/mm/yyyy format. The metric
invalid_percentage refers to the
valid_format configuration key to determine if the data in the column is valid. Note that
valid_format applies only to columns with data type TEXT. Refer to Data types for details.
To see a list of all available column configuration keys, see Column configureation keys.
columns: start_date: valid_format: date_eu tests: - invalid_percentage < 2.0
Reference the table below which corresponds to the following example scan YAML file. Both the
feepct columns are of data type TEXT, enabling the user to define a
valid_format for the contents of the columns. See Valid format values for details.
table_name: demodata metrics: - row_count - missing_count - missing_percentage - values_count - ... tests: - row_count > 0 columns: id: valid_format: uuid tests: - invalid_percentage == 0 feepct: valid_format: number_percentage tests: - invalid_percentage == 0
|Built-in metric||Comparison operator||Value||Applies to||Test|
| || || ||whole dataset||Checks to see if the dataset has at least one row. If the test fails, it means the dataset has no rows, which means that the dataset is empty.|
| || || || ||Checks to see if all rows in the id column contain data in a valid format. If the test fails, it means that more than 0% of the rows contain invalid data, which is data that is in non-UUID format.|
| || || || ||Checks to see if all rows in the |
See example tests that use each built-in metric.
If the built-in dataset and column metrics that Soda SQL offers do not quite give you the information you need from a scan, you can use custom metrics to customize your queries. Custom metrics essentially enable you to add SQL queries to your scan YAML file so that Soda SQL runs them during a scan. See Custom metrics
Reference the table below which corresponds to the following example scan YAML file.
table_name: yourdataset sql_metrics: - sql: | SELECT sum(volume) as total_volume_us FROM CUSTOMER_TRANSACTIONS WHERE country = 'US' tests: - total_volume_us > 5000
|Custom metric||Comparison operator||Value||Applies to||Test|
| || || ||whole dataset||Checks to see if the sum of all customer transactions in the United States exceeds |
Soda SQL can run both anonymous or named tests. If you intend to push Soda SQL scan results to your Soda Cloud account, a named tests appears in the Monitor Results table with the title you gave it; anonymous tests also appear, of course, but their name defaults to the test expression.
Example of an anonymous
valid_format: number_percentage tests: - invalid_percentage == 0
Examples of a named test
valid_format: number_percentage tests: - name: inval_percent expression: invalid_percentage == 0 title: Invalid Percentage
- Where you need to define tests that execute against facts and dimensions tables, you can use custom metrics to join facts and dimensions using SQL statement to validate your business metrics. In general, write aggregation tests for facts tables and validity tests for data in dimensions tables.
- There is no limit to the number of tests that Soda SQL can run during a scan.
- If you are using a data orchestration tool to schedule regular Soda scans of your data, consider the frequency of change in your data streams, or how your data pipeline is scheduled. For example, if your data sources or streams provide new data in a batch once per day, schedule a Soda scan once per day after batch processing.
- Learn how to apply filters such as date, to a scan of your data.
- Learn more about Metrics.
- See example tests that use each built-in metric.
- Learn about How Soda works.
- Reference the Data types that Soda SQL supports when it scans columns.
- Need help? Join the Soda community on Slack.
Last modified on 18-Oct-21