Numeric metrics
Use a numeric metric in a check to perform basic calculations on the data in your dataset.
Read more about SodaCL metrics and checks in general.
checks for retail_products:
- avg(size) between 100 and 300
- avg_length(manufacturer) > 10
- duplicate_count(product_id) = 0
- max(size) <= 500
- max_length(manufacturer) = 25
- min(size) >= 50
- min_length(manufacturer) = 5
- row_count > 0
- percentile(size, 0.95) > 50
checks for retail_orders_postgres:
- stddev(order_quantity) > 0
- stddev_pop(order_quantity) between 3 and 4
- stddev_samp(order_quantity) not between 3 and 4
- sum(discount, order_quantity) < 120
- variance(discount) > 0
- var_pop(discount) between 0 and 5
- var_samp(discount) not between 0 and 5
Define checks with numeric metrics
Optional check configurations
List of numeric metrics
List of comparison symbols and phrases
Fixed and dynamic thresholds
Go further
Define checks with numeric metrics
In the context of Soda check types, you use numeric metrics in Standard checks. Refer to Standard check types for exhaustive configuration details.
You can use the row_count
metric in checks that apply to entire datasets.
checks for dim_reseller:
- row_count > 0
You can use all numeric metrics in checks that apply to individual columns in a dataset. Identify the column(s) by adding one or more values in the argument between brackets in the check.
checks for dim_reseller:
- duplicate_count(phone, address_line1) = 0
You can use some numeric metrics in checks with either fixed or dynamic thresholds. See Fixed and dynamic thresholds for more detail.
checks for dim_reseller:
# a check with a fixed threshold
- duplicate_count(phone) = 0
# a check with a dynamic threshold
- change avg last 7 for row_count < 50
Send failed rows to Soda Cloud
If you have connected Soda Core to a Soda Cloud account, checks that use a duplicate_count
metric automatically sends samples of any failed rows to Soda Cloud.
- To review the failed rows in Soda Cloud, navigate to the Monitors dashboard.
- Click the row for the check with
duplicate_count
, then go to the Failed rows tab.
Optional check configurations
✓ | Configuration | Documentation |
---|---|---|
✓ | Define a name for a check with numeric metrics; see example. | Customize check names |
✓ | Define alert configurations to specify warn and fail thresholds; see example. | Add alert configurations |
✓ | Apply a filter to return results for a specific portion of the data in your dataset; see example. Exception: you cannot use filters in checks with a duplicate_count metric. | Add a filter to a check |
✓ | Use quotes when identifying dataset or column names; see example | Use quotes in a check |
Use wildcard characters ( % or * ) in values in the check. | - | |
✓ | Use for each to apply checks with numeric metrics to multiple datasets in one scan; see example. | Apply checks to multiple datasets |
✓ | Apply a dataset filter to partition data during a scan; see example. | Scan a portion of your dataset |
Example with alert configuration
checks for dim_reseller:
- duplicate_count(phone):
warn: when > 5
fail: when >= 10
Example with check name
checks for dim_reseller:
- duplicate_count(phone) = 0:
name: Duplicate phone numbers
Example with filter
checks for dim_employee:
- max(vacation_hours) < 80:
name: Too many vacation hours for sales territory US
filter: sales_territory_key = 11
Example with quotes
checks for dim_reseller:
- duplicate_count("phone", "address_line1") = 0
Example with dataset filter
coming soon
Example with for each
for each dataset T:
datasets:
- dim_product
- dim_customer
- dim reseller
checks:
- row_count > 0
List of numeric metrics
Metric | Description | Supported data type | Supported data sources |
---|---|---|---|
avg | The average value in a numeric column. | number | Athena Redshift Big Query PostgreSQL Snowflake |
avg_length | The average length in a text column. | text | Athena Redshift Big Query PostgreSQL Snowflake |
duplicate_count | The number of rows that contain duplicate values. Include one column in the argument to compare values relative to that one column. Include more than one column in the argument to compare values across columns. See Duplicate check | number text time | Athena Redshift Big Query PostgreSQL Snowflake |
max | The greatest value in a numeric column. | number time | Athena Redshift Big Query PostgreSQL Snowflake |
max_length | The greatest length in a text column. | text | Athena Redshift Big Query PostgreSQL Snowflake |
min | The smallest value in a numeric column. | number time | Athena Redshift Big Query PostgreSQL Snowflake |
min_length | The smallest length in a text column. | text | Athena Redshift Big Query PostgreSQL Snowflake |
percentile | The value below which a percentage of observations fall within a group of observations. For example, percentile(distance, 0.7) . | number | PostgreSQL |
row_count | The number of rows in a dataset or column, if specified. | number text time | Athena Redshift Big Query PostgreSQL Snowflake |
stddev | The calculated standard deviation of values in a numeric column. | number | PostgreSQL |
stddev_pop | The calculated population standard deviation of values in a numeric column. | number | PostgreSQL |
stddev_samp | The calculated sample standard deviation of values in a numeric column. | number | PostgreSQL |
sum | The calculated sum of the values in a numeric column. | number | Athena Redshift Big Query PostgreSQL Snowflake |
variance | The calculated variance of the values in a numeric column. | number time | PostgreSQL |
var_pop | The calculated population variance of the values in a numeric column. | number time | PostgreSQL |
var_samp | The calculated sample variance of the values in a numeric column. | number time | PostgreSQL |
List of comparison symbols and phrases
=
<
>
<=
>=
!=
<>
between
not between
Fixed and dynamic thresholds
Numeric metrics can specify a fixed threshold which is not relative to any other threshold. row_count > 0
is an example of a check with a fixed threshold as the threshold value, 0
, is absolute. Refer to Checks with fixed thresholds for details.
Only checks that use numeric metrics can specify a dynamic threshold, a value that is relative to a previously-measured, or historic, value. Sometimes referred to as a change-over-time threshold, you use these dynamic thresholds to gauge changes to the same metric over time. Most of the examples below use the row_count
metric, but you can use any numeric metric in checks that use dynamic thresholds.
If you have connected Soda Core to a Soda Cloud account, Soda Core pushes check results to your cloud account. Soda Cloud stores the measured value of each metric that a check result produces during a scan in a Cloud Metric Store. Over time, these historic values accumulate and you can reference them to detect anomalous values relative to historic values for the same metric. Therefore, you must have a Soda Cloud account to use dynamic thresholds.
The most basic of dynamic threshold checks has three or four mutable parts:
a metric |
an argument (optional) |
a comparison symbol or phrase |
a threshold |
The example below defines a check that applies to the entire dataset and counts the rows in the dataset, then compares that value to the preceding value contained in the Cloud Metric Store. If the row_count
at present is greater than the previously-recorded historic value for row_count
by more than 50 or less than -20, the check fails.
Use between
for checks with dynamic thresholds as much as possible to trigger check failures when the measurement falls outside of a range of acceptable values. This practice ensures that you get visibility into changes that either exceed or fall short of threshold expectations.
checks for dim_customer:
- change for row_count between -20 and +50
metric | row_count |
threshold | between -20 and +50 |
The example below defines a check that applies to the entire dataset and counts the rows in the dataset, then compares that value to the preceding value contained in the Cloud Metric Store. If the row_count
at present is greater than the previously-recorded historic value for row_count
by more than 50%, the check fails.
For example, the previously-recorded historic measurement for row count is 80, and the newly-recorded value is 100, the relative change is 25%, which is less than the 50% specified in the threshold, so the check passes.
- Percentage thresholds are between 0 and 100, not between 0 and 1.
- If you wish, you can add a
%
character to the threshold for a dynamic threshold for improved readability. - If the previous measurement value is 0 and the new value is 0, Soda calculates the relative change as 0%. However, if the previous measurement value is 0 and the new value is not 0, then Soda indicates the check as
NOT EVALUATED
because the calculation is a division by zero.
checks for dim_customer:
- change percent for row_count > 50%
metric | row_count |
comparison symbol | > |
threshold | 50 % |
The example below applies to only the phone
column in the dataset and counts the rows that contain duplicate values, then compares that value to the preceding value contained in the Cloud Metric Store. If the number of duplicate phone numbers at present is greater than the preceding historic values for duplicate_count
by more than 20, the check fails.
checks for dim_customer:
- change for duplicate_count(phone) < 20
metric | duplicate_count |
argument | (phone) |
comparison symbol | < |
threshold | 20 |
A more complex dynamic threshold check includes two more optional mutable parts:
a calculation type (optional) avg , min , max |
a historical value definition (optional) |
percent (optional) |
a metric |
an argument (optional) |
a comparison symbol or phrase |
a threshold |
checks for dim_customer:
- change avg last 7 for row_count < 50
- change min last 7 for row_count < 50
- change max last 7 percent for row_count < 50
The example above defines three checks, one for each type of calculation available to use, avg
, min
, and max
, all of which apply to the entire dataset.
The first check counts the rows in the dataset, then compares that value to the calculated average of the preceding seven measurement values for that metric contained in the Cloud Metric Store. If the row_count
at present is greater than the average of the seven preceding historic values by more than 50, the check fails.
calculation type (optional) | avg |
a historical value definition (optional) | last 7 days |
percent (optional) | - |
metric | row_count |
argument (optional) | - |
comparison symbol or phrase | < |
a threshold | 50 |
The second check in the example determines the minimum value of the preceding seven historic values, then uses that value to compare to the present measurement value.
calculation type (optional) | min |
historical value definition (optional) | last 7 days |
percent (optional) | - |
metric | row_count |
argument (optional) | - |
comparison symbol or phrase | < |
a threshold | 50 |
The third check in the example determines the maximum value of the preceding seven historic values, then uses that value and the present measurement value to calculate the percentage of change.
calculation type (optional) | max |
historical value definition (optional) | last 7 days |
percent (optional) | percent |
metric | row_count |
argument (optional) | - |
comparison symbol or phrase | < |
a threshold | 50 |
Go further
- Use numeric metrics in checks with alert configurations to establish warn and fail zones
- Use numeric metrics in checks to define ranges of acceptable thresholds using boundary thresholds.
- Need help? Join the Soda community on Slack.
Last modified on 01-Jul-22
Was this documentation helpful?
Share feedback in the Soda community on Slack.
Help improve our docs!