Validity metrics
Use a validity metric in a check to surface invalid or unexpected values in your dataset.
Read more about SodaCL metrics and checks in general.
checks for dim_customer:
- invalid_count(email_address) = 0:
valid format: email
- invalid_percent(english_education) = 0:
valid length: 100
- invalid_percent(total_children) <= 2:
valid max: 6
- invalid_percent(marital_status) = 0:
valid max length: 10
- invalid_count(number_cars_owned) = 0:
valid min: 1
- invalid_percent(marital_status) = 0:
valid min length: 1
- invalid_percent(birthday) < 5%:
valid regex: (0?[0-9]|1[012])[/](0?[0-9]|[12][0-9]|3[01])[/](0000|(19|20)?\d\d)
- invalid_count(house_owner_flag) = 0:
valid values: ['0', '1']
Define checks with validity metrics
Optional check configurations
List of validity metrics
List of configuration keys
List of valid formats
List of comparison symbols and phrases
Go further
Define checks with validity metrics
In the context of SodaCL check types, you use validity metrics in standard checks. Refer to Standard check types for exhaustive configuration details.
You can use all validity metrics in checks that apply to individual columns in a dataset; you cannot use validity metrics in checks that apply to entire datasets. Identify the column(s) by adding one or more values in the argument between brackets in the check.
- You must use a configuration key:value pair to define what qualifies as an valid value.
- If you wish, you can add a
%
character to the threshold for ainvalid_percent
metric for improved readability.
checks for dim_customer
- invalid_count(number_cars_owned) = 0:
valid min: 1
You can use validity metrics in checks with fixed thresholds, or relative thresholds, but not dynamic thresholds. See Checks with fixed thresholds for more detail.
checks for dim_reseller:
# a check with a fixed threshold
- invalid_count(email_address) = 0:
valid format: email
# a check with a relative threshold
- invalid_percent(english_education) < 3%:
valid max length: 100
What is a relative threshold?
When it scans a column in your dataset, Soda automatically separates all values in the column into one of three categories:- missing
- invalid
- valid
missing_count(column_name) + invalid_count(column_name) + valid_count(column_name) = row_count
Similarly, a calculation that uses percentage always adds up to a total of 100 for the column.
missing_percent(name) + invalid_percent(name) + valid_percent(name) = 100
These calculations enable you to write checks that use relative thresholds.
In the example above, the invalid values of the
english_education
column must be less than three percent of the total row count, or the check fails.Percentage thresholds are between 0 and 100, not between 0 and 1.
Specify valid values
Use a nested configuration key:value pair to provide your own definition of a valid value. There are several configuration keys that you can use to define what qualifies as valid; the examples below illustrate the use of just a few config keys. See a complete List of configuration keys below.
A check that uses a validity metric has six mutable parts:
a metric |
an argument |
a comparison symbol or phrase |
a threshold |
a configuration key |
a configuration value |
The example below defines two checks. The first check applies to the column house_owner_flag
. The valid values
configuration key specifies that if a row in that column contains anything other than the two valid values in the list, Soda registers them as invalid. The check fails if Soda discovers more than five values that are not 0
or 1
.
- Values in a list must be enclosed in square brackets.
- Numeric characters in a
valid values
list must be enclosed in single quotes.
The second check uses a regular expression to define what qualifies as a valid value in the birthday
column so that any values that do not match the pattern defined by the regex qualify as invalid.
checks for dim_customer:
- invalid_count(house_owner_flag) = 0:
valid values: ['0', '1']
- invalid_count(birthday) = 0:
valid regex: ^\d{4}-(0[1-9]|1[0-2])-(0[1-9]|[12][0-9]|3[01])$
First check:
metric | invalid_count |
argument | house_owner_flag |
comparison symbol | = |
threshold | 0 |
configuration key | valid values |
configuration value(s) | '0', '1' |
Second check:
metric | invalid_count |
argument | birthday |
comparison symbol or phrase | = |
threshold | 0 |
configuration key | valid regex |
configuration value(s) | (0?[0-9]|1[012])[/](0?[0-9]|[12][0-9]|3[01])[/](0000|(19|20)?\d\d) |
Specify valid format
If the data type of the column you are checking is TEXT (such as character, character varying, or string) then you can use the valid format
configuration key. This config key uses built-in values that test the data in the column for specific formats, such as email address format, date format, or uuid format. See List of valid formats below.
The check below validates that all values in the email_address
column conform to an email address format.
checks for dim_customer:
- invalid_percent(email_address) = 0:
valid format: email
metric | invalid_percent |
argument | email_address |
comparison symbol or phrase | = |
threshold | 0 |
configuration key | valid format |
configuration value(s) | email |
Troubleshoot valid format
Problem: You are using a valid format
to test the format of values in a column and the CLI returns the following error message when you run a scan.
| HINT: No operator matches the given name and argument types. You might need to add explicit type casts.
Error occurred while executing scan.
| unsupported operand type(s) for *: 'Undefined' and 'int'
Solution: The error indicates that the data type of the column is not TEXT. Adjust your check to use a different configuration key, instead.
Send failed rows to Soda Cloud
If you have connected Soda Core to a Soda Cloud account, checks with validity metrics automatically send samples of any failed rows to Soda Cloud.
- To review the failed rows in Soda Cloud, navigate to the Monitors dashboard.
- Click the row for a check for invalid values, then go to the Failed rows tab.
Optional check configurations
✓ | Configuration | Documentation |
---|---|---|
✓ | Define a name for a check with validity metrics; see example. | Customize check names |
✓ | Define alert configurations to specify warn and fail thresholds; see example. | Add alert configurations |
✓ | Apply a filter to return results for a specific portion of the data in your dataset; see example. | Add a filter to a check |
✓ | Use quotes when identifying dataset or column names; see example | Use quotes in a check |
Use wildcard characters ( % or * ) in values in the check. | - | |
✓ | Use for each to apply checks with validity metrics to multiple datasets in one scan; see example. | Apply checks to multiple datasets |
✓ | Apply a dataset filter to partition data during a scan; see example. | Scan a portion of your dataset |
Example with check name
checks for dim_customer:
- invalid_count(first_name) = 0 :
valid min length: 2
name: First name has 2 or more characters
Example with alert configuration
- invalid_count(house_owner_flag):
valid values: ['0', '1']
warn: when between 1 and 5
fail: when > 6
Example with filter
checks for dim_customer:
- invalid_percent(marital_status) = 0:
valid max length: 1
filter: total_children = 0
Example with quotes
checks for dim_customer:
- invalid_count("number_cars_owned") = 0:
valid min: 1
Example with for each
for each dataset T:
datasets:
- dim_customer
- dim_customer_%
checks:
- invalid_count(email_address) = 0:
valid format: email
Example with dataset filter
coming soon
List of validity metrics
Metric | Column config keys | Description | Supported data type | Supported data sources |
---|---|---|---|---|
invalid_count | valid format valid length valid max valid max length valid min valid min length valid regex valid values | The number of rows in a column that contain values that are not valid. | number, text, time | Athena Redshift Big Query PostgreSQL Snowflake |
invalid_percent | valid format valid length valid max valid max length valid min valid min length valid regex valid values | The percentage of rows in a column, relative to the total row count, that contain values that are not valid. | number, text, time | Athena Redshift Big Query PostgreSQL Snowflake |
List of configuration keys
The column configuration key:value pair defines what SodaCL ought to consider as valid values.
Column config key | Description | Values |
---|---|---|
valid format | Defines the format of a value that Soda ought to register as valid. Only works with columns that contain data type TEXT. | See List of valid formats. |
valid length | Specifies a valid length for a string. Only works with columns that contain data type TEXT. | integer |
valid max | Specifies a maximum numerical value for valid values. | integer or float |
valid max length | Specifies a valid maximum length for a string. Only works with columns that contain data type TEXT. | integer |
valid min | Specifies a minimum numerical value for valid values. | integer or float |
valid min length | Specifies a valid minimum length for a string. Only works with columns that contain data type TEXT. | integer |
valid regex | Specifies a regular expression to define your own custom valid values. | regex, no forward slash delimiters |
valid values | Specifies the values that Soda is to consider valid. Numeric characters in a valid values list must be enclosed in single quotes. | values in a list |
List of valid formats
Valid formats apply only to columns using data type TEXT.
Valid format value | Format |
---|---|
credit card number | Four four-digit numbers separated by spaces. Four four-digit numbers separated by dashes. Sixteen-digit number. Four five-digit numbers separated by spaces. |
date eu | Validates date only, not time. dd/mm/yyyy |
date inverse | Validates date only, not time. yyyy/mm/dd |
date iso 8601 | Validates date and/or time according to ISO 8601 format . 2021-04-28T09:00:00+02:00 |
date us | Validates date only, not time. mm/dd/yyyy |
decimal | Number uses a , or . as a decimal indicator. |
decimal comma | Number uses , as decimal indicator. |
decimal point | Number uses . as decimal indicator. |
email | name@domain.extension |
integer | Number is whole. |
ip_address | Four whole numbers separated by . |
ipv4_address | Four whole numbers separated by . |
ipv6_address | Eight values separated by : |
money | A money pattern with currency symbol + decimal point or comma + currency abbreviation. |
money comma | A money pattern with currency symbol + decimal comma + currency abbreviation. |
money point | A money pattern with currency symbol + decimal point + currency abbreviation. |
negative decimal | Negative number uses a , or . as a decimal indicator. |
negative decimal comma | Negative number uses , as decimal indicator. |
negative decimal point | Negative number uses . as decimal indicator. |
negative integer | Number is negative and whole. |
negative percentage | Negative number is a percentage. |
negative percentage comma | Negative number is a percentage with a , decimal indicator. |
negative percentage point | Negative number is a percentage with a . decimal indicator. |
percentage comma | Number is a percentage with a , decimal indicator. |
percentage point | Number is a percentage with a . decimal indicator. |
percentage | Number is a percentage. |
phone number | +12 123 123 1234 123 123 1234 +1 123-123-1234 +12 123-123-1234 +12 123 123-1234 555-2368 555-ABCD |
positive decimal | Postive number uses a , or . as a decimal indicator. |
positive decimal comma | Positive number uses , as decimal indicator. |
positive decimal point | Positive number uses . as decimal indicator. |
positive integer | Number is positive and whole. |
positive percentage | Positive number is a percentage. |
positive percentage comma | Positive number is a percentage with a , decimal indicator. |
positive percentage point | Positive number is a percentage with a . decimal indicator. |
time 12h | Validates against the 12-hour clock. hh:mm:ss |
time 12h nosec | Validates against the 12-hour clock. hh:mm |
time 24h | Validates against the 244-hour clock. hh:mm:ss |
time 24h nosec | Validates against the 24-hour clock. hh:mm |
timestamp 12h | Validates against the 12-hour clock. hh:mm:ss |
timestamp 24h | Validates against the 24-hour clock. hh:mm:ss |
uuid | Universally unique identifier. |
List of comparison symbols and phrases
=
<
>
<=
>=
!=
<>
between
not between
Go further
- Use validity metrics in checks with alert configurations to establish warn and fail zones
- Use validity metrics in checks to define ranges of acceptable thresholds using boundary thresholds.
- Need help? Join the Soda community on Slack.
Last modified on 01-Jul-22
Was this documentation helpful?
Share feedback in the Soda community on Slack.
Help improve our docs!