Link Search Menu Expand Document

Missing metrics

Use a missing metric in a check to surface missing values in the data in your dataset.
Read more about SodaCL metrics and checks in general.

checks for dim_customer
  - missing_count(birthday) = 0
  - missing_percent(gender) < 5%
  - missing_count(birthday) = 0:
      missing regex: (0?[0-9]|1[012])[/](0?[0-9]|[12][0-9]|3[01])[/](0000|(19|20)?\d\d)
  - missing_count(last_name) < 5:
      missing values: [n/a, NA, none]
  - missing_percent(email_address) = 0:
      valid format: email

Define checks with missing metrics
Optional check configurations
List of missing metrics
List of configuration keys
List of valid formats
List of comparison symbols and phrases
Go further

Define checks with missing metrics

In the context of SodaCL check types, you use missing metrics in standard checks. Refer to Standard check types for exhaustive configuration details.

You can use all missing metrics in checks that apply to individual columns in a dataset; you cannot use missing metrics in checks that apply to entire datasets. Identify the column(s) by adding one or more values in the argument between brackets in the check.

  • SodaCL considers NULL as the default value for “missing”.
  • If you wish, you can add a % character to the threshold for a missing_percent metric for improved readability.
checks for dim_customer
  - missing_count(birthday, last_name) = 0

You can use missing metrics in checks with fixed thresholds, or relative thresholds, but not dynamic thresholds. See Checks with fixed thresholds for more detail.

checks for dim_reseller:
# a check with a fixed threshold
  - missing_percent(phone) = 5%
# a check with a relative threshold
  - missing_percent(number_employees) < 5
What is a relative threshold? When it scans a column in your dataset, Soda automatically separates all values in the column into one of three categories:
  • missing
  • invalid
  • valid
Soda then performs two calculations. The sum of the count for all categories in a column is always equal to the total row count for the dataset.

missing count(column name) + invalid count(column name) + valid count(column name) = row count

Similarly, a calculation that uses percentage always adds up to a total of 100 for the column.

missing percent(name) + invalid percent(name) + valid percent(name) = 100

These calculations enable you to write checks that use relative thresholds.

In the example above, the missing values (in this case, NULL) of the number_employees column must be less than five percent of the total row count, or the check fails.

Percentage thresholds are between 0 and 100, not between 0 and 1.

Specify missing values or missing regex

SodaCL considers NULL as the default value for “missing”. In the two check examples above, Soda executes the checks to count the number or values which are NULL, or the percent of values which are NULL relative to the total row count of the column.

However, you can use a nested configuration key:value pair to provide your own definition of a missing value. See List of configuration keys below.

A check that uses a missing metric has four or six mutable parts:

a metric
an argument
a comparison symbol or phrase
a threshold
a configuration key (optional)
a configuration value (optional)


The example below defines two checks. The first check applies to the column last_name. The missing values configuration key specifies that any of the three values in the list exist in a row in that column, Soda recognizes those values as missing values. The check fails if Soda discovers more than five values that match NA, n/a, or '0'.

  • Values in a list must be enclosed in square brackets.
  • Numeric characters in a missing values list must be enclosed in single quotes.

The second check uses a regular expression to define what qualifies as a missing value in the birthday column so that any values that are 00/00/0000 qualify as missing. This check passes if Soda discovers no values that match the pattern defined by the regex.

checks for dim_customer:
  - missing_count(last_name) < 5:
      missing values: [NA, n/a, '0']
  - missing_count(birthday) = 0:
      missing regex: (0?[0-9]|1[012])[/](0?[0-9]|[12][0-9]|3[01])[/](0000|(19|20)?\d\d)

First check:

metric missing_count
argument last_name
comparison symbol <
threshold 5
configuration key missing values
configuration value(s) NA, n/a, '0'

Second check:

metric missing_count
argument birthday
comparison symbol or phrase =
threshold 0
configuration key missing regex
configuration value(s) (0?[0-9]|1[012])[/](0?[0-9]|[12][0-9]|3[01])[/](0000|(19|20)?\d\d)


Specify missing format

If the data type of the column you are checking is TEXT (such as character, character varying, or string) then you can use the missing format configuration key. This config key uses built-in values that test the data in the column for specific formats, such as email address format, date format, or uuid format. See List of valid formats below.

The check below validates that all values in the email_address column conform to an email address format.

checks for dim_customer:
  - missing_percent(email_address) = 0:
      valid format: email
metric missing_percent
argument email_address
comparison symbol or phrase =
threshold 0
configuration key valid format
configuration value(s) email


Troubleshoot missing format

Problem: You are using a valid format to test the format of values in a column and the CLI returns the following error message when you run a scan.

  | HINT:  No operator matches the given name and argument types. You might need to add explicit type casts.

Error occurred while executing scan.
  | unsupported operand type(s) for *: 'Undefined' and 'int'

Solution: The error indicates that the data type of the column is not TEXT. Adjust your check to use missing regex or missing values configuration keys, instead.


Send failed rows to Soda Cloud

If you have connected Soda Core to a Soda Cloud account, checks with missing metrics automatically send samples of any failed rows to Soda Cloud.

  1. To review the failed rows in Soda Cloud, navigate to the Monitors dashboard.
  2. Click the row for a check for missing values, then go to the Failed rows tab.

failed-missing-count

Optional check configurations

Configuration Documentation
Define a name for a check with missing metrics; see example. Customize check names
Define alert configurations to specify warn and fail thresholds; see example. Add alert configurations
Apply a filter to return results for a specific portion of the data in your dataset; see example. Add a filter to a check
Use quotes when identifying dataset or column names; see example Use quotes in a check
  Use wildcard characters ( % or * ) in values in the check. -
Use for each to apply checks with missing metrics to multiple datasets in one scan; see example. Apply checks to multiple datasets
Apply a dataset filter to partition data during a scan; see example. Scan a portion of your dataset

Example with check name

checks for dim_customer:
  - missing_count(birthday) = 0:
      missing regex: (0?[0-9]|1[012])[/](0?[0-9]|[12][0-9]|3[01])[/](0000|(19|20)?\d\d)
      name: Date entered as 00/00/0000

Example with alert configuration

checks for dim_customer:
  - missing_percent(marital_status):
      valid length: 1
      warn: when < 5
      fail: when >= 5  

Example with filter

checks for dim_customer:
  - missing_count(first_name) < 5:
      missing values: [NA, none]
      filter: number_children_at_home > 2

Example with quotes

checks for dim_reseller:
  - missing_percent("phone", "address_line1") = 0

Example with for each

for each dataset T:
  datasets:
    - dim_product
    - dim_product_%
  checks:
    - missing_count(product_line) = 0

Example with dataset filter

coming soon


List of missing metrics

Metric Column config keys Description Supported data type Supported data sources
missing_count missing format
missing regex
missing values
The number of rows in a column that contain NULL values and any other user-defined values that qualify as missing. number, text, time Athena
Redshift
Big Query
PostgreSQL
Snowflake
missing_percent missing format
missing regex
missing values
The percentage of rows in a column, relative to the total row count, that contain NULL values and any other user-defined values that qualify as missing. number, text, time Athena
Redshift
Big Query
PostgreSQL
Snowflake

List of configuration keys

The column configuration key:value pair defines what SodaCL ought to consider as missing values.

Column config key Description Values
missing format Defines the format of a value that Soda ought to register as missing.
Only works with columns that contain data type TEXT.
See List of valid formats
missing regex Specifies a regular expression to define your own custom missing values. regex, no forward slash delimiters, string only
missing values Specifies the values that Soda is to consider missing. Numeric characters in a valid values list must be enclosed in single quotes. values in a list

List of valid formats

Valid formats apply only to columns using data type TEXT.

Valid format value Format
credit card number Four four-digit numbers separated by spaces.
Four four-digit numbers separated by dashes.
Sixteen-digit number.
Four five-digit numbers separated by spaces.
date eu Validates date only, not time.
dd/mm/yyyy
date inverse Validates date only, not time.
yyyy/mm/dd
date iso 8601 Validates date and/or time according to ISO 8601 format .
2021-04-28T09:00:00+02:00
date us Validates date only, not time.
mm/dd/yyyy
decimal Number uses a , or . as a decimal indicator.
decimal comma Number uses , as decimal indicator.
decimal point Number uses . as decimal indicator.
email name@domain.extension
integer Number is whole.
ip_address Four whole numbers separated by .
ipv4_address Four whole numbers separated by .
ipv6_address Eight values separated by :
money A money pattern with currency symbol + decimal point or comma + currency abbreviation.
money comma A money pattern with currency symbol + decimal comma + currency abbreviation.
money point A money pattern with currency symbol + decimal point + currency abbreviation.
negative decimal Negative number uses a , or . as a decimal indicator.
negative decimal comma Negative number uses , as decimal indicator.
negative decimal point Negative number uses . as decimal indicator.
negative integer Number is negative and whole.
negative percentage Negative number is a percentage.
negative percentage comma Negative number is a percentage with a , decimal indicator.
negative percentage point Negative number is a percentage with a . decimal indicator.
percentage comma Number is a percentage with a , decimal indicator.
percentage point Number is a percentage with a . decimal indicator.
percentage Number is a percentage.
phone number +12 123 123 1234
123 123 1234
+1 123-123-1234
+12 123-123-1234
+12 123 123-1234
555-2368
555-ABCD
positive decimal Postive number uses a , or . as a decimal indicator.
positive decimal comma Positive number uses , as decimal indicator.
positive decimal point Positive number uses . as decimal indicator.
positive integer Number is positive and whole.
positive percentage Positive number is a percentage.
positive percentage comma Positive number is a percentage with a , decimal indicator.
positive percentage point Positive number is a percentage with a . decimal indicator.
time 12h Validates against the 12-hour clock.
hh:mm:ss
time 12h nosec Validates against the 12-hour clock.
hh:mm
time 24h Validates against the 244-hour clock.
hh:mm:ss
time 24h nosec Validates against the 24-hour clock.
hh:mm
timestamp 12h Validates against the 12-hour clock.
hh:mm:ss
timestamp 24h Validates against the 24-hour clock.
hh:mm:ss
uuid Universally unique identifier.

List of comparison symbols and phrases

 = 
 < 
 >
 <=
 >=
 !=
 <> 
 between 
 not between 

Go further


Last modified on 01-Jul-22

Was this documentation helpful?
Share feedback in the Soda community on Slack.

Help improve our docs!

  • Request a docs change.
  • Edit this page in our GitHub repo.