Link Search Menu Expand Document

Metrics

A metric is a property of the data in your database. A measurement is the value for a metric that Soda checks against during a scan. You use metrics to define the tests that Soda executes against your data during a scan.

For example, in the test defined as row_count > 0, row_count is the metric and 0 is the measurement. When it runs a scan, Soda executes the test against your dataset; if the row count is greater than 0, the test passes; if the dataset is empty, the test fails.

  • In Soda SQL, you use metrics to define tests in your scan YAML file. Read more about configuring metrics.
  • In Soda Cloud, you use metrics to define a test as part of the process to create a new monitor. Read more about creating a monitor.

There are three kinds of metrics Soda uses:

  • Dataset metrics for tests that execute against an entire dataset
  • Column metrics for tests that execute against an individual column
  • Custom metrics, also known as SQL metrics, enable you to define your own metric that you can use tests that execute against a dataset or a column; you can also use custom metrics to simply define SQL queries that Soda executes during a scan

Dataset metrics

Use dataset metrics in tests that execute against all data in the dataset during a scan.

Dataset metric Description
row_count The number of rows in a dataset.
schema A list of column names in a dataset, and their data types.

Column metrics

Use column metrics in tests that execute against specific columns in a dataset during a scan.

Where a column metric references a valid or invalid value, or a limit, use the metric in conjunction with a column configuration key in Soda SQL or a Validity Rule in Soda Cloud. A Soda scan uses the value of a column configuration key / validity rule to determine if it should pass or fail a test.

Column metric Description Applies to data type Column config key(s) / Validity Rule(s)
avg The calculated average of the values in a numeric column. number -
avg_length The average length of string values in a column. text -
invalid_count The number of rows that contain invalid values. text valid_format
valid_regex
valid_values
valid_min_length
valid_max_length
invalid_percentage The percentage of rows that contain invalid values. text valid_format
valid_regex
valid_values
valid_min_length
valid_max_length
max The greatest value in a numeric column. number -
max_length The maximum length of string values in a column. text -
min The smallest value in a numeric column. number -
min_length The minimum length of string values in a column. text -
missing_count The number of rows in a column that do not contain specific content. text, number, date missing_format
missing_regex
missing_values
missing_percentage The percentage of rows in a column that do not contain specific content. text, number, date missing_format
missing_regex
missing_values
row_count The number of rows in a column. text, number, date -
stddev The calculated standard deviation of values in a numeric column. number -
sum The calculated sum of the values in a numeric column. number -
valid_count The number of rows that contain valid content. text valid_format
valid_regex
valid_values
valid_min_length
valid_max_length
valid_percentage The percentage of rows that contain valid content. text valid_format
valid_regex
valid_values
valid_min_length
valid_max_length
values_count The number of rows that contain content included in a list of valid values. text valid_values
valid_regex
values_percentage The percentage of rows that contain content identified by valid values. text valid_values
valid_regex
variance The calculated variance of the values in a numeric column. number -

Column configuration keys or validity rules

Column config key(s) / Validity Rule(s) Description Values
metric_groups Only available in Soda SQL.
Specifies pre-defined groups of metrics that Soda computes for this column. See Metric groups and dependencies for details.
duplicates
length
missing
profiling
statistics
validity
missing_format Specifies missing values such as whitespace or empty strings.  
missing_regex Use regex expressions to specify your own custom missing values. regex, no forward slash delimiters
missing_values Specifies the values that Soda is to consider missing in list format. values in a list
valid_format Specifies a named valid text format. Can apply only to columns using data type TEXT. See Data types. See Valid format values table.
valid_max Specifies a maximum value for valid values. integer
valid_max_length Specifies a maximum string length for valid values. integer
valid_min Specifies a minimum value for valid values. integer
valid_min_length Specifies a minimum string length for valid values. integer
valid_regex Use regex expressions to specify your own custom valid values. regex, no forward slash delimiters
valid_values Specifies several valid values in list format. values in a list

Valid format values

Valid formats are experimental and subject to change.
Valid formats apply only to columns using data type TEXT. See Data types.

Valid format value
Format
number_whole Number is whole.
number_decimal_point Number uses . as decimal indicator.
number_decimal_comma Number uses , as decimal indicator.
number_percentage Number is a percentage.
number_percentage_point Number is a percentage with a . decimal indicator.
number_percentage_comma Number is a percentage with a , decimal indicator.
number_money_usd Number matches US dollar currency pattern.
number_money_eur Number matches Euro currency pattern.
number_money_gbp Number matches British pound currency pattern.
number_money_rmb Number matches Renminbi yuan currency pattern.
number_money_chf Number matches Swiss franc currency pattern.
number_money Format matches any of the number_money_ patterns listed above.
date_eu Validates date only, not time.
dd/mm/yyyy
date_us Validates date only, not time.
mm/dd/yyyy
date_inverse Validates date only, not time.
yyyy/mm/dd
time_24h Validates against the 24-hour clock.
23:00
time_12h Validates against the 12-hour clock.
11:00
time 11:59:00,000
11:59:00
11:59
11-59-00,000
23:59:00,000
Noon
1,159
date_iso_8601 Validates date and/or time according to ISO 8601 format .
2021-04-28T09:00:00+02:00
uuid Universally unique identifier.
ip_address Four whole numbers separated by .
ipv4_address Four whole numbers separated by .
ipv6_address Eight values separated by :
email name@domain.extension
phone_number +12 123 123 1234
123 123 1234
+1 123-123-1234
+12 123-123-1234
+12 123 123-1234
555-2368
555-ABCD
credit_card_number Four four-digit numbers separated by spaces.
Four four-digit numbers separated by dashes.
Sixteen-digit number.
Four five-digit numbers separated by spaces.

Custom metrics

If the built-in dataset and column metrics that Soda offers do not quite give you the information you need from a scan, you can use custom metrics to customize your queries. Custom metrics, also known as SQL metrics, essentially enable you to define SQL queries that Soda runs during a scan. You can also use custom metrics to define new metrics that you can use when you write tests. See Validate that row counts are equal for an example of a test that uses a custom metric.

Read more about using custom metrics in Soda SQL and custom metrics in Soda Cloud.

Go further


Last modified on 15-Sep-21

Was this documentation helpful?
Give us your feedback in the #soda-docs channel in the Soda community on Slack or open an issue in GitHub.