Link Search Menu Expand Document

Freshness checks

Last modified on 31-May-23

Use a freshness check to determine the relative age of the data in a column in your dataset.

checks for dim_product:
  - freshness(start_date) < 3d

Define freshness checks
    Details and limitations
    Troubleshoot errors with freshness checks
Freshness check results
Optional check configurations
List of freshness thresholds
List of comparison symbols and phrases
Go further

Define freshness checks

A freshness check measures the age of the youngest row in a table. Using this check, you can infer the freshness of the data in the dataset using the age of the most recently added row in a table.

In the context of SodaCL check types, freshness checks are unique. This check is limited in its syntax variation, with only a few mutable parts to specify column name, threshold, and, optionally, a “now” variable.

A freshness check has two or three mutable parts:

a timestamp column name
a variable to specify the value of “now” (optional)
a threshold

The example below defines a check that measures freshness relative to “now”, where “now” is the moment you run the scan that executes the freshness check. This example discovers when the last row was added to the start_date timestamp column, then compares that timestamp to “now”. If Soda discovers that the last row was added more than three days ago, the check fails.

checks for dim_product:
  - freshness(start_date) < 3d
column name start_date
threshold 3d

Instead of using the default value for “now” (the time you run the scan that executes the freshness check), you can use a variable to specify the value of “now” at scan time. For example, the following check measures freshness relative to a date that a user specifies at scan time.

checks for dim_product:
  - freshness using end_date with NOW < 1d
column name end_date
variable to specify the value of “now” (optional) NOW
threshold 1d

At scan time, you use a -v option to pass a value for the variable that the check expects for the value of “now”. The scan command below passes a variable that the check uses. In your scan command, if you pass a variable with a timestamp, the variable must be in ISO8601 format such as "2022-02-16 21:00:00" or "2022-02-16T21:00:00".

soda scan -d adventureworks -c configuration.yml -v NOW="2022-05-31 21:00:00" checks_test.yml

Known issue:
When introducing a NOW variable into a freshness check, you must use the deprecated syntax that includes using. This syntax yields an error message in the scan output, Syntax of freshness check has changed and is deprecated. Use freshness(column_name) < 24h30m See docs but does not prevent Soda from executing the check. Workaround: ignore the deprecated syntax message.

Details and limitations

  • Out-of-the-box, freshness checks only work with columns that contain data types TIMESTAMP or DATE. However, you can apply a freshness check to TEXT type data using the following syntax to cast the column:
checks for dim_product:
  - freshness(createdat::datetime) < 1d
  • Note that casting a column in a check does not work with a NOW variable.
  • The only comparison symbol you can use with freshness checks is < except when you employ and alert configuration. See Example with alert configuration for details.
  • The default value for “now” is the time you run the scan that executes the freshness check.
  • If no timezone information is available in either the timestamp of the check (scan time), or in the data in the column, a freshness check uses the UTC timezone. Soda converts both timestamps to UTC to compare values.

Troubleshoot errors with freshness checks

Problem: When you run a scan to execute a freshness check, the CLI returns one of the following error message.

Invalid staleness threshold "when < 3256d"
  +-> line=2,col=5 in checks_test.yml

Invalid check "freshness(start_date) > 1d": no viable alternative at input ' >'

Solution: The error indicates that you are using an incorrect comparison symbol. Remember that freshness checks can only use < in check, unless the freshness check employs an alert configuration, in which case it can only use > in the check.

Problem: When you run a scan to execute a freshness check that uses a NOW variable, the CLI returns an following error message for Invalid check.

Invalid check "freshness(end_date) ${NOW} < 1d": mismatched input '${NOW}' expecting {'between', 'not', '!=', '<>', '<=', '>=', '=', '<', '>'}

Solution: Until the known issue is resolved, use a deprecated syntax for freshness checks using a NOW variable, and ignore the deprecated syntax message in the output. For example, define a check as per the following.

checks for dim_product:
  - freshness using end_date with NOW < 1d

Freshness check results

When you run a scan that includes a freshness check, the output in the Soda Core CLI provides several values for measurements Soda used to calculate freshness. The value for freshness itself is displayed in days, hours, minutes, seconds, and milliseconds; see the example below.

Soda Core 3.0.x
Scan summary:
1/1 checks FAILED: 
      Data is fresh [FAILED]
        max_column_timestamp: 2013-07-01 00:00:00
        max_column_timestamp_utc: 2013-07-01 00:00:00+00:00
        now_variable_name: NOW
        now_timestamp: 2022-09-13T16:40:39.196522+00:00
        now_timestamp_utc: 2022-09-13 16:40:39.196522+00:00
        freshness: 3361 days, 16:40:39.196522
Oops! 1 failures. 0 warnings. 0 errors. 0 pass.

In Soda Cloud, the freshness value represents age of the data in the days, hours, minutes, etc. relative to now_timestamp.

Optional check configurations

Supported Configuration Documentation
Define a name for a freshness check; see example. Customize check names
Add an identity to a check. Add a check identity
Define alert configurations to specify warn and fail thresholds; see example. Add alert configurations
  Apply an in-check filter to return results for a specific portion of the data in your dataset. -
Use quotes when identifying dataset or column names; see example.
Note that the type of quotes you use must match that which your data source uses. For example, BigQuery uses a backtick (`) as a quotation mark.
Use quotes in a check
  Use wildcard characters ( % or * ) in values in the check. -
Use for each to apply freshness checks to multiple datasets in one scan; see example. Apply checks to multiple datasets
Apply a dataset filter to partition data during a scan; see example. Scan a portion of your dataset

Example with check name

checks for dim_product:
  - freshness(start_date) < 27h:
      name: Data is fresh

Example with alert configuration

The only comparison symbol that you can use with freshness checks that employ an alert configuration is >.

checks for dim_product:
  - freshness(start_date):
      warn: when > 3256d
      fail: when > 3258d


checks for dim_product:
  - freshness(start_date):
        when > 3256d
        when > 3258d

Example with quotes

checks for dim_product:
  - freshness("end_date") < 3d

Example with for each

for each dataset T:
    - dim_prod%
    - freshness(end_date) < 3d

Example with dataset filter

filter CUSTOMERS [daily]:
  where: TIMESTAMP '{ts_start}' <= "ts" AND "ts" < TIMESTAMP '${ts_end}'

checks for CUSTOMERS [daily]:
  - freshness(end_date) < 3d

List of freshness thresholds

Threshold Example Reads as
#d 3d 3 days
#h 1h 1 hour
#m 30m 30 minutes
#d#h 1d6h 1 day and 6 hours
#h#m 1h30m 1 hour and 30 minutes

List of comparison symbols and phrases

# If using without an alert configuration
# If using with an alert configuration

Go further

Was this documentation helpful?

What could we do to improve this page?

Last modified on 31-May-23