Freshness checks
Last modified on 20-Nov-24
Use a freshness check to determine the relative age of the data in a column in your dataset.
checks for dim_product:
- freshness(start_date) < 3d
✖️ Requires Soda Core Scientific (included in a Soda Agent)
✔️ Supported in Soda Core
✔️ Supported in Soda Library + Soda Cloud
✔️ Supported in Soda Cloud Agreements + Soda Agent
✔️ Available as a no-code check
Define freshness checks
Details and limitations
Troubleshoot errors with freshness checks
Freshness check results
Optional check configurations
List of freshness thresholds
List of comparison symbols and phrases
Go further
Define freshness checks
A freshness check measures the age of the youngest row in a table. Using this check, you can infer the freshness of the data in the dataset using the age of the most recently added row in a dataset.
In the context of SodaCL check types, freshness checks are unique. This check is limited in its syntax variation, with only a few mutable parts to specify column name, threshold, and, optionally, a NOW
variable.
A freshness check has two or three mutable parts:
a timestamp column name |
a variable to specify the value of “now” (optional) |
a threshold |
The example below defines a check that measures freshness relative to “now”, where “now” is the moment you run the scan that executes the freshness check. This example discovers when the last row was added to the start_date
timestamp column, then compares that timestamp to “now”. If Soda discovers that the last row was added more than three days ago, the check fails.
checks for dim_product:
- freshness(start_date) < 3d
column name | start_date |
threshold | 3d |
Instead of using the default value for “now” (the time you run the scan that executes the freshness check), you can use a variable to specify the value of “now” at scan time. For example, the following check measures freshness relative to a date that a user specifies at scan time. You cannot use variables in checks you write in an agreement in Soda Cloud as it is impossible to provide the variable values at scan time.
checks for dim_product:
- freshness using end_date with NOW < 1d
column name | end_date |
variable to specify the value of “now” (optional) | NOW |
threshold | 1d |
At scan time, you use a -v
option to pass a value for the variable that the check expects for the value of “now”. The scan command below passes a variable that the check uses. In your scan command, if you pass a variable with a timestamp, the variable must be in ISO8601 format such as "2022-02-16 21:00:00"
or "2022-02-16T21:00:00"
.
soda scan -d adventureworks -c configuration.yml -v NOW="2022-05-31 21:00:00" checks_test.yml
Known issue:
When introducing a NOW variable into a freshness check, you must use the deprecated syntax that includes using
. This syntax yields an error message in the scan output, Syntax of freshness check has changed and is deprecated. Use freshness(column_name) < 24h30m See docs
but does not prevent Soda from executing the check. Workaround: ignore the deprecated syntax message.
Details and limitations
- Out-of-the-box, freshness checks only work with columns that contain data types TIMESTAMP or DATE. However, though it does not universally apply to all data sources, you may be able to apply a freshness check to TEXT type data using the following syntax to cast the column:
checks for dim_product:
- freshness(createdat::datetime) < 1d
- Note that casting a column in a check does not work with a NOW variable.
- The only comparison symbol you can use with freshness checks is
<
except when you employ and alert configuration. See Example with alert configuration for details. - The default value for “now” is the time you run the scan that executes the freshness check.
- If no timezone information is available in either the timestamp of the check (scan time), or in the data in the column, a freshness check uses the UTC timezone. Soda converts both timestamps to UTC to compare values.
- You cannot use variables in checks you write in an agreement in Soda Cloud as it is impossible to provide the variable values at scan time.
Troubleshoot errors with freshness checks
Problem: When you run a scan to execute a freshness check, the CLI returns one of the following error message.
Invalid staleness threshold "when < 3256d"
+-> line=2,col=5 in checks_test.yml
Invalid check "freshness(start_date) > 1d": no viable alternative at input ' >'
Solution: The error indicates that you are using an incorrect comparison symbol. Remember that freshness checks can only use <
in check, unless the freshness check employs an alert configuration, in which case it can only use >
in the check.
Problem: When you run a scan to execute a freshness check that uses a NOW variable, the CLI returns an following error message for Invalid check
.
Invalid check "freshness(end_date) ${NOW} < 1d": mismatched input '${NOW}' expecting {'between', 'not', '!=', '<>', '<=', '>=', '=', '<', '>'}
Solution: Until the known issue is resolved, use a deprecated syntax for freshness checks using a NOW variable, and ignore the deprecated syntax
message in the output. For example, define a check as per the following.
checks for dim_product:
- freshness using end_date with NOW < 1d
Freshness check results
When you run a scan that includes a freshness check, the output in the Soda Library CLI provides several values for measurements Soda used to calculate freshness. The value for freshness itself is displayed in days, hours, minutes, seconds, and milliseconds; see the example below.
In Soda Cloud, the freshness value represents age of the data in the days, hours, minutes, etc. relative to now_timestamp
. In other words, (scan time - (max of date_column))
.
Soda Library 1.0.x
Soda Core 3.0.x
Scan summary:
1/1 checks FAILED:
Data is fresh [FAILED]
max_column_timestamp: 2013-07-01 00:00:00
max_column_timestamp_utc: 2013-07-01 00:00:00+00:00
now_variable_name: NOW
now_timestamp: 2022-09-13T16:40:39.196522+00:00
now_timestamp_utc: 2022-09-13 16:40:39.196522+00:00
freshness: 3361 days, 16:40:39.196522
Oops! 1 failures. 0 warnings. 0 errors. 0 pass.
Optional check configurations
Supported | Configuration | Documentation |
---|---|---|
✓ | Define a name for a freshness check; see example. | Customize check names |
✓ | Add an identity to a check. | Add a check identity |
✓ | Define alert configurations to specify warn and fail thresholds; see example. | Add alert configurations |
✓ | Apply an in-check filter to return results for a specific portion of the data in your dataset; see example. | - |
✓ | Use quotes when identifying dataset or column names; see example. Note that the type of quotes you use must match that which your data source uses. For example, BigQuery uses a backtick (`) as a quotation mark. | Use quotes in a check |
Use wildcard characters ( % or * ) in values in the check. | - | |
✓ | Use for each to apply freshness checks to multiple datasets in one scan; see example. | Apply checks to multiple datasets |
✓ | Apply a dataset filter to partition data during a scan; see example. | Scan a portion of your dataset |
Example with check name
checks for dim_product:
- freshness(start_date) < 27h:
name: Data is fresh
Example with alert configuration
The only comparison symbol that you can use with freshness checks that employ an alert configuration is >
.
checks for dim_product:
- freshness(start_date):
warn: when > 3256d
fail: when > 3258d
OR
checks for dim_product:
- freshness(start_date):
warn:
when > 3256d
fail:
when > 3258d
Example with in-check filter
checks for dim_product:
- freshness(start_date) < 27h:
filter: weight = 10
Example with quotes
checks for dim_product:
- freshness("end_date") < 3d
Example with for each
for each dataset T:
datasets:
- dim_prod%
checks:
- freshness(end_date) < 3d
Example with dataset filter
filter CUSTOMERS [daily]:
where: TIMESTAMP '{ts_start}' <= "ts" AND "ts" < TIMESTAMP '${ts_end}'
checks for CUSTOMERS [daily]:
- freshness(end_date) < 3d
List of freshness thresholds
Threshold | Example | Reads as |
---|---|---|
#d | 3d | 3 days |
#h | 1h | 1 hour |
#m | 30m | 30 minutes |
#d#h | 1d6h | 1 day and 6 hours |
#h#m | 1h30m | 1 hour and 30 minutes |
List of comparison symbols and phrases
# If using without an alert configuration
<
# If using with an alert configuration
>
Go further
- Use missing metrics in checks with alert configurations to establish warn and fail zones
- Use missing metrics in checks to define ranges of acceptable thresholds using boundary thresholds.
- Need help? Join the Soda community on Slack.
- Reference tips and best practices for SodaCL.
Was this documentation helpful?
What could we do to improve this page?
- Suggest a docs change in GitHub.
- Share feedback in the Soda community on Slack.
Documentation always applies to the latest version of Soda products
Last modified on 20-Nov-24