Group by
Last modified on 27-Sep-23
Use a group by configuration to collect and present check results by category.
Requires Soda Library
checks for fact_internet_sales:
- group by:
group_limit: 10
query: |
SELECT sales_territory_key, AVG(discount_amount) as average_discount
FROM fact_internet_sales
GROUP BY sales_territory_key
fields:
- sales_territory_key
checks:
- average_discount:
fail: when > 40
name: Average discount percentage is less than 40% (grouped-by sales territory)
Define reference checks
Group by check results
Optional check configurations
Go further
Define a group by configuration
For an individual dataset, add a group by configuration to specify the categories into which Soda must group the check results.
The example below uses a SQL query to define a custom metric for the fact_internet_sales
dataset. It calculates the average order discount based on the contents of the discount_amount
column, then groups the results according to the value in the sales_territory_key
.
The check itself uses the custom metric average_discount
and an alert configuration to determine if the measurement for each group passes, warns, or fails. In this case, any calculated measurement for average that exceeds 40 results in a fail.
checks for fact_internet_sales:
- group by:
group_limit: 10
query: |
SELECT sales_territory_key, AVG(discount_amount) as average_discount
FROM fact_internet_sales
GROUP BY sales_territory_key
fields:
- sales_territory_key
checks:
- average_discount:
fail: when > 40
name: Average discount percentage is less than 40% (grouped-by sales territory)
group by | required | configuration section label |
group_limit | optional | the maximum number of groups, or column values, into which Soda must categorize the results. This value must correspond with the number of unique values in the column you identify in the fields section. |
query | required | custom query subsection label. The nested SQL query defines the custom metric average_discount |
fields | required | column subsection label |
sales_territory_key | required | column identifier; the values in this column identify how Soda groups the results. |
checks | required | check subsection label |
average_discount | required | custom metric identifier |
fail: when > 40 | required | fail condition and threshold |
warn: when between 50 and 60 | only one alert condition is required | warn condition and threshold |
name | required | custom name for the check |
Group by check results
When you run a scan that includes checks nested in a group by configuration, the output in Soda Library CLI groups the results according to the unique values in the column you identified in the fields
subsection. The number of unique values in the column must match the value you provided for group_limit
.
In the example results below, the calculated average of average_discount
for each sales territory is less than 40%, so the check for each group passed. The value in the square brackets next to the custom check name identifies the group which, in this case, is a number that corresponds to a territory.
Soda 1.0.x
Soda Core 3.0.x
Scan summary:
11/11 checks PASSED:
fact_internet_sales in adventureworks
group by [PASSED]
Average discount percentage is less than 40% (grouped-by sales territory) [8] [PASSED]
Average discount percentage is less than 40% (grouped-by sales territory) [10] [PASSED]
Average discount percentage is less than 40% (grouped-by sales territory) [9] [PASSED]
Average discount percentage is less than 40% (grouped-by sales territory) [7] [PASSED]
Average discount percentage is less than 40% (grouped-by sales territory) [1] [PASSED]
Average discount percentage is less than 40% (grouped-by sales territory) [5] [PASSED]
Average discount percentage is less than 40% (grouped-by sales territory) [2] [PASSED]
Average discount percentage is less than 40% (grouped-by sales territory) [4] [PASSED]
Average discount percentage is less than 40% (grouped-by sales territory) [6] [PASSED]
Average discount percentage is less than 40% (grouped-by sales territory) [3] [PASSED]
All is good. No failures. No warnings. No errors.
Sending results to Soda Cloud
Soda Cloud Trace: 14733***37
In Soda Cloud, the check results appear by territory.
You must always match the number of unique values in the group by column to the value you provide for the group_limit
. In the following example, the number of unique values in the sales_territory_key
column is greater than the group_limit: 2
so Soda does not evaluate the check.
checks for dim_employee:
- group by:
group_limit: 2
query: |
SELECT sales_territory_key, AVG(vacation_hours) as vacation_calc
FROM dim_employee
GROUP BY sales_territory_key
fields:
- sales_territory_key
checks:
- vacation_calc > 60:
name: Reasonable vacation hours
Soda Library 1.0.x
Soda Core 3.0.x
Evaluation of check group by failed: Total number of groups 11 exceeds configured group limit: 2
| Total number of groups 11 exceeds configured group limit: 2
+-> line=2,col=5 in checks_groupby.yml
Scan summary:
1/1 check NOT EVALUATED:
dim_employee in adventureworks
group by [NOT EVALUATED]
1 checks not evaluated.
1 errors.
Oops! 1 error. 0 failures. 0 warnings. 0 pass.
ERRORS:
Evaluation of check group by failed: Total number of groups 11 exceeds configured group limit: 2
| Total number of groups 11 exceeds configured group limit: 2
+-> line=2,col=5 in checks_groupby.yml
Resolve the issue by increasing the group limit value.
checks for dim_employee:
- group by:
group_limit: 11
query: |
SELECT sales_territory_key, AVG(vacation_hours) as vacation_calc
FROM dim_employee
GROUP BY sales_territory_key
fields:
- sales_territory_key
checks:
- vacation_calc > 20:
name: Too much vacation
Soda Library 1.0.x
Soda Core 3.0.x
Scan summary:
12/12 checks FAILED:
dim_employee in adventureworks
group by [FAILED]
Too much vacation [3] [FAILED]
check_value: 24.0000000000000000
Too much vacation [8] [FAILED]
check_value: 35.0000000000000000
Too much vacation [11] [FAILED]
check_value: 51.3297872340425532
Too much vacation [9] [FAILED]
check_value: 36.0000000000000000
Too much vacation [7] [FAILED]
check_value: 34.0000000000000000
Too much vacation [10] [FAILED]
check_value: 37.0000000000000000
Too much vacation [1] [FAILED]
check_value: 28.0000000000000000
Too much vacation [5] [FAILED]
check_value: 29.0000000000000000
Too much vacation [4] [FAILED]
check_value: 26.5000000000000000
Too much vacation [2] [FAILED]
check_value: 38.0000000000000000
Too much vacation [6] [FAILED]
check_value: 32.0000000000000000
Oops! 12 failures. 0 warnings. 0 errors. 0 pass.
Optional check configurations
Supported | Configuration | Documentation |
---|---|---|
✓ | Define a name for a group by; see example. For group by configurations, this normally optional parameter is required. | Customize check names |
✓ | Add an identity to a check. | Add a check identity |
✓ | Define alert configurations to specify warn and fail alert conditions; see example | Add alert configurations |
Apply an in-check filter to return results for a specific portion of the data in your dataset. | - | |
✓ | Use quotes when identifying dataset or column names; see example. Note that the type of quotes you use must match that which your data source uses. For example, BigQuery uses a backtick (`) as a quotation mark. | Use quotes in a check |
✓ | Use wildcard characters in the value in the check. | Use wildcard values as you would with SQL. |
Use for each to apply group by checks to multiple datasets in one scan. | - | |
Apply a dataset filter to partition data during a scan. | - |
Example with check name
checks for dim_employee:
- group by:
group_limit: 2
query: |
SELECT marital_status, AVG(vacation_hours) as vacation_hours
FROM dim_employee
GROUP BY marital_status
fields:
- marital_status
checks:
- vacation_hours > 60:
name: Too many vacation hours
Example with alert configuration
Be aware that Soda only ever returns a single check result per check. See Expect one check result for details.
checks for dim_employee:
- group by:
group_limit: 2
query: |
SELECT marital_status, AVG(vacation_hours) as vacation_hours
FROM dim_employee
GROUP BY marital_status
fields:
- marital_status
checks:
- vacation_hours:
fail: when > 65
warn: when between 50 and 65
name: Too many vacation hours
Example with quotes
Note that the type of quotes you use must match that which your data source uses. For example, BigQuery uses a backtick (`) as a quotation mark.
checks for dim_employee:
- group by:
group_limit: 2
query: |
SELECT "marital_status", AVG("vacation_hours") as vacation_hours
FROM "dim_employee"
GROUP BY marital_status
fields:
- marital_status
checks:
- vacation_hours > 60:
name: Too many vacation hours
Expect one check result
Be aware that a check that contains one or more alert configurations only ever yields a single check result; one check yields one check result. If your check triggers both a warn
and a fail
, the check result only displays the more severe, failed check result. (Schema checks behave slightly differently; see Schema checks.)
Using the following example, Soda Library, during a scan, discovers that the data in the dataset triggers both alerts, but the check result is still Only 1 warning
. Nonetheless, the results in the CLI still display both alerts as having both triggered a [WARNED]
state.
checks for dim_customer:
- row_count:
warn:
when > 2
when > 3
Soda Library 1.0.x
Soda Core 3.0.x
Scan summary:
1/1 check WARNED:
dim_customer in adventureworks
row_count warn when > 2 when > 3 [WARNED]
check_value: 18484
Only 1 warning. 0 failure. 0 errors. 0 pass.
Sending results to Soda Cloud
Soda Cloud Trace: 42812***
The check in the example below data triggers both warn
alerts and the fail
alert, but only returns a single check result, the more severe Oops! 1 failures.
checks for dim_product:
- sum(safety_stock_level):
name: Stock levels are safe
warn:
when > 0
fail:
when > 0
Soda Library 1.0.x
Soda Core 3.0.x
Scan summary:
1/1 check FAILED:
dim_product in adventureworks
Stock levels are safe [FAILED]
check_value: 275936
Oops! 1 failures. 0 warnings. 0 errors. 0 pass.
Sending results to Soda Cloud
Soda Cloud Trace: 6016***
Go further
- Use a group evolution check to surface changes in groups in a dataset.
- Learn more about alert configurations.
- Learn more about SodaCL metrics and checks in general.
- Need help? Join the Soda community on Slack.
Was this documentation helpful?
What could we do to improve this page?
- Suggest a docs change in GitHub.
- Share feedback in the Soda community on Slack.
Documentation always applies to the latest version of Soda products
Last modified on 27-Sep-23