Link Search Menu Expand Document

Group by

Last modified on 27-Sep-23

Use a group by configuration to collect and present check results by category.
Requires Soda Library

checks for fact_internet_sales:
  - group by:
      group_limit: 10
      query: |
        SELECT sales_territory_key, AVG(discount_amount) as average_discount
        FROM fact_internet_sales
        GROUP BY sales_territory_key
      fields:
        - sales_territory_key
      checks:
        - average_discount:
            fail: when > 40
            name: Average discount percentage is less than 40% (grouped-by sales territory)

Define reference checks
Group by check results
Optional check configurations
Go further

Define a group by configuration

For an individual dataset, add a group by configuration to specify the categories into which Soda must group the check results.

The example below uses a SQL query to define a custom metric for the fact_internet_sales dataset. It calculates the average order discount based on the contents of the discount_amount column, then groups the results according to the value in the sales_territory_key.

The check itself uses the custom metric average_discount and an alert configuration to determine if the measurement for each group passes, warns, or fails. In this case, any calculated measurement for average that exceeds 40 results in a fail.

checks for fact_internet_sales:
  - group by:
      group_limit: 10
      query: |
        SELECT sales_territory_key, AVG(discount_amount) as average_discount
        FROM fact_internet_sales
        GROUP BY sales_territory_key
      fields:
        - sales_territory_key
      checks:
        - average_discount:
            fail: when > 40
            name: Average discount percentage is less than 40% (grouped-by sales territory)
group by required configuration section label
group_limit optional the maximum number of groups, or column values, into which Soda must categorize the results. This value must correspond with the number of unique values in the column you identify in the fields section.
query required custom query subsection label. The nested SQL query defines the custom metric average_discount
fields required column subsection label
sales_territory_key required column identifier; the values in this column identify how Soda groups the results.
checks required check subsection label
average_discount required custom metric identifier
fail: when > 40 required fail condition and threshold
warn: when between 50 and 60 only one alert condition is required warn condition and threshold
name required custom name for the check

Group by check results

When you run a scan that includes checks nested in a group by configuration, the output in Soda Library CLI groups the results according to the unique values in the column you identified in the fields subsection. The number of unique values in the column must match the value you provided for group_limit.

In the example results below, the calculated average of average_discount for each sales territory is less than 40%, so the check for each group passed. The value in the square brackets next to the custom check name identifies the group which, in this case, is a number that corresponds to a territory.

Soda 1.0.x
Soda Core 3.0.x
Scan summary:
11/11 checks PASSED: 
    fact_internet_sales in adventureworks
      group by [PASSED]
      Average discount percentage is less than 40% (grouped-by sales territory) [8] [PASSED]
      Average discount percentage is less than 40% (grouped-by sales territory) [10] [PASSED]
      Average discount percentage is less than 40% (grouped-by sales territory) [9] [PASSED]
      Average discount percentage is less than 40% (grouped-by sales territory) [7] [PASSED]
      Average discount percentage is less than 40% (grouped-by sales territory) [1] [PASSED]
      Average discount percentage is less than 40% (grouped-by sales territory) [5] [PASSED]
      Average discount percentage is less than 40% (grouped-by sales territory) [2] [PASSED]
      Average discount percentage is less than 40% (grouped-by sales territory) [4] [PASSED]
      Average discount percentage is less than 40% (grouped-by sales territory) [6] [PASSED]
      Average discount percentage is less than 40% (grouped-by sales territory) [3] [PASSED]
All is good. No failures. No warnings. No errors.
Sending results to Soda Cloud
Soda Cloud Trace: 14733***37

In Soda Cloud, the check results appear by territory.

group-by-4


You must always match the number of unique values in the group by column to the value you provide for the group_limit. In the following example, the number of unique values in the sales_territory_key column is greater than the group_limit: 2 so Soda does not evaluate the check.

checks for dim_employee:
  - group by:
      group_limit: 2
      query: |
        SELECT sales_territory_key, AVG(vacation_hours) as vacation_calc
        FROM dim_employee
        GROUP BY sales_territory_key
      fields:
        - sales_territory_key
      checks:
        - vacation_calc > 60:
            name: Reasonable vacation hours
Soda Library 1.0.x
Soda Core 3.0.x
Evaluation of check group by failed: Total number of groups 11 exceeds configured group limit: 2
  | Total number of groups 11 exceeds configured group limit: 2
  +-> line=2,col=5 in checks_groupby.yml
Scan summary:
1/1 check NOT EVALUATED: 
    dim_employee in adventureworks
      group by [NOT EVALUATED]
1 checks not evaluated.
1 errors.
Oops! 1 error. 0 failures. 0 warnings. 0 pass.
ERRORS:
Evaluation of check group by failed: Total number of groups 11 exceeds configured group limit: 2
  | Total number of groups 11 exceeds configured group limit: 2
  +-> line=2,col=5 in checks_groupby.yml

Resolve the issue by increasing the group limit value.

checks for dim_employee:
  - group by:
      group_limit: 11
      query: |
        SELECT sales_territory_key, AVG(vacation_hours) as vacation_calc
        FROM dim_employee
        GROUP BY sales_territory_key
      fields:
        - sales_territory_key
      checks:
        - vacation_calc > 20:
            name: Too much vacation
Soda Library 1.0.x
Soda Core 3.0.x
Scan summary:
12/12 checks FAILED: 
    dim_employee in adventureworks
      group by [FAILED]
      Too much vacation [3] [FAILED]
        check_value: 24.0000000000000000
      Too much vacation [8] [FAILED]
        check_value: 35.0000000000000000
      Too much vacation [11] [FAILED]
        check_value: 51.3297872340425532
      Too much vacation [9] [FAILED]
        check_value: 36.0000000000000000
      Too much vacation [7] [FAILED]
        check_value: 34.0000000000000000
      Too much vacation [10] [FAILED]
        check_value: 37.0000000000000000
      Too much vacation [1] [FAILED]
        check_value: 28.0000000000000000
      Too much vacation [5] [FAILED]
        check_value: 29.0000000000000000
      Too much vacation [4] [FAILED]
        check_value: 26.5000000000000000
      Too much vacation [2] [FAILED]
        check_value: 38.0000000000000000
      Too much vacation [6] [FAILED]
        check_value: 32.0000000000000000
Oops! 12 failures. 0 warnings. 0 errors. 0 pass.

Optional check configurations

Supported Configuration Documentation
Define a name for a group by; see example. For group by configurations, this normally optional parameter is required. Customize check names
Add an identity to a check. Add a check identity
Define alert configurations to specify warn and fail alert conditions; see example Add alert configurations
  Apply an in-check filter to return results for a specific portion of the data in your dataset. -
Use quotes when identifying dataset or column names; see example.
Note that the type of quotes you use must match that which your data source uses. For example, BigQuery uses a backtick (`) as a quotation mark.
Use quotes in a check
Use wildcard characters in the value in the check. Use wildcard values as you would with SQL.
  Use for each to apply group by checks to multiple datasets in one scan. -
  Apply a dataset filter to partition data during a scan. -

Example with check name

checks for dim_employee:
  - group by:
      group_limit: 2
      query: |
        SELECT marital_status, AVG(vacation_hours) as vacation_hours
        FROM dim_employee
        GROUP BY marital_status
      fields:
        - marital_status
      checks:
        - vacation_hours > 60:
            name: Too many vacation hours

Example with alert configuration

Be aware that Soda only ever returns a single check result per check. See Expect one check result for details.

checks for dim_employee:
  - group by:
      group_limit: 2
      query: |
        SELECT marital_status, AVG(vacation_hours) as vacation_hours
        FROM dim_employee
        GROUP BY marital_status
      fields:
        - marital_status
      checks:
        - vacation_hours:
            fail: when > 65
            warn: when between 50 and 65
            name: Too many vacation hours

Example with quotes

Note that the type of quotes you use must match that which your data source uses. For example, BigQuery uses a backtick (`) as a quotation mark.

checks for dim_employee:
  - group by:
      group_limit: 2
      query: |
        SELECT "marital_status", AVG("vacation_hours") as vacation_hours
        FROM "dim_employee"
        GROUP BY marital_status
      fields:
        - marital_status
      checks:
        - vacation_hours > 60:
            name: Too many vacation hours


Expect one check result

Be aware that a check that contains one or more alert configurations only ever yields a single check result; one check yields one check result. If your check triggers both a warn and a fail, the check result only displays the more severe, failed check result. (Schema checks behave slightly differently; see Schema checks.)

Using the following example, Soda Library, during a scan, discovers that the data in the dataset triggers both alerts, but the check result is still Only 1 warning. Nonetheless, the results in the CLI still display both alerts as having both triggered a [WARNED] state.

checks for dim_customer:
  - row_count:
      warn:
        when > 2
        when > 3
Soda Library 1.0.x
Soda Core 3.0.x
Scan summary:
1/1 check WARNED: 
    dim_customer in adventureworks
      row_count warn when > 2 when > 3 [WARNED]
        check_value: 18484
Only 1 warning. 0 failure. 0 errors. 0 pass.
Sending results to Soda Cloud
Soda Cloud Trace: 42812***

The check in the example below data triggers both warn alerts and the fail alert, but only returns a single check result, the more severe Oops! 1 failures.

checks for dim_product:
  - sum(safety_stock_level):
      name: Stock levels are safe
      warn:
        when > 0
      fail:
        when > 0
Soda Library 1.0.x
Soda Core 3.0.x
Scan summary:
1/1 check FAILED: 
    dim_product in adventureworks
      Stock levels are safe [FAILED]
        check_value: 275936
Oops! 1 failures. 0 warnings. 0 errors. 0 pass.
Sending results to Soda Cloud
Soda Cloud Trace: 6016***

Go further


Was this documentation helpful?

What could we do to improve this page?

Documentation always applies to the latest version of Soda products
Last modified on 27-Sep-23