Link Search Menu Expand Document

Send sample data to Soda Cloud

When creating new monitors in Soda Cloud, you may find it useful to review sample data from your dataset to help you determine the kinds of tests to run when Soda SQL scans your data; see the image below. For this reason, you may wish to configure a samples configuration key in Soda SQL.

Alternatively, you can Enable Sample Data directly in your Soda Cloud account. Refer to Display sample data for details.

sample-data

Add a sample configuration key

DO NOT use sample data if your dataset contains sensitive information or personally identifiable information (PII).

  1. If you have not already done so, connect Soda SQL to your Soda Cloud account.
  2. Add a samples configuration key to your scan YAML file according to the Scan YAML example below; use table_limit to define a value that represents the numerical threshold of rows in a dataset that Soda SQL sends to Soda Cloud after it executes a test during a scan. It yields a sample of the data from your dataset in the Sample Data tab when you are creating a new monitor; see image above.
  3. Save the changes to your scan YAML file, then run a scan on that dataset.
    soda scan warehouse.yml/tables/orders.yml
    
  4. In your Soda Cloud account, navigate to the Monitors dashboard. Click the stacked-dots icon to Create Monitor. Note that in the first step of the guided monitor creation, you can review sample data from your dataset that Soda SQL collected during its last scan of your dataset.

Scan YAML Example

table_name: orders
metrics:
  - row_count
  - missing_count
  - missing_percentage
  - values_count
  ... 
samples:
  table_limit: 50
tests:
  - row_count > 0
columns:
  orderid:
    valid_format: uuid
    tests:
      - invalid_percentage <= 3

Using the example scan YAML above, the scan executes both tests against all the data in the dataset, but it only sends a maximum of 50 rows of data and metadata to Soda Cloud for review as sample data when creating a new monitor for the orders dataset.

The snippet below displays the CLI output of the query that counts the rows in the dataset; Soda SQL counts 193 rows but only sends 50 as a sample to Soda Cloud.

  | ...
  | Executing SQL query: 
SELECT * 
FROM "public"."orders" 
LIMIT 50;
  | SQL took 0:00:00.074957
  | Sent sample orders.sample (50/193) to Soda Cloud
  | ...

Go further



Last modified on 15-Sep-21

Was this documentation helpful?
Give us your feedback in the #soda-docs channel in the Soda community on Slack or open an issue in GitHub.