Link Search Menu Expand Document

Display sample data for a dataset

When creating new monitors in Soda Cloud, you may find it useful to review sample data from your dataset to help you determine the kinds of tests to run when Soda SQL scans your data; see the image below. For this reason, you may wish to use Soda SQL to to configure a samples configuration key.


Using the information Soda SQL discovered about your datasets, you can optionally instruct it to capture and send sample data to Soda Cloud for specific datasets during the next scan. Enable sample data to display sample rows of data in Soda Cloud (to a maximum of 1000) so that you can make informed choices about the tests to run against your data when you create a monitor. A sample contains the first n number of rows from the dataset, according to the limit you specify.

Send sample data to Soda Cloud

DO NOT send sample data to Soda Cloud if your dataset contains sensitive information or personally identifiable information (PII). For security, you can disable the sample data feature entirely, or configure Soda SQL to reroute failed sample data to an alternate location.

  1. If you have not already done so, connect Soda SQL to your Soda Cloud account.
  2. Add a samples configuration key to your scan YAML file according to the Scan YAML example below; use table_limit to define a value that represents the numerical threshold of rows in a dataset that Soda SQL sends to Soda Cloud after it executes a test during a scan. It yields a sample of the data from your dataset in the Sample Data tab when you are creating a new monitor; see image above. A sample contains the first n number of rows from the dataset, according to the limit you specify.
  3. Save the changes to your scan YAML file, then run a scan on that dataset.
    soda scan warehouse.yml/tables/orders.yml
  4. In your Soda Cloud account, navigate to the Monitors dashboard. Click the stacked-dots icon to Create Monitor. Note that in the first step of the guided monitor creation, you can review sample data from your dataset that Soda SQL collected during its last scan of your dataset.

Scan YAML Example

table_name: orders
  - row_count
  - missing_count
  - missing_percentage
  - values_count
  table_limit: 50
  - row_count > 0
    valid_format: uuid
      - invalid_percentage <= 3

Using the example scan YAML above, the scan executes both tests against all the data in the dataset, but it only sends a maximum of 50 rows of data and metadata to Soda Cloud for review as sample data when creating a new monitor for the orders dataset.

The snippet below displays the CLI output of the query that counts the rows in the dataset; Soda SQL counts 193 rows but only sends 50 as a sample to Soda Cloud.

  | ...
  | Executing SQL query:
FROM "public"."orders"
  | SQL took 0:00:00.074957
  | Sent sample orders.sample (50/193) to Soda Cloud
  | ...

Disable sample data

Where your datasets contain sensitive or private information, you may not want to send sample data from your data source to Soda Cloud. In such a circumstance, you can disable the feature entirely in Soda Cloud.

To prevent Soda Cloud from receiving any sample data or failed row samples for any datasets in any data sources to which you have connected your Soda Cloud account, proceed as follows:

  1. As an Admin, log in to your Soda Cloud account and navigate to your avatar > Organization Settings.
  2. In the Company tab, check the box to “Disable storage of sample data and failed row samples in Soda Cloud.”, then Save.

If you use Soda SQL to programmatically schedule scans of individual datasets, you can configure Soda SQL to send a dataset’s samples to a secure location within your organization’s infrastructure, such as an Amazon S3 bucket or Google Big Query. Refer to Reroute sample data for details.

Go further

Last modified on 18-Jan-22

Was this documentation helpful?
Share feedback in the #soda-docs channel in the Soda community on Slack.

Help improve our docs!

  • Request a docs change.
  • Edit this page in our GitHub repo.