Last modified on 06-Dec-22
sample datasets configuration to send 100 sample rows to Soda Cloud. Examine the sample rows to gain insight into the type checks you can prepare to test for data quality.
Requires Soda Cloud.
sample datasets: datasets: - dim_customer - include prod% - exclude test%
- You have signed up for a Soda Cloud account.
- You have Administrator rights within your organization’s Soda Cloud account.
- You, or an Administrator in your organization’s Soda Cloud account, has deployed a Soda Agent which enables you to connect to a data source in Soda Cloud.
To define samples for datasets, follow the guided steps to create a new data source and add the sample configuration in step 4 Profile datasets. Reference the section below for how to configure profiling using SodaCL.
- You have installed a Soda Core package in your environment.
- You have configured Soda Core to connect to a data source using a
- You have created and connected a Soda Cloud account to Soda Core.
Reference the section below for how to configure profiling in a checks YAML file using SodaCL.
This configuration is limited in its syntax variation, with only a couple of mutable parts to specify the datasets from which to gather and send sample rows to Soda Cloud. You can add this configuration to one of two places:
- to your
- to either step 3. Discover Datasets or step 4. Profile Datasets when you add a data source directly in Soda Cloud.
The example configuration below uses a wildcard character (
%) to specify that Soda Core sends sample rows to Soda Cloud for all datasets with names that begin with
customer, and not to send samples for any dataset with a name that begins with
sample datasets: datasets: - include customer% - exclude test%
You can also specify individual datasets to include or exclude, as in the following example.
sample datasets: datasets: - include retail_orders
- To review the sample rows in Soda Cloud, first run a scan of your data source so that Soda Core can gather and send samples to Soda Cloud.
- In Soda Cloud, navigate to the Datasets dashboard, then click a dataset name to open the dataset’s info page.
- Access the Sample Data tab to review the sample rows.
|Define a name for sample data configuration.||-|
|Add an identity to a check.||-|
|Define alert configurations to specify warn and fail thresholds.||-|
|Apply an in-check filter to return results for a specific portion of the data in your dataset.||-|
|✓||Use quotes when identifying dataset names; see example. |
Note that the type of quotes you use must match that which your data source uses. For example, BigQuery uses a backtick (`) as a quotation mark.
|Use quotes in a check|
|✓||Use wildcard characters ( % with dataset names in the check; see example.||-|
|Use for each to apply anomaly score checks to multiple datasets in one scan.||-|
|Apply a dataset filter to partition data during a scan.||-|
sample datasets: datasets: - include "prod_customer"
sample datasets: datasets: - include prod% - exclude test%
- If you configure
sample datasetsto include specific datasets, Soda implicitly excludes all other datasets from sampling.
- If you combine an include config and an exclude config and a dataset fits both patterns, Soda excludes the dataset from sampling.
Where your datasets contain sensitive or private information, you may not want to send samples from your data source to Soda Cloud. In such a circumstance, you can disable the feature completely in Soda Cloud.
To prevent Soda Cloud from receiving any sample data or failed row samples for any datasets in any data sources to which you have connected your Soda Cloud account, proceed as follows:
- As an Admin, log in to your Soda Cloud account and navigate to your avatar > Organization Settings.
- In the Organization tab, check the box to “Disable storage of sample data and failed row samples in Soda Cloud.”, then Save.
Note that you cannot use an
exclude_columns configuration to disable sample row collections from specific columns in a dataset. That configuration applies only to disabling failed rows sampling.
- Need help? Join the Soda community on Slack.
- Reference tips and best practices for SodaCL.
- Use a freshness check to gauge how recently your data was captured.
- Use reference checks to compare the values of one column to another.
Was this documentation helpful?
What could we do to improve this page?
Last modified on 06-Dec-22