Sample data with Soda
Last modified on 30-Nov-23
When you add or edit a data source in Soda Cloud, use the sample datasets
configuration to send 100 sample rows to Soda Cloud. Examine the sample rows to gain insight into the type checks you can prepare to test for data quality.
sample datasets:
datasets:
- dim_customer
- include prod%
- exclude test%
Sample datasets
Inclusion and exclusion rules
Disable samples in Soda Cloud
Go further
Sample datasets
Sample datasets captures sample rows from datasets you identify. You add sample datasets configurations as part of the guided workflow to create a new data source or edit an existing one. Navigate to your avatar > Data Sources > New Data Source, or select an existing data source, to begin. You can add this configuration to one of two places:
- to either step 3. Discover Datasets
OR - or step 4. Profile Datasets
The example configuration below uses a wildcard character (%
) to specify that Soda Library sends sample rows to Soda Cloud for all datasets with names that begin with customer
, and not to send samples for any dataset with a name that begins with test
.
sample datasets:
datasets:
- include customer%
- exclude test%
You can also specify individual datasets to include or exclude, as in the following example.
sample datasets:
datasets:
- include retail_orders
Scan results in Soda Cloud
- To review the sample rows in Soda Cloud, first run a scan of your data source so that Soda can gather and send samples to Soda Cloud.
- In Soda Cloud, navigate to the Datasets dashboard, then click a dataset name to open the dataset’s info page.
- Access the Sample Data tab to review the sample rows.
Inclusion and exclusion rules
- If you configure
sample datasets
to include specific datasets, Soda implicitly excludes all other datasets from sampling. - If you combine an include config and an exclude config and a dataset fits both patterns, Soda excludes the dataset from sampling.
Disable samples in Soda Cloud
Where your datasets contain sensitive or private information, you may not want to send samples from your data source to Soda Cloud. In such a circumstance, you can disable the feature completely in Soda Cloud.
To prevent Soda Cloud from receiving any sample data or failed row samples for any datasets in any data sources to which you have connected your Soda Cloud account, proceed as follows:
- As an Admin, log in to your Soda Cloud account and navigate to your avatar > Organization Settings.
- In the Organization tab, check the box to “Disable collecting samples and failed rows for metrics in Soda Cloud”, then Save.
Alternatively, if you use Soda Library, you can adjust the configuration in your configuration.yml
to disable all samples, as in the following example.
data_source my_datasource:
type: postgres
...
sampler:
disable_samples: True
Note that you cannot use an exclude_columns
configuration to disable sample row collections from specific columns in a dataset. That configuration applies only to disabling failed rows sampling.
Go further
- Need help? Join the Soda community on Slack.
- Reference tips and best practices for SodaCL.
- Use a freshness check to gauge how recently your data was captured.
- Use reference checks to compare the values of one column to another.
Was this documentation helpful?
What could we do to improve this page?
- Suggest a docs change in GitHub.
- Share feedback in the Soda community on Slack.
Documentation always applies to the latest version of Soda products
Last modified on 30-Nov-23