Send profile information to Soda Cloud
Interested in getting early access? Let us know!
Use the discover datasets
and/or profile columns
configurations in your checks YAML file to send information about datasets and columns to Soda Cloud. Examine the profile information to gain insight into the type checks you can prepare to test for data quality.
Requires Soda Cloud.
Known issue: Currently, SodaCL does not support column exclusion for the column profiling and dataset discovery configurations when using a Spark data source.
discover datasets:
datasets:
- prod_%
- include prod_%
- exclude dev_%
profile columns:
columns:
- dataset_a.column_a
- dataset_a.%
- dataset_%.column_a
- dataset_%.%
- %.%
- include dataset_a.%
- exclude datset_a.prod_%
- exclude dim_geography
Prerequisites
Define dataset discovery
Define column profiling
Optional check configurations
Go further
Prerequisites
- You have installed a Soda Core package in your environment.
- You have configured Soda Core to connect to a data source using a
configuration.yml
file. - You have created and connected a Soda Cloud account to Soda Core.
Define dataset discovery
Dataset discovery captures basic information about each dataset, including a dataset’s schema and the columns it contains.
This configuration is limited in its syntax variation, with only a couple of mutable parts to specify the datasets from which to gather and send sample rows to Soda Cloud.
The example configuration below uses a wildcard character (%
) to specify that, during a scan, Soda Core discovers all the datasets the data source contains except those with names that begin with test_
.
discover datasets:
datasets:
- include %
- exclude test_%
You can also specify individual datasets to include or exclude, as in the following example.
discover datasets:
datasets:
- include retail_orders
Scan results in Soda Cloud
- To review the discovered datasets in Soda Cloud, first run a scan of your data source so that Soda Core can gather and send dataset information to Soda Cloud.
- In Soda Cloud, navigate to the Datasets dashboard, then click a dataset name to open the dataset’s info page.
- Access the Columns tab to review the datasets that Soda Core discovered, including the type of data each column contains.
Define column profiling
Column profile information includes details such as the calculated mean value of data in a column, the maximum and minimum values in a column, and the number of rows with missing data. Column profiling can be resource-heavy, so carefully consider the datasets for which you truly need column profile information.
This configuration is limited in its syntax variation, with only a couple of mutable parts to specify the datasets from which to gather and send sample rows to Soda Cloud.
The example configuration below uses a wildcard character (%
) to specify that, during a scan, Soda Core captures the column profile information for all the columns in the dataset named retail_orders
. The .
in the syntax separates the dataset name from the column name.
profile columns:
columns:
- retail_orders.%
You can also specify individual columns to profile, as in the following example.
profile columns:
columns:
- retail_orders.billing_address
Refer to the top of the page for more example configurations for column profiling.
Scan results in Soda Cloud
- To review the profiled columns in Soda Cloud, first run a scan of your data source so that Soda Core can gather and send column profile information to Soda Cloud.
- In Soda Cloud, navigate to the Datasets dashboard, then click a dataset name to open the dataset’s info page.
- Access the Columns tab to review the columns that Soda Core profiled.
Optional check configurations
Supported | Configuration | Documentation |
---|---|---|
Define a name for sample data configuration. | - | |
Define alert configurations to specify warn and fail thresholds. | - | |
Apply a filter to return results for a specific portion of the data in your dataset. | - | |
✓ | Use quotes when identifying dataset names; see example | Use quotes in a check |
✓ | Use wildcard characters ( % with dataset names in the check; see example. | - |
Use for each to apply anomaly score checks to multiple datasets in one scan. | - | |
Apply a dataset filter to partition data during a scan. | - |
Example with quotes
discover datasets:
datasets:
- include "prod_customer"
Example with wildcards
profile columns:
columns:
- retail_orders.%
Go further
- Need help? Join the Soda community on Slack.
- Use a freshness check to gauge how recently your data was captured.
- Use reference checks to compare the values of one column to another.
Was this documentation helpful?
What could we do to improve this page?
- Suggest a docs change in GitHub.
- Share feedback in the Soda community on Slack.
Last modified on 10-Aug-22