# Profiling

Profiling helps you understand the shape, quality, and uniqueness of your data before creating checks or metric monitors.

With profiling, you can explore metadata about your dataset, such as **column names, data types, distinct counts, null counts, and summary statistics**. You can also quickly search for specific columns to focus on the attributes that matter most to your analysis.

Profiling is useful for:

* **Business teams**: Gain a fast understanding of what’s inside a dataset, its completeness, and potential anomalies.
* **Data teams**: Validate schema, data types, and distributions before writing quality tests or transformations.
* **Data owners**: Quickly identify unexpected values, nulls, or structural changes in a dataset.

### Key features

* **Dataset overview**: Displays a structured view of all columns, their types, and counts.
* **Interactive navigation**: Scroll through the dataset structure or jump directly to a column of interest.
* **Search and filter**: Quickly locate a column by name to review its profiling details.
* **Column-level insights**:
  * **Statistics**
    * Column name
    * Column data type
    * Number of distinct values
    * Number of missing (null) values
    * Minimum, maximum, mean (for numeric columns)
    * Length, patterns, or categories (for text columns)
  * **Histogram** for numeric columns
  * **Frequent values**
  * **Extreme values**, for numeric columns
  * **Data checks** that exist for this column

<figure><img src="https://1123167021-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FA2PmHkO5cBgeRPdiPPOG%2Fuploads%2FKDHgXMsr46VeCmZmIftu%2Fhttps___files.gitbook.com_v0_b_gitbook-x-prod.appspot.com_o_spaces_2FA2PmHkO5cBgeRPdiPPOG_2Fuploads_2FLTpyRbrafUcyJ5jtphv5_2FScreenshot_202025-05-29_20at_205.09.23_20PM.avif?alt=media&#x26;token=9ea82194-9415-4e07-9204-002796616cbf" alt=""><figcaption></figcaption></figure>

***

## Enable & configure Profiling

### 1. Enable Profiling

You can enable Profiling during [dataset onboarding](https://docs.soda.io/onboard-data-sources-and-datasets/onboard-datasets-on-soda-cloud).

If you want to enable Profiling on an existing dataset, follow the next steps:

1. Click on **Datasets** > The dataset of your choosing
2. Navigate to the **Columns** tab in the dataset view
3. Click on **Update Profiling Configuration**

   <figure><img src="https://1123167021-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FA2PmHkO5cBgeRPdiPPOG%2Fuploads%2FvlTEM40UjGAIOsdZjnQ9%2Fimage.png?alt=media&#x26;token=33231714-98f2-41dd-96cb-7b8bf81d16a2" alt=""><figcaption></figcaption></figure>
4. Toggle on **Enable Profiling**

   <figure><img src="https://1123167021-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FA2PmHkO5cBgeRPdiPPOG%2Fuploads%2FvG4oS3THsseyyKU7VLjL%2Fimage.png?alt=media&#x26;token=dd546172-2d44-4a8c-96d4-8988a3b1eafd" alt=""><figcaption></figcaption></figure>

{% hint style="info" %}
When Profiling is enabled **during** [**onboarding**](https://docs.soda.io/onboard-data-sources-and-datasets/onboard-datasets-on-soda-cloud), an automatic scan for Profiling will be executed **regardless of manual or scheduled execution**.
{% endhint %}

### 2. Configure Profiling

Once Profiling has been enabled, you can configure it to adapt to your organization's needs.

1\. Choose a **Profiling schedule**

Profiling happens every 24 hours. **Choose a UTC time** from the dropdown menu to pick a specific hour when the scan will be scheduled.

<figure><img src="https://1123167021-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FA2PmHkO5cBgeRPdiPPOG%2Fuploads%2FVkfdKvy0i20lKVrCpssM%2Fimage.png?alt=media&#x26;token=e455dbb3-4e04-4860-b560-5dba0459aec6" alt="" width="563"><figcaption></figcaption></figure>

2. Choose a **Profiling strategy**
   * **Use sampling:** To perform Profiling, Soda will use a **sample of up to 1 million rows** from the dataset.
   * **Use a time window:** To perform Profiling, Soda will use data present in a **30-day time window**, based on the dataset time-partition column.

{% hint style="info" %}
The **time-partition column** is specified **above the columns table**, on the **Columns** tab of any given dataset.
{% endhint %}

<figure><img src="https://1123167021-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FA2PmHkO5cBgeRPdiPPOG%2Fuploads%2F7wBUNb3lGMLKIKSwu60H%2Fimage.png?alt=media&#x26;token=c91bebb2-c095-4d6b-80ac-b5342376a3dd" alt=""><figcaption><p>In the Bus Breakdown and Delays dataset, the time-partition column is Last_Updated_On</p></figcaption></figure>

3. Click on **Finish**

Now, Profiling will be scheduled.

### Disable Profiling

#### Disable column profiling at the organization level <a href="#disable-column-profiling-at-the-organization-level" id="disable-column-profiling-at-the-organization-level"></a>

If you wish to disable column profiling at the organization level, you must possess **Admin privileges** in your Soda Cloud account. Once confirmed, follow these steps:

1. Navigate to your avatar.
2. Click on **Organization settings**.
3. Uncheck the box labeled **Allow Soda to collect column profile information**.

***

### How it works

When you open Profiling for a dataset:

1. Soda runs a lightweight scan of the dataset’s metadata and a sample of the data (depending on configuration).
2. It calculates summary statistics for each column.
3. Results are displayed in the Profiling view for exploration.

#### Key considerations

* Soda can only profile columns that contain `NUMBERS` or `TEXT` type data; it cannot profile columns that contain `TIMESTAMP` data except to create a freshness check for the anomaly dashboard.
* Soda performs the Discover datasets and Profile datasets actions independently, relative to each other. If you define `exclude` or `include` rules in the Discover tab, the Profile configuration does not inherit the Discover rules. For example, if, for Discover, you exclude all datasets that begin with `staging_`, then configure Profile to include all datasets, Soda discovers and profiles all datasets.

***

## Next Steps

After reviewing profiling results, you can:

* Create **tests** based on profiling insights (e.g., "column should not have nulls").
* Set up **monitors** to track data quality over time.
* Export profiling information to support documentation and governance processes.

<br>

***

{% if visitor.claims.plan ===  %}
{% hint style="success" %}
You are **logged in to Soda** and seeing the **Free license** documentation. Learn more about [documentation-access-and-licensing](https://docs.soda.io/reference/documentation-access-and-licensing "mention").
{% endhint %}
{% endif %}

{% if visitor.claims.plan ===  %}
{% hint style="success" %}
You are **logged in to Soda** and seeing the **Team license** documentation. Learn more about [documentation-access-and-licensing](https://docs.soda.io/reference/documentation-access-and-licensing "mention").
{% endhint %}
{% endif %}

{% if visitor.claims.plan ===  %}
{% hint style="success" %}
You are **logged in to Soda** and seeing the **Enterprise license** documentation. Learn more about [documentation-access-and-licensing](https://docs.soda.io/reference/documentation-access-and-licensing "mention").
{% endhint %}
{% endif %}

{% if !(visitor.claims.plan ===  %}
{% hint style="info" %}
You are **not logged in to Soda** and are viewing the default public documentation. Learn more about [documentation-access-and-licensing](https://docs.soda.io/reference/documentation-access-and-licensing "mention").
{% endhint %}
{% endif %}
