# data.world

Integrate Soda with data.world to surface data quality check results directly within your data.world catalog.

* Run data quality checks using Soda and push check results and quality scores to the corresponding tables in your data.world catalog.
* Give your data.world users the confidence of knowing that the data they are using is sound.
* Drill down from a data.world table directly into Soda Cloud to investigate check details, diagnostics, and historical trends.

<figure><img src="https://1123167021-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FA2PmHkO5cBgeRPdiPPOG%2Fuploads%2Fe9nOxwTmVz6bwqqcqlSc%2Fimage.png?alt=media&#x26;token=6d6f70bd-0ac0-4f11-9c1c-c97c246faa4c" alt=""><figcaption><p>Example view of the Data Quality results tab for a table in Data.world</p></figcaption></figure>

### How it works

The integration is a lightweight CLI tool (`soda-dw-sync`) that you install and schedule. On each run it:

1. Reads your data source mapping from `ds_config.yml` to determine which Soda Cloud data sources to process and how they map to data.world collections.
2. For each data source, retrieves all datasets from the Soda Cloud API.
3. For each dataset, retrieves all checks and their latest evaluation results.
4. Transforms those results into the format expected by the data.world Check Runs and Badges APIs.
5. Posts check results and a dataset-level quality badge (Good / Moderate / Poor) to the matching table in data.world. Tables are identified by their fully-qualified name (`database.schema.table`) extracted from Soda Cloud.

***

## Prerequisites

* Python 3.10+
* You have completed at least one Soda scan and confirmed that the datasets you wish to sync appear in Soda Cloud.
* You have a data.world account with the privileges necessary to write to the target catalog collection.
* You have a git repository (or local directory) in which to store the integration project files.
* Access to the data.world repository. Contact us at <support@soda.io> to receive access.

## Set up the integration

{% stepper %}
{% step %}

#### Install

{% hint style="info" %}
Installing and using the data.world integration requires access to the integration repository. **Contact us at** [**soda@support.io**](mailto:soda@support.io) **to get access to the repository.**
{% endhint %}

Best practice dictates that you **install the integration using a virtual environment** and that you use **environment variables** to store secrets. If you haven't yet, in your command-line interface tool, create a virtual environment in the `.venv` directory.

```bash
# Optional: create a virtual environment first
python3.10 -m venv .venv
source .venv/bin/activate
```

Clone or download the data.world integration repository, then install from the repo root:

```bash
pip install .
```

Alternatively, install directly from a release wheel:

```bash
pip install soda_dw_integration-0.1.0-py3-none-any.whl
```

{% endstep %}

{% step %}

#### Configure Soda Cloud credentials

Create a `sc_config.yml` file with your Soda Cloud connection details.

{% code title="sc\_config.yml" %}

```yaml
soda_cloud:
  host: cloud.soda.io               # or cloud.us.soda.io for US region
  api_key_id: ${env.SODA_API_KEY_ID}
  api_key_secret: ${env.SODA_API_KEY_SECRET}
  dimension_attribute: quality_dimension   # optional; see note below
```

{% endcode %}

> Refer to [Generate API keys](https://docs.soda.io/soda-cloud/api-keys.html) to create the `api_key_id` and `api_key_secret` values.

**Note:** The `dimension_attribute` field is optional. When set, the tool reads the named check attribute from Soda Cloud and uses its value as the quality dimension label in data.world. If the attribute is absent or the field is omitted, the check's `checkType` is used instead.
{% endstep %}

{% step %}

#### Configure data source mappings

Create a `ds_config.yml` file that maps each Soda Cloud data source to the corresponding collection in data.world. Add one entry per data source.

{% code title="ds\_config.yml" %}

```yaml
data_sources:
  your_datasource_name:                     # must match the data source name in Soda Cloud
    context: <org-agent-id-of-source-org>
    databaseLocation: <host:port>
    orgAgentId: <org-agent-id-of-target-org>
    collectionIri: <collection-iri>
    dw_api_token: ${env.DW_API_TOKEN}
```

{% endcode %}

The table below describes each field. All values except `dw_api_token` come from data.world.

| Field              | Description                                                                                                               |
| ------------------ | ------------------------------------------------------------------------------------------------------------------------- |
| `context`          | The agent ID of the data.world org that originally cataloged the database objects (tables).                               |
| `databaseLocation` | The `host:port` of the source database, e.g. `myorg.snowflakecomputing.com:443` or `dbc-abc123.cloud.databricks.com:443`. |
| `orgAgentId`       | The agent ID of the data.world org that owns the collection where check results will be stored.                           |
| `collectionIri`    | The IRI of the data.world collection where check results will be stored.                                                  |
| `dw_api_token`     | A data.world API token with write access to the target collection.                                                        |

> For more details on how data.world uses these identifiers, see the [data.world Add Check Runs API reference](https://developer.data.world/reference/addcheckruns).
> {% endstep %}

{% step %}

#### Verify your setup with a dry run

Before uploading anything to data.world, use `--dry-run` to write the payloads to local JSON files for inspection:

```bash
soda-dw-sync \
  --ds-config ./ds_config.yml \
  --sc-config ./sc_config.yml \
  --dry-run \
  --out-dir ./output
```

Inspect `output/check_runs_<datasource>.json` and `output/badges_<datasource>.json` to confirm the data looks correct, then run without `--dry-run` to push results to data.world.
{% endstep %}

{% step %}

#### Run the integration

```bash
soda-dw-sync --ds-config ./ds_config.yml --sc-config ./sc_config.yml
```

**Useful CLI options:**

| Option                           | Description                                                                      |
| -------------------------------- | -------------------------------------------------------------------------------- |
| `--only-dataset db.schema.table` | Limit the sync to one or more specific datasets. Repeatable.                     |
| `--dry-run`                      | Write JSON payloads to `--out-dir` instead of uploading to data.world.           |
| `--out-dir ./output`             | Directory for dry-run JSON output files.                                         |
| `--log-level DEBUG`              | Increase log verbosity. Accepts `DEBUG`, `INFO`, `WARNING`, `ERROR`, `CRITICAL`. |

{% hint style="info" %}
Run `soda-dw-sync --help` to see all options.
{% endhint %}
{% endstep %}
{% endstepper %}

## Schedule the integration

### Cron

After a local install, schedule with cron to keep data.world up to date automatically:

```bash
# Runs daily at 02:15
15 2 * * * /path/to/.venv/bin/soda-dw-sync \
  --ds-config /etc/soda-dw/ds_config.yml \
  --sc-config /etc/soda-dw/sc_config.yml \
  >> /var/log/soda-dw-sync.log 2>&1
```

Edit your crontab with `crontab -e` and remove the entry to stop the job.

***

### Docker

Build the image:

```bash
docker build -t soda-dw:local .
```

Run via cron using Docker:

```bash
15 2 * * * docker run --rm \
  -v /etc/soda-dw:/etc/soda-dw:ro \
  -e SODA_API_KEY_ID=$SODA_API_KEY_ID \
  -e SODA_API_KEY_SECRET=$SODA_API_KEY_SECRET \
  -e DW_API_TOKEN=$DW_API_TOKEN \
  soda-dw:local \
  --ds-config /etc/soda-dw/ds_config.yml \
  --sc-config /etc/soda-dw/sc_config.yml \
  >> /var/log/soda-dw-sync.log 2>&1
```

***

### Kubernetes

The `k8s/` directory in the repository contains ready-to-use manifest templates for running the sync as a Kubernetes CronJob.

Load your local image into the cluster:

```bash
minikube start -p minikube --driver=docker
minikube image load -p minikube soda-dw:local
```

Apply the manifests:

```bash
kubectl apply -f k8s/secret.example.yml
kubectl apply -f k8s/configmap.example.yml
kubectl apply -f k8s/cronjob.local-image.yml
kubectl get cronjobs
```

View logs from the most recent job run:

```bash
kubectl logs job/<job-name>
```

Delete the CronJob to stop scheduling:

```bash
kubectl delete cronjob soda-dw-sync
```

The ConfigMap mounts both `sc_config.yml` and `ds_config.yml` into the container. Store secrets (API keys) in a Kubernetes Secret and reference them via `envFrom` in the CronJob spec, using the `${env.VAR}` syntax in your YAML files.

***

## Use the integration

Once the sync runs, navigate to any cataloged table in data.world. Open the **Data Quality** tab to see:

* All Soda checks associated with the table, their latest result (Pass / Fail / Warn), and a direct link back to Soda Cloud for each check.
* A dataset-level quality badge. By default, the values are set to:
  * **Good** (≥ 90% passing)
  * **Moderate** (≥ 70% passing)
  * **Poor** (< 70% passing).

Click any check link to open the corresponding check page in Soda Cloud, where you can review diagnostics, historical results, and metric values.

***

## Limitations & edge cases

* **Metric Monitor checks are not synced.** Checks that use Soda's anomaly detection (Metric Monitors) are automatically skipped. Only threshold-based checks are pushed to data.world.
* **Datasets with no eligible checks are skipped.** If a dataset has no checks, or all of its checks are Metric Monitors, no check runs or badge are posted for that dataset.
* **Table identity depends on qualified name format.** Tables are matched in data.world using the fully-qualified name extracted from Soda Cloud (`database.schema.table` or `database.table`). If the name in Soda Cloud does not match what data.world expects, results will not surface on the correct table.

<br>

***

{% if visitor.claims.plan ===  %}
{% hint style="success" %}
You are **logged in to Soda** and seeing the **Free license** documentation. Learn more about [documentation-access-and-licensing](https://docs.soda.io/reference/documentation-access-and-licensing "mention").
{% endhint %}
{% endif %}

{% if visitor.claims.plan ===  %}
{% hint style="success" %}
You are **logged in to Soda** and seeing the **Team license** documentation. Learn more about [documentation-access-and-licensing](https://docs.soda.io/reference/documentation-access-and-licensing "mention").
{% endhint %}
{% endif %}

{% if visitor.claims.plan ===  %}
{% hint style="success" %}
You are **logged in to Soda** and seeing the **Enterprise license** documentation. Learn more about [documentation-access-and-licensing](https://docs.soda.io/reference/documentation-access-and-licensing "mention").
{% endhint %}
{% endif %}

{% if !(visitor.claims.plan ===  %}
{% hint style="info" %}
You are **not logged in to Soda** and are viewing the default public documentation. Learn more about [documentation-access-and-licensing](https://docs.soda.io/reference/documentation-access-and-licensing "mention").
{% endhint %}
{% endif %}
