data.world

Learn how to integrate data.world with Soda.

Integrate Soda with data.world to surface data quality check results directly within your data.world catalog.

Run data quality checks using Soda and push check results and quality scores to the corresponding tables in your data.world catalog.
Give your data.world users the confidence of knowing that the data they are using is sound.
Drill down from a data.world table directly into Soda Cloud to investigate check details, diagnostics, and historical trends.

How it works

The integration is a lightweight CLI tool (soda-dw-sync) that you install and schedule. On each run it:

Reads your data source mapping from ds.yml to determine which Soda Cloud data sources to process and how they map to data.world collections.
For each data source, retrieves all datasets from the Soda Cloud API.
For each dataset, retrieves all checks and their latest evaluation results.
Transforms those results into the format expected by the data.world Check Runs and Badges APIs.
Posts check results and a dataset-level quality badge (Good / Moderate / Poor) to the matching table in data.world. Tables are identified by their fully-qualified name (database.schema.table) extracted from Soda Cloud.

Prerequisites

Python 3.10+
You have completed at least one Soda scan and confirmed that the datasets you wish to sync appear in Soda Cloud.
You have a data.world account with the privileges necessary to write to the target catalog collection.
You have a git repository (or local directory) in which to store the integration project files.
Access to the data.world repository. Contact us at [email protected] to receive access.

Set up the integration

Install

Installing and using the data.world integration requires access to the integration repository. Contact us at [email protected] to get access to the repository.

Best practice dictates that you install the integration using a virtual environment and that you use environment variables to store secrets. If you haven't yet, in your command-line interface tool, create a virtual environment in the .venv directory.

# Optional: create a virtual environment first
python3.10 -m venv .venv
source .venv/bin/activate

Clone or download the data.world integration repository, then install from the repo root:

pip install .

Alternatively, install directly from a release wheel:

pip install soda_dw_integration-0.1.0-py3-none-any.whl

Configure Soda Cloud credentials

Create a sc.yml file with your Soda Cloud connection details.

sc.yml

soda_cloud:
  host: cloud.soda.io               # or cloud.us.soda.io for US region
  api_key_id: ${env.SODA_API_KEY_ID}
  api_key_secret: ${env.SODA_API_KEY_SECRET}
  dimension_attribute: quality_dimension   # optional; see note below

Refer to Generate API keys to create the api_key_id and api_key_secret values.

Note: The dimension_attribute field is optional. When set, the tool reads the named check attribute from Soda Cloud and uses its value as the quality dimension label in data.world. If the attribute is absent or the field is omitted, the check's checkType is used instead.

Configure data source mappings

Create a ds.yml file that maps each Soda Cloud data source to the corresponding collection in data.world. Add one entry per data source.

ds.yml

data_sources:
  your_datasource_name:                     # must match the data source name in Soda Cloud
    context: <org-agent-id-of-source-org>
    databaseLocation: <host:port>
    orgAgentId: <org-agent-id-of-target-org>
    collectionIri: <collection-iri>
    dw_api_token: ${env.DW_API_TOKEN}

The table below describes each field. All values except dw_api_token come from data.world.

Field

Description

context

The agent ID of the data.world org that originally cataloged the database objects (tables).

databaseLocation

The host:port of the source database, e.g. myorg.snowflakecomputing.com:443 or dbc-abc123.cloud.databricks.com:443.

orgAgentId

The agent ID of the data.world org that owns the collection where check results will be stored.

collectionIri

The IRI of the data.world collection where check results will be stored.

dw_api_token

A data.world API token with write access to the target collection.

For more details on how data.world uses these identifiers, see the data.world Add Check Runs API reference.

Verify your setup with a dry run

Before uploading anything to data.world, use --dry-run to write the payloads to local JSON files for inspection:

soda-dw-sync \
  --ds-config ./ds.yml \
  --sc-config ./sc.yml \
  --dry-run \
  --out-dir ./output

Inspect output/check_runs_<datasource>.json and output/badges_<datasource>.json to confirm the data looks correct, then run without --dry-run to push results to data.world.

Run the integration

soda-dw-sync --ds-config ./ds.yml --sc-config ./sc.yml

Useful CLI options:

Option

Description

--only-dataset db.schema.table

Limit the sync to one or more specific datasets. Repeatable.

--dry-run

Write JSON payloads to --out-dir instead of uploading to data.world.

--out-dir ./output

Directory for dry-run JSON output files.

--log-level DEBUG

Increase log verbosity. Accepts DEBUG, INFO, WARNING, ERROR, CRITICAL.

Run soda-dw-sync --help to see all options.

Schedule the integration

Cron

After a local install, schedule with cron to keep data.world up to date automatically:

# Runs daily at 02:15
15 2 * * * /path/to/.venv/bin/soda-dw-sync \
  --ds-config /etc/soda-dw/ds.yml \
  --sc-config /etc/soda-dw/sc.yml \
  >> /var/log/soda-dw-sync.log 2>&1

Edit your crontab with crontab -e and remove the entry to stop the job.

Docker

Build the image:

docker build -t soda-dw:local .

Run via cron using Docker:

15 2 * * * docker run --rm \
  -v /etc/soda-dw:/etc/soda-dw:ro \
  -e SODA_API_KEY_ID=$SODA_API_KEY_ID \
  -e SODA_API_KEY_SECRET=$SODA_API_KEY_SECRET \
  -e DW_API_TOKEN=$DW_API_TOKEN \
  soda-dw:local \
  --ds-config /etc/soda-dw/ds.yml \
  --sc-config /etc/soda-dw/sc.yml \
  >> /var/log/soda-dw-sync.log 2>&1

Kubernetes

The k8s/ directory in the repository contains ready-to-use manifest templates for running the sync as a Kubernetes CronJob.

Load your local image into the cluster:

minikube start -p minikube --driver=docker
minikube image load -p minikube soda-dw:local

Apply the manifests:

kubectl apply -f k8s/secret.example.yml
kubectl apply -f k8s/configmap.example.yml
kubectl apply -f k8s/cronjob.local-image.yml
kubectl get cronjobs

View logs from the most recent job run:

kubectl logs job/<job-name>

Delete the CronJob to stop scheduling:

kubectl delete cronjob soda-dw-sync

The ConfigMap mounts both sc.yml and ds.yml into the container. Store secrets (API keys) in a Kubernetes Secret and reference them via envFrom in the CronJob spec, using the ${env.VAR} syntax in your YAML files.

Use the integration

Once the sync runs, navigate to any cataloged table in data.world. Open the Data Quality tab to see:

All Soda checks associated with the table, their latest result (Pass / Fail / Warn), and a direct link back to Soda Cloud for each check.
A dataset-level quality badge. By default, the values are set to:
- Good (≥ 90% passing)
- Moderate (≥ 70% passing)
- Poor (< 70% passing).

Click any check link to open the corresponding check page in Soda Cloud, where you can review diagnostics, historical results, and metric values.

Limitations & edge cases

Metric Monitor checks are not synced. Checks that use Soda's anomaly detection (Metric Monitors) are automatically skipped. Only threshold-based checks are pushed to data.world.
Datasets with no eligible checks are skipped. If a dataset has no checks, or all of its checks are Metric Monitors, no check runs or badge are posted for that dataset.
Table identity depends on qualified name format. Tables are matched in data.world using the fully-qualified name extracted from Soda Cloud (database.schema.table or database.table). If the name in Soda Cloud does not match what data.world expects, results will not surface on the correct table.

You are not logged in to Soda and are viewing the default public documentation. Learn more about Documentation access & licensing.

PreviousOperations & advanced usage NextGithub

Last updated 4 days ago

Was this helpful?

hashtagHow it works

hashtagPrerequisites

hashtagSet up the integration

hashtagInstall

hashtagConfigure Soda Cloud credentials

hashtagConfigure data source mappings

hashtagVerify your setup with a dry run

hashtagRun the integration

hashtagSchedule the integration

hashtagCron

hashtagDocker

hashtagKubernetes

hashtagUse the integration

hashtagLimitations & edge cases