data.world

Learn how to integrate data.world with Soda.

Integrate Soda with data.world to surface data quality check results directly within your data.world catalog.

  • Run data quality checks using Soda and push check results and quality scores to the corresponding tables in your data.world catalog.

  • Give your data.world users the confidence of knowing that the data they are using is sound.

  • Drill down from a data.world table directly into Soda Cloud to investigate check details, diagnostics, and historical trends.

Example view of the Data Quality results tab for a table in Data.world

How it works

The integration is a lightweight CLI tool (soda-dw-sync) that you install and schedule. On each run it:

  1. Reads your data source mapping from ds.yml to determine which Soda Cloud data sources to process and how they map to data.world collections.

  2. For each data source, retrieves all datasets from the Soda Cloud API.

  3. For each dataset, retrieves all checks and their latest evaluation results.

  4. Transforms those results into the format expected by the data.world Check Runs and Badges APIs.

  5. Posts check results and a dataset-level quality badge (Good / Moderate / Poor) to the matching table in data.world. Tables are identified by their fully-qualified name (database.schema.table) extracted from Soda Cloud.


Prerequisites

  • Python 3.10+

  • You have completed at least one Soda scan and confirmed that the datasets you wish to sync appear in Soda Cloud.

  • You have a data.world account with the privileges necessary to write to the target catalog collection.

  • You have a git repository (or local directory) in which to store the integration project files.

  • Access to the data.world repository. Contact us at [email protected]envelope to receive access.

Set up the integration

1

Install

circle-info

Installing and using the data.world integration requires access to the integration repository. Contact us at [email protected]envelope to get access to the repository.

Best practice dictates that you install the integration using a virtual environment and that you use environment variables to store secrets. If you haven't yet, in your command-line interface tool, create a virtual environment in the .venv directory.

Clone or download the data.world integration repository, then install from the repo root:

Alternatively, install directly from a release wheel:

2

Configure Soda Cloud credentials

Create a sc.yml file with your Soda Cloud connection details.

Refer to Generate API keysarrow-up-right to create the api_key_id and api_key_secret values.

Note: The dimension_attribute field is optional. When set, the tool reads the named check attribute from Soda Cloud and uses its value as the quality dimension label in data.world. If the attribute is absent or the field is omitted, the check's checkType is used instead.

3

Configure data source mappings

Create a ds.yml file that maps each Soda Cloud data source to the corresponding collection in data.world. Add one entry per data source.

The table below describes each field. All values except dw_api_token come from data.world.

Field
Description

context

The agent ID of the data.world org that originally cataloged the database objects (tables).

databaseLocation

The host:port of the source database, e.g. myorg.snowflakecomputing.com:443 or dbc-abc123.cloud.databricks.com:443.

orgAgentId

The agent ID of the data.world org that owns the collection where check results will be stored.

collectionIri

The IRI of the data.world collection where check results will be stored.

dw_api_token

A data.world API token with write access to the target collection.

For more details on how data.world uses these identifiers, see the data.world Add Check Runs API referencearrow-up-right.

4

Verify your setup with a dry run

Before uploading anything to data.world, use --dry-run to write the payloads to local JSON files for inspection:

Inspect output/check_runs_<datasource>.json and output/badges_<datasource>.json to confirm the data looks correct, then run without --dry-run to push results to data.world.

5

Run the integration

Useful CLI options:

Option
Description

--only-dataset db.schema.table

Limit the sync to one or more specific datasets. Repeatable.

--dry-run

Write JSON payloads to --out-dir instead of uploading to data.world.

--out-dir ./output

Directory for dry-run JSON output files.

--log-level DEBUG

Increase log verbosity. Accepts DEBUG, INFO, WARNING, ERROR, CRITICAL.

circle-info

Run soda-dw-sync --help to see all options.

Schedule the integration

Cron

After a local install, schedule with cron to keep data.world up to date automatically:

Edit your crontab with crontab -e and remove the entry to stop the job.


Docker

Build the image:

Run via cron using Docker:


Kubernetes

The k8s/ directory in the repository contains ready-to-use manifest templates for running the sync as a Kubernetes CronJob.

Load your local image into the cluster:

Apply the manifests:

View logs from the most recent job run:

Delete the CronJob to stop scheduling:

The ConfigMap mounts both sc.yml and ds.yml into the container. Store secrets (API keys) in a Kubernetes Secret and reference them via envFrom in the CronJob spec, using the ${env.VAR} syntax in your YAML files.


Use the integration

Once the sync runs, navigate to any cataloged table in data.world. Open the Data Quality tab to see:

  • All Soda checks associated with the table, their latest result (Pass / Fail / Warn), and a direct link back to Soda Cloud for each check.

  • A dataset-level quality badge. By default, the values are set to:

    • Good (≥ 90% passing)

    • Moderate (≥ 70% passing)

    • Poor (< 70% passing).

Click any check link to open the corresponding check page in Soda Cloud, where you can review diagnostics, historical results, and metric values.


Limitations & edge cases

  • Metric Monitor checks are not synced. Checks that use Soda's anomaly detection (Metric Monitors) are automatically skipped. Only threshold-based checks are pushed to data.world.

  • Datasets with no eligible checks are skipped. If a dataset has no checks, or all of its checks are Metric Monitors, no check runs or badge are posted for that dataset.

  • Table identity depends on qualified name format. Tables are matched in data.world using the fully-qualified name extracted from Soda Cloud (database.schema.table or database.table). If the name in Soda Cloud does not match what data.world expects, results will not surface on the correct table.


circle-info

You are not logged in to Soda and are viewing the default public documentation. Learn more about Documentation access & licensing.

Last updated

Was this helpful?