data.world
Learn how to integrate data.world with Soda.
Integrate Soda with data.world to surface data quality check results directly within your data.world catalog.
Run data quality checks using Soda and push check results and quality scores to the corresponding tables in your data.world catalog.
Give your data.world users the confidence of knowing that the data they are using is sound.
Drill down from a data.world table directly into Soda Cloud to investigate check details, diagnostics, and historical trends.

How it works
The integration is a lightweight CLI tool (soda-dw-sync) that you install and schedule. On each run it:
Reads your data source mapping from
ds.ymlto determine which Soda Cloud data sources to process and how they map to data.world collections.For each data source, retrieves all datasets from the Soda Cloud API.
For each dataset, retrieves all checks and their latest evaluation results.
Transforms those results into the format expected by the data.world Check Runs and Badges APIs.
Posts check results and a dataset-level quality badge (Good / Moderate / Poor) to the matching table in data.world. Tables are identified by their fully-qualified name (
database.schema.table) extracted from Soda Cloud.
Prerequisites
Python 3.10+
You have completed at least one Soda scan and confirmed that the datasets you wish to sync appear in Soda Cloud.
You have a data.world account with the privileges necessary to write to the target catalog collection.
You have a git repository (or local directory) in which to store the integration project files.
Access to the data.world repository. Contact us at [email protected] to receive access.
Set up the integration
Install
Installing and using the data.world integration requires access to the integration repository. Contact us at [email protected] to get access to the repository.
Best practice dictates that you install the integration using a virtual environment and that you use environment variables to store secrets. If you haven't yet, in your command-line interface tool, create a virtual environment in the .venv directory.
Clone or download the data.world integration repository, then install from the repo root:
Alternatively, install directly from a release wheel:
Configure Soda Cloud credentials
Create a sc.yml file with your Soda Cloud connection details.
Refer to Generate API keys to create the
api_key_idandapi_key_secretvalues.
Note: The dimension_attribute field is optional. When set, the tool reads the named check attribute from Soda Cloud and uses its value as the quality dimension label in data.world. If the attribute is absent or the field is omitted, the check's checkType is used instead.
Configure data source mappings
Create a ds.yml file that maps each Soda Cloud data source to the corresponding collection in data.world. Add one entry per data source.
The table below describes each field. All values except dw_api_token come from data.world.
context
The agent ID of the data.world org that originally cataloged the database objects (tables).
databaseLocation
The host:port of the source database, e.g. myorg.snowflakecomputing.com:443 or dbc-abc123.cloud.databricks.com:443.
orgAgentId
The agent ID of the data.world org that owns the collection where check results will be stored.
collectionIri
The IRI of the data.world collection where check results will be stored.
dw_api_token
A data.world API token with write access to the target collection.
For more details on how data.world uses these identifiers, see the data.world Add Check Runs API reference.
Verify your setup with a dry run
Before uploading anything to data.world, use --dry-run to write the payloads to local JSON files for inspection:
Inspect output/check_runs_<datasource>.json and output/badges_<datasource>.json to confirm the data looks correct, then run without --dry-run to push results to data.world.
Run the integration
Useful CLI options:
--only-dataset db.schema.table
Limit the sync to one or more specific datasets. Repeatable.
--dry-run
Write JSON payloads to --out-dir instead of uploading to data.world.
--out-dir ./output
Directory for dry-run JSON output files.
--log-level DEBUG
Increase log verbosity. Accepts DEBUG, INFO, WARNING, ERROR, CRITICAL.
Run soda-dw-sync --help to see all options.
Schedule the integration
Cron
After a local install, schedule with cron to keep data.world up to date automatically:
Edit your crontab with crontab -e and remove the entry to stop the job.
Docker
Build the image:
Run via cron using Docker:
Kubernetes
The k8s/ directory in the repository contains ready-to-use manifest templates for running the sync as a Kubernetes CronJob.
Load your local image into the cluster:
Apply the manifests:
View logs from the most recent job run:
Delete the CronJob to stop scheduling:
The ConfigMap mounts both sc.yml and ds.yml into the container. Store secrets (API keys) in a Kubernetes Secret and reference them via envFrom in the CronJob spec, using the ${env.VAR} syntax in your YAML files.
Use the integration
Once the sync runs, navigate to any cataloged table in data.world. Open the Data Quality tab to see:
All Soda checks associated with the table, their latest result (Pass / Fail / Warn), and a direct link back to Soda Cloud for each check.
A dataset-level quality badge. By default, the values are set to:
Good (≥ 90% passing)
Moderate (≥ 70% passing)
Poor (< 70% passing).
Click any check link to open the corresponding check page in Soda Cloud, where you can review diagnostics, historical results, and metric values.
Limitations & edge cases
Metric Monitor checks are not synced. Checks that use Soda's anomaly detection (Metric Monitors) are automatically skipped. Only threshold-based checks are pushed to data.world.
Datasets with no eligible checks are skipped. If a dataset has no checks, or all of its checks are Metric Monitors, no check runs or badge are posted for that dataset.
Table identity depends on qualified name format. Tables are matched in data.world using the fully-qualified name extracted from Soda Cloud (
database.schema.tableordatabase.table). If the name in Soda Cloud does not match what data.world expects, results will not surface on the correct table.
You are not logged in to Soda and are viewing the default public documentation. Learn more about Documentation access & licensing.
Last updated
Was this helpful?
