Link Search Menu Expand Document


✔ An open-source, CLI tool and Python library for data reliability

✔ Compatible with Soda Checks Language (SodaCL) and Soda Cloud

✔ Enables data quality testing both in and out of your data pipeline, for data observability and reliability

✔ Enables programmatic scans on a time-based schedule

Example checks

# Checks for basic validations
checks for dim_customer:
  - row_count between 10 and 1000
  - missing_count(birth_date) = 0
  - invalid_percent(phone) < 1 %:
      valid format: phone number
  - invalid_count(number_cars_owned) = 0:
      valid min: 1
      valid max: 6
  - duplicate_count(phone) = 0
checks for dim_product:
  - avg(safety_stock_level) > 50
# Check for schema changes
checks for dim_product:
  - schema:
      name: Find forbidden, missing, or wrong type
        when required column missing: [dealer_price, list_price]
        when forbidden column present: [credit_card]
        when wrong column type:
          standard_cost: money
        when forbidden column present: [pii*]
        when wrong column index:
          model_name: 22
# Check for freshness 
checks for dim_product:
  - freshness(start_date) < 1d
# Check for referential integrity
checks for dim_department_group:
  - values in (department_group_name) must exist in dim_employee (department_name)

Why Soda Core?

Simplify the work of maintaining healthy data and eliminate the bottlenecks in data quality management.

  • Download a free, OSS CLI tool and configure settings and data quality checks in two simple YAML files to start scanning your data within minutes.
  • Connect Soda Core to over a dozen data sources to scan volumes of data for quality.
  • Use the Soda Core Python library to build programmatic scans that you can use in conjunction with orchestration tools like Airflow or Prefect to automate pipeline actions when data quality fails.
  • Write data quality checks using SodaCL, a low-code, human-readable, domain-specific language for data quality management.
  • Run the same scans for data quality in multiple environments such as development, staging, and production.
  • Connect to Soda Cloud to unlock historic metric storage, data quality incident tracking, and change-over-time metrics.

Was this documentation helpful?

What could we do to improve this page?

Last modified on 31-May-23