Collibra
This page describes the bi-directional integration between Soda and Collibra.
The Soda↔Collibra optimized integration synchronizes data quality checks from Soda to Collibra, creating a unified view of your data quality metrics. The implementation is optimized for performance, reliability, and maintainability, with support for bi-directional ownership sync and advanced diagnostic metrics.
Key features
High Performance: 3-5x faster execution through caching, batching, and parallel processing
Custom Attribute Syncing: Flexible mapping of Soda check attributes to Collibra attributes for rich business context
Ownership Synchronization: Bi-directional ownership sync between Collibra and Soda
Diagnostic Metrics Processing: Automatic extraction of diagnostic metrics from any Soda check type with intelligent fallbacks
Robust Error Handling: Comprehensive retry logic and graceful error recovery
Advanced Monitoring: Real-time metrics, performance tracking, and detailed reporting
CLI Interface: Flexible command-line options for different use cases
Backward Compatibility: Legacy test methods preserved for smooth migration
Quickstart
For technical details on how to configure the bi-directional Collibra integration, head to Setup & configuration.
Prerequisites
Python 3.10+ required
Valid Soda Cloud API credentials
Valid Collibra API credentials
Properly configured Collibra asset types and relations
Basic Usage
# Run the integration with default settings
python main.py
# Run with debug logging for troubleshooting
python main.py --debug
# Use a custom configuration file
python main.py --config custom.yaml
# Show help and all available options
python main.py --help
Advanced Usage
# Run legacy Soda client tests
python main.py --test-soda
# Run legacy Collibra client tests
python main.py --test-collibra
# Run with verbose logging (info level)
python main.py --verbose
How It Works
1. Optimized Dataset Processing
Smart Filtering: Only processes datasets marked for synchronization
Parallel Processing: Handles multiple operations concurrently
Caching: Reduces API calls through intelligent caching
Batch Operations: Groups similar operations for efficiency
2. Enhanced Check Processing
For each check in a dataset:
Asset Management
Bulk Creation/Updates: Processes multiple assets simultaneously
Duplicate Handling: Intelligent naming to avoid conflicts
Status Tracking: Monitors creation vs. update operations
Attribute Processing
Standard Attributes: Evaluation status, timestamps, definitions
Diagnostic Metrics: Automatically extracts and calculates diagnostic metrics from check results
Custom Attributes: Flexible mappings for business context (see Custom Attribute Syncing)
Batch Updates: Groups attribute operations for performance
Relationship Management
Dimension Relations: Links checks to data quality dimensions
Table/Column Relations: Creates appropriate asset relationships
Error Recovery: Graceful handling of missing or ambiguous assets
3. Ownership Synchronization
Collibra to Soda Sync: Automatically syncs dataset owners from Collibra to Soda
User Mapping: Maps Collibra users to Soda users by email address
Error Handling: Tracks missing users and synchronization failures
Metrics Tracking: Monitors successful ownership transfers
4. Advanced Error Handling
Retry Logic: Exponential backoff for transient failures
Rate Limiting: Intelligent throttling to avoid API limits
Error Aggregation: Collects and reports all issues at the end
Graceful Degradation: Continues processing despite individual failures
Head to Setup & configuration to learn how to integrate Collibra.
Last updated
Was this helpful?