copy Copy chevron-down
Integrations chevron-right Collibra Operations & advanced usage This page provides detailed information about everything that happens while running and after running the Soda↔Collibra integration.
Advanced usage focuses on running and maintaining the Soda↔Collibra bi-directional integration after setup . The goal is to equip technical implementers with the detail required to operate the integration efficiently, resolve issues quickly, and adapt it to complex environments.
Domain Mappings : Cached for the entire session
Asset Lookups : LRU cache reduces repeated API calls
Configuration Parsing : One-time parsing with caching
Batch Processing
Asset Operations : Create/update multiple assets in single calls
Attribute Management : Bulk attribute creation and updates
Relation Creation : Batch relationship establishment
3-5x faster execution vs. original implementation
60% fewer API calls through caching
90% reduction in rate limit errors
Improved reliability with comprehensive error handling
Small datasets (< 100 checks): 30-60 seconds
Medium datasets (100-1000 checks): 2-5 minutes
Large datasets (1000+ checks): 5-15 minutes
Performance varies based on:
Number of existing vs. new assets
Complexity of relationships
Monitoring & Metrics
Integration Completion Report
Enable detailed logging for troubleshooting:
Debug output includes:
Dataset processing details
API call timing and results
Caching hit/miss statistics
Error context and stack traces
Performance metrics per operation
Ownership synchronization details
Diagnostic Metrics Processing
The integration automatically extracts diagnostic metrics from Soda check results and populates detailed row-level statistics in Collibra.
Supported Metrics
check_loaded_rows_attribute
checkRowsTested or datasetRowsTested
Total number of rows evaluated by the check
check_rows_failed_attribute
Number of rows that failed the check
check_rows_passed_attribute
check_loaded_rows - check_rows_failed
check_passing_fraction_attribute
check_rows_passed / check_loaded_rows
Flexible Diagnostic Type Support
The system automatically extracts metrics from any diagnostic type , making it future-proof:
Current Soda Diagnostic Types
Future Diagnostic Types (Automatically Supported)
Intelligent Extraction Logic
The system uses a metric-focused approach rather than type-specific logic:
Scans All Diagnostic Types : Iterates through every diagnostic type in the response
Extracts Relevant Metrics : Looks for specific metric fields regardless of diagnostic type name
Applies Smart Fallbacks : Uses datasetRowsTested if checkRowsTested is not available
Calculates Derived Metrics : Computes passing rows and fraction when source data is available
Handles Missing Data : Gracefully skips attributes when diagnostic data is unavailable
Fallback Mechanisms
Priority
Field Used
Fallback Reason
Preferred - rows actually tested by the specific check
Fallback - total dataset rows when check-specific count unavailable
Example Processing Flow
Input: Soda Check Result
Output: Collibra Attributes
✅ Future-Proof : Automatically works with new diagnostic types Soda introduces
✅ Comprehensive : Provides both raw metrics and calculated insights
✅ Flexible : Handles partial data gracefully with intelligent fallbacks
✅ Accurate : Uses check-specific row counts when available
✅ Transparent : Detailed logging shows exactly which metrics were found and used
Local Kubernetes Testing
Head to Deploy on Kubernetes to learn more about the Kubernetes deployment.
Advanced Configuration
Modify constants.py for your environment:
Enhanced Configuration Options
For detailed information on configuring custom attribute syncing, see the Custom Attribute Syncingarrow-up-right section above.
Environment Variables
Troubleshooting
Slow Processing : Increase BATCH_SIZE and DEFAULT_PAGE_SIZE
Rate Limiting : Increase RATE_LIMIT_DELAY
Memory Usage : Decrease CACHE_MAX_SIZE
Connection Issues
API Timeouts : Check network connectivity and API endpoints
Authentication : Verify credentials and permissions
Rate Limits : Monitor API usage and adjust delays
Missing Assets : Ensure required asset types exist in Collibra
Relation Failures : Verify relation type configurations
Domain Mapping : Check domain IDs and JSON formatting
Diagnostic Metrics Issues
Missing Diagnostic Attributes : Check if Soda checks have lastCheckResultValue.diagnostics data
Incomplete Metrics : Some diagnostic types may only have partial metrics (e.g., aggregate checks lack failedRowsCount)
Attribute Type Configuration : Verify diagnostic attribute type IDs are configured correctly in config.yaml
Zero Division Errors : System automatically prevents division by zero when calculating fractions
Look for these patterns in debug logs:
General Operation Patterns:
Rate limit prevention: Normal throttling behavior
Successfully updated/created: Successful operations
Skipping dataset: Expected filtering behavior
ERROR: Issues requiring attention
Diagnostic Processing Patterns:
Processing diagnostics: Diagnostic data found in check result
Found failedRowsCount in 'X': Successfully extracted failure count from diagnostic type X
Found checkRowsTested in 'X': Successfully extracted row count from diagnostic type X
Using datasetRowsTested from 'X' as fallback: Fallback mechanism activated
No diagnostics found in check result: Check has no diagnostic data (normal for some check types)
Calculated check_rows_passed: Successfully computed passing rows
Added check_X_attribute: Diagnostic attribute successfully added to Collibra
Common Commands
Key Configuration Sections
Collibra Base : collibra.base_url, collibra.username, collibra.password
Soda API : soda.api_key_id, soda.api_key_secret
Custom Attributes : soda.attributes.custom_attributes_mapping_soda_attribute_name_to_collibra_attribute_type_id
Domain Mapping : collibra.domains.soda_collibra_domain_mapping
Ownership Sync : collibra.responsibilities.owner_role_id
Asset types (table, check, dimension, column)
Attribute types (evaluation status, sync date, diagnostic metrics)
Relation types (table-to-check, check-to-dimension)
Domain IDs for asset creation
For issues and questions:
Review the performance metrics for bottlenecks
Last updated 2 months ago