Operations & advanced usage

This page provides detailed information about everything that happens while running and after running the Soda↔Collibra integration.

Advanced usage focuses on running and maintaining the Soda↔Collibra bi-directional integration after setup. The goal is to equip technical implementers with the detail required to operate the integration efficiently, resolve issues quickly, and adapt it to complex environments.


Performance & Monitoring

Performance Optimization

Caching System

  • Domain Mappings: Cached for the entire session

  • Asset Lookups: LRU cache reduces repeated API calls

  • Configuration Parsing: One-time parsing with caching

Batch Processing

  • Asset Operations: Create/update multiple assets in single calls

  • Attribute Management: Bulk attribute creation and updates

  • Relation Creation: Batch relationship establishment

Performance Results

  • 3-5x faster execution vs. original implementation

  • 60% fewer API calls through caching

  • 90% reduction in rate limit errors

  • Improved reliability with comprehensive error handling

Performance Benchmarks

Typical Performance

  • Small datasets (< 100 checks): 30-60 seconds

  • Medium datasets (100-1000 checks): 2-5 minutes

  • Large datasets (1000+ checks): 5-15 minutes

Performance varies based on:

  • Network latency to APIs

  • Number of existing vs. new assets

  • Complexity of relationships

  • API rate limits

Monitoring & Metrics

Integration Completion Report

Debug Logging

Enable detailed logging for troubleshooting:

Debug output includes:

  • Dataset processing details

  • API call timing and results

  • Caching hit/miss statistics

  • Error context and stack traces

  • Performance metrics per operation

  • Ownership synchronization details


Diagnostic Metrics Processing

The integration automatically extracts diagnostic metrics from Soda check results and populates detailed row-level statistics in Collibra.

Supported Metrics

Metric
Source
Description

check_loaded_rows_attribute

checkRowsTested or datasetRowsTested

Total number of rows evaluated by the check

check_rows_failed_attribute

failedRowsCount

Number of rows that failed the check

check_rows_passed_attribute

Calculated

check_loaded_rows - check_rows_failed

check_passing_fraction_attribute

Calculated

check_rows_passed / check_loaded_rows

Flexible Diagnostic Type Support

The system automatically extracts metrics from any diagnostic type, making it future-proof:

Current Soda Diagnostic Types

Future Diagnostic Types (Automatically Supported)

Intelligent Extraction Logic

The system uses a metric-focused approach rather than type-specific logic:

  1. Scans All Diagnostic Types: Iterates through every diagnostic type in the response

  2. Extracts Relevant Metrics: Looks for specific metric fields regardless of diagnostic type name

  3. Applies Smart Fallbacks: Uses datasetRowsTested if checkRowsTested is not available

  4. Calculates Derived Metrics: Computes passing rows and fraction when source data is available

  5. Handles Missing Data: Gracefully skips attributes when diagnostic data is unavailable

Fallback Mechanisms

Priority
Field Used
Fallback Reason

1st

checkRowsTested

Preferred - rows actually tested by the specific check

2nd

datasetRowsTested

Fallback - total dataset rows when check-specific count unavailable

Example Processing Flow

Input: Soda Check Result

Output: Collibra Attributes

Benefits

  • Future-Proof: Automatically works with new diagnostic types Soda introduces

  • Comprehensive: Provides both raw metrics and calculated insights

  • Flexible: Handles partial data gracefully with intelligent fallbacks

  • Accurate: Uses check-specific row counts when available

  • Transparent: Detailed logging shows exactly which metrics were found and used


Testing

Unit Tests

Local Kubernetes Testing

Head to Deploy on Kubernetes to learn more about the Kubernetes deployment.

Legacy Tests


Advanced Configuration

Performance Tuning

Modify constants.py for your environment:

Enhanced Configuration Options

For detailed information on configuring custom attribute syncing, see the Custom Attribute Syncing section above.

Custom Logging

Environment Variables


Troubleshooting

Common Issues

Performance Issues

  • Slow Processing: Increase BATCH_SIZE and DEFAULT_PAGE_SIZE

  • Rate Limiting: Increase RATE_LIMIT_DELAY

  • Memory Usage: Decrease CACHE_MAX_SIZE

Connection Issues

  • API Timeouts: Check network connectivity and API endpoints

  • Authentication: Verify credentials and permissions

  • Rate Limits: Monitor API usage and adjust delays

Data Issues

  • Missing Assets: Ensure required asset types exist in Collibra

  • Relation Failures: Verify relation type configurations

  • Domain Mapping: Check domain IDs and JSON formatting

Diagnostic Metrics Issues

  • Missing Diagnostic Attributes: Check if Soda checks have lastCheckResultValue.diagnostics data

  • Incomplete Metrics: Some diagnostic types may only have partial metrics (e.g., aggregate checks lack failedRowsCount)

  • Attribute Type Configuration: Verify diagnostic attribute type IDs are configured correctly in config.yaml

  • Zero Division Errors: System automatically prevents division by zero when calculating fractions

Debug Commands

Log Analysis

Look for these patterns in debug logs:

General Operation Patterns:

  • Rate limit prevention: Normal throttling behavior

  • Successfully updated/created: Successful operations

  • Skipping dataset: Expected filtering behavior

  • ERROR: Issues requiring attention

Diagnostic Processing Patterns:

  • Processing diagnostics: Diagnostic data found in check result

  • Found failedRowsCount in 'X': Successfully extracted failure count from diagnostic type X

  • Found checkRowsTested in 'X': Successfully extracted row count from diagnostic type X

  • Using datasetRowsTested from 'X' as fallback: Fallback mechanism activated

  • No diagnostics found in check result: Check has no diagnostic data (normal for some check types)

  • Calculated check_rows_passed: Successfully computed passing rows

  • Added check_X_attribute: Diagnostic attribute successfully added to Collibra


Reference

Common Commands

Key Configuration Sections

  • Collibra Base: collibra.base_url, collibra.username, collibra.password

  • Soda API: soda.api_key_id, soda.api_key_secret

  • Custom Attributes: soda.attributes.custom_attributes_mapping_soda_attribute_name_to_collibra_attribute_type_id

  • Domain Mapping: collibra.domains.soda_collibra_domain_mapping

  • Ownership Sync: collibra.responsibilities.owner_role_id

Essential UUIDs to Configure

  • Asset types (table, check, dimension, column)

  • Attribute types (evaluation status, sync date, diagnostic metrics)

  • Relation types (table-to-check, check-to-dimension)

  • Domain IDs for asset creation


Support

For issues and questions:

  1. Check the Troubleshooting section

  2. Enable Debug Logging for detailed information

  3. Review the performance metrics for bottlenecks

  4. Consult the Unit Tests for usage examples

  5. Contact [email protected] for additional help


You are not logged in to Soda and are viewing the default public documentation. Learn more about Documentation access & licensing.

Last updated

Was this helpful?