Operations & advanced usage

This page provides detailed information about everything that happens while running and after running the Soda↔Collibra integration.

Advanced usage focuses on running and maintaining the Soda↔Collibra bi-directional integration after setup. The goal is to equip technical implementers with the detail required to operate the integration efficiently, resolve issues quickly, and adapt it to complex environments.

Performance & Monitoring

Performance Optimization

Caching System

Domain Mappings: Cached for the entire session
Asset Lookups: LRU cache reduces repeated API calls
Configuration Parsing: One-time parsing with caching

Batch Processing

Asset Operations: Create/update multiple assets in single calls
Attribute Management: Bulk attribute creation and updates
Relation Creation: Batch relationship establishment

Performance Results

3-5x faster execution vs. original implementation
60% fewer API calls through caching
90% reduction in rate limit errors
Improved reliability with comprehensive error handling

Performance Benchmarks

Typical Performance

Small datasets (< 100 checks): 30-60 seconds
Medium datasets (100-1000 checks): 2-5 minutes
Large datasets (1000+ checks): 5-15 minutes

Performance varies based on:

Network latency to APIs
Number of existing vs. new assets
Complexity of relationships
API rate limits

Monitoring & Metrics

Integration Completion Report

============================================================
🎉 INTEGRATION COMPLETED SUCCESSFULLY 🎉
============================================================
📊 Datasets processed: 15
⏭️  Datasets skipped: 2
✅ Checks created: 45
🔄 Checks updated: 67
📝 Attributes created: 224
🔄 Attributes updated: 156
🔗 Dimension relations created: 89
📋 Table relations created: 23
📊 Column relations created: 89
👥 Owners synchronized: 12
❌ Ownership sync failures: 1

🎯 Total operations performed: 693
============================================================

Debug Logging

Enable detailed logging for troubleshooting:

python main.py --debug

Debug output includes:

Dataset processing details
API call timing and results
Caching hit/miss statistics
Error context and stack traces
Performance metrics per operation
Ownership synchronization details

Diagnostic Metrics Processing

The integration automatically extracts diagnostic metrics from Soda check results and populates detailed row-level statistics in Collibra.

Supported Metrics

Metric

Source

Description

check_loaded_rows_attribute

checkRowsTested or datasetRowsTested

Total number of rows evaluated by the check

check_rows_failed_attribute

failedRowsCount

Number of rows that failed the check

check_rows_passed_attribute

Calculated

check_loaded_rows - check_rows_failed

check_passing_fraction_attribute

Calculated

check_rows_passed / check_loaded_rows

Flexible Diagnostic Type Support

The system automatically extracts metrics from any diagnostic type, making it future-proof:

Current Soda Diagnostic Types

// Missing value checks
{
  "diagnostics": {
    "missing": {
      "failedRowsCount": 3331,
      "failedRowsPercent": 1.213,
      "datasetRowsTested": 274577,
      "checkRowsTested": 274577
    }
  }
}

// Aggregate checks  
{
  "diagnostics": {
    "aggregate": {
      "datasetRowsTested": 274577,
      "checkRowsTested": 274577
    }
  }
}

Future Diagnostic Types (Automatically Supported)

// Hypothetical future types
{
  "diagnostics": {
    "valid": {
      "failedRowsCount": 450,
      "validRowsCount": 9550,
      "checkRowsTested": 10000
    },
    "duplicate": {
      "duplicateRowsCount": 200,
      "checkRowsTested": 8000
    }
  }
}

Intelligent Extraction Logic

The system uses a metric-focused approach rather than type-specific logic:

Scans All Diagnostic Types: Iterates through every diagnostic type in the response
Extracts Relevant Metrics: Looks for specific metric fields regardless of diagnostic type name
Applies Smart Fallbacks: Uses datasetRowsTested if checkRowsTested is not available
Calculates Derived Metrics: Computes passing rows and fraction when source data is available
Handles Missing Data: Gracefully skips attributes when diagnostic data is unavailable

Fallback Mechanisms

Priority

Field Used

Fallback Reason

1st

checkRowsTested

Preferred - rows actually tested by the specific check

2nd

datasetRowsTested

Fallback - total dataset rows when check-specific count unavailable

Example Processing Flow

Input: Soda Check Result

{
  "name": "customer_id is present",
  "evaluationStatus": "fail",
  "lastCheckResultValue": {
    "value": 1.213,
    "diagnostics": {
      "missing": {
        "failedRowsCount": 3331,
        "checkRowsTested": 274577
      }
    }
  }
}

Output: Collibra Attributes

Attributes Set:
  - check_loaded_rows_attribute: 274577           # From checkRowsTested
  - check_rows_failed_attribute: 3331             # From failedRowsCount  
  - check_rows_passed_attribute: 271246           # Calculated: 274577 - 3331
  - check_passing_fraction_attribute: 0.9879      # Calculated: 271246 / 274577

Benefits

✅ Future-Proof: Automatically works with new diagnostic types Soda introduces
✅ Comprehensive: Provides both raw metrics and calculated insights
✅ Flexible: Handles partial data gracefully with intelligent fallbacks
✅ Accurate: Uses check-specific row counts when available
✅ Transparent: Detailed logging shows exactly which metrics were found and used

Testing

Unit Tests

# Run all tests
python -m pytest tests/ -v

# Run specific test file
python -m pytest tests/test_integration.py -v

# Run with coverage
python -m pytest tests/ --cov=integration --cov-report=html

Local Kubernetes Testing

Head to Deploy on Kubernetes to learn more about the Kubernetes deployment.

# Comprehensive local testing (recommended)
python testing/test_k8s_local.py

# Docker-specific testing
./testing/test_docker_local.sh

# Quick validation
python testing/validate_k8s.py

Legacy Tests

# Test Soda client functionality
python main.py --test-soda

# Test Collibra client functionality
python main.py --test-collibra

Advanced Configuration

Performance Tuning

Modify constants.py for your environment:

class IntegrationConstants:
    MAX_RETRIES = 3              # API retry attempts
    BATCH_SIZE = 50              # Batch operation size
    DEFAULT_PAGE_SIZE = 1000     # API pagination size
    RATE_LIMIT_DELAY = 2         # Rate limiting delay
    CACHE_MAX_SIZE = 128         # LRU cache size

Enhanced Configuration Options

For detailed information on configuring custom attribute syncing, see the Custom Attribute Syncing section above.

Custom Logging

# In your code
import logging
logging.basicConfig(
    level=logging.DEBUG,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)

Environment Variables

# Set custom config path
export SODA_COLLIBRA_CONFIG=/path/to/custom/config.yaml

# Enable debug mode
export SODA_COLLIBRA_DEBUG=true

Troubleshooting

Common Issues

Performance Issues

Slow Processing: Increase BATCH_SIZE and DEFAULT_PAGE_SIZE
Rate Limiting: Increase RATE_LIMIT_DELAY
Memory Usage: Decrease CACHE_MAX_SIZE

Connection Issues

API Timeouts: Check network connectivity and API endpoints
Authentication: Verify credentials and permissions
Rate Limits: Monitor API usage and adjust delays

Data Issues

Missing Assets: Ensure required asset types exist in Collibra
Relation Failures: Verify relation type configurations
Domain Mapping: Check domain IDs and JSON formatting

Diagnostic Metrics Issues

Missing Diagnostic Attributes: Check if Soda checks have lastCheckResultValue.diagnostics data
Incomplete Metrics: Some diagnostic types may only have partial metrics (e.g., aggregate checks lack failedRowsCount)
Attribute Type Configuration: Verify diagnostic attribute type IDs are configured correctly in config.yaml
Zero Division Errors: System automatically prevents division by zero when calculating fractions

Debug Commands

# Full debug output
python main.py --debug 2>&1 | tee debug.log

# Verbose logging with timestamps
python main.py --verbose

# Test specific components
python main.py --test-soda --debug
python main.py --test-collibra --debug

Log Analysis

Look for these patterns in debug logs:

General Operation Patterns:

Rate limit prevention: Normal throttling behavior
Successfully updated/created: Successful operations
Skipping dataset: Expected filtering behavior
ERROR: Issues requiring attention

Diagnostic Processing Patterns:

Processing diagnostics: Diagnostic data found in check result
Found failedRowsCount in 'X': Successfully extracted failure count from diagnostic type X
Found checkRowsTested in 'X': Successfully extracted row count from diagnostic type X
Using datasetRowsTested from 'X' as fallback: Fallback mechanism activated
No diagnostics found in check result: Check has no diagnostic data (normal for some check types)
Calculated check_rows_passed: Successfully computed passing rows
Added check_X_attribute: Diagnostic attribute successfully added to Collibra

Reference

Common Commands

# Basic run with default config
python main.py

# Debug mode with detailed logging
python main.py --debug

# Use custom configuration file
python main.py --config custom.yaml

# Test individual components
python main.py --test-soda --debug
python main.py --test-collibra --debug

Key Configuration Sections

Collibra Base: collibra.base_url, collibra.username, collibra.password
Soda API: soda.api_key_id, soda.api_key_secret
Custom Attributes: soda.attributes.custom_attributes_mapping_soda_attribute_name_to_collibra_attribute_type_id
Domain Mapping: collibra.domains.soda_collibra_domain_mapping
Ownership Sync: collibra.responsibilities.owner_role_id

Essential UUIDs to Configure

Asset types (table, check, dimension, column)
Attribute types (evaluation status, sync date, diagnostic metrics)
Relation types (table-to-check, check-to-dimension)
Domain IDs for asset creation

Support

For issues and questions:

Check the Troubleshooting section
Enable Debug Logging for detailed information
Review the performance metrics for bottlenecks
Consult the Unit Tests for usage examples
Contact [email protected] for additional help

You are not logged in to Soda and are viewing the default public documentation. Learn more about Documentation access & licensing.

PreviousSetup & configuration NextGithub

Last updated 1 month ago

Was this helpful?