Files
ivatar/OPENTELEMETRY_INFRASTRUCTURE.md

10 KiB

OpenTelemetry Infrastructure Requirements

This document outlines the infrastructure requirements and deployment strategy for OpenTelemetry in the ivatar project, considering the existing Fedora Project hosting environment and multi-instance setup.

Current Infrastructure Analysis

Existing Monitoring Stack

  • Prometheus + Alertmanager: Metrics collection and alerting
  • Loki: Log aggregation
  • Alloy: Observability data collection
  • Grafana: Visualization and dashboards
  • Custom exporters: Application-specific metrics

Production Environment

  • Scale: Millions of requests daily, 30k+ users, 33k+ avatar images
  • Infrastructure: Fedora Project hosted, high-performance system
  • Architecture: Apache HTTPD + Gunicorn containers + PostgreSQL
  • Containerization: Podman (not Docker)

Multi-Instance Setup

  • Production: Production environment (master branch)
  • Development: Development environment (devel branch)
  • Deployment: GitLab CI/CD with Puppet automation

Infrastructure Options

Advantages:

  • Leverages existing infrastructure
  • Minimal additional complexity
  • Consistent with current monitoring approach
  • Cost-effective

Implementation:

# Alloy configuration extension
otelcol.receiver.otlp:
  grpc:
    endpoint: 0.0.0.0:4317
  http:
    endpoint: 0.0.0.0:4318

otelcol.processor.batch:
  timeout: 1s
  send_batch_size: 1024

otelcol.exporter.prometheus:
  endpoint: "0.0.0.0:9464"

otelcol.exporter.jaeger:
  endpoint: "jaeger-collector:14250"

otelcol.pipeline.traces:
  receivers: [otelcol.receiver.otlp]
  processors: [otelcol.processor.batch]
  exporters: [otelcol.exporter.jaeger]

otelcol.pipeline.metrics:
  receivers: [otelcol.receiver.otlp]
  processors: [otelcol.processor.batch]
  exporters: [otelcol.exporter.prometheus]

Option B: Dedicated OpenTelemetry Collector

Advantages:

  • Full OpenTelemetry feature set
  • Better performance for high-volume tracing
  • More flexible configuration options
  • Future-proof architecture

Implementation:

  • Deploy standalone OpenTelemetry Collector
  • Configure OTLP receivers and exporters
  • Integrate with existing Prometheus/Grafana

Deployment Strategy

Phase 1: Development Environment

  1. Enable OpenTelemetry in Development

    # Development environment configuration
    export OTEL_ENABLED=true
    export OTEL_SERVICE_NAME=ivatar-development
    export OTEL_ENVIRONMENT=development
    export OTEL_EXPORTER_OTLP_ENDPOINT=http://collector.internal:4317
    export OTEL_PROMETHEUS_ENDPOINT=0.0.0.0:9464
    
  2. Update Alloy Configuration

    • Add OTLP receivers to existing Alloy instance
    • Configure trace and metrics pipelines
    • Test data flow
  3. Verify Integration

    • Check metrics endpoint: http://dev-instance:9464/metrics
    • Verify trace data in Jaeger
    • Monitor Grafana dashboards

Phase 2: Production Deployment

  1. Production Configuration

    # Production environment configuration
    export OTEL_ENABLED=true
    export OTEL_SERVICE_NAME=ivatar-production
    export OTEL_ENVIRONMENT=production
    export OTEL_EXPORTER_OTLP_ENDPOINT=http://collector.internal:4317
    export OTEL_PROMETHEUS_ENDPOINT=0.0.0.0:9464
    
  2. Gradual Rollout

    • Deploy to one Gunicorn container first
    • Monitor performance impact
    • Gradually enable on all containers
  3. Performance Monitoring

    • Monitor collector resource usage
    • Check application performance impact
    • Verify data quality

Resource Requirements

Collector Resources

Minimum Requirements:

  • CPU: 2 cores
  • Memory: 4GB RAM
  • Storage: 10GB for temporary data
  • Network: 1Gbps

Recommended for Production:

  • CPU: 4 cores
  • Memory: 8GB RAM
  • Storage: 50GB SSD
  • Network: 10Gbps

Network Requirements

Ports:

  • 4317: OTLP gRPC receiver
  • 4318: OTLP HTTP receiver
  • 9464: Prometheus metrics exporter
  • 14250: Jaeger trace exporter

Bandwidth:

  • Estimated 1-5 Mbps per instance
  • Burst capacity for peak loads
  • Low-latency connection to collectors

Configuration Management

Environment-Specific Settings

Production Environment

# Production OpenTelemetry configuration
OTEL_ENABLED=true
OTEL_SERVICE_NAME=ivatar-production
OTEL_ENVIRONMENT=production
OTEL_EXPORTER_OTLP_ENDPOINT=http://collector.internal:4317
OTEL_PROMETHEUS_ENDPOINT=0.0.0.0:9464
OTEL_SAMPLING_RATIO=0.1  # 10% sampling for high volume
IVATAR_VERSION=1.8.0
HOSTNAME=prod-instance-01

Development Environment

# Development OpenTelemetry configuration
OTEL_ENABLED=true
OTEL_SERVICE_NAME=ivatar-development
OTEL_ENVIRONMENT=development
OTEL_EXPORTER_OTLP_ENDPOINT=http://collector.internal:4317
OTEL_PROMETHEUS_ENDPOINT=0.0.0.0:9464
OTEL_SAMPLING_RATIO=1.0  # 100% sampling for debugging
IVATAR_VERSION=1.8.0-dev
HOSTNAME=dev-instance-01

Container Configuration

Podman Container Updates

# Add to existing Dockerfile
RUN pip install opentelemetry-api>=1.20.0 \
    opentelemetry-sdk>=1.20.0 \
    opentelemetry-instrumentation-django>=0.42b0 \
    opentelemetry-instrumentation-psycopg2>=0.42b0 \
    opentelemetry-instrumentation-pymysql>=0.42b0 \
    opentelemetry-instrumentation-requests>=0.42b0 \
    opentelemetry-instrumentation-urllib3>=0.42b0 \
    opentelemetry-exporter-otlp>=1.20.0 \
    opentelemetry-exporter-prometheus>=1.12.0rc1 \
    opentelemetry-instrumentation-memcached>=0.42b0

Container Environment Variables

# Add to container startup script
export OTEL_ENABLED=${OTEL_ENABLED:-false}
export OTEL_SERVICE_NAME=${OTEL_SERVICE_NAME:-ivatar}
export OTEL_ENVIRONMENT=${OTEL_ENVIRONMENT:-development}
export OTEL_EXPORTER_OTLP_ENDPOINT=${OTEL_EXPORTER_OTLP_ENDPOINT}
export OTEL_PROMETHEUS_ENDPOINT=${OTEL_PROMETHEUS_ENDPOINT:-0.0.0.0:9464}

Monitoring and Alerting

Collector Health Monitoring

Collector Metrics

  • otelcol_receiver_accepted_spans: Spans received by collector
  • otelcol_receiver_refused_spans: Spans rejected by collector
  • otelcol_exporter_sent_spans: Spans sent to exporters
  • otelcol_exporter_failed_spans: Failed span exports

Health Checks

# Prometheus health check
- job_name: "otel-collector-health"
  static_configs:
    - targets: ["collector.internal:8888"]
  metrics_path: /metrics
  scrape_interval: 30s

Application Performance Impact

Key Metrics to Monitor

  • Application response time impact
  • Memory usage increase
  • CPU usage increase
  • Network bandwidth usage

Alerting Rules

# High collector resource usage
alert: HighCollectorCPU
expr: rate(otelcol_process_cpu_seconds_total[5m]) > 0.8
for: 5m
labels:
  severity: warning
annotations:
  summary: "High collector CPU usage"
  description: "Collector CPU usage is {{ $value }}"

# Collector memory usage
alert: HighCollectorMemory
expr: otelcol_process_memory_usage_bytes / otelcol_process_memory_limit_bytes > 0.8
for: 5m
labels:
  severity: warning
annotations:
  summary: "High collector memory usage"
  description: "Collector memory usage is {{ $value }}"

Security Considerations

Network Security

  • Use TLS for collector communications
  • Restrict collector access to trusted networks
  • Implement proper firewall rules

Data Privacy

  • Ensure no sensitive data in trace attributes
  • Implement data sanitization
  • Configure appropriate retention policies

Access Control

  • Restrict access to metrics endpoints
  • Implement authentication for collector access
  • Monitor access logs

Backup and Recovery

Data Retention

  • Traces: 7 days (configurable)
  • Metrics: 30 days (configurable)
  • Logs: 14 days (configurable)

Backup Strategy

  • Regular backup of collector configuration
  • Backup of Grafana dashboards
  • Backup of Prometheus rules

Performance Optimization

Sampling Strategy

  • Production: 10% sampling rate
  • Development: 100% sampling rate
  • Error traces: Always sample

Batch Processing

  • Optimize batch sizes for network conditions
  • Configure appropriate timeouts
  • Monitor queue depths

Resource Optimization

  • Monitor collector resource usage
  • Scale collectors based on load
  • Implement horizontal scaling if needed

Troubleshooting

Common Issues

Collector Not Receiving Data

  • Check network connectivity
  • Verify OTLP endpoint configuration
  • Check collector logs

High Resource Usage

  • Adjust sampling rates
  • Optimize batch processing
  • Scale collector resources

Data Quality Issues

  • Verify instrumentation configuration
  • Check span attribute quality
  • Monitor error rates

Debug Procedures

  1. Check Collector Status

    curl http://collector.internal:8888/metrics
    
  2. Verify Application Configuration

    curl http://app:9464/metrics
    
  3. Check Trace Data

    • Access Jaeger UI
    • Search for recent traces
    • Verify span attributes

Future Enhancements

Advanced Features

  • Custom dashboards for avatar metrics
  • Advanced sampling strategies
  • Log correlation with traces
  • Performance profiling integration

Scalability Improvements

  • Horizontal collector scaling
  • Load balancing for collectors
  • Multi-region deployment
  • Edge collection points

Integration Enhancements

  • Additional exporter backends
  • Custom processors
  • Advanced filtering
  • Data transformation

Cost Considerations

Infrastructure Costs

  • Additional compute resources for collectors
  • Storage costs for trace data
  • Network bandwidth costs

Operational Costs

  • Monitoring and maintenance
  • Configuration management
  • Troubleshooting and support

Optimization Strategies

  • Implement efficient sampling
  • Use appropriate retention policies
  • Optimize batch processing
  • Monitor resource usage

Conclusion

The OpenTelemetry integration for ivatar provides comprehensive observability while leveraging the existing monitoring infrastructure. The phased deployment approach ensures minimal disruption to production services while providing valuable insights into avatar generation performance and user behavior.

Key success factors:

  • Gradual rollout with monitoring
  • Performance impact assessment
  • Proper resource planning
  • Security considerations
  • Ongoing optimization