10 KiB
OpenTelemetry Infrastructure Requirements
This document outlines the infrastructure requirements and deployment strategy for OpenTelemetry in the ivatar project, considering the existing Fedora Project hosting environment and multi-instance setup.
Current Infrastructure Analysis
Existing Monitoring Stack
- Prometheus + Alertmanager: Metrics collection and alerting
- Loki: Log aggregation
- Alloy: Observability data collection
- Grafana: Visualization and dashboards
- Custom exporters: Application-specific metrics
Production Environment
- Scale: Millions of requests daily, 30k+ users, 33k+ avatar images
- Infrastructure: Fedora Project hosted, high-performance system
- Architecture: Apache HTTPD + Gunicorn containers + PostgreSQL
- Containerization: Podman (not Docker)
Multi-Instance Setup
- Production: Production environment (master branch)
- Development: Development environment (devel branch)
- Deployment: GitLab CI/CD with Puppet automation
Infrastructure Options
Option A: Extend Existing Alloy Stack (Recommended)
Advantages:
- Leverages existing infrastructure
- Minimal additional complexity
- Consistent with current monitoring approach
- Cost-effective
Implementation:
# Alloy configuration extension
otelcol.receiver.otlp:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
otelcol.processor.batch:
timeout: 1s
send_batch_size: 1024
otelcol.exporter.prometheus:
endpoint: "0.0.0.0:9464"
otelcol.exporter.jaeger:
endpoint: "jaeger-collector:14250"
otelcol.pipeline.traces:
receivers: [otelcol.receiver.otlp]
processors: [otelcol.processor.batch]
exporters: [otelcol.exporter.jaeger]
otelcol.pipeline.metrics:
receivers: [otelcol.receiver.otlp]
processors: [otelcol.processor.batch]
exporters: [otelcol.exporter.prometheus]
Option B: Dedicated OpenTelemetry Collector
Advantages:
- Full OpenTelemetry feature set
- Better performance for high-volume tracing
- More flexible configuration options
- Future-proof architecture
Implementation:
- Deploy standalone OpenTelemetry Collector
- Configure OTLP receivers and exporters
- Integrate with existing Prometheus/Grafana
Deployment Strategy
Phase 1: Development Environment
-
Enable OpenTelemetry in Development
# Development environment configuration export OTEL_ENABLED=true export OTEL_SERVICE_NAME=ivatar-development export OTEL_ENVIRONMENT=development export OTEL_EXPORTER_OTLP_ENDPOINT=http://collector.internal:4317 export OTEL_PROMETHEUS_ENDPOINT=0.0.0.0:9464 -
Update Alloy Configuration
- Add OTLP receivers to existing Alloy instance
- Configure trace and metrics pipelines
- Test data flow
-
Verify Integration
- Check metrics endpoint:
http://dev-instance:9464/metrics - Verify trace data in Jaeger
- Monitor Grafana dashboards
- Check metrics endpoint:
Phase 2: Production Deployment
-
Production Configuration
# Production environment configuration export OTEL_ENABLED=true export OTEL_SERVICE_NAME=ivatar-production export OTEL_ENVIRONMENT=production export OTEL_EXPORTER_OTLP_ENDPOINT=http://collector.internal:4317 export OTEL_PROMETHEUS_ENDPOINT=0.0.0.0:9464 -
Gradual Rollout
- Deploy to one Gunicorn container first
- Monitor performance impact
- Gradually enable on all containers
-
Performance Monitoring
- Monitor collector resource usage
- Check application performance impact
- Verify data quality
Resource Requirements
Collector Resources
Minimum Requirements:
- CPU: 2 cores
- Memory: 4GB RAM
- Storage: 10GB for temporary data
- Network: 1Gbps
Recommended for Production:
- CPU: 4 cores
- Memory: 8GB RAM
- Storage: 50GB SSD
- Network: 10Gbps
Network Requirements
Ports:
- 4317: OTLP gRPC receiver
- 4318: OTLP HTTP receiver
- 9464: Prometheus metrics exporter
- 14250: Jaeger trace exporter
Bandwidth:
- Estimated 1-5 Mbps per instance
- Burst capacity for peak loads
- Low-latency connection to collectors
Configuration Management
Environment-Specific Settings
Production Environment
# Production OpenTelemetry configuration
OTEL_ENABLED=true
OTEL_SERVICE_NAME=ivatar-production
OTEL_ENVIRONMENT=production
OTEL_EXPORTER_OTLP_ENDPOINT=http://collector.internal:4317
OTEL_PROMETHEUS_ENDPOINT=0.0.0.0:9464
OTEL_SAMPLING_RATIO=0.1 # 10% sampling for high volume
HOSTNAME=prod-instance-01
Development Environment
# Development OpenTelemetry configuration
OTEL_ENABLED=true
OTEL_SERVICE_NAME=ivatar-development
OTEL_ENVIRONMENT=development
OTEL_EXPORTER_OTLP_ENDPOINT=http://collector.internal:4317
OTEL_PROMETHEUS_ENDPOINT=0.0.0.0:9464
OTEL_SAMPLING_RATIO=1.0 # 100% sampling for debugging
HOSTNAME=dev-instance-01
Container Configuration
Podman Container Updates
# Add to existing Dockerfile
RUN pip install opentelemetry-api>=1.20.0 \
opentelemetry-sdk>=1.20.0 \
opentelemetry-instrumentation-django>=0.42b0 \
opentelemetry-instrumentation-psycopg2>=0.42b0 \
opentelemetry-instrumentation-pymysql>=0.42b0 \
opentelemetry-instrumentation-requests>=0.42b0 \
opentelemetry-instrumentation-urllib3>=0.42b0 \
opentelemetry-exporter-otlp>=1.20.0 \
opentelemetry-exporter-prometheus>=1.12.0rc1 \
opentelemetry-instrumentation-memcached>=0.42b0
Container Environment Variables
# Add to container startup script
export OTEL_ENABLED=${OTEL_ENABLED:-false}
export OTEL_SERVICE_NAME=${OTEL_SERVICE_NAME:-ivatar}
export OTEL_ENVIRONMENT=${OTEL_ENVIRONMENT:-development}
export OTEL_EXPORTER_OTLP_ENDPOINT=${OTEL_EXPORTER_OTLP_ENDPOINT}
export OTEL_PROMETHEUS_ENDPOINT=${OTEL_PROMETHEUS_ENDPOINT:-0.0.0.0:9464}
Monitoring and Alerting
Collector Health Monitoring
Collector Metrics
otelcol_receiver_accepted_spans: Spans received by collectorotelcol_receiver_refused_spans: Spans rejected by collectorotelcol_exporter_sent_spans: Spans sent to exportersotelcol_exporter_failed_spans: Failed span exports
Health Checks
# Prometheus health check
- job_name: "otel-collector-health"
static_configs:
- targets: ["collector.internal:8888"]
metrics_path: /metrics
scrape_interval: 30s
Application Performance Impact
Key Metrics to Monitor
- Application response time impact
- Memory usage increase
- CPU usage increase
- Network bandwidth usage
Alerting Rules
# High collector resource usage
alert: HighCollectorCPU
expr: rate(otelcol_process_cpu_seconds_total[5m]) > 0.8
for: 5m
labels:
severity: warning
annotations:
summary: "High collector CPU usage"
description: "Collector CPU usage is {{ $value }}"
# Collector memory usage
alert: HighCollectorMemory
expr: otelcol_process_memory_usage_bytes / otelcol_process_memory_limit_bytes > 0.8
for: 5m
labels:
severity: warning
annotations:
summary: "High collector memory usage"
description: "Collector memory usage is {{ $value }}"
Security Considerations
Network Security
- Use TLS for collector communications
- Restrict collector access to trusted networks
- Implement proper firewall rules
Data Privacy
- Ensure no sensitive data in trace attributes
- Implement data sanitization
- Configure appropriate retention policies
Access Control
- Restrict access to metrics endpoints
- Implement authentication for collector access
- Monitor access logs
Backup and Recovery
Data Retention
- Traces: 7 days (configurable)
- Metrics: 30 days (configurable)
- Logs: 14 days (configurable)
Backup Strategy
- Regular backup of collector configuration
- Backup of Grafana dashboards
- Backup of Prometheus rules
Performance Optimization
Sampling Strategy
- Production: 10% sampling rate
- Development: 100% sampling rate
- Error traces: Always sample
Batch Processing
- Optimize batch sizes for network conditions
- Configure appropriate timeouts
- Monitor queue depths
Resource Optimization
- Monitor collector resource usage
- Scale collectors based on load
- Implement horizontal scaling if needed
Troubleshooting
Common Issues
Collector Not Receiving Data
- Check network connectivity
- Verify OTLP endpoint configuration
- Check collector logs
High Resource Usage
- Adjust sampling rates
- Optimize batch processing
- Scale collector resources
Data Quality Issues
- Verify instrumentation configuration
- Check span attribute quality
- Monitor error rates
Debug Procedures
-
Check Collector Status
curl http://collector.internal:8888/metrics -
Verify Application Configuration
curl http://app:9464/metrics -
Check Trace Data
- Access Jaeger UI
- Search for recent traces
- Verify span attributes
Future Enhancements
Advanced Features
- Custom dashboards for avatar metrics
- Advanced sampling strategies
- Log correlation with traces
- Performance profiling integration
Scalability Improvements
- Horizontal collector scaling
- Load balancing for collectors
- Multi-region deployment
- Edge collection points
Integration Enhancements
- Additional exporter backends
- Custom processors
- Advanced filtering
- Data transformation
Cost Considerations
Infrastructure Costs
- Additional compute resources for collectors
- Storage costs for trace data
- Network bandwidth costs
Operational Costs
- Monitoring and maintenance
- Configuration management
- Troubleshooting and support
Optimization Strategies
- Implement efficient sampling
- Use appropriate retention policies
- Optimize batch processing
- Monitor resource usage
Conclusion
The OpenTelemetry integration for ivatar provides comprehensive observability while leveraging the existing monitoring infrastructure. The phased deployment approach ensures minimal disruption to production services while providing valuable insights into avatar generation performance and user behavior.
Key success factors:
- Gradual rollout with monitoring
- Performance impact assessment
- Proper resource planning
- Security considerations
- Ongoing optimization