mirror of
https://git.linux-kernel.at/oliver/ivatar.git
synced 2025-11-11 10:46:24 +00:00
432 lines
10 KiB
Markdown
432 lines
10 KiB
Markdown
# OpenTelemetry Infrastructure Requirements
|
|
|
|
This document outlines the infrastructure requirements and deployment strategy for OpenTelemetry in the ivatar project, considering the existing Fedora Project hosting environment and multi-instance setup.
|
|
|
|
## Current Infrastructure Analysis
|
|
|
|
### Existing Monitoring Stack
|
|
|
|
- **Prometheus + Alertmanager**: Metrics collection and alerting
|
|
- **Loki**: Log aggregation
|
|
- **Alloy**: Observability data collection
|
|
- **Grafana**: Visualization and dashboards
|
|
- **Custom exporters**: Application-specific metrics
|
|
|
|
### Production Environment
|
|
|
|
- **Scale**: Millions of requests daily, 30k+ users, 33k+ avatar images
|
|
- **Infrastructure**: Fedora Project hosted, high-performance system
|
|
- **Architecture**: Apache HTTPD + Gunicorn containers + PostgreSQL
|
|
- **Containerization**: Podman (not Docker)
|
|
|
|
### Multi-Instance Setup
|
|
|
|
- **Production**: Production environment (master branch)
|
|
- **Development**: Development environment (devel branch)
|
|
- **Deployment**: GitLab CI/CD with Puppet automation
|
|
|
|
## Infrastructure Options
|
|
|
|
### Option A: Extend Existing Alloy Stack (Recommended)
|
|
|
|
**Advantages:**
|
|
|
|
- Leverages existing infrastructure
|
|
- Minimal additional complexity
|
|
- Consistent with current monitoring approach
|
|
- Cost-effective
|
|
|
|
**Implementation:**
|
|
|
|
```yaml
|
|
# Alloy configuration extension
|
|
otelcol.receiver.otlp:
|
|
grpc:
|
|
endpoint: 0.0.0.0:4317
|
|
http:
|
|
endpoint: 0.0.0.0:4318
|
|
|
|
otelcol.processor.batch:
|
|
timeout: 1s
|
|
send_batch_size: 1024
|
|
|
|
otelcol.exporter.prometheus:
|
|
endpoint: "0.0.0.0:9464"
|
|
|
|
otelcol.exporter.jaeger:
|
|
endpoint: "jaeger-collector:14250"
|
|
|
|
otelcol.pipeline.traces:
|
|
receivers: [otelcol.receiver.otlp]
|
|
processors: [otelcol.processor.batch]
|
|
exporters: [otelcol.exporter.jaeger]
|
|
|
|
otelcol.pipeline.metrics:
|
|
receivers: [otelcol.receiver.otlp]
|
|
processors: [otelcol.processor.batch]
|
|
exporters: [otelcol.exporter.prometheus]
|
|
```
|
|
|
|
### Option B: Dedicated OpenTelemetry Collector
|
|
|
|
**Advantages:**
|
|
|
|
- Full OpenTelemetry feature set
|
|
- Better performance for high-volume tracing
|
|
- More flexible configuration options
|
|
- Future-proof architecture
|
|
|
|
**Implementation:**
|
|
|
|
- Deploy standalone OpenTelemetry Collector
|
|
- Configure OTLP receivers and exporters
|
|
- Integrate with existing Prometheus/Grafana
|
|
|
|
## Deployment Strategy
|
|
|
|
### Phase 1: Development Environment
|
|
|
|
1. **Enable OpenTelemetry in Development**
|
|
|
|
```bash
|
|
# Development environment configuration
|
|
export OTEL_ENABLED=true
|
|
export OTEL_SERVICE_NAME=ivatar-development
|
|
export OTEL_ENVIRONMENT=development
|
|
export OTEL_EXPORTER_OTLP_ENDPOINT=http://collector.internal:4317
|
|
export OTEL_PROMETHEUS_ENDPOINT=0.0.0.0:9464
|
|
```
|
|
|
|
2. **Update Alloy Configuration**
|
|
|
|
- Add OTLP receivers to existing Alloy instance
|
|
- Configure trace and metrics pipelines
|
|
- Test data flow
|
|
|
|
3. **Verify Integration**
|
|
- Check metrics endpoint: `http://dev-instance:9464/metrics`
|
|
- Verify trace data in Jaeger
|
|
- Monitor Grafana dashboards
|
|
|
|
### Phase 2: Production Deployment
|
|
|
|
1. **Production Configuration**
|
|
|
|
```bash
|
|
# Production environment configuration
|
|
export OTEL_ENABLED=true
|
|
export OTEL_SERVICE_NAME=ivatar-production
|
|
export OTEL_ENVIRONMENT=production
|
|
export OTEL_EXPORTER_OTLP_ENDPOINT=http://collector.internal:4317
|
|
export OTEL_PROMETHEUS_ENDPOINT=0.0.0.0:9464
|
|
```
|
|
|
|
2. **Gradual Rollout**
|
|
|
|
- Deploy to one Gunicorn container first
|
|
- Monitor performance impact
|
|
- Gradually enable on all containers
|
|
|
|
3. **Performance Monitoring**
|
|
- Monitor collector resource usage
|
|
- Check application performance impact
|
|
- Verify data quality
|
|
|
|
## Resource Requirements
|
|
|
|
### Collector Resources
|
|
|
|
**Minimum Requirements:**
|
|
|
|
- CPU: 2 cores
|
|
- Memory: 4GB RAM
|
|
- Storage: 10GB for temporary data
|
|
- Network: 1Gbps
|
|
|
|
**Recommended for Production:**
|
|
|
|
- CPU: 4 cores
|
|
- Memory: 8GB RAM
|
|
- Storage: 50GB SSD
|
|
- Network: 10Gbps
|
|
|
|
### Network Requirements
|
|
|
|
**Ports:**
|
|
|
|
- 4317: OTLP gRPC receiver
|
|
- 4318: OTLP HTTP receiver
|
|
- 9464: Prometheus metrics exporter
|
|
- 14250: Jaeger trace exporter
|
|
|
|
**Bandwidth:**
|
|
|
|
- Estimated 1-5 Mbps per instance
|
|
- Burst capacity for peak loads
|
|
- Low-latency connection to collectors
|
|
|
|
## Configuration Management
|
|
|
|
### Environment-Specific Settings
|
|
|
|
#### Production Environment
|
|
|
|
```bash
|
|
# Production OpenTelemetry configuration
|
|
OTEL_ENABLED=true
|
|
OTEL_SERVICE_NAME=ivatar-production
|
|
OTEL_ENVIRONMENT=production
|
|
OTEL_EXPORTER_OTLP_ENDPOINT=http://collector.internal:4317
|
|
OTEL_PROMETHEUS_ENDPOINT=0.0.0.0:9464
|
|
OTEL_SAMPLING_RATIO=0.1 # 10% sampling for high volume
|
|
HOSTNAME=prod-instance-01
|
|
```
|
|
|
|
#### Development Environment
|
|
|
|
```bash
|
|
# Development OpenTelemetry configuration
|
|
OTEL_ENABLED=true
|
|
OTEL_SERVICE_NAME=ivatar-development
|
|
OTEL_ENVIRONMENT=development
|
|
OTEL_EXPORTER_OTLP_ENDPOINT=http://collector.internal:4317
|
|
OTEL_PROMETHEUS_ENDPOINT=0.0.0.0:9464
|
|
OTEL_SAMPLING_RATIO=1.0 # 100% sampling for debugging
|
|
HOSTNAME=dev-instance-01
|
|
```
|
|
|
|
### Container Configuration
|
|
|
|
#### Podman Container Updates
|
|
|
|
```dockerfile
|
|
# Add to existing Dockerfile
|
|
RUN pip install opentelemetry-api>=1.20.0 \
|
|
opentelemetry-sdk>=1.20.0 \
|
|
opentelemetry-instrumentation-django>=0.42b0 \
|
|
opentelemetry-instrumentation-psycopg2>=0.42b0 \
|
|
opentelemetry-instrumentation-pymysql>=0.42b0 \
|
|
opentelemetry-instrumentation-requests>=0.42b0 \
|
|
opentelemetry-instrumentation-urllib3>=0.42b0 \
|
|
opentelemetry-exporter-otlp>=1.20.0 \
|
|
opentelemetry-exporter-prometheus>=1.12.0rc1 \
|
|
opentelemetry-instrumentation-memcached>=0.42b0
|
|
```
|
|
|
|
#### Container Environment Variables
|
|
|
|
```bash
|
|
# Add to container startup script
|
|
export OTEL_ENABLED=${OTEL_ENABLED:-false}
|
|
export OTEL_SERVICE_NAME=${OTEL_SERVICE_NAME:-ivatar}
|
|
export OTEL_ENVIRONMENT=${OTEL_ENVIRONMENT:-development}
|
|
export OTEL_EXPORTER_OTLP_ENDPOINT=${OTEL_EXPORTER_OTLP_ENDPOINT}
|
|
export OTEL_PROMETHEUS_ENDPOINT=${OTEL_PROMETHEUS_ENDPOINT:-0.0.0.0:9464}
|
|
```
|
|
|
|
## Monitoring and Alerting
|
|
|
|
### Collector Health Monitoring
|
|
|
|
#### Collector Metrics
|
|
|
|
- `otelcol_receiver_accepted_spans`: Spans received by collector
|
|
- `otelcol_receiver_refused_spans`: Spans rejected by collector
|
|
- `otelcol_exporter_sent_spans`: Spans sent to exporters
|
|
- `otelcol_exporter_failed_spans`: Failed span exports
|
|
|
|
#### Health Checks
|
|
|
|
```yaml
|
|
# Prometheus health check
|
|
- job_name: "otel-collector-health"
|
|
static_configs:
|
|
- targets: ["collector.internal:8888"]
|
|
metrics_path: /metrics
|
|
scrape_interval: 30s
|
|
```
|
|
|
|
### Application Performance Impact
|
|
|
|
#### Key Metrics to Monitor
|
|
|
|
- Application response time impact
|
|
- Memory usage increase
|
|
- CPU usage increase
|
|
- Network bandwidth usage
|
|
|
|
#### Alerting Rules
|
|
|
|
```yaml
|
|
# High collector resource usage
|
|
alert: HighCollectorCPU
|
|
expr: rate(otelcol_process_cpu_seconds_total[5m]) > 0.8
|
|
for: 5m
|
|
labels:
|
|
severity: warning
|
|
annotations:
|
|
summary: "High collector CPU usage"
|
|
description: "Collector CPU usage is {{ $value }}"
|
|
|
|
# Collector memory usage
|
|
alert: HighCollectorMemory
|
|
expr: otelcol_process_memory_usage_bytes / otelcol_process_memory_limit_bytes > 0.8
|
|
for: 5m
|
|
labels:
|
|
severity: warning
|
|
annotations:
|
|
summary: "High collector memory usage"
|
|
description: "Collector memory usage is {{ $value }}"
|
|
```
|
|
|
|
## Security Considerations
|
|
|
|
### Network Security
|
|
|
|
- Use TLS for collector communications
|
|
- Restrict collector access to trusted networks
|
|
- Implement proper firewall rules
|
|
|
|
### Data Privacy
|
|
|
|
- Ensure no sensitive data in trace attributes
|
|
- Implement data sanitization
|
|
- Configure appropriate retention policies
|
|
|
|
### Access Control
|
|
|
|
- Restrict access to metrics endpoints
|
|
- Implement authentication for collector access
|
|
- Monitor access logs
|
|
|
|
## Backup and Recovery
|
|
|
|
### Data Retention
|
|
|
|
- Traces: 7 days (configurable)
|
|
- Metrics: 30 days (configurable)
|
|
- Logs: 14 days (configurable)
|
|
|
|
### Backup Strategy
|
|
|
|
- Regular backup of collector configuration
|
|
- Backup of Grafana dashboards
|
|
- Backup of Prometheus rules
|
|
|
|
## Performance Optimization
|
|
|
|
### Sampling Strategy
|
|
|
|
- Production: 10% sampling rate
|
|
- Development: 100% sampling rate
|
|
- Error traces: Always sample
|
|
|
|
### Batch Processing
|
|
|
|
- Optimize batch sizes for network conditions
|
|
- Configure appropriate timeouts
|
|
- Monitor queue depths
|
|
|
|
### Resource Optimization
|
|
|
|
- Monitor collector resource usage
|
|
- Scale collectors based on load
|
|
- Implement horizontal scaling if needed
|
|
|
|
## Troubleshooting
|
|
|
|
### Common Issues
|
|
|
|
#### Collector Not Receiving Data
|
|
|
|
- Check network connectivity
|
|
- Verify OTLP endpoint configuration
|
|
- Check collector logs
|
|
|
|
#### High Resource Usage
|
|
|
|
- Adjust sampling rates
|
|
- Optimize batch processing
|
|
- Scale collector resources
|
|
|
|
#### Data Quality Issues
|
|
|
|
- Verify instrumentation configuration
|
|
- Check span attribute quality
|
|
- Monitor error rates
|
|
|
|
### Debug Procedures
|
|
|
|
1. **Check Collector Status**
|
|
|
|
```bash
|
|
curl http://collector.internal:8888/metrics
|
|
```
|
|
|
|
2. **Verify Application Configuration**
|
|
|
|
```bash
|
|
curl http://app:9464/metrics
|
|
```
|
|
|
|
3. **Check Trace Data**
|
|
- Access Jaeger UI
|
|
- Search for recent traces
|
|
- Verify span attributes
|
|
|
|
## Future Enhancements
|
|
|
|
### Advanced Features
|
|
|
|
- Custom dashboards for avatar metrics
|
|
- Advanced sampling strategies
|
|
- Log correlation with traces
|
|
- Performance profiling integration
|
|
|
|
### Scalability Improvements
|
|
|
|
- Horizontal collector scaling
|
|
- Load balancing for collectors
|
|
- Multi-region deployment
|
|
- Edge collection points
|
|
|
|
### Integration Enhancements
|
|
|
|
- Additional exporter backends
|
|
- Custom processors
|
|
- Advanced filtering
|
|
- Data transformation
|
|
|
|
## Cost Considerations
|
|
|
|
### Infrastructure Costs
|
|
|
|
- Additional compute resources for collectors
|
|
- Storage costs for trace data
|
|
- Network bandwidth costs
|
|
|
|
### Operational Costs
|
|
|
|
- Monitoring and maintenance
|
|
- Configuration management
|
|
- Troubleshooting and support
|
|
|
|
### Optimization Strategies
|
|
|
|
- Implement efficient sampling
|
|
- Use appropriate retention policies
|
|
- Optimize batch processing
|
|
- Monitor resource usage
|
|
|
|
## Conclusion
|
|
|
|
The OpenTelemetry integration for ivatar provides comprehensive observability while leveraging the existing monitoring infrastructure. The phased deployment approach ensures minimal disruption to production services while providing valuable insights into avatar generation performance and user behavior.
|
|
|
|
Key success factors:
|
|
|
|
- Gradual rollout with monitoring
|
|
- Performance impact assessment
|
|
- Proper resource planning
|
|
- Security considerations
|
|
- Ongoing optimization
|