ivatar/OPENTELEMETRY_INFRASTRUCTURE.md

# OpenTelemetry Infrastructure Requirements

This document outlines the infrastructure requirements and deployment strategy for OpenTelemetry in the ivatar project, considering the existing Fedora Project hosting environment and multi-instance setup.

## Current Infrastructure Analysis

### Existing Monitoring Stack

- **Prometheus + Alertmanager**: Metrics collection and alerting
- **Loki**: Log aggregation
- **Alloy**: Observability data collection
- **Grafana**: Visualization and dashboards
- **Custom exporters**: Application-specific metrics

### Production Environment

- **Scale**: Millions of requests daily, 30k+ users, 33k+ avatar images
- **Infrastructure**: Fedora Project hosted, high-performance system
- **Architecture**: Apache HTTPD + Gunicorn containers + PostgreSQL
- **Containerization**: Podman (not Docker)

### Multi-Instance Setup

- **Production**: Production environment (master branch)
- **Development**: Development environment (devel branch)
- **Deployment**: GitLab CI/CD with Puppet automation

## Infrastructure Options

### Option A: Extend Existing Alloy Stack (Recommended)

**Advantages:**

- Leverages existing infrastructure
- Minimal additional complexity
- Consistent with current monitoring approach
- Cost-effective

**Implementation:**

```yaml
# Alloy configuration extension
otelcol.receiver.otlp:
  grpc:
    endpoint: 0.0.0.0:4317
  http:
    endpoint: 0.0.0.0:4318

otelcol.processor.batch:
  timeout: 1s
  send_batch_size: 1024

otelcol.exporter.prometheus:
  endpoint: "0.0.0.0:9464"

otelcol.exporter.jaeger:
  endpoint: "jaeger-collector:14250"

otelcol.pipeline.traces:
  receivers: [otelcol.receiver.otlp]
  processors: [otelcol.processor.batch]
  exporters: [otelcol.exporter.jaeger]

otelcol.pipeline.metrics:
  receivers: [otelcol.receiver.otlp]
  processors: [otelcol.processor.batch]
  exporters: [otelcol.exporter.prometheus]
```

### Option B: Dedicated OpenTelemetry Collector

**Advantages:**

- Full OpenTelemetry feature set
- Better performance for high-volume tracing
- More flexible configuration options
- Future-proof architecture

**Implementation:**

- Deploy standalone OpenTelemetry Collector
- Configure OTLP receivers and exporters
- Integrate with existing Prometheus/Grafana

## Deployment Strategy

### Phase 1: Development Environment

1. **Enable OpenTelemetry in Development**

   ```bash
   # Development environment configuration
   export OTEL_ENABLED=true
   export OTEL_SERVICE_NAME=ivatar-development
   export OTEL_ENVIRONMENT=development
   export OTEL_EXPORTER_OTLP_ENDPOINT=http://collector.internal:4317
   export OTEL_PROMETHEUS_ENDPOINT=0.0.0.0:9464
   ```

2. **Update Alloy Configuration**

   - Add OTLP receivers to existing Alloy instance
   - Configure trace and metrics pipelines
   - Test data flow

3. **Verify Integration**
   - Check metrics endpoint: `http://dev-instance:9464/metrics`
   - Verify trace data in Jaeger
   - Monitor Grafana dashboards

### Phase 2: Production Deployment

1. **Production Configuration**

   ```bash
   # Production environment configuration
   export OTEL_ENABLED=true
   export OTEL_SERVICE_NAME=ivatar-production
   export OTEL_ENVIRONMENT=production
   export OTEL_EXPORTER_OTLP_ENDPOINT=http://collector.internal:4317
   export OTEL_PROMETHEUS_ENDPOINT=0.0.0.0:9464
   ```

2. **Gradual Rollout**

   - Deploy to one Gunicorn container first
   - Monitor performance impact
   - Gradually enable on all containers

3. **Performance Monitoring**
   - Monitor collector resource usage
   - Check application performance impact
   - Verify data quality

## Resource Requirements

### Collector Resources

**Minimum Requirements:**

- CPU: 2 cores
- Memory: 4GB RAM
- Storage: 10GB for temporary data
- Network: 1Gbps

**Recommended for Production:**

- CPU: 4 cores
- Memory: 8GB RAM
- Storage: 50GB SSD
- Network: 10Gbps

### Network Requirements

**Ports:**

- 4317: OTLP gRPC receiver
- 4318: OTLP HTTP receiver
- 9464: Prometheus metrics exporter
- 14250: Jaeger trace exporter

**Bandwidth:**

- Estimated 1-5 Mbps per instance
- Burst capacity for peak loads
- Low-latency connection to collectors

## Configuration Management

### Environment-Specific Settings

#### Production Environment

```bash
# Production OpenTelemetry configuration
OTEL_ENABLED=true
OTEL_SERVICE_NAME=ivatar-production
OTEL_ENVIRONMENT=production
OTEL_EXPORTER_OTLP_ENDPOINT=http://collector.internal:4317
OTEL_PROMETHEUS_ENDPOINT=0.0.0.0:9464
OTEL_SAMPLING_RATIO=0.1  # 10% sampling for high volume
HOSTNAME=prod-instance-01
```

#### Development Environment

```bash
# Development OpenTelemetry configuration
OTEL_ENABLED=true
OTEL_SERVICE_NAME=ivatar-development
OTEL_ENVIRONMENT=development
OTEL_EXPORTER_OTLP_ENDPOINT=http://collector.internal:4317
OTEL_PROMETHEUS_ENDPOINT=0.0.0.0:9464
OTEL_SAMPLING_RATIO=1.0  # 100% sampling for debugging
HOSTNAME=dev-instance-01
```

### Container Configuration

#### Podman Container Updates

```dockerfile
# Add to existing Dockerfile
RUN pip install opentelemetry-api>=1.20.0 \
    opentelemetry-sdk>=1.20.0 \
    opentelemetry-instrumentation-django>=0.42b0 \
    opentelemetry-instrumentation-psycopg2>=0.42b0 \
    opentelemetry-instrumentation-pymysql>=0.42b0 \
    opentelemetry-instrumentation-requests>=0.42b0 \
    opentelemetry-instrumentation-urllib3>=0.42b0 \
    opentelemetry-exporter-otlp>=1.20.0 \
    opentelemetry-exporter-prometheus>=1.12.0rc1 \
    opentelemetry-instrumentation-memcached>=0.42b0
```

#### Container Environment Variables

```bash
# Add to container startup script
export OTEL_ENABLED=${OTEL_ENABLED:-false}
export OTEL_SERVICE_NAME=${OTEL_SERVICE_NAME:-ivatar}
export OTEL_ENVIRONMENT=${OTEL_ENVIRONMENT:-development}
export OTEL_EXPORTER_OTLP_ENDPOINT=${OTEL_EXPORTER_OTLP_ENDPOINT}
export OTEL_PROMETHEUS_ENDPOINT=${OTEL_PROMETHEUS_ENDPOINT:-0.0.0.0:9464}
```

## Monitoring and Alerting

### Collector Health Monitoring

#### Collector Metrics

- `otelcol_receiver_accepted_spans`: Spans received by collector
- `otelcol_receiver_refused_spans`: Spans rejected by collector
- `otelcol_exporter_sent_spans`: Spans sent to exporters
- `otelcol_exporter_failed_spans`: Failed span exports

#### Health Checks

```yaml
# Prometheus health check
- job_name: "otel-collector-health"
  static_configs:
    - targets: ["collector.internal:8888"]
  metrics_path: /metrics
  scrape_interval: 30s
```

### Application Performance Impact

#### Key Metrics to Monitor

- Application response time impact
- Memory usage increase
- CPU usage increase
- Network bandwidth usage

#### Alerting Rules

```yaml
# High collector resource usage
alert: HighCollectorCPU
expr: rate(otelcol_process_cpu_seconds_total[5m]) > 0.8
for: 5m
labels:
  severity: warning
annotations:
  summary: "High collector CPU usage"
  description: "Collector CPU usage is {{ $value }}"

# Collector memory usage
alert: HighCollectorMemory
expr: otelcol_process_memory_usage_bytes / otelcol_process_memory_limit_bytes > 0.8
for: 5m
labels:
  severity: warning
annotations:
  summary: "High collector memory usage"
  description: "Collector memory usage is {{ $value }}"
```

## Security Considerations

### Network Security

- Use TLS for collector communications
- Restrict collector access to trusted networks
- Implement proper firewall rules

### Data Privacy

- Ensure no sensitive data in trace attributes
- Implement data sanitization
- Configure appropriate retention policies

### Access Control

- Restrict access to metrics endpoints
- Implement authentication for collector access
- Monitor access logs

## Backup and Recovery

### Data Retention

- Traces: 7 days (configurable)
- Metrics: 30 days (configurable)
- Logs: 14 days (configurable)

### Backup Strategy

- Regular backup of collector configuration
- Backup of Grafana dashboards
- Backup of Prometheus rules

## Performance Optimization

### Sampling Strategy

- Production: 10% sampling rate
- Development: 100% sampling rate
- Error traces: Always sample

### Batch Processing

- Optimize batch sizes for network conditions
- Configure appropriate timeouts
- Monitor queue depths

### Resource Optimization

- Monitor collector resource usage
- Scale collectors based on load
- Implement horizontal scaling if needed

## Troubleshooting

### Common Issues

#### Collector Not Receiving Data

- Check network connectivity
- Verify OTLP endpoint configuration
- Check collector logs

#### High Resource Usage

- Adjust sampling rates
- Optimize batch processing
- Scale collector resources

#### Data Quality Issues

- Verify instrumentation configuration
- Check span attribute quality
- Monitor error rates

### Debug Procedures

1. **Check Collector Status**

   ```bash
   curl http://collector.internal:8888/metrics
   ```

2. **Verify Application Configuration**

   ```bash
   curl http://app:9464/metrics
   ```

3. **Check Trace Data**
   - Access Jaeger UI
   - Search for recent traces
   - Verify span attributes

## Future Enhancements

### Advanced Features

- Custom dashboards for avatar metrics
- Advanced sampling strategies
- Log correlation with traces
- Performance profiling integration

### Scalability Improvements

- Horizontal collector scaling
- Load balancing for collectors
- Multi-region deployment
- Edge collection points

### Integration Enhancements

- Additional exporter backends
- Custom processors
- Advanced filtering
- Data transformation

## Cost Considerations

### Infrastructure Costs

- Additional compute resources for collectors
- Storage costs for trace data
- Network bandwidth costs

### Operational Costs

- Monitoring and maintenance
- Configuration management
- Troubleshooting and support

### Optimization Strategies

- Implement efficient sampling
- Use appropriate retention policies
- Optimize batch processing
- Monitor resource usage

## Conclusion

The OpenTelemetry integration for ivatar provides comprehensive observability while leveraging the existing monitoring infrastructure. The phased deployment approach ensures minimal disruption to production services while providing valuable insights into avatar generation performance and user behavior.

Key success factors:

- Gradual rollout with monitoring
- Performance impact assessment
- Proper resource planning
- Security considerations
- Ongoing optimization