mirror of https://git.linux-kernel.at/oliver/ivatar.git synced 2025-11-11 10:46:24 +00:00

Files

Oliver Falk 41f8c3c402 🚀 Major Release: ivatar 2.0 - Performance, Security, and Instrumentation Overhaul

2025-11-03 10:18:33 +01:00

10 KiB

Raw Blame History

OpenTelemetry Infrastructure Requirements

This document outlines the infrastructure requirements and deployment strategy for OpenTelemetry in the ivatar project, considering the existing Fedora Project hosting environment and multi-instance setup.

Current Infrastructure Analysis

Existing Monitoring Stack

Prometheus + Alertmanager: Metrics collection and alerting
Loki: Log aggregation
Alloy: Observability data collection
Grafana: Visualization and dashboards
Custom exporters: Application-specific metrics

Production Environment

Scale: Millions of requests daily, 30k+ users, 33k+ avatar images
Infrastructure: Fedora Project hosted, high-performance system
Architecture: Apache HTTPD + Gunicorn containers + PostgreSQL
Containerization: Podman (not Docker)

Multi-Instance Setup

Production: Production environment (master branch)
Development: Development environment (devel branch)
Deployment: GitLab CI/CD with Puppet automation

Infrastructure Options

Option A: Extend Existing Alloy Stack (Recommended)

Advantages:

Leverages existing infrastructure
Minimal additional complexity
Consistent with current monitoring approach
Cost-effective

Implementation:

# Alloy configuration extension
otelcol.receiver.otlp:
  grpc:
    endpoint: 0.0.0.0:4317
  http:
    endpoint: 0.0.0.0:4318

otelcol.processor.batch:
  timeout: 1s
  send_batch_size: 1024

otelcol.exporter.prometheus:
  endpoint: "0.0.0.0:9464"

otelcol.exporter.jaeger:
  endpoint: "jaeger-collector:14250"

otelcol.pipeline.traces:
  receivers: [otelcol.receiver.otlp]
  processors: [otelcol.processor.batch]
  exporters: [otelcol.exporter.jaeger]

otelcol.pipeline.metrics:
  receivers: [otelcol.receiver.otlp]
  processors: [otelcol.processor.batch]
  exporters: [otelcol.exporter.prometheus]

Option B: Dedicated OpenTelemetry Collector

Advantages:

Full OpenTelemetry feature set
Better performance for high-volume tracing
More flexible configuration options
Future-proof architecture

Implementation:

Deploy standalone OpenTelemetry Collector
Configure OTLP receivers and exporters
Integrate with existing Prometheus/Grafana

Deployment Strategy

Phase 1: Development Environment

Enable OpenTelemetry in Development

# Development environment configuration
export OTEL_ENABLED=true
export OTEL_SERVICE_NAME=ivatar-development
export OTEL_ENVIRONMENT=development
export OTEL_EXPORTER_OTLP_ENDPOINT=http://collector.internal:4317
export OTEL_PROMETHEUS_ENDPOINT=0.0.0.0:9464

Update Alloy Configuration
- Add OTLP receivers to existing Alloy instance
- Configure trace and metrics pipelines
- Test data flow
Verify Integration
- Check metrics endpoint: http://dev-instance:9464/metrics
- Verify trace data in Jaeger
- Monitor Grafana dashboards

Phase 2: Production Deployment

Production Configuration

# Production environment configuration
export OTEL_ENABLED=true
export OTEL_SERVICE_NAME=ivatar-production
export OTEL_ENVIRONMENT=production
export OTEL_EXPORTER_OTLP_ENDPOINT=http://collector.internal:4317
export OTEL_PROMETHEUS_ENDPOINT=0.0.0.0:9464

Gradual Rollout
- Deploy to one Gunicorn container first
- Monitor performance impact
- Gradually enable on all containers
Performance Monitoring
- Monitor collector resource usage
- Check application performance impact
- Verify data quality

Resource Requirements

Collector Resources

Minimum Requirements:

CPU: 2 cores
Memory: 4GB RAM
Storage: 10GB for temporary data
Network: 1Gbps

Recommended for Production:

CPU: 4 cores
Memory: 8GB RAM
Storage: 50GB SSD
Network: 10Gbps

Network Requirements

Ports:

4317: OTLP gRPC receiver
4318: OTLP HTTP receiver
9464: Prometheus metrics exporter
14250: Jaeger trace exporter

Bandwidth:

Estimated 1-5 Mbps per instance
Burst capacity for peak loads
Low-latency connection to collectors

Configuration Management

Environment-Specific Settings

Production Environment

# Production OpenTelemetry configuration
OTEL_ENABLED=true
OTEL_SERVICE_NAME=ivatar-production
OTEL_ENVIRONMENT=production
OTEL_EXPORTER_OTLP_ENDPOINT=http://collector.internal:4317
OTEL_PROMETHEUS_ENDPOINT=0.0.0.0:9464
OTEL_SAMPLING_RATIO=0.1  # 10% sampling for high volume
HOSTNAME=prod-instance-01

Development Environment

# Development OpenTelemetry configuration
OTEL_ENABLED=true
OTEL_SERVICE_NAME=ivatar-development
OTEL_ENVIRONMENT=development
OTEL_EXPORTER_OTLP_ENDPOINT=http://collector.internal:4317
OTEL_PROMETHEUS_ENDPOINT=0.0.0.0:9464
OTEL_SAMPLING_RATIO=1.0  # 100% sampling for debugging
HOSTNAME=dev-instance-01

Container Configuration

Podman Container Updates

# Add to existing Dockerfile
RUN pip install opentelemetry-api>=1.20.0 \
    opentelemetry-sdk>=1.20.0 \
    opentelemetry-instrumentation-django>=0.42b0 \
    opentelemetry-instrumentation-psycopg2>=0.42b0 \
    opentelemetry-instrumentation-pymysql>=0.42b0 \
    opentelemetry-instrumentation-requests>=0.42b0 \
    opentelemetry-instrumentation-urllib3>=0.42b0 \
    opentelemetry-exporter-otlp>=1.20.0 \
    opentelemetry-exporter-prometheus>=1.12.0rc1 \
    opentelemetry-instrumentation-memcached>=0.42b0

Container Environment Variables

# Add to container startup script
export OTEL_ENABLED=${OTEL_ENABLED:-false}
export OTEL_SERVICE_NAME=${OTEL_SERVICE_NAME:-ivatar}
export OTEL_ENVIRONMENT=${OTEL_ENVIRONMENT:-development}
export OTEL_EXPORTER_OTLP_ENDPOINT=${OTEL_EXPORTER_OTLP_ENDPOINT}
export OTEL_PROMETHEUS_ENDPOINT=${OTEL_PROMETHEUS_ENDPOINT:-0.0.0.0:9464}

Monitoring and Alerting

Collector Health Monitoring

Collector Metrics

otelcol_receiver_accepted_spans: Spans received by collector
otelcol_receiver_refused_spans: Spans rejected by collector
otelcol_exporter_sent_spans: Spans sent to exporters
otelcol_exporter_failed_spans: Failed span exports

Health Checks

# Prometheus health check
- job_name: "otel-collector-health"
  static_configs:
    - targets: ["collector.internal:8888"]
  metrics_path: /metrics
  scrape_interval: 30s

Application Performance Impact

Key Metrics to Monitor

Application response time impact
Memory usage increase
CPU usage increase
Network bandwidth usage

Alerting Rules

# High collector resource usage
alert: HighCollectorCPU
expr: rate(otelcol_process_cpu_seconds_total[5m]) > 0.8
for: 5m
labels:
  severity: warning
annotations:
  summary: "High collector CPU usage"
  description: "Collector CPU usage is {{ $value }}"

# Collector memory usage
alert: HighCollectorMemory
expr: otelcol_process_memory_usage_bytes / otelcol_process_memory_limit_bytes > 0.8
for: 5m
labels:
  severity: warning
annotations:
  summary: "High collector memory usage"
  description: "Collector memory usage is {{ $value }}"

Security Considerations

Network Security

Use TLS for collector communications
Restrict collector access to trusted networks
Implement proper firewall rules

Data Privacy

Ensure no sensitive data in trace attributes
Implement data sanitization
Configure appropriate retention policies

Access Control

Restrict access to metrics endpoints
Implement authentication for collector access
Monitor access logs

Backup and Recovery

Data Retention

Traces: 7 days (configurable)
Metrics: 30 days (configurable)
Logs: 14 days (configurable)

Backup Strategy

Regular backup of collector configuration
Backup of Grafana dashboards
Backup of Prometheus rules

Performance Optimization

Sampling Strategy

Production: 10% sampling rate
Development: 100% sampling rate
Error traces: Always sample

Batch Processing

Optimize batch sizes for network conditions
Configure appropriate timeouts
Monitor queue depths

Resource Optimization

Monitor collector resource usage
Scale collectors based on load
Implement horizontal scaling if needed

Troubleshooting

Common Issues

Collector Not Receiving Data

Check network connectivity
Verify OTLP endpoint configuration
Check collector logs

High Resource Usage

Adjust sampling rates
Optimize batch processing
Scale collector resources

Data Quality Issues

Verify instrumentation configuration
Check span attribute quality
Monitor error rates

Debug Procedures

Check Collector Status

curl http://collector.internal:8888/metrics

Verify Application Configuration
```
curl http://app:9464/metrics
```
Check Trace Data
- Access Jaeger UI
- Search for recent traces
- Verify span attributes

Future Enhancements

Advanced Features

Custom dashboards for avatar metrics
Advanced sampling strategies
Log correlation with traces
Performance profiling integration

Scalability Improvements

Horizontal collector scaling
Load balancing for collectors
Multi-region deployment
Edge collection points

Integration Enhancements

Additional exporter backends
Custom processors
Advanced filtering
Data transformation

Cost Considerations

Infrastructure Costs

Additional compute resources for collectors
Storage costs for trace data
Network bandwidth costs

Operational Costs

Monitoring and maintenance
Configuration management
Troubleshooting and support

Optimization Strategies

Implement efficient sampling
Use appropriate retention policies
Optimize batch processing
Monitor resource usage

Conclusion

The OpenTelemetry integration for ivatar provides comprehensive observability while leveraging the existing monitoring infrastructure. The phased deployment approach ensures minimal disruption to production services while providing valuable insights into avatar generation performance and user behavior.

Key success factors:

Gradual rollout with monitoring
Performance impact assessment
Proper resource planning
Security considerations
Ongoing optimization

10 KiB Raw Blame History

OpenTelemetry Infrastructure Requirements

Current Infrastructure Analysis

Existing Monitoring Stack

Production Environment

Multi-Instance Setup

Infrastructure Options

Option A: Extend Existing Alloy Stack (Recommended)

Option B: Dedicated OpenTelemetry Collector

Deployment Strategy

Phase 1: Development Environment

Phase 2: Production Deployment

Resource Requirements

Collector Resources

Network Requirements

Configuration Management

Environment-Specific Settings

Production Environment

Development Environment

Container Configuration

Podman Container Updates

Container Environment Variables

Monitoring and Alerting

Collector Health Monitoring

Collector Metrics

Health Checks

Application Performance Impact

Key Metrics to Monitor

Alerting Rules

Security Considerations

Network Security

Data Privacy

Access Control

Backup and Recovery

Data Retention

Backup Strategy

Performance Optimization

Sampling Strategy

Batch Processing

Resource Optimization

Troubleshooting

Common Issues

Collector Not Receiving Data

High Resource Usage

Data Quality Issues

Debug Procedures

Future Enhancements

Advanced Features

Scalability Improvements

Integration Enhancements

Cost Considerations

Infrastructure Costs

Operational Costs

Optimization Strategies

Conclusion

10 KiB

Raw Blame History