12 KiB
OpenTelemetry Integration for ivatar
This document describes the OpenTelemetry integration implemented in the ivatar project, providing comprehensive observability for avatar generation, file uploads, authentication, and system performance.
Overview
OpenTelemetry is integrated into ivatar to provide:
- Distributed Tracing: Track requests across the entire avatar generation pipeline
- Custom Metrics: Monitor avatar-specific operations and performance
- Multi-Instance Support: Distinguish between production and development environments
- Infrastructure Integration: Works with existing Prometheus/Grafana stack
Architecture
Components
-
OpenTelemetry Configuration (
ivatar/opentelemetry_config.py)- Centralized configuration management
- Environment-based setup
- Resource creation with service metadata
-
Custom Middleware (
ivatar/opentelemetry_middleware.py)- Request/response tracing
- Avatar-specific metrics
- Custom decorators for operation tracing
-
Instrumentation Integration
- Django framework instrumentation
- Database query tracing (PostgreSQL/MySQL)
- HTTP client instrumentation
- Cache instrumentation (Memcached)
Configuration
Environment Variables
| Variable | Description | Default | Required |
|---|---|---|---|
OTEL_EXPORT_ENABLED |
Enable OpenTelemetry data export | false |
No |
OTEL_SERVICE_NAME |
Service name identifier | ivatar |
No |
OTEL_ENVIRONMENT |
Environment (production/development) | development |
No |
OTEL_EXPORTER_OTLP_ENDPOINT |
OTLP collector endpoint | None | No |
OTEL_PROMETHEUS_ENDPOINT |
Local Prometheus server (dev only) | None | No |
IVATAR_VERSION |
Application version | 2.0 |
No |
HOSTNAME |
Instance identifier | unknown |
No |
Multi-Instance Configuration
Production Environment
export OTEL_EXPORT_ENABLED=true
export OTEL_SERVICE_NAME=ivatar-production
export OTEL_ENVIRONMENT=production
export OTEL_EXPORTER_OTLP_ENDPOINT=http://collector.internal:4317
export HOSTNAME=prod-instance-01
Note: In production, metrics are exported via OTLP to your existing Prometheus server. Do not set OTEL_PROMETHEUS_ENDPOINT in production.
Development Environment
export OTEL_EXPORT_ENABLED=true
export OTEL_SERVICE_NAME=ivatar-development
export OTEL_ENVIRONMENT=development
export OTEL_EXPORTER_OTLP_ENDPOINT=http://collector.internal:4317
export OTEL_PROMETHEUS_ENDPOINT=0.0.0.0:9467
export IVATAR_VERSION=2.0-dev
export HOSTNAME=dev-instance-01
Note: In development, you can optionally set OTEL_PROMETHEUS_ENDPOINT to start a local HTTP server for testing metrics.
Metrics
Custom Metrics
Avatar Operations
ivatar_requests_total: Total HTTP requests by method, status, pathivatar_request_duration_seconds: Request duration histogramivatar_avatar_requests_total: Avatar requests by status, size, formativatar_avatar_generation_seconds: Avatar generation time histogramivatar_avatars_generated_total: Avatars generated by size, format, sourceivatar_avatar_cache_hits_total: Cache hits by size, formativatar_avatar_cache_misses_total: Cache misses by size, formativatar_external_avatar_requests_total: External service requestsivatar_file_uploads_total: File uploads by content type, successivatar_file_upload_size_bytes: File upload size histogram
Labels/Dimensions
method: HTTP method (GET, POST, etc.)status_code: HTTP status codepath: Request pathsize: Avatar size (80, 128, 256, etc.)format: Image format (png, jpg, gif, etc.)source: Avatar source (uploaded, generated, external)service: External service name (gravatar, bluesky)content_type: File MIME typesuccess: Operation success (true/false)
Example Queries
Avatar Generation Rate
rate(ivatar_avatars_generated_total[5m])
Cache Hit Ratio
rate(ivatar_avatar_cache_hits_total[5m]) /
(rate(ivatar_avatar_cache_hits_total[5m]) + rate(ivatar_avatar_cache_misses_total[5m]))
Average Avatar Generation Time
histogram_quantile(0.95, rate(ivatar_avatar_generation_seconds_bucket[5m]))
File Upload Success Rate
rate(ivatar_file_uploads_total{success="true"}[5m]) /
rate(ivatar_file_uploads_total[5m])
Tracing
Trace Points
Request Lifecycle
- HTTP request processing
- Avatar generation pipeline
- File upload and processing
- Authentication flows
- External API calls
Custom Spans
avatar.generate_png: PNG image generationavatar.gravatar_proxy: Gravatar service proxyfile_upload.process: File upload processingauth.login: User authenticationauth.logout: User logout
Span Attributes
HTTP Attributes
http.method: HTTP methodhttp.url: Full request URLhttp.status_code: Response status codehttp.user_agent: Client user agenthttp.remote_addr: Client IP address
Avatar Attributes
ivatar.request_type: Request type (avatar, stats, etc.)ivatar.avatar_size: Requested avatar sizeivatar.avatar_format: Requested formativatar.avatar_email: Email address (if applicable)
File Attributes
file.name: Uploaded file namefile.size: File size in bytesfile.content_type: MIME type
Infrastructure Requirements
Option A: Extend Existing Stack (Recommended)
The existing monitoring stack can be extended to support OpenTelemetry:
Alloy Configuration
# Add to existing Alloy configuration
otelcol.receiver.otlp:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
otelcol.processor.batch:
timeout: 1s
send_batch_size: 1024
otelcol.exporter.prometheus:
endpoint: "0.0.0.0:9464"
otelcol.exporter.jaeger:
endpoint: "jaeger-collector:14250"
otelcol.pipeline.traces:
receivers: [otelcol.receiver.otlp]
processors: [otelcol.processor.batch]
exporters: [otelcol.exporter.jaeger]
otelcol.pipeline.metrics:
receivers: [otelcol.receiver.otlp]
processors: [otelcol.processor.batch]
exporters: [otelcol.exporter.prometheus]
Prometheus Configuration
scrape_configs:
- job_name: "ivatar-opentelemetry"
static_configs:
- targets: ["ivatar-prod:9464", "ivatar-dev:9464"]
scrape_interval: 15s
metrics_path: /metrics
Option B: Dedicated OpenTelemetry Collector
For full OpenTelemetry features, deploy a dedicated collector:
Collector Configuration
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
timeout: 1s
send_batch_size: 1024
resource:
attributes:
- key: environment
from_attribute: deployment.environment
action: insert
exporters:
prometheus:
endpoint: "0.0.0.0:9464"
jaeger:
endpoint: "jaeger-collector:14250"
logging:
loglevel: debug
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch, resource]
exporters: [jaeger, logging]
metrics:
receivers: [otlp]
processors: [batch, resource]
exporters: [prometheus, logging]
Deployment
Development Setup
-
Install Dependencies
pip install -r requirements.txt -
Configure Environment
export OTEL_ENABLED=true export OTEL_SERVICE_NAME=ivatar-development export OTEL_ENVIRONMENT=development -
Start Development Server
./manage.py runserver 0:8080 -
Verify Metrics
curl http://localhost:9464/metrics
Production Deployment
-
Update Container Images
- Add OpenTelemetry dependencies to requirements.txt
- Update container build process
-
Configure Environment Variables
- Set production-specific OpenTelemetry variables
- Configure collector endpoints
-
Update Monitoring Stack
- Extend Alloy configuration
- Update Prometheus scrape configs
- Configure Grafana dashboards
-
Verify Deployment
- Check metrics endpoint accessibility
- Verify trace data flow
- Monitor dashboard updates
Monitoring and Alerting
Key Metrics to Monitor
Performance
- Request duration percentiles (p50, p95, p99)
- Avatar generation time
- Cache hit ratio
- File upload success rate
Business Metrics
- Avatar requests per minute
- Popular avatar sizes
- External service usage
- User authentication success rate
Error Rates
- HTTP error rates by endpoint
- File upload failures
- External service failures
- Authentication failures
Example Alerts
High Error Rate
alert: HighErrorRate
expr: rate(ivatar_requests_total{status_code=~"5.."}[5m]) > 0.1
for: 2m
labels:
severity: warning
annotations:
summary: "High error rate detected"
description: "Error rate is {{ $value }} errors per second"
Slow Avatar Generation
alert: SlowAvatarGeneration
expr: histogram_quantile(0.95, rate(ivatar_avatar_generation_seconds_bucket[5m])) > 2
for: 5m
labels:
severity: warning
annotations:
summary: "Slow avatar generation"
description: "95th percentile avatar generation time is {{ $value }}s"
Low Cache Hit Ratio
alert: LowCacheHitRatio
expr: (rate(ivatar_avatar_cache_hits_total[5m]) / (rate(ivatar_avatar_cache_hits_total[5m]) + rate(ivatar_avatar_cache_misses_total[5m]))) < 0.8
for: 10m
labels:
severity: warning
annotations:
summary: "Low cache hit ratio"
description: "Cache hit ratio is {{ $value }}"
Troubleshooting
Common Issues
OpenTelemetry Not Enabled
- Check
OTEL_ENABLEDenvironment variable - Verify OpenTelemetry packages are installed
- Check Django logs for configuration errors
Metrics Not Appearing
- Verify Prometheus endpoint is accessible
- Check collector configuration
- Ensure metrics are being generated
Traces Not Showing
- Verify OTLP endpoint configuration
- Check collector connectivity
- Ensure tracing is enabled in configuration
High Memory Usage
- Adjust batch processor settings
- Reduce trace sampling rate
- Monitor collector resource usage
Debug Mode
Enable debug logging for OpenTelemetry:
LOGGING = {
"loggers": {
"opentelemetry": {
"level": "DEBUG",
},
"ivatar.opentelemetry": {
"level": "DEBUG",
},
},
}
Performance Considerations
- Sampling: Implement trace sampling for high-traffic production
- Batch Processing: Use appropriate batch sizes for your infrastructure
- Resource Limits: Monitor collector resource usage
- Network: Ensure low-latency connections to collectors
Security Considerations
- Data Privacy: Ensure no sensitive data in trace attributes
- Network Security: Use TLS for collector communications
- Access Control: Restrict access to metrics endpoints
- Data Retention: Configure appropriate retention policies
Future Enhancements
- Custom Dashboards: Create Grafana dashboards for avatar metrics
- Advanced Sampling: Implement intelligent trace sampling
- Log Correlation: Correlate traces with application logs
- Performance Profiling: Add profiling capabilities
- Custom Exports: Export to additional backends (Datadog, New Relic)
Support
For issues related to OpenTelemetry integration:
- Check application logs for configuration errors
- Verify collector connectivity
- Review Prometheus metrics for data flow
- Consult OpenTelemetry documentation for advanced configuration