Files
ivatar/OPENTELEMETRY.md

464 lines
12 KiB
Markdown

# OpenTelemetry Integration for ivatar
This document describes the OpenTelemetry integration implemented in the ivatar project, providing comprehensive observability for avatar generation, file uploads, authentication, and system performance.
## Overview
OpenTelemetry is integrated into ivatar to provide:
- **Distributed Tracing**: Track requests across the entire avatar generation pipeline
- **Custom Metrics**: Monitor avatar-specific operations and performance
- **Multi-Instance Support**: Distinguish between production and development environments
- **Infrastructure Integration**: Works with existing Prometheus/Grafana stack
## Architecture
### Components
1. **OpenTelemetry Configuration** (`ivatar/opentelemetry_config.py`)
- Centralized configuration management
- Environment-based setup
- Resource creation with service metadata
2. **Custom Middleware** (`ivatar/opentelemetry_middleware.py`)
- Request/response tracing
- Avatar-specific metrics
- Custom decorators for operation tracing
3. **Instrumentation Integration**
- Django framework instrumentation
- Database query tracing (PostgreSQL/MySQL)
- HTTP client instrumentation
- Cache instrumentation (Memcached)
## Configuration
### Environment Variables
| Variable | Description | Default | Required |
| ----------------------------- | ------------------------------------ | ------------- | -------- |
| `OTEL_EXPORT_ENABLED` | Enable OpenTelemetry data export | `false` | No |
| `OTEL_SERVICE_NAME` | Service name identifier | `ivatar` | No |
| `OTEL_ENVIRONMENT` | Environment (production/development) | `development` | No |
| `OTEL_EXPORTER_OTLP_ENDPOINT` | OTLP collector endpoint | None | No |
| `OTEL_PROMETHEUS_ENDPOINT` | Local Prometheus server (dev only) | None | No |
| `IVATAR_VERSION` | Application version | `2.0` | No |
| `HOSTNAME` | Instance identifier | `unknown` | No |
### Multi-Instance Configuration
#### Production Environment
```bash
export OTEL_EXPORT_ENABLED=true
export OTEL_SERVICE_NAME=ivatar-production
export OTEL_ENVIRONMENT=production
export OTEL_EXPORTER_OTLP_ENDPOINT=http://collector.internal:4317
export HOSTNAME=prod-instance-01
```
**Note**: In production, metrics are exported via OTLP to your existing Prometheus server. Do not set `OTEL_PROMETHEUS_ENDPOINT` in production.
#### Development Environment
```bash
export OTEL_EXPORT_ENABLED=true
export OTEL_SERVICE_NAME=ivatar-development
export OTEL_ENVIRONMENT=development
export OTEL_EXPORTER_OTLP_ENDPOINT=http://collector.internal:4317
export OTEL_PROMETHEUS_ENDPOINT=0.0.0.0:9467
export IVATAR_VERSION=2.0-dev
export HOSTNAME=dev-instance-01
```
**Note**: In development, you can optionally set `OTEL_PROMETHEUS_ENDPOINT` to start a local HTTP server for testing metrics.
## Metrics
### Custom Metrics
#### Avatar Operations
- `ivatar_requests_total`: Total HTTP requests by method, status, path
- `ivatar_request_duration_seconds`: Request duration histogram
- `ivatar_avatar_requests_total`: Avatar requests by status, size, format
- `ivatar_avatar_generation_seconds`: Avatar generation time histogram
- `ivatar_avatars_generated_total`: Avatars generated by size, format, source
- `ivatar_avatar_cache_hits_total`: Cache hits by size, format
- `ivatar_avatar_cache_misses_total`: Cache misses by size, format
- `ivatar_external_avatar_requests_total`: External service requests
- `ivatar_file_uploads_total`: File uploads by content type, success
- `ivatar_file_upload_size_bytes`: File upload size histogram
#### Labels/Dimensions
- `method`: HTTP method (GET, POST, etc.)
- `status_code`: HTTP status code
- `path`: Request path
- `size`: Avatar size (80, 128, 256, etc.)
- `format`: Image format (png, jpg, gif, etc.)
- `source`: Avatar source (uploaded, generated, external)
- `service`: External service name (gravatar, bluesky)
- `content_type`: File MIME type
- `success`: Operation success (true/false)
### Example Queries
#### Avatar Generation Rate
```promql
rate(ivatar_avatars_generated_total[5m])
```
#### Cache Hit Ratio
```promql
rate(ivatar_avatar_cache_hits_total[5m]) /
(rate(ivatar_avatar_cache_hits_total[5m]) + rate(ivatar_avatar_cache_misses_total[5m]))
```
#### Average Avatar Generation Time
```promql
histogram_quantile(0.95, rate(ivatar_avatar_generation_seconds_bucket[5m]))
```
#### File Upload Success Rate
```promql
rate(ivatar_file_uploads_total{success="true"}[5m]) /
rate(ivatar_file_uploads_total[5m])
```
## Tracing
### Trace Points
#### Request Lifecycle
- HTTP request processing
- Avatar generation pipeline
- File upload and processing
- Authentication flows
- External API calls
#### Custom Spans
- `avatar.generate_png`: PNG image generation
- `avatar.gravatar_proxy`: Gravatar service proxy
- `file_upload.process`: File upload processing
- `auth.login`: User authentication
- `auth.logout`: User logout
### Span Attributes
#### HTTP Attributes
- `http.method`: HTTP method
- `http.url`: Full request URL
- `http.status_code`: Response status code
- `http.user_agent`: Client user agent
- `http.remote_addr`: Client IP address
#### Avatar Attributes
- `ivatar.request_type`: Request type (avatar, stats, etc.)
- `ivatar.avatar_size`: Requested avatar size
- `ivatar.avatar_format`: Requested format
- `ivatar.avatar_email`: Email address (if applicable)
#### File Attributes
- `file.name`: Uploaded file name
- `file.size`: File size in bytes
- `file.content_type`: MIME type
## Infrastructure Requirements
### Option A: Extend Existing Stack (Recommended)
The existing monitoring stack can be extended to support OpenTelemetry:
#### Alloy Configuration
```yaml
# Add to existing Alloy configuration
otelcol.receiver.otlp:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
otelcol.processor.batch:
timeout: 1s
send_batch_size: 1024
otelcol.exporter.prometheus:
endpoint: "0.0.0.0:9464"
otelcol.exporter.jaeger:
endpoint: "jaeger-collector:14250"
otelcol.pipeline.traces:
receivers: [otelcol.receiver.otlp]
processors: [otelcol.processor.batch]
exporters: [otelcol.exporter.jaeger]
otelcol.pipeline.metrics:
receivers: [otelcol.receiver.otlp]
processors: [otelcol.processor.batch]
exporters: [otelcol.exporter.prometheus]
```
#### Prometheus Configuration
```yaml
scrape_configs:
- job_name: "ivatar-opentelemetry"
static_configs:
- targets: ["ivatar-prod:9464", "ivatar-dev:9464"]
scrape_interval: 15s
metrics_path: /metrics
```
### Option B: Dedicated OpenTelemetry Collector
For full OpenTelemetry features, deploy a dedicated collector:
#### Collector Configuration
```yaml
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
timeout: 1s
send_batch_size: 1024
resource:
attributes:
- key: environment
from_attribute: deployment.environment
action: insert
exporters:
prometheus:
endpoint: "0.0.0.0:9464"
jaeger:
endpoint: "jaeger-collector:14250"
logging:
loglevel: debug
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch, resource]
exporters: [jaeger, logging]
metrics:
receivers: [otlp]
processors: [batch, resource]
exporters: [prometheus, logging]
```
## Deployment
### Development Setup
1. **Install Dependencies**
```bash
pip install -r requirements.txt
```
2. **Configure Environment**
```bash
export OTEL_ENABLED=true
export OTEL_SERVICE_NAME=ivatar-development
export OTEL_ENVIRONMENT=development
```
3. **Start Development Server**
```bash
./manage.py runserver 0:8080
```
4. **Verify Metrics**
```bash
curl http://localhost:9464/metrics
```
### Production Deployment
1. **Update Container Images**
- Add OpenTelemetry dependencies to requirements.txt
- Update container build process
2. **Configure Environment Variables**
- Set production-specific OpenTelemetry variables
- Configure collector endpoints
3. **Update Monitoring Stack**
- Extend Alloy configuration
- Update Prometheus scrape configs
- Configure Grafana dashboards
4. **Verify Deployment**
- Check metrics endpoint accessibility
- Verify trace data flow
- Monitor dashboard updates
## Monitoring and Alerting
### Key Metrics to Monitor
#### Performance
- Request duration percentiles (p50, p95, p99)
- Avatar generation time
- Cache hit ratio
- File upload success rate
#### Business Metrics
- Avatar requests per minute
- Popular avatar sizes
- External service usage
- User authentication success rate
#### Error Rates
- HTTP error rates by endpoint
- File upload failures
- External service failures
- Authentication failures
### Example Alerts
#### High Error Rate
```yaml
alert: HighErrorRate
expr: rate(ivatar_requests_total{status_code=~"5.."}[5m]) > 0.1
for: 2m
labels:
severity: warning
annotations:
summary: "High error rate detected"
description: "Error rate is {{ $value }} errors per second"
```
#### Slow Avatar Generation
```yaml
alert: SlowAvatarGeneration
expr: histogram_quantile(0.95, rate(ivatar_avatar_generation_seconds_bucket[5m])) > 2
for: 5m
labels:
severity: warning
annotations:
summary: "Slow avatar generation"
description: "95th percentile avatar generation time is {{ $value }}s"
```
#### Low Cache Hit Ratio
```yaml
alert: LowCacheHitRatio
expr: (rate(ivatar_avatar_cache_hits_total[5m]) / (rate(ivatar_avatar_cache_hits_total[5m]) + rate(ivatar_avatar_cache_misses_total[5m]))) < 0.8
for: 10m
labels:
severity: warning
annotations:
summary: "Low cache hit ratio"
description: "Cache hit ratio is {{ $value }}"
```
## Troubleshooting
### Common Issues
#### OpenTelemetry Not Enabled
- Check `OTEL_ENABLED` environment variable
- Verify OpenTelemetry packages are installed
- Check Django logs for configuration errors
#### Metrics Not Appearing
- Verify Prometheus endpoint is accessible
- Check collector configuration
- Ensure metrics are being generated
#### Traces Not Showing
- Verify OTLP endpoint configuration
- Check collector connectivity
- Ensure tracing is enabled in configuration
#### High Memory Usage
- Adjust batch processor settings
- Reduce trace sampling rate
- Monitor collector resource usage
### Debug Mode
Enable debug logging for OpenTelemetry:
```python
LOGGING = {
"loggers": {
"opentelemetry": {
"level": "DEBUG",
},
"ivatar.opentelemetry": {
"level": "DEBUG",
},
},
}
```
### Performance Considerations
- **Sampling**: Implement trace sampling for high-traffic production
- **Batch Processing**: Use appropriate batch sizes for your infrastructure
- **Resource Limits**: Monitor collector resource usage
- **Network**: Ensure low-latency connections to collectors
## Security Considerations
- **Data Privacy**: Ensure no sensitive data in trace attributes
- **Network Security**: Use TLS for collector communications
- **Access Control**: Restrict access to metrics endpoints
- **Data Retention**: Configure appropriate retention policies
## Future Enhancements
- **Custom Dashboards**: Create Grafana dashboards for avatar metrics
- **Advanced Sampling**: Implement intelligent trace sampling
- **Log Correlation**: Correlate traces with application logs
- **Performance Profiling**: Add profiling capabilities
- **Custom Exports**: Export to additional backends (Datadog, New Relic)
## Support
For issues related to OpenTelemetry integration:
- Check application logs for configuration errors
- Verify collector connectivity
- Review Prometheus metrics for data flow
- Consult OpenTelemetry documentation for advanced configuration