mirror of
https://git.linux-kernel.at/oliver/ivatar.git
synced 2025-11-11 10:46:24 +00:00
464 lines
12 KiB
Markdown
464 lines
12 KiB
Markdown
# OpenTelemetry Integration for ivatar
|
|
|
|
This document describes the OpenTelemetry integration implemented in the ivatar project, providing comprehensive observability for avatar generation, file uploads, authentication, and system performance.
|
|
|
|
## Overview
|
|
|
|
OpenTelemetry is integrated into ivatar to provide:
|
|
|
|
- **Distributed Tracing**: Track requests across the entire avatar generation pipeline
|
|
- **Custom Metrics**: Monitor avatar-specific operations and performance
|
|
- **Multi-Instance Support**: Distinguish between production and development environments
|
|
- **Infrastructure Integration**: Works with existing Prometheus/Grafana stack
|
|
|
|
## Architecture
|
|
|
|
### Components
|
|
|
|
1. **OpenTelemetry Configuration** (`ivatar/opentelemetry_config.py`)
|
|
|
|
- Centralized configuration management
|
|
- Environment-based setup
|
|
- Resource creation with service metadata
|
|
|
|
2. **Custom Middleware** (`ivatar/opentelemetry_middleware.py`)
|
|
|
|
- Request/response tracing
|
|
- Avatar-specific metrics
|
|
- Custom decorators for operation tracing
|
|
|
|
3. **Instrumentation Integration**
|
|
- Django framework instrumentation
|
|
- Database query tracing (PostgreSQL/MySQL)
|
|
- HTTP client instrumentation
|
|
- Cache instrumentation (Memcached)
|
|
|
|
## Configuration
|
|
|
|
### Environment Variables
|
|
|
|
| Variable | Description | Default | Required |
|
|
| ----------------------------- | ------------------------------------ | ------------- | -------- |
|
|
| `OTEL_EXPORT_ENABLED` | Enable OpenTelemetry data export | `false` | No |
|
|
| `OTEL_SERVICE_NAME` | Service name identifier | `ivatar` | No |
|
|
| `OTEL_ENVIRONMENT` | Environment (production/development) | `development` | No |
|
|
| `OTEL_EXPORTER_OTLP_ENDPOINT` | OTLP collector endpoint | None | No |
|
|
| `OTEL_PROMETHEUS_ENDPOINT` | Local Prometheus server (dev only) | None | No |
|
|
| `IVATAR_VERSION` | Application version | `2.0` | No |
|
|
| `HOSTNAME` | Instance identifier | `unknown` | No |
|
|
|
|
### Multi-Instance Configuration
|
|
|
|
#### Production Environment
|
|
|
|
```bash
|
|
export OTEL_EXPORT_ENABLED=true
|
|
export OTEL_SERVICE_NAME=ivatar-production
|
|
export OTEL_ENVIRONMENT=production
|
|
export OTEL_EXPORTER_OTLP_ENDPOINT=http://collector.internal:4317
|
|
export HOSTNAME=prod-instance-01
|
|
```
|
|
|
|
**Note**: In production, metrics are exported via OTLP to your existing Prometheus server. Do not set `OTEL_PROMETHEUS_ENDPOINT` in production.
|
|
|
|
#### Development Environment
|
|
|
|
```bash
|
|
export OTEL_EXPORT_ENABLED=true
|
|
export OTEL_SERVICE_NAME=ivatar-development
|
|
export OTEL_ENVIRONMENT=development
|
|
export OTEL_EXPORTER_OTLP_ENDPOINT=http://collector.internal:4317
|
|
export OTEL_PROMETHEUS_ENDPOINT=0.0.0.0:9467
|
|
export IVATAR_VERSION=2.0-dev
|
|
export HOSTNAME=dev-instance-01
|
|
```
|
|
|
|
**Note**: In development, you can optionally set `OTEL_PROMETHEUS_ENDPOINT` to start a local HTTP server for testing metrics.
|
|
|
|
## Metrics
|
|
|
|
### Custom Metrics
|
|
|
|
#### Avatar Operations
|
|
|
|
- `ivatar_requests_total`: Total HTTP requests by method, status, path
|
|
- `ivatar_request_duration_seconds`: Request duration histogram
|
|
- `ivatar_avatar_requests_total`: Avatar requests by status, size, format
|
|
- `ivatar_avatar_generation_seconds`: Avatar generation time histogram
|
|
- `ivatar_avatars_generated_total`: Avatars generated by size, format, source
|
|
- `ivatar_avatar_cache_hits_total`: Cache hits by size, format
|
|
- `ivatar_avatar_cache_misses_total`: Cache misses by size, format
|
|
- `ivatar_external_avatar_requests_total`: External service requests
|
|
- `ivatar_file_uploads_total`: File uploads by content type, success
|
|
- `ivatar_file_upload_size_bytes`: File upload size histogram
|
|
|
|
#### Labels/Dimensions
|
|
|
|
- `method`: HTTP method (GET, POST, etc.)
|
|
- `status_code`: HTTP status code
|
|
- `path`: Request path
|
|
- `size`: Avatar size (80, 128, 256, etc.)
|
|
- `format`: Image format (png, jpg, gif, etc.)
|
|
- `source`: Avatar source (uploaded, generated, external)
|
|
- `service`: External service name (gravatar, bluesky)
|
|
- `content_type`: File MIME type
|
|
- `success`: Operation success (true/false)
|
|
|
|
### Example Queries
|
|
|
|
#### Avatar Generation Rate
|
|
|
|
```promql
|
|
rate(ivatar_avatars_generated_total[5m])
|
|
```
|
|
|
|
#### Cache Hit Ratio
|
|
|
|
```promql
|
|
rate(ivatar_avatar_cache_hits_total[5m]) /
|
|
(rate(ivatar_avatar_cache_hits_total[5m]) + rate(ivatar_avatar_cache_misses_total[5m]))
|
|
```
|
|
|
|
#### Average Avatar Generation Time
|
|
|
|
```promql
|
|
histogram_quantile(0.95, rate(ivatar_avatar_generation_seconds_bucket[5m]))
|
|
```
|
|
|
|
#### File Upload Success Rate
|
|
|
|
```promql
|
|
rate(ivatar_file_uploads_total{success="true"}[5m]) /
|
|
rate(ivatar_file_uploads_total[5m])
|
|
```
|
|
|
|
## Tracing
|
|
|
|
### Trace Points
|
|
|
|
#### Request Lifecycle
|
|
|
|
- HTTP request processing
|
|
- Avatar generation pipeline
|
|
- File upload and processing
|
|
- Authentication flows
|
|
- External API calls
|
|
|
|
#### Custom Spans
|
|
|
|
- `avatar.generate_png`: PNG image generation
|
|
- `avatar.gravatar_proxy`: Gravatar service proxy
|
|
- `file_upload.process`: File upload processing
|
|
- `auth.login`: User authentication
|
|
- `auth.logout`: User logout
|
|
|
|
### Span Attributes
|
|
|
|
#### HTTP Attributes
|
|
|
|
- `http.method`: HTTP method
|
|
- `http.url`: Full request URL
|
|
- `http.status_code`: Response status code
|
|
- `http.user_agent`: Client user agent
|
|
- `http.remote_addr`: Client IP address
|
|
|
|
#### Avatar Attributes
|
|
|
|
- `ivatar.request_type`: Request type (avatar, stats, etc.)
|
|
- `ivatar.avatar_size`: Requested avatar size
|
|
- `ivatar.avatar_format`: Requested format
|
|
- `ivatar.avatar_email`: Email address (if applicable)
|
|
|
|
#### File Attributes
|
|
|
|
- `file.name`: Uploaded file name
|
|
- `file.size`: File size in bytes
|
|
- `file.content_type`: MIME type
|
|
|
|
## Infrastructure Requirements
|
|
|
|
### Option A: Extend Existing Stack (Recommended)
|
|
|
|
The existing monitoring stack can be extended to support OpenTelemetry:
|
|
|
|
#### Alloy Configuration
|
|
|
|
```yaml
|
|
# Add to existing Alloy configuration
|
|
otelcol.receiver.otlp:
|
|
grpc:
|
|
endpoint: 0.0.0.0:4317
|
|
http:
|
|
endpoint: 0.0.0.0:4318
|
|
|
|
otelcol.processor.batch:
|
|
timeout: 1s
|
|
send_batch_size: 1024
|
|
|
|
otelcol.exporter.prometheus:
|
|
endpoint: "0.0.0.0:9464"
|
|
|
|
otelcol.exporter.jaeger:
|
|
endpoint: "jaeger-collector:14250"
|
|
|
|
otelcol.pipeline.traces:
|
|
receivers: [otelcol.receiver.otlp]
|
|
processors: [otelcol.processor.batch]
|
|
exporters: [otelcol.exporter.jaeger]
|
|
|
|
otelcol.pipeline.metrics:
|
|
receivers: [otelcol.receiver.otlp]
|
|
processors: [otelcol.processor.batch]
|
|
exporters: [otelcol.exporter.prometheus]
|
|
```
|
|
|
|
#### Prometheus Configuration
|
|
|
|
```yaml
|
|
scrape_configs:
|
|
- job_name: "ivatar-opentelemetry"
|
|
static_configs:
|
|
- targets: ["ivatar-prod:9464", "ivatar-dev:9464"]
|
|
scrape_interval: 15s
|
|
metrics_path: /metrics
|
|
```
|
|
|
|
### Option B: Dedicated OpenTelemetry Collector
|
|
|
|
For full OpenTelemetry features, deploy a dedicated collector:
|
|
|
|
#### Collector Configuration
|
|
|
|
```yaml
|
|
receivers:
|
|
otlp:
|
|
protocols:
|
|
grpc:
|
|
endpoint: 0.0.0.0:4317
|
|
http:
|
|
endpoint: 0.0.0.0:4318
|
|
|
|
processors:
|
|
batch:
|
|
timeout: 1s
|
|
send_batch_size: 1024
|
|
resource:
|
|
attributes:
|
|
- key: environment
|
|
from_attribute: deployment.environment
|
|
action: insert
|
|
|
|
exporters:
|
|
prometheus:
|
|
endpoint: "0.0.0.0:9464"
|
|
jaeger:
|
|
endpoint: "jaeger-collector:14250"
|
|
logging:
|
|
loglevel: debug
|
|
|
|
service:
|
|
pipelines:
|
|
traces:
|
|
receivers: [otlp]
|
|
processors: [batch, resource]
|
|
exporters: [jaeger, logging]
|
|
metrics:
|
|
receivers: [otlp]
|
|
processors: [batch, resource]
|
|
exporters: [prometheus, logging]
|
|
```
|
|
|
|
## Deployment
|
|
|
|
### Development Setup
|
|
|
|
1. **Install Dependencies**
|
|
|
|
```bash
|
|
pip install -r requirements.txt
|
|
```
|
|
|
|
2. **Configure Environment**
|
|
|
|
```bash
|
|
export OTEL_ENABLED=true
|
|
export OTEL_SERVICE_NAME=ivatar-development
|
|
export OTEL_ENVIRONMENT=development
|
|
```
|
|
|
|
3. **Start Development Server**
|
|
|
|
```bash
|
|
./manage.py runserver 0:8080
|
|
```
|
|
|
|
4. **Verify Metrics**
|
|
```bash
|
|
curl http://localhost:9464/metrics
|
|
```
|
|
|
|
### Production Deployment
|
|
|
|
1. **Update Container Images**
|
|
|
|
- Add OpenTelemetry dependencies to requirements.txt
|
|
- Update container build process
|
|
|
|
2. **Configure Environment Variables**
|
|
|
|
- Set production-specific OpenTelemetry variables
|
|
- Configure collector endpoints
|
|
|
|
3. **Update Monitoring Stack**
|
|
|
|
- Extend Alloy configuration
|
|
- Update Prometheus scrape configs
|
|
- Configure Grafana dashboards
|
|
|
|
4. **Verify Deployment**
|
|
- Check metrics endpoint accessibility
|
|
- Verify trace data flow
|
|
- Monitor dashboard updates
|
|
|
|
## Monitoring and Alerting
|
|
|
|
### Key Metrics to Monitor
|
|
|
|
#### Performance
|
|
|
|
- Request duration percentiles (p50, p95, p99)
|
|
- Avatar generation time
|
|
- Cache hit ratio
|
|
- File upload success rate
|
|
|
|
#### Business Metrics
|
|
|
|
- Avatar requests per minute
|
|
- Popular avatar sizes
|
|
- External service usage
|
|
- User authentication success rate
|
|
|
|
#### Error Rates
|
|
|
|
- HTTP error rates by endpoint
|
|
- File upload failures
|
|
- External service failures
|
|
- Authentication failures
|
|
|
|
### Example Alerts
|
|
|
|
#### High Error Rate
|
|
|
|
```yaml
|
|
alert: HighErrorRate
|
|
expr: rate(ivatar_requests_total{status_code=~"5.."}[5m]) > 0.1
|
|
for: 2m
|
|
labels:
|
|
severity: warning
|
|
annotations:
|
|
summary: "High error rate detected"
|
|
description: "Error rate is {{ $value }} errors per second"
|
|
```
|
|
|
|
#### Slow Avatar Generation
|
|
|
|
```yaml
|
|
alert: SlowAvatarGeneration
|
|
expr: histogram_quantile(0.95, rate(ivatar_avatar_generation_seconds_bucket[5m])) > 2
|
|
for: 5m
|
|
labels:
|
|
severity: warning
|
|
annotations:
|
|
summary: "Slow avatar generation"
|
|
description: "95th percentile avatar generation time is {{ $value }}s"
|
|
```
|
|
|
|
#### Low Cache Hit Ratio
|
|
|
|
```yaml
|
|
alert: LowCacheHitRatio
|
|
expr: (rate(ivatar_avatar_cache_hits_total[5m]) / (rate(ivatar_avatar_cache_hits_total[5m]) + rate(ivatar_avatar_cache_misses_total[5m]))) < 0.8
|
|
for: 10m
|
|
labels:
|
|
severity: warning
|
|
annotations:
|
|
summary: "Low cache hit ratio"
|
|
description: "Cache hit ratio is {{ $value }}"
|
|
```
|
|
|
|
## Troubleshooting
|
|
|
|
### Common Issues
|
|
|
|
#### OpenTelemetry Not Enabled
|
|
|
|
- Check `OTEL_ENABLED` environment variable
|
|
- Verify OpenTelemetry packages are installed
|
|
- Check Django logs for configuration errors
|
|
|
|
#### Metrics Not Appearing
|
|
|
|
- Verify Prometheus endpoint is accessible
|
|
- Check collector configuration
|
|
- Ensure metrics are being generated
|
|
|
|
#### Traces Not Showing
|
|
|
|
- Verify OTLP endpoint configuration
|
|
- Check collector connectivity
|
|
- Ensure tracing is enabled in configuration
|
|
|
|
#### High Memory Usage
|
|
|
|
- Adjust batch processor settings
|
|
- Reduce trace sampling rate
|
|
- Monitor collector resource usage
|
|
|
|
### Debug Mode
|
|
|
|
Enable debug logging for OpenTelemetry:
|
|
|
|
```python
|
|
LOGGING = {
|
|
"loggers": {
|
|
"opentelemetry": {
|
|
"level": "DEBUG",
|
|
},
|
|
"ivatar.opentelemetry": {
|
|
"level": "DEBUG",
|
|
},
|
|
},
|
|
}
|
|
```
|
|
|
|
### Performance Considerations
|
|
|
|
- **Sampling**: Implement trace sampling for high-traffic production
|
|
- **Batch Processing**: Use appropriate batch sizes for your infrastructure
|
|
- **Resource Limits**: Monitor collector resource usage
|
|
- **Network**: Ensure low-latency connections to collectors
|
|
|
|
## Security Considerations
|
|
|
|
- **Data Privacy**: Ensure no sensitive data in trace attributes
|
|
- **Network Security**: Use TLS for collector communications
|
|
- **Access Control**: Restrict access to metrics endpoints
|
|
- **Data Retention**: Configure appropriate retention policies
|
|
|
|
## Future Enhancements
|
|
|
|
- **Custom Dashboards**: Create Grafana dashboards for avatar metrics
|
|
- **Advanced Sampling**: Implement intelligent trace sampling
|
|
- **Log Correlation**: Correlate traces with application logs
|
|
- **Performance Profiling**: Add profiling capabilities
|
|
- **Custom Exports**: Export to additional backends (Datadog, New Relic)
|
|
|
|
## Support
|
|
|
|
For issues related to OpenTelemetry integration:
|
|
|
|
- Check application logs for configuration errors
|
|
- Verify collector connectivity
|
|
- Review Prometheus metrics for data flow
|
|
- Consult OpenTelemetry documentation for advanced configuration
|