mirror of https://git.linux-kernel.at/oliver/ivatar.git synced 2025-11-11 10:46:24 +00:00

Files

Oliver Falk 41f8c3c402 🚀 Major Release: ivatar 2.0 - Performance, Security, and Instrumentation Overhaul

2025-11-03 10:18:33 +01:00

12 KiB

Raw Permalink Blame History

OpenTelemetry Integration for ivatar

This document describes the OpenTelemetry integration implemented in the ivatar project, providing comprehensive observability for avatar generation, file uploads, authentication, and system performance.

Overview

OpenTelemetry is integrated into ivatar to provide:

Distributed Tracing: Track requests across the entire avatar generation pipeline
Custom Metrics: Monitor avatar-specific operations and performance
Multi-Instance Support: Distinguish between production and development environments
Infrastructure Integration: Works with existing Prometheus/Grafana stack

Architecture

Components

OpenTelemetry Configuration (ivatar/opentelemetry_config.py)
- Centralized configuration management
- Environment-based setup
- Resource creation with service metadata
Custom Middleware (ivatar/opentelemetry_middleware.py)
- Request/response tracing
- Avatar-specific metrics
- Custom decorators for operation tracing
Instrumentation Integration
- Django framework instrumentation
- Database query tracing (PostgreSQL/MySQL)
- HTTP client instrumentation
- Cache instrumentation (Memcached)

Configuration

Environment Variables

Variable	Description	Default	Required
`OTEL_EXPORT_ENABLED`	Enable OpenTelemetry data export	`false`	No
`OTEL_SERVICE_NAME`	Service name identifier	`ivatar`	No
`OTEL_ENVIRONMENT`	Environment (production/development)	`development`	No
`OTEL_EXPORTER_OTLP_ENDPOINT`	OTLP collector endpoint	None	No
`OTEL_PROMETHEUS_ENDPOINT`	Local Prometheus server (dev only)	None	No
`IVATAR_VERSION`	Application version	`2.0`	No
`HOSTNAME`	Instance identifier	`unknown`	No

Multi-Instance Configuration

Production Environment

export OTEL_EXPORT_ENABLED=true
export OTEL_SERVICE_NAME=ivatar-production
export OTEL_ENVIRONMENT=production
export OTEL_EXPORTER_OTLP_ENDPOINT=http://collector.internal:4317
export HOSTNAME=prod-instance-01

Note: In production, metrics are exported via OTLP to your existing Prometheus server. Do not set OTEL_PROMETHEUS_ENDPOINT in production.

Development Environment

export OTEL_EXPORT_ENABLED=true
export OTEL_SERVICE_NAME=ivatar-development
export OTEL_ENVIRONMENT=development
export OTEL_EXPORTER_OTLP_ENDPOINT=http://collector.internal:4317
export OTEL_PROMETHEUS_ENDPOINT=0.0.0.0:9467
export IVATAR_VERSION=2.0-dev
export HOSTNAME=dev-instance-01

Note: In development, you can optionally set OTEL_PROMETHEUS_ENDPOINT to start a local HTTP server for testing metrics.

Metrics

Custom Metrics

Avatar Operations

ivatar_requests_total: Total HTTP requests by method, status, path
ivatar_request_duration_seconds: Request duration histogram
ivatar_avatar_requests_total: Avatar requests by status, size, format
ivatar_avatar_generation_seconds: Avatar generation time histogram
ivatar_avatars_generated_total: Avatars generated by size, format, source
ivatar_avatar_cache_hits_total: Cache hits by size, format
ivatar_avatar_cache_misses_total: Cache misses by size, format
ivatar_external_avatar_requests_total: External service requests
ivatar_file_uploads_total: File uploads by content type, success
ivatar_file_upload_size_bytes: File upload size histogram

Labels/Dimensions

method: HTTP method (GET, POST, etc.)
status_code: HTTP status code
path: Request path
size: Avatar size (80, 128, 256, etc.)
format: Image format (png, jpg, gif, etc.)
source: Avatar source (uploaded, generated, external)
service: External service name (gravatar, bluesky)
content_type: File MIME type
success: Operation success (true/false)

Example Queries

Avatar Generation Rate

rate(ivatar_avatars_generated_total[5m])

Cache Hit Ratio

rate(ivatar_avatar_cache_hits_total[5m]) /
(rate(ivatar_avatar_cache_hits_total[5m]) + rate(ivatar_avatar_cache_misses_total[5m]))

Average Avatar Generation Time

histogram_quantile(0.95, rate(ivatar_avatar_generation_seconds_bucket[5m]))

File Upload Success Rate

rate(ivatar_file_uploads_total{success="true"}[5m]) /
rate(ivatar_file_uploads_total[5m])

Tracing

Trace Points

Request Lifecycle

HTTP request processing
Avatar generation pipeline
File upload and processing
Authentication flows
External API calls

Custom Spans

avatar.generate_png: PNG image generation
avatar.gravatar_proxy: Gravatar service proxy
file_upload.process: File upload processing
auth.login: User authentication
auth.logout: User logout

Span Attributes

HTTP Attributes

http.method: HTTP method
http.url: Full request URL
http.status_code: Response status code
http.user_agent: Client user agent
http.remote_addr: Client IP address

Avatar Attributes

ivatar.request_type: Request type (avatar, stats, etc.)
ivatar.avatar_size: Requested avatar size
ivatar.avatar_format: Requested format
ivatar.avatar_email: Email address (if applicable)

File Attributes

file.name: Uploaded file name
file.size: File size in bytes
file.content_type: MIME type

Infrastructure Requirements

Option A: Extend Existing Stack (Recommended)

The existing monitoring stack can be extended to support OpenTelemetry:

Alloy Configuration

# Add to existing Alloy configuration
otelcol.receiver.otlp:
  grpc:
    endpoint: 0.0.0.0:4317
  http:
    endpoint: 0.0.0.0:4318

otelcol.processor.batch:
  timeout: 1s
  send_batch_size: 1024

otelcol.exporter.prometheus:
  endpoint: "0.0.0.0:9464"

otelcol.exporter.jaeger:
  endpoint: "jaeger-collector:14250"

otelcol.pipeline.traces:
  receivers: [otelcol.receiver.otlp]
  processors: [otelcol.processor.batch]
  exporters: [otelcol.exporter.jaeger]

otelcol.pipeline.metrics:
  receivers: [otelcol.receiver.otlp]
  processors: [otelcol.processor.batch]
  exporters: [otelcol.exporter.prometheus]

Prometheus Configuration

scrape_configs:
  - job_name: "ivatar-opentelemetry"
    static_configs:
      - targets: ["ivatar-prod:9464", "ivatar-dev:9464"]
    scrape_interval: 15s
    metrics_path: /metrics

Option B: Dedicated OpenTelemetry Collector

For full OpenTelemetry features, deploy a dedicated collector:

Collector Configuration

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 1s
    send_batch_size: 1024
  resource:
    attributes:
      - key: environment
        from_attribute: deployment.environment
        action: insert

exporters:
  prometheus:
    endpoint: "0.0.0.0:9464"
  jaeger:
    endpoint: "jaeger-collector:14250"
  logging:
    loglevel: debug

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch, resource]
      exporters: [jaeger, logging]
    metrics:
      receivers: [otlp]
      processors: [batch, resource]
      exporters: [prometheus, logging]

Deployment

Development Setup

Install Dependencies
```
pip install -r requirements.txt
```

Configure Environment

export OTEL_ENABLED=true
export OTEL_SERVICE_NAME=ivatar-development
export OTEL_ENVIRONMENT=development

Start Development Server
```
./manage.py runserver 0:8080
```
Verify Metrics
```
curl http://localhost:9464/metrics
```

Production Deployment

Update Container Images
- Add OpenTelemetry dependencies to requirements.txt
- Update container build process
Configure Environment Variables
- Set production-specific OpenTelemetry variables
- Configure collector endpoints
Update Monitoring Stack
- Extend Alloy configuration
- Update Prometheus scrape configs
- Configure Grafana dashboards
Verify Deployment
- Check metrics endpoint accessibility
- Verify trace data flow
- Monitor dashboard updates

Monitoring and Alerting

Key Metrics to Monitor

Performance

Request duration percentiles (p50, p95, p99)
Avatar generation time
Cache hit ratio
File upload success rate

Business Metrics

Avatar requests per minute
Popular avatar sizes
External service usage
User authentication success rate

Error Rates

HTTP error rates by endpoint
File upload failures
External service failures
Authentication failures

Example Alerts

High Error Rate

alert: HighErrorRate
expr: rate(ivatar_requests_total{status_code=~"5.."}[5m]) > 0.1
for: 2m
labels:
  severity: warning
annotations:
  summary: "High error rate detected"
  description: "Error rate is {{ $value }} errors per second"

Slow Avatar Generation

alert: SlowAvatarGeneration
expr: histogram_quantile(0.95, rate(ivatar_avatar_generation_seconds_bucket[5m])) > 2
for: 5m
labels:
  severity: warning
annotations:
  summary: "Slow avatar generation"
  description: "95th percentile avatar generation time is {{ $value }}s"

Low Cache Hit Ratio

alert: LowCacheHitRatio
expr: (rate(ivatar_avatar_cache_hits_total[5m]) / (rate(ivatar_avatar_cache_hits_total[5m]) + rate(ivatar_avatar_cache_misses_total[5m]))) < 0.8
for: 10m
labels:
  severity: warning
annotations:
  summary: "Low cache hit ratio"
  description: "Cache hit ratio is {{ $value }}"

Troubleshooting

Common Issues

OpenTelemetry Not Enabled

Check OTEL_ENABLED environment variable
Verify OpenTelemetry packages are installed
Check Django logs for configuration errors

Metrics Not Appearing

Verify Prometheus endpoint is accessible
Check collector configuration
Ensure metrics are being generated

Traces Not Showing

Verify OTLP endpoint configuration
Check collector connectivity
Ensure tracing is enabled in configuration

High Memory Usage

Adjust batch processor settings
Reduce trace sampling rate
Monitor collector resource usage

Debug Mode

Enable debug logging for OpenTelemetry:

LOGGING = {
    "loggers": {
        "opentelemetry": {
            "level": "DEBUG",
        },
        "ivatar.opentelemetry": {
            "level": "DEBUG",
        },
    },
}

Performance Considerations

Sampling: Implement trace sampling for high-traffic production
Batch Processing: Use appropriate batch sizes for your infrastructure
Resource Limits: Monitor collector resource usage
Network: Ensure low-latency connections to collectors

Security Considerations

Data Privacy: Ensure no sensitive data in trace attributes
Network Security: Use TLS for collector communications
Access Control: Restrict access to metrics endpoints
Data Retention: Configure appropriate retention policies

Future Enhancements

Custom Dashboards: Create Grafana dashboards for avatar metrics
Advanced Sampling: Implement intelligent trace sampling
Log Correlation: Correlate traces with application logs
Performance Profiling: Add profiling capabilities
Custom Exports: Export to additional backends (Datadog, New Relic)

Support

For issues related to OpenTelemetry integration:

Check application logs for configuration errors
Verify collector connectivity
Review Prometheus metrics for data flow
Consult OpenTelemetry documentation for advanced configuration

12 KiB Raw Permalink Blame History