Files
ivatar/OPENTELEMETRY.md

12 KiB

OpenTelemetry Integration for ivatar

This document describes the OpenTelemetry integration implemented in the ivatar project, providing comprehensive observability for avatar generation, file uploads, authentication, and system performance.

Overview

OpenTelemetry is integrated into ivatar to provide:

  • Distributed Tracing: Track requests across the entire avatar generation pipeline
  • Custom Metrics: Monitor avatar-specific operations and performance
  • Multi-Instance Support: Distinguish between production and development environments
  • Infrastructure Integration: Works with existing Prometheus/Grafana stack

Architecture

Components

  1. OpenTelemetry Configuration (ivatar/opentelemetry_config.py)

    • Centralized configuration management
    • Environment-based setup
    • Resource creation with service metadata
  2. Custom Middleware (ivatar/opentelemetry_middleware.py)

    • Request/response tracing
    • Avatar-specific metrics
    • Custom decorators for operation tracing
  3. Instrumentation Integration

    • Django framework instrumentation
    • Database query tracing (PostgreSQL/MySQL)
    • HTTP client instrumentation
    • Cache instrumentation (Memcached)

Configuration

Environment Variables

Variable Description Default Required
OTEL_EXPORT_ENABLED Enable OpenTelemetry data export false No
OTEL_SERVICE_NAME Service name identifier ivatar No
OTEL_ENVIRONMENT Environment (production/development) development No
OTEL_EXPORTER_OTLP_ENDPOINT OTLP collector endpoint None No
OTEL_PROMETHEUS_ENDPOINT Local Prometheus server (dev only) None No
IVATAR_VERSION Application version 2.0 No
HOSTNAME Instance identifier unknown No

Multi-Instance Configuration

Production Environment

export OTEL_EXPORT_ENABLED=true
export OTEL_SERVICE_NAME=ivatar-production
export OTEL_ENVIRONMENT=production
export OTEL_EXPORTER_OTLP_ENDPOINT=http://collector.internal:4317
export HOSTNAME=prod-instance-01

Note: In production, metrics are exported via OTLP to your existing Prometheus server. Do not set OTEL_PROMETHEUS_ENDPOINT in production.

Development Environment

export OTEL_EXPORT_ENABLED=true
export OTEL_SERVICE_NAME=ivatar-development
export OTEL_ENVIRONMENT=development
export OTEL_EXPORTER_OTLP_ENDPOINT=http://collector.internal:4317
export OTEL_PROMETHEUS_ENDPOINT=0.0.0.0:9467
export IVATAR_VERSION=2.0-dev
export HOSTNAME=dev-instance-01

Note: In development, you can optionally set OTEL_PROMETHEUS_ENDPOINT to start a local HTTP server for testing metrics.

Metrics

Custom Metrics

Avatar Operations

  • ivatar_requests_total: Total HTTP requests by method, status, path
  • ivatar_request_duration_seconds: Request duration histogram
  • ivatar_avatar_requests_total: Avatar requests by status, size, format
  • ivatar_avatar_generation_seconds: Avatar generation time histogram
  • ivatar_avatars_generated_total: Avatars generated by size, format, source
  • ivatar_avatar_cache_hits_total: Cache hits by size, format
  • ivatar_avatar_cache_misses_total: Cache misses by size, format
  • ivatar_external_avatar_requests_total: External service requests
  • ivatar_file_uploads_total: File uploads by content type, success
  • ivatar_file_upload_size_bytes: File upload size histogram

Labels/Dimensions

  • method: HTTP method (GET, POST, etc.)
  • status_code: HTTP status code
  • path: Request path
  • size: Avatar size (80, 128, 256, etc.)
  • format: Image format (png, jpg, gif, etc.)
  • source: Avatar source (uploaded, generated, external)
  • service: External service name (gravatar, bluesky)
  • content_type: File MIME type
  • success: Operation success (true/false)

Example Queries

Avatar Generation Rate

rate(ivatar_avatars_generated_total[5m])

Cache Hit Ratio

rate(ivatar_avatar_cache_hits_total[5m]) /
(rate(ivatar_avatar_cache_hits_total[5m]) + rate(ivatar_avatar_cache_misses_total[5m]))

Average Avatar Generation Time

histogram_quantile(0.95, rate(ivatar_avatar_generation_seconds_bucket[5m]))

File Upload Success Rate

rate(ivatar_file_uploads_total{success="true"}[5m]) /
rate(ivatar_file_uploads_total[5m])

Tracing

Trace Points

Request Lifecycle

  • HTTP request processing
  • Avatar generation pipeline
  • File upload and processing
  • Authentication flows
  • External API calls

Custom Spans

  • avatar.generate_png: PNG image generation
  • avatar.gravatar_proxy: Gravatar service proxy
  • file_upload.process: File upload processing
  • auth.login: User authentication
  • auth.logout: User logout

Span Attributes

HTTP Attributes

  • http.method: HTTP method
  • http.url: Full request URL
  • http.status_code: Response status code
  • http.user_agent: Client user agent
  • http.remote_addr: Client IP address

Avatar Attributes

  • ivatar.request_type: Request type (avatar, stats, etc.)
  • ivatar.avatar_size: Requested avatar size
  • ivatar.avatar_format: Requested format
  • ivatar.avatar_email: Email address (if applicable)

File Attributes

  • file.name: Uploaded file name
  • file.size: File size in bytes
  • file.content_type: MIME type

Infrastructure Requirements

The existing monitoring stack can be extended to support OpenTelemetry:

Alloy Configuration

# Add to existing Alloy configuration
otelcol.receiver.otlp:
  grpc:
    endpoint: 0.0.0.0:4317
  http:
    endpoint: 0.0.0.0:4318

otelcol.processor.batch:
  timeout: 1s
  send_batch_size: 1024

otelcol.exporter.prometheus:
  endpoint: "0.0.0.0:9464"

otelcol.exporter.jaeger:
  endpoint: "jaeger-collector:14250"

otelcol.pipeline.traces:
  receivers: [otelcol.receiver.otlp]
  processors: [otelcol.processor.batch]
  exporters: [otelcol.exporter.jaeger]

otelcol.pipeline.metrics:
  receivers: [otelcol.receiver.otlp]
  processors: [otelcol.processor.batch]
  exporters: [otelcol.exporter.prometheus]

Prometheus Configuration

scrape_configs:
  - job_name: "ivatar-opentelemetry"
    static_configs:
      - targets: ["ivatar-prod:9464", "ivatar-dev:9464"]
    scrape_interval: 15s
    metrics_path: /metrics

Option B: Dedicated OpenTelemetry Collector

For full OpenTelemetry features, deploy a dedicated collector:

Collector Configuration

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 1s
    send_batch_size: 1024
  resource:
    attributes:
      - key: environment
        from_attribute: deployment.environment
        action: insert

exporters:
  prometheus:
    endpoint: "0.0.0.0:9464"
  jaeger:
    endpoint: "jaeger-collector:14250"
  logging:
    loglevel: debug

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch, resource]
      exporters: [jaeger, logging]
    metrics:
      receivers: [otlp]
      processors: [batch, resource]
      exporters: [prometheus, logging]

Deployment

Development Setup

  1. Install Dependencies

    pip install -r requirements.txt
    
  2. Configure Environment

    export OTEL_ENABLED=true
    export OTEL_SERVICE_NAME=ivatar-development
    export OTEL_ENVIRONMENT=development
    
  3. Start Development Server

    ./manage.py runserver 0:8080
    
  4. Verify Metrics

    curl http://localhost:9464/metrics
    

Production Deployment

  1. Update Container Images

    • Add OpenTelemetry dependencies to requirements.txt
    • Update container build process
  2. Configure Environment Variables

    • Set production-specific OpenTelemetry variables
    • Configure collector endpoints
  3. Update Monitoring Stack

    • Extend Alloy configuration
    • Update Prometheus scrape configs
    • Configure Grafana dashboards
  4. Verify Deployment

    • Check metrics endpoint accessibility
    • Verify trace data flow
    • Monitor dashboard updates

Monitoring and Alerting

Key Metrics to Monitor

Performance

  • Request duration percentiles (p50, p95, p99)
  • Avatar generation time
  • Cache hit ratio
  • File upload success rate

Business Metrics

  • Avatar requests per minute
  • Popular avatar sizes
  • External service usage
  • User authentication success rate

Error Rates

  • HTTP error rates by endpoint
  • File upload failures
  • External service failures
  • Authentication failures

Example Alerts

High Error Rate

alert: HighErrorRate
expr: rate(ivatar_requests_total{status_code=~"5.."}[5m]) > 0.1
for: 2m
labels:
  severity: warning
annotations:
  summary: "High error rate detected"
  description: "Error rate is {{ $value }} errors per second"

Slow Avatar Generation

alert: SlowAvatarGeneration
expr: histogram_quantile(0.95, rate(ivatar_avatar_generation_seconds_bucket[5m])) > 2
for: 5m
labels:
  severity: warning
annotations:
  summary: "Slow avatar generation"
  description: "95th percentile avatar generation time is {{ $value }}s"

Low Cache Hit Ratio

alert: LowCacheHitRatio
expr: (rate(ivatar_avatar_cache_hits_total[5m]) / (rate(ivatar_avatar_cache_hits_total[5m]) + rate(ivatar_avatar_cache_misses_total[5m]))) < 0.8
for: 10m
labels:
  severity: warning
annotations:
  summary: "Low cache hit ratio"
  description: "Cache hit ratio is {{ $value }}"

Troubleshooting

Common Issues

OpenTelemetry Not Enabled

  • Check OTEL_ENABLED environment variable
  • Verify OpenTelemetry packages are installed
  • Check Django logs for configuration errors

Metrics Not Appearing

  • Verify Prometheus endpoint is accessible
  • Check collector configuration
  • Ensure metrics are being generated

Traces Not Showing

  • Verify OTLP endpoint configuration
  • Check collector connectivity
  • Ensure tracing is enabled in configuration

High Memory Usage

  • Adjust batch processor settings
  • Reduce trace sampling rate
  • Monitor collector resource usage

Debug Mode

Enable debug logging for OpenTelemetry:

LOGGING = {
    "loggers": {
        "opentelemetry": {
            "level": "DEBUG",
        },
        "ivatar.opentelemetry": {
            "level": "DEBUG",
        },
    },
}

Performance Considerations

  • Sampling: Implement trace sampling for high-traffic production
  • Batch Processing: Use appropriate batch sizes for your infrastructure
  • Resource Limits: Monitor collector resource usage
  • Network: Ensure low-latency connections to collectors

Security Considerations

  • Data Privacy: Ensure no sensitive data in trace attributes
  • Network Security: Use TLS for collector communications
  • Access Control: Restrict access to metrics endpoints
  • Data Retention: Configure appropriate retention policies

Future Enhancements

  • Custom Dashboards: Create Grafana dashboards for avatar metrics
  • Advanced Sampling: Implement intelligent trace sampling
  • Log Correlation: Correlate traces with application logs
  • Performance Profiling: Add profiling capabilities
  • Custom Exports: Export to additional backends (Datadog, New Relic)

Support

For issues related to OpenTelemetry integration:

  • Check application logs for configuration errors
  • Verify collector connectivity
  • Review Prometheus metrics for data flow
  • Consult OpenTelemetry documentation for advanced configuration