ivatar/OPENTELEMETRY.md

# OpenTelemetry Integration for ivatar

This document describes the OpenTelemetry integration implemented in the ivatar project, providing comprehensive observability for avatar generation, file uploads, authentication, and system performance.

## Overview

OpenTelemetry is integrated into ivatar to provide:

- **Distributed Tracing**: Track requests across the entire avatar generation pipeline
- **Custom Metrics**: Monitor avatar-specific operations and performance
- **Multi-Instance Support**: Distinguish between production and development environments
- **Infrastructure Integration**: Works with existing Prometheus/Grafana stack

## Architecture

### Components

1. **OpenTelemetry Configuration** (`ivatar/opentelemetry_config.py`)

   - Centralized configuration management
   - Environment-based setup
   - Resource creation with service metadata

2. **Custom Middleware** (`ivatar/opentelemetry_middleware.py`)

   - Request/response tracing
   - Avatar-specific metrics
   - Custom decorators for operation tracing

3. **Instrumentation Integration**
   - Django framework instrumentation
   - Database query tracing (PostgreSQL/MySQL)
   - HTTP client instrumentation
   - Cache instrumentation (Memcached)

## Configuration

### Environment Variables

| Variable                      | Description                          | Default       | Required |
| ----------------------------- | ------------------------------------ | ------------- | -------- |
| `OTEL_EXPORT_ENABLED`         | Enable OpenTelemetry data export     | `false`       | No       |
| `OTEL_SERVICE_NAME`           | Service name identifier              | `ivatar`      | No       |
| `OTEL_ENVIRONMENT`            | Environment (production/development) | `development` | No       |
| `OTEL_EXPORTER_OTLP_ENDPOINT` | OTLP collector endpoint              | None          | No       |
| `OTEL_PROMETHEUS_ENDPOINT`    | Local Prometheus server (dev only)   | None          | No       |
| `IVATAR_VERSION`              | Application version                  | `2.0`         | No       |
| `HOSTNAME`                    | Instance identifier                  | `unknown`     | No       |

### Multi-Instance Configuration

#### Production Environment

```bash
export OTEL_EXPORT_ENABLED=true
export OTEL_SERVICE_NAME=ivatar-production
export OTEL_ENVIRONMENT=production
export OTEL_EXPORTER_OTLP_ENDPOINT=http://collector.internal:4317
export HOSTNAME=prod-instance-01
```

**Note**: In production, metrics are exported via OTLP to your existing Prometheus server. Do not set `OTEL_PROMETHEUS_ENDPOINT` in production.

#### Development Environment

```bash
export OTEL_EXPORT_ENABLED=true
export OTEL_SERVICE_NAME=ivatar-development
export OTEL_ENVIRONMENT=development
export OTEL_EXPORTER_OTLP_ENDPOINT=http://collector.internal:4317
export OTEL_PROMETHEUS_ENDPOINT=0.0.0.0:9467
export IVATAR_VERSION=2.0-dev
export HOSTNAME=dev-instance-01
```

**Note**: In development, you can optionally set `OTEL_PROMETHEUS_ENDPOINT` to start a local HTTP server for testing metrics.

## Metrics

### Custom Metrics

#### Avatar Operations

- `ivatar_requests_total`: Total HTTP requests by method, status, path
- `ivatar_request_duration_seconds`: Request duration histogram
- `ivatar_avatar_requests_total`: Avatar requests by status, size, format
- `ivatar_avatar_generation_seconds`: Avatar generation time histogram
- `ivatar_avatars_generated_total`: Avatars generated by size, format, source
- `ivatar_avatar_cache_hits_total`: Cache hits by size, format
- `ivatar_avatar_cache_misses_total`: Cache misses by size, format
- `ivatar_external_avatar_requests_total`: External service requests
- `ivatar_file_uploads_total`: File uploads by content type, success
- `ivatar_file_upload_size_bytes`: File upload size histogram

#### Labels/Dimensions

- `method`: HTTP method (GET, POST, etc.)
- `status_code`: HTTP status code
- `path`: Request path
- `size`: Avatar size (80, 128, 256, etc.)
- `format`: Image format (png, jpg, gif, etc.)
- `source`: Avatar source (uploaded, generated, external)
- `service`: External service name (gravatar, bluesky)
- `content_type`: File MIME type
- `success`: Operation success (true/false)

### Example Queries

#### Avatar Generation Rate

```promql
rate(ivatar_avatars_generated_total[5m])
```

#### Cache Hit Ratio

```promql
rate(ivatar_avatar_cache_hits_total[5m]) /
(rate(ivatar_avatar_cache_hits_total[5m]) + rate(ivatar_avatar_cache_misses_total[5m]))
```

#### Average Avatar Generation Time

```promql
histogram_quantile(0.95, rate(ivatar_avatar_generation_seconds_bucket[5m]))
```

#### File Upload Success Rate

```promql
rate(ivatar_file_uploads_total{success="true"}[5m]) /
rate(ivatar_file_uploads_total[5m])
```

## Tracing

### Trace Points

#### Request Lifecycle

- HTTP request processing
- Avatar generation pipeline
- File upload and processing
- Authentication flows
- External API calls

#### Custom Spans

- `avatar.generate_png`: PNG image generation
- `avatar.gravatar_proxy`: Gravatar service proxy
- `file_upload.process`: File upload processing
- `auth.login`: User authentication
- `auth.logout`: User logout

### Span Attributes

#### HTTP Attributes

- `http.method`: HTTP method
- `http.url`: Full request URL
- `http.status_code`: Response status code
- `http.user_agent`: Client user agent
- `http.remote_addr`: Client IP address

#### Avatar Attributes

- `ivatar.request_type`: Request type (avatar, stats, etc.)
- `ivatar.avatar_size`: Requested avatar size
- `ivatar.avatar_format`: Requested format
- `ivatar.avatar_email`: Email address (if applicable)

#### File Attributes

- `file.name`: Uploaded file name
- `file.size`: File size in bytes
- `file.content_type`: MIME type

## Infrastructure Requirements

### Option A: Extend Existing Stack (Recommended)

The existing monitoring stack can be extended to support OpenTelemetry:

#### Alloy Configuration

```yaml
# Add to existing Alloy configuration
otelcol.receiver.otlp:
  grpc:
    endpoint: 0.0.0.0:4317
  http:
    endpoint: 0.0.0.0:4318

otelcol.processor.batch:
  timeout: 1s
  send_batch_size: 1024

otelcol.exporter.prometheus:
  endpoint: "0.0.0.0:9464"

otelcol.exporter.jaeger:
  endpoint: "jaeger-collector:14250"

otelcol.pipeline.traces:
  receivers: [otelcol.receiver.otlp]
  processors: [otelcol.processor.batch]
  exporters: [otelcol.exporter.jaeger]

otelcol.pipeline.metrics:
  receivers: [otelcol.receiver.otlp]
  processors: [otelcol.processor.batch]
  exporters: [otelcol.exporter.prometheus]
```

#### Prometheus Configuration

```yaml
scrape_configs:
  - job_name: "ivatar-opentelemetry"
    static_configs:
      - targets: ["ivatar-prod:9464", "ivatar-dev:9464"]
    scrape_interval: 15s
    metrics_path: /metrics
```

### Option B: Dedicated OpenTelemetry Collector

For full OpenTelemetry features, deploy a dedicated collector:

#### Collector Configuration

```yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 1s
    send_batch_size: 1024
  resource:
    attributes:
      - key: environment
        from_attribute: deployment.environment
        action: insert

exporters:
  prometheus:
    endpoint: "0.0.0.0:9464"
  jaeger:
    endpoint: "jaeger-collector:14250"
  logging:
    loglevel: debug

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch, resource]
      exporters: [jaeger, logging]
    metrics:
      receivers: [otlp]
      processors: [batch, resource]
      exporters: [prometheus, logging]
```

## Deployment

### Development Setup

1. **Install Dependencies**

   ```bash
   pip install -r requirements.txt
   ```

2. **Configure Environment**

   ```bash
   export OTEL_ENABLED=true
   export OTEL_SERVICE_NAME=ivatar-development
   export OTEL_ENVIRONMENT=development
   ```

3. **Start Development Server**

   ```bash
   ./manage.py runserver 0:8080
   ```

4. **Verify Metrics**
   ```bash
   curl http://localhost:9464/metrics
   ```

### Production Deployment

1. **Update Container Images**

   - Add OpenTelemetry dependencies to requirements.txt
   - Update container build process

2. **Configure Environment Variables**

   - Set production-specific OpenTelemetry variables
   - Configure collector endpoints

3. **Update Monitoring Stack**

   - Extend Alloy configuration
   - Update Prometheus scrape configs
   - Configure Grafana dashboards

4. **Verify Deployment**
   - Check metrics endpoint accessibility
   - Verify trace data flow
   - Monitor dashboard updates

## Monitoring and Alerting

### Key Metrics to Monitor

#### Performance

- Request duration percentiles (p50, p95, p99)
- Avatar generation time
- Cache hit ratio
- File upload success rate

#### Business Metrics

- Avatar requests per minute
- Popular avatar sizes
- External service usage
- User authentication success rate

#### Error Rates

- HTTP error rates by endpoint
- File upload failures
- External service failures
- Authentication failures

### Example Alerts

#### High Error Rate

```yaml
alert: HighErrorRate
expr: rate(ivatar_requests_total{status_code=~"5.."}[5m]) > 0.1
for: 2m
labels:
  severity: warning
annotations:
  summary: "High error rate detected"
  description: "Error rate is {{ $value }} errors per second"
```

#### Slow Avatar Generation

```yaml
alert: SlowAvatarGeneration
expr: histogram_quantile(0.95, rate(ivatar_avatar_generation_seconds_bucket[5m])) > 2
for: 5m
labels:
  severity: warning
annotations:
  summary: "Slow avatar generation"
  description: "95th percentile avatar generation time is {{ $value }}s"
```

#### Low Cache Hit Ratio

```yaml
alert: LowCacheHitRatio
expr: (rate(ivatar_avatar_cache_hits_total[5m]) / (rate(ivatar_avatar_cache_hits_total[5m]) + rate(ivatar_avatar_cache_misses_total[5m]))) < 0.8
for: 10m
labels:
  severity: warning
annotations:
  summary: "Low cache hit ratio"
  description: "Cache hit ratio is {{ $value }}"
```

## Troubleshooting

### Common Issues

#### OpenTelemetry Not Enabled

- Check `OTEL_ENABLED` environment variable
- Verify OpenTelemetry packages are installed
- Check Django logs for configuration errors

#### Metrics Not Appearing

- Verify Prometheus endpoint is accessible
- Check collector configuration
- Ensure metrics are being generated

#### Traces Not Showing

- Verify OTLP endpoint configuration
- Check collector connectivity
- Ensure tracing is enabled in configuration

#### High Memory Usage

- Adjust batch processor settings
- Reduce trace sampling rate
- Monitor collector resource usage

### Debug Mode

Enable debug logging for OpenTelemetry:

```python
LOGGING = {
    "loggers": {
        "opentelemetry": {
            "level": "DEBUG",
        },
        "ivatar.opentelemetry": {
            "level": "DEBUG",
        },
    },
}
```

### Performance Considerations

- **Sampling**: Implement trace sampling for high-traffic production
- **Batch Processing**: Use appropriate batch sizes for your infrastructure
- **Resource Limits**: Monitor collector resource usage
- **Network**: Ensure low-latency connections to collectors

## Security Considerations

- **Data Privacy**: Ensure no sensitive data in trace attributes
- **Network Security**: Use TLS for collector communications
- **Access Control**: Restrict access to metrics endpoints
- **Data Retention**: Configure appropriate retention policies

## Future Enhancements

- **Custom Dashboards**: Create Grafana dashboards for avatar metrics
- **Advanced Sampling**: Implement intelligent trace sampling
- **Log Correlation**: Correlate traces with application logs
- **Performance Profiling**: Add profiling capabilities
- **Custom Exports**: Export to additional backends (Datadog, New Relic)

## Support

For issues related to OpenTelemetry integration:

- Check application logs for configuration errors
- Verify collector connectivity
- Review Prometheus metrics for data flow
- Consult OpenTelemetry documentation for advanced configuration