online-boutique/docs/monitoring.md

# Monitoring & Observability

This guide covers monitoring online-boutique with Prometheus and Grafana.

## Overview

The Java Golden Path includes comprehensive observability:

- **Metrics**: Prometheus metrics via Spring Boot Actuator
- **Dashboards**: Pre-configured Grafana dashboard
- **Scraping**: Automatic discovery via ServiceMonitor
- **Retention**: 15 days of metrics storage

## Metrics Endpoint

Spring Boot Actuator exposes Prometheus metrics at:

```
http://<pod-ip>:8080/actuator/prometheus
```

### Verify Metrics Locally

```bash
curl http://localhost:8080/actuator/prometheus
```

### Sample Metrics Output

```
# HELP jvm_memory_used_bytes The amount of used memory
# TYPE jvm_memory_used_bytes gauge
jvm_memory_used_bytes{area="heap",id="G1 Eden Space",} 5.2428800E7

# HELP http_server_requests_seconds Duration of HTTP server request handling
# TYPE http_server_requests_seconds summary
http_server_requests_seconds_count{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/api/status",} 42.0
http_server_requests_seconds_sum{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/api/status",} 0.351234567
```

## Available Metrics

### HTTP Metrics

- `http_server_requests_seconds_count`: Total request count
- `http_server_requests_seconds_sum`: Total request duration
- **Labels**: method, status, uri, outcome, exception

### JVM Metrics

#### Memory
- `jvm_memory_used_bytes`: Current memory usage
- `jvm_memory_max_bytes`: Maximum memory available
- `jvm_memory_committed_bytes`: Committed memory
- **Areas**: heap, nonheap
- **Pools**: G1 Eden Space, G1 Old Gen, G1 Survivor Space

#### Garbage Collection
- `jvm_gc_pause_seconds_count`: GC pause count
- `jvm_gc_pause_seconds_sum`: Total GC pause time
- `jvm_gc_memory_allocated_bytes_total`: Total memory allocated
- `jvm_gc_memory_promoted_bytes_total`: Memory promoted to old gen

#### Threads
- `jvm_threads_live_threads`: Current live threads
- `jvm_threads_daemon_threads`: Current daemon threads
- `jvm_threads_peak_threads`: Peak thread count
- `jvm_threads_states_threads`: Threads by state (runnable, blocked, waiting)

#### CPU
- `process_cpu_usage`: Process CPU usage (0-1)
- `system_cpu_usage`: System CPU usage (0-1)
- `system_cpu_count`: Number of CPU cores

### Application Metrics

- `application_started_time_seconds`: Application start timestamp
- `application_ready_time_seconds`: Application ready timestamp
- `process_uptime_seconds`: Process uptime
- `process_files_open_files`: Open file descriptors

### Custom Metrics

Add custom metrics with Micrometer:

```java
@Autowired
private MeterRegistry meterRegistry;

// Counter
Counter.builder("business_operations")
    .tag("operation", "checkout")
    .register(meterRegistry)
    .increment();

// Gauge
Gauge.builder("active_users", this, obj -> obj.getActiveUsers())
    .register(meterRegistry);

// Timer
Timer.builder("api_processing_time")
    .tag("endpoint", "/api/process")
    .register(meterRegistry)
    .record(() -> {
        // Timed operation
    });
```

## Prometheus Configuration

### ServiceMonitor

Deployed automatically in `deploy/servicemonitor.yaml`:

```yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: online-boutique
  namespace:
  labels:
    app: online-boutique
    prometheus: kube-prometheus
spec:
  selector:
    matchLabels:
      app: online-boutique
  endpoints:
    - port: http
      path: /actuator/prometheus
      interval: 30s
```

### Verify Scraping

Check Prometheus targets:

1. Access Prometheus: `https://prometheus.kyndemo.live`
2. Navigate to **Status → Targets**
3. Find `online-boutique` in `monitoring/` namespace
4. Status should be **UP**

Or via kubectl:

```bash
# Port-forward Prometheus
kubectl -n monitoring port-forward svc/prometheus-operated 9090:9090

# Check targets API
curl http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | select(.labels.job == "online-boutique")'
```

### Query Metrics

Access Prometheus UI and run queries:

```promql
# Request rate
rate(http_server_requests_seconds_count{job="online-boutique"}[5m])

# Average request duration
rate(http_server_requests_seconds_sum{job="online-boutique"}[5m])
/ rate(http_server_requests_seconds_count{job="online-boutique"}[5m])

# Error rate
sum(rate(http_server_requests_seconds_count{job="online-boutique",status=~"5.."}[5m]))
/ sum(rate(http_server_requests_seconds_count{job="online-boutique"}[5m]))

# Memory usage
jvm_memory_used_bytes{job="online-boutique",area="heap"}
/ jvm_memory_max_bytes{job="online-boutique",area="heap"}
```

## Grafana Dashboard

### Access Dashboard

1. Open Grafana: `https://grafana.kyndemo.live`
2. Navigate to **Dashboards → Spring Boot Application**
3. Select `online-boutique` from dropdown

### Dashboard Panels

#### HTTP Metrics
- **Request Rate**: Requests per second by endpoint
- **Request Duration**: Average, 95th, 99th percentile latency
- **Status Codes**: Breakdown of 2xx, 4xx, 5xx responses
- **Error Rate**: Percentage of failed requests

#### JVM Metrics
- **Heap Memory**: Used vs. max heap memory over time
- **Non-Heap Memory**: Metaspace, code cache, compressed class space
- **Garbage Collection**: GC pause frequency and duration
- **Thread Count**: Live threads, daemon threads, peak threads

#### System Metrics
- **CPU Usage**: Process and system CPU utilization
- **File Descriptors**: Open file count
- **Uptime**: Application uptime

### Custom Dashboards

Import dashboard JSON from `/k8s/monitoring/spring-boot-dashboard.json`:

1. Grafana → Dashboards → New → Import
2. Upload `spring-boot-dashboard.json`
3. Select Prometheus data source
4. Click **Import**

## Alerting

### Prometheus Alerting Rules

Create alerting rules in Prometheus:

```yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: online-boutique-alerts
  namespace:
  labels:
    prometheus: kube-prometheus
spec:
  groups:
    - name: online-boutique
      interval: 30s
      rules:
        # High error rate
        - alert: HighErrorRate
          expr: |
            sum(rate(http_server_requests_seconds_count{job="online-boutique",status=~"5.."}[5m]))
            / sum(rate(http_server_requests_seconds_count{job="online-boutique"}[5m]))
            > 0.05
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "High error rate on online-boutique"
            description: "Error rate is {{ $value | humanizePercentage }}"

        # High latency
        - alert: HighLatency
          expr: |
            histogram_quantile(0.95,
              sum(rate(http_server_requests_seconds_bucket{job="online-boutique"}[5m])) by (le)
            ) > 1.0
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "High latency on online-boutique"
            description: "95th percentile latency is {{ $value }}s"

        # High memory usage
        - alert: HighMemoryUsage
          expr: |
            jvm_memory_used_bytes{job="online-boutique",area="heap"}
            / jvm_memory_max_bytes{job="online-boutique",area="heap"}
            > 0.90
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: "High memory usage on online-boutique"
            description: "Heap usage is {{ $value | humanizePercentage }}"

        # Pod not ready
        - alert: PodNotReady
          expr: |
            kube_pod_status_ready{namespace="",pod=~"online-boutique-.*",condition="true"} == 0
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: "online-boutique pod not ready"
            description: "Pod {{ $labels.pod }} not ready for 5 minutes"
```

Apply:

```bash
kubectl apply -f prometheus-rules.yaml
```

### Grafana Alerts

Configure alerts in Grafana dashboard panels:

1. Edit panel
2. Click **Alert** tab
3. Set conditions (e.g., "when avg() of query(A) is above 0.8")
4. Configure notification channels (Slack, email, PagerDuty)

### Alert Testing

Trigger test alerts:

```bash
# Generate errors
for i in {1..100}; do
  curl http://localhost:8080/api/nonexistent
done

# Trigger high latency
ab -n 10000 -c 100 http://localhost:8080/api/status

# Cause memory pressure
curl -X POST http://localhost:8080/actuator/heapdump
```

## Distributed Tracing (Future)

To add tracing with Jaeger/Zipkin:

1. Add dependency:
```xml
<dependency>
    <groupId>io.micrometer</groupId>
    <artifactId>micrometer-tracing-bridge-otel</artifactId>
</dependency>
<dependency>
    <groupId>io.opentelemetry</groupId>
    <artifactId>opentelemetry-exporter-zipkin</artifactId>
</dependency>
```

2. Configure in `application.yml`:
```yaml
management:
  tracing:
    sampling:
      probability: 1.0
  zipkin:
    tracing:
      endpoint: http://zipkin:9411/api/v2/spans
```

## Log Aggregation

For centralized logging:

1. **Loki**: Add Promtail to collect pod logs
2. **Grafana Logs**: Query logs alongside metrics
3. **Log Correlation**: Link traces to logs via trace ID

## Best Practices

1. **Metric Cardinality**: Avoid high-cardinality labels (user IDs, timestamps)
2. **Naming**: Follow Prometheus naming conventions (`_total`, `_seconds`, `_bytes`)
3. **Aggregation**: Use recording rules for expensive queries
4. **Retention**: Adjust retention period based on storage capacity
5. **Dashboarding**: Create business-specific dashboards for stakeholders

## Troubleshooting

### Metrics Not Appearing

```bash
# Check if actuator is enabled
kubectl -n  exec -it deployment/online-boutique -- \
  curl http://localhost:8080/actuator

# Check ServiceMonitor
kubectl -n  get servicemonitor online-boutique -o yaml

# Check Prometheus logs
kubectl -n monitoring logs -l app.kubernetes.io/name=prometheus --tail=100 | grep online-boutique
```

### High Memory Usage

```bash
# Take heap dump
kubectl -n  exec -it deployment/online-boutique -- \
  curl -X POST http://localhost:8080/actuator/heapdump --output heapdump.hprof

# Analyze with jmap/jhat or Eclipse Memory Analyzer
```

### Slow Queries

Enable query logging in Prometheus:

```bash
kubectl -n monitoring port-forward svc/prometheus-operated 9090:9090
# Access http://localhost:9090/graph
# Enable query stats in settings
```

## Next Steps

- [Review architecture](architecture.md)
- [Learn about deployment](deployment.md)
- [Return to overview](index.md)