Files
online-boutique/docs/monitoring.md
Scaffolder 7e119cad41 initial commit
Change-Id: I9c68c43e939d2c1a3b95a68b71ecc5ba861a4df5
2026-03-05 13:37:56 +00:00

10 KiB

Monitoring & Observability

This guide covers monitoring online-boutique with Prometheus and Grafana.

Overview

The Java Golden Path includes comprehensive observability:

  • Metrics: Prometheus metrics via Spring Boot Actuator
  • Dashboards: Pre-configured Grafana dashboard
  • Scraping: Automatic discovery via ServiceMonitor
  • Retention: 15 days of metrics storage

Metrics Endpoint

Spring Boot Actuator exposes Prometheus metrics at:

http://<pod-ip>:8080/actuator/prometheus

Verify Metrics Locally

curl http://localhost:8080/actuator/prometheus

Sample Metrics Output

# HELP jvm_memory_used_bytes The amount of used memory
# TYPE jvm_memory_used_bytes gauge
jvm_memory_used_bytes{area="heap",id="G1 Eden Space",} 5.2428800E7

# HELP http_server_requests_seconds Duration of HTTP server request handling
# TYPE http_server_requests_seconds summary
http_server_requests_seconds_count{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/api/status",} 42.0
http_server_requests_seconds_sum{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/api/status",} 0.351234567

Available Metrics

HTTP Metrics

  • http_server_requests_seconds_count: Total request count
  • http_server_requests_seconds_sum: Total request duration
  • Labels: method, status, uri, outcome, exception

JVM Metrics

Memory

  • jvm_memory_used_bytes: Current memory usage
  • jvm_memory_max_bytes: Maximum memory available
  • jvm_memory_committed_bytes: Committed memory
  • Areas: heap, nonheap
  • Pools: G1 Eden Space, G1 Old Gen, G1 Survivor Space

Garbage Collection

  • jvm_gc_pause_seconds_count: GC pause count
  • jvm_gc_pause_seconds_sum: Total GC pause time
  • jvm_gc_memory_allocated_bytes_total: Total memory allocated
  • jvm_gc_memory_promoted_bytes_total: Memory promoted to old gen

Threads

  • jvm_threads_live_threads: Current live threads
  • jvm_threads_daemon_threads: Current daemon threads
  • jvm_threads_peak_threads: Peak thread count
  • jvm_threads_states_threads: Threads by state (runnable, blocked, waiting)

CPU

  • process_cpu_usage: Process CPU usage (0-1)
  • system_cpu_usage: System CPU usage (0-1)
  • system_cpu_count: Number of CPU cores

Application Metrics

  • application_started_time_seconds: Application start timestamp
  • application_ready_time_seconds: Application ready timestamp
  • process_uptime_seconds: Process uptime
  • process_files_open_files: Open file descriptors

Custom Metrics

Add custom metrics with Micrometer:

@Autowired
private MeterRegistry meterRegistry;

// Counter
Counter.builder("business_operations")
    .tag("operation", "checkout")
    .register(meterRegistry)
    .increment();

// Gauge
Gauge.builder("active_users", this, obj -> obj.getActiveUsers())
    .register(meterRegistry);

// Timer
Timer.builder("api_processing_time")
    .tag("endpoint", "/api/process")
    .register(meterRegistry)
    .record(() -> {
        // Timed operation
    });

Prometheus Configuration

ServiceMonitor

Deployed automatically in deploy/servicemonitor.yaml:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: online-boutique
  namespace: 
  labels:
    app: online-boutique
    prometheus: kube-prometheus
spec:
  selector:
    matchLabels:
      app: online-boutique
  endpoints:
    - port: http
      path: /actuator/prometheus
      interval: 30s

Verify Scraping

Check Prometheus targets:

  1. Access Prometheus: https://prometheus.kyndemo.live
  2. Navigate to Status → Targets
  3. Find online-boutique in monitoring/ namespace
  4. Status should be UP

Or via kubectl:

# Port-forward Prometheus
kubectl -n monitoring port-forward svc/prometheus-operated 9090:9090

# Check targets API
curl http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | select(.labels.job == "online-boutique")'

Query Metrics

Access Prometheus UI and run queries:

# Request rate
rate(http_server_requests_seconds_count{job="online-boutique"}[5m])

# Average request duration
rate(http_server_requests_seconds_sum{job="online-boutique"}[5m])
/ rate(http_server_requests_seconds_count{job="online-boutique"}[5m])

# Error rate
sum(rate(http_server_requests_seconds_count{job="online-boutique",status=~"5.."}[5m]))
/ sum(rate(http_server_requests_seconds_count{job="online-boutique"}[5m]))

# Memory usage
jvm_memory_used_bytes{job="online-boutique",area="heap"}
/ jvm_memory_max_bytes{job="online-boutique",area="heap"}

Grafana Dashboard

Access Dashboard

  1. Open Grafana: https://grafana.kyndemo.live
  2. Navigate to Dashboards → Spring Boot Application
  3. Select online-boutique from dropdown

Dashboard Panels

HTTP Metrics

  • Request Rate: Requests per second by endpoint
  • Request Duration: Average, 95th, 99th percentile latency
  • Status Codes: Breakdown of 2xx, 4xx, 5xx responses
  • Error Rate: Percentage of failed requests

JVM Metrics

  • Heap Memory: Used vs. max heap memory over time
  • Non-Heap Memory: Metaspace, code cache, compressed class space
  • Garbage Collection: GC pause frequency and duration
  • Thread Count: Live threads, daemon threads, peak threads

System Metrics

  • CPU Usage: Process and system CPU utilization
  • File Descriptors: Open file count
  • Uptime: Application uptime

Custom Dashboards

Import dashboard JSON from /k8s/monitoring/spring-boot-dashboard.json:

  1. Grafana → Dashboards → New → Import
  2. Upload spring-boot-dashboard.json
  3. Select Prometheus data source
  4. Click Import

Alerting

Prometheus Alerting Rules

Create alerting rules in Prometheus:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: online-boutique-alerts
  namespace: 
  labels:
    prometheus: kube-prometheus
spec:
  groups:
    - name: online-boutique
      interval: 30s
      rules:
        # High error rate
        - alert: HighErrorRate
          expr: |
            sum(rate(http_server_requests_seconds_count{job="online-boutique",status=~"5.."}[5m]))
            / sum(rate(http_server_requests_seconds_count{job="online-boutique"}[5m]))
            > 0.05
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "High error rate on online-boutique"
            description: "Error rate is {{ $value | humanizePercentage }}"

        # High latency
        - alert: HighLatency
          expr: |
            histogram_quantile(0.95,
              sum(rate(http_server_requests_seconds_bucket{job="online-boutique"}[5m])) by (le)
            ) > 1.0
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "High latency on online-boutique"
            description: "95th percentile latency is {{ $value }}s"

        # High memory usage
        - alert: HighMemoryUsage
          expr: |
            jvm_memory_used_bytes{job="online-boutique",area="heap"}
            / jvm_memory_max_bytes{job="online-boutique",area="heap"}
            > 0.90
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: "High memory usage on online-boutique"
            description: "Heap usage is {{ $value | humanizePercentage }}"

        # Pod not ready
        - alert: PodNotReady
          expr: |
            kube_pod_status_ready{namespace="",pod=~"online-boutique-.*",condition="true"} == 0
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: "online-boutique pod not ready"
            description: "Pod {{ $labels.pod }} not ready for 5 minutes"

Apply:

kubectl apply -f prometheus-rules.yaml

Grafana Alerts

Configure alerts in Grafana dashboard panels:

  1. Edit panel
  2. Click Alert tab
  3. Set conditions (e.g., "when avg() of query(A) is above 0.8")
  4. Configure notification channels (Slack, email, PagerDuty)

Alert Testing

Trigger test alerts:

# Generate errors
for i in {1..100}; do
  curl http://localhost:8080/api/nonexistent
done

# Trigger high latency
ab -n 10000 -c 100 http://localhost:8080/api/status

# Cause memory pressure
curl -X POST http://localhost:8080/actuator/heapdump

Distributed Tracing (Future)

To add tracing with Jaeger/Zipkin:

  1. Add dependency:
<dependency>
    <groupId>io.micrometer</groupId>
    <artifactId>micrometer-tracing-bridge-otel</artifactId>
</dependency>
<dependency>
    <groupId>io.opentelemetry</groupId>
    <artifactId>opentelemetry-exporter-zipkin</artifactId>
</dependency>
  1. Configure in application.yml:
management:
  tracing:
    sampling:
      probability: 1.0
  zipkin:
    tracing:
      endpoint: http://zipkin:9411/api/v2/spans

Log Aggregation

For centralized logging:

  1. Loki: Add Promtail to collect pod logs
  2. Grafana Logs: Query logs alongside metrics
  3. Log Correlation: Link traces to logs via trace ID

Best Practices

  1. Metric Cardinality: Avoid high-cardinality labels (user IDs, timestamps)
  2. Naming: Follow Prometheus naming conventions (_total, _seconds, _bytes)
  3. Aggregation: Use recording rules for expensive queries
  4. Retention: Adjust retention period based on storage capacity
  5. Dashboarding: Create business-specific dashboards for stakeholders

Troubleshooting

Metrics Not Appearing

# Check if actuator is enabled
kubectl -n  exec -it deployment/online-boutique -- \
  curl http://localhost:8080/actuator

# Check ServiceMonitor
kubectl -n  get servicemonitor online-boutique -o yaml

# Check Prometheus logs
kubectl -n monitoring logs -l app.kubernetes.io/name=prometheus --tail=100 | grep online-boutique

High Memory Usage

# Take heap dump
kubectl -n  exec -it deployment/online-boutique -- \
  curl -X POST http://localhost:8080/actuator/heapdump --output heapdump.hprof

# Analyze with jmap/jhat or Eclipse Memory Analyzer

Slow Queries

Enable query logging in Prometheus:

kubectl -n monitoring port-forward svc/prometheus-operated 9090:9090
# Access http://localhost:9090/graph
# Enable query stats in settings

Next Steps