# Monitoring & Observability This guide covers monitoring online-boutique with Prometheus and Grafana. ## Overview The Java Golden Path includes comprehensive observability: - **Metrics**: Prometheus metrics via Spring Boot Actuator - **Dashboards**: Pre-configured Grafana dashboard - **Scraping**: Automatic discovery via ServiceMonitor - **Retention**: 15 days of metrics storage ## Metrics Endpoint Spring Boot Actuator exposes Prometheus metrics at: ``` http://:8080/actuator/prometheus ``` ### Verify Metrics Locally ```bash curl http://localhost:8080/actuator/prometheus ``` ### Sample Metrics Output ``` # HELP jvm_memory_used_bytes The amount of used memory # TYPE jvm_memory_used_bytes gauge jvm_memory_used_bytes{area="heap",id="G1 Eden Space",} 5.2428800E7 # HELP http_server_requests_seconds Duration of HTTP server request handling # TYPE http_server_requests_seconds summary http_server_requests_seconds_count{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/api/status",} 42.0 http_server_requests_seconds_sum{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/api/status",} 0.351234567 ``` ## Available Metrics ### HTTP Metrics - `http_server_requests_seconds_count`: Total request count - `http_server_requests_seconds_sum`: Total request duration - **Labels**: method, status, uri, outcome, exception ### JVM Metrics #### Memory - `jvm_memory_used_bytes`: Current memory usage - `jvm_memory_max_bytes`: Maximum memory available - `jvm_memory_committed_bytes`: Committed memory - **Areas**: heap, nonheap - **Pools**: G1 Eden Space, G1 Old Gen, G1 Survivor Space #### Garbage Collection - `jvm_gc_pause_seconds_count`: GC pause count - `jvm_gc_pause_seconds_sum`: Total GC pause time - `jvm_gc_memory_allocated_bytes_total`: Total memory allocated - `jvm_gc_memory_promoted_bytes_total`: Memory promoted to old gen #### Threads - `jvm_threads_live_threads`: Current live threads - `jvm_threads_daemon_threads`: Current daemon threads - `jvm_threads_peak_threads`: Peak thread count - `jvm_threads_states_threads`: Threads by state (runnable, blocked, waiting) #### CPU - `process_cpu_usage`: Process CPU usage (0-1) - `system_cpu_usage`: System CPU usage (0-1) - `system_cpu_count`: Number of CPU cores ### Application Metrics - `application_started_time_seconds`: Application start timestamp - `application_ready_time_seconds`: Application ready timestamp - `process_uptime_seconds`: Process uptime - `process_files_open_files`: Open file descriptors ### Custom Metrics Add custom metrics with Micrometer: ```java @Autowired private MeterRegistry meterRegistry; // Counter Counter.builder("business_operations") .tag("operation", "checkout") .register(meterRegistry) .increment(); // Gauge Gauge.builder("active_users", this, obj -> obj.getActiveUsers()) .register(meterRegistry); // Timer Timer.builder("api_processing_time") .tag("endpoint", "/api/process") .register(meterRegistry) .record(() -> { // Timed operation }); ``` ## Prometheus Configuration ### ServiceMonitor Deployed automatically in `deploy/servicemonitor.yaml`: ```yaml apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: online-boutique namespace: labels: app: online-boutique prometheus: kube-prometheus spec: selector: matchLabels: app: online-boutique endpoints: - port: http path: /actuator/prometheus interval: 30s ``` ### Verify Scraping Check Prometheus targets: 1. Access Prometheus: `https://prometheus.kyndemo.live` 2. Navigate to **Status → Targets** 3. Find `online-boutique` in `monitoring/` namespace 4. Status should be **UP** Or via kubectl: ```bash # Port-forward Prometheus kubectl -n monitoring port-forward svc/prometheus-operated 9090:9090 # Check targets API curl http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | select(.labels.job == "online-boutique")' ``` ### Query Metrics Access Prometheus UI and run queries: ```promql # Request rate rate(http_server_requests_seconds_count{job="online-boutique"}[5m]) # Average request duration rate(http_server_requests_seconds_sum{job="online-boutique"}[5m]) / rate(http_server_requests_seconds_count{job="online-boutique"}[5m]) # Error rate sum(rate(http_server_requests_seconds_count{job="online-boutique",status=~"5.."}[5m])) / sum(rate(http_server_requests_seconds_count{job="online-boutique"}[5m])) # Memory usage jvm_memory_used_bytes{job="online-boutique",area="heap"} / jvm_memory_max_bytes{job="online-boutique",area="heap"} ``` ## Grafana Dashboard ### Access Dashboard 1. Open Grafana: `https://grafana.kyndemo.live` 2. Navigate to **Dashboards → Spring Boot Application** 3. Select `online-boutique` from dropdown ### Dashboard Panels #### HTTP Metrics - **Request Rate**: Requests per second by endpoint - **Request Duration**: Average, 95th, 99th percentile latency - **Status Codes**: Breakdown of 2xx, 4xx, 5xx responses - **Error Rate**: Percentage of failed requests #### JVM Metrics - **Heap Memory**: Used vs. max heap memory over time - **Non-Heap Memory**: Metaspace, code cache, compressed class space - **Garbage Collection**: GC pause frequency and duration - **Thread Count**: Live threads, daemon threads, peak threads #### System Metrics - **CPU Usage**: Process and system CPU utilization - **File Descriptors**: Open file count - **Uptime**: Application uptime ### Custom Dashboards Import dashboard JSON from `/k8s/monitoring/spring-boot-dashboard.json`: 1. Grafana → Dashboards → New → Import 2. Upload `spring-boot-dashboard.json` 3. Select Prometheus data source 4. Click **Import** ## Alerting ### Prometheus Alerting Rules Create alerting rules in Prometheus: ```yaml apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: name: online-boutique-alerts namespace: labels: prometheus: kube-prometheus spec: groups: - name: online-boutique interval: 30s rules: # High error rate - alert: HighErrorRate expr: | sum(rate(http_server_requests_seconds_count{job="online-boutique",status=~"5.."}[5m])) / sum(rate(http_server_requests_seconds_count{job="online-boutique"}[5m])) > 0.05 for: 5m labels: severity: warning annotations: summary: "High error rate on online-boutique" description: "Error rate is {{ $value | humanizePercentage }}" # High latency - alert: HighLatency expr: | histogram_quantile(0.95, sum(rate(http_server_requests_seconds_bucket{job="online-boutique"}[5m])) by (le) ) > 1.0 for: 5m labels: severity: warning annotations: summary: "High latency on online-boutique" description: "95th percentile latency is {{ $value }}s" # High memory usage - alert: HighMemoryUsage expr: | jvm_memory_used_bytes{job="online-boutique",area="heap"} / jvm_memory_max_bytes{job="online-boutique",area="heap"} > 0.90 for: 5m labels: severity: critical annotations: summary: "High memory usage on online-boutique" description: "Heap usage is {{ $value | humanizePercentage }}" # Pod not ready - alert: PodNotReady expr: | kube_pod_status_ready{namespace="",pod=~"online-boutique-.*",condition="true"} == 0 for: 5m labels: severity: critical annotations: summary: "online-boutique pod not ready" description: "Pod {{ $labels.pod }} not ready for 5 minutes" ``` Apply: ```bash kubectl apply -f prometheus-rules.yaml ``` ### Grafana Alerts Configure alerts in Grafana dashboard panels: 1. Edit panel 2. Click **Alert** tab 3. Set conditions (e.g., "when avg() of query(A) is above 0.8") 4. Configure notification channels (Slack, email, PagerDuty) ### Alert Testing Trigger test alerts: ```bash # Generate errors for i in {1..100}; do curl http://localhost:8080/api/nonexistent done # Trigger high latency ab -n 10000 -c 100 http://localhost:8080/api/status # Cause memory pressure curl -X POST http://localhost:8080/actuator/heapdump ``` ## Distributed Tracing (Future) To add tracing with Jaeger/Zipkin: 1. Add dependency: ```xml io.micrometer micrometer-tracing-bridge-otel io.opentelemetry opentelemetry-exporter-zipkin ``` 2. Configure in `application.yml`: ```yaml management: tracing: sampling: probability: 1.0 zipkin: tracing: endpoint: http://zipkin:9411/api/v2/spans ``` ## Log Aggregation For centralized logging: 1. **Loki**: Add Promtail to collect pod logs 2. **Grafana Logs**: Query logs alongside metrics 3. **Log Correlation**: Link traces to logs via trace ID ## Best Practices 1. **Metric Cardinality**: Avoid high-cardinality labels (user IDs, timestamps) 2. **Naming**: Follow Prometheus naming conventions (`_total`, `_seconds`, `_bytes`) 3. **Aggregation**: Use recording rules for expensive queries 4. **Retention**: Adjust retention period based on storage capacity 5. **Dashboarding**: Create business-specific dashboards for stakeholders ## Troubleshooting ### Metrics Not Appearing ```bash # Check if actuator is enabled kubectl -n exec -it deployment/online-boutique -- \ curl http://localhost:8080/actuator # Check ServiceMonitor kubectl -n get servicemonitor online-boutique -o yaml # Check Prometheus logs kubectl -n monitoring logs -l app.kubernetes.io/name=prometheus --tail=100 | grep online-boutique ``` ### High Memory Usage ```bash # Take heap dump kubectl -n exec -it deployment/online-boutique -- \ curl -X POST http://localhost:8080/actuator/heapdump --output heapdump.hprof # Analyze with jmap/jhat or Eclipse Memory Analyzer ``` ### Slow Queries Enable query logging in Prometheus: ```bash kubectl -n monitoring port-forward svc/prometheus-operated 9090:9090 # Access http://localhost:9090/graph # Enable query stats in settings ``` ## Next Steps - [Review architecture](architecture.md) - [Learn about deployment](deployment.md) - [Return to overview](index.md)