initial commit

Change-Id: I9c68c43e939d2c1a3b95a68b71ecc5ba861a4df5
2026-03-05 13:37:56 +00:00
commit 7e119cad41
24 changed files with 3024 additions and 0 deletions
--- a/docs/monitoring.md
+++ b/docs/monitoring.md
@@ -0,0 +1,395 @@
+# Monitoring & Observability
+
+This guide covers monitoring online-boutique with Prometheus and Grafana.
+
+## Overview
+
+The Java Golden Path includes comprehensive observability:
+
+- **Metrics**: Prometheus metrics via Spring Boot Actuator
+- **Dashboards**: Pre-configured Grafana dashboard
+- **Scraping**: Automatic discovery via ServiceMonitor
+- **Retention**: 15 days of metrics storage
+
+## Metrics Endpoint
+
+Spring Boot Actuator exposes Prometheus metrics at:
+
+```
+http://<pod-ip>:8080/actuator/prometheus
+```
+
+### Verify Metrics Locally
+
+```bash
+curl http://localhost:8080/actuator/prometheus
+```
+
+### Sample Metrics Output
+
+```
+# HELP jvm_memory_used_bytes The amount of used memory
+# TYPE jvm_memory_used_bytes gauge
+jvm_memory_used_bytes{area="heap",id="G1 Eden Space",} 5.2428800E7
+
+# HELP http_server_requests_seconds Duration of HTTP server request handling
+# TYPE http_server_requests_seconds summary
+http_server_requests_seconds_count{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/api/status",} 42.0
+http_server_requests_seconds_sum{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/api/status",} 0.351234567
+```
+
+## Available Metrics
+
+### HTTP Metrics
+
+- `http_server_requests_seconds_count`: Total request count
+- `http_server_requests_seconds_sum`: Total request duration
+- **Labels**: method, status, uri, outcome, exception
+
+### JVM Metrics
+
+#### Memory
+- `jvm_memory_used_bytes`: Current memory usage
+- `jvm_memory_max_bytes`: Maximum memory available
+- `jvm_memory_committed_bytes`: Committed memory
+- **Areas**: heap, nonheap
+- **Pools**: G1 Eden Space, G1 Old Gen, G1 Survivor Space
+
+#### Garbage Collection
+- `jvm_gc_pause_seconds_count`: GC pause count
+- `jvm_gc_pause_seconds_sum`: Total GC pause time
+- `jvm_gc_memory_allocated_bytes_total`: Total memory allocated
+- `jvm_gc_memory_promoted_bytes_total`: Memory promoted to old gen
+
+#### Threads
+- `jvm_threads_live_threads`: Current live threads
+- `jvm_threads_daemon_threads`: Current daemon threads
+- `jvm_threads_peak_threads`: Peak thread count
+- `jvm_threads_states_threads`: Threads by state (runnable, blocked, waiting)
+
+#### CPU
+- `process_cpu_usage`: Process CPU usage (0-1)
+- `system_cpu_usage`: System CPU usage (0-1)
+- `system_cpu_count`: Number of CPU cores
+
+### Application Metrics
+
+- `application_started_time_seconds`: Application start timestamp
+- `application_ready_time_seconds`: Application ready timestamp
+- `process_uptime_seconds`: Process uptime
+- `process_files_open_files`: Open file descriptors
+
+### Custom Metrics
+
+Add custom metrics with Micrometer:
+
+```java
+@Autowired
+private MeterRegistry meterRegistry;
+
+// Counter
+Counter.builder("business_operations")
+    .tag("operation", "checkout")
+    .register(meterRegistry)
+    .increment();
+
+// Gauge
+Gauge.builder("active_users", this, obj -> obj.getActiveUsers())
+    .register(meterRegistry);
+
+// Timer
+Timer.builder("api_processing_time")
+    .tag("endpoint", "/api/process")
+    .register(meterRegistry)
+    .record(() -> {
+        // Timed operation
+    });
+```
+
+## Prometheus Configuration
+
+### ServiceMonitor
+
+Deployed automatically in `deploy/servicemonitor.yaml`:
+
+```yaml
+apiVersion: monitoring.coreos.com/v1
+kind: ServiceMonitor
+metadata:
+  name: online-boutique
+  namespace: 
+  labels:
+    app: online-boutique
+    prometheus: kube-prometheus
+spec:
+  selector:
+    matchLabels:
+      app: online-boutique
+  endpoints:
+    - port: http
+      path: /actuator/prometheus
+      interval: 30s
+```
+
+### Verify Scraping
+
+Check Prometheus targets:
+
+1. Access Prometheus: `https://prometheus.kyndemo.live`
+2. Navigate to **Status → Targets**
+3. Find `online-boutique` in `monitoring/` namespace
+4. Status should be **UP**
+
+Or via kubectl:
+
+```bash
+# Port-forward Prometheus
+kubectl -n monitoring port-forward svc/prometheus-operated 9090:9090
+
+# Check targets API
+curl http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | select(.labels.job == "online-boutique")'
+```
+
+### Query Metrics
+
+Access Prometheus UI and run queries:
+
+```promql
+# Request rate
+rate(http_server_requests_seconds_count{job="online-boutique"}[5m])
+
+# Average request duration
+rate(http_server_requests_seconds_sum{job="online-boutique"}[5m])
+/ rate(http_server_requests_seconds_count{job="online-boutique"}[5m])
+
+# Error rate
+sum(rate(http_server_requests_seconds_count{job="online-boutique",status=~"5.."}[5m]))
+/ sum(rate(http_server_requests_seconds_count{job="online-boutique"}[5m]))
+
+# Memory usage
+jvm_memory_used_bytes{job="online-boutique",area="heap"}
+/ jvm_memory_max_bytes{job="online-boutique",area="heap"}
+```
+
+## Grafana Dashboard
+
+### Access Dashboard
+
+1. Open Grafana: `https://grafana.kyndemo.live`
+2. Navigate to **Dashboards → Spring Boot Application**
+3. Select `online-boutique` from dropdown
+
+### Dashboard Panels
+
+#### HTTP Metrics
+- **Request Rate**: Requests per second by endpoint
+- **Request Duration**: Average, 95th, 99th percentile latency
+- **Status Codes**: Breakdown of 2xx, 4xx, 5xx responses
+- **Error Rate**: Percentage of failed requests
+
+#### JVM Metrics
+- **Heap Memory**: Used vs. max heap memory over time
+- **Non-Heap Memory**: Metaspace, code cache, compressed class space
+- **Garbage Collection**: GC pause frequency and duration
+- **Thread Count**: Live threads, daemon threads, peak threads
+
+#### System Metrics
+- **CPU Usage**: Process and system CPU utilization
+- **File Descriptors**: Open file count
+- **Uptime**: Application uptime
+
+### Custom Dashboards
+
+Import dashboard JSON from `/k8s/monitoring/spring-boot-dashboard.json`:
+
+1. Grafana → Dashboards → New → Import
+2. Upload `spring-boot-dashboard.json`
+3. Select Prometheus data source
+4. Click **Import**
+
+## Alerting
+
+### Prometheus Alerting Rules
+
+Create alerting rules in Prometheus:
+
+```yaml
+apiVersion: monitoring.coreos.com/v1
+kind: PrometheusRule
+metadata:
+  name: online-boutique-alerts
+  namespace: 
+  labels:
+    prometheus: kube-prometheus
+spec:
+  groups:
+    - name: online-boutique
+      interval: 30s
+      rules:
+        # High error rate
+        - alert: HighErrorRate
+          expr: |
+            sum(rate(http_server_requests_seconds_count{job="online-boutique",status=~"5.."}[5m]))
+            / sum(rate(http_server_requests_seconds_count{job="online-boutique"}[5m]))
+            > 0.05
+          for: 5m
+          labels:
+            severity: warning
+          annotations:
+            summary: "High error rate on online-boutique"
+            description: "Error rate is {{ $value | humanizePercentage }}"
+
+        # High latency
+        - alert: HighLatency
+          expr: |
+            histogram_quantile(0.95,
+              sum(rate(http_server_requests_seconds_bucket{job="online-boutique"}[5m])) by (le)
+            ) > 1.0
+          for: 5m
+          labels:
+            severity: warning
+          annotations:
+            summary: "High latency on online-boutique"
+            description: "95th percentile latency is {{ $value }}s"
+
+        # High memory usage
+        - alert: HighMemoryUsage
+          expr: |
+            jvm_memory_used_bytes{job="online-boutique",area="heap"}
+            / jvm_memory_max_bytes{job="online-boutique",area="heap"}
+            > 0.90
+          for: 5m
+          labels:
+            severity: critical
+          annotations:
+            summary: "High memory usage on online-boutique"
+            description: "Heap usage is {{ $value | humanizePercentage }}"
+
+        # Pod not ready
+        - alert: PodNotReady
+          expr: |
+            kube_pod_status_ready{namespace="",pod=~"online-boutique-.*",condition="true"} == 0
+          for: 5m
+          labels:
+            severity: critical
+          annotations:
+            summary: "online-boutique pod not ready"
+            description: "Pod {{ $labels.pod }} not ready for 5 minutes"
+```
+
+Apply:
+
+```bash
+kubectl apply -f prometheus-rules.yaml
+```
+
+### Grafana Alerts
+
+Configure alerts in Grafana dashboard panels:
+
+1. Edit panel
+2. Click **Alert** tab
+3. Set conditions (e.g., "when avg() of query(A) is above 0.8")
+4. Configure notification channels (Slack, email, PagerDuty)
+
+### Alert Testing
+
+Trigger test alerts:
+
+```bash
+# Generate errors
+for i in {1..100}; do
+  curl http://localhost:8080/api/nonexistent
+done
+
+# Trigger high latency
+ab -n 10000 -c 100 http://localhost:8080/api/status
+
+# Cause memory pressure
+curl -X POST http://localhost:8080/actuator/heapdump
+```
+
+## Distributed Tracing (Future)
+
+To add tracing with Jaeger/Zipkin:
+
+1. Add dependency:
+```xml
+<dependency>
+    <groupId>io.micrometer</groupId>
+    <artifactId>micrometer-tracing-bridge-otel</artifactId>
+</dependency>
+<dependency>
+    <groupId>io.opentelemetry</groupId>
+    <artifactId>opentelemetry-exporter-zipkin</artifactId>
+</dependency>
+```
+
+2. Configure in `application.yml`:
+```yaml
+management:
+  tracing:
+    sampling:
+      probability: 1.0
+  zipkin:
+    tracing:
+      endpoint: http://zipkin:9411/api/v2/spans
+```
+
+## Log Aggregation
+
+For centralized logging:
+
+1. **Loki**: Add Promtail to collect pod logs
+2. **Grafana Logs**: Query logs alongside metrics
+3. **Log Correlation**: Link traces to logs via trace ID
+
+## Best Practices
+
+1. **Metric Cardinality**: Avoid high-cardinality labels (user IDs, timestamps)
+2. **Naming**: Follow Prometheus naming conventions (`_total`, `_seconds`, `_bytes`)
+3. **Aggregation**: Use recording rules for expensive queries
+4. **Retention**: Adjust retention period based on storage capacity
+5. **Dashboarding**: Create business-specific dashboards for stakeholders
+
+## Troubleshooting
+
+### Metrics Not Appearing
+
+```bash
+# Check if actuator is enabled
+kubectl -n  exec -it deployment/online-boutique -- \
+  curl http://localhost:8080/actuator
+
+# Check ServiceMonitor
+kubectl -n  get servicemonitor online-boutique -o yaml
+
+# Check Prometheus logs
+kubectl -n monitoring logs -l app.kubernetes.io/name=prometheus --tail=100 | grep online-boutique
+```
+
+### High Memory Usage
+
+```bash
+# Take heap dump
+kubectl -n  exec -it deployment/online-boutique -- \
+  curl -X POST http://localhost:8080/actuator/heapdump --output heapdump.hprof
+
+# Analyze with jmap/jhat or Eclipse Memory Analyzer
+```
+
+### Slow Queries
+
+Enable query logging in Prometheus:
+
+```bash
+kubectl -n monitoring port-forward svc/prometheus-operated 9090:9090
+# Access http://localhost:9090/graph
+# Enable query stats in settings
+```
+
+## Next Steps
+
+- [Review architecture](architecture.md)
+- [Learn about deployment](deployment.md)
+- [Return to overview](index.md)