10 KiB
10 KiB
Monitoring & Observability
This guide covers monitoring online-boutique with Prometheus and Grafana.
Overview
The Java Golden Path includes comprehensive observability:
- Metrics: Prometheus metrics via Spring Boot Actuator
- Dashboards: Pre-configured Grafana dashboard
- Scraping: Automatic discovery via ServiceMonitor
- Retention: 15 days of metrics storage
Metrics Endpoint
Spring Boot Actuator exposes Prometheus metrics at:
http://<pod-ip>:8080/actuator/prometheus
Verify Metrics Locally
curl http://localhost:8080/actuator/prometheus
Sample Metrics Output
# HELP jvm_memory_used_bytes The amount of used memory
# TYPE jvm_memory_used_bytes gauge
jvm_memory_used_bytes{area="heap",id="G1 Eden Space",} 5.2428800E7
# HELP http_server_requests_seconds Duration of HTTP server request handling
# TYPE http_server_requests_seconds summary
http_server_requests_seconds_count{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/api/status",} 42.0
http_server_requests_seconds_sum{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/api/status",} 0.351234567
Available Metrics
HTTP Metrics
http_server_requests_seconds_count: Total request counthttp_server_requests_seconds_sum: Total request duration- Labels: method, status, uri, outcome, exception
JVM Metrics
Memory
jvm_memory_used_bytes: Current memory usagejvm_memory_max_bytes: Maximum memory availablejvm_memory_committed_bytes: Committed memory- Areas: heap, nonheap
- Pools: G1 Eden Space, G1 Old Gen, G1 Survivor Space
Garbage Collection
jvm_gc_pause_seconds_count: GC pause countjvm_gc_pause_seconds_sum: Total GC pause timejvm_gc_memory_allocated_bytes_total: Total memory allocatedjvm_gc_memory_promoted_bytes_total: Memory promoted to old gen
Threads
jvm_threads_live_threads: Current live threadsjvm_threads_daemon_threads: Current daemon threadsjvm_threads_peak_threads: Peak thread countjvm_threads_states_threads: Threads by state (runnable, blocked, waiting)
CPU
process_cpu_usage: Process CPU usage (0-1)system_cpu_usage: System CPU usage (0-1)system_cpu_count: Number of CPU cores
Application Metrics
application_started_time_seconds: Application start timestampapplication_ready_time_seconds: Application ready timestampprocess_uptime_seconds: Process uptimeprocess_files_open_files: Open file descriptors
Custom Metrics
Add custom metrics with Micrometer:
@Autowired
private MeterRegistry meterRegistry;
// Counter
Counter.builder("business_operations")
.tag("operation", "checkout")
.register(meterRegistry)
.increment();
// Gauge
Gauge.builder("active_users", this, obj -> obj.getActiveUsers())
.register(meterRegistry);
// Timer
Timer.builder("api_processing_time")
.tag("endpoint", "/api/process")
.register(meterRegistry)
.record(() -> {
// Timed operation
});
Prometheus Configuration
ServiceMonitor
Deployed automatically in deploy/servicemonitor.yaml:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: online-boutique
namespace:
labels:
app: online-boutique
prometheus: kube-prometheus
spec:
selector:
matchLabels:
app: online-boutique
endpoints:
- port: http
path: /actuator/prometheus
interval: 30s
Verify Scraping
Check Prometheus targets:
- Access Prometheus:
https://prometheus.kyndemo.live - Navigate to Status → Targets
- Find
online-boutiqueinmonitoring/namespace - Status should be UP
Or via kubectl:
# Port-forward Prometheus
kubectl -n monitoring port-forward svc/prometheus-operated 9090:9090
# Check targets API
curl http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | select(.labels.job == "online-boutique")'
Query Metrics
Access Prometheus UI and run queries:
# Request rate
rate(http_server_requests_seconds_count{job="online-boutique"}[5m])
# Average request duration
rate(http_server_requests_seconds_sum{job="online-boutique"}[5m])
/ rate(http_server_requests_seconds_count{job="online-boutique"}[5m])
# Error rate
sum(rate(http_server_requests_seconds_count{job="online-boutique",status=~"5.."}[5m]))
/ sum(rate(http_server_requests_seconds_count{job="online-boutique"}[5m]))
# Memory usage
jvm_memory_used_bytes{job="online-boutique",area="heap"}
/ jvm_memory_max_bytes{job="online-boutique",area="heap"}
Grafana Dashboard
Access Dashboard
- Open Grafana:
https://grafana.kyndemo.live - Navigate to Dashboards → Spring Boot Application
- Select
online-boutiquefrom dropdown
Dashboard Panels
HTTP Metrics
- Request Rate: Requests per second by endpoint
- Request Duration: Average, 95th, 99th percentile latency
- Status Codes: Breakdown of 2xx, 4xx, 5xx responses
- Error Rate: Percentage of failed requests
JVM Metrics
- Heap Memory: Used vs. max heap memory over time
- Non-Heap Memory: Metaspace, code cache, compressed class space
- Garbage Collection: GC pause frequency and duration
- Thread Count: Live threads, daemon threads, peak threads
System Metrics
- CPU Usage: Process and system CPU utilization
- File Descriptors: Open file count
- Uptime: Application uptime
Custom Dashboards
Import dashboard JSON from /k8s/monitoring/spring-boot-dashboard.json:
- Grafana → Dashboards → New → Import
- Upload
spring-boot-dashboard.json - Select Prometheus data source
- Click Import
Alerting
Prometheus Alerting Rules
Create alerting rules in Prometheus:
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: online-boutique-alerts
namespace:
labels:
prometheus: kube-prometheus
spec:
groups:
- name: online-boutique
interval: 30s
rules:
# High error rate
- alert: HighErrorRate
expr: |
sum(rate(http_server_requests_seconds_count{job="online-boutique",status=~"5.."}[5m]))
/ sum(rate(http_server_requests_seconds_count{job="online-boutique"}[5m]))
> 0.05
for: 5m
labels:
severity: warning
annotations:
summary: "High error rate on online-boutique"
description: "Error rate is {{ $value | humanizePercentage }}"
# High latency
- alert: HighLatency
expr: |
histogram_quantile(0.95,
sum(rate(http_server_requests_seconds_bucket{job="online-boutique"}[5m])) by (le)
) > 1.0
for: 5m
labels:
severity: warning
annotations:
summary: "High latency on online-boutique"
description: "95th percentile latency is {{ $value }}s"
# High memory usage
- alert: HighMemoryUsage
expr: |
jvm_memory_used_bytes{job="online-boutique",area="heap"}
/ jvm_memory_max_bytes{job="online-boutique",area="heap"}
> 0.90
for: 5m
labels:
severity: critical
annotations:
summary: "High memory usage on online-boutique"
description: "Heap usage is {{ $value | humanizePercentage }}"
# Pod not ready
- alert: PodNotReady
expr: |
kube_pod_status_ready{namespace="",pod=~"online-boutique-.*",condition="true"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "online-boutique pod not ready"
description: "Pod {{ $labels.pod }} not ready for 5 minutes"
Apply:
kubectl apply -f prometheus-rules.yaml
Grafana Alerts
Configure alerts in Grafana dashboard panels:
- Edit panel
- Click Alert tab
- Set conditions (e.g., "when avg() of query(A) is above 0.8")
- Configure notification channels (Slack, email, PagerDuty)
Alert Testing
Trigger test alerts:
# Generate errors
for i in {1..100}; do
curl http://localhost:8080/api/nonexistent
done
# Trigger high latency
ab -n 10000 -c 100 http://localhost:8080/api/status
# Cause memory pressure
curl -X POST http://localhost:8080/actuator/heapdump
Distributed Tracing (Future)
To add tracing with Jaeger/Zipkin:
- Add dependency:
<dependency>
<groupId>io.micrometer</groupId>
<artifactId>micrometer-tracing-bridge-otel</artifactId>
</dependency>
<dependency>
<groupId>io.opentelemetry</groupId>
<artifactId>opentelemetry-exporter-zipkin</artifactId>
</dependency>
- Configure in
application.yml:
management:
tracing:
sampling:
probability: 1.0
zipkin:
tracing:
endpoint: http://zipkin:9411/api/v2/spans
Log Aggregation
For centralized logging:
- Loki: Add Promtail to collect pod logs
- Grafana Logs: Query logs alongside metrics
- Log Correlation: Link traces to logs via trace ID
Best Practices
- Metric Cardinality: Avoid high-cardinality labels (user IDs, timestamps)
- Naming: Follow Prometheus naming conventions (
_total,_seconds,_bytes) - Aggregation: Use recording rules for expensive queries
- Retention: Adjust retention period based on storage capacity
- Dashboarding: Create business-specific dashboards for stakeholders
Troubleshooting
Metrics Not Appearing
# Check if actuator is enabled
kubectl -n exec -it deployment/online-boutique -- \
curl http://localhost:8080/actuator
# Check ServiceMonitor
kubectl -n get servicemonitor online-boutique -o yaml
# Check Prometheus logs
kubectl -n monitoring logs -l app.kubernetes.io/name=prometheus --tail=100 | grep online-boutique
High Memory Usage
# Take heap dump
kubectl -n exec -it deployment/online-boutique -- \
curl -X POST http://localhost:8080/actuator/heapdump --output heapdump.hprof
# Analyze with jmap/jhat or Eclipse Memory Analyzer
Slow Queries
Enable query logging in Prometheus:
kubectl -n monitoring port-forward svc/prometheus-operated 9090:9090
# Access http://localhost:9090/graph
# Enable query stats in settings