Files
online-boutique/docs/monitoring.md
Scaffolder 7e119cad41 initial commit
Change-Id: I9c68c43e939d2c1a3b95a68b71ecc5ba861a4df5
2026-03-05 13:37:56 +00:00

396 lines
10 KiB
Markdown

# Monitoring & Observability
This guide covers monitoring online-boutique with Prometheus and Grafana.
## Overview
The Java Golden Path includes comprehensive observability:
- **Metrics**: Prometheus metrics via Spring Boot Actuator
- **Dashboards**: Pre-configured Grafana dashboard
- **Scraping**: Automatic discovery via ServiceMonitor
- **Retention**: 15 days of metrics storage
## Metrics Endpoint
Spring Boot Actuator exposes Prometheus metrics at:
```
http://<pod-ip>:8080/actuator/prometheus
```
### Verify Metrics Locally
```bash
curl http://localhost:8080/actuator/prometheus
```
### Sample Metrics Output
```
# HELP jvm_memory_used_bytes The amount of used memory
# TYPE jvm_memory_used_bytes gauge
jvm_memory_used_bytes{area="heap",id="G1 Eden Space",} 5.2428800E7
# HELP http_server_requests_seconds Duration of HTTP server request handling
# TYPE http_server_requests_seconds summary
http_server_requests_seconds_count{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/api/status",} 42.0
http_server_requests_seconds_sum{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/api/status",} 0.351234567
```
## Available Metrics
### HTTP Metrics
- `http_server_requests_seconds_count`: Total request count
- `http_server_requests_seconds_sum`: Total request duration
- **Labels**: method, status, uri, outcome, exception
### JVM Metrics
#### Memory
- `jvm_memory_used_bytes`: Current memory usage
- `jvm_memory_max_bytes`: Maximum memory available
- `jvm_memory_committed_bytes`: Committed memory
- **Areas**: heap, nonheap
- **Pools**: G1 Eden Space, G1 Old Gen, G1 Survivor Space
#### Garbage Collection
- `jvm_gc_pause_seconds_count`: GC pause count
- `jvm_gc_pause_seconds_sum`: Total GC pause time
- `jvm_gc_memory_allocated_bytes_total`: Total memory allocated
- `jvm_gc_memory_promoted_bytes_total`: Memory promoted to old gen
#### Threads
- `jvm_threads_live_threads`: Current live threads
- `jvm_threads_daemon_threads`: Current daemon threads
- `jvm_threads_peak_threads`: Peak thread count
- `jvm_threads_states_threads`: Threads by state (runnable, blocked, waiting)
#### CPU
- `process_cpu_usage`: Process CPU usage (0-1)
- `system_cpu_usage`: System CPU usage (0-1)
- `system_cpu_count`: Number of CPU cores
### Application Metrics
- `application_started_time_seconds`: Application start timestamp
- `application_ready_time_seconds`: Application ready timestamp
- `process_uptime_seconds`: Process uptime
- `process_files_open_files`: Open file descriptors
### Custom Metrics
Add custom metrics with Micrometer:
```java
@Autowired
private MeterRegistry meterRegistry;
// Counter
Counter.builder("business_operations")
.tag("operation", "checkout")
.register(meterRegistry)
.increment();
// Gauge
Gauge.builder("active_users", this, obj -> obj.getActiveUsers())
.register(meterRegistry);
// Timer
Timer.builder("api_processing_time")
.tag("endpoint", "/api/process")
.register(meterRegistry)
.record(() -> {
// Timed operation
});
```
## Prometheus Configuration
### ServiceMonitor
Deployed automatically in `deploy/servicemonitor.yaml`:
```yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: online-boutique
namespace:
labels:
app: online-boutique
prometheus: kube-prometheus
spec:
selector:
matchLabels:
app: online-boutique
endpoints:
- port: http
path: /actuator/prometheus
interval: 30s
```
### Verify Scraping
Check Prometheus targets:
1. Access Prometheus: `https://prometheus.kyndemo.live`
2. Navigate to **Status → Targets**
3. Find `online-boutique` in `monitoring/` namespace
4. Status should be **UP**
Or via kubectl:
```bash
# Port-forward Prometheus
kubectl -n monitoring port-forward svc/prometheus-operated 9090:9090
# Check targets API
curl http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | select(.labels.job == "online-boutique")'
```
### Query Metrics
Access Prometheus UI and run queries:
```promql
# Request rate
rate(http_server_requests_seconds_count{job="online-boutique"}[5m])
# Average request duration
rate(http_server_requests_seconds_sum{job="online-boutique"}[5m])
/ rate(http_server_requests_seconds_count{job="online-boutique"}[5m])
# Error rate
sum(rate(http_server_requests_seconds_count{job="online-boutique",status=~"5.."}[5m]))
/ sum(rate(http_server_requests_seconds_count{job="online-boutique"}[5m]))
# Memory usage
jvm_memory_used_bytes{job="online-boutique",area="heap"}
/ jvm_memory_max_bytes{job="online-boutique",area="heap"}
```
## Grafana Dashboard
### Access Dashboard
1. Open Grafana: `https://grafana.kyndemo.live`
2. Navigate to **Dashboards → Spring Boot Application**
3. Select `online-boutique` from dropdown
### Dashboard Panels
#### HTTP Metrics
- **Request Rate**: Requests per second by endpoint
- **Request Duration**: Average, 95th, 99th percentile latency
- **Status Codes**: Breakdown of 2xx, 4xx, 5xx responses
- **Error Rate**: Percentage of failed requests
#### JVM Metrics
- **Heap Memory**: Used vs. max heap memory over time
- **Non-Heap Memory**: Metaspace, code cache, compressed class space
- **Garbage Collection**: GC pause frequency and duration
- **Thread Count**: Live threads, daemon threads, peak threads
#### System Metrics
- **CPU Usage**: Process and system CPU utilization
- **File Descriptors**: Open file count
- **Uptime**: Application uptime
### Custom Dashboards
Import dashboard JSON from `/k8s/monitoring/spring-boot-dashboard.json`:
1. Grafana → Dashboards → New → Import
2. Upload `spring-boot-dashboard.json`
3. Select Prometheus data source
4. Click **Import**
## Alerting
### Prometheus Alerting Rules
Create alerting rules in Prometheus:
```yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: online-boutique-alerts
namespace:
labels:
prometheus: kube-prometheus
spec:
groups:
- name: online-boutique
interval: 30s
rules:
# High error rate
- alert: HighErrorRate
expr: |
sum(rate(http_server_requests_seconds_count{job="online-boutique",status=~"5.."}[5m]))
/ sum(rate(http_server_requests_seconds_count{job="online-boutique"}[5m]))
> 0.05
for: 5m
labels:
severity: warning
annotations:
summary: "High error rate on online-boutique"
description: "Error rate is {{ $value | humanizePercentage }}"
# High latency
- alert: HighLatency
expr: |
histogram_quantile(0.95,
sum(rate(http_server_requests_seconds_bucket{job="online-boutique"}[5m])) by (le)
) > 1.0
for: 5m
labels:
severity: warning
annotations:
summary: "High latency on online-boutique"
description: "95th percentile latency is {{ $value }}s"
# High memory usage
- alert: HighMemoryUsage
expr: |
jvm_memory_used_bytes{job="online-boutique",area="heap"}
/ jvm_memory_max_bytes{job="online-boutique",area="heap"}
> 0.90
for: 5m
labels:
severity: critical
annotations:
summary: "High memory usage on online-boutique"
description: "Heap usage is {{ $value | humanizePercentage }}"
# Pod not ready
- alert: PodNotReady
expr: |
kube_pod_status_ready{namespace="",pod=~"online-boutique-.*",condition="true"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "online-boutique pod not ready"
description: "Pod {{ $labels.pod }} not ready for 5 minutes"
```
Apply:
```bash
kubectl apply -f prometheus-rules.yaml
```
### Grafana Alerts
Configure alerts in Grafana dashboard panels:
1. Edit panel
2. Click **Alert** tab
3. Set conditions (e.g., "when avg() of query(A) is above 0.8")
4. Configure notification channels (Slack, email, PagerDuty)
### Alert Testing
Trigger test alerts:
```bash
# Generate errors
for i in {1..100}; do
curl http://localhost:8080/api/nonexistent
done
# Trigger high latency
ab -n 10000 -c 100 http://localhost:8080/api/status
# Cause memory pressure
curl -X POST http://localhost:8080/actuator/heapdump
```
## Distributed Tracing (Future)
To add tracing with Jaeger/Zipkin:
1. Add dependency:
```xml
<dependency>
<groupId>io.micrometer</groupId>
<artifactId>micrometer-tracing-bridge-otel</artifactId>
</dependency>
<dependency>
<groupId>io.opentelemetry</groupId>
<artifactId>opentelemetry-exporter-zipkin</artifactId>
</dependency>
```
2. Configure in `application.yml`:
```yaml
management:
tracing:
sampling:
probability: 1.0
zipkin:
tracing:
endpoint: http://zipkin:9411/api/v2/spans
```
## Log Aggregation
For centralized logging:
1. **Loki**: Add Promtail to collect pod logs
2. **Grafana Logs**: Query logs alongside metrics
3. **Log Correlation**: Link traces to logs via trace ID
## Best Practices
1. **Metric Cardinality**: Avoid high-cardinality labels (user IDs, timestamps)
2. **Naming**: Follow Prometheus naming conventions (`_total`, `_seconds`, `_bytes`)
3. **Aggregation**: Use recording rules for expensive queries
4. **Retention**: Adjust retention period based on storage capacity
5. **Dashboarding**: Create business-specific dashboards for stakeholders
## Troubleshooting
### Metrics Not Appearing
```bash
# Check if actuator is enabled
kubectl -n exec -it deployment/online-boutique -- \
curl http://localhost:8080/actuator
# Check ServiceMonitor
kubectl -n get servicemonitor online-boutique -o yaml
# Check Prometheus logs
kubectl -n monitoring logs -l app.kubernetes.io/name=prometheus --tail=100 | grep online-boutique
```
### High Memory Usage
```bash
# Take heap dump
kubectl -n exec -it deployment/online-boutique -- \
curl -X POST http://localhost:8080/actuator/heapdump --output heapdump.hprof
# Analyze with jmap/jhat or Eclipse Memory Analyzer
```
### Slow Queries
Enable query logging in Prometheus:
```bash
kubectl -n monitoring port-forward svc/prometheus-operated 9090:9090
# Access http://localhost:9090/graph
# Enable query stats in settings
```
## Next Steps
- [Review architecture](architecture.md)
- [Learn about deployment](deployment.md)
- [Return to overview](index.md)