initial commit
Change-Id: I9c68c43e939d2c1a3b95a68b71ecc5ba861a4df5
This commit is contained in:
395
docs/monitoring.md
Normal file
395
docs/monitoring.md
Normal file
@@ -0,0 +1,395 @@
|
||||
# Monitoring & Observability
|
||||
|
||||
This guide covers monitoring online-boutique with Prometheus and Grafana.
|
||||
|
||||
## Overview
|
||||
|
||||
The Java Golden Path includes comprehensive observability:
|
||||
|
||||
- **Metrics**: Prometheus metrics via Spring Boot Actuator
|
||||
- **Dashboards**: Pre-configured Grafana dashboard
|
||||
- **Scraping**: Automatic discovery via ServiceMonitor
|
||||
- **Retention**: 15 days of metrics storage
|
||||
|
||||
## Metrics Endpoint
|
||||
|
||||
Spring Boot Actuator exposes Prometheus metrics at:
|
||||
|
||||
```
|
||||
http://<pod-ip>:8080/actuator/prometheus
|
||||
```
|
||||
|
||||
### Verify Metrics Locally
|
||||
|
||||
```bash
|
||||
curl http://localhost:8080/actuator/prometheus
|
||||
```
|
||||
|
||||
### Sample Metrics Output
|
||||
|
||||
```
|
||||
# HELP jvm_memory_used_bytes The amount of used memory
|
||||
# TYPE jvm_memory_used_bytes gauge
|
||||
jvm_memory_used_bytes{area="heap",id="G1 Eden Space",} 5.2428800E7
|
||||
|
||||
# HELP http_server_requests_seconds Duration of HTTP server request handling
|
||||
# TYPE http_server_requests_seconds summary
|
||||
http_server_requests_seconds_count{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/api/status",} 42.0
|
||||
http_server_requests_seconds_sum{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/api/status",} 0.351234567
|
||||
```
|
||||
|
||||
## Available Metrics
|
||||
|
||||
### HTTP Metrics
|
||||
|
||||
- `http_server_requests_seconds_count`: Total request count
|
||||
- `http_server_requests_seconds_sum`: Total request duration
|
||||
- **Labels**: method, status, uri, outcome, exception
|
||||
|
||||
### JVM Metrics
|
||||
|
||||
#### Memory
|
||||
- `jvm_memory_used_bytes`: Current memory usage
|
||||
- `jvm_memory_max_bytes`: Maximum memory available
|
||||
- `jvm_memory_committed_bytes`: Committed memory
|
||||
- **Areas**: heap, nonheap
|
||||
- **Pools**: G1 Eden Space, G1 Old Gen, G1 Survivor Space
|
||||
|
||||
#### Garbage Collection
|
||||
- `jvm_gc_pause_seconds_count`: GC pause count
|
||||
- `jvm_gc_pause_seconds_sum`: Total GC pause time
|
||||
- `jvm_gc_memory_allocated_bytes_total`: Total memory allocated
|
||||
- `jvm_gc_memory_promoted_bytes_total`: Memory promoted to old gen
|
||||
|
||||
#### Threads
|
||||
- `jvm_threads_live_threads`: Current live threads
|
||||
- `jvm_threads_daemon_threads`: Current daemon threads
|
||||
- `jvm_threads_peak_threads`: Peak thread count
|
||||
- `jvm_threads_states_threads`: Threads by state (runnable, blocked, waiting)
|
||||
|
||||
#### CPU
|
||||
- `process_cpu_usage`: Process CPU usage (0-1)
|
||||
- `system_cpu_usage`: System CPU usage (0-1)
|
||||
- `system_cpu_count`: Number of CPU cores
|
||||
|
||||
### Application Metrics
|
||||
|
||||
- `application_started_time_seconds`: Application start timestamp
|
||||
- `application_ready_time_seconds`: Application ready timestamp
|
||||
- `process_uptime_seconds`: Process uptime
|
||||
- `process_files_open_files`: Open file descriptors
|
||||
|
||||
### Custom Metrics
|
||||
|
||||
Add custom metrics with Micrometer:
|
||||
|
||||
```java
|
||||
@Autowired
|
||||
private MeterRegistry meterRegistry;
|
||||
|
||||
// Counter
|
||||
Counter.builder("business_operations")
|
||||
.tag("operation", "checkout")
|
||||
.register(meterRegistry)
|
||||
.increment();
|
||||
|
||||
// Gauge
|
||||
Gauge.builder("active_users", this, obj -> obj.getActiveUsers())
|
||||
.register(meterRegistry);
|
||||
|
||||
// Timer
|
||||
Timer.builder("api_processing_time")
|
||||
.tag("endpoint", "/api/process")
|
||||
.register(meterRegistry)
|
||||
.record(() -> {
|
||||
// Timed operation
|
||||
});
|
||||
```
|
||||
|
||||
## Prometheus Configuration
|
||||
|
||||
### ServiceMonitor
|
||||
|
||||
Deployed automatically in `deploy/servicemonitor.yaml`:
|
||||
|
||||
```yaml
|
||||
apiVersion: monitoring.coreos.com/v1
|
||||
kind: ServiceMonitor
|
||||
metadata:
|
||||
name: online-boutique
|
||||
namespace:
|
||||
labels:
|
||||
app: online-boutique
|
||||
prometheus: kube-prometheus
|
||||
spec:
|
||||
selector:
|
||||
matchLabels:
|
||||
app: online-boutique
|
||||
endpoints:
|
||||
- port: http
|
||||
path: /actuator/prometheus
|
||||
interval: 30s
|
||||
```
|
||||
|
||||
### Verify Scraping
|
||||
|
||||
Check Prometheus targets:
|
||||
|
||||
1. Access Prometheus: `https://prometheus.kyndemo.live`
|
||||
2. Navigate to **Status → Targets**
|
||||
3. Find `online-boutique` in `monitoring/` namespace
|
||||
4. Status should be **UP**
|
||||
|
||||
Or via kubectl:
|
||||
|
||||
```bash
|
||||
# Port-forward Prometheus
|
||||
kubectl -n monitoring port-forward svc/prometheus-operated 9090:9090
|
||||
|
||||
# Check targets API
|
||||
curl http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | select(.labels.job == "online-boutique")'
|
||||
```
|
||||
|
||||
### Query Metrics
|
||||
|
||||
Access Prometheus UI and run queries:
|
||||
|
||||
```promql
|
||||
# Request rate
|
||||
rate(http_server_requests_seconds_count{job="online-boutique"}[5m])
|
||||
|
||||
# Average request duration
|
||||
rate(http_server_requests_seconds_sum{job="online-boutique"}[5m])
|
||||
/ rate(http_server_requests_seconds_count{job="online-boutique"}[5m])
|
||||
|
||||
# Error rate
|
||||
sum(rate(http_server_requests_seconds_count{job="online-boutique",status=~"5.."}[5m]))
|
||||
/ sum(rate(http_server_requests_seconds_count{job="online-boutique"}[5m]))
|
||||
|
||||
# Memory usage
|
||||
jvm_memory_used_bytes{job="online-boutique",area="heap"}
|
||||
/ jvm_memory_max_bytes{job="online-boutique",area="heap"}
|
||||
```
|
||||
|
||||
## Grafana Dashboard
|
||||
|
||||
### Access Dashboard
|
||||
|
||||
1. Open Grafana: `https://grafana.kyndemo.live`
|
||||
2. Navigate to **Dashboards → Spring Boot Application**
|
||||
3. Select `online-boutique` from dropdown
|
||||
|
||||
### Dashboard Panels
|
||||
|
||||
#### HTTP Metrics
|
||||
- **Request Rate**: Requests per second by endpoint
|
||||
- **Request Duration**: Average, 95th, 99th percentile latency
|
||||
- **Status Codes**: Breakdown of 2xx, 4xx, 5xx responses
|
||||
- **Error Rate**: Percentage of failed requests
|
||||
|
||||
#### JVM Metrics
|
||||
- **Heap Memory**: Used vs. max heap memory over time
|
||||
- **Non-Heap Memory**: Metaspace, code cache, compressed class space
|
||||
- **Garbage Collection**: GC pause frequency and duration
|
||||
- **Thread Count**: Live threads, daemon threads, peak threads
|
||||
|
||||
#### System Metrics
|
||||
- **CPU Usage**: Process and system CPU utilization
|
||||
- **File Descriptors**: Open file count
|
||||
- **Uptime**: Application uptime
|
||||
|
||||
### Custom Dashboards
|
||||
|
||||
Import dashboard JSON from `/k8s/monitoring/spring-boot-dashboard.json`:
|
||||
|
||||
1. Grafana → Dashboards → New → Import
|
||||
2. Upload `spring-boot-dashboard.json`
|
||||
3. Select Prometheus data source
|
||||
4. Click **Import**
|
||||
|
||||
## Alerting
|
||||
|
||||
### Prometheus Alerting Rules
|
||||
|
||||
Create alerting rules in Prometheus:
|
||||
|
||||
```yaml
|
||||
apiVersion: monitoring.coreos.com/v1
|
||||
kind: PrometheusRule
|
||||
metadata:
|
||||
name: online-boutique-alerts
|
||||
namespace:
|
||||
labels:
|
||||
prometheus: kube-prometheus
|
||||
spec:
|
||||
groups:
|
||||
- name: online-boutique
|
||||
interval: 30s
|
||||
rules:
|
||||
# High error rate
|
||||
- alert: HighErrorRate
|
||||
expr: |
|
||||
sum(rate(http_server_requests_seconds_count{job="online-boutique",status=~"5.."}[5m]))
|
||||
/ sum(rate(http_server_requests_seconds_count{job="online-boutique"}[5m]))
|
||||
> 0.05
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "High error rate on online-boutique"
|
||||
description: "Error rate is {{ $value | humanizePercentage }}"
|
||||
|
||||
# High latency
|
||||
- alert: HighLatency
|
||||
expr: |
|
||||
histogram_quantile(0.95,
|
||||
sum(rate(http_server_requests_seconds_bucket{job="online-boutique"}[5m])) by (le)
|
||||
) > 1.0
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "High latency on online-boutique"
|
||||
description: "95th percentile latency is {{ $value }}s"
|
||||
|
||||
# High memory usage
|
||||
- alert: HighMemoryUsage
|
||||
expr: |
|
||||
jvm_memory_used_bytes{job="online-boutique",area="heap"}
|
||||
/ jvm_memory_max_bytes{job="online-boutique",area="heap"}
|
||||
> 0.90
|
||||
for: 5m
|
||||
labels:
|
||||
severity: critical
|
||||
annotations:
|
||||
summary: "High memory usage on online-boutique"
|
||||
description: "Heap usage is {{ $value | humanizePercentage }}"
|
||||
|
||||
# Pod not ready
|
||||
- alert: PodNotReady
|
||||
expr: |
|
||||
kube_pod_status_ready{namespace="",pod=~"online-boutique-.*",condition="true"} == 0
|
||||
for: 5m
|
||||
labels:
|
||||
severity: critical
|
||||
annotations:
|
||||
summary: "online-boutique pod not ready"
|
||||
description: "Pod {{ $labels.pod }} not ready for 5 minutes"
|
||||
```
|
||||
|
||||
Apply:
|
||||
|
||||
```bash
|
||||
kubectl apply -f prometheus-rules.yaml
|
||||
```
|
||||
|
||||
### Grafana Alerts
|
||||
|
||||
Configure alerts in Grafana dashboard panels:
|
||||
|
||||
1. Edit panel
|
||||
2. Click **Alert** tab
|
||||
3. Set conditions (e.g., "when avg() of query(A) is above 0.8")
|
||||
4. Configure notification channels (Slack, email, PagerDuty)
|
||||
|
||||
### Alert Testing
|
||||
|
||||
Trigger test alerts:
|
||||
|
||||
```bash
|
||||
# Generate errors
|
||||
for i in {1..100}; do
|
||||
curl http://localhost:8080/api/nonexistent
|
||||
done
|
||||
|
||||
# Trigger high latency
|
||||
ab -n 10000 -c 100 http://localhost:8080/api/status
|
||||
|
||||
# Cause memory pressure
|
||||
curl -X POST http://localhost:8080/actuator/heapdump
|
||||
```
|
||||
|
||||
## Distributed Tracing (Future)
|
||||
|
||||
To add tracing with Jaeger/Zipkin:
|
||||
|
||||
1. Add dependency:
|
||||
```xml
|
||||
<dependency>
|
||||
<groupId>io.micrometer</groupId>
|
||||
<artifactId>micrometer-tracing-bridge-otel</artifactId>
|
||||
</dependency>
|
||||
<dependency>
|
||||
<groupId>io.opentelemetry</groupId>
|
||||
<artifactId>opentelemetry-exporter-zipkin</artifactId>
|
||||
</dependency>
|
||||
```
|
||||
|
||||
2. Configure in `application.yml`:
|
||||
```yaml
|
||||
management:
|
||||
tracing:
|
||||
sampling:
|
||||
probability: 1.0
|
||||
zipkin:
|
||||
tracing:
|
||||
endpoint: http://zipkin:9411/api/v2/spans
|
||||
```
|
||||
|
||||
## Log Aggregation
|
||||
|
||||
For centralized logging:
|
||||
|
||||
1. **Loki**: Add Promtail to collect pod logs
|
||||
2. **Grafana Logs**: Query logs alongside metrics
|
||||
3. **Log Correlation**: Link traces to logs via trace ID
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Metric Cardinality**: Avoid high-cardinality labels (user IDs, timestamps)
|
||||
2. **Naming**: Follow Prometheus naming conventions (`_total`, `_seconds`, `_bytes`)
|
||||
3. **Aggregation**: Use recording rules for expensive queries
|
||||
4. **Retention**: Adjust retention period based on storage capacity
|
||||
5. **Dashboarding**: Create business-specific dashboards for stakeholders
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Metrics Not Appearing
|
||||
|
||||
```bash
|
||||
# Check if actuator is enabled
|
||||
kubectl -n exec -it deployment/online-boutique -- \
|
||||
curl http://localhost:8080/actuator
|
||||
|
||||
# Check ServiceMonitor
|
||||
kubectl -n get servicemonitor online-boutique -o yaml
|
||||
|
||||
# Check Prometheus logs
|
||||
kubectl -n monitoring logs -l app.kubernetes.io/name=prometheus --tail=100 | grep online-boutique
|
||||
```
|
||||
|
||||
### High Memory Usage
|
||||
|
||||
```bash
|
||||
# Take heap dump
|
||||
kubectl -n exec -it deployment/online-boutique -- \
|
||||
curl -X POST http://localhost:8080/actuator/heapdump --output heapdump.hprof
|
||||
|
||||
# Analyze with jmap/jhat or Eclipse Memory Analyzer
|
||||
```
|
||||
|
||||
### Slow Queries
|
||||
|
||||
Enable query logging in Prometheus:
|
||||
|
||||
```bash
|
||||
kubectl -n monitoring port-forward svc/prometheus-operated 9090:9090
|
||||
# Access http://localhost:9090/graph
|
||||
# Enable query stats in settings
|
||||
```
|
||||
|
||||
## Next Steps
|
||||
|
||||
- [Review architecture](architecture.md)
|
||||
- [Learn about deployment](deployment.md)
|
||||
- [Return to overview](index.md)
|
||||
Reference in New Issue
Block a user