Not all metrics are created equal. Focus on metrics that drive decisions, indicate health, and align with business objectives.
The Four Golden Signals (Google SRE)
1. Latency
Time to service a request.
What to Measure
- Request duration (p50, p95, p99, p99.9)
- Time to first byte (TTFB)
- Database query time
- API response time
- Cache hit latency vs. miss latency
Example: Prometheus
# Request duration histogram
histogram_quantile(0.95,
rate(http_request_duration_seconds_bucket[5m])
)
# Separate successful vs. failed latency
histogram_quantile(0.99,
rate(http_request_duration_seconds_bucket{status!~"5.."}[5m])
)
Alerting Thresholds
alert: HighLatency
expr: |
histogram_quantile(0.95,
rate(http_request_duration_seconds_bucket[5m])
) > 1.0
for: 5m
annotations:
summary: "95th percentile latency above 1s"
2. Traffic
Demand on your system.
What to Measure
- Requests per second (RPS)
- Concurrent connections
- Bandwidth utilization
- Transactions per second
- Active users
Example: Prometheus
# Total request rate
rate(http_requests_total[5m])
# Request rate by endpoint
sum(rate(http_requests_total[5m])) by (endpoint)
# Traffic growth
rate(http_requests_total[5m]) /
rate(http_requests_total[5m] offset 1w)
Capacity Planning
# Current usage vs. capacity
(
rate(http_requests_total[5m])
/
on() group_left() max_capacity
) * 100
3. Errors
Rate of failed requests.
What to Measure
- HTTP 5xx errors
- HTTP 4xx errors (client errors)
- Failed database transactions
- Exception rate
- Timeout rate
Example: Prometheus
# Error rate
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
# Error budget (99.9% availability = 0.1% errors allowed)
1 - (
sum(rate(http_requests_total{status=~"5.."}[30d]))
/
sum(rate(http_requests_total[30d]))
) < 0.999
By Error Type
# Group errors
sum(rate(http_requests_total{status=~"5.."}[5m])) by (status)
4. Saturation
Resource utilization.
What to Measure
- CPU utilization
- Memory usage
- Disk I/O
- Network bandwidth
- Queue depth
- Thread pool utilization
Example: Prometheus
# CPU saturation
100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Memory saturation
(
node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes
) / node_memory_MemTotal_bytes * 100
# Disk saturation
node_filesystem_avail_bytes{mountpoint="/"}
/ node_filesystem_size_bytes{mountpoint="/"} * 100
USE Method (Brendan Gregg)
For every resource, monitor:
Utilization
Percentage of time resource is busy.
# CPU utilization
avg(rate(node_cpu_seconds_total{mode!="idle"}[5m])) by (instance)
# Network utilization
rate(node_network_receive_bytes_total[5m]) +
rate(node_network_transmit_bytes_total[5m])
Saturation
Degree of queued work.
# Load average (should be < number of CPUs)
node_load1
# Disk I/O wait
rate(node_disk_io_time_seconds_total[5m])
Errors
Count of error events.
# Disk errors
rate(node_disk_read_errors_total[5m]) +
rate(node_disk_write_errors_total[5m])
# Network errors
rate(node_network_receive_errs_total[5m]) +
rate(node_network_transmit_errs_total[5m])
RED Method (Tom Wilkie)
For services and microservices:
Rate
Requests per second.
sum(rate(http_requests_total[5m])) by (service)
Errors
Failed requests per second.
sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
Duration
Request latency distribution.
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
)
Application Metrics
Business Metrics
Directly measure business value.
E-commerce Example
# Revenue per minute
sum(rate(order_total_dollars[1m]))
# Successful checkouts
sum(rate(checkout_complete_total[5m]))
# Cart abandonment rate
(
sum(rate(cart_created_total[5m])) -
sum(rate(checkout_complete_total[5m]))
) / sum(rate(cart_created_total[5m]))
# Average order value
sum(rate(order_total_dollars[1h])) /
sum(rate(order_count_total[1h]))
SaaS Example
# Active users
active_users_total
# New signups
rate(user_signups_total[1h])
# Churn rate
rate(user_cancellations_total[30d]) /
active_users_total
# Feature usage
sum(rate(feature_usage_total[1h])) by (feature)
Apdex Score
User satisfaction metric (Application Performance Index).
# Apdex = (Satisfied + (Tolerating / 2)) / Total
# Satisfied: < 0.5s
# Tolerating: 0.5s - 2s
# Frustrated: > 2s
(
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[5m])) +
sum(rate(http_request_duration_seconds_bucket{le="2.0"}[5m])) / 2
) / sum(rate(http_request_duration_seconds_count[5m]))
Infrastructure Metrics
Compute
# CPU usage by process
topk(10,
sum(rate(process_cpu_seconds_total[5m])) by (process)
)
# Memory usage by container
container_memory_usage_bytes / container_spec_memory_limit_bytes * 100
# Context switches (high = CPU thrashing)
rate(node_context_switches_total[5m])
Storage
# Disk usage
100 - (
node_filesystem_avail_bytes{mountpoint="/"} /
node_filesystem_size_bytes{mountpoint="/"}
) * 100
# Disk throughput
rate(node_disk_read_bytes_total[5m]) +
rate(node_disk_write_bytes_total[5m])
# IOPS
rate(node_disk_reads_completed_total[5m]) +
rate(node_disk_writes_completed_total[5m])
Network
# Bandwidth
rate(node_network_receive_bytes_total[5m]) * 8 / 1000000 # Mbps
# Packet loss
rate(node_network_receive_drop_total[5m]) +
rate(node_network_transmit_drop_total[5m])
# TCP connections
node_netstat_Tcp_CurrEstab
Database Metrics
Query Performance
# Query duration
histogram_quantile(0.95,
rate(mysql_global_status_queries_duration_bucket[5m])
)
# Slow queries
rate(mysql_global_status_slow_queries[5m])
# Queries per second
rate(mysql_global_status_queries[5m])
Connection Pool
# Active connections
mysql_global_status_threads_connected
# Connection pool saturation
mysql_global_status_threads_connected /
mysql_global_variables_max_connections * 100
# Waiting connections
mysql_global_status_threads_running
Cache Hit Rate
# Query cache hit rate
(
rate(mysql_global_status_qcache_hits[5m]) /
(
rate(mysql_global_status_qcache_hits[5m]) +
rate(mysql_global_status_qcache_inserts[5m])
)
) * 100
Replication Lag
# Seconds behind master
mysql_slave_status_seconds_behind_master
Cache Metrics
Redis Example
# Hit rate
(
rate(redis_keyspace_hits_total[5m]) /
(
rate(redis_keyspace_hits_total[5m]) +
rate(redis_keyspace_misses_total[5m])
)
) * 100
# Memory usage
redis_memory_used_bytes / redis_memory_max_bytes * 100
# Evictions (memory pressure)
rate(redis_evicted_keys_total[5m])
# Commands per second
rate(redis_commands_processed_total[5m])
Message Queue Metrics
RabbitMQ Example
# Queue depth
rabbitmq_queue_messages_ready
# Message rate
rate(rabbitmq_queue_messages_published_total[5m])
# Consumer utilization
rabbitmq_queue_consumers / rabbitmq_queue_messages_ready
# Message age (oldest message in queue)
time() - rabbitmq_queue_head_message_timestamp
Kafka Example
# Consumer lag
kafka_consumer_lag
# Bytes in/out
rate(kafka_server_brokertopicmetrics_bytesin_total[5m])
rate(kafka_server_brokertopicmetrics_bytesout_total[5m])
# Under-replicated partitions (data at risk)
kafka_server_replicamanager_underreplicatedpartitions
Kubernetes Metrics
Pod Metrics
# Pod CPU usage
sum(rate(container_cpu_usage_seconds_total[5m])) by (pod)
# Pod memory usage
sum(container_memory_usage_bytes) by (pod)
# Pod restart rate
rate(kube_pod_container_status_restarts_total[1h])
# Pods in crash loop
kube_pod_container_status_waiting_reason{reason="CrashLoopBackOff"}
Cluster Health
# Node status
kube_node_status_condition{condition="Ready",status="true"}
# Available vs. desired replicas
kube_deployment_status_replicas_available /
kube_deployment_spec_replicas
# Pending pods
kube_pod_status_phase{phase="Pending"}
SLI/SLO/SLA Metrics
Service Level Indicators (SLIs)
Metrics that matter to users.
# Availability SLI
sum(rate(http_requests_total{status!~"5.."}[30d])) /
sum(rate(http_requests_total[30d]))
# Latency SLI (% of requests under threshold)
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[30d])) /
sum(rate(http_request_duration_seconds_count[30d]))
Error Budget
# Error budget remaining
# Target: 99.9% (0.1% errors allowed per 30d)
error_budget_remaining = 0.001 - (
sum(rate(http_requests_total{status=~"5.."}[30d])) /
sum(rate(http_requests_total[30d]))
)
# Error budget burn rate (how fast using budget)
(
sum(rate(http_requests_total{status=~"5.."}[1h])) /
sum(rate(http_requests_total[1h]))
) / 0.001
Custom Application Metrics
Instrumentation Example
Python (Prometheus Client)
from prometheus_client import Counter, Histogram, Gauge
import time
# Counters
requests_total = Counter('http_requests_total', 'Total requests', ['method', 'endpoint', 'status'])
errors_total = Counter('http_errors_total', 'Total errors', ['type'])
# Histograms
request_duration = Histogram('http_request_duration_seconds', 'Request duration', ['endpoint'])
# Gauges
active_users = Gauge('active_users', 'Currently active users')
# Usage
@request_duration.labels(endpoint='/api/users').time()
def get_users():
requests_total.labels(method='GET', endpoint='/api/users', status='200').inc()
# ... handle request
Go
import (
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promauto"
)
var (
requestsTotal = promauto.NewCounterVec(
prometheus.CounterOpts{
Name: "http_requests_total",
Help: "Total HTTP requests",
},
[]string{"method", "endpoint", "status"},
)
requestDuration = promauto.NewHistogramVec(
prometheus.HistogramOpts{
Name: "http_request_duration_seconds",
Help: "HTTP request duration",
Buckets: prometheus.DefBuckets,
},
[]string{"endpoint"},
)
)
func handleRequest(w http.ResponseWriter, r *http.Request) {
timer := prometheus.NewTimer(requestDuration.WithLabelValues(r.URL.Path))
defer timer.ObserveDuration()
// ... handle request
requestsTotal.WithLabelValues(r.Method, r.URL.Path, "200").Inc()
}
Metric Naming Conventions
Best Practices
Use consistent naming
<namespace>_<subsystem>_<name>_<unit>
Examples:
http_requests_total
http_request_duration_seconds
database_query_duration_milliseconds
cache_hits_total
Choose appropriate types
- Counter: Monotonically increasing (requests, errors)
- Gauge: Can go up/down (temperature, memory)
- Histogram: Distribution (latency, request size)
- Summary: Similar to histogram, pre-calculated quantiles
Cardinality Management
Avoid High Cardinality
Bad (unbounded labels):
# User ID as label = millions of time series!
http_requests_total{user_id="12345"}
Good (bounded labels):
# Group by user tier
http_requests_total{user_tier="premium"}
Label Guidelines
Good labels (low cardinality):
- Environment (dev, staging, prod)
- Region (us-east-1, eu-west-1)
- Service name
- HTTP method
- Status code range (2xx, 4xx, 5xx)
Bad labels (high cardinality):
- User IDs
- Email addresses
- Full URLs
- Timestamps
- UUIDs
Metrics Storage and Retention
Retention Strategy
# Prometheus example
global:
scrape_interval: 15s
evaluation_interval: 15s
storage:
tsdb:
retention:
time: 30d
size: 50GB
remote_write:
- url: https://long-term-storage/api/v1/write
write_relabel_configs:
# Only keep important metrics long-term
- source_labels: [__name__]
regex: '(up|http_requests_total|error_budget_.*)'
action: keep
Downsampling
# Keep raw data for 7d, then downsample
recording_rules:
- record: http_requests:rate5m
expr: rate(http_requests_total[5m])
- record: http_requests:rate1h
expr: rate(http_requests_total[1h])
Dashboards
Key Principles
- Purpose-driven: Different audiences, different dashboards
- Top-down: Start with high-level, drill down
- Signal-to-noise: Show what matters
- Actionable: What should I do?
Dashboard Examples
Executive Dashboard
- SLA compliance
- Revenue metrics
- User growth
- Critical incidents
Engineering Dashboard
- Golden signals (latency, traffic, errors, saturation)
- Deployment markers
- Error budget
- Top errors
On-call Dashboard
- Active alerts
- Recent deployments
- Service health
- Resource utilization
Alerting on Metrics
Good Alerts
# Symptom-based (what users experience)
alert: HighErrorRate
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[5m])) /
sum(rate(http_requests_total[5m]))
) > 0.05
for: 5m
annotations:
summary: "Error rate above 5%"
# Resource exhaustion (predictive)
alert: DiskWillFillIn4Hours
expr: |
predict_linear(node_filesystem_avail_bytes[1h], 4*3600) < 0
annotations:
summary: "Disk will fill in 4 hours at current rate"
Bad Alerts
# Cause-based (doesn't affect users yet)
alert: HighCPU
expr: node_cpu_usage > 80
# Problem: High CPU doesn't always mean user impact
# Too sensitive
alert: AnyError
expr: rate(errors_total[1m]) > 0
# Problem: Noise, alert fatigue
Metrics Anti-Patterns
- Vanity Metrics: Look good but don’t drive decisions
- Too Many Metrics: Can’t see signal in noise
- No Baselines: Don’t know what’s normal
- Ignoring Distributions: Averages hide problems
- Alert on Everything: Alert fatigue
- No Context: Metrics without labels
- Stale Metrics: Not updating, outdated thresholds
Conclusion
Effective metrics enable:
- Observability: Understand system behavior
- Alerting: Know when things break
- Capacity Planning: Predict future needs
- Performance: Identify bottlenecks
- Business Decisions: Data-driven choices
Remember:
- Focus on user-facing metrics
- Use percentiles, not averages
- Monitor rate of change
- Set meaningful SLOs
- Alert on symptoms, not causes
- Keep cardinality manageable
- Dashboard for your audience
The best metric is one that helps you make better decisions.