Skip to main content

Metrics That Matter

November 10, 2025

Essential metrics for monitoring system health, performance, and business outcomes

Not all metrics are created equal. Focus on metrics that drive decisions, indicate health, and align with business objectives.

The Four Golden Signals (Google SRE)

1. Latency

Time to service a request.

What to Measure

  • Request duration (p50, p95, p99, p99.9)
  • Time to first byte (TTFB)
  • Database query time
  • API response time
  • Cache hit latency vs. miss latency

Example: Prometheus

# Request duration histogram
histogram_quantile(0.95,
  rate(http_request_duration_seconds_bucket[5m])
)

# Separate successful vs. failed latency
histogram_quantile(0.99,
  rate(http_request_duration_seconds_bucket{status!~"5.."}[5m])
)

Alerting Thresholds

alert: HighLatency
expr: |
  histogram_quantile(0.95,
    rate(http_request_duration_seconds_bucket[5m])
  ) > 1.0
for: 5m
annotations:
  summary: "95th percentile latency above 1s"

2. Traffic

Demand on your system.

What to Measure

  • Requests per second (RPS)
  • Concurrent connections
  • Bandwidth utilization
  • Transactions per second
  • Active users

Example: Prometheus

# Total request rate
rate(http_requests_total[5m])

# Request rate by endpoint
sum(rate(http_requests_total[5m])) by (endpoint)

# Traffic growth
rate(http_requests_total[5m]) /
rate(http_requests_total[5m] offset 1w)

Capacity Planning

# Current usage vs. capacity
(
  rate(http_requests_total[5m])
  /
  on() group_left() max_capacity
) * 100

3. Errors

Rate of failed requests.

What to Measure

  • HTTP 5xx errors
  • HTTP 4xx errors (client errors)
  • Failed database transactions
  • Exception rate
  • Timeout rate

Example: Prometheus

# Error rate
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))

# Error budget (99.9% availability = 0.1% errors allowed)
1 - (
  sum(rate(http_requests_total{status=~"5.."}[30d]))
  /
  sum(rate(http_requests_total[30d]))
) < 0.999

By Error Type

# Group errors
sum(rate(http_requests_total{status=~"5.."}[5m])) by (status)

4. Saturation

Resource utilization.

What to Measure

  • CPU utilization
  • Memory usage
  • Disk I/O
  • Network bandwidth
  • Queue depth
  • Thread pool utilization

Example: Prometheus

# CPU saturation
100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Memory saturation
(
  node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes
) / node_memory_MemTotal_bytes * 100

# Disk saturation
node_filesystem_avail_bytes{mountpoint="/"}
/ node_filesystem_size_bytes{mountpoint="/"} * 100

USE Method (Brendan Gregg)

For every resource, monitor:

Utilization

Percentage of time resource is busy.

# CPU utilization
avg(rate(node_cpu_seconds_total{mode!="idle"}[5m])) by (instance)

# Network utilization
rate(node_network_receive_bytes_total[5m]) +
rate(node_network_transmit_bytes_total[5m])

Saturation

Degree of queued work.

# Load average (should be < number of CPUs)
node_load1

# Disk I/O wait
rate(node_disk_io_time_seconds_total[5m])

Errors

Count of error events.

# Disk errors
rate(node_disk_read_errors_total[5m]) +
rate(node_disk_write_errors_total[5m])

# Network errors
rate(node_network_receive_errs_total[5m]) +
rate(node_network_transmit_errs_total[5m])

RED Method (Tom Wilkie)

For services and microservices:

Rate

Requests per second.

sum(rate(http_requests_total[5m])) by (service)

Errors

Failed requests per second.

sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)

Duration

Request latency distribution.

histogram_quantile(0.95,
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
)

Application Metrics

Business Metrics

Directly measure business value.

E-commerce Example

# Revenue per minute
sum(rate(order_total_dollars[1m]))

# Successful checkouts
sum(rate(checkout_complete_total[5m]))

# Cart abandonment rate
(
  sum(rate(cart_created_total[5m])) -
  sum(rate(checkout_complete_total[5m]))
) / sum(rate(cart_created_total[5m]))

# Average order value
sum(rate(order_total_dollars[1h])) /
sum(rate(order_count_total[1h]))

SaaS Example

# Active users
active_users_total

# New signups
rate(user_signups_total[1h])

# Churn rate
rate(user_cancellations_total[30d]) /
active_users_total

# Feature usage
sum(rate(feature_usage_total[1h])) by (feature)

Apdex Score

User satisfaction metric (Application Performance Index).

# Apdex = (Satisfied + (Tolerating / 2)) / Total
# Satisfied: < 0.5s
# Tolerating: 0.5s - 2s
# Frustrated: > 2s

(
  sum(rate(http_request_duration_seconds_bucket{le="0.5"}[5m])) +
  sum(rate(http_request_duration_seconds_bucket{le="2.0"}[5m])) / 2
) / sum(rate(http_request_duration_seconds_count[5m]))

Infrastructure Metrics

Compute

# CPU usage by process
topk(10,
  sum(rate(process_cpu_seconds_total[5m])) by (process)
)

# Memory usage by container
container_memory_usage_bytes / container_spec_memory_limit_bytes * 100

# Context switches (high = CPU thrashing)
rate(node_context_switches_total[5m])

Storage

# Disk usage
100 - (
  node_filesystem_avail_bytes{mountpoint="/"} /
  node_filesystem_size_bytes{mountpoint="/"}
) * 100

# Disk throughput
rate(node_disk_read_bytes_total[5m]) +
rate(node_disk_write_bytes_total[5m])

# IOPS
rate(node_disk_reads_completed_total[5m]) +
rate(node_disk_writes_completed_total[5m])

Network

# Bandwidth
rate(node_network_receive_bytes_total[5m]) * 8 / 1000000  # Mbps

# Packet loss
rate(node_network_receive_drop_total[5m]) +
rate(node_network_transmit_drop_total[5m])

# TCP connections
node_netstat_Tcp_CurrEstab

Database Metrics

Query Performance

# Query duration
histogram_quantile(0.95,
  rate(mysql_global_status_queries_duration_bucket[5m])
)

# Slow queries
rate(mysql_global_status_slow_queries[5m])

# Queries per second
rate(mysql_global_status_queries[5m])

Connection Pool

# Active connections
mysql_global_status_threads_connected

# Connection pool saturation
mysql_global_status_threads_connected /
mysql_global_variables_max_connections * 100

# Waiting connections
mysql_global_status_threads_running

Cache Hit Rate

# Query cache hit rate
(
  rate(mysql_global_status_qcache_hits[5m]) /
  (
    rate(mysql_global_status_qcache_hits[5m]) +
    rate(mysql_global_status_qcache_inserts[5m])
  )
) * 100

Replication Lag

# Seconds behind master
mysql_slave_status_seconds_behind_master

Cache Metrics

Redis Example

# Hit rate
(
  rate(redis_keyspace_hits_total[5m]) /
  (
    rate(redis_keyspace_hits_total[5m]) +
    rate(redis_keyspace_misses_total[5m])
  )
) * 100

# Memory usage
redis_memory_used_bytes / redis_memory_max_bytes * 100

# Evictions (memory pressure)
rate(redis_evicted_keys_total[5m])

# Commands per second
rate(redis_commands_processed_total[5m])

Message Queue Metrics

RabbitMQ Example

# Queue depth
rabbitmq_queue_messages_ready

# Message rate
rate(rabbitmq_queue_messages_published_total[5m])

# Consumer utilization
rabbitmq_queue_consumers / rabbitmq_queue_messages_ready

# Message age (oldest message in queue)
time() - rabbitmq_queue_head_message_timestamp

Kafka Example

# Consumer lag
kafka_consumer_lag

# Bytes in/out
rate(kafka_server_brokertopicmetrics_bytesin_total[5m])
rate(kafka_server_brokertopicmetrics_bytesout_total[5m])

# Under-replicated partitions (data at risk)
kafka_server_replicamanager_underreplicatedpartitions

Kubernetes Metrics

Pod Metrics

# Pod CPU usage
sum(rate(container_cpu_usage_seconds_total[5m])) by (pod)

# Pod memory usage
sum(container_memory_usage_bytes) by (pod)

# Pod restart rate
rate(kube_pod_container_status_restarts_total[1h])

# Pods in crash loop
kube_pod_container_status_waiting_reason{reason="CrashLoopBackOff"}

Cluster Health

# Node status
kube_node_status_condition{condition="Ready",status="true"}

# Available vs. desired replicas
kube_deployment_status_replicas_available /
kube_deployment_spec_replicas

# Pending pods
kube_pod_status_phase{phase="Pending"}

SLI/SLO/SLA Metrics

Service Level Indicators (SLIs)

Metrics that matter to users.

# Availability SLI
sum(rate(http_requests_total{status!~"5.."}[30d])) /
sum(rate(http_requests_total[30d]))

# Latency SLI (% of requests under threshold)
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[30d])) /
sum(rate(http_request_duration_seconds_count[30d]))

Error Budget

# Error budget remaining
# Target: 99.9% (0.1% errors allowed per 30d)
error_budget_remaining = 0.001 - (
  sum(rate(http_requests_total{status=~"5.."}[30d])) /
  sum(rate(http_requests_total[30d]))
)

# Error budget burn rate (how fast using budget)
(
  sum(rate(http_requests_total{status=~"5.."}[1h])) /
  sum(rate(http_requests_total[1h]))
) / 0.001

Custom Application Metrics

Instrumentation Example

Python (Prometheus Client)

from prometheus_client import Counter, Histogram, Gauge
import time

# Counters
requests_total = Counter('http_requests_total', 'Total requests', ['method', 'endpoint', 'status'])
errors_total = Counter('http_errors_total', 'Total errors', ['type'])

# Histograms
request_duration = Histogram('http_request_duration_seconds', 'Request duration', ['endpoint'])

# Gauges
active_users = Gauge('active_users', 'Currently active users')

# Usage
@request_duration.labels(endpoint='/api/users').time()
def get_users():
    requests_total.labels(method='GET', endpoint='/api/users', status='200').inc()
    # ... handle request

Go

import (
    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promauto"
)

var (
    requestsTotal = promauto.NewCounterVec(
        prometheus.CounterOpts{
            Name: "http_requests_total",
            Help: "Total HTTP requests",
        },
        []string{"method", "endpoint", "status"},
    )

    requestDuration = promauto.NewHistogramVec(
        prometheus.HistogramOpts{
            Name: "http_request_duration_seconds",
            Help: "HTTP request duration",
            Buckets: prometheus.DefBuckets,
        },
        []string{"endpoint"},
    )
)

func handleRequest(w http.ResponseWriter, r *http.Request) {
    timer := prometheus.NewTimer(requestDuration.WithLabelValues(r.URL.Path))
    defer timer.ObserveDuration()

    // ... handle request

    requestsTotal.WithLabelValues(r.Method, r.URL.Path, "200").Inc()
}

Metric Naming Conventions

Best Practices

Use consistent naming

<namespace>_<subsystem>_<name>_<unit>

Examples:
http_requests_total
http_request_duration_seconds
database_query_duration_milliseconds
cache_hits_total

Choose appropriate types

  • Counter: Monotonically increasing (requests, errors)
  • Gauge: Can go up/down (temperature, memory)
  • Histogram: Distribution (latency, request size)
  • Summary: Similar to histogram, pre-calculated quantiles

Cardinality Management

Avoid High Cardinality

Bad (unbounded labels):

# User ID as label = millions of time series!
http_requests_total{user_id="12345"}

Good (bounded labels):

# Group by user tier
http_requests_total{user_tier="premium"}

Label Guidelines

Good labels (low cardinality):

  • Environment (dev, staging, prod)
  • Region (us-east-1, eu-west-1)
  • Service name
  • HTTP method
  • Status code range (2xx, 4xx, 5xx)

Bad labels (high cardinality):

  • User IDs
  • Email addresses
  • Full URLs
  • Timestamps
  • UUIDs

Metrics Storage and Retention

Retention Strategy

# Prometheus example
global:
  scrape_interval: 15s
  evaluation_interval: 15s

storage:
  tsdb:
    retention:
      time: 30d
      size: 50GB

remote_write:
  - url: https://long-term-storage/api/v1/write
    write_relabel_configs:
      # Only keep important metrics long-term
      - source_labels: [__name__]
        regex: '(up|http_requests_total|error_budget_.*)'
        action: keep

Downsampling

# Keep raw data for 7d, then downsample
recording_rules:
  - record: http_requests:rate5m
    expr: rate(http_requests_total[5m])

  - record: http_requests:rate1h
    expr: rate(http_requests_total[1h])

Dashboards

Key Principles

  1. Purpose-driven: Different audiences, different dashboards
  2. Top-down: Start with high-level, drill down
  3. Signal-to-noise: Show what matters
  4. Actionable: What should I do?

Dashboard Examples

Executive Dashboard

  • SLA compliance
  • Revenue metrics
  • User growth
  • Critical incidents

Engineering Dashboard

  • Golden signals (latency, traffic, errors, saturation)
  • Deployment markers
  • Error budget
  • Top errors

On-call Dashboard

  • Active alerts
  • Recent deployments
  • Service health
  • Resource utilization

Alerting on Metrics

Good Alerts

# Symptom-based (what users experience)
alert: HighErrorRate
expr: |
  (
    sum(rate(http_requests_total{status=~"5.."}[5m])) /
    sum(rate(http_requests_total[5m]))
  ) > 0.05
for: 5m
annotations:
  summary: "Error rate above 5%"

# Resource exhaustion (predictive)
alert: DiskWillFillIn4Hours
expr: |
  predict_linear(node_filesystem_avail_bytes[1h], 4*3600) < 0
annotations:
  summary: "Disk will fill in 4 hours at current rate"

Bad Alerts

# Cause-based (doesn't affect users yet)
alert: HighCPU
expr: node_cpu_usage > 80
# Problem: High CPU doesn't always mean user impact

# Too sensitive
alert: AnyError
expr: rate(errors_total[1m]) > 0
# Problem: Noise, alert fatigue

Metrics Anti-Patterns

  1. Vanity Metrics: Look good but don’t drive decisions
  2. Too Many Metrics: Can’t see signal in noise
  3. No Baselines: Don’t know what’s normal
  4. Ignoring Distributions: Averages hide problems
  5. Alert on Everything: Alert fatigue
  6. No Context: Metrics without labels
  7. Stale Metrics: Not updating, outdated thresholds

Conclusion

Effective metrics enable:

  • Observability: Understand system behavior
  • Alerting: Know when things break
  • Capacity Planning: Predict future needs
  • Performance: Identify bottlenecks
  • Business Decisions: Data-driven choices

Remember:

  • Focus on user-facing metrics
  • Use percentiles, not averages
  • Monitor rate of change
  • Set meaningful SLOs
  • Alert on symptoms, not causes
  • Keep cardinality manageable
  • Dashboard for your audience

The best metric is one that helps you make better decisions.