Good alerting wakes you up when something is wrong and actionable. Bad alerting wakes you up all the time, eventually getting ignored.
Principles of Good Alerting
1. Alert on Symptoms, Not Causes
Bad (cause-based):
alert: HighCPU
expr: node_cpu_usage_percent > 80
Problem: High CPU doesn’t necessarily mean user impact.
Good (symptom-based):
alert: HighLatency
expr: |
histogram_quantile(0.95,
rate(http_request_duration_seconds_bucket[5m])
) > 1.0
for: 5m
Why: Users experience slow responses, that’s what matters.
2. Every Alert Must Be Actionable
Questions to ask:
- Can I do something about this right now?
- Does this require immediate human intervention?
- What specific action should I take?
Bad Alert:
alert: DiskUsageHigh
expr: disk_usage_percent > 70
annotations:
summary: "Disk usage is high"
Good Alert:
alert: DiskWillFillIn4Hours
expr: |
predict_linear(node_filesystem_avail_bytes[1h], 4*3600) < 0
annotations:
summary: "Disk {{ $labels.device }} will fill in ~4 hours"
description: |
Current usage: {{ $value | humanize }}
Action: Clean up logs or expand disk
runbook_url: "https://runbooks.example.com/disk-full"
3. Reduce Alert Fatigue
Techniques:
- Alert aggregation
- Smart throttling
- Severity levels
- Scheduled maintenance windows
- Auto-resolution
Example: Alert Grouping
route:
receiver: 'team-pager'
group_by: ['alertname', 'cluster']
group_wait: 30s # Wait before sending first alert
group_interval: 5m # Wait before sending next batch
repeat_interval: 4h # Don't repeat more than every 4h
Alert Severity Levels
Critical (Page/Call)
Immediate user impact, requires immediate action.
Criteria:
- User-facing service down
- Data loss occurring
- Security breach detected
- SLA violation imminent
Example:
alert: ServiceDown
expr: up{job="api-server"} == 0
for: 1m
labels:
severity: critical
page: "true"
annotations:
summary: "API server is down"
Warning (Ticket/Email)
Potential future problem, action needed within hours.
Criteria:
- Resource will exhaust soon
- Degraded performance
- Non-critical service issue
- Approaching SLA threshold
Example:
alert: HighMemoryUsage
expr: node_memory_usage_percent > 85
for: 15m
labels:
severity: warning
annotations:
summary: "Memory usage approaching capacity"
Info (Log Only)
Informational, no action needed.
Criteria:
- Deployment notifications
- Scaling events
- Routine maintenance
Example:
alert: DeploymentCompleted
expr: kube_deployment_status_replicas_updated > 0
labels:
severity: info
annotations:
summary: "Deployment {{ $labels.deployment }} updated"
Alert Design Patterns
1. Service-Level Objectives (SLOs)
Alert when burning error budget too fast.
# SLO: 99.9% availability (0.1% error budget)
# Fast burn: Use 5% of monthly budget in 1 hour
alert: ErrorBudgetBurnRateFast
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[1h])) /
sum(rate(http_requests_total[1h]))
) > (0.001 * 14.4) # 14.4x normal rate
for: 5m
labels:
severity: critical
# Slow burn: Use 10% of monthly budget in 6 hours
alert: ErrorBudgetBurnRateSlow
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[6h])) /
sum(rate(http_requests_total[6h]))
) > (0.001 * 6) # 6x normal rate
for: 30m
labels:
severity: warning
2. Threshold-Based
Simple value exceeds threshold.
alert: HighErrorRate
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[5m])) /
sum(rate(http_requests_total[5m]))
) > 0.05
for: 5m
Avoid:
- Static thresholds for dynamic systems
- Too tight thresholds (noise)
- Too loose thresholds (miss real issues)
3. Rate of Change
Alert on sudden changes.
alert: TrafficSpike
expr: |
(
rate(http_requests_total[5m]) /
rate(http_requests_total[5m] offset 1h)
) > 2
for: 10m
annotations:
summary: "Traffic doubled in last hour"
4. Absence Detection
Alert when expected data stops.
alert: MetricsNotReported
expr: |
absent(up{job="critical-service"}) or
up{job="critical-service"} == 0
for: 5m
5. Composite Alerts
Multiple conditions must be true.
alert: ServiceUnhealthy
expr: |
(
rate(http_requests_total{status=~"5.."}[5m]) > 10 and
rate(http_requests_total[5m]) > 100 and
up{job="api-server"} == 1
)
for: 5m
annotations:
summary: "Service is up but returning errors under load"
Alert Routing
Route by Severity
# Alertmanager config
route:
receiver: 'default'
routes:
- match:
severity: critical
receiver: 'pager'
continue: true
- match:
severity: warning
receiver: 'slack-warnings'
- match:
severity: info
receiver: 'log-only'
Route by Team
route:
receiver: 'default'
group_by: ['team']
routes:
- match:
team: platform
receiver: 'platform-team'
- match:
team: frontend
receiver: 'frontend-team'
- match:
team: database
receiver: 'dba-team'
Time-Based Routing
route:
receiver: 'on-call'
routes:
- match:
severity: critical
receiver: 'on-call'
active_time_intervals:
- business-hours
- match:
severity: warning
receiver: 'slack-warnings'
active_time_intervals:
- business-hours
- match:
severity: warning
receiver: 'ticket-system'
active_time_intervals:
- after-hours
time_intervals:
- name: business-hours
time_intervals:
- times:
- start_time: '09:00'
end_time: '17:00'
weekdays: ['monday:friday']
Alert Suppression
Maintenance Windows
# Silence alerts during deployments
inhibit_rules:
- source_match:
alertname: DeploymentInProgress
target_match:
severity: warning
equal: ['cluster', 'namespace']
Dependency-Based
# Don't alert on app if database is down
inhibit_rules:
- source_match:
alertname: DatabaseDown
target_match:
alertname: AppUnhealthy
equal: ['environment']
Silences (Manual)
# Silence via amtool
amtool silence add \
alertname=DiskSpaceWarning \
instance=server-01 \
--duration=2h \
--comment="Expanding disk, known issue"
Notification Channels
PagerDuty Integration
receivers:
- name: 'pager'
pagerduty_configs:
- service_key: '<integration-key>'
severity: '{{ .GroupLabels.severity }}'
description: '{{ .GroupLabels.alertname }}: {{ .CommonAnnotations.summary }}'
details:
firing: '{{ template "pagerduty.firing" . }}'
Slack Integration
receivers:
- name: 'slack-warnings'
slack_configs:
- api_url: 'https://hooks.slack.com/services/xxx'
channel: '#alerts'
title: '{{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.summary }}\n{{ end }}'
color: '{{ if eq .Status "firing" }}danger{{ else }}good{{ end }}'
Email Integration
receivers:
- name: 'email-team'
email_configs:
- to: 'team@example.com'
from: 'alerts@example.com'
smarthost: 'smtp.example.com:587'
auth_username: 'alerts@example.com'
auth_password: '<password>'
headers:
Subject: '[{{ .Status }}] {{ .GroupLabels.alertname }}'
Webhook Integration
receivers:
- name: 'ticket-system'
webhook_configs:
- url: 'https://ticketing.example.com/webhook'
send_resolved: true
Alert Templates
Informative Messages
templates:
- '/etc/alertmanager/templates/*.tmpl'
# alert.tmpl
{{ define "alert.summary" }}
[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}]
{{ .GroupLabels.alertname }}
{{ end }}
{{ define "alert.description" }}
{{ range .Alerts }}
Status: {{ .Status }}
Starts: {{ .StartsAt }}
{{ if .Annotations.description }}
Description: {{ .Annotations.description }}
{{ end }}
{{ if .Annotations.runbook_url }}
Runbook: {{ .Annotations.runbook_url }}
{{ end }}
Labels:
{{ range .Labels.SortedPairs }} {{ .Name }}: {{ .Value }}
{{ end }}
{{ end }}
{{ end }}
Runbook Links
alert: DatabaseConnectionPoolExhausted
annotations:
summary: "Connection pool exhausted"
description: |
{{ $labels.instance }} has {{ $value }} available connections (< 10%)
runbook_url: "https://wiki.example.com/runbooks/db-connection-pool"
Alert Testing
Unit Tests
# promtool test rules
rule_files:
- alerts.yml
tests:
- interval: 1m
input_series:
- series: 'http_requests_total{status="500"}'
values: '0+10x10' # 0, 10, 20, ... 100
- series: 'http_requests_total{status="200"}'
values: '0+100x10' # 0, 100, 200, ... 1000
alert_rule_test:
- eval_time: 10m
alertname: HighErrorRate
exp_alerts:
- exp_labels:
severity: critical
exp_annotations:
summary: "Error rate above threshold"
Integration Tests
# Send test alert
curl -X POST http://alertmanager:9093/api/v1/alerts \
-H 'Content-Type: application/json' \
-d '[{
"labels": {
"alertname": "TestAlert",
"severity": "critical"
},
"annotations": {
"summary": "This is a test alert"
}
}]'
Monitoring Alerting System
Alert Manager Metrics
# Alert manager up
up{job="alertmanager"}
# Alerts firing
alertmanager_alerts{state="active"}
# Notification success rate
rate(alertmanager_notifications_total{state="success"}[5m]) /
rate(alertmanager_notifications_total[5m])
# Alert processing latency
histogram_quantile(0.95,
rate(alertmanager_notification_latency_seconds_bucket[5m])
)
Dead Man’s Switch
# Alert that should always fire
alert: DeadMansSwitch
expr: vector(1)
labels:
severity: info
annotations:
summary: "Alerting system is working"
# External monitor checks for this alert
# If it stops firing, alerting is broken
Alert Workflow
1. Detection
Metric → Evaluation → Threshold Exceeded → Pending State
2. Aggregation
Pending → Active → Group by Labels → Batch Similar Alerts
3. Routing
Grouped Alerts → Route by Rules → Select Receiver
4. Notification
Receiver → Format Message → Send to Channel
5. Response
Human Receives → Acknowledges → Investigates → Resolves
6. Resolution
Problem Fixed → Metric Returns to Normal → Alert Resolves
On-Call Best Practices
Rotation Schedule
Week 1: Engineer A (Primary), Engineer B (Secondary)
Week 2: Engineer B (Primary), Engineer C (Secondary)
Week 3: Engineer C (Primary), Engineer A (Secondary)
Escalation Policy
Alert Fired
↓
0-5 min: Page Primary
↓
5-10 min: Page Secondary
↓
10-15 min: Page Manager
↓
15-20 min: Page Director
Post-Incident Review
# Incident Report: [INCIDENT-123]
## Summary
Brief description of what happened
## Timeline
- 14:32 UTC: Alert fired
- 14:35 UTC: Engineer acknowledged
- 14:42 UTC: Root cause identified
- 15:10 UTC: Fix deployed
- 15:15 UTC: Incident resolved
## Root Cause
Why did this happen?
## Impact
- Duration: 43 minutes
- Users affected: ~1,000
- Revenue impact: $X
## Actions Taken
1. Restarted failing service
2. Increased connection pool
3. Deployed hotfix
## Follow-up Items
- [ ] Add monitoring for X
- [ ] Update runbook
- [ ] Increase resource limits
Common Anti-Patterns
1. Too Many Alerts
Problem: Alert fatigue, important alerts missed Solution: Review and reduce, increase thresholds
2. Vague Alerts
Problem: “Something is wrong” doesn’t help Solution: Include context, metrics, runbook links
3. Alerting on Everything
Problem: Can’t distinguish signal from noise Solution: Only alert on user-impacting issues
4. No Runbooks
Problem: Nobody knows how to respond Solution: Document response procedures
5. Flapping Alerts
Problem: Alert fires/resolves repeatedly
Solution: Increase for duration, add hysteresis
6. Missing Context
Problem: Can’t determine severity or impact Solution: Include relevant labels and annotations
7. No Test Alerts
Problem: Alerting broken but nobody knows Solution: Regular test alerts, dead man’s switch
Alert Quality Metrics
Track These
# Alert noise ratio
(
count(ALERTS{alertstate="firing"} unless on(alertname) alert_was_actionable)
/
count(ALERTS{alertstate="firing"})
) * 100
# Mean time to acknowledge (MTTA)
avg(alert_acknowledge_time_seconds - alert_fire_time_seconds)
# Mean time to resolve (MTTR)
avg(alert_resolve_time_seconds - alert_fire_time_seconds)
# False positive rate
count(ALERTS unless on(alertname) alert_required_action) /
count(ALERTS) * 100
Review Regularly
# Weekly Alert Review
## Alert Volume
- Total alerts: 142
- Critical: 3
- Warning: 89
- Info: 50
## Top Noisy Alerts
1. DiskSpaceWarning (42 fires)
2. HighCPU (23 fires)
3. MemoryPressure (18 fires)
## Action Items
- Increase DiskSpaceWarning threshold
- Investigate HighCPU root cause
- Add memory to servers
Advanced Techniques
Anomaly Detection
# Alert on statistically significant deviations
alert: AnomalousTraffic
expr: |
abs(
rate(http_requests_total[5m]) -
avg_over_time(rate(http_requests_total[5m])[1d:5m])
) > (3 * stddev_over_time(rate(http_requests_total[5m])[1d:5m]))
for: 10m
Predictive Alerts
# Predict resource exhaustion
alert: DiskWillFillSoon
expr: |
predict_linear(node_filesystem_avail_bytes[6h], 24*3600) < 0
annotations:
summary: "Disk will fill in ~24 hours"
Multi-Window, Multi-Burn-Rate
# Different burn rates, different sensitivities
alert: ErrorBudgetBurn
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[1h])) /
sum(rate(http_requests_total[1h])) > (14.4 * 0.001)
and
sum(rate(http_requests_total{status=~"5.."}[5m])) /
sum(rate(http_requests_total[5m])) > (14.4 * 0.001)
)
or
(
sum(rate(http_requests_total{status=~"5.."}[6h])) /
sum(rate(http_requests_total[6h])) > (6 * 0.001)
and
sum(rate(http_requests_total{status=~"5.."}[30m])) /
sum(rate(http_requests_total[30m])) > (6 * 0.001)
)
Conclusion
Effective alerting requires:
- Alert on user impact, not internal metrics
- Make alerts actionable with context and runbooks
- Reduce noise through aggregation and throttling
- Route intelligently by severity and team
- Test regularly to ensure alerting works
- Review and refine based on incident data
Remember: The goal isn’t zero alerts. It’s actionable alerts that lead to faster problem resolution and better uptime.
Golden Rule: If you can’t do anything about it right now, don’t alert on it.