Alerting Strategy

Good alerting wakes you up when something is wrong and actionable. Bad alerting wakes you up all the time, eventually getting ignored.

Principles of Good Alerting

1. Alert on Symptoms, Not Causes

Bad (cause-based):

alert: HighCPU
expr: node_cpu_usage_percent > 80

Problem: High CPU doesn’t necessarily mean user impact.

Good (symptom-based):

alert: HighLatency
expr: |
  histogram_quantile(0.95,
    rate(http_request_duration_seconds_bucket[5m])
  ) > 1.0
for: 5m

Why: Users experience slow responses, that’s what matters.

2. Every Alert Must Be Actionable

Questions to ask:

Can I do something about this right now?
Does this require immediate human intervention?
What specific action should I take?

Bad Alert:

alert: DiskUsageHigh
expr: disk_usage_percent > 70
annotations:
  summary: "Disk usage is high"

Good Alert:

alert: DiskWillFillIn4Hours
expr: |
  predict_linear(node_filesystem_avail_bytes[1h], 4*3600) < 0
annotations:
  summary: "Disk {{ $labels.device }} will fill in ~4 hours"
  description: |
    Current usage: {{ $value | humanize }}
    Action: Clean up logs or expand disk
  runbook_url: "https://runbooks.example.com/disk-full"

3. Reduce Alert Fatigue

Techniques:

Alert aggregation
Smart throttling
Severity levels
Scheduled maintenance windows
Auto-resolution

Example: Alert Grouping

route:
  receiver: 'team-pager'
  group_by: ['alertname', 'cluster']
  group_wait: 30s        # Wait before sending first alert
  group_interval: 5m     # Wait before sending next batch
  repeat_interval: 4h    # Don't repeat more than every 4h

Alert Severity Levels

Critical (Page/Call)

Immediate user impact, requires immediate action.

Criteria:

User-facing service down
Data loss occurring
Security breach detected
SLA violation imminent

Example:

alert: ServiceDown
expr: up{job="api-server"} == 0
for: 1m
labels:
  severity: critical
  page: "true"
annotations:
  summary: "API server is down"

Warning (Ticket/Email)

Potential future problem, action needed within hours.

Criteria:

Resource will exhaust soon
Degraded performance
Non-critical service issue
Approaching SLA threshold

Example:

alert: HighMemoryUsage
expr: node_memory_usage_percent > 85
for: 15m
labels:
  severity: warning
annotations:
  summary: "Memory usage approaching capacity"

Info (Log Only)

Informational, no action needed.

Criteria:

Deployment notifications
Scaling events
Routine maintenance

Example:

alert: DeploymentCompleted
expr: kube_deployment_status_replicas_updated > 0
labels:
  severity: info
annotations:
  summary: "Deployment {{ $labels.deployment }} updated"

Alert Design Patterns

1. Service-Level Objectives (SLOs)

Alert when burning error budget too fast.

# SLO: 99.9% availability (0.1% error budget)

# Fast burn: Use 5% of monthly budget in 1 hour
alert: ErrorBudgetBurnRateFast
expr: |
  (
    sum(rate(http_requests_total{status=~"5.."}[1h])) /
    sum(rate(http_requests_total[1h]))
  ) > (0.001 * 14.4)  # 14.4x normal rate
for: 5m
labels:
  severity: critical

# Slow burn: Use 10% of monthly budget in 6 hours
alert: ErrorBudgetBurnRateSlow
expr: |
  (
    sum(rate(http_requests_total{status=~"5.."}[6h])) /
    sum(rate(http_requests_total[6h]))
  ) > (0.001 * 6)  # 6x normal rate
for: 30m
labels:
  severity: warning

2. Threshold-Based

Simple value exceeds threshold.

alert: HighErrorRate
expr: |
  (
    sum(rate(http_requests_total{status=~"5.."}[5m])) /
    sum(rate(http_requests_total[5m]))
  ) > 0.05
for: 5m

Avoid:

Static thresholds for dynamic systems
Too tight thresholds (noise)
Too loose thresholds (miss real issues)

3. Rate of Change

Alert on sudden changes.

alert: TrafficSpike
expr: |
  (
    rate(http_requests_total[5m]) /
    rate(http_requests_total[5m] offset 1h)
  ) > 2
for: 10m
annotations:
  summary: "Traffic doubled in last hour"

4. Absence Detection

Alert when expected data stops.

alert: MetricsNotReported
expr: |
  absent(up{job="critical-service"}) or
  up{job="critical-service"} == 0
for: 5m

5. Composite Alerts

Multiple conditions must be true.

alert: ServiceUnhealthy
expr: |
  (
    rate(http_requests_total{status=~"5.."}[5m]) > 10 and
    rate(http_requests_total[5m]) > 100 and
    up{job="api-server"} == 1
  )
for: 5m
annotations:
  summary: "Service is up but returning errors under load"

Alert Routing

Route by Severity

# Alertmanager config
route:
  receiver: 'default'
  routes:
    - match:
        severity: critical
      receiver: 'pager'
      continue: true

    - match:
        severity: warning
      receiver: 'slack-warnings'

    - match:
        severity: info
      receiver: 'log-only'

Route by Team

route:
  receiver: 'default'
  group_by: ['team']
  routes:
    - match:
        team: platform
      receiver: 'platform-team'

    - match:
        team: frontend
      receiver: 'frontend-team'

    - match:
        team: database
      receiver: 'dba-team'

Time-Based Routing

route:
  receiver: 'on-call'
  routes:
    - match:
        severity: critical
      receiver: 'on-call'
      active_time_intervals:
        - business-hours

    - match:
        severity: warning
      receiver: 'slack-warnings'
      active_time_intervals:
        - business-hours

    - match:
        severity: warning
      receiver: 'ticket-system'
      active_time_intervals:
        - after-hours

time_intervals:
  - name: business-hours
    time_intervals:
      - times:
          - start_time: '09:00'
            end_time: '17:00'
        weekdays: ['monday:friday']

Alert Suppression

Maintenance Windows

# Silence alerts during deployments
inhibit_rules:
  - source_match:
      alertname: DeploymentInProgress
    target_match:
      severity: warning
    equal: ['cluster', 'namespace']

Dependency-Based

# Don't alert on app if database is down
inhibit_rules:
  - source_match:
      alertname: DatabaseDown
    target_match:
      alertname: AppUnhealthy
    equal: ['environment']

Silences (Manual)

# Silence via amtool
amtool silence add \
  alertname=DiskSpaceWarning \
  instance=server-01 \
  --duration=2h \
  --comment="Expanding disk, known issue"

Notification Channels

PagerDuty Integration

receivers:
  - name: 'pager'
    pagerduty_configs:
      - service_key: '<integration-key>'
        severity: '{{ .GroupLabels.severity }}'
        description: '{{ .GroupLabels.alertname }}: {{ .CommonAnnotations.summary }}'
        details:
          firing: '{{ template "pagerduty.firing" . }}'

Slack Integration

receivers:
  - name: 'slack-warnings'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/xxx'
        channel: '#alerts'
        title: '{{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.summary }}\n{{ end }}'
        color: '{{ if eq .Status "firing" }}danger{{ else }}good{{ end }}'

Email Integration

receivers:
  - name: 'email-team'
    email_configs:
      - to: 'team@example.com'
        from: 'alerts@example.com'
        smarthost: 'smtp.example.com:587'
        auth_username: 'alerts@example.com'
        auth_password: '<password>'
        headers:
          Subject: '[{{ .Status }}] {{ .GroupLabels.alertname }}'

Webhook Integration

receivers:
  - name: 'ticket-system'
    webhook_configs:
      - url: 'https://ticketing.example.com/webhook'
        send_resolved: true

Alert Templates

Informative Messages

templates:
  - '/etc/alertmanager/templates/*.tmpl'

# alert.tmpl
{{ define "alert.summary" }}
[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}]
{{ .GroupLabels.alertname }}
{{ end }}

{{ define "alert.description" }}
{{ range .Alerts }}
Status: {{ .Status }}
Starts: {{ .StartsAt }}
{{ if .Annotations.description }}
Description: {{ .Annotations.description }}
{{ end }}
{{ if .Annotations.runbook_url }}
Runbook: {{ .Annotations.runbook_url }}
{{ end }}
Labels:
{{ range .Labels.SortedPairs }}  {{ .Name }}: {{ .Value }}
{{ end }}
{{ end }}
{{ end }}

Runbook Links

alert: DatabaseConnectionPoolExhausted
annotations:
  summary: "Connection pool exhausted"
  description: |
    {{ $labels.instance }} has {{ $value }} available connections (< 10%)
  runbook_url: "https://wiki.example.com/runbooks/db-connection-pool"

Alert Testing

Unit Tests

# promtool test rules
rule_files:
  - alerts.yml

tests:
  - interval: 1m
    input_series:
      - series: 'http_requests_total{status="500"}'
        values: '0+10x10'  # 0, 10, 20, ... 100
      - series: 'http_requests_total{status="200"}'
        values: '0+100x10'  # 0, 100, 200, ... 1000

    alert_rule_test:
      - eval_time: 10m
        alertname: HighErrorRate
        exp_alerts:
          - exp_labels:
              severity: critical
            exp_annotations:
              summary: "Error rate above threshold"

Integration Tests

# Send test alert
curl -X POST http://alertmanager:9093/api/v1/alerts \
  -H 'Content-Type: application/json' \
  -d '[{
    "labels": {
      "alertname": "TestAlert",
      "severity": "critical"
    },
    "annotations": {
      "summary": "This is a test alert"
    }
  }]'

Monitoring Alerting System

Alert Manager Metrics

# Alert manager up
up{job="alertmanager"}

# Alerts firing
alertmanager_alerts{state="active"}

# Notification success rate
rate(alertmanager_notifications_total{state="success"}[5m]) /
rate(alertmanager_notifications_total[5m])

# Alert processing latency
histogram_quantile(0.95,
  rate(alertmanager_notification_latency_seconds_bucket[5m])
)

Dead Man’s Switch

# Alert that should always fire
alert: DeadMansSwitch
expr: vector(1)
labels:
  severity: info
annotations:
  summary: "Alerting system is working"

# External monitor checks for this alert
# If it stops firing, alerting is broken

Alert Workflow

1. Detection

Metric → Evaluation → Threshold Exceeded → Pending State

2. Aggregation

Pending → Active → Group by Labels → Batch Similar Alerts

3. Routing

Grouped Alerts → Route by Rules → Select Receiver

4. Notification

Receiver → Format Message → Send to Channel

5. Response

Human Receives → Acknowledges → Investigates → Resolves

6. Resolution

Problem Fixed → Metric Returns to Normal → Alert Resolves

On-Call Best Practices

Rotation Schedule

Week 1: Engineer A (Primary), Engineer B (Secondary)
Week 2: Engineer B (Primary), Engineer C (Secondary)
Week 3: Engineer C (Primary), Engineer A (Secondary)

Escalation Policy

Alert Fired
  ↓
0-5 min: Page Primary
  ↓
5-10 min: Page Secondary
  ↓
10-15 min: Page Manager
  ↓
15-20 min: Page Director

Post-Incident Review

# Incident Report: [INCIDENT-123]

## Summary
Brief description of what happened

## Timeline
- 14:32 UTC: Alert fired
- 14:35 UTC: Engineer acknowledged
- 14:42 UTC: Root cause identified
- 15:10 UTC: Fix deployed
- 15:15 UTC: Incident resolved

## Root Cause
Why did this happen?

## Impact
- Duration: 43 minutes
- Users affected: ~1,000
- Revenue impact: $X

## Actions Taken
1. Restarted failing service
2. Increased connection pool
3. Deployed hotfix

## Follow-up Items
- [ ] Add monitoring for X
- [ ] Update runbook
- [ ] Increase resource limits

Common Anti-Patterns

1. Too Many Alerts

Problem: Alert fatigue, important alerts missed Solution: Review and reduce, increase thresholds

2. Vague Alerts

Problem: “Something is wrong” doesn’t help Solution: Include context, metrics, runbook links

3. Alerting on Everything

Problem: Can’t distinguish signal from noise Solution: Only alert on user-impacting issues

4. No Runbooks

Problem: Nobody knows how to respond Solution: Document response procedures

5. Flapping Alerts

Problem: Alert fires/resolves repeatedly Solution: Increase for duration, add hysteresis

6. Missing Context

Problem: Can’t determine severity or impact Solution: Include relevant labels and annotations

7. No Test Alerts

Problem: Alerting broken but nobody knows Solution: Regular test alerts, dead man’s switch

Alert Quality Metrics

Track These

# Alert noise ratio
(
  count(ALERTS{alertstate="firing"} unless on(alertname) alert_was_actionable)
  /
  count(ALERTS{alertstate="firing"})
) * 100

# Mean time to acknowledge (MTTA)
avg(alert_acknowledge_time_seconds - alert_fire_time_seconds)

# Mean time to resolve (MTTR)
avg(alert_resolve_time_seconds - alert_fire_time_seconds)

# False positive rate
count(ALERTS unless on(alertname) alert_required_action) /
count(ALERTS) * 100

Review Regularly

# Weekly Alert Review

## Alert Volume
- Total alerts: 142
- Critical: 3
- Warning: 89
- Info: 50

## Top Noisy Alerts
1. DiskSpaceWarning (42 fires)
2. HighCPU (23 fires)
3. MemoryPressure (18 fires)

## Action Items
- Increase DiskSpaceWarning threshold
- Investigate HighCPU root cause
- Add memory to servers

Advanced Techniques

Anomaly Detection

# Alert on statistically significant deviations
alert: AnomalousTraffic
expr: |
  abs(
    rate(http_requests_total[5m]) -
    avg_over_time(rate(http_requests_total[5m])[1d:5m])
  ) > (3 * stddev_over_time(rate(http_requests_total[5m])[1d:5m]))
for: 10m

Predictive Alerts

# Predict resource exhaustion
alert: DiskWillFillSoon
expr: |
  predict_linear(node_filesystem_avail_bytes[6h], 24*3600) < 0
annotations:
  summary: "Disk will fill in ~24 hours"

Multi-Window, Multi-Burn-Rate

# Different burn rates, different sensitivities
alert: ErrorBudgetBurn
expr: |
  (
    sum(rate(http_requests_total{status=~"5.."}[1h])) /
    sum(rate(http_requests_total[1h])) > (14.4 * 0.001)
    and
    sum(rate(http_requests_total{status=~"5.."}[5m])) /
    sum(rate(http_requests_total[5m])) > (14.4 * 0.001)
  )
  or
  (
    sum(rate(http_requests_total{status=~"5.."}[6h])) /
    sum(rate(http_requests_total[6h])) > (6 * 0.001)
    and
    sum(rate(http_requests_total{status=~"5.."}[30m])) /
    sum(rate(http_requests_total[30m])) > (6 * 0.001)
  )

Conclusion

Effective alerting requires:

Alert on user impact, not internal metrics
Make alerts actionable with context and runbooks
Reduce noise through aggregation and throttling
Route intelligently by severity and team
Test regularly to ensure alerting works
Review and refine based on incident data

Remember: The goal isn’t zero alerts. It’s actionable alerts that lead to faster problem resolution and better uptime.

Golden Rule: If you can’t do anything about it right now, don’t alert on it.

Alerting Strategy

On This Page

Principles of Good Alerting

1. Alert on Symptoms, Not Causes

2. Every Alert Must Be Actionable

3. Reduce Alert Fatigue

Alert Severity Levels

Critical (Page/Call)

Warning (Ticket/Email)

Info (Log Only)

Alert Design Patterns

1. Service-Level Objectives (SLOs)

2. Threshold-Based

3. Rate of Change

4. Absence Detection

5. Composite Alerts

Alert Routing

Route by Severity

Route by Team

Time-Based Routing

Alert Suppression

Maintenance Windows

Dependency-Based

Silences (Manual)

Notification Channels

PagerDuty Integration

Slack Integration

Email Integration

Webhook Integration

Alert Templates

Informative Messages

Runbook Links

Alert Testing

Unit Tests

Integration Tests

Monitoring Alerting System

Alert Manager Metrics

Dead Man’s Switch

Alert Workflow

1. Detection

2. Aggregation

3. Routing

4. Notification

5. Response

6. Resolution

On-Call Best Practices

Rotation Schedule

Escalation Policy

Post-Incident Review

Common Anti-Patterns

1. Too Many Alerts

2. Vague Alerts

3. Alerting on Everything

4. No Runbooks

5. Flapping Alerts

6. Missing Context

7. No Test Alerts

Alert Quality Metrics

Track These

Review Regularly

Advanced Techniques

Anomaly Detection

Predictive Alerts

Multi-Window, Multi-Burn-Rate

Conclusion