Troubleshooting Methodology

Effective troubleshooting is methodical, documented, and reproducible. Random guessing wastes time and can make problems worse.

The Scientific Method

1. Observe

Gather data about the problem.

Questions to ask:

What is the symptom?
When did it start?
What changed recently?
Is it consistent or intermittent?
Who/what is affected?

Data to collect:

# System metrics
top, htop, vmstat, iostat

# Network
netstat -an, ss -tulpn, tcpdump

# Logs
journalctl -xe, tail -f /var/log/*

# Application
ps aux, lsof, strace

2. Hypothesize

Form educated guesses about root cause.

Good hypotheses:

Testable
Based on evidence
Specific

Example:

Hypothesis: High latency is caused by database connection pool exhaustion
Evidence: Connection pool metrics show 100% utilization
Test: Monitor connection pool during latency spike

3. Test

Verify or disprove hypothesis.

Testing approaches:

Check logs for correlation
Monitor metrics during problem
Reproduce in controlled environment
A/B test (compare working vs. broken)

4. Analyze

Interpret test results.

Outcomes:

Hypothesis confirmed → Proceed to fix
Hypothesis rejected → Form new hypothesis
Inconclusive → Gather more data

5. Document

Record findings for future reference.

# Issue: High API Latency

## Timeline
- 14:32 UTC: Latency spike detected
- 14:35 UTC: Investigation started
- 14:42 UTC: Root cause identified
- 15:10 UTC: Fix deployed
- 15:15 UTC: Incident resolved

## Root Cause
Database connection pool exhausted due to slow queries

## Solution
Increased connection pool size and optimized slow query

## Prevention
- Add monitoring for connection pool utilization
- Set query timeout limits
- Regular query performance review

Divide and Conquer

Break complex systems into components.

Layer Isolation

User → Browser → DNS → Load Balancer → Web Server → App Server → Database

Test each layer:
1. Can user access internet? (ping 8.8.8.8)
2. Does DNS resolve? (nslookup example.com)
3. Is load balancer responding? (curl LB-IP)
4. Are web servers healthy? (check health endpoint)
5. Can app connect to DB? (telnet db-host 3306)

Binary Search

Split problem space in half repeatedly.

Problem: API slow between 10AM-11AM

Divide by time:
- First 30min (10:00-10:30): Slow
- Second 30min (10:30-11:00): Normal

Divide first half:
- 10:00-10:15: Slow
- 10:15-10:30: Slow

Divide second quarter:
- 10:00-10:07: Slow
- 10:07-10:15: Normal

Conclusion: Problem started at 10:07
Check: What changed at 10:07?

Component Isolation

# Is it the application or infrastructure?
# Deploy app to different server → Problem persists?

# Is it code or data?
# Run with sample data → Problem persists?

# Is it network or application?
# Run locally → Problem persists?

Common Troubleshooting Patterns

Pattern 1: It Worked Before

When it broke, what changed?

# Check recent deployments
git log --since="2 hours ago"

# Check config changes
diff old-config.yaml new-config.yaml

# Check infrastructure changes
terraform show -json | jq '.values.root_module.resources[] | select(.mode == "managed") | .values.tags.LastModified'

# Check system updates
rpm -qa --last | head -20    # RHEL/CentOS
dpkg -l --last | head -20    # Debian/Ubuntu

Pattern 2: It Works Somewhere Else

What’s different between environments?

# Compare versions
diff <(ssh prod 'app --version') <(ssh dev 'app --version')

# Compare configs
diff <(ssh prod 'cat /etc/app/config') <(ssh dev 'cat /etc/app/config')

# Compare dependencies
diff <(ssh prod 'pip freeze') <(ssh dev 'pip freeze')

# Compare environment variables
diff <(ssh prod 'env | sort') <(ssh dev 'env | sort')

Pattern 3: Intermittent Issues

Look for patterns over time.

# Correlate with time of day
grep "ERROR" /var/log/app.log | awk '{print $1}' | sort | uniq -c

# Correlate with load
join <(grep "ERROR" app.log | cut -d' ' -f1) \
     <(sar -q 1 | awk '{print $1,$5}')

# Correlate with deployments
git log --all --oneline --since="1 week" \
  --until="$(grep 'ERROR' app.log | head -1 | cut -d' ' -f1)"

Pattern 4: Cascading Failures

Follow the dependency chain.

Service A → Service B → Service C → Database

Database slow
  ↓
Service C times out
  ↓
Service B retries
  ↓
Service A queue fills
  ↓
User sees errors

Start from the deepest dependency:

# Check database
mysql -e "SHOW PROCESSLIST" | grep -c "Query"

# Check database connections
netstat -an | grep :3306 | wc -l

# Check service logs in reverse order
for svc in database service-c service-b service-a; do
  echo "=== $svc ==="
  tail -n 50 /var/log/$svc.log | grep ERROR
done

Essential Tools

System Performance

Linux

# CPU
top                        # Real-time process monitor
mpstat -P ALL 1           # Per-CPU statistics
pidstat 1                 # Per-process CPU usage

# Memory
free -h                   # Memory usage summary
vmstat 1                  # Virtual memory statistics
slabtop                   # Kernel slab cache

# Disk
iostat -x 1               # Disk I/O statistics
iotop                     # Per-process I/O
lsblk                     # Block devices

# Network
netstat -s                # Network statistics
ss -s                     # Socket statistics
iftop                     # Network bandwidth by connection

macOS

# System
top -l 1                  # Process snapshot
vm_stat 1                 # Virtual memory stats
fs_usage                  # Filesystem activity

# Network
netstat -i                # Interface statistics
nettop                    # Network usage by process
lsof -i                   # Network connections

Windows

# Performance
Get-Process | Sort-Object CPU -Descending
Get-Counter '\Processor(_Total)\% Processor Time'

# Network
Get-NetTCPConnection
Test-NetConnection

# Disk
Get-PhysicalDisk
Get-Volume

Application Debugging

Process Inspection

# Attach to running process
strace -p <pid>           # System calls
ltrace -p <pid>           # Library calls
gdb -p <pid>              # Debugger

# Thread dumps
jstack <pid>              # Java
kill -3 <pid>             # Java (dumps to stdout)
py-spy dump -p <pid>      # Python

Network Debugging

# Packet capture
tcpdump -i any port 8080 -w capture.pcap

# HTTP debugging
curl -v https://api.example.com
curl -w "@curl-format.txt" https://api.example.com

# DNS
dig example.com +trace
nslookup example.com
host example.com

# Connectivity
telnet host port
nc -zv host port

Log Analysis

# Search logs
grep -r "ERROR" /var/log/
journalctl -u nginx -f

# Parse JSON logs
cat app.log | jq 'select(.level == "error")'

# Count occurrences
grep "ERROR" app.log | sort | uniq -c | sort -rn

# Time-based filtering
awk '/2025-01-15 14:3[0-9]/ {print}' app.log

Performance Troubleshooting

High CPU

Identify culprit:

# Find CPU-intensive processes
top -b -n 1 | head -20

# Profile running process
perf record -g -p <pid>
perf report

# Python profiling
py-spy top -p <pid>

Common causes:

Infinite loops
Inefficient algorithms
Too many threads
CPU-bound tasks

High Memory

Identify memory hogs:

# Process memory
ps aux --sort=-%mem | head -20

# Memory map
pmap -x <pid>

# Heap dump (Java)
jmap -dump:live,format=b,file=heap.bin <pid>

Common causes:

Memory leaks
Large object creation
Insufficient garbage collection
Cache bloat

Slow Queries

Identify slow queries:

-- MySQL
SELECT * FROM mysql.slow_log ORDER BY query_time DESC LIMIT 10;

-- PostgreSQL
SELECT query, calls, total_time, mean_time
FROM pg_stat_statements
ORDER BY mean_time DESC
LIMIT 10;

Analyze queries:

EXPLAIN ANALYZE SELECT * FROM users WHERE email = 'user@example.com';

Common causes:

Missing indexes
Full table scans
N+1 queries
Lock contention

Network Latency

Measure latency:

# ICMP ping
ping -c 10 example.com

# TCP connection time
time telnet example.com 80

# HTTP request breakdown
curl -w "@curl-format.txt" -o /dev/null -s https://example.com

curl-format.txt:

time_namelookup:  %{time_namelookup}s\n
time_connect:     %{time_connect}s\n
time_appconnect:  %{time_appconnect}s\n
time_pretransfer: %{time_pretransfer}s\n
time_redirect:    %{time_redirect}s\n
time_starttransfer: %{time_starttransfer}s\n
time_total:       %{time_total}s\n

Common causes:

DNS resolution delays
Network congestion
Geographic distance
Firewall rules

Debugging Strategies

Enable Verbose Logging

Temporarily increase log level:

# Application
export LOG_LEVEL=DEBUG
systemctl restart app

# Web server (Nginx)
error_log /var/log/nginx/error.log debug;

# Database
SET GLOBAL general_log = 'ON';

Add Instrumentation

import time
import logging

def slow_function():
    start = time.time()
    # ... function code
    duration = time.time() - start
    logging.info(f"slow_function took {duration:.2f}s")

Use Feature Flags

if feature_flag('debug_mode'):
    log_detailed_info()
    dump_state()

Reproduce in Isolation

# Minimal reproduction
docker run -it ubuntu:latest bash
# Install only necessary dependencies
# Run minimal test case

Root Cause Analysis

The 5 Whys

Keep asking “why” until you reach root cause.

Example:

Why did the service crash? → Out of memory
Why did it run out of memory? → Memory leak in user session cache
Why is there a memory leak? → Sessions never expire
Why don’t sessions expire? → TTL not configured
Why wasn’t TTL configured? → Not in deployment checklist

Root cause: Missing deployment checklist item Fix: Add session TTL to checklist

Fishbone Diagram

           Methods          People
              |              |
        ------+------  ------+------
              |              |
         PROBLEM/EFFECT
              |              |
        ------+------  ------+------
              |              |
          Materials      Environment

Example: High Latency

Methods: Inefficient algorithm, no caching
People: New developer unfamiliar with codebase
Materials: Outdated libraries, missing indexes
Environment: Network congestion, undersized servers

Prevention Strategies

Chaos Engineering

Deliberately introduce failures to test resilience.

# Kill random pods
kubectl delete pod -l app=myapp --random

# Inject network latency
tc qdisc add dev eth0 root netem delay 100ms

# Limit CPU
docker run --cpus=".5" myapp

# Fill disk
dd if=/dev/zero of=/tmp/fill bs=1M count=1000

Load Testing

Find breaking points before users do.

# Apache Bench
ab -n 10000 -c 100 http://example.com/

# k6
k6 run --vus 100 --duration 30s script.js

# Locust
locust -f locustfile.py --host=http://example.com

Monitoring & Alerting

Detect issues before users report them.

# Latency increasing
rate(http_request_duration_seconds_sum[5m]) /
rate(http_request_duration_seconds_count[5m]) > 1.0

# Error rate spiking
rate(http_requests_total{status=~"5.."}[5m]) /
rate(http_requests_total[5m]) > 0.05

Communication During Incidents

Incident Timeline

14:32 - Alert: High latency detected
14:35 - Investigation started (Alice)
14:42 - Root cause identified: DB connection pool exhausted
14:45 - Mitigation: Increased connection pool size
15:00 - Fix deployed: Query optimization
15:15 - Incident resolved
15:30 - Post-mortem scheduled

Status Updates

Good update:

[15:00 UPDATE] We've identified the root cause as database connection
pool exhaustion. We've temporarily increased pool size and are deploying
a fix to optimize slow queries. ETA for full resolution: 15 minutes.

Bad update:

[15:00 UPDATE] We're working on it.

Post-Incident Review

# Post-Incident Review

## What Happened
High latency on API from 14:32-15:15 UTC

## Impact
- 43 minutes of degraded performance
- ~1,000 users affected
- 5% error rate

## Root Cause
Slow database queries exhausted connection pool

## What Went Well
- Alert fired within 2 minutes
- Clear runbook for investigation
- Rapid mitigation deployed

## What Went Poorly
- No monitoring for connection pool usage
- Slow query not caught in code review
- Manual scaling required

## Action Items
1. Add connection pool monitoring
2. Implement query performance tests
3. Automate connection pool scaling
4. Update code review checklist

Troubleshooting Checklist

## Initial Assessment
- [ ] What is the user-visible symptom?
- [ ] When did it start?
- [ ] Is it affecting everyone or subset of users?
- [ ] What changed recently?

## Data Collection
- [ ] Check logs for errors
- [ ] Review metrics/graphs
- [ ] Check recent deployments
- [ ] Review monitoring dashboards

## Investigation
- [ ] Form hypothesis
- [ ] Test hypothesis
- [ ] Document findings
- [ ] Identify root cause

## Resolution
- [ ] Implement fix
- [ ] Verify fix works
- [ ] Monitor for regression
- [ ] Update documentation

## Follow-up
- [ ] Write post-mortem
- [ ] Create action items
- [ ] Schedule review
- [ ] Update runbooks

Common Mistakes

Making changes without hypothesis: Testing random solutions wastes time
Not documenting steps: Can’t reproduce or learn from investigation
Fixing symptoms not cause: Problem will return
Changing multiple things: Can’t identify what fixed it
Not verifying fix: Assumed resolution, problem continues
Skipping post-mortem: Miss opportunity to prevent recurrence

Conclusion

Effective troubleshooting requires:

Systematic approach: Follow methodology, don’t guess randomly
Data-driven: Collect evidence, form hypotheses, test
Documentation: Record findings for future reference
Communication: Keep stakeholders informed
Learning: Post-mortems prevent recurrence

Remember: The goal isn’t just to fix the immediate problem. It’s to prevent it from happening again.

Troubleshooting Methodology

On This Page

The Scientific Method

1. Observe

2. Hypothesize

3. Test

4. Analyze

5. Document

Divide and Conquer

Layer Isolation

Binary Search

Component Isolation

Common Troubleshooting Patterns

Pattern 1: It Worked Before

Pattern 2: It Works Somewhere Else

Pattern 3: Intermittent Issues

Pattern 4: Cascading Failures

Essential Tools

System Performance

Application Debugging

Performance Troubleshooting

High CPU

High Memory

Slow Queries

Network Latency

Debugging Strategies

Enable Verbose Logging

Add Instrumentation

Use Feature Flags

Reproduce in Isolation

Root Cause Analysis

The 5 Whys

Fishbone Diagram

Prevention Strategies

Chaos Engineering

Load Testing

Monitoring & Alerting

Communication During Incidents

Incident Timeline

Status Updates

Post-Incident Review

Troubleshooting Checklist

Common Mistakes

Conclusion