Effective troubleshooting is methodical, documented, and reproducible. Random guessing wastes time and can make problems worse.
The Scientific Method
1. Observe
Gather data about the problem.
Questions to ask:
- What is the symptom?
- When did it start?
- What changed recently?
- Is it consistent or intermittent?
- Who/what is affected?
Data to collect:
# System metrics
top, htop, vmstat, iostat
# Network
netstat -an, ss -tulpn, tcpdump
# Logs
journalctl -xe, tail -f /var/log/*
# Application
ps aux, lsof, strace
2. Hypothesize
Form educated guesses about root cause.
Good hypotheses:
- Testable
- Based on evidence
- Specific
Example:
Hypothesis: High latency is caused by database connection pool exhaustion
Evidence: Connection pool metrics show 100% utilization
Test: Monitor connection pool during latency spike
3. Test
Verify or disprove hypothesis.
Testing approaches:
- Check logs for correlation
- Monitor metrics during problem
- Reproduce in controlled environment
- A/B test (compare working vs. broken)
4. Analyze
Interpret test results.
Outcomes:
- Hypothesis confirmed → Proceed to fix
- Hypothesis rejected → Form new hypothesis
- Inconclusive → Gather more data
5. Document
Record findings for future reference.
# Issue: High API Latency
## Timeline
- 14:32 UTC: Latency spike detected
- 14:35 UTC: Investigation started
- 14:42 UTC: Root cause identified
- 15:10 UTC: Fix deployed
- 15:15 UTC: Incident resolved
## Root Cause
Database connection pool exhausted due to slow queries
## Solution
Increased connection pool size and optimized slow query
## Prevention
- Add monitoring for connection pool utilization
- Set query timeout limits
- Regular query performance review
Divide and Conquer
Break complex systems into components.
Layer Isolation
User → Browser → DNS → Load Balancer → Web Server → App Server → Database
Test each layer:
1. Can user access internet? (ping 8.8.8.8)
2. Does DNS resolve? (nslookup example.com)
3. Is load balancer responding? (curl LB-IP)
4. Are web servers healthy? (check health endpoint)
5. Can app connect to DB? (telnet db-host 3306)
Binary Search
Split problem space in half repeatedly.
Problem: API slow between 10AM-11AM
Divide by time:
- First 30min (10:00-10:30): Slow
- Second 30min (10:30-11:00): Normal
Divide first half:
- 10:00-10:15: Slow
- 10:15-10:30: Slow
Divide second quarter:
- 10:00-10:07: Slow
- 10:07-10:15: Normal
Conclusion: Problem started at 10:07
Check: What changed at 10:07?
Component Isolation
# Is it the application or infrastructure?
# Deploy app to different server → Problem persists?
# Is it code or data?
# Run with sample data → Problem persists?
# Is it network or application?
# Run locally → Problem persists?
Common Troubleshooting Patterns
Pattern 1: It Worked Before
When it broke, what changed?
# Check recent deployments
git log --since="2 hours ago"
# Check config changes
diff old-config.yaml new-config.yaml
# Check infrastructure changes
terraform show -json | jq '.values.root_module.resources[] | select(.mode == "managed") | .values.tags.LastModified'
# Check system updates
rpm -qa --last | head -20 # RHEL/CentOS
dpkg -l --last | head -20 # Debian/Ubuntu
Pattern 2: It Works Somewhere Else
What’s different between environments?
# Compare versions
diff <(ssh prod 'app --version') <(ssh dev 'app --version')
# Compare configs
diff <(ssh prod 'cat /etc/app/config') <(ssh dev 'cat /etc/app/config')
# Compare dependencies
diff <(ssh prod 'pip freeze') <(ssh dev 'pip freeze')
# Compare environment variables
diff <(ssh prod 'env | sort') <(ssh dev 'env | sort')
Pattern 3: Intermittent Issues
Look for patterns over time.
# Correlate with time of day
grep "ERROR" /var/log/app.log | awk '{print $1}' | sort | uniq -c
# Correlate with load
join <(grep "ERROR" app.log | cut -d' ' -f1) \
<(sar -q 1 | awk '{print $1,$5}')
# Correlate with deployments
git log --all --oneline --since="1 week" \
--until="$(grep 'ERROR' app.log | head -1 | cut -d' ' -f1)"
Pattern 4: Cascading Failures
Follow the dependency chain.
Service A → Service B → Service C → Database
Database slow
↓
Service C times out
↓
Service B retries
↓
Service A queue fills
↓
User sees errors
Start from the deepest dependency:
# Check database
mysql -e "SHOW PROCESSLIST" | grep -c "Query"
# Check database connections
netstat -an | grep :3306 | wc -l
# Check service logs in reverse order
for svc in database service-c service-b service-a; do
echo "=== $svc ==="
tail -n 50 /var/log/$svc.log | grep ERROR
done
Essential Tools
System Performance
Linux
# CPU
top # Real-time process monitor
mpstat -P ALL 1 # Per-CPU statistics
pidstat 1 # Per-process CPU usage
# Memory
free -h # Memory usage summary
vmstat 1 # Virtual memory statistics
slabtop # Kernel slab cache
# Disk
iostat -x 1 # Disk I/O statistics
iotop # Per-process I/O
lsblk # Block devices
# Network
netstat -s # Network statistics
ss -s # Socket statistics
iftop # Network bandwidth by connection
macOS
# System
top -l 1 # Process snapshot
vm_stat 1 # Virtual memory stats
fs_usage # Filesystem activity
# Network
netstat -i # Interface statistics
nettop # Network usage by process
lsof -i # Network connections
Windows
# Performance
Get-Process | Sort-Object CPU -Descending
Get-Counter '\Processor(_Total)\% Processor Time'
# Network
Get-NetTCPConnection
Test-NetConnection
# Disk
Get-PhysicalDisk
Get-Volume
Application Debugging
Process Inspection
# Attach to running process
strace -p <pid> # System calls
ltrace -p <pid> # Library calls
gdb -p <pid> # Debugger
# Thread dumps
jstack <pid> # Java
kill -3 <pid> # Java (dumps to stdout)
py-spy dump -p <pid> # Python
Network Debugging
# Packet capture
tcpdump -i any port 8080 -w capture.pcap
# HTTP debugging
curl -v https://api.example.com
curl -w "@curl-format.txt" https://api.example.com
# DNS
dig example.com +trace
nslookup example.com
host example.com
# Connectivity
telnet host port
nc -zv host port
Log Analysis
# Search logs
grep -r "ERROR" /var/log/
journalctl -u nginx -f
# Parse JSON logs
cat app.log | jq 'select(.level == "error")'
# Count occurrences
grep "ERROR" app.log | sort | uniq -c | sort -rn
# Time-based filtering
awk '/2025-01-15 14:3[0-9]/ {print}' app.log
Performance Troubleshooting
High CPU
Identify culprit:
# Find CPU-intensive processes
top -b -n 1 | head -20
# Profile running process
perf record -g -p <pid>
perf report
# Python profiling
py-spy top -p <pid>
Common causes:
- Infinite loops
- Inefficient algorithms
- Too many threads
- CPU-bound tasks
High Memory
Identify memory hogs:
# Process memory
ps aux --sort=-%mem | head -20
# Memory map
pmap -x <pid>
# Heap dump (Java)
jmap -dump:live,format=b,file=heap.bin <pid>
Common causes:
- Memory leaks
- Large object creation
- Insufficient garbage collection
- Cache bloat
Slow Queries
Identify slow queries:
-- MySQL
SELECT * FROM mysql.slow_log ORDER BY query_time DESC LIMIT 10;
-- PostgreSQL
SELECT query, calls, total_time, mean_time
FROM pg_stat_statements
ORDER BY mean_time DESC
LIMIT 10;
Analyze queries:
EXPLAIN ANALYZE SELECT * FROM users WHERE email = 'user@example.com';
Common causes:
- Missing indexes
- Full table scans
- N+1 queries
- Lock contention
Network Latency
Measure latency:
# ICMP ping
ping -c 10 example.com
# TCP connection time
time telnet example.com 80
# HTTP request breakdown
curl -w "@curl-format.txt" -o /dev/null -s https://example.com
curl-format.txt:
time_namelookup: %{time_namelookup}s\n
time_connect: %{time_connect}s\n
time_appconnect: %{time_appconnect}s\n
time_pretransfer: %{time_pretransfer}s\n
time_redirect: %{time_redirect}s\n
time_starttransfer: %{time_starttransfer}s\n
time_total: %{time_total}s\n
Common causes:
- DNS resolution delays
- Network congestion
- Geographic distance
- Firewall rules
Debugging Strategies
Enable Verbose Logging
Temporarily increase log level:
# Application
export LOG_LEVEL=DEBUG
systemctl restart app
# Web server (Nginx)
error_log /var/log/nginx/error.log debug;
# Database
SET GLOBAL general_log = 'ON';
Add Instrumentation
import time
import logging
def slow_function():
start = time.time()
# ... function code
duration = time.time() - start
logging.info(f"slow_function took {duration:.2f}s")
Use Feature Flags
if feature_flag('debug_mode'):
log_detailed_info()
dump_state()
Reproduce in Isolation
# Minimal reproduction
docker run -it ubuntu:latest bash
# Install only necessary dependencies
# Run minimal test case
Root Cause Analysis
The 5 Whys
Keep asking “why” until you reach root cause.
Example:
Why did the service crash? → Out of memory
Why did it run out of memory? → Memory leak in user session cache
Why is there a memory leak? → Sessions never expire
Why don’t sessions expire? → TTL not configured
Why wasn’t TTL configured? → Not in deployment checklist
Root cause: Missing deployment checklist item Fix: Add session TTL to checklist
Fishbone Diagram
Methods People
| |
------+------ ------+------
| |
PROBLEM/EFFECT
| |
------+------ ------+------
| |
Materials Environment
Example: High Latency
- Methods: Inefficient algorithm, no caching
- People: New developer unfamiliar with codebase
- Materials: Outdated libraries, missing indexes
- Environment: Network congestion, undersized servers
Prevention Strategies
Chaos Engineering
Deliberately introduce failures to test resilience.
# Kill random pods
kubectl delete pod -l app=myapp --random
# Inject network latency
tc qdisc add dev eth0 root netem delay 100ms
# Limit CPU
docker run --cpus=".5" myapp
# Fill disk
dd if=/dev/zero of=/tmp/fill bs=1M count=1000
Load Testing
Find breaking points before users do.
# Apache Bench
ab -n 10000 -c 100 http://example.com/
# k6
k6 run --vus 100 --duration 30s script.js
# Locust
locust -f locustfile.py --host=http://example.com
Monitoring & Alerting
Detect issues before users report them.
# Latency increasing
rate(http_request_duration_seconds_sum[5m]) /
rate(http_request_duration_seconds_count[5m]) > 1.0
# Error rate spiking
rate(http_requests_total{status=~"5.."}[5m]) /
rate(http_requests_total[5m]) > 0.05
Communication During Incidents
Incident Timeline
14:32 - Alert: High latency detected
14:35 - Investigation started (Alice)
14:42 - Root cause identified: DB connection pool exhausted
14:45 - Mitigation: Increased connection pool size
15:00 - Fix deployed: Query optimization
15:15 - Incident resolved
15:30 - Post-mortem scheduled
Status Updates
Good update:
[15:00 UPDATE] We've identified the root cause as database connection
pool exhaustion. We've temporarily increased pool size and are deploying
a fix to optimize slow queries. ETA for full resolution: 15 minutes.
Bad update:
[15:00 UPDATE] We're working on it.
Post-Incident Review
# Post-Incident Review
## What Happened
High latency on API from 14:32-15:15 UTC
## Impact
- 43 minutes of degraded performance
- ~1,000 users affected
- 5% error rate
## Root Cause
Slow database queries exhausted connection pool
## What Went Well
- Alert fired within 2 minutes
- Clear runbook for investigation
- Rapid mitigation deployed
## What Went Poorly
- No monitoring for connection pool usage
- Slow query not caught in code review
- Manual scaling required
## Action Items
1. Add connection pool monitoring
2. Implement query performance tests
3. Automate connection pool scaling
4. Update code review checklist
Troubleshooting Checklist
## Initial Assessment
- [ ] What is the user-visible symptom?
- [ ] When did it start?
- [ ] Is it affecting everyone or subset of users?
- [ ] What changed recently?
## Data Collection
- [ ] Check logs for errors
- [ ] Review metrics/graphs
- [ ] Check recent deployments
- [ ] Review monitoring dashboards
## Investigation
- [ ] Form hypothesis
- [ ] Test hypothesis
- [ ] Document findings
- [ ] Identify root cause
## Resolution
- [ ] Implement fix
- [ ] Verify fix works
- [ ] Monitor for regression
- [ ] Update documentation
## Follow-up
- [ ] Write post-mortem
- [ ] Create action items
- [ ] Schedule review
- [ ] Update runbooks
Common Mistakes
- Making changes without hypothesis: Testing random solutions wastes time
- Not documenting steps: Can’t reproduce or learn from investigation
- Fixing symptoms not cause: Problem will return
- Changing multiple things: Can’t identify what fixed it
- Not verifying fix: Assumed resolution, problem continues
- Skipping post-mortem: Miss opportunity to prevent recurrence
Conclusion
Effective troubleshooting requires:
- Systematic approach: Follow methodology, don’t guess randomly
- Data-driven: Collect evidence, form hypotheses, test
- Documentation: Record findings for future reference
- Communication: Keep stakeholders informed
- Learning: Post-mortems prevent recurrence
Remember: The goal isn’t just to fix the immediate problem. It’s to prevent it from happening again.