High Availability (HA) ensures systems remain operational despite failures. It’s achieved through redundancy, failover, and fault tolerance.
HA Fundamentals
Availability Calculation
Availability = Uptime / (Uptime + Downtime)
99.9% (three nines) = 8.76 hours downtime/year
99.99% (four nines) = 52.56 minutes downtime/year
99.999% (five nines) = 5.26 minutes downtime/year
Components of HA
- Redundancy: Multiple instances of components
- Failover: Automatic switching to backup
- Health Checks: Detect failures quickly
- Load Balancing: Distribute traffic
- Data Replication: Prevent data loss
Active-Active Pattern
All instances actively serve traffic.
Load Balancer
|
--------+--------
| |
Server A Server B
(Active) (Active)
| |
--------+--------
|
Database
(Replicated)
Benefits:
- No wasted resources
- Better performance
- Seamless failover
Challenges:
- Data synchronization
- Session management
- Consistency
Implementation Example:
# Kubernetes Deployment with multiple replicas
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-app
spec:
replicas: 3
selector:
matchLabels:
app: web
template:
metadata:
labels:
app: web
spec:
containers:
- name: web
image: web-app:latest
ports:
- containerPort: 8080
---
apiVersion: v1
kind: Service
metadata:
name: web-service
spec:
type: LoadBalancer
selector:
app: web
ports:
- port: 80
targetPort: 8080
Active-Passive Pattern
One instance serves traffic, others on standby.
Load Balancer
|
--------+--------
| |
Server A Server B
(Active) (Passive)
Benefits:
- Simpler failover
- Easier data consistency
- Lower resource costs
Challenges:
- Wasted capacity
- Failover time
- Detection delay
Implementation Example:
# HAProxy with backup server
backend web_servers
balance roundrobin
server web1 10.0.1.10:80 check
server web2 10.0.1.11:80 check backup
Multi-Region Deployment
Distribute across geographic regions for disaster recovery.
Global Load Balancer (Route 53)
|
--------+--------
| |
Region A Region B
US-East-1 EU-West-1
AWS Route 53 Failover:
{
"Name": "example.com",
"Type": "A",
"SetIdentifier": "Primary",
"Failover": "PRIMARY",
"HealthCheckId": "abc123",
"ResourceRecords": [{"Value": "52.1.1.1"}]
},
{
"Name": "example.com",
"Type": "A",
"SetIdentifier": "Secondary",
"Failover": "SECONDARY",
"ResourceRecords": [{"Value": "54.2.2.2"}]
}
Database HA Patterns
Master-Slave Replication
Master (Write)
|
Replication
|
----+----
| |
Slave A Slave B
(Read) (Read)
MySQL Master-Slave:
-- On Master
CHANGE MASTER TO
MASTER_HOST='master.example.com',
MASTER_USER='repl',
MASTER_PASSWORD='password',
MASTER_LOG_FILE='mysql-bin.000001',
MASTER_LOG_POS=107;
START SLAVE;
Multi-Master Replication
Master A ←→ Master B
(Write) (Write)
Conflict Resolution Required:
- Last-write-wins
- Application-level resolution
- Manual intervention
Clustering
MySQL Group Replication:
Node A ←→ Node B ←→ Node C
(R/W) (R/W) (R/W)
PostgreSQL Patroni:
# patroni.yml
scope: postgres-cluster
name: node1
restapi:
listen: 0.0.0.0:8008
connect_address: 10.0.1.10:8008
etcd:
hosts: 10.0.1.20:2379,10.0.1.21:2379,10.0.1.22:2379
bootstrap:
dcs:
ttl: 30
loop_wait: 10
retry_timeout: 10
maximum_lag_on_failover: 1048576
postgresql:
listen: 0.0.0.0:5432
connect_address: 10.0.1.10:5432
data_dir: /var/lib/postgresql/data
Health Checks
HTTP Health Endpoint
from flask import Flask, jsonify
import psycopg2
app = Flask(__name__)
@app.route('/health')
def health_check():
checks = {
'database': check_database(),
'cache': check_cache(),
'disk': check_disk_space(),
}
all_healthy = all(checks.values())
status_code = 200 if all_healthy else 503
return jsonify({
'status': 'healthy' if all_healthy else 'unhealthy',
'checks': checks
}), status_code
def check_database():
try:
conn = psycopg2.connect(DATABASE_URL)
conn.close()
return True
except:
return False
Load Balancer Health Check
# HAProxy
backend web_servers
option httpchk GET /health HTTP/1.1\r\nHost:\ example.com
http-check expect status 200
server web1 10.0.1.10:80 check inter 5s fall 3 rise 2
Deep vs. Shallow Checks
Shallow (Fast):
GET /health
→ 200 OK (service is running)
Deep (Comprehensive):
GET /health
→ Check database connection
→ Check cache connection
→ Check disk space
→ Check external API
→ 200 OK (all dependencies healthy)
Failover Strategies
DNS Failover
# Route 53 health check
aws route53 create-health-check \
--caller-reference $(date +%s) \
--health-check-config \
IPAddress=52.1.1.1,Port=80,Type=HTTP,ResourcePath=/health
Limitations:
- DNS caching delays (TTL)
- Client-side caching
- Typically 60+ second failover time
Load Balancer Failover
Client → Load Balancer → [Server A | Server B | Server C]
(health checked every 5s)
Benefits:
- Fast failover (seconds)
- No DNS caching issues
- Automatic recovery
Application-Level Failover
def get_data():
try:
return primary_db.query()
except DatabaseError:
logger.warning("Primary DB failed, using secondary")
return secondary_db.query()
Circuit Breaker Pattern
Prevent cascading failures by stopping requests to failing services.
class CircuitBreaker:
def __init__(self, failure_threshold=5, timeout=60):
self.failure_count = 0
self.failure_threshold = failure_threshold
self.timeout = timeout
self.last_failure_time = None
self.state = 'CLOSED' # CLOSED, OPEN, HALF_OPEN
def call(self, func, *args, **kwargs):
if self.state == 'OPEN':
if time.time() - self.last_failure_time > self.timeout:
self.state = 'HALF_OPEN'
else:
raise CircuitBreakerOpenError()
try:
result = func(*args, **kwargs)
if self.state == 'HALF_OPEN':
self.state = 'CLOSED'
self.failure_count = 0
return result
except Exception as e:
self.failure_count += 1
self.last_failure_time = time.time()
if self.failure_count >= self.failure_threshold:
self.state = 'OPEN'
raise e
# Usage
breaker = CircuitBreaker()
@app.route('/data')
def get_data():
return breaker.call(external_api.fetch_data)
Session Management
Sticky Sessions
# HAProxy sticky sessions
backend web_servers
cookie SERVERID insert indirect nocache
server web1 10.0.1.10:80 cookie web1 check
server web2 10.0.1.11:80 cookie web2 check
Drawbacks:
- Uneven load distribution
- Server failure loses sessions
- Harder to scale
Shared Session Store
# Redis-backed sessions
from flask import Flask, session
from flask_session import Session
app = Flask(__name__)
app.config['SESSION_TYPE'] = 'redis'
app.config['SESSION_REDIS'] = redis.StrictRedis(
host='redis.example.com',
port=6379,
db=0
)
Session(app)
@app.route('/login', methods=['POST'])
def login():
session['user_id'] = authenticate(request.form)
return redirect('/dashboard')
Benefits:
- Server failures don’t lose sessions
- Easy horizontal scaling
- Consistent user experience
Stateless Architecture
Design services without server-side state.
# Stateless API with JWT
import jwt
@app.route('/api/protected')
def protected():
token = request.headers.get('Authorization').split()[1]
payload = jwt.decode(token, SECRET_KEY, algorithms=['HS256'])
user_id = payload['user_id']
return jsonify({'data': get_user_data(user_id)})
Benefits:
- Any server can handle any request
- Easy scaling
- Simple failover
Graceful Degradation
Continue operating with reduced functionality when components fail.
def get_recommendations():
try:
# Try ML recommendation service
return ml_service.get_recommendations(user_id)
except ServiceUnavailable:
# Fall back to simple algorithm
logger.warning("ML service unavailable, using fallback")
return get_popular_items()
def get_user_profile():
profile = db.get_user(user_id)
# Try to enrich with external data
try:
profile.update(external_api.get_extra_data(user_id))
except:
# Continue without enrichment
logger.warning("External API failed, returning basic profile")
return profile
Auto-Healing
Automatically recover from failures.
Kubernetes Liveness Probe:
apiVersion: v1
kind: Pod
metadata:
name: web-app
spec:
containers:
- name: web
image: web-app:latest
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
failureThreshold: 3
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
Auto-Restart Service:
# systemd service with automatic restart
[Service]
Restart=always
RestartSec=10s
Rate Limiting & Throttling
Protect services from overload.
from flask_limiter import Limiter
from flask_limiter.util import get_remote_address
limiter = Limiter(
app,
key_func=get_remote_address,
default_limits=["200 per day", "50 per hour"]
)
@app.route("/api/expensive")
@limiter.limit("10 per minute")
def expensive_operation():
return perform_operation()
Bulkhead Pattern
Isolate resources to prevent cascading failures.
# Separate thread pools for different operations
critical_pool = ThreadPoolExecutor(max_workers=10)
background_pool = ThreadPoolExecutor(max_workers=5)
@app.route('/critical')
def critical_operation():
future = critical_pool.submit(process_critical)
return future.result()
@app.route('/background')
def background_operation():
future = background_pool.submit(process_background)
return {'status': 'processing'}
Chaos Engineering
Test HA by intentionally causing failures.
# Chaos Monkey - randomly terminate instances
while true; do
instance=$(aws ec2 describe-instances \
--filters "Name=tag:Environment,Values=staging" \
--query 'Reservations[*].Instances[*].InstanceId' \
--output text | shuf -n 1)
echo "Terminating $instance"
aws ec2 terminate-instances --instance-ids $instance
sleep $((RANDOM % 3600 + 600)) # 10-60 minutes
done
Monitoring HA Systems
Key Metrics:
# Service availability
up{job="web-app"}
# Request success rate
rate(http_requests_total{status!~"5.."}[5m]) /
rate(http_requests_total[5m])
# Healthy instance count
count(up{job="web-app"} == 1)
# Failover events
increase(failover_events_total[1h])
Testing HA
Failure Injection
# Network partition
iptables -A INPUT -s 10.0.1.10 -j DROP
# CPU stress
stress --cpu 4 --timeout 60s
# Memory pressure
stress --vm 2 --vm-bytes 1G --timeout 60s
# Disk I/O
stress --io 4 --timeout 60s
Disaster Recovery Drills
## DR Drill Checklist
### Preparation
- [ ] Schedule drill during low-traffic period
- [ ] Notify all stakeholders
- [ ] Prepare rollback plan
- [ ] Set up monitoring
### Execution
- [ ] Simulate primary region failure
- [ ] Verify failover to secondary region
- [ ] Test all critical functionality
- [ ] Measure failover time
### Validation
- [ ] Verify data consistency
- [ ] Check all services operational
- [ ] Review metrics and logs
- [ ] Document issues found
### Cleanup
- [ ] Failback to primary region
- [ ] Verify normal operations
- [ ] Document lessons learned
- [ ] Update runbooks
SLA Considerations
Define SLA Metrics:
Service Level Agreement:
Availability: 99.95%
Response Time (p99): < 500ms
Error Rate: < 0.1%
Maintenance Window: Sunday 02:00-04:00 UTC
Support Response: < 1 hour
Calculate Allowed Downtime:
99.9% = 43.2 minutes/month
99.95% = 21.6 minutes/month
99.99% = 4.3 minutes/month
Cost vs. Availability
Availability Cost Multiple Use Case
─────────────────────────────────────────
99% 1x Internal tools
99.9% 2-3x Standard SaaS
99.95% 5-10x Critical business
99.99% 10-50x Financial services
99.999% 50-100x+ Emergency services
Conclusion
High availability requires:
- Redundancy at all levels
- Automated failover mechanisms
- Comprehensive monitoring
- Regular testing
- Graceful degradation
- Documentation and runbooks
Remember: HA is not achieved once. It requires continuous validation and improvement.