High Availability Patterns

High Availability (HA) ensures systems remain operational despite failures. It’s achieved through redundancy, failover, and fault tolerance.

HA Fundamentals

Availability Calculation

Availability = Uptime / (Uptime + Downtime)

99.9% (three nines)   = 8.76 hours downtime/year
99.99% (four nines)   = 52.56 minutes downtime/year
99.999% (five nines)  = 5.26 minutes downtime/year

Components of HA

Redundancy: Multiple instances of components
Failover: Automatic switching to backup
Health Checks: Detect failures quickly
Load Balancing: Distribute traffic
Data Replication: Prevent data loss

Active-Active Pattern

All instances actively serve traffic.

         Load Balancer
              |
      --------+--------
      |               |
   Server A        Server B
   (Active)        (Active)
      |               |
      --------+--------
              |
          Database
       (Replicated)

Benefits:

No wasted resources
Better performance
Seamless failover

Challenges:

Data synchronization
Session management
Consistency

Implementation Example:

# Kubernetes Deployment with multiple replicas
apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-app
spec:
  replicas: 3
  selector:
    matchLabels:
      app: web
  template:
    metadata:
      labels:
        app: web
    spec:
      containers:
      - name: web
        image: web-app:latest
        ports:
        - containerPort: 8080
---
apiVersion: v1
kind: Service
metadata:
  name: web-service
spec:
  type: LoadBalancer
  selector:
    app: web
  ports:
  - port: 80
    targetPort: 8080

Active-Passive Pattern

One instance serves traffic, others on standby.

         Load Balancer
              |
      --------+--------
      |               |
   Server A        Server B
   (Active)        (Passive)

Benefits:

Simpler failover
Easier data consistency
Lower resource costs

Challenges:

Wasted capacity
Failover time
Detection delay

Implementation Example:

# HAProxy with backup server
backend web_servers
    balance roundrobin
    server web1 10.0.1.10:80 check
    server web2 10.0.1.11:80 check backup

Multi-Region Deployment

Distribute across geographic regions for disaster recovery.

     Global Load Balancer (Route 53)
              |
      --------+--------
      |               |
   Region A        Region B
   US-East-1       EU-West-1

AWS Route 53 Failover:

{
  "Name": "example.com",
  "Type": "A",
  "SetIdentifier": "Primary",
  "Failover": "PRIMARY",
  "HealthCheckId": "abc123",
  "ResourceRecords": [{"Value": "52.1.1.1"}]
},
{
  "Name": "example.com",
  "Type": "A",
  "SetIdentifier": "Secondary",
  "Failover": "SECONDARY",
  "ResourceRecords": [{"Value": "54.2.2.2"}]
}

Database HA Patterns

Master-Slave Replication

   Master (Write)
       |
   Replication
       |
   ----+----
   |       |
Slave A  Slave B
(Read)   (Read)

MySQL Master-Slave:

-- On Master
CHANGE MASTER TO
  MASTER_HOST='master.example.com',
  MASTER_USER='repl',
  MASTER_PASSWORD='password',
  MASTER_LOG_FILE='mysql-bin.000001',
  MASTER_LOG_POS=107;

START SLAVE;

Multi-Master Replication

Master A ←→ Master B
(Write)     (Write)

Conflict Resolution Required:

Last-write-wins
Application-level resolution
Manual intervention

Clustering

MySQL Group Replication:

Node A ←→ Node B ←→ Node C
(R/W)     (R/W)     (R/W)

PostgreSQL Patroni:

# patroni.yml
scope: postgres-cluster
name: node1

restapi:
  listen: 0.0.0.0:8008
  connect_address: 10.0.1.10:8008

etcd:
  hosts: 10.0.1.20:2379,10.0.1.21:2379,10.0.1.22:2379

bootstrap:
  dcs:
    ttl: 30
    loop_wait: 10
    retry_timeout: 10
    maximum_lag_on_failover: 1048576

postgresql:
  listen: 0.0.0.0:5432
  connect_address: 10.0.1.10:5432
  data_dir: /var/lib/postgresql/data

Health Checks

HTTP Health Endpoint

from flask import Flask, jsonify
import psycopg2

app = Flask(__name__)

@app.route('/health')
def health_check():
    checks = {
        'database': check_database(),
        'cache': check_cache(),
        'disk': check_disk_space(),
    }

    all_healthy = all(checks.values())
    status_code = 200 if all_healthy else 503

    return jsonify({
        'status': 'healthy' if all_healthy else 'unhealthy',
        'checks': checks
    }), status_code

def check_database():
    try:
        conn = psycopg2.connect(DATABASE_URL)
        conn.close()
        return True
    except:
        return False

Load Balancer Health Check

# HAProxy
backend web_servers
    option httpchk GET /health HTTP/1.1\r\nHost:\ example.com
    http-check expect status 200
    server web1 10.0.1.10:80 check inter 5s fall 3 rise 2

Deep vs. Shallow Checks

Shallow (Fast):

GET /health
→ 200 OK (service is running)

Deep (Comprehensive):

GET /health
→ Check database connection
→ Check cache connection
→ Check disk space
→ Check external API
→ 200 OK (all dependencies healthy)

Failover Strategies

DNS Failover

# Route 53 health check
aws route53 create-health-check \
  --caller-reference $(date +%s) \
  --health-check-config \
    IPAddress=52.1.1.1,Port=80,Type=HTTP,ResourcePath=/health

Limitations:

DNS caching delays (TTL)
Client-side caching
Typically 60+ second failover time

Load Balancer Failover

Client → Load Balancer → [Server A | Server B | Server C]
                          (health checked every 5s)

Benefits:

Fast failover (seconds)
No DNS caching issues
Automatic recovery

Application-Level Failover

def get_data():
    try:
        return primary_db.query()
    except DatabaseError:
        logger.warning("Primary DB failed, using secondary")
        return secondary_db.query()

Circuit Breaker Pattern

Prevent cascading failures by stopping requests to failing services.

class CircuitBreaker:
    def __init__(self, failure_threshold=5, timeout=60):
        self.failure_count = 0
        self.failure_threshold = failure_threshold
        self.timeout = timeout
        self.last_failure_time = None
        self.state = 'CLOSED'  # CLOSED, OPEN, HALF_OPEN

    def call(self, func, *args, **kwargs):
        if self.state == 'OPEN':
            if time.time() - self.last_failure_time > self.timeout:
                self.state = 'HALF_OPEN'
            else:
                raise CircuitBreakerOpenError()

        try:
            result = func(*args, **kwargs)
            if self.state == 'HALF_OPEN':
                self.state = 'CLOSED'
                self.failure_count = 0
            return result
        except Exception as e:
            self.failure_count += 1
            self.last_failure_time = time.time()

            if self.failure_count >= self.failure_threshold:
                self.state = 'OPEN'

            raise e

# Usage
breaker = CircuitBreaker()

@app.route('/data')
def get_data():
    return breaker.call(external_api.fetch_data)

Session Management

Sticky Sessions

# HAProxy sticky sessions
backend web_servers
    cookie SERVERID insert indirect nocache
    server web1 10.0.1.10:80 cookie web1 check
    server web2 10.0.1.11:80 cookie web2 check

Drawbacks:

Uneven load distribution
Server failure loses sessions
Harder to scale

Shared Session Store

# Redis-backed sessions
from flask import Flask, session
from flask_session import Session

app = Flask(__name__)
app.config['SESSION_TYPE'] = 'redis'
app.config['SESSION_REDIS'] = redis.StrictRedis(
    host='redis.example.com',
    port=6379,
    db=0
)
Session(app)

@app.route('/login', methods=['POST'])
def login():
    session['user_id'] = authenticate(request.form)
    return redirect('/dashboard')

Benefits:

Server failures don’t lose sessions
Easy horizontal scaling
Consistent user experience

Stateless Architecture

Design services without server-side state.

# Stateless API with JWT
import jwt

@app.route('/api/protected')
def protected():
    token = request.headers.get('Authorization').split()[1]
    payload = jwt.decode(token, SECRET_KEY, algorithms=['HS256'])
    user_id = payload['user_id']
    return jsonify({'data': get_user_data(user_id)})

Benefits:

Any server can handle any request
Easy scaling
Simple failover

Graceful Degradation

Continue operating with reduced functionality when components fail.

def get_recommendations():
    try:
        # Try ML recommendation service
        return ml_service.get_recommendations(user_id)
    except ServiceUnavailable:
        # Fall back to simple algorithm
        logger.warning("ML service unavailable, using fallback")
        return get_popular_items()

def get_user_profile():
    profile = db.get_user(user_id)

    # Try to enrich with external data
    try:
        profile.update(external_api.get_extra_data(user_id))
    except:
        # Continue without enrichment
        logger.warning("External API failed, returning basic profile")

    return profile

Auto-Healing

Automatically recover from failures.

Kubernetes Liveness Probe:

apiVersion: v1
kind: Pod
metadata:
  name: web-app
spec:
  containers:
  - name: web
    image: web-app:latest
    livenessProbe:
      httpGet:
        path: /health
        port: 8080
      initialDelaySeconds: 30
      periodSeconds: 10
      failureThreshold: 3
    readinessProbe:
      httpGet:
        path: /ready
        port: 8080
      initialDelaySeconds: 5
      periodSeconds: 5

Auto-Restart Service:

# systemd service with automatic restart
[Service]
Restart=always
RestartSec=10s

Rate Limiting & Throttling

Protect services from overload.

from flask_limiter import Limiter
from flask_limiter.util import get_remote_address

limiter = Limiter(
    app,
    key_func=get_remote_address,
    default_limits=["200 per day", "50 per hour"]
)

@app.route("/api/expensive")
@limiter.limit("10 per minute")
def expensive_operation():
    return perform_operation()

Bulkhead Pattern

Isolate resources to prevent cascading failures.

# Separate thread pools for different operations
critical_pool = ThreadPoolExecutor(max_workers=10)
background_pool = ThreadPoolExecutor(max_workers=5)

@app.route('/critical')
def critical_operation():
    future = critical_pool.submit(process_critical)
    return future.result()

@app.route('/background')
def background_operation():
    future = background_pool.submit(process_background)
    return {'status': 'processing'}

Chaos Engineering

Test HA by intentionally causing failures.

# Chaos Monkey - randomly terminate instances
while true; do
    instance=$(aws ec2 describe-instances \
        --filters "Name=tag:Environment,Values=staging" \
        --query 'Reservations[*].Instances[*].InstanceId' \
        --output text | shuf -n 1)

    echo "Terminating $instance"
    aws ec2 terminate-instances --instance-ids $instance

    sleep $((RANDOM % 3600 + 600))  # 10-60 minutes
done

Monitoring HA Systems

Key Metrics:

# Service availability
up{job="web-app"}

# Request success rate
rate(http_requests_total{status!~"5.."}[5m]) /
rate(http_requests_total[5m])

# Healthy instance count
count(up{job="web-app"} == 1)

# Failover events
increase(failover_events_total[1h])

Testing HA

Failure Injection

# Network partition
iptables -A INPUT -s 10.0.1.10 -j DROP

# CPU stress
stress --cpu 4 --timeout 60s

# Memory pressure
stress --vm 2 --vm-bytes 1G --timeout 60s

# Disk I/O
stress --io 4 --timeout 60s

Disaster Recovery Drills

## DR Drill Checklist

### Preparation
- [ ] Schedule drill during low-traffic period
- [ ] Notify all stakeholders
- [ ] Prepare rollback plan
- [ ] Set up monitoring

### Execution
- [ ] Simulate primary region failure
- [ ] Verify failover to secondary region
- [ ] Test all critical functionality
- [ ] Measure failover time

### Validation
- [ ] Verify data consistency
- [ ] Check all services operational
- [ ] Review metrics and logs
- [ ] Document issues found

### Cleanup
- [ ] Failback to primary region
- [ ] Verify normal operations
- [ ] Document lessons learned
- [ ] Update runbooks

SLA Considerations

Define SLA Metrics:

Service Level Agreement:
  Availability: 99.95%
  Response Time (p99): < 500ms
  Error Rate: < 0.1%
  Maintenance Window: Sunday 02:00-04:00 UTC
  Support Response: < 1 hour

Calculate Allowed Downtime:

99.9%  = 43.2 minutes/month
99.95% = 21.6 minutes/month
99.99% = 4.3 minutes/month

Cost vs. Availability

Availability    Cost Multiple    Use Case
─────────────────────────────────────────
99%             1x              Internal tools
99.9%           2-3x            Standard SaaS
99.95%          5-10x           Critical business
99.99%          10-50x          Financial services
99.999%         50-100x+        Emergency services

Conclusion

High availability requires:

Redundancy at all levels
Automated failover mechanisms
Comprehensive monitoring
Regular testing
Graceful degradation
Documentation and runbooks

Remember: HA is not achieved once. It requires continuous validation and improvement.

High Availability Patterns

On This Page

HA Fundamentals

Availability Calculation

Components of HA

Active-Active Pattern

Active-Passive Pattern

Multi-Region Deployment

Database HA Patterns

Master-Slave Replication

Multi-Master Replication

Clustering

Health Checks

HTTP Health Endpoint

Load Balancer Health Check

Deep vs. Shallow Checks

Failover Strategies

DNS Failover

Load Balancer Failover

Application-Level Failover

Circuit Breaker Pattern

Session Management

Sticky Sessions

Shared Session Store

Stateless Architecture

Graceful Degradation

Auto-Healing

Rate Limiting & Throttling

Bulkhead Pattern

Chaos Engineering

Monitoring HA Systems

Testing HA

Failure Injection

Disaster Recovery Drills

SLA Considerations

Cost vs. Availability

Conclusion