Skip to main content

High Availability Patterns

November 10, 2025

Design patterns and strategies for building highly available systems that minimize downtime

High Availability (HA) ensures systems remain operational despite failures. It’s achieved through redundancy, failover, and fault tolerance.

HA Fundamentals

Availability Calculation

Availability = Uptime / (Uptime + Downtime)

99.9% (three nines)   = 8.76 hours downtime/year
99.99% (four nines)   = 52.56 minutes downtime/year
99.999% (five nines)  = 5.26 minutes downtime/year

Components of HA

  1. Redundancy: Multiple instances of components
  2. Failover: Automatic switching to backup
  3. Health Checks: Detect failures quickly
  4. Load Balancing: Distribute traffic
  5. Data Replication: Prevent data loss

Active-Active Pattern

All instances actively serve traffic.

         Load Balancer
              |
      --------+--------
      |               |
   Server A        Server B
   (Active)        (Active)
      |               |
      --------+--------
              |
          Database
       (Replicated)

Benefits:

  • No wasted resources
  • Better performance
  • Seamless failover

Challenges:

  • Data synchronization
  • Session management
  • Consistency

Implementation Example:

# Kubernetes Deployment with multiple replicas
apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-app
spec:
  replicas: 3
  selector:
    matchLabels:
      app: web
  template:
    metadata:
      labels:
        app: web
    spec:
      containers:
      - name: web
        image: web-app:latest
        ports:
        - containerPort: 8080
---
apiVersion: v1
kind: Service
metadata:
  name: web-service
spec:
  type: LoadBalancer
  selector:
    app: web
  ports:
  - port: 80
    targetPort: 8080

Active-Passive Pattern

One instance serves traffic, others on standby.

         Load Balancer
              |
      --------+--------
      |               |
   Server A        Server B
   (Active)        (Passive)

Benefits:

  • Simpler failover
  • Easier data consistency
  • Lower resource costs

Challenges:

  • Wasted capacity
  • Failover time
  • Detection delay

Implementation Example:

# HAProxy with backup server
backend web_servers
    balance roundrobin
    server web1 10.0.1.10:80 check
    server web2 10.0.1.11:80 check backup

Multi-Region Deployment

Distribute across geographic regions for disaster recovery.

     Global Load Balancer (Route 53)
              |
      --------+--------
      |               |
   Region A        Region B
   US-East-1       EU-West-1

AWS Route 53 Failover:

{
  "Name": "example.com",
  "Type": "A",
  "SetIdentifier": "Primary",
  "Failover": "PRIMARY",
  "HealthCheckId": "abc123",
  "ResourceRecords": [{"Value": "52.1.1.1"}]
},
{
  "Name": "example.com",
  "Type": "A",
  "SetIdentifier": "Secondary",
  "Failover": "SECONDARY",
  "ResourceRecords": [{"Value": "54.2.2.2"}]
}

Database HA Patterns

Master-Slave Replication

   Master (Write)
       |
   Replication
       |
   ----+----
   |       |
Slave A  Slave B
(Read)   (Read)

MySQL Master-Slave:

-- On Master
CHANGE MASTER TO
  MASTER_HOST='master.example.com',
  MASTER_USER='repl',
  MASTER_PASSWORD='password',
  MASTER_LOG_FILE='mysql-bin.000001',
  MASTER_LOG_POS=107;

START SLAVE;

Multi-Master Replication

Master A ←→ Master B
(Write)     (Write)

Conflict Resolution Required:

  • Last-write-wins
  • Application-level resolution
  • Manual intervention

Clustering

MySQL Group Replication:

Node A ←→ Node B ←→ Node C
(R/W)     (R/W)     (R/W)

PostgreSQL Patroni:

# patroni.yml
scope: postgres-cluster
name: node1

restapi:
  listen: 0.0.0.0:8008
  connect_address: 10.0.1.10:8008

etcd:
  hosts: 10.0.1.20:2379,10.0.1.21:2379,10.0.1.22:2379

bootstrap:
  dcs:
    ttl: 30
    loop_wait: 10
    retry_timeout: 10
    maximum_lag_on_failover: 1048576

postgresql:
  listen: 0.0.0.0:5432
  connect_address: 10.0.1.10:5432
  data_dir: /var/lib/postgresql/data

Health Checks

HTTP Health Endpoint

from flask import Flask, jsonify
import psycopg2

app = Flask(__name__)

@app.route('/health')
def health_check():
    checks = {
        'database': check_database(),
        'cache': check_cache(),
        'disk': check_disk_space(),
    }

    all_healthy = all(checks.values())
    status_code = 200 if all_healthy else 503

    return jsonify({
        'status': 'healthy' if all_healthy else 'unhealthy',
        'checks': checks
    }), status_code

def check_database():
    try:
        conn = psycopg2.connect(DATABASE_URL)
        conn.close()
        return True
    except:
        return False

Load Balancer Health Check

# HAProxy
backend web_servers
    option httpchk GET /health HTTP/1.1\r\nHost:\ example.com
    http-check expect status 200
    server web1 10.0.1.10:80 check inter 5s fall 3 rise 2

Deep vs. Shallow Checks

Shallow (Fast):

GET /health
→ 200 OK (service is running)

Deep (Comprehensive):

GET /health
→ Check database connection
→ Check cache connection
→ Check disk space
→ Check external API
→ 200 OK (all dependencies healthy)

Failover Strategies

DNS Failover

# Route 53 health check
aws route53 create-health-check \
  --caller-reference $(date +%s) \
  --health-check-config \
    IPAddress=52.1.1.1,Port=80,Type=HTTP,ResourcePath=/health

Limitations:

  • DNS caching delays (TTL)
  • Client-side caching
  • Typically 60+ second failover time

Load Balancer Failover

Client → Load Balancer → [Server A | Server B | Server C]
                          (health checked every 5s)

Benefits:

  • Fast failover (seconds)
  • No DNS caching issues
  • Automatic recovery

Application-Level Failover

def get_data():
    try:
        return primary_db.query()
    except DatabaseError:
        logger.warning("Primary DB failed, using secondary")
        return secondary_db.query()

Circuit Breaker Pattern

Prevent cascading failures by stopping requests to failing services.

class CircuitBreaker:
    def __init__(self, failure_threshold=5, timeout=60):
        self.failure_count = 0
        self.failure_threshold = failure_threshold
        self.timeout = timeout
        self.last_failure_time = None
        self.state = 'CLOSED'  # CLOSED, OPEN, HALF_OPEN

    def call(self, func, *args, **kwargs):
        if self.state == 'OPEN':
            if time.time() - self.last_failure_time > self.timeout:
                self.state = 'HALF_OPEN'
            else:
                raise CircuitBreakerOpenError()

        try:
            result = func(*args, **kwargs)
            if self.state == 'HALF_OPEN':
                self.state = 'CLOSED'
                self.failure_count = 0
            return result
        except Exception as e:
            self.failure_count += 1
            self.last_failure_time = time.time()

            if self.failure_count >= self.failure_threshold:
                self.state = 'OPEN'

            raise e

# Usage
breaker = CircuitBreaker()

@app.route('/data')
def get_data():
    return breaker.call(external_api.fetch_data)

Session Management

Sticky Sessions

# HAProxy sticky sessions
backend web_servers
    cookie SERVERID insert indirect nocache
    server web1 10.0.1.10:80 cookie web1 check
    server web2 10.0.1.11:80 cookie web2 check

Drawbacks:

  • Uneven load distribution
  • Server failure loses sessions
  • Harder to scale

Shared Session Store

# Redis-backed sessions
from flask import Flask, session
from flask_session import Session

app = Flask(__name__)
app.config['SESSION_TYPE'] = 'redis'
app.config['SESSION_REDIS'] = redis.StrictRedis(
    host='redis.example.com',
    port=6379,
    db=0
)
Session(app)

@app.route('/login', methods=['POST'])
def login():
    session['user_id'] = authenticate(request.form)
    return redirect('/dashboard')

Benefits:

  • Server failures don’t lose sessions
  • Easy horizontal scaling
  • Consistent user experience

Stateless Architecture

Design services without server-side state.

# Stateless API with JWT
import jwt

@app.route('/api/protected')
def protected():
    token = request.headers.get('Authorization').split()[1]
    payload = jwt.decode(token, SECRET_KEY, algorithms=['HS256'])
    user_id = payload['user_id']
    return jsonify({'data': get_user_data(user_id)})

Benefits:

  • Any server can handle any request
  • Easy scaling
  • Simple failover

Graceful Degradation

Continue operating with reduced functionality when components fail.

def get_recommendations():
    try:
        # Try ML recommendation service
        return ml_service.get_recommendations(user_id)
    except ServiceUnavailable:
        # Fall back to simple algorithm
        logger.warning("ML service unavailable, using fallback")
        return get_popular_items()

def get_user_profile():
    profile = db.get_user(user_id)

    # Try to enrich with external data
    try:
        profile.update(external_api.get_extra_data(user_id))
    except:
        # Continue without enrichment
        logger.warning("External API failed, returning basic profile")

    return profile

Auto-Healing

Automatically recover from failures.

Kubernetes Liveness Probe:

apiVersion: v1
kind: Pod
metadata:
  name: web-app
spec:
  containers:
  - name: web
    image: web-app:latest
    livenessProbe:
      httpGet:
        path: /health
        port: 8080
      initialDelaySeconds: 30
      periodSeconds: 10
      failureThreshold: 3
    readinessProbe:
      httpGet:
        path: /ready
        port: 8080
      initialDelaySeconds: 5
      periodSeconds: 5

Auto-Restart Service:

# systemd service with automatic restart
[Service]
Restart=always
RestartSec=10s

Rate Limiting & Throttling

Protect services from overload.

from flask_limiter import Limiter
from flask_limiter.util import get_remote_address

limiter = Limiter(
    app,
    key_func=get_remote_address,
    default_limits=["200 per day", "50 per hour"]
)

@app.route("/api/expensive")
@limiter.limit("10 per minute")
def expensive_operation():
    return perform_operation()

Bulkhead Pattern

Isolate resources to prevent cascading failures.

# Separate thread pools for different operations
critical_pool = ThreadPoolExecutor(max_workers=10)
background_pool = ThreadPoolExecutor(max_workers=5)

@app.route('/critical')
def critical_operation():
    future = critical_pool.submit(process_critical)
    return future.result()

@app.route('/background')
def background_operation():
    future = background_pool.submit(process_background)
    return {'status': 'processing'}

Chaos Engineering

Test HA by intentionally causing failures.

# Chaos Monkey - randomly terminate instances
while true; do
    instance=$(aws ec2 describe-instances \
        --filters "Name=tag:Environment,Values=staging" \
        --query 'Reservations[*].Instances[*].InstanceId' \
        --output text | shuf -n 1)

    echo "Terminating $instance"
    aws ec2 terminate-instances --instance-ids $instance

    sleep $((RANDOM % 3600 + 600))  # 10-60 minutes
done

Monitoring HA Systems

Key Metrics:

# Service availability
up{job="web-app"}

# Request success rate
rate(http_requests_total{status!~"5.."}[5m]) /
rate(http_requests_total[5m])

# Healthy instance count
count(up{job="web-app"} == 1)

# Failover events
increase(failover_events_total[1h])

Testing HA

Failure Injection

# Network partition
iptables -A INPUT -s 10.0.1.10 -j DROP

# CPU stress
stress --cpu 4 --timeout 60s

# Memory pressure
stress --vm 2 --vm-bytes 1G --timeout 60s

# Disk I/O
stress --io 4 --timeout 60s

Disaster Recovery Drills

## DR Drill Checklist

### Preparation
- [ ] Schedule drill during low-traffic period
- [ ] Notify all stakeholders
- [ ] Prepare rollback plan
- [ ] Set up monitoring

### Execution
- [ ] Simulate primary region failure
- [ ] Verify failover to secondary region
- [ ] Test all critical functionality
- [ ] Measure failover time

### Validation
- [ ] Verify data consistency
- [ ] Check all services operational
- [ ] Review metrics and logs
- [ ] Document issues found

### Cleanup
- [ ] Failback to primary region
- [ ] Verify normal operations
- [ ] Document lessons learned
- [ ] Update runbooks

SLA Considerations

Define SLA Metrics:

Service Level Agreement:
  Availability: 99.95%
  Response Time (p99): < 500ms
  Error Rate: < 0.1%
  Maintenance Window: Sunday 02:00-04:00 UTC
  Support Response: < 1 hour

Calculate Allowed Downtime:

99.9%  = 43.2 minutes/month
99.95% = 21.6 minutes/month
99.99% = 4.3 minutes/month

Cost vs. Availability

Availability    Cost Multiple    Use Case
─────────────────────────────────────────
99%             1x              Internal tools
99.9%           2-3x            Standard SaaS
99.95%          5-10x           Critical business
99.99%          10-50x          Financial services
99.999%         50-100x+        Emergency services

Conclusion

High availability requires:

  1. Redundancy at all levels
  2. Automated failover mechanisms
  3. Comprehensive monitoring
  4. Regular testing
  5. Graceful degradation
  6. Documentation and runbooks

Remember: HA is not achieved once. It requires continuous validation and improvement.