Skip to main content

Disaster Recovery Planning

November 10, 2025

Strategies and best practices for disaster recovery, business continuity, and data protection

Disaster Recovery (DR) ensures business continuity when catastrophic failures occur. It’s about recovering quickly with minimal data loss.

Recovery Objectives

Recovery Time Objective (RTO)

Maximum acceptable downtime.

RTO = How long can we be down before serious business impact?

Examples:
- Critical e-commerce: 1 hour
- Internal tools: 24 hours
- Analytics platform: 72 hours

Recovery Point Objective (RPO)

Maximum acceptable data loss.

RPO = How much data can we afford to lose?

Examples:
- Financial transactions: 0 seconds (no data loss)
- Customer data: 1 hour
- Logs/analytics: 24 hours

Relationship

Cost increases exponentially as RTO/RPO approach zero

RPO/RTO    Cost    Strategy
──────────────────────────────────
Hours      $       Backups
Minutes    $$      Replication
Seconds    $$$     Active-Active
Zero       $$$$    Synchronous replication

DR Strategies

1. Backup and Restore (Cheapest, Slowest)

RTO: Hours to days
RPO: Hours to days
Cost: $

Implementation:

# Daily full backup, hourly incrementals
0 2 * * * /usr/bin/backup-full.sh
0 * * * * /usr/bin/backup-incremental.sh

# Backup script
#!/bin/bash
DATE=$(date +%Y%m%d)
mysqldump --all-databases | gzip > /backups/full-$DATE.sql.gz
aws s3 cp /backups/full-$DATE.sql.gz s3://dr-backups/

Best For:

  • Non-critical systems
  • Small datasets
  • Limited budget

2. Pilot Light (Minimal resources running)

RTO: Hours
RPO: Minutes to hours
Cost: $$

Architecture:

Primary Region (Active)          DR Region (Minimal)
├── App Servers (Running)   →   ├── App Servers (Stopped)
├── Database (Active)        →   ├── Database (Replicating)
└── Data (Live)              →   └── Data (Synced)

Implementation:

# DR region - minimal resources
resource "aws_instance" "dr_app" {
  ami           = data.aws_ami.app.id
  instance_type = "t3.micro"  # Minimal size
  count         = 1           # Single instance

  lifecycle {
    ignore_changes = [instance_state]
  }

  tags = {
    Environment = "DR"
    AutoStart   = "disaster-only"
  }
}

# Database replication
resource "aws_db_instance" "dr_database" {
  replicate_source_db = aws_db_instance.primary.id
  instance_class      = "db.t3.small"
  skip_final_snapshot = true
}

Failover Process:

  1. Detect disaster
  2. Scale up DR instances
  3. Promote read replica to master
  4. Update DNS to point to DR region
  5. Verify functionality

3. Warm Standby (Reduced capacity running)

RTO: Minutes
RPO: Seconds to minutes
Cost: $$$

Architecture:

Primary Region (Full)            DR Region (Scaled Down)
├── App Servers (10)        →   ├── App Servers (2)
├── Database (Multi-AZ)     →   ├── Database (Replicating)
├── Load Balancer           →   ├── Load Balancer
└── Auto Scaling            →   └── Auto Scaling (Scaled Down)

Implementation:

# Auto Scaling in DR region
AutoScalingGroup:
  MinSize: 2
  MaxSize: 10
  DesiredCapacity: 2  # Minimal but ready

# Scale up on failover
ScaleUpPolicy:
  Trigger: DR_ACTIVATED
  Action:
    DesiredCapacity: 10

Best For:

  • Business-critical applications
  • Moderate RTO/RPO requirements
  • Budget allows some redundancy

4. Multi-Site Active-Active (Most Expensive, Fastest)

RTO: Seconds
RPO: Near-zero
Cost: $$$$

Architecture:

Region A (Active)            Region B (Active)
├── App Servers (10)    ↔   ├── App Servers (10)
├── Database (Master)   ↔   ├── Database (Master)
├── Load Balancer       ↔   ├── Load Balancer
└── Global Traffic Mgr  ↔   └── Auto failover

Implementation:

# Route 53 Geolocation Routing
{
  "Name": "example.com",
  "Type": "A",
  "SetIdentifier": "US-Users",
  "GeoLocation": {"ContinentCode": "NA"},
  "ResourceRecords": [{"Value": "us-east-lb.example.com"}]
},
{
  "Name": "example.com",
  "Type": "A",
  "SetIdentifier": "EU-Users",
  "GeoLocation": {"ContinentCode": "EU"},
  "ResourceRecords": [{"Value": "eu-west-lb.example.com"}]
}

Challenges:

  • Data consistency (CAP theorem)
  • Conflict resolution
  • Increased complexity
  • Higher cost

Data Protection

Backup Strategy (3-2-1 Rule)

3 copies of data
2 different media types
1 offsite copy

Implementation:

# Local backup
backup-local.sh → /local/backups/

# Cloud backup (different region)
aws s3 sync /local/backups/ s3://backups-us-west-2/

# Glacier archive (long-term)
aws s3 cp /local/backups/ s3://glacier-vault/ \
  --storage-class GLACIER

Database Backups

Point-in-Time Recovery (PITR):

-- PostgreSQL continuous archiving
archive_mode = on
archive_command = 'cp %p /archive/%f'
wal_level = replica

-- Restore to specific timestamp
pg_restore --target-time="2025-01-15 14:30:00"

Automated Snapshots:

# AWS RDS automated backups
import boto3

rds = boto3.client('rds')

# Create manual snapshot
rds.create_db_snapshot(
    DBSnapshotIdentifier=f'manual-snapshot-{datetime.now()}',
    DBInstanceIdentifier='production-db'
)

# Configure automatic backups
rds.modify_db_instance(
    DBInstanceIdentifier='production-db',
    BackupRetentionPeriod=30,  # 30 days
    PreferredBackupWindow='03:00-04:00'
)

Application Data Replication

Real-Time Replication:

# rsync continuous sync
while true; do
  rsync -avz --delete /data/ dr-server:/data/
  sleep 60
done

# S3 cross-region replication
aws s3api put-bucket-replication \
  --bucket source-bucket \
  --replication-configuration '{
    "Role": "arn:aws:iam::123456789012:role/s3-replication",
    "Rules": [{
      "Status": "Enabled",
      "Priority": 1,
      "Destination": {
        "Bucket": "arn:aws:s3:::destination-bucket",
        "ReplicationTime": {
          "Status": "Enabled",
          "Time": {"Minutes": 15}
        }
      }
    }]
  }'

Failover Procedures

Automated Failover

Health Check Based:

import boto3
import time

route53 = boto3.client('route53')

def check_primary_health():
    try:
        response = requests.get('https://primary.example.com/health', timeout=5)
        return response.status_code == 200
    except:
        return False

def failover_to_dr():
    # Update DNS to point to DR region
    route53.change_resource_record_sets(
        HostedZoneId='Z123456',
        ChangeBatch={
            'Changes': [{
                'Action': 'UPSERT',
                'ResourceRecordSet': {
                    'Name': 'api.example.com',
                    'Type': 'A',
                    'TTL': 60,
                    'ResourceRecords': [{'Value': DR_IP}]
                }
            }]
        }
    )

while True:
    if not check_primary_health():
        failover_to_dr()
        alert_team("Failover initiated to DR")
    time.sleep(30)

Manual Failover Runbook

# DR Failover Runbook

## Pre-Checks
- [ ] Verify primary region is truly down
- [ ] Confirm DR region is healthy
- [ ] Notify all stakeholders
- [ ] Document start time

## Database Failover
1. Promote read replica to master

aws rds promote-read-replica
–db-instance-identifier dr-database

2. Verify replication stopped
3. Update application connection strings

## Application Failover
1. Scale up DR app servers

aws autoscaling set-desired-capacity
–auto-scaling-group-name dr-asg
–desired-capacity 10

2. Wait for instances to be healthy
3. Update load balancer targets

## DNS Failover
1. Lower TTL on DNS records (if not already low)
2. Update DNS to point to DR

aws route53 change-resource-record-sets
–hosted-zone-id Z123
–change-batch file://dr-dns.json

3. Verify DNS propagation

## Verification
- [ ] Test critical user flows
- [ ] Verify database connectivity
- [ ] Check application logs
- [ ] Monitor error rates
- [ ] Confirm data integrity

## Communication
- [ ] Update status page
- [ ] Notify customers
- [ ] Update internal teams
- [ ] Schedule regular updates

Testing DR Plans

DR Drill Schedule

## Annual DR Test Plan

### Q1: Backup Restoration Test
- Restore from backup to isolated environment
- Verify data integrity
- Document restoration time

### Q2: Pilot Light Activation
- Start DR instances
- Promote database replica
- Run smoke tests
- Measure activation time

### Q3: Warm Standby Failover
- Full failover to DR region
- Run full test suite
- Serve real traffic (10%)
- Failback to primary

### Q4: Chaos Engineering
- Random component failures
- Verify auto-recovery
- Test monitoring/alerting
- Document issues

Test Checklist

## DR Test Checklist

### Preparation (T-7 days)
- [ ] Schedule test window
- [ ] Notify stakeholders
- [ ] Prepare rollback plan
- [ ] Set up monitoring

### Pre-Test (T-1 hour)
- [ ] Verify DR environment health
- [ ] Confirm backup freshness
- [ ] Alert team on standby
- [ ] Take pre-test snapshots

### Execution
- [ ] Initiate failover procedure
- [ ] Follow runbook precisely
- [ ] Document actual steps
- [ ] Time each phase
- [ ] Note any issues

### Validation
- [ ] Test all critical functions
- [ ] Verify data integrity
- [ ] Check performance
- [ ] Monitor error rates
- [ ] User acceptance testing

### Failback
- [ ] Sync data changes to primary
- [ ] Reverse failover process
- [ ] Verify primary health
- [ ] Monitor after failback

### Post-Test
- [ ] Document lessons learned
- [ ] Update runbooks
- [ ] Create action items
- [ ] Share results with team

Data Integrity Verification

Checksums and Validation

import hashlib

def verify_backup_integrity(source_file, backup_file):
    """Compare file checksums to verify backup integrity"""

    def calculate_checksum(filepath):
        hash_md5 = hashlib.md5()
        with open(filepath, "rb") as f:
            for chunk in iter(lambda: f.read(4096), b""):
                hash_md5.update(chunk)
        return hash_md5.hexdigest()

    source_checksum = calculate_checksum(source_file)
    backup_checksum = calculate_checksum(backup_file)

    if source_checksum == backup_checksum:
        print(f"✓ Backup verified: {backup_file}")
        return True
    else:
        print(f"✗ Backup corrupted: {backup_file}")
        return False

Database Consistency Checks

-- MySQL
CHECK TABLE users;
ANALYZE TABLE users;

-- PostgreSQL
SELECT * FROM pg_stat_database;
VACUUM ANALYZE users;

-- Count verification
SELECT COUNT(*) FROM users;  -- Primary
SELECT COUNT(*) FROM users;  -- DR replica

Compliance and Documentation

Required Documentation

## DR Documentation Index

### 1. DR Policy
- RTO/RPO targets per service
- Roles and responsibilities
- Escalation procedures

### 2. Technical Architecture
- Diagrams of primary and DR sites
- Data flow diagrams
- Network topology

### 3. Runbooks
- Failover procedures
- Failback procedures
- Recovery procedures per component

### 4. Contact Information
- On-call schedule
- Vendor contacts
- Executive contacts

### 5. Test Results
- DR test history
- Issues identified
- Resolution status

### 6. Maintenance Logs
- Configuration changes
- Infrastructure updates
- Capacity planning

Compliance Requirements

SOC 2:

  • Documented DR plan
  • Regular DR testing
  • DR test results
  • Recovery procedures

HIPAA:

  • Data backup procedures
  • Disaster recovery plan
  • Emergency mode operations
  • Testing and revision

PCI-DSS:

  • Business continuity planning
  • Incident response procedures
  • Annual DR testing

Cost Optimization

Balance Cost vs. Requirements

Service         RTO    RPO    Strategy           Cost/Month
─────────────────────────────────────────────────────────────
Critical API    1h     5min   Warm Standby      $5,000
User Database   1h     0      Synchronous Rep   $8,000
Analytics       24h    24h    Backup/Restore    $100
Dev Environment 72h    N/A    Rebuild           $0

Cost Reduction Strategies

  1. Tiered DR: Different strategies per service
  2. Scheduled Scaling: Start DR instances only during business hours
  3. Spot Instances: Use for non-critical DR components
  4. Storage Lifecycle: Move old backups to Glacier
  5. Cross-Region Replication: Only for critical data

Common Disaster Scenarios

1. Region Failure

Cause: AWS/Azure/GCP region outage

Response:

  1. Verify outage via status page
  2. Activate DR region
  3. Update global traffic routing
  4. Monitor closely
  5. Fail back when primary recovered

2. Data Corruption

Cause: Bad deployment, malicious actor

Response:

  1. Stop replication immediately
  2. Identify corruption scope
  3. Restore from clean backup
  4. Verify data integrity
  5. Resume replication

3. Ransomware

Cause: Malware encrypts data

Response:

  1. Isolate affected systems
  2. Do NOT pay ransom
  3. Restore from offline/immutable backups
  4. Scan for malware
  5. Investigate entry point

4. Human Error

Cause: Accidental deletion, bad configuration

Response:

  1. Assess impact
  2. Restore from most recent backup
  3. Verify restoration
  4. Implement safeguards
  5. Update procedures

Continuous Improvement

Post-Incident Review

# DR Activation Post-Mortem

## Incident Summary
- Date/Time: 2025-01-15 14:32 UTC
- Duration: 2 hours 15 minutes
- Root Cause: Primary region network failure

## Timeline
14:32 - Primary region unresponsive
14:35 - DR activation initiated
14:50 - Database promoted
15:10 - DNS updated
15:30 - Full traffic on DR
16:47 - Failed back to primary

## What Went Well
- Monitoring detected issue quickly
- Runbooks were accurate
- Team responded efficiently

## What Went Poorly
- DNS TTL was too high (5min)
- Some manual steps could be automated
- Missing contact for vendor

## Action Items
1. Reduce DNS TTL to 60s
2. Automate database promotion
3. Update contact list
4. Practice failover more frequently

Metrics to Track

# DR Readiness
time_since_last_dr_test_days

# Backup Health
backup_age_hours
backup_success_rate

# Replication Lag
database_replication_lag_seconds
s3_replication_lag_seconds

# RTO/RPO Actual
actual_recovery_time_minutes
actual_data_loss_minutes

Conclusion

Effective disaster recovery requires:

  1. Clear objectives: Define RTO/RPO for each service
  2. Appropriate strategy: Match cost to requirements
  3. Regular testing: DR plans untested are DR plans that fail
  4. Documentation: Runbooks must be accurate and accessible
  5. Continuous improvement: Learn from tests and incidents

Remember: Hope is not a DR strategy. Test, verify, and continuously improve your DR capabilities.