Disaster Recovery (DR) ensures business continuity when catastrophic failures occur. It’s about recovering quickly with minimal data loss.
Recovery Objectives
Recovery Time Objective (RTO)
Maximum acceptable downtime.
RTO = How long can we be down before serious business impact?
Examples:
- Critical e-commerce: 1 hour
- Internal tools: 24 hours
- Analytics platform: 72 hours
Recovery Point Objective (RPO)
Maximum acceptable data loss.
RPO = How much data can we afford to lose?
Examples:
- Financial transactions: 0 seconds (no data loss)
- Customer data: 1 hour
- Logs/analytics: 24 hours
Relationship
Cost increases exponentially as RTO/RPO approach zero
RPO/RTO Cost Strategy
──────────────────────────────────
Hours $ Backups
Minutes $$ Replication
Seconds $$$ Active-Active
Zero $$$$ Synchronous replication
DR Strategies
1. Backup and Restore (Cheapest, Slowest)
RTO: Hours to days
RPO: Hours to days
Cost: $
Implementation:
# Daily full backup, hourly incrementals
0 2 * * * /usr/bin/backup-full.sh
0 * * * * /usr/bin/backup-incremental.sh
# Backup script
#!/bin/bash
DATE=$(date +%Y%m%d)
mysqldump --all-databases | gzip > /backups/full-$DATE.sql.gz
aws s3 cp /backups/full-$DATE.sql.gz s3://dr-backups/
Best For:
- Non-critical systems
- Small datasets
- Limited budget
2. Pilot Light (Minimal resources running)
RTO: Hours
RPO: Minutes to hours
Cost: $$
Architecture:
Primary Region (Active) DR Region (Minimal)
├── App Servers (Running) → ├── App Servers (Stopped)
├── Database (Active) → ├── Database (Replicating)
└── Data (Live) → └── Data (Synced)
Implementation:
# DR region - minimal resources
resource "aws_instance" "dr_app" {
ami = data.aws_ami.app.id
instance_type = "t3.micro" # Minimal size
count = 1 # Single instance
lifecycle {
ignore_changes = [instance_state]
}
tags = {
Environment = "DR"
AutoStart = "disaster-only"
}
}
# Database replication
resource "aws_db_instance" "dr_database" {
replicate_source_db = aws_db_instance.primary.id
instance_class = "db.t3.small"
skip_final_snapshot = true
}
Failover Process:
- Detect disaster
- Scale up DR instances
- Promote read replica to master
- Update DNS to point to DR region
- Verify functionality
3. Warm Standby (Reduced capacity running)
RTO: Minutes
RPO: Seconds to minutes
Cost: $$$
Architecture:
Primary Region (Full) DR Region (Scaled Down)
├── App Servers (10) → ├── App Servers (2)
├── Database (Multi-AZ) → ├── Database (Replicating)
├── Load Balancer → ├── Load Balancer
└── Auto Scaling → └── Auto Scaling (Scaled Down)
Implementation:
# Auto Scaling in DR region
AutoScalingGroup:
MinSize: 2
MaxSize: 10
DesiredCapacity: 2 # Minimal but ready
# Scale up on failover
ScaleUpPolicy:
Trigger: DR_ACTIVATED
Action:
DesiredCapacity: 10
Best For:
- Business-critical applications
- Moderate RTO/RPO requirements
- Budget allows some redundancy
4. Multi-Site Active-Active (Most Expensive, Fastest)
RTO: Seconds
RPO: Near-zero
Cost: $$$$
Architecture:
Region A (Active) Region B (Active)
├── App Servers (10) ↔ ├── App Servers (10)
├── Database (Master) ↔ ├── Database (Master)
├── Load Balancer ↔ ├── Load Balancer
└── Global Traffic Mgr ↔ └── Auto failover
Implementation:
# Route 53 Geolocation Routing
{
"Name": "example.com",
"Type": "A",
"SetIdentifier": "US-Users",
"GeoLocation": {"ContinentCode": "NA"},
"ResourceRecords": [{"Value": "us-east-lb.example.com"}]
},
{
"Name": "example.com",
"Type": "A",
"SetIdentifier": "EU-Users",
"GeoLocation": {"ContinentCode": "EU"},
"ResourceRecords": [{"Value": "eu-west-lb.example.com"}]
}
Challenges:
- Data consistency (CAP theorem)
- Conflict resolution
- Increased complexity
- Higher cost
Data Protection
Backup Strategy (3-2-1 Rule)
3 copies of data
2 different media types
1 offsite copy
Implementation:
# Local backup
backup-local.sh → /local/backups/
# Cloud backup (different region)
aws s3 sync /local/backups/ s3://backups-us-west-2/
# Glacier archive (long-term)
aws s3 cp /local/backups/ s3://glacier-vault/ \
--storage-class GLACIER
Database Backups
Point-in-Time Recovery (PITR):
-- PostgreSQL continuous archiving
archive_mode = on
archive_command = 'cp %p /archive/%f'
wal_level = replica
-- Restore to specific timestamp
pg_restore --target-time="2025-01-15 14:30:00"
Automated Snapshots:
# AWS RDS automated backups
import boto3
rds = boto3.client('rds')
# Create manual snapshot
rds.create_db_snapshot(
DBSnapshotIdentifier=f'manual-snapshot-{datetime.now()}',
DBInstanceIdentifier='production-db'
)
# Configure automatic backups
rds.modify_db_instance(
DBInstanceIdentifier='production-db',
BackupRetentionPeriod=30, # 30 days
PreferredBackupWindow='03:00-04:00'
)
Application Data Replication
Real-Time Replication:
# rsync continuous sync
while true; do
rsync -avz --delete /data/ dr-server:/data/
sleep 60
done
# S3 cross-region replication
aws s3api put-bucket-replication \
--bucket source-bucket \
--replication-configuration '{
"Role": "arn:aws:iam::123456789012:role/s3-replication",
"Rules": [{
"Status": "Enabled",
"Priority": 1,
"Destination": {
"Bucket": "arn:aws:s3:::destination-bucket",
"ReplicationTime": {
"Status": "Enabled",
"Time": {"Minutes": 15}
}
}
}]
}'
Failover Procedures
Automated Failover
Health Check Based:
import boto3
import time
route53 = boto3.client('route53')
def check_primary_health():
try:
response = requests.get('https://primary.example.com/health', timeout=5)
return response.status_code == 200
except:
return False
def failover_to_dr():
# Update DNS to point to DR region
route53.change_resource_record_sets(
HostedZoneId='Z123456',
ChangeBatch={
'Changes': [{
'Action': 'UPSERT',
'ResourceRecordSet': {
'Name': 'api.example.com',
'Type': 'A',
'TTL': 60,
'ResourceRecords': [{'Value': DR_IP}]
}
}]
}
)
while True:
if not check_primary_health():
failover_to_dr()
alert_team("Failover initiated to DR")
time.sleep(30)
Manual Failover Runbook
# DR Failover Runbook
## Pre-Checks
- [ ] Verify primary region is truly down
- [ ] Confirm DR region is healthy
- [ ] Notify all stakeholders
- [ ] Document start time
## Database Failover
1. Promote read replica to master
aws rds promote-read-replica
–db-instance-identifier dr-database
2. Verify replication stopped
3. Update application connection strings
## Application Failover
1. Scale up DR app servers
aws autoscaling set-desired-capacity
–auto-scaling-group-name dr-asg
–desired-capacity 10
2. Wait for instances to be healthy
3. Update load balancer targets
## DNS Failover
1. Lower TTL on DNS records (if not already low)
2. Update DNS to point to DR
aws route53 change-resource-record-sets
–hosted-zone-id Z123
–change-batch file://dr-dns.json
3. Verify DNS propagation
## Verification
- [ ] Test critical user flows
- [ ] Verify database connectivity
- [ ] Check application logs
- [ ] Monitor error rates
- [ ] Confirm data integrity
## Communication
- [ ] Update status page
- [ ] Notify customers
- [ ] Update internal teams
- [ ] Schedule regular updates
Testing DR Plans
DR Drill Schedule
## Annual DR Test Plan
### Q1: Backup Restoration Test
- Restore from backup to isolated environment
- Verify data integrity
- Document restoration time
### Q2: Pilot Light Activation
- Start DR instances
- Promote database replica
- Run smoke tests
- Measure activation time
### Q3: Warm Standby Failover
- Full failover to DR region
- Run full test suite
- Serve real traffic (10%)
- Failback to primary
### Q4: Chaos Engineering
- Random component failures
- Verify auto-recovery
- Test monitoring/alerting
- Document issues
Test Checklist
## DR Test Checklist
### Preparation (T-7 days)
- [ ] Schedule test window
- [ ] Notify stakeholders
- [ ] Prepare rollback plan
- [ ] Set up monitoring
### Pre-Test (T-1 hour)
- [ ] Verify DR environment health
- [ ] Confirm backup freshness
- [ ] Alert team on standby
- [ ] Take pre-test snapshots
### Execution
- [ ] Initiate failover procedure
- [ ] Follow runbook precisely
- [ ] Document actual steps
- [ ] Time each phase
- [ ] Note any issues
### Validation
- [ ] Test all critical functions
- [ ] Verify data integrity
- [ ] Check performance
- [ ] Monitor error rates
- [ ] User acceptance testing
### Failback
- [ ] Sync data changes to primary
- [ ] Reverse failover process
- [ ] Verify primary health
- [ ] Monitor after failback
### Post-Test
- [ ] Document lessons learned
- [ ] Update runbooks
- [ ] Create action items
- [ ] Share results with team
Data Integrity Verification
Checksums and Validation
import hashlib
def verify_backup_integrity(source_file, backup_file):
"""Compare file checksums to verify backup integrity"""
def calculate_checksum(filepath):
hash_md5 = hashlib.md5()
with open(filepath, "rb") as f:
for chunk in iter(lambda: f.read(4096), b""):
hash_md5.update(chunk)
return hash_md5.hexdigest()
source_checksum = calculate_checksum(source_file)
backup_checksum = calculate_checksum(backup_file)
if source_checksum == backup_checksum:
print(f"✓ Backup verified: {backup_file}")
return True
else:
print(f"✗ Backup corrupted: {backup_file}")
return False
Database Consistency Checks
-- MySQL
CHECK TABLE users;
ANALYZE TABLE users;
-- PostgreSQL
SELECT * FROM pg_stat_database;
VACUUM ANALYZE users;
-- Count verification
SELECT COUNT(*) FROM users; -- Primary
SELECT COUNT(*) FROM users; -- DR replica
Compliance and Documentation
Required Documentation
## DR Documentation Index
### 1. DR Policy
- RTO/RPO targets per service
- Roles and responsibilities
- Escalation procedures
### 2. Technical Architecture
- Diagrams of primary and DR sites
- Data flow diagrams
- Network topology
### 3. Runbooks
- Failover procedures
- Failback procedures
- Recovery procedures per component
### 4. Contact Information
- On-call schedule
- Vendor contacts
- Executive contacts
### 5. Test Results
- DR test history
- Issues identified
- Resolution status
### 6. Maintenance Logs
- Configuration changes
- Infrastructure updates
- Capacity planning
Compliance Requirements
SOC 2:
- Documented DR plan
- Regular DR testing
- DR test results
- Recovery procedures
HIPAA:
- Data backup procedures
- Disaster recovery plan
- Emergency mode operations
- Testing and revision
PCI-DSS:
- Business continuity planning
- Incident response procedures
- Annual DR testing
Cost Optimization
Balance Cost vs. Requirements
Service RTO RPO Strategy Cost/Month
─────────────────────────────────────────────────────────────
Critical API 1h 5min Warm Standby $5,000
User Database 1h 0 Synchronous Rep $8,000
Analytics 24h 24h Backup/Restore $100
Dev Environment 72h N/A Rebuild $0
Cost Reduction Strategies
- Tiered DR: Different strategies per service
- Scheduled Scaling: Start DR instances only during business hours
- Spot Instances: Use for non-critical DR components
- Storage Lifecycle: Move old backups to Glacier
- Cross-Region Replication: Only for critical data
Common Disaster Scenarios
1. Region Failure
Cause: AWS/Azure/GCP region outage
Response:
- Verify outage via status page
- Activate DR region
- Update global traffic routing
- Monitor closely
- Fail back when primary recovered
2. Data Corruption
Cause: Bad deployment, malicious actor
Response:
- Stop replication immediately
- Identify corruption scope
- Restore from clean backup
- Verify data integrity
- Resume replication
3. Ransomware
Cause: Malware encrypts data
Response:
- Isolate affected systems
- Do NOT pay ransom
- Restore from offline/immutable backups
- Scan for malware
- Investigate entry point
4. Human Error
Cause: Accidental deletion, bad configuration
Response:
- Assess impact
- Restore from most recent backup
- Verify restoration
- Implement safeguards
- Update procedures
Continuous Improvement
Post-Incident Review
# DR Activation Post-Mortem
## Incident Summary
- Date/Time: 2025-01-15 14:32 UTC
- Duration: 2 hours 15 minutes
- Root Cause: Primary region network failure
## Timeline
14:32 - Primary region unresponsive
14:35 - DR activation initiated
14:50 - Database promoted
15:10 - DNS updated
15:30 - Full traffic on DR
16:47 - Failed back to primary
## What Went Well
- Monitoring detected issue quickly
- Runbooks were accurate
- Team responded efficiently
## What Went Poorly
- DNS TTL was too high (5min)
- Some manual steps could be automated
- Missing contact for vendor
## Action Items
1. Reduce DNS TTL to 60s
2. Automate database promotion
3. Update contact list
4. Practice failover more frequently
Metrics to Track
# DR Readiness
time_since_last_dr_test_days
# Backup Health
backup_age_hours
backup_success_rate
# Replication Lag
database_replication_lag_seconds
s3_replication_lag_seconds
# RTO/RPO Actual
actual_recovery_time_minutes
actual_data_loss_minutes
Conclusion
Effective disaster recovery requires:
- Clear objectives: Define RTO/RPO for each service
- Appropriate strategy: Match cost to requirements
- Regular testing: DR plans untested are DR plans that fail
- Documentation: Runbooks must be accurate and accessible
- Continuous improvement: Learn from tests and incidents
Remember: Hope is not a DR strategy. Test, verify, and continuously improve your DR capabilities.