DR & Migrations Overview
| Content | page |
|---|---|
| AWS DMS | AWS DMS |
| AWS Backup | AWS Backup |
| AWS Application Discovery Service | AWS Application Discovery Service |
| AWS Application Migration Service | AWS Application Migration Service |
| VMC on AWS | VMC on AWS |
๐ง What Is Disaster Recovery (DR)?¶
Disaster Recovery (DR) is the practice of planning for and recovering from unexpected events that disrupt business operations โ like hardware failures, cyberattacks, accidental deletions, natural disasters, etc.
In the cloud, DR focuses on resilience, automation, and cost-effective replication of critical systems and data.
๐ RTO and RPO: The Core Metrics¶
| Metric | Definition |
|---|---|
| RTO (Recovery Time Objective) | How quickly you must restore service after a disruption |
| RPO (Recovery Point Objective) | How much data loss (time window) you can tolerate |
For example:
If RTO = 4 hours โ system must be back online within 4 hours
If RPO = 15 minutes โ you can lose at most 15 minutes of data
๐งช Common RTO/RPO Requirements by Industry¶
| Industry | RTO | RPO |
|---|---|---|
| Banking | โค 1 hour | โค 5 minutes |
| Healthcare | 1โ2 hours | โค 15 mins |
| SaaS Startups | 4โ8 hours | 1โ4 hours |
| Non-critical apps | 24โ48 hours | 12โ24 hours |
๐๏ธ Disaster Recovery Strategies in AWS¶
AWS offers 4 standard DR architectures:
| Strategy | RTO | RPO | Cost | Description |
|---|---|---|---|---|
| Backup & Restore | Hours | Hours | Low | Backup data & config, restore manually after disaster |
| Pilot Light | Minutes | <1 hour | Medium | Core services (DB, AMIs) always running, others launched on fail |
| Warm Standby | <30 min | <30 min | High | Scaled-down version always running, scaled up when needed |
| Multi-site (Hot) | Seconds | Seconds | Very High | Fully duplicated system in multiple regions or AZs |
๐ Migration vs DR¶
| Aspect | Migration | Disaster Recovery |
|---|---|---|
| Purpose | Move workload permanently | Restore workload temporarily |
| Involves Cutover | โ Yes | โ Not unless disaster occurs |
| Downtime allowed | Often scheduled | Must be minimized |
| Data replication | One-time or phased | Continuous or periodic |
| Tooling | AWS DMS, SMS, Application Migration Service | Snapshots, Replication, CloudEndure |
โ๏ธ AWS Services for DR¶
| Category | Services | Description |
|---|---|---|
| Compute | EC2 AMIs, Auto Scaling | Pre-baked backups, scale after restore |
| Storage | EBS snapshots, S3 versioning | Durable backups |
| Database | RDS Multi-AZ, Aurora Global | Automatic failover, cross-region replicas |
| DNS | Route 53 | Health checks, failover routing |
| Replication | AWS DMS, CloudEndure | Live or scheduled data replication |
| Automation | Lambda, CloudFormation | Automate failover & restore |
๐ Sample DR Workflow: Backup & Restore¶
-
Use AWS Backup to schedule daily EBS and RDS backups
-
Replicate backups across regions
-
Store application configs (env vars, IAM, templates) in S3 or SSM
-
When disaster occurs:
-
Spin up EC2 instances using AMIs
-
Restore RDS from snapshot
-
Repoint DNS via Route 53
-
๐ก DR Best Practices & Tips¶
| Category | Tip |
|---|---|
| Automation | Use CloudFormation + Lambda to automate restoration & infra setup |
| Immutable Infra | Use AMIs and containers for fast deployments |
| Testing | Perform regular DR drills using sandbox accounts or test AZs |
| Backups | Enable versioning for S3, schedule backups for EBS, RDS |
| Encryption | Use KMS to secure backups and replicas |
| Monitoring | Use CloudWatch + SNS for outage alerts and triggering failover |
| Documentation | Maintain clear SOPs (Standard Operating Procedures) for DR |
๐ฆ Tooling for Migrations & DR¶
| Tool/Service | Use Case |
|---|---|
| AWS DMS | Migrate live databases |
| AWS CloudEndure / MGN | Lift-and-shift of full apps/VMs |
| AWS Backup | Scheduled backup + cross-region copy |
| S3 Replication | DR for object storage |
| RDS Multi-AZ / Read Replica | Hot standby for databases |
| Route 53 Failover | DNS-based failover |
| Step Functions | Recovery workflows orchestration |
๐งฑ Terraform Tip for Cross-Region DR (S3 Example)¶
resource "aws_s3_bucket" "primary" {
bucket = "my-primary-bucket"
versioning {
enabled = true
}
replication_configuration {
role = aws_iam_role.replication.arn
rules {
id = "replication-rule"
status = "Enabled"
destination {
bucket = aws_s3_bucket.secondary.arn
storage_class = "STANDARD"
}
}
}
}
โ TL;DR Summary¶
| Term | Meaning |
|---|---|
| RTO | Max acceptable downtime (e.g., 1 hr) |
| RPO | Max acceptable data loss (e.g., 15 min of data) |
| DR Plan Types | Backup-Restore, Pilot Light, Warm Standby, Multi-site |
| Tools | DMS, MGN, S3 replication, Route 53, CloudFormation |
| Key Advice | Automate everything, test regularly, document recovery steps |