The difference between a smooth migration and a disaster weekend isn't luck — it's preparation. We've run 30+ migrations in the last three years. Zero unplanned outages. Here's exactly how.
Why Migrations Fail
We've seen the same four root causes take down migrations:
- Incomplete discovery. You find out about "that one system nobody documented" at 2 AM Friday.
- No rollback testing. Your rollback plan exists on a whiteboard. It doesn't work under real load.
- Scope creep during cutover. "While we're moving, let's upgrade the database." No. Stop.
- Poor org communication. Engineering knows the plan. Product doesn't. Support doesn't. Customers definitely don't.
Every disaster migration we've analyzed failed on at least two of these. Usually three.
The Six-Phase Methodology
The Rollback Plan Nobody Talks About
Define your rollback trigger before you start. Not during cutover. Before.
Example trigger: "Error rate exceeds 0.1% for 5 consecutive minutes during the first 2 hours post-cutover, or any critical service fails health checks for more than 60 seconds."
Once your rollback trigger is hit, you execute the plan automatically. No committee meetings. No "let's wait and see if it recovers." Rollback happens.
And critically: you practice the rollback. You don't roll back to production for the first time during an actual outage.
The Scope Discipline Rule
No version upgrades during cutover. No feature deployments. No schema changes. No new dependencies. No changes to DNS routing outside the plan.
Zero. You are in a change freeze from 6 hours before cutover begins until 72 hours after you're live in production. This is non-negotiable.
The urge to "fix one more thing" while you're migrating will be strong. Every team feels it. The teams that give in to it are the ones that end up in the 2 AM war room trying to rollback.
The Organization Communication Playbook
4 weeks before cutover: Brief product and support. Not a deep technical sync. A "here's the change that's coming, here's why, here's the timeline" conversation. Get their questions answered.
2 weeks before: Email to all customers. Not scary. Professional. "We're upgrading our infrastructure. Here's zero changes to your experience. Here's our runbook if something goes wrong."
24 hours before: Status page goes yellow. Customers know something is happening. This prevents surprise emails.
Cutover window: Every engineer in the war room has a specific role. Someone is watching error rates. Someone is watching customer support tickets. Someone is on call with your biggest customers. Someone is running health checks against the old environment.
Post-cutover: 72-hour hypercare window. At least one engineer on call at all times. Responding to support tickets within 15 minutes. Not ideal response, but immediate acknowledgment.
After Migration: The 30-Day Review
Migration didn't end at cutover. It ends 30 days later when you've turned off the old environment.
During those 30 days: you monitor like a hawk, you respond to issues fast, and you resist the urge to optimize. Optimizations come later. Right now, stability is everything.
At day 30: run a full incident postmortem. Not because something went wrong — run it anyway. Document what you learned. Document what you'd do differently next time. Document what went smoother than you expected.
The Infrastructure as Code Requirement
You cannot run a safe migration with manual ops. Your cloud environment must be defined in code. All of it.
This is both a technical requirement and a communication requirement. When your infrastructure is code, your team can review it. They can comment on it. They can understand it before cutover happens.
And critically: when something breaks at 3 AM, you can re-run your infrastructure code and know that your manual fixes get blown away on the next deploy. Everything is code. Always.
Planning a migration?
We've built this playbook from 30+ migrations. If you're planning something big, let's talk through the risks specific to your environment.
Request My Free Audit →