The EMC Durham Cloud Data Center Migration: A Long Night of Lessons Learned

Fifth in a series on EMC’s Durham Data Center.

With the migration plan completed (click here to read part 4) for EMC’s Durham Data Center, we began the daunting task of the migration. We weren’t going to use trucks or airplanes to move the gear. We were going to migrate all the applications and data over the wire. The fact that it really hadn’t been done before was a technical challenge that we would just have to overcome.

In late Q4 2010, as we were completing the Durham Data Center infrastructure build (click here to read part 2) our migration team began experimenting.

The first attempt was a straight virtual to virtual (V2V) migration over the WAN.  We thought how cool would that be?  No downtime, little risk, we were already well over 50 percent virtualized.  It turns out the distance between North Carolina and Massachusetts is too far apart, more than 600 miles, which resulted in 25 milliseconds latency. The V2V experiment failed. It took nearly 30 hours to move one Virtual Machine. V2V migration wouldn’t work at that distance.  It also wasn’t a viable solution for the hundreds of physical servers that we were still running.

We somehow needed to replicate the applications and databases from Massachusetts to North Carolina non-disruptively so we wouldn’t be affected by the latency. We naturally turned to EMC’s Symmetrix Remote Data Facility (SRDF), the gold standard for remote replication for mission critical environments.  It was still going to be pretty complex due to the many arrays in both the source data center and Durham.

To minimize effort, a small EMC VMAX was set up in each data center and an asynchronous SRDF/A bridge was put in place to replicate between the two.  For each move event we set up a dedicated one terabyte logical disk, or LUN, and replicated that to Durham.  A LUN is basically a virtual storage container made from parts of many disks on the array. Prior to each migration we would use VMware’s Storage vMotion to non-disruptively move the virtual machine (VM) onto the swing array before the migration. SRDF would then replicate the VM to Durham.

We would be able to migrate databases over the SRDF bridge as well. Our database administrators would export the database to a migration LUN, allow SRDF to copy the data to Durham, then import the data to a new home in Durham.

The first production move: If anything can go wrong, it will

Our first production move only had a few applications and a handful of servers.  We went through the migration planning activities and table top exercises creating a playbook.  Everything was going well.  Tracking spreadsheets were being created and updated. There was frenzied activity.  Two weeks prior to the move, one of the application teams bailed out. They supported another more critical application that was having some issues. They couldn’t work on both the migration and fixing the application.  If we were using a truck we would have doomed the entire move event, but since we were going over the wire we had the flexibility to keep going. The next week, another application team bailed out. Their next release was in trouble and they requested to reschedule so that they could concentrate on developing the next release. Reluctantly, we re-scheduled them and we pressed forward.

The remaining applications were small and not very critical, but they were still production.  Friday night at 8 p.m. the team hunkered down for the move. We opened up the conference call and began executing on the playbook.  As we shut down the applications and databases, I looked out the window and saw the last glow of the sun fade below horizon  The first migration applications had just gone dark too.

It was our first move, so there were some missed handoffs and communication issues, but we eventually had the VMs and databases standing up on the target swing array in Durham.  It was about 2 a.m. The application teams began reconfiguring their environment.  Almost immediately they encountered issues.  Troubleshooting the issues took several hours. There were Domain Name Service (DNS) changes that needed to be redone, firewall reconfigurations, and server configurations. We had worked all night and into the early morning. I looked out the window and saw the sun beginning to rise but we hadn’t managed to bring any of the first migration applications online yet.

In the next few hours, the issues were ironed out and two of the applications were brought online and functioning. We weren’t so lucky with the third app.  It ran the registration site for one our customer conferences. The application team inherited it from someone, who inherited from someone else.  They rarely worked on it and didn’t know the inner workings very well.

Then our security team member raised a key question: Where was the traffic for this application going to?  This wasn’t our network!

The application owner stated that they had forgotten that this site had two integrations to third party companies. I was shocked we had missed this.  We’d been meeting on this nearly every day for weeks!

After another hour or so we were able to get in touch with someone at one of the third parties to try and figure out what they would need to reconfigure.

Just then my instant messenger lit up. It was my friend Patricia telling me that we needed the event registration site up and running by Monday so customers could sign up for the next conference.

This was another surprise. We had scheduled this app to be moved first because the conference was months away. Now Patricia was telling me that registration was in a few days and that emails with the link had been sent out.  She asked why it was taking so long to move the site.

“It is moved,” I told her. “We were done moving it at 2 a.m.!  It just doesn’t work yet.”

At this point the inevitable struck me. We had zero percent chance of getting this done this weekend. The team was exhausted and frustrated. We still hadn’t heard from the other third party, and the site needed to be up by Monday. It was now Saturday afternoon.

I decided we would need to admit defeat and roll it back. I told the team to move the VMs back.

Since we were migrating using VMware vSphere Storage vMotion over SRDF there was no data to migrate back.  We still had the original copy in the legacy data center. We were in the process of trying to configure the replicas in Durham.  The firewall rules were all in place and the third party vendor hadn’t changed any configurations yet.

We only had to change DNS to point the URL and servers back to the old IPs and reboot them.

“Alright guys, let’s do it. We might get done by dinner,” I said.

About a half an hour later, the app was rolled back and online, running, tested, and available for customer use.

While not entirely successful, we learned a lot from this first move event.  First, no matter how much you talk about a migration, there are unknown unknowns.  The application teams are spread thin and don’t have perfect knowledge of every aspect of every application.  Secondly, moving VMs gives you tremendous flexibility. We were able to alter the content of the move event literally days before the move with no impact. Two applications were migrated relatively easily.  Lastly, we were able roll back the failed migration nearly instantly.

But we still weren’t 100 percent successful. We still worked all night. This was a little application with only a handful of servers.  What about the big complex applications with huge number of servers?  How would we scale this approach for those monsters?

Storage vMotion over SRDF worked much better than using trucks and airplanes, but there were still unknowns and issues. We needed to find an even better way.  We needed to reduce risk and reduce downtime.  In my next blog, “Inventing a Better Way,” I’ll go into detail on the approach we used to migrate EMC’s most mission-critical applications.

(Read part 1 and part 3 of Steve’s blog on the data center migration.)

About the Author: Stephen Doherty