In my last post, (click here to read Part 6) I explained how we invented a better way to migrate and transform an application either across the room or across the country: build a parallel, virtualized environment, pre-configure and pre-test the new environment and practice the migration. However, nothing is perfect. As we found out, there are still some things you can’t test.
The ESRS 2 migration is probably the pinnacle success story of the entire Durham migration. ESRS 2 connects EMC Customer Service to customers and helps us monitor installed systems, identify problems and connect back to the systems to diagnose and fix problems remotely or through a service request.
The migration team was able to build out a new entirely virtualized architecture running on Vblock. Performance testing results were outstanding. The new architecture was tested at 4x the current load and ran faster than the pre-migration system. We were able to test and fully document our disaster recovery plans.
The migration itself wasn’t going to be easy. Customers had firewall rules configured at tens of thousands of sites which had to be changed and tested so we could connect. We also had to move some network configurations to the new data center. Our network team estimated that it would take several hours to get that work done and there was no contingency if it didn’t work. Additionally, production would have to be down the entire time.
This was a non-starter. There was no way we could take that plan to the business, so we began brainstorming ideas.
We concluded that we needed another environment to fail over to and it was suggested we use the new DR (disaster recovery) environment. We even thought we might be able to do it with no downtime.
The team was skeptical, even though I pointed out that we had just tested the new DR environment, knew that it worked and that most of the customers had those firewalls already configured. I reasoned that we didn’t need perfect transaction consistency, just connectivity. We could migrate to the new DR environment, export/import the database and then cutover the main URL. The export/import might take an hour or two, but production would still be running. When the URL is migrated, transactions would move over to the new DR environment. Users might not even notice.
Our DBA lead pointed out that that option would disrupt data consistency because we’d be missing two hours of transactions, but I felt that shouldn’t be a problem. All of those transactions were leaf node, at the very end of the database and we could capture and insert them back into the tables when we were done.
After batting around the idea for a few more days, we decided a more conservative approach with a longer downtime was better than potential database corruption. We decided that we’d take production down and then migrate to the new DR environment. Once that was up and running, we’d start moving the network configurations. When that was complete, we would migrate the application again to the new production environment. That way we would always have a proven environment to fail back to and would minimize overall downtime and radically minimize risk.
It was a bold plan. It was a mission-critical application and we were going to attempt to migrate it twice in the same day!
We practiced it a few times and came up with a great cutover plan. We practiced everything except cutting over the main URL. The migration itself went nearly flawlessly. With about 30 people onsite, we brought down production and began executing the plan. It took about an hour to move the data and make the few remaining configuration changes.
“Ok, the application should be coming up!” exclaimed the development lead.
We were all holding our breath in anticipation.
“We’re getting pings from the field! Looks good, 20 percent have already made the switch. Wow, it’s already at 40 percent. This might take a few minutes for the URL changed to be honored by all the sites in the field,” he reported.
A few more minutes passed.
“60 percent! 80 percent! 90 percent! That’s probably all of them that have DR configured!”
We were ecstatic!
But, the celebration didn’t last long. We did hit some snags. However, because we built and tested a new parallel environment, the possible root causes were drastically reduced and we were able to quickly identify and address the issue. A quick reboot later and it was fully functional. The timing was perfect; pizza had just been delivered for lunch.
At 3:00 PM we shut down the application and began the migration. All the teams were working in parallel cutting the application over to the new Production environment. This time we verified that DNS had globally propagated before rebooting the servers.
At five minutes, most of the DNS servers had updated, but not all of them. We waited. At seven minutes, all of our global DNS servers had replicated.
We were good to go. We carefully rebooted each server one by one. After another five minutes, the application was back online.
4:15 p.m. We had just migrated a mission critical application TWICE in the same day and we would all be home in time for dinner.
The application is so big, complex and mission critical that there is absolutely no way we could have moved it any other way without a very significant impact to the business. More importantly we had used the migration as a vehicle to virtualize, upgrade, and scale out the infrastructure. All of the past performance issues were resolved and now that all components were virtualized, we could easily add additional resources as required.
In my final blog I’ll be discussing the challenges and excitement of decommissioning the legacy data center.