VPLEX is entering its 4th year since being officially launched at EMC World 2010 and just coming off of a great EMC World 2014, one of the most powerful conversations on VPLEX is around VPLEX Metro Continuous Availability bolstered by Cluster Witness, how it operates and “kissing” 7-9’s (99999.99) uptime is possible for the EMC customer base. Continuous availability is rooted in the combination of the Geosynchrony cache coherent model and the Cluster Witness as mentioned. To re-emphasize, from our install base upwards of 4000 clusters to date and through TCE (total customer experience) we know that our customers have achieved 7- 9s availability across metro deployments. This essentially means that if ever an outage should occur it would be unnoticeable at 2.5sec.
In the next installment of my blog, I intend to deep-dive into the cache coherent technology and the importance of witness, but first, I want to focus on the Mission Critical Center (MCC) story and how their “Trusted IT” philosophy contributes to the best-of-breed status in enterprise and mid-range systems. MCC is an EMC initiative, driven by the Enterprise and Mid-Range Systems Division (EMSD) and the Data Protection & Availability Division (DPAD) to ensure customers can build a trusted IT infrastructure and accelerate their organization’s ability to capitalize on cloud and big data. MCC ensures continuous availability and continuous data protection by deploying a complex customer configuration from applications, through virtualization layers, down to the storage arrays. Furthermore, MCC test runs pre-GA beta cycles (releases and service packs) through rigorous failure injection, failover and recovery operations through all system components whilst monitoring, reporting on workloads and identifying engineering improvements needed in product.
Following are excerpts from conversations held between Ramesh Balan [Sr. Manager Enterprise Engineering] and John Wallace [Sr. Director Customer Engineering Performance Operations] who owns Mission Critical Center (MCC) operations.
Ramesh, right off the bat I am asking you a direct question based on the claims that MCC is able simulate years of customer production runtime using customized techniques in a matter of months, can we talk about how that is accomplished with MCC?
Yes of course Jen. EMC closely tracks the mean time between part replacements (MTBPR) data for all hardware components in EMC’s install base. By using this data we identify the critical subsystem component counts in our infrastructure, calculate the expected failure rates and inject failures in critical paths across all component types. The injection rate vs. MTBPR defines an ‘acceleration’ factor and this acceleration helps us compress real world timelines into days and weeks.
In real world, a customer can expect to run many years without experiencing a hardware failure, or invoking a disaster recovery plan. In MCC, so many disasters occur in just a month that we get data equivalent to decades of DR drills stressing the breadth and depth of the portfolio.
John, why Mission Critical Center?
We needed a place where we are using the portfolio with Always-on applications for EMC customers that need solutions that support always-on IT with continuous availability in a production environment. This philosophy and approach builds on the larger EMC investments for TCE. We build on these investments by running data centers and customer work-loads with a community of experts across product divisions, professional services and customer support to develop competencies in HA/CA. MCC puts this all together in three DataCenter environments, implemented via best practices for HA/CA/DR, and subjected to accelerated stress and degraded operation. We are able to then understand and translate the application of these practices to real customer problems and solutions.
Product to infrastructure: Although EMC products have exceptional reliability and availability, customers view availability across their infrastructure as one entity. No one in the industry measures this level of availability from the application perspective. With this portfolio of products; VPLEX, VMAX, VNX, RecoverPoint, Connectrix achieving 8 9’s availability isn’t too far a stretch. However, we need MCC to demonstrate that this is possible and be the competency center for HA/CA. In addition, as EMC leads the industry from Platform 2 to Platform 3 Transition, this availability of platform 2 (exceeding six nines) needs to be demonstrated and then these practices moved to platform 3.
How Does MCC Measure Success?
Ultimately, success is measured through our customers’ eyes, and prevention of customer outages through our install base DU/DL, uptime and run hours are applied. We also apply these analysis with MCC. Internally, on a daily basis we monitor the uptime of the MCC data centers, tracking the availability and following the same TCE practices as our customers. This way internally we can parlay any suggested adaption of practices ahead of release as well as direct association to engineering through the beta process. We affectionally call this “beta the business.”
Ramesh, what practices does MCC deploy in order to be a proxy for the customer?
The MCC data centers strive to emulate every aspect of a true production environment. We follow accepted IT business practices for capacity planning, change control approval and maintenance windows while we administer non-disruptive upgrades, technology refreshes, environment changes and incident management. Enterprise scale applications are run round the clock with any interruption escalated through EMCs Customer Support teams following time honored TCE processes. We ensure that the application data has adequate replication, and back up for quicker recovery.
How does MCC influence Engineering?
MCC engages with early code releases as a Beta customer, and provide early feedback for Engineering prior to products becoming generally available to customers. By ‘throwing rocks’ at the code from different angles, MCC functions as one of the ‘worst beta customers’ providing prolific feedback into Beta programs. As one of the first customer’s to upgrade to GA code, MCC also influences product teams during early Patch and Service Pack releases.
How has MCC evolved?
A year ago, MCC put together our datacenters in Hopkinton with a few hosts, two back end arrays and a VPLEX Metro. We started operations with OracleRAC and VMware ESX stretched clusters spread over three small datacenters with Swingbench and user VMs as the main application. Our first project “Apollo” focused on proving the resiliency of VPLEX & VMAX in stressful BC/DR scenarios. During our next project “Cassini” (MCC folks are fans of NASA space missions), we moved the entire host infrastructure to VMware virtualized host platform live with no disruption to applications. In the third project “DeepImpact”, we capitalized on the virtual infrastructure to orchestrate automated BC/DR operations using VMWare SRM & RP. We also deployed enterprise class applications using SAP and proved the resiliency of our portfolio running SAP ERP application. Currently with “Project Fermi”, we are adding second site replication using Metropoint and plan to use VMWare SRM+RP to do three site failover/failback DR scenarios. During this project we are in parallel adding more SAP modules (BI & CRM) to the datacenter. Very exciting to see our portfolio scale seamlessly and flawlessly failover and failback all these applications!!
MCC data centers currently has the following infrastructure products and their associated management tools:
• EMC Symmetrix® VMAX® 10K, VMAX 20K, and VMAX 40K Series systems, EMC Solutions Enabler and EMC Unisphere™ for VMAX
• EMC RecoverPoint® and EMC Unisphere for RecoverPoint
• EMC VPLEX™ Metro with GeoSynchrony™ and EMC Unisphere for VPLEX
• EMC Connectrix® and Connectrix Manager
• EMC VNX™
Wow Ramesh! That is a lot of work in a short time. In summary, what do you see as the ultimate benefits and results from MCC to date?
MCC was built to emulate a customer’s production environment. Overall results show that the strategies employed by MCC can evolve and influence our product portfolio beyond robust interoperability to the higher level of hardening, resiliency and availability required for our customers’ most sensitive business-critical applications. By reproducing real-world scenarios and generally doing things that would never happen in real-world, it drives the approach to 7-9’s availability for all EMC’s EMSD and DPAD product families.