Service Outage Hits Home for Cloud Provider

Are there blind spots in your service assurance approach?

Netflix, a provider of online streaming media, made news over the holidays when customers experienced a service outage on Christmas Eve.  Imagine taking the wrapping off of your new mobile device and deciding to try it out to stream a movie.   For those located in North America, you probably found that the Netflix movie streaming service was down. 

This outage was caused by issues within Amazon Web Services that Netflix employs to support movie streaming.  Initially, the Amazon support team pursued API errors before learning that the root cause of the outage was actually a configuration issue caused by human error.  This misstep ultimately delayed the restoration of service to Netflix customers.  Over the course of that day, the configuration error first manifested itself as performance degradation, and then cascaded to a full service outage for many customers. One way of avoiding a situation like this one could have been to take a more system-wide approach to service assurance.

Service Outages

Although outages in your IT environment might not receive attention in the press, they can still have significant impact on your customers, and in turn, on your business.  Most IT organizations are vulnerable to this kind of service disruption.  Had the configuration error been detected and remediated immediately in the situation described above, maybe a few Netflix customers would have detected some degradation in performance, but it is far less likely that a large number of customers would have experienced an outage. 

Through 2015, 80% of the outages impacting mission-critical services are expected to be caused by people and process issues, according to Gartner (Top Seven Considerations for Configuration Management for Virtual and Cloud Infrastructure, October 2010). More than half of these outages will be attributed to change, configuration, and other related issues.  Additionally, many of these outages will be further exacerbated by a dependence on dynamic virtual and cloud technologies. 

Only 22% of organizations surveyed by Gartner, however, have deployed the full complement of fault, performance, and configuration management capabilities necessary to provide a solid foundation for robust monitoring.  Over half of these organizations (51.3%) have their network fault management bases covered, but performance and configuration capabilities are expected to lag through 2017 according to a recent Gartner research report (I& O Teams Must Proactively Develop Three Core Network Management Disciplines, December 2012).

Service Assurance

Given that a large proportion of service disruptions originate from configuration- and change-related anomalies, savvy organizations will proactively extend their management capabilities to include unified configuration, fault, and performance management capabilities.  For these organizations, including the ones with virtual and cloud infrastructure, key considerations for effective and efficient service assurance management include:

  1. Integration with root cause analysis:  Infrastructure information needs to be integrated with a central root cause analysis engine, and not delivered in separate silos.  Without integration, the task of interpreting that additional data falls on IT staff.   In the Amazon case, robust root-cause analysis capabilities could have made it possible to identify risk conditions and performance degradations before Netflix customers were impacted.
  2. Domain coverage:  All data center domains—compute, networking, and storage—need to be covered.  Not doing so results in blind spots in the total service availability picture like what Amazon, as a service provider, potentially experienced. Being able to robustly monitor a service from the user to the application, server, and underlying network and storage is critical to meeting service-levels in physical as well as virtual and cloud deployments.
  3. Unified presentation layer:  A presentation layer that consolidates, analyzes, and presents all data, in a concise manner focuses staff on the most relevant and actionable information.  This presentation layer needs to be real-time, customizable, and provide the option to drill-down for more detailed information. It should also span multiple data centers across geographies.
  4. Detailed discovery:  A thorough and automated discovery of domains, dependencies, and relationships must be considered and implemented, and done on an ongoing basis.  This data is part of the foundation needed for effective and efficient root cause analysis.
  5. Virtual environments:  With virtual and cloud-based technologies, a new dimension of complexity is added due to their highly-dynamic nature.  To address this risk, a complete virtual and physical perspective is needed.  Real-time updates on virtual machine movements and activity, and an understanding of interdependencies within the physical and the virtual environment provide better-informed decisions for maintaining service levels.
  6. Impact Identification:  Efficiency also depends on focusing on those issues that have the most impact on business-critical services.  Being able to quickly identify and act means that you can direct the right people to the right problems at the right time, and align IT actions with business priorities.

Getting a complete service assurance picture means having an integrated view of availability, performance, and configuration data.  In the stressful environment caused by service outages where problems seem to come from all directions, the ability to automatically calculate business impact is crucial to IT operations making the right decisions for the business. Accurate and real-time configuration insights facilitate in-context remediation of availability and performance issues, and can prevent potentially serious outages before they even occur. 

Meet Service-Levels Today

CIOs are being asked to decrease the portion of their annual budgets for resources devoted to the basics of running a data center, and invest more in business innovation.  Yet they also must consistently maintain high service levels, making the elimination of blind spots in providing service assurance more critical than ever. 

Solutions like EMC Smarts provide service assurance for critical applications through automated root-cause and business-impact analysis that encompasses fault, performance, and configuration for compute, network, and storage. These critical applications and services vary by organization and may include core processes like accounts receivable and billing—but just as likely, in the case of Netflix, a customer-facing service (on-demand streaming video).  

Though the product name came into being during different times and Smarts has evolved to span physical, virtual, and cloud deployments. The Smarts name rightfully invokes the idea that higher intelligence is needed to work across these different environments. An intelligence for service assurance that many cloud and service providers, as well as enterprises, probably want—and need.

About the Author: Mark Prahl