In rapid-fire succession on 8 July 2015, United Airlines, the New York Stock Exchange (NYSE), and the Wall Street Journal (WSJ) website experienced extremely high-profile outages.
The culprits? For United Airlines and the NYSE, it was the network (routers and gateways, respectively); for the WSJ, it was the combination of insufficient capacity for unanticipated demand, and a failure of server connections (a 504 error).
Key factors in these events were network devices, outages, downtime, service impact, configurations, and capacity –the realm of monitoring and management of IT infrastructure used to deliver applications and services. Yet most of what I’ve read sensationalized the nonexistent cybersecurity angle. So I’m highlighting some key IT operations management insights to take away from incidents such as these:
Any assumption that could impact ongoing business and IT operations should be considered wrong. Felix Unger said it best: Never assume.
The United Airlines router outage impacted applications and services that ultimately grounded all planes. An IT operator easily could have assumed in such a scenario that he or she would see a relevant alert generated by a faulty device. But what if the bad router effectively imploded before getting a chance to deliver its “suicide note”? Then the operator has to know to look for events that should have been generated but weren’t. In my experience, the kind of out-of-the-box thinking needed to immediately deduce this type of problem is extremely rare – especially in a major-outage scenario, in which significant pressure exists to solve the problem ASAP, and everyone presumes all relevant data is there, but it just hasn’t (yet) been properly analyzed into something insightful.
In the case of the NYSE, customer gateways – unlike the NYSE gateways on which the software was tested – didn’t initially have their configurations properly updated and loaded to support a new software release, but it was thought that once that update was completed, all three “moving parts” (NYSE gateways, customer gateways, and the new software) would behave properly once brought together. That wasn’t the case; a misconfiguration was the likely cause. And I don’t think WSJ IT operations showed up for work that day anticipating so much traffic that its website home page would crash because of a combination of server connectivity issues and demand exceeding supply (capacity management).
Operations monitoring and network management matter more than ever. IT infrastructure – especially the network – has become the Rodney Dangerfield of IT; there’s seemingly no respect for its critical enabling role in today’s IT-interdependent business.
Without high levels of network availability and performance, things fall apart quickly. What good is a cloud or web-enabled thin app when the network’s down or super sluggish? Think about your everyday work tasks, their related applications and services, and how productive you’d be if you weren’t network-connected. For most (including yours truly), the answer is not very.
According to IT research firm Gartner, 90 percent of its clients have immature infrastructure and operations environments, with an average IT maturity model score of 2.33 (on a scale of 1 to 5). Infrastructure and operations – what should be the bedrock of IT– is instead, in many cases, more of a risk and liability to the business than most care to acknowledge.
When IT operations lacks the process maturity and automated tools needed to deliver promised service levels for availability and performance– and to quickly identify and remediate issues when they do occur – then prominent, business-impacting problems, like these three outages, will continue. And the increasing complexity of the IT service delivery infrastructure (multiple dynamic virtual layers on top of a physical layer – both of which could be on site, or in external clouds) makes this bad situation even worse.
What could IT operations have done differently? IT operations teams benefit from trusted, automated operations management capabilities and insights; in these cases, specifically:
- Immediate access to accurate, up-to-date delivery infrastructure topology and relationships (so all derived analysis is in turn accurate and up-to-date)
- Alignment of applications and services to that delivery infrastructure (so business impact is readily identified and understood by operations, and easily communicated to the business)
- Ability of monitoring software to not only determine the root-cause of a problem, but also to identify (and separate) related symptomatic issues (that is, those dependent on the root-cause problem), as well as identify business processes, applications, and services impacted by the root-cause problem
- “Sandbox” testing of the impact of updated network configurations with existing or updated software, as well as the ability to roll back configurations, across thousands of devices and within minutes, to the previous (pre-update) version
- Short-term and long-term performance and capacity monitoring alerts (especially large deviations from historical norms), as well as a tested and proven elastic hybrid cloud deployment for immediate capacity needs (by adding resources from the cloud before ongoing business operations are impacted)
As these three outages showed, the criticality of the network, along with the increasing scale and complexity of IT infrastructures, requires proven and trusted automation to ensure required levels of IT availability and performance. No single management tool can ever be the one and only silver bullet for IT operations. However, the need exists for a critical, top-level, go-to operations monitoring and management system that delivers the insights required to ensure the availability and performance of the applications and services delivered over the infrastructure that IT operations oversees. On that front, I think you’ll find that the EMC Service Assurance Suite can’t be beat.