Window To Our Private Cloud: Automating Operations Management for Our Cloud Infrastructure

Today’s dynamic cloud infrastructure requires a modern approach to IT operations management. Traditional tools and processes designed for static physical environments overwhelm you with monitoring data, alerts and open questions. They’re not designed for highly scalable, dynamic and virtualized infrastructure.

As a result, end users are often first to report performance problems, putting more pressure on IT because of a fundamental lack of visibility into the true health and efficiency of both infrastructure and applications running in today’s “cloud” datacenters.

And as these datacenters become more and more virtualized, traditional IT operations management and automation teams face new challenges in their efforts to effectively manage and protect IT infrastructure under their management and supervision.

EMC IT’s Enterprise Management and Automation Services (EMAS) team is implementing the latest EMC and VMware automation tools to not only meet the complex needs of today’s cloud environment but also support EMC IT’s shift to an IT-as-a-Service (ITaaS) delivery model.

Click here to read our recent white paper on the subject: Simplifying and Automating Management Across Virtualized/Cloud-Based Infrastructures

Thinking services, not silos

At the heart of our EMAS efforts is the same structural shift that is central to ITaaS—a move from siloed IT operations to cross-functional approach based on IT services we offer to our internal business units. The idea is to create ‘service views’ encompassing the full IT stack used to deliver a particular service offering—servers, storage, data bases, security, etc.—and structure our EMAS alerts around that view.

Under the traditional monitoring and control approach, EMAS alerts are as siloed as the physical IT infrastructure itself. That means network personnel see network alerts telling them something is broken or about to break. Security people see security alerts. Storage people see storage alerts and so on.

With this new focus, security and control alerts affecting a particular service would be disseminated across that service, spanning the old silos to cross-functionally share information that might impact a service offering. This new EMAS approach seeks to inform end users about potential issues that may impact the specific service they are using, regardless of whether it’s a security incident or a server glitch.

One of the first examples of this new approach is the way we now monitor and support Messaging, which is the service that includes email and other capabilities within Microsoft Exchange. In the Exchange world, one of the more popular applications is an EMC product called Business Impact Manager. We created a view of Business Impact Manager that our Global IT Command Center uses and helped us build. It takes apart the service infrastructure—the application which is Exchange in this case; data base; servers; storage; and network—and then rolls everything up into one view of this service called Messaging. So that if any one part fails that impacts the Messaging users, we can see the overall impact in one view. We will be applying this service view structure to other mission critical IT apps as well.

Beyond keeping end users informed via alerts, this approach also provides the command center with a cross-domain view of problems and incidents. By knowing what services sit on what servers, security personnel can more quickly understand the impacts a problem could have on end users and address it more proactively and efficiently. Therefore we are striving to build enterprise management and automation capabilities into each service offering.

To increase our efficiency in monitoring and managing operations across our service portfolio, EMAS is re-examining our standards for our monitoring and alert operations. One of the biggest challenges we face is determining what to monitor and when to send out alerts. We are currently working with technology owners to determine whether current alert practices are valid. For instance, we are asking technology owners what types of problems are critical enough to make them leave a meeting. What kinds of issues do they want to be notified about? With their feedback, we will standardize and formalize alert standards and continue to mature our event management process.

“Learning” to spot warning signs

EMAS is also deploying new tools to optimize our performance management of EMC’s cloud environment and to provide predictive data to head off problems before they occur.

Using VMware vCenter Operations Management, allows us to integrate the result of monitoring all infrastructure components—storage, compute, datacenter and network performance—into a comprehensive view of our infrastructure performance. It then tracks patterns or establishes “learned behavior” from which we can identify abnormalities that might signal impending problems. This allows us to get early warnings of failures and head them off. For instance, maybe a system is running too hot which indicates a possible hardware component failure. Or if capacity on a given system is filling up quickly.

VMware vCenter Operations Management gives us a single pane of glass – “Window to our Cloud” – through which we monitor our IT environment from top to bottom. Having that kind of view means we can meet the service levels users expect from a service – by proactively informing our IT colleagues when things are about to break or are broken so that things can be restored to health quicker.

Automating for faster service

Increasing automation of processes to improve IT performance is another important focus for EMAS. We recently put the right tools and technology in place and we are now starting to look at automation options in areas where there are a lot of manual efforts.

A key piece of that is connecting ITaaS to more automated infrastructure in what we call the back-of-the-house operations. So a given IT service that users consume is delivered efficiently by automating the management of the infrastructure that’s behind it. That’s where we see a lot of automation potential.

For example, customers can use Infrastructure-as-a-Service to request for a virtual machine environment to do testing or other POC work. Previously, it would take many manual steps and eight days from the time a virtual machine was requested before it was ready to use. But now, using automation software we recently deployed, we’ve got it down to one day.

Such efficiencies are part of a broader, service-oriented approach to monitoring and managing IT operations to ensure that we meet the needs of our end users in the most efficient way possible.

About the Author: Mike Leach