If you are a data center manager or administrator, the chances are pretty good that your IT organization is running a Network Operations Center (NOC) with a firm hand on the pulse of the networking elements that comprise the lifeline of any computing environment. Or, possibly your organization has moved to a more geographically distributed approach with a Global Operations Center (GOC). While effective to varying degrees, these operations approaches have their limitations when applied to transformative technologies such as virtualization and cloud computing.
Enterprises, service providers, and other organizations are all moving to virtual data centers, or cloud architectures, including the new software-defined data center, to obtain the well-documented benefits of agility, efficiency, and cost control. But the move to these new architectures also challenges conventional management tools and processes for assuring the effective operations of the data center. The need to improve data center operations is leading to the new concept of the cloud operations center (CLOC).
Operations by Degree
Virtualization and cloud models are characterized by new architectures with applications running on top of pooled compute and networking resources, and increasingly pooled storage as well (i.e. converged infrastructure). Moving to dynamic virtual and cloud environments means that the operations teams chartered with meeting service-level objectives can no longer view the entire infrastructure environment. They cannot monitor, manage, and troubleshoot their aggregated resources because their existing management tools and processes are no longer sufficient for the job. The inability to pinpoint exactly which resources applications are using, and the resulting problems when something goes wrong, leaves the operations team feeling very exposed!
How did the operations center get to this point? How do the different operations center models address these challenges?
Let’s examine the evolution of management tools and approaches in the context of the different operations models, along with the relative degree of success each approach brings to these new challenges.
Network Operations Center (NOC)
Network Operations Centers monitor and control complex networks from one or more locations, and have been around since the early 1970s. In the past, network faults were usually the source of infrastructure issues, and for many, the NOC continues to be the place to go when issues occur. However, complex infrastructure issues can arise in places other than the network such as in the server and storage layer and be difficult to pinpoint.
In self-defense, some NOC teams have added tools to enable them to examine other infrastructure components, such as servers, databases, storage, and applications. When issues arise, they can now confirm that the problem is not in the network, and provide assistance in identifying the true nature of the problem. While a laudable effort, this approach does not eliminate the battle that often ensues with problem identification and resolution, where each domain owner uses their own tools to proclaim their innocence. This approach makes problem identification and resolution difficult and is certainly no match for the new architectures.
Global Operations Center (GOC)
Some organizations have now moved to the Global Operations Center. This approach combines the NOC, data center, applications, and virtualization operation teams to keep a pulse on performance from the applications to the supporting infrastructure—across all domains. This approach brings together the existing domain-focused tools to provide an integrated view of the IT environment.
Unfortunately, integrating existing tools typically happens at the event level, with events captured and sent to a single console. This approach just drives an overwhelming number of events, which are mostly symptoms of the real problem. The ability to model and visualize the end-to-end topology and separate these symptoms from the real root cause is difficult to achieve.
Further, incorporating real-time performance information to identify a performance degradation (or a service risk condition) prior to an actual service outage has proven to be a challenge. For most operations personnel, these performance indicators just add to the overwhelming number of events to monitor, adding more symptoms to track that make it difficult to identify a true root cause or risk to a service level. While better than the traditional NOC, the typical GOC implementation is still no match for the new IT architectures.
Cloud Operations Center (CLOC)
The challenges of the network and global operations models have led to the new concept of a Cloud Operations Center. An effective CLOC team exceeds service-level agreements for new IT architectures by identifying the root cause of a risk-to-service impact, allowing issues to be resolved prior to any impact on business. This preemptive problem identification and remediation is achieved via applying five (5) key tenets to data center operations:
- Good visibility: Data center operations in the transformed data center require full visibility of the end-to-end topology of the new IT infrastructure, across all domains (compute, network, storage, applications).
- Sound relationships: A good understanding of the tight linkage of the physical and virtual environments is required to always know which applications are running on which infrastructure.
- Current information: Continuous, cross-domain discovery is necessary to stay updated in the ever-changing new environments.
- Integrated management: Integrated availability, performance, and configuration management is required to understand performance degradations and risk conditions, as well as the impact of any changes.
- Modeling capabilities: Employing a dynamic, model-based approach with understanding of relationships and dependencies facilitates separating symptoms from the root causes of issues.
With management technology that delivers against these five tenets, CLOC teams have an opportunity to achieve much higher service levels than old architectures and tools allowed. They can leverage the flexibility of virtualized infrastructures, dynamically moving applications and virtual machines to available resources, while maintaining full visibility and control across all resources and all domains.
Further, improved management technology allows CLOC teams to improve processes, become more efficient, and achieve better results. The greatest challenge to resolving IT issues is identifying the cause. Rather than continuing with the war room approach characteristic of the NOC battlefield, with different teams comparing separate information from disparate tools, the CLOC team is better informed of risk conditions, the likely cause, and the impact. The right people are then able to address the most important issues quickly and efficiently.
Better Data Center Operations Management
As you evaluate the evolution of your data center operations, consider the move to a Cloud Operations Center approach. Ask your IT management tool providers how they address these new IT architectures, and probe against the five tenets above. Improved service levels, greater efficiency, and fewer sleepless nights could be in your future.