The Stability Myth and the Resiliency Mindset

If I could go back in time, I’d persuade early technologists and IT operators to focus their attention on resiliency and recovery, not stabilization. While the pace of technological change and the demands of our IT consumers have only continued to accelerate, the human aspects of IT, or the practices and processes, have evolved at a much slower rate. Sure, we have introduced automation and self-service catalogs for some incremental gains, but how we onboard new technology, patch and upgrade existing technology, and service demand in general has not truly evolved much since the release of ITIL v2. Overcoming the inertia of a stability mindset and embracing a resiliency mindset is the modern-day IT challenge.

What Is a “Resiliency Mindset” and How Is It Different?

To answer that question, we must first define a stability mindset. According to Merriam-Webster, stability is “the property of a body (in this case an IT technology or service) to develop forces or moments that restore the original condition when disturbed from equilibrium or steady motion.” In simpler terms, stability is the quality of being unchanged and enduring. A stability mindset prioritizes the characteristic “enduring.” Evidence of this mindset is seen on nearly all IT dashboards in the form of metrics, like issue wait time, mean-time-to-failure, and uptime; and, it is central to defining service level objects (SLOs) and agreements (SLAs). Over time, this need to endure, or better phrased, “to provide durable and steady systems, services, and applications,” inadvertently stifled change. Moreover, it created the illusion that IT can actually moderate the pace of change. Rather than embracing change, IT’s unwavering and somewhat paradoxical commitment to stability has earned it the title, “Department of No.”

Source: Tech Republic. Capital One has set out to enable metrics for the DevOps pipeline via their open source project Hygieia.

IT Leaders Must Deliver and Innovate in a Constantly Changing Ecosystem

I think that we all agree; it is unlikely that the pace of change or the fickleness of IT consumers will ever return to a pre-iTunes era where technology was locked in a closet and only IT held the key. In fact, we are seeing just the opposite from an architectural, organizational, and behavioral perspective. Architecturally, technology is being pushed out of the data center as more applications are being deployed at the edge or on a device held in a user’s pocket. Organizationally, Executive Boards are repeatedly pushing IT teams to develop Autonomous platforms that unlock technology in order to better enable enterprise innovation. And behaviorally, we are seeing disenfranchised Product Teams employ “Shadow IT” to get reliable services efficiently from the cloud or other providers. It is no surprise that researchers from OpsRamp uncovered in their “Report on Modern IT Operations” that a critical business expectation for IT leaders is the ability “to deliver new revenue streams, drive faster time-to-market for products and services, and ensure better digital user engagement.”

Yet despite these indicators and trends, few of our IT processes and practices have evolved from a lock and key mentality based on overly complicated ITSM process designed to inhibit (or slow) change. Even with investments in automation, most enterprise IT systems are ineffective at efficiently managing a constantly changing ecosystem of tools and technologies. This is evidenced by continued increases in both IT spend despite significant efforts and focus on flatline growth and cost reduction and IT backlogs of work items, projects, etc. despite larger teams, smarter tools, and cloud solutions. Rather than continuing to drive a stability agenda and losing relevance in the enterprise, now is the time for IT to shift towards a resiliency mindset. Now is the time to modernize processes and practices and reclaim your ‘seat at the table’.

Example: Contrasting Stability Mindset with Resiliency Mindset

Unlike stability-based thinking that actively works to preserve the status quo and minimize change, a resiliency mindset is focused on recovering from or easily adjusting to change. It is all about adaptability and elasticity. This somewhat subtle shift fundamentally alters the approach and planning of IT. Rather than attempting to moderate or prevent change and failure, a resiliency-based approach accepts that change and failure to happen. As such, a resiliency mindset focuses on designing and building fault tolerance systems, applications, and process. While rooted in IT operations, this mindset permeates the enterprise influencing product development, compliance and security, and of course, platform development.

To better explain the difference, let’s look at a simple example like a JVM heap overrun issue that cause an out-of-memory error and requires a restart to further illustrate the difference.

Changing the Metrics for Success

Having a resiliency mindset is more than just applying built-for-change principles and leveraging design patterns like the example above, it also requires a change in how you measure and define success. This is most apparent in the core performance metrics of a resiliency-based system.

Key metrics for a resiliency mindset:

Mean-Time-to-Recovery (MTTR) is the average time you respond and restore service in times of outages and issues. This is perhaps the most important metric as it directly correlates to customer satisfaction. Customers/consumers expect services to be up. You will never see an email from a user that says, “Thanks for keeping my underwriting application running all year!” What you will see is hundreds of emails, calls, and tickets appear when service is interrupted. How quickly you can restore service will minimize the impact of any outage.

Change failure rate measures how successfully a change is deployed (single or multiple attempts), like a new application feature, security patch, software upgrade, etc. High change failure rates are major productivity impediments and are typically representative of inadequate and/or undisciplined development and testing practices across the application stack.

Release frequency is a direct measure of agility and adaptability when a change has been introduced into a Production environment. It measures your ability to respond to demands from customers, users, regulators, executives, etc. If your release frequency is monthly, it will take you month to package, test, deploy, and operationalize a security patch to a newly discovered vulnerability. Sure, you can “fast track” the patch; but, you are then slowing all other changes in your pipeline. In addition, “fast-tracking” typically leads to production issues and conflicts resulting from inadequate testing driving up your change failure rate. Also, “fast-tracking” is typically documented and communicated poorly resulting in longer MTTR cycles caused by the newly introduced technical debt.

A Word on High Performing Resiliency-based Companies

When we look at top performers across industry (and I am not specifically referring to FANG[1]), most companies started this journey years ago and have continually invested in and improved their platforms, process, architectures, and services.

These IT leaders have stepped outside of the proverbial burning IT house and built a new house, using modern, resilient design principles, sound engineering practices, and lean processes. They are systematically and intelligently migrating workloads and applications to this new house and paying down technical debt in the process. And, they are using operational metrics to drive continued investment and improvement in the platform and services provided by IT.

This ongoing commitment to Kaizen practices and technical debt reduction is a leadership decision. Reducing and eliminating the toil associated with maintaining stability across a fragile and brittle portfolio is a leadership decision. Embracing a resiliency mindset is a leadership decision.

Summary: Transitioning to a Resiliency-based System is No Easy Endeavor

The move to a resiliency-based system is a cultural paradigm shift that is dependent on modern, cloud native platform architectures and robust automation tooling but only actualized through new processes, practices, skills, and metrics. Most enterprises organize these efforts around tooling and technology. While this approach will typically deliver some short-term wins, it rarely delivers sustainable change because it lacks operational context and true integration in the daily flow of work and decision making.

As Bhanu Singh, a Senior Vice President at OpsRamp, suggests, ‘it is context that enables companies to see the flow of data’ and ‘connect the dots across silos’ that truly delivers value back to the enterprise.

Are you ready to make the change to a resiliency mindset?

Dell Technologies Consulting offers a variety of products and consulting services to help you get started in navigating and accelerating this journey towards building resilient systems and platforms. To learn more, contact your Dell Technologies Services representative today.

[1] Facebook, Amazon, Netflix and Google (now Alphabet, Inc).

About the Author: Bart Driscoll

Bart Driscoll is the Global Innovation Lead for Digital Services at Dell Technologies. This practice delivers a full spectrum of platform, data, application, and operations related services that help our clients navigate through the complexities and challenges of modernizing legacy portfolios, implementing continuous delivery systems, and adopting lean devops and agile practices. Bart’s passion for lean, collaborative systems combined with his tactical, action-oriented focus has helped Dell Technologies partner with some of the largest financial services and healthcare companies to begin the journey of digital transformation. Bart has broad experience in IT ranging from networking engineering to help desk management to application development and testing. He has spent the last 22 years honing his application development and delivery skills in roles such as Information Architect, Release Manager, Test Manager, Agile Coach, Architect, and Project/Program Manager. Bart has held certifications from PMI, Agile Alliance, Pegasystems, and Six Sigma. Bart earned a bachelor’s degree from the College of the Holy Cross and a master’s degree from the University of Virginia.