Artificial Intelligence (AI) is here! With a rapidly growing number of success stories proving the possibilities and some bloopers too, there is no question that AI and machine learning technology have moved from science fiction to reality.
Why now? In essence, I see it as a confluence of two trends: multi-layered recursive learning technologies inspired by a deeper understanding of how the human brain learns, and exponentially cheaper and more powerful computing. Some of the latest advances made by leveraging these trends are truly amazing: machines that take advantage of their own “bodies” to learn, machines that autonomously learn to assist other machines, and deep learning algorithms that are fundamentally rooted in the simplest of ideas: curiosity. Another reason for the success of AI is that it is being applied to focused problems such as natural language processing, facial recognition, document analysis, medical diagnosis, etc.; a “truly intelligent” digital assistant still seems a long way away.
The kinds of problems that are ripe for AI are those where intelligence is a set of heuristics (fact-based or “intuitive”) that evolve over time as the environment changes. For example:
- Diagnosing a medical condition based on observed symptoms combined with an understanding of human anatomy, physiology, chemistry, etc.
- Predicting a downstream problem in a manufacturing pipeline based on a confluence of events happening right now
The broader the scope of the environment, the more challenging it is for AI to succeed reliably. There’s no doubt AI will be solving increasingly complex problems, but like most things in life, the key is to start with a narrower focus and then expand.
At Dell EMC, we recognized the need/opportunity for applying learning algorithms in the data center and have been developing and perfecting them for years. Over the last five years or so, we have been making our storage systems intelligent, applying machine learning algorithms that give the systems the ability to automatically change their behavior in response to changing workloads to best serve our customers’ mission-critical applications, without human support. Side note: beware of vendors that seem to have developed AI capabilities overnight – they’re clearly jumping on the hype bandwagon.
We also recognized that we needed to keep it focused initially and then expand. As mentioned earlier, the problems that are best suited for an AI algorithm are those where the intelligence is a set of heuristics that need to evolve over time as the environment changes. If you’ve built enterprise storage systems, you know they are full of such things – for example:
- Is this a sequential read pattern? If so, how many blocks should I pre-fetch?
- Is that I/O surge a real application or a workload gone rogue?
- Do I have enough cache left to absorb this incoming stream or will I drown if I do that?
In the past, engineers would instinctively codify their heuristics as a set of knobs and dials that they expect “someone” to magically set and tune. So, we started looking at how we can put this experience and knowledge into algorithms inside the system, so the system itself can do the “tweaking” and avoid human intervention.
One of the most complex sub-systems in an enterprise storage system is the allocation of critical shared system resources across workloads. No matter the size, it is a known fact that storage systems service more workloads than its dedicated resources – in other words, the system has to share its critical resources (such as memory, CPU, expensive non-volatile RAM, back-end I/O bandwidth, etc.) optimally across a set of workloads that are continuously changing. The system has a ton of information about the workloads but it is non-trivial to decipher what is going on at a macro system level. In the example below, clearly something changed on Day 3, but is it a permanent change? And should the system change its knobs and dials settings as a result?
Over the last few years, we have been developing a machine learning capability embedded inside the storage array that allows the system to make these kinds of decisions autonomously (shown in the picture below). Each application has a pre-defined performance requirement specified as a service level. The critical system resources (e.g. CPU, memory, non-volatile media, backend bandwidth) are dynamically reallocated across applications to achieve the service levels. We accomplish that by modeling this as a reinforcement learning problem. The system is taught to take action to maximize a notion of cumulative reward (or minimize regret) to achieve the target application performance.
And here’s an example of this machinery in action in the all-new PowerMax Dell EMC just launched:
Let’s unpack this a bit. The graphs make it look easy – it’s easy to conclude looking at the chart after the fact that the answer is to just keep throttling the Test workload until the Production workload meets its service level objective. That’s the sledge-hammer approach and it’s ineffective in any but the most simplistic and/or contrived situations; instead the PowerMax targets the response time objective of each workload and manipulates the system (not the workload) at a fundamental and granular level to achieve the least disruptive and most optimal outcome for all workloads in the system. And it does that in real time while serving millions of IOs per second to mission-critical applications, ensuring it does not induce catastrophic behaviors in the process.
Not easy. However, building on our experience with thousands of systems in the field and customer feedback and insights, we’ve trained the models and applied the technology to more complex problems inside the PowerMax, making it a truly Intelligent Storage system. In part 2 of this blog, I’ll explore how we can take this intelligence to the next level and develop the storage equivalent of a self-driving car – self-driving storage!