By Sudhir Srinivasan, CTO, Dell EMC Storage
Artificial Intelligence (AI) is here! With a rapidly growing number of success stories proving the possibilities and some bloopers too, there is no question that AI and machine learning technology have moved from science fiction to reality.
Why now? The confluence of two trends: multi-layered recursive learning technologies inspired by a deeper understanding of how the human brain learns, and exponentially cheaper and more powerful computing. Some of the latest advances made by leveraging these trends are truly amazing: machines that take advantage of their own “bodies” to learn, machines that autonomously learn to assist other machines, and deep learning algorithms that are fundamentally rooted in the simplest of ideas: curiosity. Another reason for the success of AI is the focused targeting of problems such as natural language processing, facial recognition, document analysis, medical diagnosis, etc. The kinds of problems that are ripe for AI are those where intelligence is a set of heuristics (fact-based or “intuitive”) that evolve over time as the environment changes. For example:
- Diagnosing a medical condition based on observed symptoms combined with an understanding of human anatomy, physiology, chemistry, etc.
- Predicting a downstream problem in a manufacturing pipeline based on a confluence of events happening right now
The broader the scope of the environment, the more challenging it is for AI to succeed reliably. There’s no doubt AI will be solving increasingly complex problems, but like most things in life, the key is to start with a narrower focus and then expand.
There was a clear need for applying learning algorithms in the data center, specifically, making storage systems intelligent by applying machine learning algorithms that enable the systems to automatically change their behavior in response to changing workloads. Storage systems are well suited for an AI algorithm because the intelligence is a set of heuristics that need to evolve over time as the environment changes. Take the following example situations that require ongoing fine-tuning in a storage system:
- Is this a sequential read pattern? If so, how many blocks should I pre-fetch?
- Is that I/O surge a real application or a workload gone rogue?
- Do I have enough cache left to absorb this incoming stream or will I drown if I do that?
In the past, engineers would instinctively codify their heuristics as a set of knobs and dials that they expected “someone” to magically set and tune. With the application of AI algorithms the system can do the “tweaking” and avoid human intervention.
One of the most complex sub-systems in an enterprise storage system is the allocation of critical shared system resources across workloads. No matter the size, it is a known fact that a given storage system services more workloads than its dedicated resources. This means, the system has to share its critical resources (such as memory, CPU, expensive non-volatile RAM, back-end I/O bandwidth, etc.) optimally across a set of workloads that are continuously changing. The system has a lot of information about the workloads at a micro level but it is non-trivial to decipher what is going on at a macro system level.
Machine learning capability embedded in the storage array allows the system to make higher levels of decisions autonomously. Modeling this as a reinforcement learning problem is the solution. The system is taught to take action to maximize a notion of cumulative reward (or minimize regret) to achieve the target application performance.
Taking this a step further, let’s look at the evolution of intelligent storage systems that are truly autonomous, akin to self-driving cars. If we can build self-driving cars, can we build “self-driving” storage systems?
An autonomous car and a storage system have fundamental similarities. Consider this:
- Both are very complex systems – dealing with a multitude of simultaneously occurring events happening very fast
- Both have a lot riding on them – human lives in one case and mission-critical business operations in the other that often impact human lives as well
While the image most people have in their minds when they hear “self-driving car” is a vehicle that drives completely by itself, in fact there are multiple levels of automation, with Level 5 being 100% autonomous.
We envision a similar gradation for storage systems and we can look at the journey to a fully autonomous storage system as consisting of four steps:
Level 1: Application-Centric
With the self-driving car, you tell it the destination, not the exact roads and turns or speed. Similarly, the way you interact with the storage system has to be in terms of your goals, i.e. the application you wish to run. You care about the application not what the storage system needs to do to run that application. For example, the goal is to run a web-based transaction processing application using a relational database and have it take care of the rest.
Level 2: Policy-Driven
Next, you tell the car whether to take the direct fastest route or a scenic route. Similarly, you want to set some service level objectives for the application you just told the storage system to run – is it a high priority production application or a best-effort dev-test instance? Does it need additional data protection via remote copies? How often?
Level 3: Self-Aware
Now the car has all needed instructions. However, to actually drive itself, it needs to be “self-aware”, in other words, analyze all internal and external data that affect it, like fuel level and proximity to and speed of vehicles around it. The storage system in this analogy needs to know how “close to the edge” it is operating. This is where the industry has been lagging. While a lot of telemetry is available, we typically haven’t been very good at analyzing the data to determine if we’re about to drive off the cliff – we still rely on humans to figure this out. And typically, the humans get involved after something has gone horribly wrong. The first step here is to make the system self-aware – instead of throwing a whole lot of data at the human, the system should be able to analyze the data and tell the user how close it is to the edge.
Level 4: Self-Optimizing
Once the storage system knows how close it is to the edge, it needs to be able to adjust its behavior/operation to avoid going over the edge. In the self-driving car world, a very simple example is adaptive cruise control where the car regulates its speed to keep a safe distance from the vehicle ahead sensed by RADAR. This is exactly what a smart storage system can do when its algorithms are designed to detect changes in the environment and change key system behaviors accordingly to meet the needs of the applications, and prevent the system from getting into catastrophic situations. In other words, you may want/need to drive at 65 MPH, but right now you can’t unless you change your lane.
So, who will get there first – a fully autonomous car or a fully autonomous storage system? My bet is on the latter!