The summer is here and I realized that we had not blogged in a couple of months. So here goes. As you know, we in EMC IT continue to have many many conversations with customers who are interested in our cloud journey, what we are doing in our IT as a Service program, and more recently than anything else “Is Big Data real ? And does it really apply to me ?” Let me tackle this last one. We do typically hear about Big Data being prevalent in typical large business data scenarios like genomics, space telemetry, drug discovery and synthesis, financial time series data, etc. But is that all ? The answer is definitely NO. Look at the picture alongside to illustrate the point. Traditional BI is the usual current model – primarily backward looking and with lagging indicators. The future is in predicting analytics – predicting potential outcomes by leveraging the history. The longer and the more detailed history you have, the better your ability to predict. This calls for Big Data approaches, where you don’t necessarily summarize and roll up, but have all the data. All well and great , but does it really apply to me ? Our thesis is Absolutely – either enterprises do not know it or they do not generate the data since they do not know how to best store or process it.
Usually, you hear about the business data being analyzed to provide better metrics and KPIs for the business. And we have our own examples of that too. I had referred to this in the prior EMC World post – we had collaborated with the Corporate Quality group at EMC to standup a Greenplum solution, modify some processes, and go from a 6 day cycle of data load and analysis to a 28 minute cycle. The details of that story is for another post but we had already seen the power of what Big Data can give us. In this post, I want to focus instead on a couple of non-traditional IT examples to illustrate the point – (a) performance analysis of storage and (b) security event forensics.
The first problem statement we looked at was to do real-time performance analysis of our storage arrays. At EMC IT, we have hundreds of storage arrays, many SAN switches and a number of collection and data points per array. To gain a good understanding of how these storage arrays are performing, we need to collect performance data from the collection points but there is ‘too much’ data to collect – even if we did 15 min collection intervals, we would start to generate about 12MB/Array/hour. So the frequency of data collection is typically quite lossy and we do not record a lot of peak information. The constraining factor has typically been whether we can load the data coming from the sensors fast enough into a datastore and also do we have the tools to analyze all the data in a predictive as-it-happens kind of manner. All this changed once we decided to adopt a Big Data approach to it.
Based on our experience, we decided to attack the performance analysis problem with the same tools we used to address the Corporate Quality issue mentioned above. The fast load functionality in Greenplum enabled us to ratchet up the data collection frequency to 10 sec intervals (from 15 min intervals) due to the fast load functionality and we were able to use embedded analytics with R to get to some neat visualizations in real-time. The picture alongside gives you a sense of the peaks in real-time but one of the places we needed to go is to also be predictive (Clearly one of the challenging areas is the best visualization model but ours was functional at this point even if it needed more work).
But wait, there is more… We worked with the VMWare vCenter Operations (vCops) team – the Integrien product before it was acquired by VMWare – to implement a POC with a feed from our Greenplum database to the VMWare VCOps interface. This starts to provide some of the predictive analytics we were talking about in the prior paragraph. It is slowly starting to come together …
In the case of security forensics, we were faced with a similar problem. There is the typical data overload where we have to collect, retain and retrieve many TBs of security-related information – change logs, logins, access control logs, … – over long periods of time. Once an incident occurs, it is very hard to find the root cause – the needle in the haystack, whether direct or a collection of correlated points. But we need to avert security issues in the first place, or also detect and respond to attacks while they are still happening. The same Big Data techniques are absolutely needed here. We find that complex queries that took the Critical Incident Response Center (CIRC) many hours to run can be done in a few minutes on a Greenplum model. We continue to work closely with the RSA team to enable our own capabilities in this area.
So, the Big Data problem truly exists in any data center and any IT operations organization. What if we were to increase the frequency of collection of the different sensor data across the data center ? What if we were to put more collection points and sensors ? What if the data center operations team can analyze the data from a performance or error perspective and fix problems before they occur ? This does start to change the game !