From using analytics to predict how our storage arrays will perform in the field, to engineering product configurations to best meet customers’ future needs, EMC is just beginning to tap into the gold mine of intelligence waiting to be extracted from our new data lake.
In fact, we are currently working on dozens of business use cases that are projected to drive millions in revenue opportunities. And we are just scratching the surface. There’s a lot more data available, more to be harvested, and more analytics to be built out as data scientists and business users hit their stride in exploring a new era of data-driven innovation at EMC.
As I noted in my earlier blog ( The Analytics Journey Leading to the Business Data Lake), EMC IT embarked on creating a data lake to transition from traditional business intelligence to advance analytics more than two years ago. A key focus of this effort was to address the fact that data scientists and business users seeking to leverage our growing amount of data were stifled by the need for such projects to go through IT, which was a costly and slow process that discouraged innovation.
We now have the foundation and tools in place to use data and analytics to create sustainable, long-term competitive differentiation. To get here, we worked closely with EMC affiliate Pivotal Software, Inc. to mature together and leverage the multi-tenancy capabilities of their Big Data Suite.
Building the Lake
Here are some highlights of our data lake journey.
We began our effort to create a scalable, but cost-effective foundation for EMC’s data analytics by building a Hadoop-based data lake to store and process EMC’s growing data footprint. Hadoop runs on commodity hardware and scales linearly, making it less expensive to scale than a traditional data base appliance. It also runs on EMC’s industry leading IT Proven technologies, including XtremIO, ScaleIO, Isilon and Data Domain to enable enterprise capabilities such as built-in name node fail-over, replication, storage efficiency, disaster recovery, backup and recovery, snapshots, and the ability to scale out compute and storage separately.
Once the lake was operational, we used Pivotal Spring XD, a data integration and pipelining tool, to orchestrate batch and streaming data flows into the data lake. It didn’t take long for the data lake to exceed the size of EMC’s legacy global data warehouse. Today, it is more than 500 terabytes and continues to grow.
With the foundation in place, our team then turned our attention to building analytics capabilities. The most important requirement was that whatever tools and technologies we chose, they must provide self-service capabilities so users could easily spin up new analytics projects without having to involve IT. We wanted self-service capabilities that allowed users to easily identify the data sets they needed, to integrate new data sources, to create analytical workspaces to blend and interrogate data, and to publish the results of analysis for collaboration with colleagues.
We chose to use a mix of technologies centered around an EMC-developed framework for Data API/Services based on Pivotal Cloud Foundry (PCF) and Big Data Suite (BDS) that enables seamless interface with the data lake.
For the analytics itself, we chose Pivotal Greenplum, a massively parallel processing analytical database that is part of Pivotal’s BDS. Users bring their desired data sets into an analytics workspace powered by Greenplum, where they can run different styles of analytics—including machine learning, geospacial analytics and text analytics. They can visualize the results with the tools of their choice. EMC’s data scientists primarily use MADlib, R and SAS to develop and run algorithms and predictive models inside Greenplum, while business users tend to use business intelligence tools like Tableau and Business Objects.
Finally, users can publish the analytical results via a data hub. This significantly shortens time to insight since users can build on one another’s work rather than constantly starting from scratch.
Self-Service Tools Empower Users
EMC is already reaping the benefits of its new data lake and analytics capabilities.
One such opportunity involves log data created by EMC storage arrays and other products as they operate in the field. With its new Hadoop-based data lake, EMC is now equipped to ingest, store and process this log data, which is then analyzed to predict and prevent problems before they occur and impact the customer.
Analysis of log data might reveal, for example, that a particular component in a customer’s storage array is likely to fail in the next 8 to 12 hours. With that insight in hand, EMC support can reach out to the customer and take steps to prevent the component failure before it disrupts important business processes. Our support and sales folks are at the door before a problem happens.
This preventative maintenance capability, which would not be possible without EMC’s new data lake and agile analytics capabilities, results in higher levels of customer satisfaction and customer loyalty, which has a direct impact on EMC’s bottom line. It also helps EMC’s engineers determine the optimal product configurations for various scenarios and use cases, as well as provides valuable insights as they develop new products and services.
Our cutting-edge Big Data analytics should yield many more such opportunities now that the data lake enables our innovators to experiment and explore at their own pace. After all, data scientists and business users no longer need to go through IT when they want to start a new analytics project. Instead, they log into the data hub and use self-service tools to identify potentially valuable data sets for analysis. They can bring in their own data or outside vendor data and mesh it with our enterprise data.
With the IT bottleneck out of the way, everyone is very excited and they feel very empowered to tap into and explore new data analytics frontiers.