Why a Data Lake? Keeping Up with the Digital Universe

With the digital universe expected to swell to 44 zettabytes of data by 2020, today’s enterprises need a central data repository that can process increasing volumes of all types of data faster to let business users make better, real-time decisions. In short they need a stronger backbone; they need the data lake!

Not only do traditional databases constrain real-time and shared data analytics due to their siloed nature, they also lack the technology to accommodate the skyrocketing level and types of data being created at an increasing rate. After all, according to IDC research, the growing number of smart devices that analyze everything from home heating systems to consumer information will mean that within four years there will be some 7 billion connected people using an estimated 30 billion devices.

Our longstanding traditional database approach simply can’t keep pace with these growing data challenges and demand for flexibility and scalability. Based on evolving Big Data technologies, including Greenplum and Hadoop, however, the data lake can process this onslaught of data—including structured, unstructured and semi-structured—in the volume and at the velocity to keep pace with the exploding data universe.

The data lake allows enterprises to consolidate their data assets into a single central repository, for cross-functional business analysis satisfying the core tenants of security, availability, reliability and scalability. It also lets organizations create one logical data platform with multiple tiers of performance and storage levels to optimally serve various data needs based on service level agreements.brahma

At EMC, we launched our data lake in 2014 and are continuing to evolve in the journey. We previously consolidated several islands of data into a corporate data warehouse and built out our Business Analytics as a Service (BaaS) platform, based on Pivotal’s Greenplum MPP database appliance, to empower the analysts in the business. However, we needed a more systematic, enterprise-wide analytics platform in order to become an analytics enterprise. Through the data lake, we enable the business to harness the data, to collaborate and share insights, and exploit the value of the data to make better decisions, create prediction models, and transform business operations by applying advanced analytics.

The data lake addresses the following business challenges:

1. Allows business users to innovate with data for competitive advantage. Every company is trying to make data their new weapon for competitive advantage—how can we learn more about our customers faster than anybody else? This is where analyzing data from social media and every other data feed on the planet could make a difference to uncovering nuggets of insights. The data lake handles the volume, variety and velocity of data to make that happen.

2. Provides near real-time information to optimize predictions. Through our implementation of data lake technologies, we are able to integrate data faster and allow quicker and more accurate data querying. For example, with our data lake we found that integrating data from our marketing automation system was reduced from 7 to 10 days to 24 hours. And data queries that previously took four hours to execute now take less than one minute. This lets users make faster, better business decisions.

3. Cost effectively scales to improve query and load performance. data lake technology, including Hadoop, lets us process and utilize unstructured and structured data a lot faster and cheaper at a scalable level. Instead of one-size-fits-all, the data lake increases efficiencies by tiering data to meet different business needs. Data that doesn’t need real-time access can be stored on a less expensive tier than data for which users require up-to-the-minute access.

4. Democratizes the access to data by placing the data in the hands of the business. The data lake has allowed IT to move from being a gatekeeper of the data to a catalyst for business to use analytics to improve their success. We can now give our business users analytics workspace and access to a wide array of data (based on their security authorization) to perform their own reporting, analytics, and apply data science. This stems from the fact that they are closer to the business, in addition to the access to rich ecosystem of datasets.  It also allows business users to innovate and model the various possibilities. And they can collaborate with other business groups on analytic models and algorithmic efforts which can be reused and enhanced to improve the speed of development. After all, innovation occurs when you have no boundaries – or, as they say, we don’t always know what we don’t know.

Ultimately, the case for an enterprise to create a data lake has been caused by the digital explosion itself. As the volume and types of data created in that world continue to grow, the data lake will let users harvest such information to gain actionable insights and drive intelligent decision making. And companies that aren’t prepared to leverage the onslaught will be left behind.

For more insight from EMC’s own data scientists, be sure to check out these blogs:

Stocking the Data Lake with Smart Data: IT-Business Partnership is Key

The Data Lake From a Data Scientist Perspective

The Analytics Journey Leading to the Data Lake

The Journey Toward a Predictive Enterprise

About the Author: Brahma Tangella