What’s Driving the Data Lake?

EMC’s Federation Business Data Lake (FBDL) announcement has been a long time in the making.  It’s been a perfect storm of industry trends that enable big data and make data lakes a feasible data architecture option.  These trends include:

Data Growth – Web applications, social media, mobile apps, sensors, scanners, wearable computing and the Internet of Things are all generating an avalanche of new, more granular data about customers, channels, products and operations that can now be captured, integrated, mined and acted upon.

Cheap Storage – The cost of storage is plummeting, which enables organizations to think differently about data. Leading organizations are transitioning from viewing data as a cost to be minimized to valuing it as an asset to be hoarded. Even if they don’t yet know how they will use that data, they are transitioning to a “data abundance” mentality.

Limitless Computing – The ability to bring to bear an almost limitless amount of computing power to any business problem allows organizations to process, enrich and analyze this growing wealth of data, uncovering actionable insights about their customers and their business operations.

Real-time Technologies – Low-latency data access and analysis is enabling organizations to identify and monetize “events in the moment” while there is still value in the freshness or recency of the event.

While this list is impressive, it is not complete. There are two other key industry trends that are driving big data and the data lake:

Open Source Software is democratizing software tools like Hadoop, R, Shark, YARN, Mahout, and MADlib, by putting these tools within the reach of any organization.  Open source software is fueling innovation from startups and Fortune 500 organizations to universities and digital media companies; it is liberating organizations from being held captive by the product development cycles of traditional enterprise software vendors.

Many smart people have been working hard to pull the FBDL together and I am proud to say that I saw the data lake coming as early as May 2012 when I published my “Understanding the Role of Hadoop In Your BI Environment” blog post.  Okay, okay, I originally called it an Hadoop-based “Operational Data Store,” but regardless of missing on the name, I got many of the key benefits right:

Hadoop brings at least two significant advantages to your ETL and data staging processes.  The first is the ability to ingest massive amounts of data as-is. That means that you do not need to pre-define the data schema before loading data into Hadoop. This includes both traditional transactional data (e.g., point-of-sale transactions, call detail records, general ledger transactions, call center transactions), but also unstructured internal data (like consumer comments, doctor’s notes, insurance claims descriptions, and web logs) and external social media data (from social media sites such LinkedIn, Pinterest, Facebook and Twitter).  So regardless of the structure of your incoming data, you can rapidly load it all into Hadoop, as-is, where it then becomes available for your downstream ETL, DW, and analytic processes.

My original "data lake" graphic
My original “data lake” graphic

The second advantage that Hadoop brings to your BI/DW architecture occurs once the data is in the Hadoop environment.  Once it’s in your Hadoop ODS, you can leverage the inherently parallel nature of Hadoop to perform your traditional ETL work of cleansing, normalizing, aligning, and creating aggregates for your EDW at massive scale.

And finally, Data Science, which is the most exciting industry trend for me.  Analytic tools combined with the volume, variety and velocity of data are converging with training and education, business-centric methodologies, and innovative thinking to enable organizations to “weave data hay into business gold” by uncovering customer, product and operational insights from data lakes that can be used to optimize key business processes and uncover new monetization opportunities.

What Does the Future Hold?

EMC’s Federation Business Data Lake takes a big step in the maturation of data lakes by leveraging big data industry trends to create a living “interconnected tissue” entity.  The features outlined in the FBDL will fuel the business transformational processes that we are already seeing underway at many clients.  But there still is a long way to go as tools, training and methodologies continue to evolve, helping organizations think differently about the role of data and analytics to power their value creation processes.

Bill Schmarzo

About the Author: Bill Schmarzo

Bill Schmarzo is the Customer Advocate for Data Management Innovation at Dell Technologies. He is currently part of Dell Technology’s core data management leadership team, where he is responsible for spearheading customer co-creation engagement to identify and prioritize the customers' key data management, data science, and data monetization requirements. Bill is the former Chief Innovation Officer at Hitachi Vantara where he was responsible for driving Hitachi Vantara’s Data Science and “co-creation” efforts. Bill also has served as CTO at Dell EMC where he formulated the company’s Big Data Practice strategy, identified target markets, developed solution frameworks, and led Analytics client engagements. As the VP of Analytics at Yahoo, Bill delivered the analytics tools and applications that optimized customers’ online marketing spend. Bill is the author of four books and is currently an Adjunct Professor at Menlo College, an Honorary Professor at the University of Ireland – Galway, and an Executive Fellow at the University of San Francisco, School of Management. Bill holds a Master of Business Administration from University of Iowa and a Bachelor of Science degree in Mathematics, Computer Science and Business Administration from Coe College.