Carving Out a New Data Lake

Creating a single data lake to serve a newly merged Dell Inc. and EMC Corp. is a bit like harnessing the tectonic shifts in the Earth’s crust that form the more traditional lakes some of us would rather be fishing on.

Both companies—united last fall as Dell Technologies, the world’s largest privately held technology company—have relied on somewhat different technologies to perform critical Big Data analytics that are key to their success. Critical data for each company was housed in multiple legacy systems and platforms. The challenge was how to bring everything together in a central repository—i.e. a data lake.

As soon as the groundbreaking merger took place last fall, a newly merged Big Data team, for which I serve as lead architect, began working to develop a world-class data ecosystem that would provide the right data, in right place, in the right format and at the right time to solve for current challenges and position the company for digital transformation.

Seven months later, we have built the foundation for the Dell Data Lake, we have stood up considerable functionality and are continuing to integrate data from legacy systems and harness its value to enable Dell to act like a single company.

While our data lake formation continues, here are some insights on our data lake journey so far.

Making the data connection

A first step toward integrated analytics was to design what the Dell Data Lake would look like. Our starting point was two Big Data platforms that were similar but not identical.  Dell relied on Apache Hadoop-based software, Cloudera and a massively parallel processing (MPP) database, Teradata, for its analytics, transformed data sets and operational reporting. In contrast, EMC had stood up its Big Data platform to analyze raw and transformed data sets using the MPP platform Greenplum and Hadoop-based Hortonworks software for processing large data sets.

We started by getting access to each other’s data platforms. We had to overcome typical barriers between two large companies trying to work together, such as different firewalls on each side, IP address conflicts, network routing and firewall rules. Once we got past the ability to get to each other’s data, we had to address the fact that we had two different sets of enterprise applications, like ERP and manufacturing systems, feeding two different Big Data solutions. Each of the solutions was performing operational reporting and analytics on their respective data sets.  In order to do meaningful analytics, you can’t have processes feeding into separate systems and then try to merge the results into one. You have to do it in one or you get inaccurate, skewed or misleading results.

So the challenge was how do we get two source systems that are already writing to two Big Data solutions to write to a third Big Data solution and then integrate those data sets into a common data lake.

Since analytics requires raw data, we had to first ingest data from Dell’s legacy applications and from EMC’s legacy applications, such as the two ERP systems, into the data lake. And while the ERP systems had similar data, they had two different schemas or database blueprints. That meant that once we moved it, we then had to map the data that was in one schema to the data in the other schema so we could integrate it into a common data model in order to do reporting and analytics on a single data set.

This clearly is a very long process. We have been working on this integration since last fall and have only scratched the surface of fully stocking the Dell Data Lake. Our priorities thus far have been Sales, Customer and Service data, all of which will give us better insights into our customers and help us function as a more united company.

Data Lake Architecture

Thinking about the future    

Among the key challenges we continue to face is deciding which architecture and tools to use for the Dell Data Lake. The question is how do we blend two Big Data solutions utilizing the best parts from both to make one. We have had a lot of discussion between enterprise architects and delivery organization architects and administrators on this issue.

Ultimately, the challenge is that you have two distinct groups of people that have done things somewhat differently and the goal is to get them to forget about the past and think about the future.

While it isn’t easy getting everyone to agree, we have made good progress and built a solid foundation.  I expect the Dell Data Lake to be fully built and functional within two years. While the data lake is functioning well already and we have a plan to complete it, two years from now we will still be building it and changing it.  The data lake will continually evolve, as the technology, insights and requirements are constantly changing and giving us newer and better ways to get value out of the data. So we will constantly evolve the lake as needed.

Check out this Dell EMC World ’17 TV interview with Ramesh Razdan, SVP, Dell IT, discover how he’s delivering technology services around data analytics and data science to Dell’s internal customers.

About the Author: Darryl Smith