I recently got a chance to catch up with Doug Cutting, the founder of Hadoop, Nutch, Lucene, and various other open-source technologies. Doug is currently the Chief Architect for Cloudera, the current leader in the Hadoop marketplace. He has been spending his time supporting the growth of Big Data and concerns himself with the advancement of governmental regulations around the use of that data to help the world and not to be used in mischievous manners. Having been in meetings with various enterprise companies having to decide how they will implement Big Data technologies, I was surprised to hear Doug has an open mind to open-source in that he believes open-source and close-source technologies often marry quite well to solve challenges and create opportunities. As EMC Isilon adds capabilities to Hadoop by using OneFS as the underlying file system and connects using HDFS as an RPC over-the-wire protocol, questions naturally came up as to Doug’s perspective on this as a solution.
Here is the transcript from that discussion:
Ryan: Can you tell us about the origins of the Hadoop Distributed File Systems (HDFS)?
Doug: It was modelled after Google’s GFS paper. Mike Cafarella & I were working on Nutch, a scalable, open-source web search engine. We knew that to scale to the billions of pages on the web we had to distribute computation to economical PCs. We had a distributed crawler, web analyzer, indexer, and search system working. This ran on five machines, but it was hard to operate even at that scale, involving a lot of manual steps. When the GFS and MapReduce papers from Google were published we immediately saw their relevance to our work.
Algorithmically, MapReduce was nearly identical to what were already doing in Nutch. But Google showed how these kinds of distributed computations could be automated, so they could scale farther than five machines, to tens, hundreds or even thousands, without requiring much manual operation. GFS’s reliable, distributed storage was a critical sub-component of this, so we reproduced it in Nutch. We called it NDFS at first and renamed it HDFS in 2006 when Hadoop split out of Nutch, separating the distributed computing parts from the web-search ones.
Ryan: Many people say that HDFS was purpose-built for Hadoop and that the replication count of three was required for performance reasons and not for protection purposes. We have been able to prove comparable performance without the need for three copies using today’s technologies. What is the real reason for the replication count?
Doug: Google suggested a replication count of three primarily for reliability, not performance. The odds of losing data with a replication count of two are too high, while with three replicas they’re acceptable.
Ryan: Following up to that question, we hear that open-source HDFS has plans to follow suit with Isilon with respect to Erasure Coding [where cross-node parity is used for data protection] instead of 3X replication. Why do you believe the community has decided to take that approach?
Doug: Google’s original GFS paper mentioned erasure coding as a possible optimization, so the idea’s not new. Facebook implemented erasure coding for HDFS years ago, but their implementation never got merged back into the Apache version.
The rationale for erasure coding is simply to save more on storage costs. Affordability is a big component of scalability. If a system uses fewer drives per petabyte then folks can afford to store & analyze more petabytes.
Ryan: Isilon customers worry about the added cost to migrate their content from Isilon to a dedicated Hadoop environment. What is your take on the architecture with Isilon and Hadoop working together with tasks such as ETL, archiving etc.?
Doug: Each case needs to be evaluated on its own merits. I’m sure there are lots of cases where it makes sense to put a Hadoop cluster next to an Isilon cluster and use it to analyze the data in Isilon.
Ryan: The term Data Lake has become popular in the industry. What is your opinion on the data lake strategy to big data storage?
Doug: We call it the Enterprise Data Hub. A defining factor is that you can bring multiple workloads to a shared dataset repository, providing a wide variety of tools for both exploration and production uses. Instead of designing and building solutions for each data problem, you build a general-purpose data storage and processing facility where your solutions can develop and evolve.
Ryan: You made a comment to me in the past that I’d like to dig deeper into (and I am paraphrasing): “there is no one specific Hadoop stack that must be used as long as it solves the problem a customer has to solve”. Can you elaborate on what you mean by that?
Doug: I will say that not only does the Hadoop ecosystem support evolving, exploratory, agile applications, but the ecosystem itself is designed to evolve. It’s built on a loose confederation of open-source projects, which is a key strength. If some component is superseded by a superior technology, there’s no single organization that can stop its replacement for its own interest at the expense of the ecosystem. This may sound like the crazy wild-west. That’s where vendors come in. A vendor will commit to long-term support of components so that each production system need not evolve at the pace of the ecosystem.
Ryan: Where do you see Hadoop 5-years from now and 10-years from now?
Doug: In five years it will be equal to the RDBMs in adoption. In ten years it will have eclipsed the RDBMs. It will be the center of every major enterprise’s data system.
Ryan: In your opinion, what is needed to see accelerated rapid adoption of Hadoop in the Enterprise?
Doug: Attention to detail. We need security facilities that make adoption easy in each industry, each with their different compliance needs. We need industry-specific applications and tools. We need broader familiarity with the technology stack. The Hadoop stack is still young relative to most enterprise technologies. But it’s growing fast and we’re seeing it meet the needs of more and more applications each quarter.
Ryan: Knowing what you know now versus 2005, would you have done anything differently having some of today’s technologies at your disposal?
Doug: If I had had today’s technologies then it wouldn’t have been 2005! I was attracted to the technologies that Google described because they were clearly useful and there was nothing similar available to developers outside of Google. I knew that open-source would be a great way to make these tools widely available, so put 2 and 2 together and started building Hadoop. If I saw another opportunity, where there was a broadly applicable technology that wasn’t generally available, then I might do the same thing again. Or I might now leave it to younger, more energetic folks next time around!
Ryan: So Doug, most people know already that the name Hadoop comes from the name of your son’s stuffed Elephant. Do you still have the elephant? Do you think it will end up in the Smithsonian someday?
Doug: Smithsonian? Wow! I don’t think he’s that famous. Computer History Museum, perhaps! I still have him. He lives in my drawer now.
Ryan: Doug, thanks for taking the time to meet with me and discuss Hadoop! I look forward to speaking to you more often about the industry as it grows and matures.
To all you readers out there, if you are interested in learning more about the budding relationship between Cloudera and EMC Isilon, I invite you to join us for the keynote at EMC World.