I recently got a chance to sit down with Hugh Williams, the SVP of R&D for Pivotal. Hugh was previously at eBay and Microsoft and comes with an impressive background with respect to Big Data technologies that includes industry patents and numerous publications. Here is a transcript of our discussion:
Ryan: Hi Hugh, thanks for taking the time to sit down with me. Getting straight to the questions I have for you: How would you define Data Lake?
Hugh: Great question, the basic premise of a Data Lake is that you have one place where all of your data is stored, and it allows everyone in an organization to extract value from the data using the tools they want to use. Without a data lake, data is usually silo’d in legacy systems where only a couple of applications or a subgroup of users have access to the data.
Ryan: What would you consider to be the most important attributes of a data lake?
Hugh: Having all of the data in one place. Of course, you need the right tools to be able to accomplish that – ingestion and connection to existing sources is still more challenging than it should be.
Ryan: How do customers build data lakes?
Hugh: Most companies start out a data lake with a set of folks who build out a small Hadoop capability, they demonstrate that capability, the noise gets louder, and the company says that rather than having all of these solutions throughout the organization, let’s look at collecting all of that into one place.
Ryan: I call those Data Puddles! What have you seen inhibit adoption?
Hugh: I think a few things come to mind: Ingestion and egestion is problematic. How am I going to get all of that data from various places into the central place? The second thing is that the Hadoop ecosystem is relatively immature. Although an impressive toolbox, there is still a barrier on setting up the infrastructure, the standing up, the training, getting all the right pieces. The last thing I’ll say is using Hadoop to extract business value is not easy. You have to employ Data Science folks. Pivotal is making SQL much more mature on Hadoop to help solve this issue.
Ryan: What interests you about the Isilon partnership with Pivotal?
Hugh: Hadoop will rule the world, but its maturity is a problem today. Isilon is mature and companies bet their businesses on it. If you want one thing to be reliable, it has to be the storage – and so the partnership between Pivotal and Isilon really matters
Ryan: Customers often lump HAWQ with Stinger, Impala, and even Hive. How do you differentiate HAWQ from other SQL solutions?
Hugh: HIVE is a relatively elementary implementation of SQL access to Hadoop with basic features of SQL. It was revolutionary when it happened, but it doesn’t have what a Data Scientist would need. Impala is a nice step forward from HIVE. The really interesting thing about HAWQ is that we took 10+ years of experience with SQL from the Data Warehouse space and ported that to work with Hadoop. What you get with HAWQ is GreenPlum database heritage adapted to Hadoop. Pivotal has the most advanced solution for SQL access to Hadoop.
Ryan: Can you provide an example of something you can do with HAWQ that cannot be done with the others?
Hugh: There are benchmarks such as TPC-DS that help validate whether various typical SQL queries can be evaluated and optimized on different systems. In rough terms, when we used TPC-DS to test SQL compliance, 100% of queries are successful with HAWQ, only 30% with Impala, and around 20% for HIVE. We published an independently peer reviewed study that shows these results in this year’s SIGMOD, the leading database conference
Ryan: You recently announced GemXD, a new product in the GemFire family. What is an example of a problem that GemXD solves?
Hugh: You can think of it as Cassandra or HBase done really, really well – with a SQL interface, full ACID capabilities, the ability to upgrade with no downtime, the ability to read and write to Hadoop’s HDFS storage layer when there’s too much data for memory, and much more.
Ryan: What’s your favorite “Big Data changed the world” story?
Hugh: Here’s a fun story. When I was at Microsoft, I decided to research what drugs caused stomach cramps by looking at what queries customers ran in sessions on Microsoft’s search engine. I reverse engineered a list of drugs that caused stomach cramps, and checked the FDA literature – and, sure enough, it was right.
Ryan: How does Cloud Foundry fit into the Big Data / Hadoop storyline?
Hugh: Today they’re somewhat separate stories, but they won’t be for long. It’s of critical importance to the future of PaaS and the future of Big Data that they converge. In the future, most applications will be data-centric, and great companies will be built on those applications. Developers are demanding the convergence. PaaS and Big Data exist within Pivotal to build the future platform for software.