Strata & Hadoop World 2014 Recap

Preventing terrorist attacks, feeding the hungry, capturing bad guys, and enabling cubic splined segmented continuous polynomial non-linear regression models. I promise to try and explain the last one later in this blog.

This week was Strata + Hadoop World, a fast-growing convention and exposition directly pointed at Statisticians, Engineers, and Data Scientists. The topics were diverse and ranged from machine learning to the Michael J Fox Foundation’s use of Hadoop to help discover Parkinson’s disease earlier on the cycle for patients.

What is clear from the messaging this year is that Hadoop has made it into the mainstream technology people are using in their organizations. Customers from all walks of life spoke about their projects, planned projects, or how they changed their business, the economy, the world, or even saved lives.

One of our customers discussed a scenario they were involved in where their software with Cloudera Haodop and Isilon was used to save a life: “A child contacted a helpline service online, indicating that he had self-harmed and was intending to commit suicide. This was passed on to CEOP who acquired the communications data to reconcile the IP address to an individual. They did so in a very short space of time and passed it on to the local police force. When they got into the address the child had already hanged himself, but was still breathing. If there had been any delay, or if the child had been unlucky enough to be using one of those service providers that do not keep subscriber data relating to IP addresses, that child would now be dead.” – Page 11 at http://www.publications.parliament.uk/pa/jt201213/jtselect/jtdraftcomuni/79/79.pdf

We also saw Mark Grabb of GE explain their use of EMC technology to create the Industrial Internet and what that means to the innovation engine (pardon the pun) at GE.

[youtube_sc url=”http://www.youtube.com/watch?v=vrRvLmIF0nI”]

What we are most excited about this year is a fundamental transition that flips the thinking that data must be moved into a new repository in order for that data to be included in analysis operations. Don’t get me wrong, data lakes simplify the management and correlation of data by getting as much into one place as possible. That in mind, there are some fundamental issues we are starting to address. Take this math from a real customer: 130PB of object storage used to house video and images + 8PB of file data used for home directories, weblogs, click stream, and more. Add in a desire to run analysis on ALL of that data and you’ll need 3-4x the capacity in a central Hadoop system. Do you want to build a 400-500PB raw capacity hyper-converged Hadoop cluster? What if we can flip the process and offer the right storage solution for the data being stored at the location where that data needs to be stored, and for the primary workload that originally captures and uses that data? That changes the conversation to creating a highly capable platform full of all of the ecosystem applications and pointing to the data. I had the opportunity to discuss this flipping of the process with customers during a session at Strata and it was met with great enthusiasm.

Mike Olson announced the partnership with EMC and the enablement of Isilon as the first platform to be certified with CDH. See his blog at http://vision.cloudera.com/turn-your-data-lake-into-an-enterprise-data-hub/. It reflects on the idea of bringing an Enterprise Data Hub to layer above all of the data in data lakes to enable a central system for correlating data from many sources. Mike Olson and I discussed our newly found partnership with David Vellante on theCube.

[youtube_sc url=”http://www.youtube.com/watch?v=f3rS1DIRq8A”]

We cannot be happier about these announcements and look forward to a long and mutually prosperous relationship. Let me say here that the Cloudera team encompasses some of the most humble and talented people in the world and they are a joy to work with. Tom Reilly, Cloudera CEO and Mike Olson both took multiple stage opportunities to talk about the new partnership from Mike’s keynote to Tom and Mike’s Cloudera Partner Summit discussions.

During the event, I had the great privilege to hold a joint “Jam Session” with Ailey Crow from Pivotal’s Data Science team. The goal of the session was to riff on projects we have worked on that range from Healthcare and Life Sciences to Government, Telco, and Banking. With a packed house, she and I had an incredible time answering questions, discussing use cases around Big Data and more. Ailey is one of the smartest people I have met and I am truly honored to have shared the stage. A couple of examples from the discussion include banks using social sentiment analysis to look at trends of stocks; enabling traders to use one more data point before investing in particular securities. Another Ailey spoke about correlated air quality information with patient’s experiencing asthma and who also haven’t refilled their prescriptions; the result of which enables notifications to those patients to refill their prescriptions when air quality drops below certain thresholds.

An offshoot conversation with Ailey and B. Scott Cassell (Director, Solutions Architecture for EMC) went into an idea B. Scott has for modeling performance of storage. As he explained what he was doing, Ailey explained that what he wanted to do was create a “cubic splined segmented continuous polynomial non-linear regression model”. Roughly what that means is to create a specific model of performance based on specific plot points, but in order to keep that model as accurate as possible, break it into multiple chunks (segmented), but in order to connect those segments, use a cubic spline (I have no idea what that is – but they did), and ultimately graph a continuous polynomial. Here is what one looks like:

Yep, that hurts my brain. And I bring it all up for good reason. This year at Hadoop World we began to see new products that do all of that for you and put together the neat graph, chart, or even turn it into an application (perhaps an easy to use performance predictor is in our future). Hadoop is becoming an underlying toolset that will be the base for the next generation of technology. Similar to the RDBMS, Hadoop will soon become a term and less of “the application”.

The EMC Federation was there in full force. The EMC booth displayed Isilon, Elastic Cloud Storage (ECS), and DCA. Experts on each platform manned the booth and hundreds of attendees came through to learn more about our HDFS Storage solutions. Sitting just across from VMware and down the hall from Pivotal, I was reminded how strong of a force the federation already is in the Big Data space. With the newly announced DCA+Isilon+Pivotal bundle v2, the federation is able to provide the “Data Lake in a box” that so many have been asking for. See the Press Release at http://wallstreetpr.com/emc-corporation-nyseemc-and-pivotal-unveils-data-lake-hadoop-bundle-2-0-34175

Aidan O’Brien (Head of Global Big Data Solutions for the EMC Federation) and Sam Grocott (SVP, Emerging Technologies Marketing & Product Management) discuss the newly formed Emerging Technologies Division and the plans for EVP solutions around Big Data.

[youtube_sc url=”http://www.youtube.com/watch?v=MQ1wXnwgyZo”]

People often ask me what excites me about working for a storage company. I like to answer them with a couple of key points. EMC is no longer a storage company in my mind. We are a data company. And we’re tackling challenges that had previously gone unsolved. EMC stores data, but with protocol access to that data such as HDFS (Hadoop), EMC is able to unlock the potential for that data and allow new harder questions to be asked. So whether you’re trying to prevent terror, increase food production, return kids to their parents, or answer a complex technology performance question in an easier way, EMC has the tools and rich partnerships to help you do that.

About the Author: Ryan Peterson