Optimizing Hadoop to Turn Big Data into Big Value

Without a doubt, Big Data is the hottest topic in enterprise IT since cloud computing came to prominence five years ago.  And the most concrete technology behind the Big Data revolution is Hadoop.  The potential for transformative business improvement is real.  But just as real is the chance of a “Hadoop hangover” if the project doesn’t meet expectations and results in costly failure.

Hadoop, which is an open source implementation of core Google technologies, provides an economic way to store and process masses of raw data. At last month’s Hadoop Summit in San Jose, Gartner analyst Merv Adrian spoke about Hadoop’s continuing maturity while reinforcing the need to close existing search, security and compliance gaps.

Most Hadoop pilot projects are still in the initial data capture stage:  setting up workflows to capture raw business data, demographics and the “data exhaust” flowing from websites and social media. These data capture projects entail significant risk in their own right. Of course, collecting data is only the beginning. You also need to take into consideration the role of machine learning, which allows algorithms to be “trained” from the data itself.  Essentially, the data drives and refines the algorithms.

So, it’s critical to move beyond just collecting data to producing smart algorithms, which creates a whole other set of unique challenges. According to a Big Data survey conducted at the O’Reilly Strata event held in February, 88 percent of those surveyed reported problems with Hadoop while only 24 percent had current Hadoop projects in production.

The Googles and Amazons of the world succeeded in their Big Data projects largely because they were able to attract and retain some of the world’s most gifted data scientists. It’s well understood that the base skills required, including statistics, algorithms, parallel programming, etc., are in short supply. If, and when, we see the supply of data scientists increase, we still will be faced with a more fundamental issue:  this stuff is hard.  It requires the ability to think across at least three complex specializations: competitive business strategy, machine learning algorithms and massively parallel data programming.

Compounding the problem is the lack of suitable tools for the data scientist. Hadoop and other data stores supply a brute force engine for computation and data storage.  Hadoop clusters can consist of potentially thousands of commodity servers, each with their own disk storage and CPUs. Data is stored redundantly across nodes in the cluster. The MapReduce algorithm allows processing to be distributed across all the nodes in the cluster. The result is an amazingly cost-effective way of distributing processing across potentially thousands of CPUs and disks.

But programming in MapReduce is akin to programming in Assembly language, which is not a practical way to create Big Data algorithms. To turn Big Data into big value, data scientists need tools that can support statistical hypothesis testing, creating and training predictive models as well as reporting and visualization. Open source projects, such as Mahout, Weka and R, provide a starting point, but none are easy to use, and often don’t scale sufficiently.

As it stands, only the largest enterprises can attract the limited supply of “rock star” data scientists. If this trend is left unchecked, we might predict that the data-driven business models promised by the Big Data revolution will only be accessible to larger enterprises while leaving small and medium businesses out in the cold.

Earlier this year, Dell Software enhanced its Kitenga Analytic Suite to help improve the productivity of the sophisticated data scientist while also making data science more accessible to those just getting started.  In doing so, we hope to make it possible for a wider range of businesses to realize the transformative potential of data-driven business models.  

Kitenga allows the drag and drop construction of more complex Hadoop workflows, minimizing programming and maximizing productivity.  It provides rich content mining capabilities for data held in Hadoop, and provides features, such as sentiment analysis, visualization and job monitoring.  Toad BI suite also supports the data scientist by allowing ad hoc query of Hadoop data through Hive and Hbase queries.

The Big Data revolution stands to benefit us all through a richer and more personalized digital lifestyle. It’s our mission to make sure that the business benefits of Big Data are available not just to the Fortune 500 but to enterprises at every level.

I’m curious to learn how enterprises are ensuring the success of their early Hadoop Big Data projects.  Drop me a line at guy.harrison@software.dell.com to share your story.

About the Author: Guy Harrison