Over the past few years, usage of the Internet has increased massively. People are now accessing emails, social networking sites, writing blogs, and creating multiple websites. As a result, petabytes of data are generated every moment. Enterprises today are trying to derive meaningful information from this data and convert it into information that can translate into business value as well as features and functionality for their various products.
Huge volumes of a great variety of data, both structured and unstructured, are being generated at an unprecedented velocity and in many respects, that is the easy part! It is the “gather, filter, derive and translate” part that has most organizations tied up in knots. This is the genesis of today’s focus on Big Data solutions.
Previously we have used traditional enterprise architectures consisting of one or more servers, storage arrays, and storage area networks connecting the servers and the storage arrays. This architecture was built based on compute intensive applications that require a lot of processing cycles but mostly on a small subset of application data. In the era of Big Data, petabytes of data are being generated every day and our customers want to sort and derive business value from them. In fact, the amount of Big Data that is generated every day must be sorted first in order to get analyzed. This is a massively data intensive operation. Handling this volume of data in a manner that that meets the performance requirements of the business, drives us towards a Clustered Architecture. Clustered Architectures typically consist of a set of simple and basic components that are available in hundreds/thousands. The computational capability of each component may be less but they can perform massive amounts of compute and data intensive operations efficiently as a scalable group.
The problem with Clustered Architectures is, when we are dealing with hundreds or thousands of nodes, they can and will fail – the larger the number of nodes, the more often it will happen. So the software managing the cluster and the applications running on it must detect and respond to those failures in an automated and efficient manner. Moreover, data needs to be efficiently distributed among the nodes and in many cases petabytes of data need to be replicated as well.
This has been the driving force behind Hadoop.
Let’s think of a simple problem. People all over the world have written blogs about various EMC products and we have aggregated all of them. The aggregated file is now petabytes in size and we want to find out something simple like how many times each EMC product is mentioned in the blogs. This looks difficult at first blush, but using Hadoop we can get the data we need in a very efficient manner.
A file may be petabytes in size, but at its most basic level, it still consists of a number of blocks. Using the Hadoop Distributed File System (HDFS), we store these blocks in a cluster consisting of hundreds of nodes, replicating each block a certain number of times across the nodes. HDFS will take care of fault tolerance – if any node fails; it can automatically assign the work to another node.
Now it uses MapReduce, which consists of two phases, the Mapper Phase and the Reduction Phase. In the Mapper phase, each participating node gets a pointer to Mapper function and the address of the file blocks collocated in the node. In our example, we need to find out how many times, each EMC product is mentioned in the file. So, the problem will be solved on each node on the file blocks residing on that node. Each Mapper phase outputs a <key, value=””> pair, in our case this will be <emc_product, number=”” of=”” times=”” it=”” appears=”” in=”” the=”” file=”” blocks=””>. In the Reduction phase we will consolidate all those <key, value=””> pairs and aggregate them to find out how many times an EMC product is mentioned in the file as a whole.
The same issues companies like Google and Yahoo faced in the early 2000s, are now being faced by many enterprises. According to Yahoo, by the second half of the decade, 50% of enterprise data will be processed and stored using Hadoop1.