95% of clients starting Hadoop projects don’t have an established use case; therefore, selecting the right distribution will probably be a shot in the dark. You may start off with Hortonworks for a dev/test environment, but then realize that Pivotal HD is a better choice for enterprise-class deployment. The good news is that if you start your Hadoop project using EMC Isilon scale-out NAS, you have zero data migration when moving from one Hadoop distribution to another. In fact, you can run multiple Hadoop distributions against the same data – no duplication of data required.
All this makes sense to me. Utilize Isilon scale-out NAS as the native storage layer for Hadoop, making the entire Hadoop environment more flexible. But wait, there’s more. Using Isilon storage with Hadoop instead of a traditional DAS configuration makes the entire Hadoop environment easier and faster to deploy, reliable, and in some cases, a lower TCO than DAS.
De-coupling the Hadoop compute and storage layer may lead you to believe there is a performance hit. Not true. You can expect up to 100GB/s of concurrent throughput on the Hadoop storage layer with Isilon. Additionally, by off-loading storage-related HDFS overhead to Isilon, Hadoop compute farms can be better utilized for performing more analysis jobs instead of managing local storage.
You may think I am biased towards Isilon because I do Big Data Marketing for EMC. Not true. I genuinely believe Isilon is a better choice for Hadoop than traditional DAS for the reasons listed in the table below and based on my interview with Ryan Peterson, Director of Solutions Architecture at Isilon.
1. What Hadoop distributions does Isilon support?
Isilon fully supports all Hadoop distributions – Pivotal HD, Cloudera, and Hortonworks, with portability between distributions. If you start off with Hortonworks and later realize they are not supporting your needs, you simply switch to over to another distribution without migrating any data. In fact you can run several different Hadoop distributions against the same data or different data managed in Isilon.
2. How does Isilon make Hadoop easier and fast to deploy?
With a traditional Hadoop environment, it is very common for deployment to take several days. And once you do stand up your Hadoop infrastructure, there is no easy way to get data into the system – you have to go through ingest servers which causes additional delays.
When using Isilon with Serengeti (VMware’s virtualization solution for Hadoop), you can deploy any Hadoop distribution with a few commands in a few hours. Additionally, you can get data into Hadoop very fast and start analyzing the data through Isilon’s multi-protocol support – HDFS, NFS, CIFS, FTP, HTTP to name a few.
Most clients don’t know what they are going to do with Hadoop and need to run experiments in a development environment. Isilon with Serengeti is the fastest and cheapest way to spin up your development environment.
3. HDFS is not natively visible to Windows, Unix, Linux, Apple, or any other file system natively, which makes getting data in out of Hadoop manual and slow. How does Isilon make it simpler and faster to get data in and out of Hadoop?
With a traditional Hadoop environment, there is a two step hop to get data in and out of Hadoop since an ingest or staging server is required with an application such as Flume. This increases the time and resources needed to get data in and out of Hadoop.
With Isilon, all nodes can handle HDFS requests directly, removing the choke point and improving performance since all nodes are working together to get the data in and out of Hadoop. This is possible through HDFS open source compliant RPC calls natively built into Isilon. And to support multiple application workflows, Isilon nodes can handle all protocol requests HDFS, NFS, CIFS, FTP, HTTP, etc simultaneously with the fastest write performance.
4. How does Islion provide a highly available data environment required for enterprise class Big Data applications?
In a traditional Hadoop environment, there is manual failover process that occurs which equates to some downtime. There is no downtime with Isilion – every Isilon node a Name Node, clustered together so every single node can answer a Name Node request. This is possible through Isilon’s native metadata management to display that metadata as HDFS metadata.
5. Hadoop is already inexpensive. How does Isilon make the system even more inexpensive?
To answer this question, let’s start off with addressing the premise of a traditional Hadoop environment whereby the system requires 3X replication of data to optimize analysis. If you add backup and disaster recovery to the environment you may have 8 or more copies of data. To keep things simple, let’s just say with a traditional Hadoop environment, you have 3 copies of the data at minimum.
Using Isilon with Hadoop does not require 3 copies of data, only a single copy of the data, as Isilon storage nodes have the throughput and speed to quickly respond to data requests using Reed-Solomon error correction coding. Therefore, a traditional Hadoop environment requires more servers simply for managing 3x copies of the data. So in some cases, a traditional Hadoop environment will require more hardware compared to a Hadoop environment with Isilon. And the more hardware you have, the more energy and floorspace is needed driving up the cost of your traditional Hadoop environment.
6. Let’s talk about use cases. Does Isilon work well in all Hadoop use cases?
We know that Hadoop with Isilon performs very well in batch processing workloads; however, our competitors claim that Hadoop with Isilon may not perform well in Cassandra type real time analytic workloads. But 99% of Hadoop use cases are batch processing workloads, so I’m not going to worry about addressing the 1% of Hadoop use cases using Cassandra. Having said that, we do plan on testing Cassandra-like workloads to confirm whether this is true or not.
Are you an Isilon believer now? Click here, to learn more about Isilon for Big Data.