Increasingly oil and gas companies are looking to big data and analytics to provide a new approach to answering some of their hardest questions. One of the foundation components of this is to use the HaDoop File System (HDFS). HDFS is a unifying persistence layer for many of the big data and analytical tools on the market (Pivotal’s and other vendors). Whilst many companies have looked to Hadoop clusters to provide both storage and compute, EMC has recognized that there are a number of challenges associated with this approach including:
- If storage sits inside a Hadoop cluster, there must be a (potentially time consuming) ETL task to get data from where it sits into the cluster. As soon as the ETL process is complete, the data is out of sync.
- In order to increase storage it is also necessary to increase compute. This can create an imbalance between compute and storage capacity. This can further be exacerbated by the need to buy Hadoop distribution licenses for each node.
- Because Hadoop HDFS is designed to run on cheap commodity hardware, it provides “eventual consistency” of data, and ensures availability by maintaining three (or more) copies of all data. This leads to much greater raw storage requirements than traditional storage environments (<33% usable capacity).
- All metadata requests to a Hadoop-HDFS cluster must be directed to a single NameNode. Although it is possible to configure a standby NameNode in Active/Passive mode, the failover process is weak and recovery is not straightforward.
To address these challenges, EMC has developed three storage solutions that resolve these issues (with a fourth coming soon):
- EMC Isilon provides high performance HDFS storage as an additional protocol. This means that any data copied to the Isilon cluster using CIFS or NFS can be made available through HDFS. The storage is much more efficient as data protection is achieved using Isilon’s built in protection so only one copy of each data file (plus parity) is created. In addition, each Isilon node runs as both a NameNode and a DataNode so there is much higher performance, availability and no single point of failure.
- EMC Elastic Cloud Storage (ECS) provides a very scalable geo-distributed object store which fully supports HDFS. ECS is available either as an appliance (with low cost EMC commodity hardware) or as software (in a ‘bring your own tin’ model). ECS is highly compelling for companies looking to build vast geo-distributed object data stores and also for archiving workflows (especially for seismic acquisition data).
- EMC ViPR Data Services (VDS) enables commodity and other vendor storage systems to be exposed using the HDFS protocol. So for storage systems that do not natively support HDFS, you can use VDS to layer on top of this storage and make the data available via HDFS.
Using these technologies, EMC makes it very easy to deliver on an ‘HDFS Anywhere’ strategy, but what are the compelling reasons for doing this?
- By making the entire multi-vendor storage real estate available through HDFS, big data and analytical tools can be layered on top of the enterprise persistence layer allowing in-place analytics without having to perform any ETL tasks. This capability delivers cost reduction, reduced cycle times and increased productivity.
- As companies seek to deploy the new generation of cloud native applications, it is essential (particularly in oil and gas) to be able to have an integrated environment for old and new applications sitting on top of common persistence layers. This is an essential characteristic of contemporary IT systems as companies look to embrace Bi-modal IT strategies.
At EMC we are increasingly hearing from oil and gas companies that to achieve their efficiency targets and cost reductions, they need a concise roadmap to enable them to consolidate their legacy applications with an environment that supports and embraces the next generation of mobile, big data analytical apps. HDFS Everywhere is one element of the strategy to achieve this.
For many oil companies, the ability to run big data analytics against all their structured, semi-structured and unstructured data is compelling. Removing the necessity to carry out complex ETL tasks and the inevitable analytical latency enables analytics use cases and gives legacy vendors an easy roadmap to start migrating their applications to the 3rd Platform.
PS If you’d like to know more, swing by our booth #2511 at SEG in New Orleans (18-21st October 2015).