Sunday, July 14, 2013

What is new in Hadoop 2

Upcoming release of Hadoop, is becoming a major milestone in Hadoop development containing several significant improvements in HDFS and MapReduce(YARN) and also includes a very important new capabilities as well.

Hadoop 2 will be delivering a first release of new features like HDFS improvements including new append-pipeline, federation, wire compatibility, Namenode High Availability, HDFS Snapshots, better storage density and file formats, Caching and hierarchical storage management  and performance improvements. It is covering architectural improvements in High Availability of Namenode, Federation and Snapshots. Apache Hadoop YARN is the new basis for running MapReduce and other applications on a Hadoop cluster. It representing Hadoop as a more generic data integration and processing system. As we already discussed about MapReduce 2 (YARN) providing many more generic functionalities on data processing by simple and efficient ways.


One very good feature I would like to focus more is Namenode High Availability. Earlier versions of Hadoop has a single Nomenode controlling over the cluster, but it becoming a single point of failure(SPOF), if Namenode machine is unavailable, cluster as a whole would be unavailable till it either rebooted or replaced by another machine. Namenode High Availability feature address the same problem by providing option by providing two Namenodes (introduced StandbyNode, a hot backup of HDFS Namenode) sharing a same cluster with active/passive configuration.

Today in the market Hadoop 2.0.5-alpha version is available but still in under development, it includes new developer and user-facing incompatibilities, features, and major improvement. You can find the Hadoop 2.0.5-alpha release notes here. It is not really available for production but we can explore it for learning purpose and developing your POCs.