The Big Object: Hadoop YARN

Showing posts with label Hadoop YARN. Show all posts

Sunday, March 12, 2017

My First Publication: YARN Essentials

YARN is the next generation generic resource platform used to manage resources in a typical cluster and is designed to support multi-tenancy in its core architecture. As optimal resource utilization is central to the design of YARN, learning how to fully utilize the available fine-grained resources (RAM, CPU cycles, and so on) in the cluster becomes vital.

This book is an easy-to-follow, self-learning guide to help you start working with YARN. Beginning with an overview of YARN and Hadoop, you will dive into the pitfalls of Hadoop 1.x and how YARN takes us to the next level. You will learn the concepts, terminology, architecture, core components, and key interactions, and cover the installation and administration of a YARN cluster as well as learning about YARN application development with new and emerging data processing frameworks.

Follow below link for more details, https://www.packtpub.com/big-data-and-business-intelligence/yarn-essentials

Thank you!

Tuesday, May 24, 2016

Hazelcast : In-Memory NoSQL Solution

If you are evaluating high-performance NoSQL solutions such as: Redis, Riak, Couchbase, MongoDB, Cassandra etc. or in even rarer cases if you’re evaluating caching solutions such as Memcached or EHcache, it’s possible that your best choice may be Hazelcast: Hazelcast uses a considerably different approach to any of the above projects, and yet for some classes of people looking for a Key-Value store, Hazelcast may be the best option for you.

So lets look at Hazelcast and why it is better alternative for above mentioned systems, Hazelcast is an In-Memory Data Grid, not a NoSQL database.

advantages and disadvantages of using Hazelcast in-memory solution, first of all as being key-value store into the memory, it has some default advantages are speed and read efficiency but also some natural disadvantages as storing map into memory are scalability and volatility, as we know that size of RAM is always less than the total available disk space, as RAM is more expensive than the disk, so being in-memory store space is always been a constant and second one as being flash storage, RAM refreshes itself on process restart, resulting into the data loss. Hazelcast addressing this shortcomings of the in-memory stores by providing efficient and convenient solution.

In Hazelcast, scalability issue is addressed by clustering solution, as joining hundreds nodes into the cluster, we may aggregate more than terabytes of in-memory space, to accommodate Hazelcast map into the memory. Of course this not going to be compared with disk space as its been 100 times more than memory but depending upon use case some terabytes of space is sufficient for the in-memory operations or else this solution may use with some backend data store.

Volatility, in Hazelcast volatility handled by peer to peer data distribution, so every block of data has multiple copies(replication across the cluster) of it present on different locations, as any node/rack goes down because of some issue, we can recover data from other copies present on other locations. The number of backup copies can be configured as depending upon the data criticality. Too much copies of the data into the cluster may reduce the overall availability of working memory to other operations. Hazelcast addresses the tuning problem of the cluster by providing ways to make your cluster available and reliable, including setting up backup process for example servers on one rack would be set to back up on another rack, so failure of entire rack can be managed gracefully.

Hazelcast also address the Rebalancing problem of the cluster, whenever node is added/removed to/from the cluster can be lead to moving data across the cluster to rebalancing it. If the node crashes because of some issues, the data copy(primary) on dead node has to be re-owned by the another node who is having replica of that data copy(secondary) becoming primary and backed up on another node to make cluster fail-safe again. this process may consume more cluster resources like CPU, RAM, network etc. might lead to latency into the process during this whole process.

Also in addition to the above benefits, Hazelcast also make sure that Java Garbage Collection process should not be having any effect on the terabytes of data stored onto memory specifically on heap as your heap gets bigger, garbage collection might cause delay in your application response time. So memory store is Hazelcast with native storage support to avoid garbage collector to causing delay in application; resulting in more efficiency and throughput.

Wednesday, March 16, 2016

YARN Essentials

If you have a working knowledge of Hadoop 1.x but want to start afresh with YARN, this book is ideal for you. You will be able to install and administer a YARN cluster and also discover the configuration settings to fine-tune your cluster both in terms of performance and scalability. This book will help you develop, deploy, and run multiple applications/frameworks on the same shared YARN cluster.

YARN is the next generation generic resource platform used to manage resources in a typical cluster and is designed to support multi-tenancy in its core architecture. As optimal resource utilization is central to the design of YARN, learning how to fully utilize the available fine-grained resources (RAM, CPU cycles, and so on) in the cluster becomes vital.

This book is an easy-to-follow, self-learning guide to help you start working with YARN. Beginning with an overview of YARN and Hadoop, you will dive into the pitfalls of Hadoop 1.x and how YARN takes us to the next level. You will learn the concepts, terminolog

y, architecture, core components, and key interactions, and cover the installation and administration of a YARN cluster as well as learning about YARN application development with new and emerging data processing frameworks.

Follow below link for more details,
https://www.packtpub.com/big-data-and-business-intelligence/yarn-essentials

Thank you!

Sunday, July 14, 2013

What is new in Hadoop 2

Upcoming release of Hadoop, is becoming a major milestone in Hadoop development containing several significant improvements in HDFS and MapReduce(YARN) and also includes a very important new capabilities as well.

Hadoop 2 will be delivering a first release of new features like HDFS improvements including new append-pipeline, federation, wire compatibility, Namenode High Availability, HDFS Snapshots, better storage density and file formats, Caching and hierarchical storage management and performance improvements. It is covering architectural improvements in High Availability of Namenode, Federation and Snapshots. Apache Hadoop YARN is the new basis for running MapReduce and other applications on a Hadoop cluster. It representing Hadoop as a more generic data integration and processing system. As we already discussed about MapReduce 2 (YARN) providing many more generic functionalities on data processing by simple and efficient ways.

One very good feature I would like to focus more is Namenode High Availability. Earlier versions of Hadoop has a single Nomenode controlling over the cluster, but it becoming a single point of failure(SPOF), if Namenode machine is unavailable, cluster as a whole would be unavailable till it either rebooted or replaced by another machine. Namenode High Availability feature address the same problem by providing option by providing two Namenodes (introduced StandbyNode, a hot backup of HDFS Namenode) sharing a same cluster with active/passive configuration.

Today in the market Hadoop 2.0.5-alpha version is available but still in under development, it includes new developer and user-facing incompatibilities, features, and major improvement. You can find the Hadoop 2.0.5-alpha release notes here. It is not really available for production but we can explore it for learning purpose and developing your POCs.

Friday, June 28, 2013

Apache Hadoop YARN : Next Generation MapReduce

MapReduce has a complete transformation in hadoop-0.x and now we have MapReduce v2 or YARN

Main inspiration behind development of MapReduce v2 that is YARN is to divide major functionality of JobTracker that resource management and job scheduling/monitoring into a separate daemons. MapReduce v2 have a global resource management(RM) and Application Master per application(single client job or job workflows)

The Resource-Manager(RM) has authority to control over the Node-Manager(NM), the per-node slave and co-ordinates resources among all the applications in the system. The Application-Master(AM) is the framework, has a responsibility coordinating with Resource-Manager for resources negotiation and Node-Manager to execute and monitor the tasks.

As MapReduce v2 has two core responsibilities i.e. resource management and job scheduling/monitoring so Resource-Manager(RM) have two core components, Scheduler and Applications-Manager

Scheduler is responsible for allocating execution time slots and resources to the various running applications as per the requirements/configurations, the Scheduler is pure Scheduler, it does not perform monitoring or status tracking of the application. The Scheduler performs its scheduling function as per the resource requirements of the applications; it does it through resource Container which examines elements such as memory, cpu, disk, network etc.

Application-Manager(AM) is responsible for accepting the jobs, negotiating with Container for executing the application specific Application-Master and restarting the Application-Master Container on application failure or hardware failure. The Node-Manager is the per slave machine agent who is responsible for Containers, monitoring their resource usage and reporting the same to the Resource-Manager. The per-application Application-Master has the responsibility of negotiating appropriate resource Containers from the Scheduler, tracking their status and monitoring for progress.

MapReduce v2 jobs are compatible with all previous stable releases means all previous jobs will run on MapReduce v2 just need to recompile.

Reference:
http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html

Pages