Sunday, February 24, 2013

Starting with Hadoop : Installation

Purpose
This document describes how to install, configure and manage non-trivial Hadoop clusters ranging from a few nodes to extremely large clusters with thousands of nodes.
Prerequisites
1.      Make sure all required software is installed on all nodes in your cluster (i.e. jdk1.6 and greater).
2.      Download the Hadoop software.

Installation Process
1.      Installing JAVA
     a.      Unzip the jdk tar
              tar -xvf jdk-7u2-linux-x64.tar.gz

     b.      Move unzipped jdk to usr/lib/jvm
              sudo mv jdk1.7.0_03/* /usr/lib/jvm/jdk1.7.0_34/

     c.       Update JAVA alternatives
              sudo update-alternatives --install "/usr/bin/java" "java" "/usr/lib/jvm/jdk1.7.0_34/bin/java" 1
              sudo update-alternatives --install "/usr/bin/javac" "javac" "/usr/lib/jvm/jdk1.7.0_34/bin/javac" 1
              sudo update-alternatives --install "/usr/bin/javaws" "javaws" "/usr/lib/jvm/jdk1.7.0_34/bin/javaws" 1

2.      Installing Hadoop
     a.      Unzip the hadoop tar inside home/<user>/hadoop
              tar xzf hadoop-1.0.4.tar.gz

3.      set JAVAPATH and HADOOP_COMMON_HOME path
         export JAVA_HOME=/usr/lib/jvm/jdk1.7.0_34
         export HADOOP_COMMON_HOME="/home/<user>/hadoop/hadoop-1.0.4"
         export PATH=$HADOOP_COMMON_HOME/bin/:$PATH

Configure Hadoop
 1.      conf/hadoop-env.h
          Add or change these lines to specify the JAVA_HOME and directory to store the logs:
          export JAVA_HOME=/usr/lib/jvm/jdk1.7.0_34
          export HADOOP_LOG_DIR=/home/<home>/hadoop/hadoop-1.0.4/hadoop_logs

2.      conf/core-site.xml
         Here the NameNode runs on 172.16.16.60
   
       <configuration>
           <property>
              <name>fs.default.name</name>
              <value>hdfs:// 172.16.16.60:9000</value>
          </property>
      </configuration>


3.      conf/hdfs-site.xml
     
       <configuration>
                <property>
                          <name>dfs.replication</name>
                          <value>3</value>
                </property>
                <property>
                         <name>dfs.name.dir</name>
                         <value>/lhome/hadoop/data/dfs/name/</value>
               </property>
               <property>
                         <name>dfs.data.dir</name>
                         <value>/lhome/hadoop/data/dfs/data/</value>
               </property>
        <configuration>

dfs.replication is the number of replicas of each block. dfs.name.dir is the path on the local filesystem where the NameNode stores the namespace and transactions logs persistently. dfs.data.dir is comma-separated list of paths on the local filesystem of a DataNode where it stores its blocks.

4.      conf/mapred-site.xml
         Here the JobTracker runs on 172.16.16.60
     
         <configuration>
                <property>
                          <name>mapred.job.tracker</name>
                          <value>172.16.16.60:9001</value>
               </property>
               <property>
                          <name>mapred.system.dir</name>
                          <value>/hadoop/data/mapred/system/</value>
               </property>
               <property>
                         <name>mapred.local.dir</name>
                         <value>/lhome/hadoop/data/mapred/local/</value>
               </property>
       </configuration>

mapreduce.jobtracker.address is host or IP and port of JobTracker. mapreduce.jobtracker.system.dir is the path on the HDFS where where the Map/Reduce framework stores system files. mapreduce.cluster.local.dir is comma-separated list of paths on the local filesystem where temporary MapReduce data is written.

5.      conf/masters
         Delete localhost and add all the names of the namenode, each in on line.
         For Example:
         <IP of Namenode>

6.      conf/slaves
         Delete localhost and add all the names of the TaskTrackers, each in on line.
         For Example:
         <IP of Slave 1>
         <IP of Slave 2>
          ….
          ….
        <IP of Slave n>

7.      Configuring SSH
         In fully-distributed mode, we have to start daemons, and to do that, we need to have SSH installed. it merely starts daemons on the set of hosts in the cluster (defined by the slaves file) by SSH-ing to each host and starting a daemon process. So we need to make sure that we can SSH to localhost and log in without having to enter a password.

       $ sudo apt-get install ssh

Then to enable password-less login, generate a new SSH key with an empty passphrase:

        $ ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa

        $ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

Test this with:

         $ ssh localhost

You should be logged in without having to type a password.
Copy the contents inside the  /<user>/.ssh/  of masternode to other slave nodes as well to have passwordless access to other nodes from masternode.

8.      Duplicate Hadoop configuration files to all nodes
We may duplicate the configuration files under conf directory to all nodes. The script mentioned above can be used. By now, we have finished copying Hadoop softwares and configuring the Hadoop. Now let’s have some fun with Hadoop.

Hadoop Startup
To start a Hadoop cluster you will need to start both the HDFS and Map/Reduce cluster. Format a new distributed file system:
       
            $ bin/hadoop namenode -format

Start the HDFS with the following command, run on the designated NameNode:                                            
       
            $ bin/start-dfs.sh

The bin/start-dfs.sh script also consults the ${HADOOP_CONF_DIR}/slaves file on the NameNode and
starts the DataNode daemon on all the listed slaves.
Start Map-Reduce with the following command, run on the designated JobTracker:    
       
            $ bin/start-mapred.sh

The bin/start-mapred.sh script also consults the ${HADOOP_CONF_DIR}/slaves file on the JobTracker and starts the TaskTracker daemon on all the listed slaves.

Hadoop Shutdown
Stop HDFS with the following command, run on the designated NameNode:
 
             $ bin/stop-dfs.sh

The bin/stop-dfs.sh script also consults the ${HADOOP_CONF_DIR}/slaves file on the NameNode and stops the DataNode daemon on all the listed slaves.
Stop Map/Reduce with the following command, run on the designated the designated JobTracker:

            $ bin/stop-mapred.sh

The bin/stop-mapred.sh script also consults the ${HADOOP_CONF_DIR}/slaves file on the JobTracker and stops the TaskTracker daemon on all the listed slaves.

Followers