Thursday, March 14, 2013

Apache Hadoop HttpFS : A service that provides HTTP access to HDFS.

HttpFS :  Introduction

Apache Hadoop HttpFS is a service that provides HTTP access to HDFS.
HttpFS provides a REST HTTP gateway supports HDFS operations like read and write, It can be used to transfer data between clusters running different versions of Hadoop. Also HttpFS can be used to access data in HDFS using HTTP utilities.

HttpFS was inspired by Hadoop HDFS proxy, It can be seening as a full rewrite of Hadoop HDFS proxy.
Hadoop HDFS proxy provides a subset of file system operations (read only), Its also provides support for all file system operations.

HttpFS uses a clean HTTP REST API making its use with HTTP tools more intuitive.

About security, HttpFS supports Hadoop pseudo-authentication, HTTP SPNEGO Kerberos, and additional authentication mechanisms via a plugin API. HttpFS also supports Hadoop proxy user functionality. 

HttpFS :  Installation

Prerequisites for installing HttpFS are:

  • Java 6+
  • Maven 3+

 Installing HttpFS

      HttpFS is distributed in the hadoop-httpfs package. To install it, use your preferred package manager application. Install the package on the system that will run the HttpFS server.

      $ sudo yum install hadoop-httpfs    //on a Red Hat-compatible system
   $ sudo zypper install hadoop-httpfs  / /on a SLES system
   $ sudo apt-get install hadoop-httpfs  //on an Ubuntu or Debian system

or If you have a httpfs tarball then you can simply untar it,
   $ tar xzf  httpfs-2.0.3-alpha.tar.gz
now you are ready to configure HttpFS. 

Configure HttpFS

     HttpFS reads the HDFS configuration from the core-site.xml and hdfs-site.xml files in /etc/hadoop/conf/. If necessary edit those files to configure the HDFS HttpFS will use. By default, HttpFS assumes that Hadoop configuration files (core-site.xml & hdfs-site.xml) are in the HttpFS configuration directory.

Configure Hadoop

Edit Hadoop core-site.xml and defined the Unix user that will run the HttpFS server as a proxyuser. For example:

Note : Please replace "myhttpfsuser" to your httpfs host name. 
IMPORTANT : You need to restart Hadoop for the proxyuser configuration
            become active. 

Starting/Stopping the HttpFS Server

 To start/stop HttpFS use HttpFS's bin/ script. For example:
       httpfs-2.0.3-alpha $ bin/ start  --> for start
       httpfs-2.0.3-alpha $ bin/ stop   --> for stop 

Test HttpFS is working

A tool such as curl to access HDFS via HttpFS. For example, to obtain the home directory of the user ubantu, use a command such as this:
$ curl -i "http://<MyHttpFSHostName>:14000?"
       HTTP/1.1 200 OK
       Content-Type: application/json
       Transfer-Encoding: chunked

$ curl "http://<MyHttpFSHostName>:14000/webhdfs/v1?op=homedir&" 
       HTTP/1.1 200 OK
       Server: Apache-Coyote/1.1 Set-Cookie: hadoop.auth="u=ubantu&p=ubantu&t=simple =4558977754545&s=wtFFgaGHHJFGffasWXK68rc 
       /0xI=";Version=1; Path=/
       Content-Type: application/json
       Transfer-Encoding: chunked
       Date: Wed, 28 Mar 2012 13:35:55 GMT

 See the WebHDFS REST API web page for complete documentation of the API.