Tuesday, March 5, 2013

Hadoop : Hive In Action

Hive: A Simple Introduction

Hive is built on Hadoop and uses Hadoop’s storage(HDFS) and execution(MapReduce) modules. In terms of storage Hive can use any file system supported by Hadoop, although HDFS is by far the most common. If you’re not going to be using HDFS you probably want a more in-depth introduction than what I’m going to give here. Hive Queries are translated to a graph of Hadoop MapReduce jobs that get executed on your Hadoop grid. As you can see Pig and Hive use the same methods of execution abstraction. The big difference comes in the language used to access the data.
Figure 1 : Hive Structure diagram.

Hive Query Language (HQL) is based on SQL, and you’ll find many of the familiar constructs such as SHOW, DESCRIBE, SELECT, USE and JOIN. You can find the full definition of the language here: Hive Lanuage Manual. Similar to an RDBMS in Hive you will find Databases that contain one or more Tables that contain some Data defined by a Schema.
Hive also supports User Defined Functions (UDFs) and Serialization/Deserialization functions (SerDe’s). UDFs allow programmers to write functions to abstract common tasks in Hive. It also allows you to seamlessly connect a Query Language in Hive with functional, procedural or scripting languages. SerDe’s allow for the management of arbitrarily structured or unstructured data in Hive. One particularly useful and popular SerDe is the now built-in one for Avro.
Hive uses a metadata store to keep the data warehouse information that links Hive and the raw data. Data loaded into Hive can be maintained in an external location or a copy made into the Hive warehouse. In this way you can create a self-contained data warehouse, or share the infrastructure with MapReduce and Pig programmer.
Out of the box Hive uses a Derby database for the metadata store. For scalability and performance you’ll want to switch to use an RDBMS such as MySQL as your Hive system matures.

Getting Started

The first thing you’ll need to do is install and get Hive up and running. The Getting Started Guide is your best resource for this.
Once Hive is installed you’ll want to create and use a database:

1.      Create Database

hive> CREATE DATABASE demo;
hive> use demo;
Next create a table with a schema:

2.      Create Table

hive> CREATE TABLE demoTable (foo INT, bar STRING);
hive> describe demoTable;
Now let’s load some data from HDFS into Hive. Assuming there’s an input file where each row has an integer and a string:

3.      Load data to table

hive> LOAD DATA LOCAL INPATH ‘./resources/input.txt’ OWERWRITE INTO TABLE demoTable;

4.      get data from table

hive> SELECT * FROM demoTable;

5.      Create Table : reference  an external data-store

Note that when data is LOAD’ed like we do above the data is copied and moved into Hive’s warehouse. This is all well and good if you want your Hive queries to be self-contained and not affect the rest of the Hadoop analysts who are using MapReduce and/or Pig. Dropping the table from Hive will not affect the original data.
I find one of the most powerful features of the Hadoop eco-system is allowing programmers to access the same data using different interfaces. If you want Hive to simply reference an external datastore you can create your tables in the following way:

hive> CREATE EXTERNAL TABLE demoTable (foo INT, bar STRING) ROW FORMAT STORED AS TEXTFILE LOCATION ‘./resource/input.txt’;

This links your external file to the Hive meta-store. Manipulations on the data in Hive will be reflected in the shared data, and changes in the data will be reflected in Hive. If you drop the table only the Hive meta-store is affected.

Pretty quickly you’ll get into the situation where you want to define the structure of the data and maintain that schema separately. This is where Avro comes in. Let’s load some data using Avro instead of the default SequenceFile:

Create an Avro schema and save it to the file system (the namespace and name are not important):
{
“namespace”: “org.demo.schema”,
“name”: “demo_schema”,
“type”: “record”,
“fields”: [
   { "name": "foo", "type": "int" },
   { "name": "bar", "type": "string" }
] }
hive> CREATE TABLE demoTableAvro PARTITIONED BY (ds string)
ROW FORMAT SERDE ‘org.apache.hadoop.hive.serde2.avro.AvroSerDe’
WITH TBLPROPERTIES(‘avro.schema.url’='file:///path/to/schema/demoSchema.avsc’ STORED AS
INPUTFORMAT ‘org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat’
OUTPUTFORMAT ‘org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat’;

Using any of these three approaches you can create tables and load data into it. Now programmers can write queries against this data warehouse similar to their existing workflow and experience.

Conclusion

Hive is a great way to share your Hadoop infrastructure with programmers who are the most comfortable with Query Languages. By combining MapReduce, Pig and Hive you can create a full-featured data warehouse and analytics system. Next up you’ll want to look at workflow automation tools and ways to store and serve your results. Look for this in upcoming posts.

Followers