The Big Object: May 2013

Friday, May 31, 2013

Big Business with Big Opportunities

Nowadays Businesses are struggling with abnormally growing volumes, speed and variety of information that used to generate everyday, everyday the complexity of information generation is also rapidly growing - the term to be known for as 'Bigdata'. Many companies are seeking for the technology to not only help them to bigdata storage and process but also finding many more business insights from bigdata and growing up the business strategies with bigdata.

Arround 80% information in world is unstructured, and many businesses are not even attempting to use that information for their advantage or not aware how to use that information. Imagine if you and your business keep afford that all data generated by you business and keep tracking and analyzing it, Imagine if know to way the handle that bigdata?

The data explosion presents great challenge to businesses, today most lack the technology and knowledge about bigdata and how to deal with it and get real business values. Many Companies are focusing on the developing skills and insights of business needs to accelerate the path of transforming larger data sets.

What bigdata can do? Businesses are growing up with bigdata to finding more business insights and row that caries values for business with latest bigdata processing technologies like Hadoop fromework.

Its now possible to track each individual user through cell phones, wireless sensors with measurement of his interest in particular thing, where does he lives, works, plays and what is his day to day program and collect the data, analyse this huge data using bigdata processing technologies and find out the business ways with each individual user to help or make his life simpler.

Bigdata in Social Networking, day to day millions of facebook comments, updates, twitter tweets are generating and many more so using bigdata processing to find out current market trends, what people are talking about, their likes, dislikes accordingly plan our business.

Bigdata in Healthcare, every hospital or healthcare organization maintaining their historical records with patients records which may kind of bigdata so technology can analyse that past records and predict in future which patients, on what date, with the cause and what are the possible treatments for similar cause.

Bigdata in BFSI, In BFSI domain fault tolerance is the one of the most important pillar so, there are millions of daily banking transactions are there we want to find out the fake transactions, bigdata helps us even for product recommendations, transaction analysis bigdata plays a major role.

Bigdata in ECommerce, somewhere and somehow on online shopping sites you might seen dialogs like 'you bought this you may like this', this is kind of recommendations calculated by bigdata processing technologies.

The information that we have today about 90% of information is generated in just last 2 years and this trend is going, I believe after 2025 there will about 70% businesses in world generated by Bigdata and Bigdata oriented. Product will be delivered to the customer if he just thinking about it, Cab will be waiting for us when decided to shopping and Discounts will already there on Shirt we might think to buy.

Thursday, May 30, 2013

WebHDFS REST API : Complete FileSystem interface for HDFS

The HTTP REST APIs supports for most of the file system operation with Hadoop File system like read, write, open, modify and delete files using HTTP GET, POST, PUT and DELETE operations.

HTTP Operations:
1. HTTP GET
OPEN
GETFILESTATUS
LISTSTATUS
GETCONTENTSUMMARY
GETFILECHECKSUM
GETHOMEDIRECTORY
GETDELEGATIONTOKEN
2. HTTP PUT
CREATE
MKDIRS
RENAME
SETREPLICATION
SETOWNER
SETPERMISSION
SETTIMES
RENEWDELEGATIONTOKEN
CANCELDELEGATIONTOKEN
3. HTTP POST
APPEND
4. HTTP DELETE
DELETE

WebHDFS FileSystem URIs:

The FileSystem format of WebHDFS is as below.
webhdfs://<MyHOST>:<HTTP_PORT>/<PATH>

In the webHDFS REST API, the prefix /webhdfs/v1 is inserted in the path and a query is appended at the end.
http://<HOST>:<HTTP_PORT>/webhdfs/v1/<PATH>?op=<OPERATION_NAME>

Configurations:
For enabling webHDFS on your Hadoop cluster you need to add some parameters inside the hdfs-site.xml configuration file, to make HDFS accessible from webHDFS REST APIs.

1. dfs.webhdfs.enabled
This is the basic and mandatory property you need to add into hdfs-site.xml to enabling HDFS access.
<property>
<name>dfs.webhdfs.enabled</name>
<value>true</value>
</property>

2. dfs.web.authentication.kerberos.principal
The HTTP Kerberos principal used by Hadoop-Auth in the HTTP endpoint, this is a optional if your are using Kerberos authentication.

3. dfs.web.authentication.kerberos.keytab
The Kerberos keytab file with the credentials for the HTTP Kerberos principal used by Hadoop-Auth in the HTTP endpoint. This is a optional if your are using Kerberos authentication.

File System Operations:
1. Create and Write into file:
There is two step create operation is because of preventing clients to send out data before the redirect.
Step 1: Submit a HTTP GET request

curl -i -X PUT "http://<MyHost>:50070/webhdfs/v1/user/ubantu/input?op=CREATE&overwrite=true&blocksize=1234&replication=1&permission=777&buffersize=123"

The request is redirected to a datanode where the file data is to be written with messgae on console:

HTTP/1.1 307 TEMPORARY_REDIRECT
Location: http://<DATANODE>:<PORT>/webhdfs/v1/<PATH>?op=CREATE...
Content-Length: 0

Step 2:
Submit another HTTP PUT request using the URL in the Location header with the file data to be written.

curl -i -X PUT -T /home/ubantu/hadoop/hadoop-1.0.4/input/data0.txt "http://<DATANODE>:50075/webhdfs/v1/user/ubantu?op=CREATE&user.name=ubantu&Overwrite=true&&blocksize=1234567&Token=amol"

2.Open and Read a File:
For opening a particular file from HDFS you need to know the file path where file is stored and the name of the file, you can use the below HTTP GET request for opening and reading a file form HDFS.

curl -i -L "http://<MyHOST>:50075/webhdfs/v1/user/ubantu/input/data0.txt?op=OPEN&user.name=ubantu&offset=12345&length=12345678&buffersize=123123"

3.Delete a File:
For deleting a file from HDFS you need to submit the HTTP DELETE request as below,

curl -i -X DELETE "http://<MyHOST>:50075/webhdfs/v1/user/ubantu/input/data0.txt?op=DELETE&recursive=true"

4. For setting a permission:
HTTP put request

curl -i -X PUT "http://<MyHOST>:50075/webhdfs/v1/user/ubantu/input/data0.txt?op=SETOWNER&owner=<USER>&group=<GROUP>"

Error response:
When any operation fails the server may thrown a specieific error codes with particular errors like below,

IllegalArgumentException 400 Bad Request
UnsupportedOperationException 400 Bad Request
SecurityException 401 Unauthorized
IOException 403 Forbidden
FileNotFoundException 404 Not Found
RumtimeException 500 Internal Server Error

For more details related to webHDFS REST APIs do visit: http://hadoop.apache.org/docs/stable/webhdfs.html

Friday, May 24, 2013

NoSQL brings Hadoop Live

Hadoop designed for processing the large amount of data, with different varieties and with high processing speed, but not really real time, there is some latency in the hadoop response and actual real time application request. Integration of hadoop with real time application is the more tedious and complex and of course the most important job. If we have capability of large data storage and process but we are not able to access it real time so that is of no use.
Previously for this integration, Apache HttpFS and Hoop were used but as per time goes that new WebHDFS(RESTful) services becomes active to access HDFS(Hadoop Distributed File System) over HTTP or similar protocols.

Its a era of NoSQL databases, which are replacing the current and traditional RDBMS systems because of many more advantages over them. "NoSQL" database are designed to deal with huge amount of data in short "Bigdata", when the data is in the any form, doesn't requires a relational model, may or may not be structured, but the NoSQL is used only when there is data storage and retrieval matters not the relationship between the elements

Now think what happens when two bigdata handling giants come together and what will be their power together. We can use hadoop with NoSQL database to respond real time application.

Hadoop-NoSQL Integration with Realtime Application

In above architecture diagram you can see the frontend application can communicate with the NoSQL database (As we are replacing RDBMS with NoSQL DB) and Hadoop can Integrate with the NoSQL database, Hadoop can take a input data from NoSQL database does the processing and again stores the output data into the NoSQL database, so frontend application can easily access the processed data on UI. It is as it is simple. Here is mmion complex part is to access NoSQL data into the hadoop jobs.

Nowadays many NoSQL database provides connectors with Hadoop (e.g. MongoDB-Hadoop Connector) so we can easily get data from and stores data into the NoSQL database from hadoop jobs.

Even we can generate a BI reports from bigdata, like we can import database tables (structured) and Application logs (unstructured) into HDFS from ETL jobs as a Hive/HBase tables using Sqoop/Flume and then we have BI connectors available to integrate with HDFS/Hive/HBase so we can generate business reports from bigdata.

Sunday, May 12, 2013

Starting with NoSQL Database

All of us know that hadoop ecosystem is designed for bigdata processing and analyzing data in batch process not really real time purpose; but we can bring hadoop live using NoSQL databases, now think what happen when tow big giants comes together and what are the powers of them together.

"NoSQL" some in the industry referred "Not only SQL" describing NoSQL systems do supports SQL like query languages but its not 100% correct. NoSQL databases are high optimized for data retrieving and accessing operations. its because of storage system of NoSQL databases are based upon key-value pair, each data item inside the NoSQL database has a unique key, because of this Runtime query complexity of traditional relational databases are removed; Made a footprint into emerging market of valuable data warehousing by gaining high scalable and optimized performance model of NoSQL database.

"NoSQL" database are designed to deal with huge amount of data in short "Bigdata", when the data is in the any form, doesn't requires a relational model, may or may not be structured, but the NoSQL is used only when there is data storage and retrieval matters not the relationship between the elements, NoSQL can able to store millions of key-value pair and access them faster than any relational database can. This system is very useful for real time statistical analysis of growing data such as application logs.

Its not give guarantees that "NoSQL" supports full AICD operations. Perhaps only eventual consistency is possible or transactions limited to single data items, Although most of the NoSQL systems have transactions over single data objects, mutliple multiple supports transactions over the multiple data objects. Such as eXtreme Scale NoSQL system supports transaction over the single object and systems such as FoundationDB, OrientDB etc. supports transaction over multiple objects like traditional RDBMS databases.

Focusing on storing and accessing data from NoSQL database. As mentioned above NoSQL system uses a key-value pair to store data into the data store. Key-value pair allows to store data in schemaless way, so data can be stored into the store in the form of programming language like POJO classes in JAVA and accessed into the same; because of this there is no need to fixed data model to store or access the data.

Now revolution into the NoSQL databases brings it into the next step such as Graph database, in which data is stored in graphical manner, in relation with each other so we can access related data faster than any one can.

This just an overview of what is NoSQL database and how it work for detailed explanation please https://en.wikipedia.org/wiki/NoSQL

There are 100's NoSQL databases available in market waiting for you, some of them are

1. HBase - A hadoop Database

2. MongoDB

3. Cassandra

For more related to the NoSQL dbs please do visit:

http://nosql-database.org/

Wednesday, May 8, 2013

Integration of Hadoop with Business Intelligence(BI) and Data Warehousing(DW)

As we know the Hadoops powers, it can stores the data with large volumes, different varieties. If we have large amount of valuable data but we are not able to present it then its of no use like raw data. The real values of data comes only when it will be presented the impressive manner like tables, charts and graphs etc.

we can store the structured, unstructured and semi-structured data into the hadoop ecosystem like on HDFS or in HIVE data-warehouse or hadoop database(HBase).

When reports comes into the picture people thinks about Business Intelligence and Data Warehousing tools, Yes that's the one many of the BI and DW tools now providing the connectivity to hadoop, because they knew that bigdata is the need of tomorrow and yes I believe. BI tool providers like Pentaho already started work on bigdata and hadoop and created a very usefull tools to deal with bigdata, that tools are really impressive. Pentaho provided support for HDFS - to accessing data from hadoop file system and creating reports from that. It also supports HIVE and HBase data warehouse and database to connect and generating a business valuables from that. Many more organizations following best practices Report takes a careful look at the benefits, barriers, and emerging best practices for integrating Hadoop into BI and DW.

You can see below how we are able to customize reports from Hadoop(Hive) and Pentaho integration.

Even we can also develop a our own BI/ETL tool using Integration of hadoop and sqoop/flume to do ETL(Extract Transform and Load) processing on the data and can generate the reports or the meaningful relational data from the raw data. The architecture behind the every bigdata BI or ETL tool is the same just the frontend differs.

Hadoop promises to assist with the toughest challenges in BI today, including big data analysis processing, advanced analytics, and unstructured and structured data together.

For more details about BI and ETL tool :
http://en.wikipedia.org/wiki/Extract,_transform,_load
http://en.wikipedia.org/wiki/Business_intelligence_tools
http://www.pentahobigdata.com/

Pages