Wednesday, July 17, 2013

Hadoop Ecosystem on Windows Azure

As Microsoft becoming one of the popular vendor in Bigdata Hadoop market, Microsoft have developed a cloud based solution Bigdata, "Windows Azure HDInsight" which Process, analyze, and find out new business insights from Big Data using the power of Apache Hadoop Ecosystem. Windows Azure HDInsight is used to gain valuable business insights by processing and analyzing data including unstructured data, and helps business to made realtime decisions, a Big Data solution powered by Apache Hadoop. 

HDInsight Service makes Apache Hadoop available as a service in the cloud. It provides a provisions to build a Hadoop cluster in minutes, and scale it down once you run your MapReduce jobs. It gives a various ways for to gain performance and effective output like to choose the cluster size to optimize job and processing time to insight or cost,with very interactive way. HDInsight also supports many programming languages including JAVA, .NET technologies.

Reference :

You can find the core services, data processing frameworks, Microsoft integration points and value adds services, data movement servies, and packages exposed by Windows Azure HDInsights in above diagram. It makes the HDFS and MapReduce the componants of Hadoop framework available in a simpler, more scalable, and cost efficient Windows Azure environment. HDInsight simplifies the hadoop configuration, monitoring and post-processing of Hadoop analysed data by hadoop jobs by providing simple JS and Hive consoles. The JavaScript console is unique to HDInsight and handles Pig(ETL) Latin as well as JavaScript and HDFS commands. HDInsight also provides a cost efficient approach to the managing and storing of data, it uses Windows Azure Blob Storage as a native file system. ( Binary Large Object(Blob): a file of any type and size, that can be stored in Windows Azure) 

A very good appreciable thing about HDInsight is very user interactive console of JavaScript  and hive, for configuration, scheduling and monitoring the jobs. 

Sunday, July 14, 2013

What is new in Hadoop 2

Upcoming release of Hadoop, is becoming a major milestone in Hadoop development containing several significant improvements in HDFS and MapReduce(YARN) and also includes a very important new capabilities as well.

Hadoop 2 will be delivering a first release of new features like HDFS improvements including new append-pipeline, federation, wire compatibility, Namenode High Availability, HDFS Snapshots, better storage density and file formats, Caching and hierarchical storage management  and performance improvements. It is covering architectural improvements in High Availability of Namenode, Federation and Snapshots. Apache Hadoop YARN is the new basis for running MapReduce and other applications on a Hadoop cluster. It representing Hadoop as a more generic data integration and processing system. As we already discussed about MapReduce 2 (YARN) providing many more generic functionalities on data processing by simple and efficient ways.


One very good feature I would like to focus more is Namenode High Availability. Earlier versions of Hadoop has a single Nomenode controlling over the cluster, but it becoming a single point of failure(SPOF), if Namenode machine is unavailable, cluster as a whole would be unavailable till it either rebooted or replaced by another machine. Namenode High Availability feature address the same problem by providing option by providing two Namenodes (introduced StandbyNode, a hot backup of HDFS Namenode) sharing a same cluster with active/passive configuration.

Today in the market Hadoop 2.0.5-alpha version is available but still in under development, it includes new developer and user-facing incompatibilities, features, and major improvement. You can find the Hadoop 2.0.5-alpha release notes here. It is not really available for production but we can explore it for learning purpose and developing your POCs.

Saturday, July 13, 2013

Bigdata in Banking Domain

As financial industries growing with evolving business landscapes and increased information and business demands, finding efficient ways to store, organize and analyze the continuously increasing hell of data and integration and analysis is really crucial job. How effectively they can make better business decisions based on the this huge amount of data in short Bigdata they processes on a daily or weakly basis will be hurdle for the industry going forward. Nowadays banking system introduced very innovative and productive banking ideas like mobile banking, SMS banking, as we are able to carry banks in our pocket and every transactions are on our fingers. As it is increasing and having many more ideas equal proportionally the risk of banking also increasing like fraud, fake transactions, fake user accounts, miss-use of banking products by thefts and hackers.

Banking industries are using structural data from many years ago and finding a ways to tackle with such situations but they are not that much effective and accurate, So banks also should focus on not using more data but should use more diverse and variety of data from different data sources available on network, this includes not only the banks internal transactions and profile based data but the external information such as social networking data, application logs. Previously such data considered as none of any use but banks should use this data for customer analysis and getting more business insights out of it. Simply Banks should not only use internal structured data(traditional data) but also the external unstructured data to grow with more accurate results and effective predictions.

Bigdata plays a very important role to protect and secure end users and he’s banking activities. There are 1000’s of ways to protect our customer from theft and fraud if you have amount of data. As we can do analysis of customer transactions and monitoring its regular activities like customer salary, beneficiary transactions frequency and amount of every transaction helps banking industry to analysis of customers, customer location and transaction location analysis.

Today social networking is being very important part of every business network, we can found lots of ways customer analysis and sentimental analysis against products, as product reviews are easily available on such networking sites. There are 100s of solutions based on Hadoop available to replace banking traditional crucial analytics to new real time and less time consuming solutions to developing true relationship based analytics and finding out the true business values as per customers views.

Think again in growing business perspective take a look what data (internal plus external) we have, how we use it more effectively and where should we focus more to get more accuracy to fight in competitive market for survive and grow.

Thursday, July 11, 2013

Apache Hadoop: Solution for Bigdata

Nowadays “Bigdata” is the most hitting word all over the business world, peoples are not just talking about bigdata but finding business out of it. What exactly bigdata is? Simplest definition of bigdata is nothing but a data comes with high velocity with different varieties and huge volumes. The purpose of publishing this paper to not just to talk about bigdata but how to integrate bigdata in our current solution, how to find more business insights around the bigdata and hidden bigdata dimensions around your business. 
Apache Hadoop is the open source framework provided by Apache foundation to deal with bigdata, the power of Apache Hadoop is to provide cost efficient and effective solution to businesses for focusing more on exactly what matters: extracting business values from bigdata. In this paper we will be addressing more about the technical details about Hadoop Ecosystem architecture and integration with real time application to process and analysis and to find out the various hidden dimensions of bigdata, which helps our business to grow up.

Apache Hadoop as a Team:
Consider a regular scenario; you have a project team, one project manager and ten resources under him. 
If a client comes to your project manager and asked him to sort out the ten files, each file of 100 pages record.  What will be best approach your project manager will follow? 
Exactly! what you are thinking is right, Project manager will distribute the ten files among ten resources and keep the only record track with him. This approach will reduce to work load about 1/10th, ultimately increases speed and efficiency. 

Hadoop Team Structure:
This is what hadoop is, data storage and processing team. Hadoop has data storage and processing components. Hadoop follows master-slave architecture 

Physical structure of Hadoop cluster is same as above project team we have a Manager called namenode and team members called datanodes and Data storage is the responsibility of  datanodes(slaves), controlled by name node at master level and data processing is the responsibility of task tracker(slave) and controller over task tracker is job tracker at master level.

You can see in the diagram and do map with the project team that you have already and see how interesting it isTry to map everything with the real world things you can find many possible ways and solutions out of it.  

Thursday, July 4, 2013

Capitalizing Bigata!

90% of data created today is unstructured and more difficult to manage that generating from data sources like social media(facebook, twitter), video(youtube), texts(application logs), audio(viacom), email(gmail), and documents.

Bigdata is much more than data and is already transforming the way businesses and organizations are running. It represents a new way of doing business, creating a bright path for future business world, one that is driven by data oriented decision making and new types of products and services influenced by data. The rapid explosion in Bigdata and ways to handle it, changing the landscape of not only IT industry but all over the data oriented systems, And this data is becoming so powerful and important to drive for today’s businesses, as it contains customer insight and business growth opportunities that have yet to be identified or even no one had a idea about. But due to its volume, type and speed of change, most companies are doesn't have enough resources  to address this valuable data and get business out of it. 

Its time to get together and find out the ways and patterns from bigdata that can help us to make our lives even simpler and we have the way(Hadoop) but need to explore it more, to focus on true growth and identifying a money making opportunities.