Thursday, December 5, 2013

Social Networking Analysis

Around 75% internet users around the world (1.43 billion social network users in 2012, representing a 19.2 percent increase over 2011 figures) uses a online social networking portals like facebook, tweeter, youtube to share their experiences and to get familiar with what’s happening around us. As any new product lunches in any industry we can found the real experiences provided by product user on these portals. Nowadays social networking plays a very important role in Business analysis and locating business lagging and growing areas, which helps businesses to create a business strategy to improving lagging areas and maintaining the qualities of growing areas.

Apache Hadoop plays a very important role in Bigdata collection, processing and providing a nearly real time analytics out of it.

Sunday, September 29, 2013

Bigdata & TimeMachine

Powers of #Bigdata analytics, we can find out which movie gonna be blockbuster next year, not only the movie but also the future, the TimeMachine. Yesterday I saw a movie Paycheck, Michael Jennings is a reverse engineer; he analyzes his clients' competitors' technology and recreates it, often adding improvements beyond the original specifications. I think this is a best real use-case of Bigdata Implementation.

Michael creates a Time Machine with one of the his old college roommate, James Rethrick, the CEO of the successful technology company Allcom, after successful creation of TimeMachine James wipes Michael's memory, but before cleaning Michael's memory, Michael seen his future(in TimeMachine) and accordingly he sent himself a parcel(which delivers him after two years) using the things the parcel has, Michael(with lost memory) able to predict the things which he should do after two years to save himself from James.                                                      
Now we can see the things, which really correlate with Bigdata Analystics, Time Machine woks on principle of Astrology and the things we did in past gonna help us in future to survive and get the right direction, technically the data we(and off course the people who has a impact on our life) generated in our past, gets analyzed and using that analytics we are able to predict a future. Many companies now Analyzing the Bigdata generated/generating by each business vertical and designing a recommendation and decision engines to help business to survive in market.

Recommendation and decision engines, an area of predictive analytics and decision management, are going to quite active in next year, The pioneer was which used collaborative filtering to generate “you might also want”  or “next best offers” prompts for each product bought or page visited. 

I really appriciate your valuable comments and suggestions that guide me and you to direct our own future. Stay tunned for more updates on #TimeMachine

Friday, September 27, 2013

Bigdata & Natural Language Processing(NLP)

Natural language processing (NLP) is increasingly discussed in social media and other verticals of businesses, but often in reference to different technologies such as speech recognition, computer-assisted coding (CAC), and analytics. NLP is an enabling technology that allows computers to derive meaning from human, or natural language input.

Media is data intensive from customer satisfaction, product reviews and business perspectives. While the industry’s transition to electronic data collection and storage in recent years has increased significantly, this has not actually forced physicians to code the majority of meaningful content. Eighty percent of meaningful data remains within the unstructured text, as it does in most industries. This means that it remains in a format that cannot be easily searched or accessed electronically.

NLP can be leveraged to drive and directly impacting on improvements in financial, production, and operational aspects of business workflows:

For financial processes, automating data extraction for claims, banking transactions, financial auditing, and revenue cycle analytics can impact the top line. NLP can automatically extract underlying data, making claims more efficient and offering the potential for revenue analytics.
For production processes, automatically extracting key quality measures existing products and customer reviews, reporting and analytics. NLP can infer whether a product meets a quality measure. prelaunch response from customers, so decide a product launching stategy.

For operational processes, descriptive and predictive modeling can support more effective and efficient operations. NLP can extract hundreds of data elements similar available product rather than the 2-4 available products, producing better models and supporting business insight.

So, NLP is a powerful enabling technology, but it is not an end user application. It is not speech recognition or revenue cycle management or analytics. It can, however, enable all of these.

There is a battle underway that is increasingly recognized in the business space. Individual business divisions seek turnkey solutions and frequently purchase NLP-enabled products. But at a broader level.

We can use natural language processing for customer sentimental analysis, customer segmentation and many of the business cases, and find out the customer response and satisfaction from similar available products in market and to maintain quality of already released product, to decide business strategy to be a different in market.

Thursday, August 8, 2013

Friend Recommender In MapReduce

Hello Guys, today MapReduce is becoming a very popular framework for designing a data processing system for application has huge amount of data inshort #Bigdata. The main reason behind the popularity of MapReduce is the Scalability. You can easily carry out the very complex data processing through a huge amount of data in very short span of time(Nearly real time), unlike the traditional data processing systems takes hours to process it.

Here I wanna discuss a very popular use case of bigdata processing is the Friend Recommendations or you may name it as artifact recommendation

Here is the problem.
How to find out the Nth degree mutual friend from given list of friends like
A is direct friend of B and B is direct friend of C then C is the 2nd degree mutual friend of A.
below is the input(userid and their direct friends userid)


In the first phase MapReduce will findout the group of friends by user, in Map phase produces the Mapping of 2xN and reduce will reduce it to N with group of friends by user.

public static class Map extends Mapper<Longwritable,Text, Text, Text> {

  public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
   String line[] = value.toString().split("\\t");
   String fromUser = line[0].trim();

   if (line.length == 2) {
    String toUser = line[1].trim();
    context.write(new Text(toUser), new Text(fromUser));
    context.write(new Text(fromUser),new Text(toUser));
    context.write(new Text(fromUser),null);
public static class Reduce extends Reducer<Text,Text, Text, Text> {
  public void reduce(Text key, Iterable<text> values, Context context)
    throws IOException, InterruptedException {

   ArrayList<string> userEntryList = new ArrayList<>();
   Iterator<text> friends = values.iterator();

    Text e =;
   context.write(key, new Text(userEntryList.toString()));

And the output will be generated
5101   [5102, 5106]
5102   [5104, 5105, 5101, 5104]
5103   [5106]
5104   [5102, 5102]
5105   [5102]
5106   [5107, 5103, 5101]
5107   [5106]

Now you need to find out the 2nd degree friends like friends of each friend
In Map Phase, Emit the <touser1, r=touser2,m=fromuser>, here touser1 is current user, r means recommended friend and m means mutual friend. Like A is friend of B and B of C, then we can recommend C to A though mutual friend B, means here  above formula becomes<touser1=A,r=touser2=C,m=fromuser=B>. It will emit n(n-1) records Totally there are n^2 records emitted though map phase. In reduce phase we just sum the how many friend will be there for current user and key.

As emitted value is not primitive type in hadoop, so we can create our own datatype

static public class FriendCount implements Writable {
  public Long user;
  public Long mutualFriend;

  public FriendCount(Long user, Long mutualFriend) {
   this.user = user;
   this.mutualFriend = mutualFriend;

  public FriendCount() {
   this(-1L, -1L);

  public void write(DataOutput out) throws IOException {

  public void readFields(DataInput in) throws IOException {
   user = in.readLong();
   mutualFriend = in.readLong();

  public String toString() {
   return " toUser: "
     + Long.toString(user) + " mutualFriend: " + Long.toString(mutualFriend);

Map and Reduce can be implemented by
public static class Map extends Mapper<LongWritable, Text, LongWritable, FriendCount> {
  private Text word = new Text();

  public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
   String line[] = value.toString().split("\\t");
   Long fromUser = Long.parseLong(line[0]);
   List<Long> toUsers = new ArrayList<Long>();

   if (line.length == 2) {
    StringTokenizer tokenizer = new StringTokenizer(line[1], ",");
    while (tokenizer.hasMoreTokens()) {
     Long toUser = Long.parseLong(tokenizer.nextToken().replace("[", "").replace("]", "").trim());
     context.write(new LongWritable(fromUser), new FriendCount(toUser, -1L));

    for (int i = 0; i < toUsers.size(); i++) {
     for (int j = i + 1; j < toUsers.size(); j++) {
      context.write(new LongWritable(toUsers.get(i)), new FriendCount((toUsers.get(j)), fromUser));
      context.write(new LongWritable(toUsers.get(j)), new FriendCount((toUsers.get(i)), fromUser));

 public static class Reduce extends Reducer<LongWritable, FriendCount, LongWritable, Text> {
  public void reduce(LongWritable key, Iterable<FriendCount> values, Context context)
    throws IOException, InterruptedException {

   final java.util.Map<Long, Set<Long>> mutualFriends = new HashMap<Long, Set<Long>>();

   for (FriendCount val : values) {
    final Boolean isAlreadyFriend = (val.mutualFriend == -1);
    final Long toUser = val.user;
    final Long mutualFriend = val.mutualFriend;

    if (mutualFriends.containsKey(toUser)) {
     if (isAlreadyFriend) {
      mutualFriends.put(toUser, null);
     } else if (mutualFriends.get(toUser) != null) {
    } else {
     if (!isAlreadyFriend) {
      mutualFriends.put(toUser, new HashSet<Long>() {
     } else {
      mutualFriends.put(toUser, null);

   java.util.SortedMap<Long, Set<Long>> sortedMutualFriends = new TreeMap<Long, Set<Long>>(new Comparator<Long>() {
    public int compare(Long key1, Long key2) {
     Integer v1 = mutualFriends.get(key1).size();
     Integer v2 = mutualFriends.get(key2).size();
     if (v1 > v2) {
      return -1;
     } else if (v1.equals(v2) && key1 < key2) {
      return -1;
     } else {
      return 1;

   for (java.util.Map.Entry<Long, Set<Long>> entry : mutualFriends.entrySet()) {
    if (entry.getValue() != null) {
     sortedMutualFriends.put(entry.getKey(), entry.getValue());

   Integer i = 0;
         String output = "";
         Set<Long> entrySet = new HashSet<>();
   for (java.util.Map.Entry<Long, Set<Long>> entry : sortedMutualFriends.entrySet()) {
   Iterator<Long> setItr = entrySet.iterator();

  context.write(key, new Text(output));
Final Output you can see like first is the current user id and against you can see the direct friends with recommended friends
[5101, 5102, 5106, 5104, 5105, 5107, 5103]
[5102, 5104, 5105, 5101, 5104, 5106]
[5103, 5106, 5107, 5101]
[5104, 5102, 5102, 5105, 5101]
[5105, 5102, 5104, 5101]
[5106, 5107, 5103, 5101, 5102]
[5107, 5106, 5103, 5101]
You can implement the same code in simple java programmer without using MapReduce framework, it works well but not much scalable as MapReduce, You can find below the Normal JAVA code to find out the recommended friends might help you to design MapReduce

import java.util.ArrayList;
import java.util.List;
import java.util.Map;
import java.util.TreeMap;

public class FriendRecommendationWithoutMapReduce extends TreeMap<String, List<String>> {

 //Overriding put method to append friends of same user
 public void put(String key, String number) {
  List<String> current = get(key);
  if (current == null) {
   current = new ArrayList<String>();
   super.put(key, current);

 public static void main(String[] args) {
  FriendRecommendationWithoutMapReduce user = new FriendRecommendationWithoutMapReduce();
  //Putting all values in map
  user.put("5101", "5102");
  user.put("5102", "5104");
  user.put("5102", "5105");
  user.put("5103", "5106");
  user.put("5101", "5106");
  user.put("5106", "5107");
  user.put("5104", "5102");
  // Putting the same value in reverse
  user.put("5104", "5102");
  user.put("5105", "5102");
  user.put("5106", "5103");
  user.put("5106", "5101");
  user.put("5107", "5106");
  user.put("5102", "5104");
  System.out.println("\n___________________Group By Friends__________________________\n");
  ArrayList<String> userEntryList = new ArrayList<>();

  // For N=2
  for (Map.Entry e : user.entrySet()) {
   System.out.println(e.getKey() + "    " + e.getValue());

  System.out.println("\n___________________Final Output__________________________\n");
  // For Rest Case
  for (int i = 0; i <= userEntryList.size() - 1; i++) {
   List<String> output = new ArrayList<>();

   // Get All 2nd degree Related Friend of User i
   List<String> friends = user.get(userEntryList.get(i));
   for (int j = 0; j < friends.size(); j++) {
    List<String> aList = new ArrayList<>();
    for (int k = 0; k < aList.size(); k++) {
  System.out.println("\n___________________End Final Output__________________________\n");

Wednesday, July 17, 2013

Hadoop Ecosystem on Windows Azure

As Microsoft becoming one of the popular vendor in Bigdata Hadoop market, Microsoft have developed a cloud based solution Bigdata, "Windows Azure HDInsight" which Process, analyze, and find out new business insights from Big Data using the power of Apache Hadoop Ecosystem. Windows Azure HDInsight is used to gain valuable business insights by processing and analyzing data including unstructured data, and helps business to made realtime decisions, a Big Data solution powered by Apache Hadoop. 

HDInsight Service makes Apache Hadoop available as a service in the cloud. It provides a provisions to build a Hadoop cluster in minutes, and scale it down once you run your MapReduce jobs. It gives a various ways for to gain performance and effective output like to choose the cluster size to optimize job and processing time to insight or cost,with very interactive way. HDInsight also supports many programming languages including JAVA, .NET technologies.

Reference :

You can find the core services, data processing frameworks, Microsoft integration points and value adds services, data movement servies, and packages exposed by Windows Azure HDInsights in above diagram. It makes the HDFS and MapReduce the componants of Hadoop framework available in a simpler, more scalable, and cost efficient Windows Azure environment. HDInsight simplifies the hadoop configuration, monitoring and post-processing of Hadoop analysed data by hadoop jobs by providing simple JS and Hive consoles. The JavaScript console is unique to HDInsight and handles Pig(ETL) Latin as well as JavaScript and HDFS commands. HDInsight also provides a cost efficient approach to the managing and storing of data, it uses Windows Azure Blob Storage as a native file system. ( Binary Large Object(Blob): a file of any type and size, that can be stored in Windows Azure) 

A very good appreciable thing about HDInsight is very user interactive console of JavaScript  and hive, for configuration, scheduling and monitoring the jobs. 

Sunday, July 14, 2013

What is new in Hadoop 2

Upcoming release of Hadoop, is becoming a major milestone in Hadoop development containing several significant improvements in HDFS and MapReduce(YARN) and also includes a very important new capabilities as well.

Hadoop 2 will be delivering a first release of new features like HDFS improvements including new append-pipeline, federation, wire compatibility, Namenode High Availability, HDFS Snapshots, better storage density and file formats, Caching and hierarchical storage management  and performance improvements. It is covering architectural improvements in High Availability of Namenode, Federation and Snapshots. Apache Hadoop YARN is the new basis for running MapReduce and other applications on a Hadoop cluster. It representing Hadoop as a more generic data integration and processing system. As we already discussed about MapReduce 2 (YARN) providing many more generic functionalities on data processing by simple and efficient ways.


One very good feature I would like to focus more is Namenode High Availability. Earlier versions of Hadoop has a single Nomenode controlling over the cluster, but it becoming a single point of failure(SPOF), if Namenode machine is unavailable, cluster as a whole would be unavailable till it either rebooted or replaced by another machine. Namenode High Availability feature address the same problem by providing option by providing two Namenodes (introduced StandbyNode, a hot backup of HDFS Namenode) sharing a same cluster with active/passive configuration.

Today in the market Hadoop 2.0.5-alpha version is available but still in under development, it includes new developer and user-facing incompatibilities, features, and major improvement. You can find the Hadoop 2.0.5-alpha release notes here. It is not really available for production but we can explore it for learning purpose and developing your POCs.

Saturday, July 13, 2013

Bigdata in Banking Domain

As financial industries growing with evolving business landscapes and increased information and business demands, finding efficient ways to store, organize and analyze the continuously increasing hell of data and integration and analysis is really crucial job. How effectively they can make better business decisions based on the this huge amount of data in short Bigdata they processes on a daily or weakly basis will be hurdle for the industry going forward. Nowadays banking system introduced very innovative and productive banking ideas like mobile banking, SMS banking, as we are able to carry banks in our pocket and every transactions are on our fingers. As it is increasing and having many more ideas equal proportionally the risk of banking also increasing like fraud, fake transactions, fake user accounts, miss-use of banking products by thefts and hackers.

Banking industries are using structural data from many years ago and finding a ways to tackle with such situations but they are not that much effective and accurate, So banks also should focus on not using more data but should use more diverse and variety of data from different data sources available on network, this includes not only the banks internal transactions and profile based data but the external information such as social networking data, application logs. Previously such data considered as none of any use but banks should use this data for customer analysis and getting more business insights out of it. Simply Banks should not only use internal structured data(traditional data) but also the external unstructured data to grow with more accurate results and effective predictions.

Bigdata plays a very important role to protect and secure end users and he’s banking activities. There are 1000’s of ways to protect our customer from theft and fraud if you have amount of data. As we can do analysis of customer transactions and monitoring its regular activities like customer salary, beneficiary transactions frequency and amount of every transaction helps banking industry to analysis of customers, customer location and transaction location analysis.

Today social networking is being very important part of every business network, we can found lots of ways customer analysis and sentimental analysis against products, as product reviews are easily available on such networking sites. There are 100s of solutions based on Hadoop available to replace banking traditional crucial analytics to new real time and less time consuming solutions to developing true relationship based analytics and finding out the true business values as per customers views.

Think again in growing business perspective take a look what data (internal plus external) we have, how we use it more effectively and where should we focus more to get more accuracy to fight in competitive market for survive and grow.

Thursday, July 11, 2013

Apache Hadoop: Solution for Bigdata

Nowadays “Bigdata” is the most hitting word all over the business world, peoples are not just talking about bigdata but finding business out of it. What exactly bigdata is? Simplest definition of bigdata is nothing but a data comes with high velocity with different varieties and huge volumes. The purpose of publishing this paper to not just to talk about bigdata but how to integrate bigdata in our current solution, how to find more business insights around the bigdata and hidden bigdata dimensions around your business. 
Apache Hadoop is the open source framework provided by Apache foundation to deal with bigdata, the power of Apache Hadoop is to provide cost efficient and effective solution to businesses for focusing more on exactly what matters: extracting business values from bigdata. In this paper we will be addressing more about the technical details about Hadoop Ecosystem architecture and integration with real time application to process and analysis and to find out the various hidden dimensions of bigdata, which helps our business to grow up.

Apache Hadoop as a Team:
Consider a regular scenario; you have a project team, one project manager and ten resources under him. 
If a client comes to your project manager and asked him to sort out the ten files, each file of 100 pages record.  What will be best approach your project manager will follow? 
Exactly! what you are thinking is right, Project manager will distribute the ten files among ten resources and keep the only record track with him. This approach will reduce to work load about 1/10th, ultimately increases speed and efficiency. 

Hadoop Team Structure:
This is what hadoop is, data storage and processing team. Hadoop has data storage and processing components. Hadoop follows master-slave architecture 

Physical structure of Hadoop cluster is same as above project team we have a Manager called namenode and team members called datanodes and Data storage is the responsibility of  datanodes(slaves), controlled by name node at master level and data processing is the responsibility of task tracker(slave) and controller over task tracker is job tracker at master level.

You can see in the diagram and do map with the project team that you have already and see how interesting it isTry to map everything with the real world things you can find many possible ways and solutions out of it.  

Thursday, July 4, 2013

Capitalizing Bigata!

90% of data created today is unstructured and more difficult to manage that generating from data sources like social media(facebook, twitter), video(youtube), texts(application logs), audio(viacom), email(gmail), and documents.

Bigdata is much more than data and is already transforming the way businesses and organizations are running. It represents a new way of doing business, creating a bright path for future business world, one that is driven by data oriented decision making and new types of products and services influenced by data. The rapid explosion in Bigdata and ways to handle it, changing the landscape of not only IT industry but all over the data oriented systems, And this data is becoming so powerful and important to drive for today’s businesses, as it contains customer insight and business growth opportunities that have yet to be identified or even no one had a idea about. But due to its volume, type and speed of change, most companies are doesn't have enough resources  to address this valuable data and get business out of it. 

Its time to get together and find out the ways and patterns from bigdata that can help us to make our lives even simpler and we have the way(Hadoop) but need to explore it more, to focus on true growth and identifying a money making opportunities.

Friday, June 28, 2013

Apache Hadoop YARN : Next Generation MapReduce

MapReduce has a complete transformation in hadoop-0.x and now we have MapReduce v2 or YARN

Main inspiration behind development of MapReduce v2 that is YARN is to divide major functionality of JobTracker that resource management and job scheduling/monitoring into a  separate daemons. MapReduce v2 have a global resource management(RM) and Application Master per application(single client job or job workflows)

The Resource-Manager(RM) has authority to control over the Node-Manager(NM), the per-node slave and co-ordinates resources among all the applications in the system. The Application-Master(AM) is the framework, has a responsibility coordinating with Resource-Manager for resources negotiation and Node-Manager to execute and monitor the tasks.

As MapReduce v2 has two core responsibilities i.e.  resource management and job scheduling/monitoring so Resource-Manager(RM) have two core components, Scheduler and Applications-Manager

Scheduler is responsible for allocating execution time slots and resources to the various running applications as per the requirements/configurations, the Scheduler is pure Scheduler, it does not perform monitoring or status tracking of the application. The Scheduler performs its scheduling function as per the resource requirements of the applications; it does it through resource Container which examines elements such as memory, cpu, disk, network etc. 

Application-Manager(AM) is responsible for accepting the jobs, negotiating with Container for executing the application specific Application-Master and restarting the Application-Master Container on application failure or hardware failure. The Node-Manager is the per slave machine agent who is responsible for Containers, monitoring their resource usage and reporting the same to the Resource-Manager. The per-application Application-Master has the responsibility of negotiating appropriate resource Containers from the Scheduler, tracking their status and monitoring for progress.

MapReduce v2 jobs are compatible with all previous stable releases means all previous jobs will run on MapReduce v2 just need to recompile.


Monday, June 24, 2013

Fraud Detection and Risk Prediction in the Era of Bigdata

Fraud detection and Risk predictions is a multi-million dollar business and it is increasing proportionally every year. As mentioned on Wikipedia,  the PwC global economic crime survey of 2009 suggests that close to 30% of companies worldwide have reported being victims of fraud in the past year. 

Traditional methods of data analysis and mining have long been used to detect fraud. They require too complex architecture and time-consuming computations that deal with different domains like financial, economics and business practices, and still the results produces are not that much accurate  Fraud often consists of many instances or incidents involving repeated offences using the same method. Fraud instances can be similar in content wise and appearance wise but usually are not identical.

How exactly Bigdata helps to find out the Fraud or to predict most likely risk factors?
There are thousands of data sources with too large volumes and varieties, which are ignored by the traditional fraud analysis techniques and methods in short termed as Bigdata includes social media, transaction logs, application logs, weblogs,  geographical data etc.

For an example: A guy who has taken loan from bank say 1,00,000 with returning monthly installment of 10,000. He regularly paid installments of first four months as per policy after that he unable to pay remaining installments as unavailability of funds, But he is posting his new car, or new home or foreign trip pics on twitter. The guys who is already defaulter in banks record because of unavailability of funds and keeps posting a photos his new car on twitter or facebook. So bank officials can take immediate action on it without waiting for fraud to be happen.

Second example is like, A person whose is living in India, keeps/tries withdrawing money from Delhi, NewYark, Londan, Paris everyday, we can find out his geolocation history using google maps and  will compare with transaction location, resulting into immediate action.

There are many more use cases with bigdata to find out fraud and risk analysis, Advantage of using bigdata over traditional systems is most important is high accuracy towards results and most likely predictions, ultimately because of huge data, high accuracy and likely predictions are directly proportional to the size and sources of data.

Nowadays we have technology which can take over the bigdata analytics nearly real time, without wasting much time in computations and calculations, so action can be taken prior fraud to be happen. High performance analytics is just an technology fad, With new distributed computing options like Hadoop and in-memory processing on commodity hardware, insurers can have access to a flexible and scalable real-time big data analytics solution at a reasonable cost.

Saturday, June 15, 2013

What people really thinks about Bigdata?

How much do you think people are aware of bigdata world and its advantages and disadvantages, or they are just aware of it, don't know how to use it? Bigdata analytics is really a hell? Bigdata is playing a role of hero or villain in our day today life?

Yes, these are the some headlines I found on internet while I was studying for bigdata analytics. Is that bigdata analysis is really difficult job? As per my experience I dint found such hardness and difficulties while going through. "If you know how to create a bigdata, then you should know how to bring business values out of it" this is the simple line I'm following.

Just think of end user perspective, you will get known many more dimensions and directions to analyse bigdata, do it and get successful in bigdata era.

being a simple end user is not that much difficult task I think so:) 

Tuesday, June 4, 2013

Bigdata and Business Verticals

As we are an active part of Bigdata ecosystems, where our day to day lifestyle and activities are responsible for data generation, and systems around us can collect the data, analyse it and consume it for their business to help our lifestyle. Nowadays world gets too much interconnected because of internet and mobile devices as never been in history, each day we are creating about 2.5 quintillion( 2.5×1018) of data, its huge amount created by different verticals in the industry, This verticals using this massive amount of information to rise above the business cloud. But before using this such huge amount of information industry must aware of the real time business scenarios, in short 'Usecases' to implement the solution for analysis of Bigdata.

We'll focus on some industry key verticals/domains which are using or most likely to use Bigdata analysis. Below are the some Bigdata value creation opportunities.

Financial Services:
-Fraud Detect
-Model and manage risk
-Improve debt recovery rates
-Personalized banking and insurance products
-Recommendation of banking products

Retail and Consumer Packaged Goods Industry:
-Customer Care Call Centers
-Customer Sentiment Analysis
-Campaign management and customer loyalty programs
-Supply Chain Management and Logistics
-Window Shoppers
-Location based Marketing
-Predicting Purchases and Recommendations

Manufacturing Industry:
-Design to value
-Consumer Sentiment Analysis
-Supply Chain Management and Logistic
-Preventive Maintenance and Repairs
-Digital factory for lean manufacturing
-Improve service via product sensor data

-Optimal treatment pathways
-Remote patient monitoring
-Predictive modeling for new drugs
-Personalized medicine
-Patient behavior and sentiment data
-Pharmaceutical R&D data

Web/Social/Mobile Industry:
-Location based marketing
-Social segmentation
-Sentiment analysis
-Price comparison services
-Recommendation engines
-Advertisements/promotions and Web Campaigns

-Reduce fraud
-Segment population, customize action
-Support open data initiatives
-Automate decision making
-Election Campaigns

Data growth in each section of each vertical is viral, speed of data generation is tremendous so needed a Bigdata capability for addressing such business problems, get ready soon and make your business to capable to hit big elephant of information.

Monday, June 3, 2013

Bigdata : Impact on day to day life

Would Bigdata really impact on our day to day life? If you asked this question 10 years before, the answer  might be No, but nowadays if you going for shopping to any mall, Google maps are tracking you, your home, you rout towards a mall and suggests the similar malls near to you. You reached to mall and  went to the mobile store, shop cameras are watching you, in which section you are spending more time and suggest you similar section to shop, Now you picked up a any gadget, they will calculate your interest and recommend you the gadgets with similar features and functionalists with discounts. (As they also want to grow up with their business:) ). Result leaving from home you decided for a-gadget and you b-gadget actually because of attractive offer on it.

From healthcare, to sports, from retails stores to the e-banking, from the business to the social networking, to the way we used to go for office, big data will making big changes to the way we live our lives. Specially internet is getting more and more importance to everyones life everyday, everyone is like to sharing his information on social site and social networking sites are becoming very popular for Business world. Businesses are becoming more and more consumer centric with the help of social networking and easily available information. Businesses are using this information to find out the customer trends and business out of it. Think of this we get an reason why E-Commerce businesses are getting more and more popularity these day. How weather forecasting is always being correct, Why healthcare programs are getting arranged in particular days of year, How fraud is detected in bank between millions of transactions per day. 

This is all about bigdata, we are surrounded by it as we are responsible for generating it and Businesses are just using it for their purpose to help us, ultimately both get benefited, We are happy because of we get better and  convenient solution even if we dint thought about it and Its impacting directly to Annual Revenue of Businesses. 

Friday, May 31, 2013

Big Business with Big Opportunities

Nowadays Businesses are struggling with abnormally growing volumes, speed and variety of information that used to generate everyday, everyday the complexity of information generation is also rapidly growing - the term to be known for as 'Bigdata'. Many companies are seeking for the technology to not only help them to bigdata storage and process but also finding many more business insights from bigdata and growing up the business strategies with bigdata. 

Arround 80% information in world is unstructured, and many businesses are not even attempting to use  that information for their advantage or not aware how to use that information. Imagine if you and your business keep afford that all data generated by you business and keep tracking and analyzing it, Imagine if know to way the handle that bigdata?

The data explosion presents great challenge to businesses, today most lack the technology and knowledge about bigdata and how to deal with it and get real business values. Many Companies are focusing on the developing skills and insights of business needs to accelerate the path of transforming larger data sets. 

What bigdata can do? Businesses are growing up with bigdata to finding more business insights and row that caries values for business with latest bigdata processing technologies like Hadoop fromework.
Its now possible to track each individual user through cell phones, wireless sensors with measurement of his interest in particular thing, where does he lives, works, plays and what is his day to day program and collect the data, analyse this huge data using bigdata processing technologies and find out the business ways with each individual user to help or make his life simpler. 
Bigdata in Social Networking, day to day millions of facebook comments, updates, twitter tweets are generating and many more so using bigdata processing to find out current market trends, what people are talking about, their likes, dislikes accordingly plan our business. 
Bigdata in Healthcare, every hospital or healthcare organization maintaining their historical records with patients records which may kind of bigdata so technology can analyse that past records and predict in future which patients, on what date, with the cause and what are the possible treatments for similar cause.
Bigdata in BFSI, In BFSI domain fault tolerance is the one of the most important pillar so, there are millions of daily banking transactions are there we want to find out the fake transactions, bigdata helps us even for product recommendations, transaction analysis bigdata plays a major role.
Bigdata in  ECommerce, somewhere and somehow on online shopping sites you might seen dialogs like 'you bought this you may like this', this is kind of recommendations calculated by bigdata processing technologies.

The information that we have today about 90% of information is generated in just last 2 years and this trend is going, I believe after 2025 there will about 70% businesses in world generated by Bigdata and Bigdata oriented. Product will be delivered to the customer if he just thinking about it, Cab will be waiting for us when decided to shopping and Discounts will already there on Shirt we might think to buy.

Thursday, May 30, 2013

WebHDFS REST API : Complete FileSystem interface for HDFS

The HTTP REST APIs supports for most of the file system operation with Hadoop File system like read, write, open, modify and delete files using HTTP GET, POST, PUT and DELETE operations.

HTTP Operations:

WebHDFS FileSystem URIs:

The FileSystem format of WebHDFS is as below.

In the webHDFS REST API, the prefix /webhdfs/v1 is inserted in the path and a query is appended at the end.   

For enabling webHDFS on your Hadoop cluster you need to add some parameters inside the hdfs-site.xml configuration file, to make HDFS accessible from webHDFS REST APIs.

1. dfs.webhdfs.enabled 
This is the basic and mandatory property you need to add into hdfs-site.xml to enabling HDFS access.

2. dfs.web.authentication.kerberos.principal
The HTTP Kerberos principal used by Hadoop-Auth in the HTTP endpoint, this is a optional if your are using Kerberos authentication.

3. dfs.web.authentication.kerberos.keytab
The Kerberos keytab file with the credentials for the HTTP Kerberos principal used by Hadoop-Auth in the HTTP endpoint. This is a optional if your are using Kerberos authentication.

File System Operations:
1. Create and Write into file:
There is two step create operation is because of preventing clients to send out data before the redirect.
Step 1: Submit a HTTP GET request

curl -i -X PUT "http://<MyHost>:50070/webhdfs/v1/user/ubantu/input?op=CREATE&overwrite=true&blocksize=1234&replication=1&permission=777&buffersize=123"

The request is redirected to a datanode where the file data is to be written with messgae on console:

Location: http://<DATANODE>:<PORT>/webhdfs/v1/<PATH>?op=CREATE...
Content-Length: 0

Step 2:
Submit another HTTP PUT request using the URL in the Location header with the file data to be written.

curl -i -X PUT -T /home/ubantu/hadoop/hadoop-1.0.4/input/data0.txt  "http://<DATANODE>:50075/webhdfs/v1/user/ubantu?op=CREATE&"

2.Open and Read a File:
For opening a particular file from HDFS you need to know the file path where file is stored and the name of the file, you can use the below HTTP GET request for opening and reading a file form HDFS.

curl -i -L "http://<MyHOST>:50075/webhdfs/v1/user/ubantu/input/data0.txt?op=OPEN&" 

3.Delete a File:
For deleting a file from HDFS you need to submit the HTTP DELETE request as below,

curl -i -X DELETE "http://<MyHOST>:50075/webhdfs/v1/user/ubantu/input/data0.txt?op=DELETE&recursive=true"

4. For setting a permission:
HTTP put request

curl -i -X PUT "http://<MyHOST>:50075/webhdfs/v1/user/ubantu/input/data0.txt?op=SETOWNER&owner=<USER>&group=<GROUP>"

Error response:
When any operation fails the server may thrown a specieific error codes with particular errors like below,

IllegalArgumentException                     400 Bad Request
UnsupportedOperationException          400 Bad Request
SecurityException                                 401 Unauthorized
IOException                                         403 Forbidden
FileNotFoundException                        404 Not Found
RumtimeException                                500 Internal Server Error

For more details related to webHDFS REST APIs do visit: 

Friday, May 24, 2013

NoSQL brings Hadoop Live

Hadoop designed for processing the large amount of data, with different varieties and with high processing speed, but not really real time, there is some latency in the hadoop response and actual real time application request. Integration of hadoop with real time application is the more tedious and complex and of course the most important job. If we have capability of large data storage and process but we are not able to access it real time so that is of no use.
Previously for this integration, Apache HttpFS and Hoop were used but as per time goes that new WebHDFS(RESTful) services becomes active to access HDFS(Hadoop Distributed File System) over HTTP or similar protocols. 

Its a era of NoSQL databases, which are replacing the current and traditional RDBMS systems because of many more advantages over them. "NoSQL" database are designed to deal with huge amount of data in short "Bigdata", when the data is in the any form, doesn't requires a relational model, may or may not be structured, but the NoSQL is used only when there is data storage and retrieval matters not the relationship between the elements

Now think what happens when two bigdata handling giants come together and what will be their power together. We can use hadoop with NoSQL database to respond real time application.

Hadoop-NoSQL Integration with Realtime Application

In above architecture diagram you can see the frontend application can communicate with the NoSQL database (As we are replacing RDBMS with NoSQL DB) and Hadoop can Integrate with the NoSQL database, Hadoop can take a input data from NoSQL database does the processing and again stores the output data into the NoSQL database, so frontend application can easily access the processed data on UI. It is as it is simple. Here is mmion complex part is to access NoSQL data into the hadoop jobs.

Nowadays many NoSQL database provides connectors with Hadoop (e.g. MongoDB-Hadoop Connector) so we can easily get data from and stores data into the NoSQL database from hadoop jobs. 

Even we can generate a BI reports from bigdata, like we can import database tables (structured) and Application logs (unstructured) into HDFS from ETL jobs as a Hive/HBase tables using Sqoop/Flume and then we have BI connectors available to integrate with HDFS/Hive/HBase so we can generate business reports from bigdata.

Sunday, May 12, 2013

Starting with NoSQL Database

All of us know that hadoop ecosystem is designed for bigdata processing and analyzing data in batch process not really real time purpose; but we can bring hadoop live using NoSQL databases, now think what happen when tow big giants comes together and what are the powers of them together.

"NoSQL" some in the industry referred "Not only SQL" describing NoSQL systems do supports SQL like query languages but its not 100% correct. NoSQL databases are high optimized for data retrieving and accessing operations. its because of storage system of NoSQL databases are based upon key-value pair, each data item inside the NoSQL database has a unique key, because of this Runtime query complexity of traditional relational databases are removed; Made a footprint into emerging market of valuable data warehousing by gaining high scalable and optimized performance model of NoSQL database.

"NoSQL" database are designed to deal with huge amount of data in short "Bigdata", when the data is in the any form, doesn't requires a relational model, may or may not be structured, but the NoSQL is used only when there is data storage and retrieval matters not the relationship between the elements, NoSQL can able to store millions of key-value pair and access them faster than any relational database can. This system is very useful for real time statistical analysis of  growing data such as application logs.

Its not give guarantees that "NoSQL" supports full AICD operations. Perhaps only eventual consistency is possible or transactions limited to single data items, Although most of the NoSQL systems have transactions over single data objects, mutliple multiple supports transactions over the multiple data objects. Such as eXtreme Scale NoSQL system supports transaction over the single object and systems such as FoundationDB, OrientDB etc. supports transaction over multiple objects like traditional RDBMS databases.

Focusing on storing and accessing data from NoSQL database. As mentioned above NoSQL system uses a key-value pair to store data into the data store. Key-value pair allows to store data in schemaless way, so data can be stored into the store in the form of programming language like POJO classes in JAVA and accessed into the same; because of this there is no need to fixed data model to store or access the data.

Now revolution into the NoSQL databases brings it into the next step such as Graph database, in which data is stored in graphical manner, in relation with each other so we can access related data faster than any one can.

This just an overview of what is NoSQL database and how it work for detailed explanation please 

There are 100's NoSQL databases available in market waiting for you, some of them are
1. HBase - A hadoop Database
2. MongoDB
3. Cassandra
For more related to the NoSQL dbs please do visit:

Wednesday, May 8, 2013

Integration of Hadoop with Business Intelligence(BI) and Data Warehousing(DW)

As we know the Hadoops powers, it can stores the data with large volumes, different varieties. If we have large amount of valuable data but we are not able to present it then its of no use like raw data. The real values of data comes only when it will be presented the impressive manner like tables, charts and graphs etc.
we can store the structured, unstructured and semi-structured data into the hadoop ecosystem like on HDFS or in HIVE data-warehouse or hadoop database(HBase).

When reports comes into the picture people thinks about  Business Intelligence and Data Warehousing tools, Yes that's the one many of the BI and DW tools now providing the connectivity to hadoop, because they knew that bigdata is the need of tomorrow and yes I believe. BI tool providers like Pentaho already started work on bigdata and hadoop and created a very usefull tools to deal with bigdata, that tools are really impressive. Pentaho provided support for HDFS - to accessing data from hadoop file system and creating reports from that. It also supports HIVE and HBase data warehouse and database to connect and generating a business valuables from that. Many more organizations following best practices Report takes a careful look at the benefits, barriers, and emerging best practices for integrating Hadoop into BI and DW.

You can see below how we are able to customize reports from Hadoop(Hive) and Pentaho integration.

Even we can also develop a our own BI/ETL tool using Integration of hadoop and sqoop/flume to do ETL(Extract Transform and Load) processing on the data and can generate the reports or the meaningful relational data from the raw data. The architecture behind the every bigdata BI or ETL tool is the same just the frontend differs.  

Hadoop promises to assist with the toughest challenges in BI today, including big data analysis processing, advanced analytics, and unstructured and structured data together.

For more details about BI and ETL tool :,_transform,_load

Thursday, April 25, 2013

Job Scheduling for Hadoop.

As we know hadoop processes and analyse large amount of data, with different variety and with high processing speed, but for achieve this performance at maximum level, with higher rate of efficiency Job scheduling is very important.

Hadoop supports three types of scheduling,
1. FIFO Scheduler - First In First Out
2. Fair Scheduler  - Each job get equal amount of processor time span.
3. Capacity Scheduler - Priority Scheduler

FIFO Scheduler :  
This is a default scheduler, The original scheduling algorithm that was integrated within the Job Tracker was called FIFO. In FIFO scheduling, a Job Tracker pulled jobs from a work queue, oldest job first. This schedule had no concept of the priority or size of the job, but the approach was simple to implement and efficient.

Fair Scheduler :
Fair scheduling is a method of assigning resources to jobs such that all jobs get, on average, an equal share of resources over time. When there is a single job running, that job uses the entire cluster. When other jobs are submitted, tasks slots that free up are assigned to the new jobs, so that each job gets roughly the same amount of CPU time. Unlike the default Hadoop scheduler, which forms a queue of jobs, this lets short jobs finish in reasonable time while not starving long jobs. It is also an easy way to share a cluster between multiple of users. Fair sharing can also work with job priorities - the priorities are used as weights to determine the fraction of total compute time that each job gets.

Capacity Scheduler : 
The capacity scheduler shares some of the principles of the fair scheduler but has distinct differences, too. First, capacity scheduling was defined for large clusters, which may have multiple, independent consumers and target applications. For this reason, capacity scheduling provides greater control as well as the ability to provide a minimum capacity guarantee and share excess capacity among users.
In capacity scheduling, instead of pools, several queues are created, each with a configurable number of map and reduce slots. Each queue is also assigned a guaranteed capacity (where the overall capacity of the cluster is the sum of each queue's capacity).

This scheduler was developed by Yahoo!.


Wednesday, April 24, 2013

Would Hadoop really replaces Traditional Data warehousing domains?

If someone asks you 'Would Hadoop will be the future to data warehousing?; would it replaces the traditional data warehousing systems? then what will be your reaction.You will think that he must be kidding, there is no point of discussing such question because you know the how important is the data warehousing is. but its only if you are unaware of Hadoop.

Yes, Traditional Data Warehouse can in fact address this specific use case reasonably well from an architectural standpoint. But given that the most cutting-edge cloud analytics is happening in Hadoop clusters, it’s just a matter of time one to two years, tops before all data warehouse vendors bring Hadoop into their heart of their architectures. For those vendors who haven’t yet fully committed to full Hadoop integration, the growing real-world adoption of this open-source approach will force their hands.

Where the next-generation Data Warehouse is concerned, the petabyte staging cloud is merely Hadoop’s initial footprint. Enterprises are moving rapidly toward the Data Warehouse as the hub for all future analytics. Again, the impressive growth in MapReduce for predictive modeling, data mining(Mahout), and content analytics will practically compel Data Warehouse vendors to optimize their platforms for MapReduce.

Yes, Handling Bigdata it not just a matter of volume, it means variety of data and velocity as well.

Friday, March 22, 2013

Recommendations with Apache Mahout


Have you ever been recommended a friend on Facebook? Or visited a shopping portal where you can see the recommended items for you, Or an item you might be interested in on Amazon? If so then you've benefited from the value of recommendation systems.
for example, often see personalized recommendations phrased something like, “If you liked that item, you might like also like this one...” These sites use recommendations to help drive users  to other things they offer in an intelligent, meaningful way, tailored specifically to the user and the user’s preferences.

Recommendation systems apply knowledge discovery techniques to the problem of making recommendations that are personalized for each user. Recommendation systems are one way we can use algorithms to help us sort through the masses of information to find the “good stuff” in a very managed way.

From an algorithmic standpoint, the recommendation systems we’ll talk about today are considered in the k-nearest neighbor family of problems (another type would be a SVD-based recommender). We want to predict the estimated preference of a user towards an item they have never seen before. We also want to generate a ranked (by preference score) list of items the user might be most interested in. Two well-known styles of recommendation algorithms are item-based recommenders and user-based recommenders. Both types rely on the concept of a similarity function/metric (ex: Euclidean distance, log likelihood), whether it is for users or items.

Overview of a recommendation engine

The main purpose of a recommendation engine is to make inferences on existing data to show relationships between objects and entities. Objects can be many things, including users, items, products(in short user related data) and so on. Relationships provide a degree of likeness or belonging between objects. For example, relationships can represent ratings of how much a user likes an item, or indicate if a user bookmarked a particular page.

To make a recommendation, recommendation engines perform several steps to mine the data(Data mining). Initially, you begin with input data that represents the objects as well as their relationships. Input data consists of object identifiers and the relationships to other objects.

Consider the ratings users give to items. Using this input data, a recommendation engine computes a similarity between objects. Computing the similarity between objects(co-similarity) can take a great deal of time depending on the size of the data or the particular algorithm. Distributed algorithms such as Apache Hadoop using Mahout can be used to parallelize the computation of the similarities. There are different types of algorithms to compute similarities. Finally, using the similarity information, the recommendation engine can make recommendation requests based on the parameters requested.

For Example:
GroupLens Movie Data

The input data for this demo is based on 1M anonymous ratings of approximately 4000 movies made by 6,040 MovieLens users, which you can download from the site. The zip file contains four files:

movies.dat (movie ids with title and category)
ratings.dat (ratings of movies)
users.dat (user information)

The ratings file is most interesting to us since it’s the main input to our recommendation job. Each line has the format:
Ratings.dat description


So let’s adjust our input file to match what we need to run our job. First download the file and unzip it locally from:

Next run the command:
        tr -s ':' ',' < ratings.dat | cut -f1-3 -d, > ratings.csv

This produces the csv output format we’ll use in the next section when we run our “Itembased Collaborative Filtering” job.

        hadoop fs -put [my_local_file] [user_file_location_in_hdfs]

this command put  input file on HDFS,

create user.txt file which stores the data(userID) of the users to which we want show recommendations.
put it on HDFS under users directory.
With our user list in hdfs we can now run the Mahout  recommendation job with a command in the form of:
       mahout recommenditembased --input [input-hdfs-path] --output [output-hdfs-path] --tempDir [tmp-hdfs-path] --usersFile [user_file_location_in_hdfs]

which will run for a while (a chain of 10 MapReduce jobs) and then write out the item recommendations into HDFS we can now take a look at.  If we tail the output from the RecommenderJob with the command:

         hadoop fs -cat [output-hdfs-path]/part-r-00000

The output will show the user(provided into user.txt) with the recommended items.

For more details: