Sunday, November 12, 2017

Serialization Frameworks(Protocols), When and What?

Technically, there are 2 important comparison differences between below protocols:

- Statically typed or dynamically typed
- Type mapping between language's type system and serializer's type system (Note: these serializers are cross-language)

The most understandable difference is "statically typed" vs "dynamically typed". It affects that how to manage compatibility of data and programs. Statically typed serializers don't store detailed type information of objects into the serialized data, because it is explained in source codes or IDL. Dynamically typed serializers store type information by the side of values.

- Statically typed: Protocol Buffers, Thrift
- Dynamically typed: JSON, Avro, MessagePack, BSON

Generally speaking, statically typed serializers can store objects in fewer bytes. But they they can't detect errors in the IDL (=mismatch of data and IDL). They must believe IDL is correct since data don't include type information. It means statically typed serializers are high-performance but you must strongly care about compatibility of data and programs.
Note that some serializers have original improvements for the problems. Protocol Buffers store some (not detailed) type information into data. Thus it can detect mismatch of IDL and data. MessagePack stores type information in effective format. Thus its data size becomes smaller than Protocol Buffers or Thrift (depends on data).

Type systems are also important difference. Following list compares type systems of Protocol Buffers, Avro and MessagePack:

- Protocol Buffers: int32, int64, uint32, uint64, sint32, sint64, fixed32, fixed64, sfixed32, sfixed64, double, float, bool, string, bytes, repeated, message
- Avro: int, long, float, double, boolean, null, float, double, bytes, fixed, string, enum, array, map, record
- MessagePack: Integer, Float, Boolean, Nil, Raw, Array, Map (=same as JSON)

Serializers must map these types into/from language's types to achieve cross-language compatibility. It means that some types supported by your favorite language can't be stored by some serializers. Or too many types may cause interoperability problems. For example, Protocol Buffers doesn't have map (dictionary) type. Avro doesn't tell unsigned integers from signed integers, while Protocol Buffers does. Avro has enum type, while Protocol Buffers and MessagePack don't have.

It was necessary for their designers. Protocol Buffers are initially designed for C++ while Avro for Java. MessagePack aims interoperability with JSON.

Some of the advantages and disadvantages of the Serialization frameworks

1. XML
Advantage of the XML is human  readable/editable, extensibility, interoperability, XML provides a structure to data so that it is richer in information, XML is easily processed because the structure of the data is simple and standard and There is a wide range of reusable software available to programmers to handle XML so they don't have to re-invent code. XML provides Many views of the one data. XML separates the presentation of data from the structure of that data. It is standard for SOAP etc.

XML use Unicode encoding of the data. XML can be parsed without know schema in advance.

JSON is much easier for human to read than XML. It is easier to write, too. It is also easier for machines to read and write. JSON also provides a structure to data so that it is richer in information and easy processing. JSON has better data exchange format, JSON would perform great for correct usecase. JSON schema and structures are based on arrays and records into the JSON Object. JSON has excellent brower support and less verbose than XML.

As same as XML, JSON also use Unicode encoding format of the data. Advantage of the JSON over XML is the size of the message is much smaller than XML and JSON readability/editability. JSON can be parsed without know schema in advance.

JSON is just beginning to become known. Its simplicity and the ease of converting XML to JSON makes JSON ultimately more adoptable.

3. Protocol Buffer
Protocol Buffer has small output size, very dense data, but very fast processing. It is hard to robustly decode without knowing the schema. (data format is internally ambiguous, and needs schema to clarify) Only machine can able to understand is, not intended for the human eys. (dense binary)

Protocol Buffer has full backword compatibilty, and requires less boilerplate code for parsing as compared to JSON or XML.

4. BSON(Binary JSON)
BSON can be compared to binary interchange formats, like Protocol Buffers. BSON is more "schema-less" than Protocol Buffers, which can give it an advantage in flexibility but also a slight disadvantage in space efficiency (BSON has overhead for field names within the serialized data).

BSON is Lightweight, Keeping spatial overhead to a minimum is important for any data representation format, especially when used over the network. Traversable, BSON is designed to be traversed easily. This is a vital property in its role as the primary data representation for MongoDB.

Efficient, Encoding data to BSON and decoding from BSON can be performed very quickly in most languages due to the use of C data types.

5. Apache Thrift
Apache Thrift provides, Cross-language serialization with lower overhead than alternatives such as SOAP due to use of binary format, A lean and clean library. No framework to code. No XML configuration files. The language bindings feel natural. For example Java uses ArrayList<String>. C++ uses std::vector<std::string>. The application-level wire format and the serialization-level wire format are cleanly separated. They can be modified independently. The predefined serialization styles include: binary, HTTP-friendly and compact binary and Doubles as cross-language file serialization.

Thrift does not require a centralized and explicit mechanism like major-version/minor-version. Loosely coupled teams can freely evolve RPC calls. No build dependencies or non-standard software. No mix of incompatible software licenses.

6. Message Pack
The tagline of Message Pack says 'It's like JSON. but fast and small.' Message Pack is not human readable as it stores data into binary format.

7. Apache AVRO
Nowadays Apache AVRO becoming popular into the industry is because of the message size and evolving schemas feature.

Schema evolution – Avro requires schemas when data is written or read. Most interesting is that you can use different schemas for serialization and deserialization, and Avro will handle the missing/extra/modified fields. Untagged data – Providing a schema with binary data allows each datum be written without overhead. The result is more compact data encoding, and faster data processing. Dynamic typing – This refers to serialization and deserialization without code generation. It complements the code generation, which is available in Avro for statically typed languages as an optional optimization.

Tuesday, July 18, 2017

Music Analytics Opportunities

Once upon a time the words music listening habits were private as their bedrooms, music lovers used to buy the CDs, recordings and other physical copies of music and never publicly shared, Record companies were aware which radio station played their songs and where their CDs were popular, but that information painted an incomplete picture at best. Who knew what music people were sharing on tapes and CDs burnt in the privacy of their own bedrooms?

A traditional business metrics like number of CDs were sold and nothing happened after that, who purchased what and whom to assist what to buy, all this was anonymous. Thats all changed after explosion of online music sources like torrenting, music streaming sites and social media platforms, are now playing a very key role for music industry to understand their fans, spot upcoming talents like never before and anyones personal music interest nowadays becoming a public. Music analytics is now worth around $24.35 billion per year.

Image result for Music Analytics Opportunities

At the same time that the internet is taking power away from record labels, it is also giving them the ability to predict future hits.

Sunday, March 12, 2017

My First Publication: YARN Essentials

 Image result for yarn essentials
If you have a working knowledge of Hadoop 1.x but want to start afresh with YARN, this book is ideal for you. You will be able to install and administer a YARN cluster and also discover the configuration settings to fine-tune your cluster both in terms of performance and scalability. This book will help you develop, deploy, and run multiple applications/frameworks on the same shared YARN cluster.

YARN is the next generation generic resource platform used to manage resources in a typical cluster and is designed to support multi-tenancy in its core architecture. As optimal resource utilization is central to the design of YARN, learning how to fully utilize the available fine-grained resources (RAM, CPU cycles, and so on) in the cluster becomes vital.

This book is an easy-to-follow, self-learning guide to help you start working with YARN. Beginning with an overview of YARN and Hadoop, you will dive into the pitfalls of Hadoop 1.x and how YARN takes us to the next level. You will learn the concepts, terminology, architecture, core components, and key interactions, and cover the installation and administration of a YARN cluster as well as learning about YARN application development with new and emerging data processing frameworks.

Thank you!

Big Businesses with Big Opportunities

Nowadays Businesses are struggling with abnormally growing volumes, speed and variety of information that used to generate everyday, everyday the complexity of information generation is also rapidly growing - the term to be known for as 'Bigdata'. Many companies are seeking for the technology to not only help them to bigdata storage and process but also finding many more business insights from bigdata and growing up the business strategies with bigdata.

Arround 80% information in world is unstructured, and many businesses are not even attempting to use  that information for their advantage or not aware how to use that information. Imagine if you and your business keep afford that all data generated by you business and keep tracking and analyzing it, Imagine if know to way the handle that bigdata?

The data explosion presents great challenge to businesses, today most lack the technology and knowledge about bigdata and how to deal with it and get real business values. Many Companies are focusing on the developing skills and insights of business needs to accelerate the path of transforming larger data sets. 
What bigdata can do? Businesses are growing up with bigdata to finding more business insights and row that caries values for business with latest bigdata processing technologies like Hadoop fromework.
Its now possible to track each individual user through cell phones, wireless sensors with measurement of his interest in particular thing, where does he lives, works, plays and what is his day to day program and collect the data, analyse this huge data using bigdata processing technologies and find out the business ways with each individual user to help or make his life simpler.
Bigdata in Social Networking, day to day millions of facebook comments, updates, twitter tweets are generating and many more so using bigdata processing to find out current market trends, what people are talking about, their likes, dislikes accordingly plan our business.
Bigdata in Healthcare, every hospital or healthcare organization maintaining their historical records with patients records which may kind of bigdata so technology can analyse that past records and predict in future which patients, on what date, with the cause and what are the possible treatments for similar cause.
Bigdata in BFSI, In BFSI domain fault tolerance is the one of the most important pillar so, there are millions of daily banking transactions are there we want to find out the fake transactions, bigdata helps us even for product recommendations, transaction analysis bigdata plays a major role.
Bigdata in  ECommerce, somewhere and somehow on online shopping sites you might seen dialogs like 'you bought this you may like this', this is kind of recommendations calculated by bigdata processing technologies.

The information that we have today, about 90% of information is generated in just last 2 years, I believe after 2020 there will about 70% businesses in world would wont be having any other option. Products would be delivered to the customer if he just think buying about it, Cab will arrive when people think of going out and discounts would be already there on Shirt we might think to buy.

Saturday, February 11, 2017

Recommendations with Apache Mahout


Have you ever been recommended a friend on Facebook? Or visited a shopping portal where you can see the recommended items for you, Or an item you might be interested in on Amazon? If so then you've benefited from the value of recommendation systems.
for example, often see personalized recommendations phrased something like, “If you liked that item, you might like also like this one...” These sites use recommendations to help drive users  to other things they offer in an intelligent, meaningful way, tailored specifically to the user and the user’s preferences.

Recommendation systems apply knowledge discovery techniques to the problem of making recommendations that are personalized for each user. Recommendation systems are one way we can use algorithms to help us sort through the masses of information to find the “good stuff” in a very managed way.

From an algorithmic standpoint, the recommendation systems we’ll talk about today are considered in the k-nearest neighbor family of problems (another type would be a SVD-based recommender). We want to predict the estimated preference of a user towards an item they have never seen before. We also want to generate a ranked (by preference score) list of items the user might be most interested in. Two well-known styles of recommendation algorithms are item-based recommenders and user-based recommenders. Both types rely on the concept of a similarity function/metric (ex: Euclidean distance, log likelihood), whether it is for users or items.

Overview of a recommendation engine

The main purpose of a recommendation engine is to make inferences on existing data to show relationships between objects and entities. Objects can be many things, including users, items, products(in short user related data) and so on. Relationships provide a degree of likeness or belonging between objects. For example, relationships can represent ratings of how much a user likes an item, or indicate if a user bookmarked a particular page.

To make a recommendation, recommendation engines perform several steps to mine the data(Data mining). Initially, you begin with input data that represents the objects as well as their relationships. Input data consists of object identifiers and the relationships to other objects.

Consider the ratings users give to items. Using this input data, a recommendation engine computes a similarity between objects. Computing the similarity between objects(co-similarity) can take a great deal of time depending on the size of the data or the particular algorithm. Distributed algorithms such as Apache Hadoop using Mahout can be used to parallelize the computation of the similarities. There are different types of algorithms to compute similarities. Finally, using the similarity information, the recommendation engine can make recommendation requests based on the parameters requested.

For Example:
GroupLens Movie Data

The input data for this demo is based on 1M anonymous ratings of approximately 4000 movies made by 6,040 MovieLens users, which you can download from the site. The zip file contains four files:

movies.dat (movie ids with title and category)
ratings.dat (ratings of movies)
users.dat (user information)

The ratings file is most interesting to us since it’s the main input to our recommendation job. Each line has the format:
Ratings.dat description


So let’s adjust our input file to match what we need to run our job. First download the file and unzip it locally from:

Next run the command:
        tr -s ':' ',' < ratings.dat | cut -f1-3 -d, > ratings.csv

This produces the csv output format we’ll use in the next section when we run our “Itembased Collaborative Filtering” job.

        hadoop fs -put [my_local_file] [user_file_location_in_hdfs]

this command put  input file on HDFS,

create user.txt file which stores the data(userID) of the users to which we want show recommendations.
put it on HDFS under users directory.
With our user list in hdfs we can now run the Mahout  recommendation job with a command in the form of:
       mahout recommenditembased --input [input-hdfs-path] --output [output-hdfs-path] --tempDir [tmp-hdfs-path] --usersFile [user_file_location_in_hdfs]

which will run for a while (a chain of 10 MapReduce jobs) and then write out the item recommendations into HDFS we can now take a look at.  If we tail the output from the RecommenderJob with the command:

         hadoop fs -cat [output-hdfs-path]/part-r-00000

The output will show the user(provided into user.txt) with the recommended items.