Wednesday, July 6, 2016

Social Networking Analysis

Around 85% internet users around the world (1.59 billion monthly active users in 2016, representing a 25.2 percent increase over last years figures. Eighth-ranked instagram had over 400 million monthly active accounts.) uses a online social networking portals like facebook, whatsapp, tweeter, youtube to share their experiences and to get familiar with what’s happening around us. Facebook alone reported with 1.55 billion monthly active users. As any new product lunches in any industry we can found the real experiences provided by product user on these portals. Nowadays social networking plays a very important role in Business analysis and locating business lagging and growing areas, which helps businesses to create a business strategy to improving lagging areas and maintaining the qualities of growing areas.

Apache Hadoop and Spark plays a very important role in Bigdata collection, processing and providing a nearly real time analytics out of it.

Sunday, June 26, 2016

Capitalizing Bigata.!!!

90% of data created today is unstructured and more difficult to manage that generating from data sources like social media(facebook, twitter), video(youtube), texts(application logs), audio(viacom), email(gmail), and documents. As all of know the social media becoming revolutionary factor for businesses.

 
Bigdata is much more than data and is already transforming the way businesses and organizations are running. It represents a new way of doing business, creating a bright path for future business world, one that is driven by data oriented decision making and new types of products and services influenced by data. The rapid explosion in Bigdata and ways to handle it, changing the landscape of not only IT industry but all over the data oriented systems, And this data is becoming so powerful and important to drive for today’s businesses, as it contains customer insight and business growth opportunities that have yet to be identified or even no one had a idea about. But due to its volume, type and speed of change, most companies are doesn't have enough resources  to address this valuable data and get business out of it. 

Its time to get together and find out the ways and patterns from bigdata that can help us to make our lives even simpler and we have the solution(Hadoop) but need to explore it more, to focus on true growth and identifying the business opportunities.

Tuesday, May 24, 2016

Hazelcast : In-Memory NoSQL Solution

If you are evaluating high-performance NoSQL solutions such as: Redis, Riak, Couchbase, MongoDB, Cassandra etc. or in even rarer cases if you’re evaluating caching solutions such as Memcached or EHcache, it’s possible that your best choice may be Hazelcast: Hazelcast uses a considerably different approach to any of the above projects, and yet for some classes of people looking for a Key-Value store, Hazelcast may be the best option for you.




So lets look at Hazelcast and why it is better alternative for above mentioned systems, Hazelcast is an In-Memory Data Grid, not a NoSQL database.
advantages and disadvantages of using Hazelcast in-memory solution, first of all as being key-value store into the memory, it has some default advantages are speed and read efficiency but also some natural disadvantages as storing map into memory are scalability and volatility, as we know that size of RAM is always less than the total available disk space, as RAM is more expensive than the disk, so being in-memory store space is always been a constant and second one as being flash storage, RAM refreshes itself on process restart, resulting into the data loss. Hazelcast addressing this shortcomings of the in-memory stores by providing efficient and convenient solution.
In Hazelcast, scalability issue is addressed by clustering solution, as joining hundreds nodes into the cluster, we may aggregate more than terabytes of in-memory space, to accommodate Hazelcast map into the memory. Of course this not going to be compared with disk space as its been 100 times more than memory but depending upon use case some terabytes of space is sufficient for the in-memory operations or else this solution may use with some backend data store.

Volatility, in Hazelcast volatility handled by peer to peer data distribution, so every block of data has multiple copies(replication across the cluster) of it present on different locations, as any node/rack goes down because of some issue, we can recover data from other copies present on other locations. The number of backup copies can be configured as depending upon the data criticality. Too much copies of the data into the cluster may reduce the overall availability of working memory to other operations. Hazelcast addresses the tuning problem of the cluster by providing ways to make your cluster available and reliable, including setting up backup process for example servers on one rack would be set to back up on another rack, so failure of entire rack can be managed gracefully.
Hazelcast also address the Rebalancing problem of the cluster, whenever node is added/removed to/from the cluster can be lead to moving data across the cluster to rebalancing it. If the node crashes because of some issues, the data copy(primary) on dead node has to be re-owned by the another node who is having replica of that data copy(secondary) becoming primary and backed up on another node to make cluster fail-safe again. this process may consume more cluster resources like CPU, RAM, network etc. might lead to latency into the process during this whole process.
Also in addition to the above benefits, Hazelcast also make sure that Java Garbage Collection process should not be having any effect on the terabytes of data stored onto memory specifically on heap as your heap gets bigger, garbage collection might cause delay in your application response time. So memory store is Hazelcast with native storage support to avoid garbage collector to causing delay in application; resulting in more efficiency and throughput.

Sunday, May 1, 2016

Serialization Frameworks(Protocols), When to use What?

Technically, there are 2 important comparison differences between below protocols:

- Statically typed or dynamically typed
- Type mapping between language's type system and serializer's type system (Note: these serializers are cross-language)

The most understandable difference is "statically typed" vs "dynamically typed". It affects that how to manage compatibility of data and programs. Statically typed serializers don't store detailed type information of objects into the serialized data, because it is explained in source codes or IDL. Dynamically typed serializers store type information by the side of values.

- Statically typed: Protocol Buffers, Thrift
- Dynamically typed: JSON, Avro, MessagePack, BSON

Generally speaking, statically typed serializers can store objects in fewer bytes. But they they can't detect errors in the IDL (=mismatch of data and IDL). They must believe IDL is correct since data don't include type information. It means statically typed serializers are high-performance but you must strongly care about compatibility of data and programs.
Note that some serializers have original improvements for the problems. Protocol Buffers store some (not detailed) type information into data. Thus it can detect mismatch of IDL and data. MessagePack stores type information in effective format. Thus its data size becomes smaller than Protocol Buffers or Thrift (depends on data).

Type systems are also important difference. Following list compares type systems of Protocol Buffers, Avro and MessagePack:

- Protocol Buffers: int32, int64, uint32, uint64, sint32, sint64, fixed32, fixed64, sfixed32, sfixed64, double, float, bool, string, bytes, repeated, message
- Avro: int, long, float, double, boolean, null, float, double, bytes, fixed, string, enum, array, map, record
- MessagePack: Integer, Float, Boolean, Nil, Raw, Array, Map (=same as JSON)

Serializers must map these types into/from language's types to achieve cross-language compatibility. It means that some types supported by your favorite language can't be stored by some serializers. Or too many types may cause interoperability problems. For example, Protocol Buffers doesn't have map (dictionary) type. Avro doesn't tell unsigned integers from signed integers, while Protocol Buffers does. Avro has enum type, while Protocol Buffers and MessagePack don't have.

It was necessary for their designers. Protocol Buffers are initially designed for C++ while Avro for Java. MessagePack aims interoperability with JSON.

Some of the advantages and disadvantages of the Serialization frameworks

1. XML
Advantage of the XML is human  readable/editable, extensibility, interoperability, XML provides a structure to data so that it is richer in information, XML is easily processed because the structure of the data is simple and standard and There is a wide range of reusable software available to programmers to handle XML so they don't have to re-invent code. XML provides Many views of the one data. XML separates the presentation of data from the structure of that data. It is standard for SOAP etc.

XML use Unicode encoding of the data. XML can be parsed without know schema in advance.


2. JSON
JSON is much easier for human to read than XML. It is easier to write, too. It is also easier for machines to read and write. JSON also provides a structure to data so that it is richer in information and easy processing. JSON has better data exchange format, JSON would perform great for correct usecase. JSON schema and structures are based on arrays and records into the JSON Object. JSON has excellent brower support and less verbose than XML.

As same as XML, JSON also use Unicode encoding format of the data. Advantage of the JSON over XML is the size of the message is much smaller than XML and JSON readability/editability. JSON can be parsed without know schema in advance.

JSON is just beginning to become known. Its simplicity and the ease of converting XML to JSON makes JSON ultimately more adoptable.


3. Protocol Buffer
Protocol Buffer has small output size, very dense data, but very fast processing. It is hard to robustly decode without knowing the schema. (data format is internally ambiguous, and needs schema to clarify) Only machine can able to understand is, not intended for the human eys. (dense binary)

Protocol Buffer has full backword compatibilty, and requires less boilerplate code for parsing as compared to JSON or XML.


4. BSON(Binary JSON)
BSON can be compared to binary interchange formats, like Protocol Buffers. BSON is more "schema-less" than Protocol Buffers, which can give it an advantage in flexibility but also a slight disadvantage in space efficiency (BSON has overhead for field names within the serialized data).

BSON is Lightweight, Keeping spatial overhead to a minimum is important for any data representation format, especially when used over the network. Traversable, BSON is designed to be traversed easily. This is a vital property in its role as the primary data representation for MongoDB.

Efficient, Encoding data to BSON and decoding from BSON can be performed very quickly in most languages due to the use of C data types.


5. Apache Thrift
Apache Thrift provides, Cross-language serialization with lower overhead than alternatives such as SOAP due to use of binary format, A lean and clean library. No framework to code. No XML configuration files. The language bindings feel natural. For example Java uses ArrayList<String>. C++ uses std::vector<std::string>. The application-level wire format and the serialization-level wire format are cleanly separated. They can be modified independently. The predefined serialization styles include: binary, HTTP-friendly and compact binary and Doubles as cross-language file serialization.

Thrift does not require a centralized and explicit mechanism like major-version/minor-version. Loosely coupled teams can freely evolve RPC calls. No build dependencies or non-standard software. No mix of incompatible software licenses.


6. Message Pack
The tagline of Message Pack says 'It's like JSON. but fast and small.' Message Pack is not human readable as it stores data into binary format.


7. Apache AVRO
Nowadays Apache AVRO becoming popular into the industry is because of the message size and evolving schemas feature.

Schema evolution – Avro requires schemas when data is written or read. Most interesting is that you can use different schemas for serialization and deserialization, and Avro will handle the missing/extra/modified fields. Untagged data – Providing a schema with binary data allows each datum be written without overhead. The result is more compact data encoding, and faster data processing. Dynamic typing – This refers to serialization and deserialization without code generation. It complements the code generation, which is available in Avro for statically typed languages as an optional optimization.

Please let me know your comments and experiences about these frameworks.

Thanks.

Tuesday, April 26, 2016

Functional Programming and Procedural Programming

Procedural Programming
Derived from structured programming, based on the concept of modular programming or the procedure call Derived from structured programming, based on the concept of modular programming or the procedure call

Procedural programming uses a list of instructions to tell the computer what to do step by step. Procedural programming relies on procedures, also known as routines. A procedure contains a series of computational steps to be carried out. Procedural programming is also referred to as imperative or structured programming.

Procedural programming is intuitive in the sense that it is very similar to how you would expect a program to work. If you want a computer to do something, you should provide step-by-step instructions on how to do it. Many of the early programming languages are all procedural. Examples of procedural languages include Fortran, COBOL and C, which have been around since the 1960s and 70s.

A common technique in procedural programming is to repeat a series of steps using iteration. This means you write a series of steps and then tell the program to repeat these steps a certain number of times. This makes it possible to automate repetitive tasks.

·        The output of a routine does not always have a direct correlation with the input.
·        Everything is done in a specific order.
·        Execution of a sub-routine may have side effects.
·        Tends to emphasize implementing solutions in a linear fashion.

For Example: (Perl)

     sub factorial ( int $n ){
       my $result = 1;
       loop ( ; $n > 0; $n-- ){
            $result *= $n;
       }
       return $result;
     }


Functional Programming
Treats computation as the evaluation of mathematical functions avoiding state and mutable data.

Functional programming is an approach to problem solving that treats every computation as a mathematical function. The outputs of a function rely only on the values that are provided as input to the function and don't depend on a particular series of steps that precede the function.

Functional programming relies heavily on recursion. A recursive function can repeat itself until a particular condition is reached. This is similar to the use of iteration in procedural programming, but now applied to a single function as opposed to a series of steps. Examples of functional programming languages include Erlang, Haskell, Lisp and Scala.

·        Often recursive.
·        Always returns the same output for a given input.
·        Order of evaluation is usually undefined.
·        Must be stateless. i.e. No operation can have side effects.
·        Good fit for parallel execution
·        Tends to emphasize a divide and conquer approach.
·        May have the feature of Lazy Evaluation.

For Example: (Perl)
    sub factorial ( int $n ){
       return 1 unless $n > 0;
       return $n * factorial( $n-1 );
     }

Wednesday, March 16, 2016

YARN Essentials

If you have a working knowledge of Hadoop 1.x but want to start afresh with YARN, this book is ideal for you. You will be able to install and administer a YARN cluster and also discover the configuration settings to fine-tune your cluster both in terms of performance and scalability. This book will help you develop, deploy, and run multiple applications/frameworks on the same shared YARN cluster.


YARN is the next generation generic resource platform used to manage resources in a typical cluster and is designed to support multi-tenancy in its core architecture. As optimal resource utilization is central to the design of YARN, learning how to fully utilize the available fine-grained resources (RAM, CPU cycles, and so on) in the cluster becomes vital.


This book is an easy-to-follow, self-learning guide to help you start working with YARN. Beginning with an overview of YARN and Hadoop, you will dive into the pitfalls of Hadoop 1.x and how YARN takes us to the next level. You will learn the concepts, terminolog

y, architecture, core components, and key interactions, and cover the installation and administration of a YARN cluster as well as learning about YARN application development with new and emerging data processing frameworks.

Follow below link for more details,
https://www.packtpub.com/big-data-and-business-intelligence/yarn-essentials

Thank you!

Thursday, August 27, 2015

What is Hadoop anyway?

Hadoop will change the way businesses think about storage, processing and the value of ‘big’ data.

Apache Hadoop is an open source project governed by the Apache Software Foundation (ASF). Hadoop enables the user to extract valuable business insight from massive amounts of structured and unstructured data quickly and cost-effectively through three main functions:

Processing – MapReduce. Computation in Hadoop is based on the MapReduce paradigm that distributes tasks across a cluster of coordinated “nodes.”

Storage – HDFS. Storage is accomplished with the Hadoop Distributed File System (HDFS) – a reliable file system that allows large volumes of data to be stored and accessed across large clusters of commodity servers.

Resource Management – YARN. Coming in Hadoop 2.0, YARN performs a resource management function further increasing efficiency and extends MapReduce capabilities by supporting non-MapReduce workloads such as Graph, Steaming, In-memory, MPI processing and more.

Hadoop is designed to scale up or down without system interruption and runs on commodity hardware making the capture and processing of big data economically viable for the enterprise.

“By 2017, I believe that 50% of the world’s data will be stored and analyzed by Apache Hadoop.” 

Tuesday, April 14, 2015

Data Analysis from MongoDB using R

Most of us are aware of R, is a programming language and software environment for statistical computing and graphics. The R language is widely used among statisticians and data miners for developing statistical softwares and data analysis. If we empower R with proper datasets and sources it would be the icing on the cake, so in this post we are going to see how, R would be connected to the MongoDB and how one can apply R power or datasets from MongoDB.

Prerequisites for this demo, you should have MongoDB daemon up and running on server or on your local machine(pseudo distributed mode) 

Start your R instance and install "rmongodb" packages by issuing below command(s)

        $  install.packages("rmongodb")
        $  library(rmongodb)

connect R with MongoDB instance
   
       $ mongo.create(host = "127.0.0.1", name = "", username = "", password = "", db = "test", timeout = 0L)

you'll get response as below, using above connection configuration you are connecting to the mongo instance on 127.0.0.1 to the 'test' mongo database with empty username and password.

        [1] 0
        attr(,"mongo")
        <pointer: 0x0884f0a8>
        attr(,"class")
        [1] "mongo"
        attr(,"host")
        [1] "127.0.0.1"
        attr(,"name")
        [1] ""
        attr(,"username")
        [1] ""
        attr(,"password")
        [1] ""
        attr(,"db")
        [1] "test"
        attr(,"timeout")
        [1] 0   

you can check by issuing below command, whether R is connected to MongoDB or not.

        $ mongo.is.connected(mongo)
        [1] TRUE

Now your R is successfully connected to MongoDB instance to test database, so you can easily fire a simple mongo queries and use R's power to calculate analytics over mongoDB datasets.

for example to get simple one record from Mongo

        $ mongo.find.one(mongo,"test.zip",list())

we can also use filter queries to fetch records from MongoDB into R datasets,

        $ mongo.find(mongo, "test.zip", list(pop=list('$gt'=21L)))

So, this just a beginning stay tuned for the next updates.
Thanks for visiting, I'll appreciate your thoughts and comments

Saturday, March 28, 2015

Data Scrapper in Python

Hello All,

Nowadays we know the data is the most valuable thing in the world, who has the more data has the more power or command over the market. This market is totally data driven and I'm sure in next couple of decades the data can also decide the future, just kidding :) 
But trust me we can power our recommendations systems to predict very much accurate results with the data. Data is directly proportional to the value.

As the data is important then the its collection is also important, so we have number of data sources available over the net, one just need to find it out and fetch the required information from.

So in this post, we are going to learn one of the very famous data collection method is Data Scrapping from world wide web. Today we are going to write data scrapper in Python(3.4.3) 

#Import the required libraries
import urllib.request
import re

#stock symbol lists, you may refer it from file
symbolslist = ["suzlon.bo","unitech.bo","spicejet.bo","idfc6.bo","powergrid6.bo"]

i=0
while i<len(symbolslist):
#scapping page url
urlstr = "https://in.finance.yahoo.com/q?s="+symbolslist[i]+""
htmfile = urllib.request.urlopen(urlstr)
htmtext = htmfile.read().decode('utf-8')
regex='<span id="yfs_l84_'+symbolslist[i]+'">(.+?)</span>'
pattern = re.compile(regex)
price = re.findall(pattern, htmtext)
#Print the scrapped data
print("The price of",symbolslist[i]," is ",price)
i+=1

This is just a basic program you can modify and extend as per your requirement.

Thanks for visiting, stay tuned for more!!!

Thursday, March 19, 2015

Apache Storm Setup and Deployment


Please follow below steps for apache storm and zookeeper setup and deployment


Set up a Zookeeper cluster

Download and extract a Storm package to Nimbus and worker machines
Install dependencies on Nimbus and worker machines
Fill in mandatory configurations into storm.yaml
Launch daemons under supervision using “storm” script and a supervisor of your choice

Overall Zookeeper and Storm cluster components

Setup a Zookeeper cluster

Storm uses Zookeeper for coordinating the cluster. Zookeeper is not used for message passing, so the load Storm places on Zookeeper is quite low. Single node Zookeeper clusters should be sufficient for most cases, but if you want failover or are deploying large Storm clusters you may want larger Zookeeper clusters.
Install the Java JDK. You can use the native packaging system for your system, or download the JDK from:

http://java.sun.com/javase/downloads/index.jsp

Set the Java heap size. This is very important to avoid swapping, which will seriously degrade Zookeeper performance. To determine the correct value, use load tests, and make sure you are well below the usage limit that would cause you to swap. Be conservative - use a maximum heap size of 3GB for a 4GB machine.
Install the Zookeeper Server Package. It can be downloaded from:

http://hadoop.apache.org/zookeeper/releases.html

Create a configuration file. This file can be called anything. Use the following settings as a starting point:

tickTime=2000
dataDir=/var/zookeeper/
clientPort=2181
initLimit=5
syncLimit=2
server.1=zoo1:2888:3888
server.2=zoo2:2888:3888
server.3=zoo3:2888:3888

You can find the meanings of these and other configuration settings in the section Configuration Parameters. A word though about a few here:

Every machine that is part of the Zookeeper ensemble should know about every other machine in the ensemble. You accomplish this with the series of lines of the form server.id=host:port:port. The parameters host and port are straightforward. You attribute the server id to each machine by creating a file named myid, one for each server, which resides in that server's data directory, as specified by the configuration file parameter dataDir.

The myid file consists of a single line containing only the text of that machine's id. So myid of server 1 would contain the text "1" and nothing else. The id must be unique within the ensemble and should have a value between 1 and 255.

If your configuration file is setup, you can start a Zookeeper server:

$ java -cp zookeeper.jar:lib/log4j-1.2.15.jar:conf \ org.apache.zookeeper.server.quorum.QuorumPeerMain zoo.cfg


QuorumPeerMain starts a Zookeeper server, JMX management beans are also registered which allows management through a JMX management console. The ZooKeeper JMX document contains details on managing ZooKeeper with JMX. See the script bin/zkServer.sh, which is included in the release, for an example of starting server instances.

Test your deployment by connecting to the hosts:

In Java, you can run the following command to execute simple operations:

$ java -cp zookeeper.jar:src/java/lib/log4j-1.2.15.jar:conf:src/java/lib/jline-0.9.94.jar \ org.apache.zookeeper.ZooKeeperMain -server 127.0.0.1:2181

In C, you can compile either the single threaded client or the multithreaded client: or n the c subdirectory in the Zookeeper sources. This compiles the single threaded client:

$ make cli_st

And this compiles the multithreaded client:

$ make cli_mt

Running either program gives you a shell in which to execute simple file-system-like operations. To connect to Zookeeper with the multithreaded client, for example, you would run:

$ cli_mt 127.0.0.1:2181

Setup a Storm cluster

Environment
* OS: CentOS 6.X
* CPU Arch: x64
* Middleware: Needs JDK6 or after(Oracle JDK or Open JDK)

Installing storm package
Unzip downloaded zip archive.
https://github.com/acromusashi/storm-installer/wiki/Download

Install the ZeroMQ RPM:
If occur failed dependencies uuid, download from
http://zid-lux1.uibk.ac.at/linux/rpm2html/centos/6/os/x86_64/Packages/uuid-1.6.1-10.el6.x86_64.html
and install uuid-1.6.1-10.el6.x86_64.rpm.

# su -
# rpm -ivh zeromq-2.1.7-1.el6.x86_64.rpm
# rpm -ivh zeromq-devel-2.1.7-1.el6.x86_64.rpm
# rpm -ivh jzmq-2.1.0-1.el6.x86_64.rpm
# rpm -ivh jzmq-devel-2.1.0-1.el6.x86_64.rpm

Install the Storm RPM:

# su -
# rpm -ivh storm-0.9.0-1.el6.x86_64.rpm
# rpm -ivh storm-service-0.9.0-1.el6.x86_64.rpm

Set the zookeeper host, nimbus host and other required properties to storm configuration file.
(Reference: http://nathanmarz.github.com/storm/doc/backtype/storm/Config.html )

* storm.zookeeper.servers (STORM_ZOOKEEPER_SERVERS)
* nimbus.host (NIMBUS_HOST)
# vi /opt/storm/conf/storm.yaml

Settings Example:
Default storm.yaml example.

########### These MUST be filled in for a storm configuration##############
storm.zookeeper.servers:
- "111.222.333.444"
- "555.666.777.888" ## zookeeper hosts
storm.zookeeper.port: 2181
nimbus.host: "111.222.333.444" ## nimbus host
storm.local.dir: "/mnt/storm"
supervisor.slots.ports:
    - 6700
    - 6701
    - 6702
    - 6703

Start or stop storm cluster by following commands:

Start

# service storm-nimbus start
# service storm-ui start
# service storm-drpc start
# service storm-logviewer start
# service storm-supervisor start

Stop

# service storm-supervisor stop
# service storm-logviewer stop
# service storm-drpc stop
# service storm-ui stop
# service storm-nimbus stop

Strom Dependency libraries

Project : Storm
Version : 0.9.0
Lisence : Eclipse Public License 1.0
Source URL : http://storm-project.net/

Project : ZeroMQ
Version : 2.1.7
Lisence : LGPLv3
Source URL : http://www.zeromq.org/

Project : JZMQ
Version : 2.1.0
Lisence : LGPLv3
Source URL : https://github.com/zeromq/jzmq 

Followers