Saturday, March 28, 2015

Data Scrapper in Python

Hello All,

Nowadays we know the data is the most valuable thing in the world, who has the more data has the more power or command over the market. This market is totally data driven and I'm sure in next couple of decades the data can also decide the future, just kidding :) 
But trust me we can power our recommendations systems to predict very much accurate results with the data. Data is directly proportional to the value.

As the data is important then the its collection is also important, so we have number of data sources available over the net, one just need to find it out and fetch the required information from.

So in this post, we are going to learn one of the very famous data collection method is Data Scrapping from world wide web. Today we are going to write data scrapper in Python(3.4.3) 

#Import the required libraries
import urllib.request
import re

#stock symbol lists, you may refer it from file
symbolslist = ["suzlon.bo","unitech.bo","spicejet.bo","idfc6.bo","powergrid6.bo"]

i=0
while i<len(symbolslist):
#scapping page url
urlstr = "https://in.finance.yahoo.com/q?s="+symbolslist[i]+""
htmfile = urllib.request.urlopen(urlstr)
htmtext = htmfile.read().decode('utf-8')
regex='<span id="yfs_l84_'+symbolslist[i]+'">(.+?)</span>'
pattern = re.compile(regex)
price = re.findall(pattern, htmtext)
#Print the scrapped data
print("The price of",symbolslist[i]," is ",price)
i+=1

This is just a basic program you can modify and extend as per your requirement.

Thanks for visiting, stay tuned for more!!!

Thursday, March 19, 2015

Apache Storm Setup and Deployment


Please follow below steps for apache storm and zookeeper setup and deployment


Set up a Zookeeper cluster

Download and extract a Storm package to Nimbus and worker machines
Install dependencies on Nimbus and worker machines
Fill in mandatory configurations into storm.yaml
Launch daemons under supervision using “storm” script and a supervisor of your choice

Overall Zookeeper and Storm cluster components

Setup a Zookeeper cluster

Storm uses Zookeeper for coordinating the cluster. Zookeeper is not used for message passing, so the load Storm places on Zookeeper is quite low. Single node Zookeeper clusters should be sufficient for most cases, but if you want failover or are deploying large Storm clusters you may want larger Zookeeper clusters.
Install the Java JDK. You can use the native packaging system for your system, or download the JDK from:

http://java.sun.com/javase/downloads/index.jsp

Set the Java heap size. This is very important to avoid swapping, which will seriously degrade Zookeeper performance. To determine the correct value, use load tests, and make sure you are well below the usage limit that would cause you to swap. Be conservative - use a maximum heap size of 3GB for a 4GB machine.
Install the Zookeeper Server Package. It can be downloaded from:

http://hadoop.apache.org/zookeeper/releases.html

Create a configuration file. This file can be called anything. Use the following settings as a starting point:

tickTime=2000
dataDir=/var/zookeeper/
clientPort=2181
initLimit=5
syncLimit=2
server.1=zoo1:2888:3888
server.2=zoo2:2888:3888
server.3=zoo3:2888:3888

You can find the meanings of these and other configuration settings in the section Configuration Parameters. A word though about a few here:

Every machine that is part of the Zookeeper ensemble should know about every other machine in the ensemble. You accomplish this with the series of lines of the form server.id=host:port:port. The parameters host and port are straightforward. You attribute the server id to each machine by creating a file named myid, one for each server, which resides in that server's data directory, as specified by the configuration file parameter dataDir.

The myid file consists of a single line containing only the text of that machine's id. So myid of server 1 would contain the text "1" and nothing else. The id must be unique within the ensemble and should have a value between 1 and 255.

If your configuration file is setup, you can start a Zookeeper server:

$ java -cp zookeeper.jar:lib/log4j-1.2.15.jar:conf \ org.apache.zookeeper.server.quorum.QuorumPeerMain zoo.cfg


QuorumPeerMain starts a Zookeeper server, JMX management beans are also registered which allows management through a JMX management console. The ZooKeeper JMX document contains details on managing ZooKeeper with JMX. See the script bin/zkServer.sh, which is included in the release, for an example of starting server instances.

Test your deployment by connecting to the hosts:

In Java, you can run the following command to execute simple operations:

$ java -cp zookeeper.jar:src/java/lib/log4j-1.2.15.jar:conf:src/java/lib/jline-0.9.94.jar \ org.apache.zookeeper.ZooKeeperMain -server 127.0.0.1:2181

In C, you can compile either the single threaded client or the multithreaded client: or n the c subdirectory in the Zookeeper sources. This compiles the single threaded client:

$ make cli_st

And this compiles the multithreaded client:

$ make cli_mt

Running either program gives you a shell in which to execute simple file-system-like operations. To connect to Zookeeper with the multithreaded client, for example, you would run:

$ cli_mt 127.0.0.1:2181

Setup a Storm cluster

Environment
* OS: CentOS 6.X
* CPU Arch: x64
* Middleware: Needs JDK6 or after(Oracle JDK or Open JDK)

Installing storm package
Unzip downloaded zip archive.
https://github.com/acromusashi/storm-installer/wiki/Download

Install the ZeroMQ RPM:
If occur failed dependencies uuid, download from
http://zid-lux1.uibk.ac.at/linux/rpm2html/centos/6/os/x86_64/Packages/uuid-1.6.1-10.el6.x86_64.html
and install uuid-1.6.1-10.el6.x86_64.rpm.

# su -
# rpm -ivh zeromq-2.1.7-1.el6.x86_64.rpm
# rpm -ivh zeromq-devel-2.1.7-1.el6.x86_64.rpm
# rpm -ivh jzmq-2.1.0-1.el6.x86_64.rpm
# rpm -ivh jzmq-devel-2.1.0-1.el6.x86_64.rpm

Install the Storm RPM:

# su -
# rpm -ivh storm-0.9.0-1.el6.x86_64.rpm
# rpm -ivh storm-service-0.9.0-1.el6.x86_64.rpm

Set the zookeeper host, nimbus host and other required properties to storm configuration file.
(Reference: http://nathanmarz.github.com/storm/doc/backtype/storm/Config.html )

* storm.zookeeper.servers (STORM_ZOOKEEPER_SERVERS)
* nimbus.host (NIMBUS_HOST)
# vi /opt/storm/conf/storm.yaml

Settings Example:
Default storm.yaml example.

########### These MUST be filled in for a storm configuration##############
storm.zookeeper.servers:
- "111.222.333.444"
- "555.666.777.888" ## zookeeper hosts
storm.zookeeper.port: 2181
nimbus.host: "111.222.333.444" ## nimbus host
storm.local.dir: "/mnt/storm"
supervisor.slots.ports:
    - 6700
    - 6701
    - 6702
    - 6703

Start or stop storm cluster by following commands:

Start

# service storm-nimbus start
# service storm-ui start
# service storm-drpc start
# service storm-logviewer start
# service storm-supervisor start

Stop

# service storm-supervisor stop
# service storm-logviewer stop
# service storm-drpc stop
# service storm-ui stop
# service storm-nimbus stop

Strom Dependency libraries

Project : Storm
Version : 0.9.0
Lisence : Eclipse Public License 1.0
Source URL : http://storm-project.net/

Project : ZeroMQ
Version : 2.1.7
Lisence : LGPLv3
Source URL : http://www.zeromq.org/

Project : JZMQ
Version : 2.1.0
Lisence : LGPLv3
Source URL : https://github.com/zeromq/jzmq 

Followers