Saturday, February 11, 2017

Recommendations with Apache Mahout


Have you ever been recommended a friend on Facebook? Or visited a shopping portal where you can see the recommended items for you, Or an item you might be interested in on Amazon? If so then you've benefited from the value of recommendation systems.
for example, often see personalized recommendations phrased something like, “If you liked that item, you might like also like this one...” These sites use recommendations to help drive users  to other things they offer in an intelligent, meaningful way, tailored specifically to the user and the user’s preferences.

Recommendation systems apply knowledge discovery techniques to the problem of making recommendations that are personalized for each user. Recommendation systems are one way we can use algorithms to help us sort through the masses of information to find the “good stuff” in a very managed way.

From an algorithmic standpoint, the recommendation systems we’ll talk about today are considered in the k-nearest neighbor family of problems (another type would be a SVD-based recommender). We want to predict the estimated preference of a user towards an item they have never seen before. We also want to generate a ranked (by preference score) list of items the user might be most interested in. Two well-known styles of recommendation algorithms are item-based recommenders and user-based recommenders. Both types rely on the concept of a similarity function/metric (ex: Euclidean distance, log likelihood), whether it is for users or items.

Overview of a recommendation engine

The main purpose of a recommendation engine is to make inferences on existing data to show relationships between objects and entities. Objects can be many things, including users, items, products(in short user related data) and so on. Relationships provide a degree of likeness or belonging between objects. For example, relationships can represent ratings of how much a user likes an item, or indicate if a user bookmarked a particular page.

To make a recommendation, recommendation engines perform several steps to mine the data(Data mining). Initially, you begin with input data that represents the objects as well as their relationships. Input data consists of object identifiers and the relationships to other objects.

Consider the ratings users give to items. Using this input data, a recommendation engine computes a similarity between objects. Computing the similarity between objects(co-similarity) can take a great deal of time depending on the size of the data or the particular algorithm. Distributed algorithms such as Apache Hadoop using Mahout can be used to parallelize the computation of the similarities. There are different types of algorithms to compute similarities. Finally, using the similarity information, the recommendation engine can make recommendation requests based on the parameters requested.

For Example:
GroupLens Movie Data

The input data for this demo is based on 1M anonymous ratings of approximately 4000 movies made by 6,040 MovieLens users, which you can download from the site. The zip file contains four files:

movies.dat (movie ids with title and category)
ratings.dat (ratings of movies)
users.dat (user information)

The ratings file is most interesting to us since it’s the main input to our recommendation job. Each line has the format:
Ratings.dat description


So let’s adjust our input file to match what we need to run our job. First download the file and unzip it locally from:

Next run the command:
        tr -s ':' ',' < ratings.dat | cut -f1-3 -d, > ratings.csv

This produces the csv output format we’ll use in the next section when we run our “Itembased Collaborative Filtering” job.

        hadoop fs -put [my_local_file] [user_file_location_in_hdfs]

this command put  input file on HDFS,

create user.txt file which stores the data(userID) of the users to which we want show recommendations.
put it on HDFS under users directory.
With our user list in hdfs we can now run the Mahout  recommendation job with a command in the form of:
       mahout recommenditembased --input [input-hdfs-path] --output [output-hdfs-path] --tempDir [tmp-hdfs-path] --usersFile [user_file_location_in_hdfs]

which will run for a while (a chain of 10 MapReduce jobs) and then write out the item recommendations into HDFS we can now take a look at.  If we tail the output from the RecommenderJob with the command:

         hadoop fs -cat [output-hdfs-path]/part-r-00000

The output will show the user(provided into user.txt) with the recommended items.