Monday, December 3, 2012

I read Edwin Chen`s blog(who is famous data scientist, my role model!) and read interesting post.

Problem was Edge Prediction in given social graph. after I read through Edwin`s blog, I decided to implement it with Map/Reduce since I am not fluent with Scalding and scalar yet(These days I have been busy to catching up Functional Programming course from Coursera and it`s so much fun).

excerpt from Edwin`s blog

Briefly, a personalized PageRank is like standard PageRank, except that when randomly teleporting to a new node, the surfer always teleports back to the given source node being personalized (rather than to a node chosen uniformly at random, as in the classic PageRank algorithm).
I was wondering how efficient to compute personalized pagerank for all vertices in graph. using Map/Reduce with mahout`s matrix library, this computation becomes following.

Even though Edwin built many features and build random forest for awesome model, I think in reality, vertex-vertex similarity computation will become too costly so just focus on scaling personalized pagerank up to large scale.

Note that maximum degree can be over 3000000 which quickly become computational bottleneck. so I pruned vertices with more than 10000 degree which we actually doens`t care much since this vertices already have too much link.

The result was better than I expected. Map/Reduce runs less than 2 hours on 8cores, 16G, 10 datanode cluster for 3 iterations.

Other than performance, I plan to evaluate personalized pagerank based link prediction by using PPR score as recommendations and actual training data as test set. think calculating MAP on topK can be used evaluation metric. I will update this metric in a few days.

Wednesday, November 7, 2012

I am taking Probabilistic Graphical Models course from Coursera. once class covers inference on markov random field using label propagation, I wanted to see how this algorithm works on real data so applied label propagation on million song dataset to feel it.

here is example of label propagation.

simple arithmetic result to following.

50% from my profile + 40% from X +10% from Z = 60% rock, 40% jazz.

Generalization

Here is actual formulation. here markov random field is represented by factor graph.

Above formulation can be implemented by matrix multiplication using Map/Reduce.

definition

Lets say graph structure as G, and set of Concept C = {c1, c2,...} where each ci is consist of set of vertices in G. ci represent prior per concept i. then we want to calculate pair-wise posterior given G, C.

Note that using Map/Reduce for iterative job is inefficient, so why not try Giraph?

in Graph-Parallel environment, problems become following.

1. each vertex has it`s neighbor edges in G.
2. at superstep 0, some vertex has C vector as value([ci:prior, cj:prior...]). if vertex has C vector, then send C to all of it`s neighbors otherwise don`t send it.
3. after superstep 0, all vertex get messages([vertex_id j, C vector]) from each of it`s neighbors.
if current vertex is Vi, and message is [Vj, Cvj] then edge(Vi, Vj) / Vi`s all Edge sum * C is added to Vi`s value Cvi vector. merge all concept-prior vectors sent to each vertex and update value(Cvi).
4. if iteration is not done, send value(Cvi) to all neighbors.

I implemented demo using label propagation with open dataset from million song dataset challenge from Kaggle. This demo load taste profile graph data into memory and calculate on the fly rather than using Giraph for demonstration. here is github for Giraph/Mahout code and demo codes.

TODO: since test set for this data is opend(competition is over), I will measure truncated mean average precision to evaluation label propagation.

Wednesday, October 17, 2012

I was experimenting with Graphlab and Mahout for Matrix Factorization these days.

Matrix factorization transform both items and users to the same latent factor space so they can be compared directly.

Even though Mahout and Graphlab is great tool for matrix factorization, these are designed for batch process. to get recommendations for new users who rate existing movies in rate matrix, following two steps are necessary.

The median of M numbers is defined as the middle number after sorting them in order, if M is odd or the average number of the middle 2 numbers (again after sorting) if M is even. You have an empty number list at first. Then you can add or remove some number from the list. For each add or remove operation, output the median of numbers in the list.

Example : For a set of m = 5 numbers, { 9, 2, 8, 4, 1 } the median is the third number in sorted set { 1, 2, 4, 8, 9 } which is 4. Similarly for set of m = 4, { 5, 2, 10, 4 }, the median is the average of second and the third element in the sorted set { 2, 4, 5, 10 } which is (4+5)/2 = 4.5

Constraints:

0 < n <= 100,000

1. naive

If n is small enough, problem is simple.

each insert/delete sort entire elements

this require n * nlogn(n operation x cost for sorting).

2. using two multiset

Thing is only few elements are affected when insert/delete operation happen.

solution using two multiset s1, s2.

invariant

1) s1 is supposed to maintain 0 ~ n/2th order statistic.

2) s2 is supposed to maintain n/2 ~ nth order statistic.

3) s1.size() and s2.size() only differ at most 1.

to hold invariant 1, 2

we compare x(current value to insert or delete) with maximum of s1 and minimum of s2 to decide where to put x.

to hold invariant 3

we need re-balance step

this can be done in constant time by

1) moving minimum of s2 to s1

2) moving maximum of s1 to s

Solution

once 3 invariant is meet, then calculating median as problem description is trivial.