Dubinko et al, WWW 2006

From ScribbleWiki: Analysis of Social Media

Visualizing Tags Over Time

This paper describe a visualization method for the evolution of tags within Flickr online image sharing system. They try to highlight the "most interesting" tags within a given time interval. An on-line demo based on Flash in Web browser can be found at here. They provide two metaphors,-the `rivers' vs. `the waterfall'.

To define "interestingness", they basically use the probablity of the occurance for a given tag within the time interval, which looks like 'tf-idf' in IR. In the meanwhile, they want to penalize those rare tages by introducing a normalizing constant. In this way, they want to avoid the 'idf' term effect.

Given (1) the huge amount of the tags (over 1 million tages per week) and (2) they want to allow the quick response for *any* time interval, the efficiency is a big consideration in the paper. First, they try to pre-compute the interesting for a carefully choosen set of time intervals (served as "base") so that any given time interval I can be represented as a linear combination of these "bases" (interval covering). Then, they use a fast algorithm, either the threshold algorithm by Fagin et al or its variants, to do fast aggregation. The paper also discuss a fast algorithm for sliding interval.

The experimental dataset consists of date, photo id, tagger id, and tag collected over a period of 472 days. There were around 1.26 million unique tags with an average repetition of 70 times. They performed basic tag normalization without using any stemming or spell correction.

In the experiments they compare the performance of computing the most interesting tags using a naive scheme that uses no pre-processing, then a scheme that uses interval covering but no score aggregation, and finally the full scheme using both interval covering and score aggregation. As expected, the scheme using both interval covering and score aggregation provides 3-4 order of magnitude improvement than the naive scheme.

They provide few interesting qualitative observations about the results of applying the algorithms to the data. They point out that the interesting tags fall under categories: events (eg. Lunar eclipse, Oct 28th), personalities(Pope, Apr 26th - his birthday), social media tags (eg. Badge, Jul 1 - images of badges). They also point out the effect of time window on interesting tags like lunar eclipse is interesting at daily, weekly and monthly level but not at quarterly level.