Introduction

The emergence of Web 2.0 and the consequent success of social network websites such as del.icio.us and Flickr introduce us to a new concept called social bookmarking, or tagging in short. Tagging can be seen as the action of connecting a relevant user-defined keyword to a document, image or video, which helps user to better organize and share their collections of interesting stuff. With therapid growth of Web 2.0, tagged data is becoming more and more abundant on the social network websites. An interesting problem is how to automate the process of making tag recommendations to users when a new resource becomes available.

Ideas

Real-time Automatic Tag Recommendation

We propose a highly-automated novel framework for real-time tag recommendation. The tagged training documents are treated as triplets of (words, docs, tags), and represented in two bipartite graphs, which are partitioned into clusters by Spectral Recursive Embedding (SRE). Tags in each topical cluster are ranked by our novel ranking algorithm. A two-way Poisson Mixture Model (PMM) is proposed to model the document distribution into mixture components within each cluster and aggregate words into word clusters simultaneously. A new document is classified by the mixture model based on its posterior probabilities so that tags are recommended according to their ranks. Experiments on large-scale tagging datasets of scientific documents (CiteULike) and web pages (del.icio.us) indicate that our framework is capable of making tag recommendation efficiently and effectively. The average tagging time for testing a document is around 1 second, with over 88% test documents correctly labeled with the top nine tags we suggested.

Sparse Gaussian Processes for Fast Tag Recommendation

We address the issue of efficient tag suggestion. We first propose a multi-class sparse Gaussian process classification framework (SGPS) which is capable of classifying data with very few training instances. We suggest a novel prototype selection algorithm to select the best subset of points for model learning. The framework is then extended to a novel multi-class multi-label classification algorithm (MMSG) that transforms tag suggestion into the problem of multi-label ranking. Experiments on bench-mark data sets and real-world data from Del.icio.us and BibSonomy suggest that our model can greatly improve the performance of tag suggestions when compared to the state-of-the-art. Overall, our model requires linear time to train and constant time to predict per case. The memory consumption is also significantly less than traditional batch learning algorithms such as SVMs. In addition, results on tagging digital data also demonstrate that our model is capable of recommending relevant tags to images and videos by using their surrounding textual information.

Hierarchical Tag Visualization and Recommendations

While tag clouds effectively show the relative frequency and thus popularity of tags, they fail to convey two aspects to the users: (1) the similarity between tags, and (2) the abstractness of tags. We suggest an alternative to tag clouds known as tag hierarchies. Tag hierarchies are based on a minimum evolution-based greedy algorithm for tag hierarchy construction, which iteratively includes optimal tags into the tree that introduce minimum changes to the existing taxonomy. Our algorithm also uses a global tag ranking method to order tags according to their levels of abstractness as well as popularity such that more abstract tags will appear at higher levels in the taxonomy. Based on the tag hierarchy, we derive a new tag recommendation algorithm, which is a structure-based approach that does not require heavily trained models and thus is highly efficient. User studies and quantitative analysis suggest that (1) the tag hierarchy can potentially reduce the user’s tagging time in comparison to tag clouds and other tag tree structures, and (2) the tag recommendation algorithm significantly outperforms existing content-based methods in quality.

Metrics for Evaluating User Tagging Behavior

We analyze over two years of data from CiteULike, a social bookmarking system for tagging academic papers. We propose six tag metrics-tag growth, tag reuse, tag non-obviousness, tag discrimination, tag frequency, and tag patterns-to understand the characteristics of a social bookmarking system. Using these metrics, we suggest possible design heuristics to implement a social bookmarking system for CiteSeer, a popular online scholarly digital library for computer science. We believe that these metrics and design heuristics can be applied to social bookmarking systems in other domains.

Data Sets and Code

Since I graduated, most of the data used in the paper has been lost since they were stored in my school's machine. I made my best effort to retain just one data set that I used in my CIKM '08 paper.
Please download from here (10MB).