Wednesday, March 23, 2011

In my previous post I indicated that I was faced with a variety of semi-supervised problems, and that I was hoping to utilize LDA on the social graph in order to build a feature representation that would improve my performance on various classification tasks. Now that I've actually done it, I thought I'd share some results.

LDA on Graphs

The strategy is to treat the edge sets at each vertex of the social graph as a document and then apply LDA to the resulting document corpus, similar to Zhang et. al. Since I'm considering Twitter's social graph, the latent factors might represent interests or communities, but I don't actually care as long as the resulting features improve my supervised classifiers.

When LDA was first applied in Computer Vision, it was first applied essentially without modification with some success. Then the generative model was adapted to the problem domain to improve performance (e.g., in the case of Computer Vision, by incorporating spatial structure). Things are done in this order for a very practical reason: when you apply the standard generative model, you get to leverage someone else's optimized and correct implementation! For the same reasons I'm sticking with the original LDA here, but there are some aspects I've noticed are not a perfect fit.

On directed social graphs (such as Twitter) there are two kinds of edges which is analogous to two different kinds of tokens being present in the document. LDA only has one token type. Possibly this can be worked around by prefixing every edge with a '+' or '-' indicating direction. In practice I sidestep this problem by only modeling the outgoing edges (i.e., the set of people that someone follows).

An edge can only exist once in an edge set, whereas with vanilla LDA a token can occur multiple times in a text document. Taking into account this negative correlation between edge emission probabilities might improve results.

Broad Social Topics

Even though I don't actually care about understanding the latent factors, it makes for entertaining blog fodder. So now for the fun. I ran a 10 topic LDA model over the edge sets from a random sample of twitter users, in order to get a broad overview of the graph structure. Here are the top 10 mostly likely twitter accounts for each topic:

And yes, this data was collected prior to Charlie Sheen's meteoric rise.

shitmydadsays is a News Site

Actually topic 6 is truly fascinating. Perhaps it is best called "Stuff News Junkies like". There is no doubt that news interest and comedy interest intersect, but the causality is unclear: does one need to watch the news to understand the jokes, or does one need the jokes to avoid severe depression after watching the news?

The Cultural Polysemy of Justin Beiber

When using LDA to analyze text, tokens that have high emission probability for multiple topics often have multiple meanings. Here we see justinbieber has high emmission probability for topics 4 and 7, which are otherwise mostly of Asian and North American focus respectively. One interpretation is that the appeal of justinbieber cuts across both cultures.