Topic Models to explore and compare communities

Recently I’ve been playing with an R wrapper for a machine language library called Mallet to generate lists of topics from a series of text documents. The technique is called Topic Modelling and I have gotten to grips with it from Ben Marwick‘s readings of archaeology papers which has some excellent reusable code. A topic in my model is simply a collection of words that make up the topic. Mallet can do all sorts of fancy things with the words and topics, it can tell me how likely a word is to appear in the topic, analyse text and tell me how much of that text belongs to which topics.

The reason I like it is because the algorithm implemented in Mallet that I use to generate the topics is probabilistic; this leads to some crazy and funny topics. There are also lots of parameters such as number of words in a topic, number of topics and number of runs of the algorithm. Changes to these parameters seem to have a big affect on what comes out the other end. I guess the process has lots of scope to crash and burn, and I like that. Remember: Losing is fun.

This morning I had a go at using topic modelling techniques to write a script that could tell me what people in similar communities talk about. I though this would be a useful tool to ask things like ‘What Mooc should I take next?’. I think my experiment failed, but I guess in the spirit of Losing is Fun though I’d post it anyway.

The Problem

I thought I’d start with Reddit communities since I’m familiar with them. The idea of a Reddit communities (called subreddits) is simple, to post content from elsewhere on the web and then have a good natter about it.

The website Reddit has lots of subreddits, often they overlap or a new community rises out of discontent with an existing community. I decided to use my script to generate a bunch of topics over a few related subreddits and then work out which subreddit talked about which topic most and where the related subreddits overlaped. Since I am interested in what people are talking about I decided to use the comments themselves as the basis for the topics. I mined these using a script provided by user Snotaphilious . The comments were taken out of the top 10 posts from each subbreddit from the last week. I haven’t checked how many comments there are for each post, but the top posts typically have quite a few.

I started by comparing the comments from 4 subreddits, these were r/politics a subreddit for U.S political news and information, r/ukpolitics a subreddit for U.K political news, r/shitpoliticssays which describes itself as a ‘subreddit dedicated to pointing out the hypocrisy, arrogance, and bias of /r/politics‘. Finally I threw in r/conspiracy which is where the conspiracy theorists hang out. I set my script to pull back 30 topics with 5 keywords each, here they are:

These topics sound right when you think of the context of the subreddits I described above. By casting an eye over them you would also think you would be able to place some of these topics to being used in certain cases, for eaxample topic [30] “conspiracy story evidence view remember” belonging to /r/conspiracy. To see if I was placing the topics correctly I tried to plot the topics along with the average proportions of each topic across all comments for each subreddit:

average proportions of each topic across all comments for each subreddit

This is really odd and doesn’t really back up what my initial glance told me. In fact topic 30 was seen least in r/conspiracy and most in r/ukpoltics (I’ll try not to think why). R/conspiracy’s biggest topic was [22] system voter vote voting political. Which was also pretty big in /politics and r/shitpolitcssays. I guess this makes sense with conspiracy theories about rigged elections, r/politics talking about elections and r/shitpolitics says mocking conspiracy theorists. The biggest one topic belonging to one subbreddit but not others was [13] “male issued vote student texas” belonging to r/shitpoliticssays. From what I gather with the Reddit audience being young leftish adults there is often hive mind south state bashing in r/politics and this could be ridicule of that.

I don’t know why topic [12] “attack book thousands site ddos” seems to be the topic that appears most crosses all subreddits the most, perhaps it could have been about site outage that effected them all.

Attempt 2: r/gaming and r/truegaming

At some point the users of subreddit r/gaming got annoyed that the most upvoted content was consistently pictures of N64 cartridges with the caption ‘look what I found in the loft’ and decided to break off and create a subreddit around gaming discussion only, /truegaming. Again I ran the script over r/gaming and r/truegaming to see what the users were talking about in each forum and how it differed.

average proportions of each topic across all comments for each subreddit

The topics generated from comments in the two gaming subreddits, I think this time the two are more closely aligned and it’s quite hard to single out topics belonging to one community and not the other. There are two which stand out, [9] people they’re diablo worst evil” seems to be big in truegaming but not gaming, and vice versa for [17] “pokemon type types water dark” . Does one talk about Diablo (and how they hate it?) and the other about Pokemon?

Why it doesn’t work so well

There were some topics cropping up in subreddit comments I’d expect but there was a lot that seemed random, also all topics seemed to appear in all subreddits.

To be honest I think there is problem with the approach. Each of my Topics has 5 words on it, but each comment on Reddit is very short, quite often there is less than 5 words in the whole comment, the technique does not seem great over small bits of text. Perhaps analysing the content being posted instead of the comments around the posts would be more effective. Still, I’ll have another play with this code changing some of the variables like using less labels per topic and checking things like the amount of comments I’m analysing per subreddit, perhaps there are simply more comments in some subreddits and that is making my results a bit squiffy. The amount of comments might also mean I need more or less topics; I guess usually I’d do this sort of thing with a few large documents talking about a few related things, whereas here I have lots of comments around lots of different things.

I’m still really interested in mining the comments, so if you have any ideas please fire away.

5 thoughts on “Topic Models to explore and compare communities”

Thanks for the feedback. You are right, It also doesn’t help that the label on the right is upside down if that makes sense. I’m going to have a play and see what I can do to make it more readable. I’m interested in ways of visualising topic models but finding it difficult so I’d be interested in any other ideas you have

Looks like you had fun playing with the data. The color doesn’t really seem to add anything extra, unless maybe you sort the columns by high to low frequency. Mallet looks really cool too, I’m definitely going to check that out! Thanks for posting

I’m using Mallet on Twitter to model topics. Tweets are in many sense similar to the comments you mine so I might have some us full info. Each topic you have doesn’t consist of only 5 topics, it’s a distribution of all the words used in your data. In other words if some of the 5 word topics don’t make sense, try 10, 15 or even 20. I’m not sure how to set that in R, but its possible. As for the shortness of the comments: aggregate the comments into larger documents. There are numerous articles about Mallet’s topic modeling algorithm called LDA performing better on larger document sizes. How you aggregate your documents, by user, day or subreddits, is a state of art technique. You have to try. Finally about the number of Tweets: try from 10 to 200. You have to run your data for come different topic numbers and then take a look at the topics and see if they make sense. If many topics seem to be about very different topics, higher the number of topics; if you see your topics correspond more to objects and some themes are grained to a few topics, try increasing your topic number. Hope this helps, don’t give up!

Thanks for the useful and very in depth feedback! That is how I understood the topics, have you seen this introduction: http://www.cs.princeton.edu/~blei/papers/Blei2012.pdf ? I think it’s really good. If you are interested in how to set the number of topics in my example I’ve done it here:

n.topics <- 30
topic.model <- MalletLDA(n.topics)

I think you are right that part of the art is working out how you organize documents and set paramaters.

I still don't understand 'Hyperparameter Optimization'. Do you understand what that does?