“Panama Papers” News Story

The “Panama Papers” were about 12 million leaked documents that shed light upon people around the world who were hiding money in offshore entities through the Panamanian company Mossack Fonseca. The news coverage in the immediate aftermath of the leak tended to focus on a wide range of issues from the technical aspects of a hack to the implications of various world leaders and celebrities who were exposed. In the weeks following the leak, I selected 34 articles about the issue and extracted the text content into text files (see articles below.)

Manual Topic Modeling Activity

Topic Modeling is an algorithm that finds “a recurring pattern of co-occurring words” in a corpus of text. This does take overall usage of a word across the entire corpus into consideration, so words used often in many documents will not appear in every topic. So, for example, “Panama” will almost certainly be used at least once in every article — the average is about 6.3 times per article. However, “Panama” will only be included in an article’s keywords if it’s used a significant amount of times and has close usage relationships with other words.

Today you’ll skim a few articles and try your best to imitate topic modeling algorithms. Your assigned articles can be found below.

Before you start reading your articles, take a look at the total word usage for all 34 articles (stop words removed). Once you have glanced at the list, read your first article. Write (or type) out words that seem unique to the article. This means words that are used disproportionally more often in your article than in others in the corpus. Remember, you’re only looking at words used, not concepts or themes that you’re able to discern from them. You’re imitating a computer algorithm, and algorithms are not particularly smart. Once you’ve finished your first article, do your second.

Do your words seem to make sense as one or more topics, or do they seem to be random?

Comparing and Analyzing Results

I used Topic Modeling Tool (TMT), an easy-to-use tool for using MALLET for topic modeling. TMT produces a set of CSV files and a set of HTML files with your output. Take a look at these results with 20 topics. (Remember, these aren’t labeled topics, they’re clusters of words that likely represent a topic.)

Are there any identifiable “topics” here? Are there any “topics” that don’t seem to make sense?

If you click on one of the topics, you’ll see the list of documents ordered by how closely each document corresponds with the topic. The number in parentheses is the number of times words in the topic appear in the document. Now click on one of the text files. This will show you the full text of the file, and it will also show the topics that align closely with your topic.

Take a few minutes and explore these results. Click through the network of topics and documents and see if you can find any patterns. Look at the articles you read and see how closely your results matched the TMT results.