Other sites

Topic Modeling in R

As a part of Twitter Data Analysis, So far I have completed Movie review using R & Document Classification using R. Today we will be dealing with discovering topics in Tweets, i.e. to mine the tweets data to discover underlying topics– approach known as Topic Modeling.

What is Topic Modeling?

A statistical approach for discovering “abstracts/topics” from a collection of text documents based on statistics of each word. In simple terms, the process of looking into a large collection of documents, identifying clusters of words and grouping them together based on similarity and identifying patterns in the clusters appearing in multitude.

Consider the below Statements:

I love playing cricket.

Sachin is my favorite cricketer.

Titanic is heart touching movie.

Data Analytics is next Future in IT.

Data Analytics & Big Data complements each other.

When we apply Topic Modeling to the above statements, we will be able to group statement 1&2 as Topic-1 (later we can identify that the topic is Sport), statement 3 as Topic-2 (topic is Movies), statement 4&5 as Topic-3 (topic is data Analytics).

Topic Modeling can be achieved by using Latent Dirichlet Allocation algorithm. Not going into the nuts & bolts of the Algorithm, LDA automatically learns itself to assign probabilities to each & every word in the Corpus and classify into Topics. A simple explanation for LDA could be found here:

Topic modeling using LDA is a very good method of discovering topics underlying. The analysis will give good results if and only if we have large set of Corpus.In the above analysis using tweets from top 5 Airlines, I could find that one of the topics which people are talking about is about FOOD being served. We can Sentiment Analysis techniques to mine what people thinks about, talks about products/companies etc.