Menu

Let’s start talking about Data Mining! In today’s post, we are going to dive into Topic Modeling, a unique technique that extracts the topics from a text. It is a really impressive technique that has many appliances in the world of Data Science. The following post will go as follows. First I am going to give some basic definitions and explain what Topic Modeling is. Then, I will shortly refer to preprocessing, since I am going to dedicate a whole post for this. Continuing, I will present a Python algorithm and I will conclude with a visualization process. For the sake of this post, I am going to use a known dataset from lda python library called reuters and not my previous blog posts, since they are not that many. Let’s begin!

The code of this project will be uploaded soon and it will contain the preprocessing step too!

1. Topic Modeling, Definitions

A topic model is a type of statistical model for discovering the abstract "topics"
that occur in a collection of documents.

As we can see, Topic Model is the method of topic extraction from a document. For a human, to find the text’s topic is really easy. Even if the text is unreadable, only from some specific words, he/she will understand the topic. For a computer, this method is not that trivial, a computer cannot understand the meaning of words. If we pick two random words from a physical book and we give them to a computer, the computer cannot comprehend the difference, for example, the words the and Juliette. The computer must have some previous knowledge about the book or to be able to scrape/search/crawl/etc the internet or any other source of information and even then it will just deduct an analysis.

With topic modeling, a computer deducts a statistical analysis on a document and outputs a series of words that are relevant to that document(very roughly explanation). Let’s take a closer look.

There are several methods for performing Topic Modeling, some of them are:

LSA

NMF

pLSA

LDA

In this post, we are going to see a well-known algorithm that is very flexible. The name of this algorithm is LDA or Latent Dirichlet Allocation. A very good explanation is given by Christine Doig.

Check her out! She is amazing!

2. Preprocessing the Data

To perform LDA or every other Topic Modeling algorithm, you will need a nice text corpus. The corpus that you will need depends on the application. If you need to perform topic modeling on articles from CNN/BBC/or any other news website, you will need a good utility corpus like Wikipedia, because you will have to deal with different categories (sports, politics, food, movies, ….). At the bottom line, a good corpus will give you better results. I am not going to jump into details here because as I said before, I will write preprocessing on a different post. Here, for example, we can do the following:

Get title and content of all wiki pages

Get rid of short articles

tokenize the remaining articles

sort the words according to Tf-idf

perform stemming

remove a %, top% and bottom% from the sorted list

remove stopwords

keep the top % of the remaining list

These are some basic steps for preprocessing a text corpus, we will discuss more of them and in depth in a later post.

Topic 0: police church catholic women
Topic 1: elvis film music fans
Topic 2: yeltsin president political russian
Topic 3: city million century art
Topic 4: charles prince king diana
Topic 5: germany against french german
Topic 6: church people years first
Topic 7: pope mother teresa vatican
Topic 8: harriman u.s clinton churchill
Topic 9: died former life funeral

We can see that the topics are not making any sense whatsoever, but we can clearly get the sense of what the documents are talking about! With this kind of information we can manipulate and analyze the documents, for example, we can cluster the documents for a recommendation system.

Furthermore, we can see the first ten documents and the assigned topics:

4. Conclusion

We can clearly see that the topics were to the point. For the evaluation process, we can use several methods, for example, we can compute the distance between documents which translates to the similarity between documents. We can use cosine similarity or Jensen-Shannon Distance similarity to cluster the documents or use perplexity to see if the model is representative of the documents we are scoring on.

For the evaluation process, we can use several methods according to the needs of our application, for example, we can compute the distance between documents which translates to the similarity between documents. We can use cosine similarity or Jensen-Shannon Distance similarity to cluster the documents or use perplexity to see if the model is representative of the documents we are scoring on.

That’s all for today’s post! Please let me know if you have any question in the comments section below! Till next time, take care and bye bye!