TFIDF explained using Apache Mahout

The current article facilitates the understanding of TFIDF (term frequency inverse document frequency) in the machine learning/natural language processing context. We will explain what the term is, then we will create a sample project using Java and Apache Mahout where we can observe how the term is calculated. This project also helps to better understand the Apache Mahout output and input folder structure. Basic Java programming language knowledge is required.

TFIDF short description

TFIDF – term frequency inverse document frequency is an important weighting scheme which can be used in fields like machine learning, natural language processing, search engines and text mining. The metric is used to measure the relative importance of a word for a collection of documents. If a term or word occurs frequently in a document and not so frequently in the entire set of documents, it is more relevant to a search than a word that appears frequently across all the documents. By calculating TFIDF for all terms which appear in a set of document we can filter away the less relevant words. As an example, a word wich appears only twice in a single document is more relevant to someone searching that document, compare to words which appear many times in all the documents like: the, is, at, and, or,on, etc. Using TFIDF the later words can be ignored and the relevant ones are retained.

Following, we will gain a better insight of how the TFIDF weight is calculated by using a practical example. We will use the Apache Mahout library to create two simple documents and to calculate the TFIDF for each word in these documents.

TFIDF Calculation

As you can see, we have the two sample documents as input. Mahout starts by counting the occurrences of every word in the document collection. Next, a dictionary entry is created for each word. From now on, the dictionary value will be used for calculations, instead of the actual word.

The Term Frequency (TF) is calculated by counting the occurrences of every word in a given document. You can see that the word car (the value 0, according to the dictionary) has the frequency 2.0 (0:2.0) in the first document and 1.0 (0:1.0) for the second one.

The Document Frequency (DF) is calculated by counting for each word in how many documents it appears. You can see that the words car and saw have the count 2 as they appear in both documents. All other words have the count 1. The first entry, -1 is the total document count.

In the final, the TFIDF is calculated by multiplying TF and IDF (Inverse Document Frequency):

TFIDF = TF x IDF

If we take the word red (3 according to the dictionary ) TF is pretty straightforward: the term is found once in the first document, hence the TF is 1.0.

IDF is obtained using the following natural logarithm formula:

IDF = log (Total document count / (document frequency + 1) ) + 1

Mahout delegates the actual calculation of the IDF to the Apache Lucene library, more precisely to the DefaultSimilarity class. Also the calculation of TF is actually implemented as sqrt(tf), see here: Lucene TF.

Conclusion

TFIDF is a fundamental term in the context of machine learning, natural language processing, search engines and text mining. A deep understanding of this term is needed for efficient implementations of machine learning or natural language processing projects.

About The Author

leo

Thanks for a great article. If I already have a list of TFIDF for each document, how would I go about implementing it in this code? Rather than creating test documents, and then performing word count, and then calculate TFIDF on those. Thanks in advance.

Hi, thanks for the explanation. i have a doubti have a doubt about how this value 0.8407992720603943 was calculated. it is in the first document, the word is “car”, because the tf is 2 and the idf is 0.5945348739624023. and when i calculate the idf = tf * idf the result is diferent. Could you please help me ?

The fact is that Mahout delegates the calculation of TF (Term Frequency) to Lucene, which calculates it as sqrt(TF). See here Lucene TF. If you then multiply your IDF (0.5945348739624023) with sqrt(2) you will get the correct TFIDF. I also changed the article to clearly specify the TF calculation.

The tf-idf vectors are used to train a model using TrainNaiveBayesJob from Java code or trainnb from command line. You can see also this tutorial which uses Mahout naive Bayes: Sentiment analysis using Mahout naive Bayes