Document Classification Using Multinomial Naive Bayes Classifier

Document classification is a classical machine learning problem. If there is a set of documents that is already categorized/labeled in existing categories, the task is to automatically categorize a new document into one of the existing categories. In this blog, I will elaborate upon the machine learning technique to do this.

We have an existing set of documents (D1-D5) that are categorized into Auto, Sports, and Computer.

Document #

Content

Category

D1

Saturn Dealer’s Car

Auto

D2

Toyota Car Tercel

Auto

D3

Baseball Game Play

Sports

D4

Pulled Muscle Game

Sports

D5

Colored GIFs Root

Computer

Now the task is to categorize the new D6 and D7 into Auto, Sports, or Computer.

Document #

Content

Category

D6

Home Runs Game

?

D7

Car Engine Noises

?

In machine learning, the given set of documents used to train the probabilistic model is called the training set.

The problem can be solved by the classification technique of machine learning. There are several machine learning algorithms that can be tried out, including:

Pipeline

BernoulliNB

MultinomialNB

NearestCentroid

SGD Classifier

LinearSVC

RandomForestClassifier

KNeighborsClassifier

PassiveAggressiveClassifier

Perceptron

RidgeClassifier

Feel free to try out these algorithms for yourself; I found Multinomial Naive Bayes to be one of the most effective algorithms for this purpose.

In this blog, I will also provide an application of Multinomial Naive Bayes. I recommend going through the following topics to build a strong foundation of this concept.

Calculate Likelihood. Likelihood is the conditional probability of a word occurring in a document given that the document belongs to a particular category.

P(Word/Category) = (Number of occurrence of the word in all the documents from a category+1) divided by (All the words in every document from a category + Total number of unique words in all the documents)

P(Saturn/Auto) = (Number of occurrence of the word “SATURN” in all the documents in “AUTO”+1) divided by (All the words in every document from “AUTO” + Total number of unique words in all the documents)

= (1+1)/(6+13) = 2/19 = 0.105263158

The tables below provide conditional probabilities for each word in Auto, Sports, and Computer.

Before concluding, I would recommend exploring following Python Packages, which provide great resources to learn classification techniques along with the implementation of several classification algorithms.

I hope you enjoyed reading this. If you have any questions or queries, please leave a comment below. I highly appreciate your feedback!

Manoj Bisht

Senior Architect

Manoj Bisht is the Senior Architect at 3Pillar Global, working out of our office in Noida, India. He has expertise in building and working with high performance team delivering cutting edge enterprise products. He is also a keen researcher and dive deeps into trending technologies. His current areas of interest are data science, cloud services and micro service/ serverless design and architecture. He loves to spend his spare time playing games and also likes traveling to new places with family and friends.

Hi sir! May I ask when did you publish this article? My team and I are working on a paper about spam and we will be using Multinomial Naive Bayes Classifier. We would like to cite your article on our paper. Thank you very much.

I have a huge data set having around 360 categories( Eg. Agriculture, Animation, AI, Banking, Security ect..) to be predicted based on description of a comapny. Using MULTINOMIAL BAYES I’m getting very low accuracy. Is there any other algo which works on such type of dataset?