Classifiying documents using Naive Bayes on Apache Spark / MLlib

In recent years, Apache Spark has gained in popularity as a faster alternative to Hadoop and it reached a major milestone last month by releasing the production ready version 1.0.0. It claims to be up to a 100 times faster by leveraging the distributed memory of the cluster and by not being tied to the multi stage execution of Map/Reduce. Like Hadoop, it offers a similar ecosystem with a database (Shark SQL), a machine learning library (MLlib), a graph library (GraphX) and many other tools built on top of Spark. Finally Spark integrates well with Scala and one can manipulate distributed collections just like regular Scala collections and Spark will take care of distributing the processing to the different workers.

In this post, we describe how we used Spark / MLlib to classify HTML documents using the popular Reuters 21578 collection of documents that appeared on Reuters newswire in 1987 as a training set.

Spark will take care of distributing the work to the different workers and collect() will collect the data from the different workers.

Based on the dictionary, we compute the IDF score for each term. There are different formulas to calculate IDF scores. It’s usually:
idf(term, docs) = log[(number of documents) / (number of documents containing term)]

However in the implementation of Naive Bayes in MLlib, it’s using log, so we can get rid of it in the formula.
idf(term, docs) = (number of documents) / (number of documents containing term)

We also exclude words that are present in less than 3 documents (arbitrary) to remove too specific terms:

Conclusion

In this post, we described in a simple example how we can use Spark to classify documents using Naive Bayes. There are many other aspects of Spark that are also interesting: ability to broadcast variables to workers, cache results, ingest data streams, ….

Even though MLlib is still very young and offer much less algorithm implementations than Mahout, it is faster and their team is working on adding more algorithms. On the other hand Mahout is moving to Spark to offer better performance. So it should be interesting to see in a few months how things are evolving.

Thanks for the detailed tutorial, this works great but I have to implement it using Java 8 and am stuck in creating TFIDF vectors. Can you please provide any direction on implementing the same logic using Java 8 (including lambda expressions).

I run the command follow your blog, but exception like below always come out :
Exception in thread “main” java.lang.NoClassDefFoundError: scala/reflect/ClassManifest
at com.gravity.goose.network.HtmlFetcher$.(HtmlFetcher.scala:66)
at com.gravity.goose.network.HtmlFetcher$.(HtmlFetcher.scala)
at com.gravity.goose.Configuration.(Configuration.scala:118)
Could you help me?Thank you.