Natural Language Processing with Apache Hadoop and Python

This blog was co-written by Nitin Madnani andJimmy Lin, both researchers at the University of Maryland, who are sharing their thoughts and experiences with Apache Hadoop and Python for improving Natural Language Processing techniques.

If you listen to analysts talk about complex data, they all agree, it’s growing, and faster than anything else before. Complex data can mean a lot of things, but to our research group, ever increasing volumes of naturally occurring human text and speech—from blogs to YouTube videos—enable new and novel questions for Natural Language Processing (NLP). The dominating characteristic of these new questions involves making sense of lots of data in different forms, and extracting useful insights.

NLP is hot and getting hotter

NLP is a highly interdisciplinary field of study comprising of concepts and ideas from Mathematics, Computer Science and Linguistics. Naturally occurring instances of human language, be it text or speech, are growing at an exponential rate given the popularity of the Web and social media. In addition, people are increasingly becoming more and more reliant on internet services to search, filter, process and, in some cases, even understand the subset of such instances they encounter in their daily lives. Whether you think about it or not, those services allowing you to do so much with language everyday are generally trying to solve well-understood NLP problems under active research. To put it into context, let us show you some examples. Let’s say that a blogger is trying to gather the latest information on the earthquake in Chile. Her workflow might consist of the following sequence of web-based tasks. With each task, we include the name of the specific NLP problem being solved by the service performing the task:

• “Show me the 10 most relevant documents on the web about the earthquake in Chile” (Information Retrieval) • “Show me a useful summary of these 200 news articles about the earthquake in Chile” (Automatic Document Summarization) • “Translate this Spanish blog into English so I can get the latest information about the earthquake in Chile” (Machine Translation)

We believe that NLP as an area of scholarly exploration has never been more relevant than it is today. One of the most successful trends in NLP has been that of using methods that are driven by naturally occurring language data instead of using purely knowledge-based or rule-based methods that are generally not as robust and are expensive to build. As this trend has continued, data-driven NLP techniques have become more and more sophisticated by borrowing heavily from the fields of Statistics and Machine Learning. These more sophisticated techniques now require large amounts of data in order to build a reasonably good model of what human language looks like.

NLP needs lots of data to shine. Therefore, we use Hadoop.

In order to do effective NLP research, we use very large bodies of text (or corpora) that are now becoming available to us. Examples include:

It should be obvious by now why this post belongs on the Cloudera blog. In order to process language data sets of such sizes efficiently, we have turned to Hadoop and Cloudera.

The Python and The Elephant

However, one issue that we have encountered is that Hadoop is written entirely in Java and we regularly use Python in our research, particularly the excellent and open-source Natural Language ToolKit or NLTK. NLTK is a great tool in that it tries to espouse the same “batteries included” philosophy that makes Python a useful programming language. NLTK ships with real-world data in the form of more than 50 raw as well as annotated corpora. In addition, it also includes useful language processing tools like tokenizers, part-of-speech taggers, parsers as well as interfaces to machine learning libraries. NLTK is impressive enough that it now commands its own animal in the O’Reilly collection. In short, we would like to be able to continue using Python and NLTK for our research but also leverage the fantastic distributed processing capabilities of Hadoop. What’s a Pythonista to do?

The Hadoop Streaming interface is the solution to our problem. From the webpage: “Hadoop streaming is a utility that comes with the Hadoop distribution and allows allows you to create and run map/reduce jobs with any executable or script as the mapper and/or the reducer”. However, wouldn’t it be nice if someone did the hard work of wrapping the streaming interface into a nice Pythonic API? Indeed, it would be and it is.

Dumbo is a fledgling, yet already very capable, open source project that strives to do exactly this. In fact, Dumbo is so nice that we recently used it to do the NLP task of automatic word association with a very large corpus by using Hadoop on Amazon EC2. Word association is a very common task in psycholinguistics where the subject is asked to say the word X that immediately comes to mind when hearing word Y. The results of having a computer do it were both entertaining and informative and we presented it to the Python community at PyCon 2010 in Atlanta. The PyCon folks recorded the talk and have made it available here.

The tools and resources that Cloudera provides were a really big help to us in doing this task. We used the Cloudera 0.20.1+152 hadoop distribution and the stock ec2 shell-scripts that come bundled with that distribution. Note that Cloudera has now made available better and more robust Python versions of the ec2 scripts that are even easier to use. The folks at Cloudera were also very helpful and gracious with their time when we had a question or two.

Bottom line: If you are interested in NLP, and like us, think that Python and NLTK are useful tools, the world is a better place these days thanks to Dumbo.

Some additional resources

As long as we are talking about Hadoop and NLP together, it would be worth mentioning that in the last three years or so, one of us has worked really hard, with help from other smart people, to come up with Hadoopified versions of the most commonly used algorithms for many NLP problems and collected them into a book. It is an excellent resource for anyone who is interested in doing large scale text processing with MapReduce.

Great post Ed, thanks. This is the second time this week that I hear about using NLTK on Hadoop.
Just wanted to mention Behemoth (http://code.google.com/p/behemoth-pebble/) which is very related to the subject as it allows to deploy Apache UIMA and GATE-based NLP applications over Hadoop. Still early days, but there is already some code working. Comments / thoughts are welcome.

I was thinking about using Streaming or Pipe for embedding non Java resources in Behemoth, NLTK could be a good example.

Gate is a very good tool to bring out the different annotations. However is there a way using gate where we can actually list down related actors and mark out a difference between primary actor and a subject being discussed ?

So here is what i want.

Actors (Primary Vs secondary) the relationship between them.
What is being talked about.
When(Gate does it quite nicely)
Where(Gate can provide)

Mostly a sumarization. How do we use gate to achieve it ? do we need to write custom jape rules ? If yes, then is there some sample on the same ? Do we have some java bases algorithm to do the IR/IE being referred above ?