Pre-processing text: R/tm vs. python/NLTK

Let’s say that you want to take a set of documents and apply a computational linguistic technique. If your method is based on the bag-of-words model, you probably need to pre-process these documents first by segmenting, tokenizing, stripping, stopwording, and stemming each one (phew, that’s a lot of -ing’s).

In the past, I’ve relied on NLTK to perform these tasks. Python is my strongest language and NLTK is mature, fast, and well-documented. However, I’ve been focusing on performing tasks entirely within R lately, and so I’ve been giving the tm package a chance. So far, I’ve been disappointed with its speed (at least from a relative sense).

Here’s a simple example that hopefully some better R programmer out there can help me with. I have been tracking tweets on the #25bahman hashtag (97k of them so far). The total size of the dataset is 18M, so even fairly inefficient code should have no problem with this. I want to build a corpus of documents where each tweet is a document and each document is a bag of stemmed, stripped tokens. You can see the code in my repository here, and I’ve embedded two of the three examples below.

Big N.B.: The R code wasn’t done after 15 minutes. My laptop has an i7 640M and 8G of RAM, so the issue isn’t my machine. The timing below actually excludes the last line of the embedded R example below.

Here are the timing results with 3 runs per example. As you can see, R/tm is more than twice as slow to build the un-processed corpus as Python/NLTK is to build the processed corpus. The comparison only gets worse when you parallelize the Python code.

Python, unparallelized: ~1:05

Python, parallelized (8 runners): ~0:45

R, unparallelized: ~2:15

So, dear Internet friends, what can I do? Am I using tm properly? Is this just a testament to the quality of NLTK? Am I cursed to forever write pre-processing code in one language and perform analysis in another?