NLP Processing in R

Wouter van Atteveldt

February 13, 2018

For text analysis it is often useful to POS tag and lemmatize your text, especially with non-English data. R does not really have built-in functions for that, but there are libraries that connect to external tools to help you do this. This handout reviews two common tools (spacy and coreNLP) and two tools developed by us that are useful for Dutch (frogr) and distributed processing (nlpipe).

Spacyr

Spacy is a python package with processing models for 6 different languages, which makes it attractive to use if you need e.g. French or German lemmatizing.

To install it, you need to install the spacy module in python and load the appropriate language model. See https://spacy.io/usage/

After that, you can install spacyr and use it to tag, lemmatize, and/or parse text:

Frog

Unfortunately, while spacy has Dutch models the lemmatizer doesn’t seem to work. The University of Tilburg has the Frog program which performs lemmatization pretty fast. You need to install and run it via docker. Install docker and run the following command:

NLPipe

NLPipe is a platform developed at the VU that allows you to use a separate server (or multiple servers) to do the processing, which can be quite useful if you have to process a lot of documents. Also, since it runs outside of R it saves you the hassle of tying R to java or python like for coreNLP or spacy.