In my previous post, I showed how to parse sentences using OpenNLP. Another useful feature supported by OpenNLP is “chunking”. That is the subject of today’s article.

Chunking stands between part-of-speech tagging and full parse in terms of the information it captures. POS tagging assigns part of speech to individual tokens in a sentence. So, in the sentence “Peter likes sweets”, the POS tags are:

The constituency parser operates at the other extreme. It tries to assign a structure to the complete sentence, by assigning a structure (recursively) to constituent parts. We saw this in the last article.

Full parse is significantly more expensive than just POS tagging for obvious reasons. Sometimes we might be interested only in the smaller structures contained in the larger parse tree, for example, Verb Phrase, Adjective Phrase, Noun Phrase, and so on. The classic example is NER (Named Entity Recognition) where we are interested in specific Noun Phrases. This usually (not always) involves more than one token in the given text, and is called “chunking”.

OK. Let us see how to use the chunker in OpenNLP. I have written a simple class called “OpenNLPChunkerExample” to illustrate the essential features (you can download the source from here).

The code fragment below gets the chunked tags and prints them along with the corresponding word.

Printing Chunked Tags

The output from the program is:

Output

The tagging produced by the chunker follows the “IOB” tagging scheme. Here,

B = Beginning of chunk

I = In a chunk

O = Outside any chunk

From the above scheme, we can easily see that the words “The pretty cat” form a single NP chunk, the word “chased” forms a VP chunk all by itself, and the words “the ugly rat” constitute an NP chunk again. The final “.” is not part of any chunk.

To facilitate readability, we can write a convenience function to group the related chunks. Here is the code:

Function to Group Words in Chunk

The function returns a Span[]. The updated “main” that uses this function and prints the chunks is:

Printing Grouped Chunks

The corresponding output is:

Grouped Chunks

We can even get the probability associated with each chunked tag. Here is the final version that prints this information:

Printing Chunked Tags with Probability

Here is the corresponding output:

Tags with Probability

Before concluding, let us print the chunks for another sentence: “It is very beautiful.”

Another Example

You can see that we now have an Adjective Phrase (ADJP): “very beautiful”.

Python’s NLTK, another popular NLP toolkit, also supports chunking. What I like about NLTK is that it allows us to define a “chunking grammar” to customize our chunking logic. This can prove useful in some cases. Take a look at NLTK when you get time.