Extracting noun phrases with contextual relevance in .NET using OpenNLP

A few months ago I was working on a project that had a word cloud-like feature. A word cloud is an interesting way to visually represent a popular theme or topic. I had a dataset of user reviews from another project that we wanted to parse and use. This began my first exposure to Natural Language Processing (NLP) and other advanced text analytics tools.

Notes from an NLP spike

I started by extracting nouns from our dataset and calculating frequency. This resulted in a list of top used terms, but unlike a tag cloud implementation for a blog, the results were not always relevant. I needed a way to capture the theme of a sentence, sum up all the reviews with the same theme, and then present the top themes to the user. Think of a yelp user review and its core positive or negative theme. Then calculate how many other user reviews have the same theme, i.e. reviews for a pub that often mentioned its "great patio". It wasn't long before I started reading about NLP tools and in particular, Part of Speech (PoS) analyzers. I started learning the vocabulary of the NLP world such as N-grams, sentence chunking, and most importantly, noun phrases. Noun phrases contained two or more words (including a noun) which provide some contextual relevance to the theme of the sentence.

Below is a more formal definition of a noun phrase with an example.

A word group with a noun or pronoun as its head. The noun head can be accompanied by modifiers, determiners (such as the, a, her), and/or complements.

A noun phrase (often abbreviated as NP) most commonly functions as a subject, object, or complement.

"The wells and water table had been polluted by chemical pesticides and fertilizers that leached into the earth and were washed by rain into the creeks, where the stunned fish were scavenged by the ospreys."
(Peter Matthiessen, Men's Lives, 1986)

Identifying noun phrases is not a trivial task. I started reading up on big open source projects in the NLP game like OpenNLP (Java), NLTK (python), and LingPipe (Java). I also found a great deal of smaller analytics tools and parsers, but none seemed advanced enough to really capture the essence of a noun phrase or theme of a sentence. It was then that a colleague pointed me in the direction of SQL Server Integration Services (SSIS) text analytics transformations. Most notably, the Term Extraction and Term Lookup transformations. A PoC quickly demonstrated that these transformations were an efficient and scalable way to extract noun phrases. It was very simple to configure and get up and running (if you don't mind using BIDS, *shudder*). I was able to extract meaningful noun phrases with a high degree of accuracy. However, it had a number of limitations.

It's an SSIS package, great for parsing text after-the-fact, but not in real time.

It requires a SQL Server enterprise license ($$).

It only supports English with no plans of supporting other languages.

Ideally, I wanted to replace the SSIS package with an in-process solution, but unfortunately there are limited text analytics tools available for the .NET community. There are a few options. SharperNLP is a C# port of OpenNLP. It had a brief flurry of activity in 2006, but not much since then. Here are some notes from someone who attempted to integrate with NLTK in a .NET implementation using IronPython: Open Source NLP in C# 3.5 using NLTK.

A viable .NET implementation

Eventually I came across a wiki article entitled "A quick guide to using OpenNLP from .NET" that introduced me to a remarkable project called IKVM.NET. After generating a shiney new .NET OpenNLP assembly with the steps provided I was able to use the OpenNLP namespaces with ease in my project.

The first step in using the parsers in OpenNLP was to instantiate a model using Java streams. I created a base class for my NounPhraseParser with a utility method to help load these models.

usingSystem;usingSystem.Collections.Generic;usingSystem.IO;usingSystem.Linq;namespaceOpenNLP.NET.PoC{publicclassAbstractNounPhraseAdapter{protectedreadonlystringModelsPath;/// <summary>
/// A path to the directory where the OpenNLP models are located.
/// </summary>
protectedAbstractNounPhraseAdapter(stringmodelsPath){ModelsPath=modelsPath;}/// <summary>
/// Return the OpenNLP analyzer given its model type (M), the type of the analyzer (T), the filename
/// of the model (i.e. en-maxent.bin) and a path to where the Models are lcoated (ModelsPath).
/// </summary>
publicTResolveOpenNlpTool(stringmodelPath)whereM:classwhereT:class{varmodelStream=newjava.io.FileInputStream(Path.Combine(ModelsPath,modelPath));Mmodel;try{model=(M)Activator.CreateInstance(typeof(M),modelStream);}finally{if(modelStream!=null){modelStream.close();}}return(T)Activator.CreateInstance(typeof(T),model);}/// <summary>
/// Functions to run after PoS parsing to determine if the noun phrase should be returned.
/// </summary>
publicIEnumerable<Func>PostProcessingFilters{get;set;}protectedboolValidNounPhrase(stringnounPhrase){returnPostProcessingFilters==null||PostProcessingFilters.Aggregate(true,(current,filter)=>current&&filter.Invoke(nounPhrase));}}}

usingSystem;usingSystem.Collections.Generic;usingopennlp.tools.chunker;usingopennlp.tools.postag;usingopennlp.tools.sentdetect;usingopennlp.tools.tokenize;namespaceOpenNLP.NET.PoC{/// <summary>
/// Ported from Java implementation by Sujit Pal
/// http://sujitpal.blogspot.ca/2011/08/uima-noun-phrase-pos-annotator-using.html
/// </summary>
publicclassPosNounPhraseParser:AbstractNounPhraseAdapter,INounPhraseParser{publicPosNounPhraseParser(stringmodelsPath):base(modelsPath){}privatestaticSentenceDetector_sentenceDetector;privateSentenceDetectorGetSentenceDetector(){return_sentenceDetector??(_sentenceDetector=ResolveOpenNlpTool("en-sent.bin"));}privatestaticPOSTagger_posTagger;privatePOSTaggerGetPosTagger(){return_posTagger??(_posTagger=ResolveOpenNlpTool("en-pos-maxent.bin"));}privatestaticTokenizer_tokenizer;privateTokenizerGetTokenizer(){return_tokenizer??(_tokenizer=ResolveOpenNlpTool("en-token.bin"));}privatestaticChunker_chunker;privateChunkerGetChunker(){return_chunker??(_chunker=ResolveOpenNlpTool("en-chunker.bin"));}publicvoidWarmUpModels(){GetSentenceDetector();GetPosTagger();GetTokenizer();GetChunker();}publicIList<string>GetNounPhrases(stringsourceText){if(string.IsNullOrWhiteSpace(sourceText))thrownewArgumentNullException("sourceText");varnounPhrases=newList<string>();// return an array of start and end indexes that identify sentences
varsentenceSpans=GetSentenceDetector().sentPosDetect(sourceText);foreach(varsentenceSpaninsentenceSpans){// retrieve the actual sentence from the source text
varsentence=sentenceSpan.getCoveredText(sourceText).toString();varstart=sentenceSpan.getStart();// return an array of start and end indexes that identify various
// tokens/tags in the sentence (i.e. noun phrases, verb phrases, etc)
vartokenSpans=GetTokenizer().tokenizePos(sentence);vartokens=newstring[tokenSpans.Length];for(vari=0;i<tokens.Length;i++){tokens[i]=tokenSpans[i].getCoveredText(sentence).toString();}vartags=GetPosTagger().tag(tokens);// return an array of chunks that contain tag types and start/end indexes
// for the chunk in the source text
varchunks=GetChunker().chunkAsSpans(tokens,tags);foreach(varchunkinchunks){// filter out everything but noun phrases
if(chunk.getType()!="NP")continue;varchunkStart=start+tokenSpans[chunk.getStart()].getStart();varchunkEnd=start+tokenSpans[chunk.getEnd()-1].getEnd();// extract the noun phrase
varnounPhrase=sourceText.Substring(chunkStart,chunkEnd-chunkStart);// run post processing functions to determine if this noun phrase
// is suitable for our purposes (defined by caller)
if(!ValidNounPhrase(nounPhrase))continue;nounPhrases.Add(nounPhrase);}}returnnounPhrases;}}}

And finally, a test that demonstrates the setup of my PosNounPhraseParser over the example sentence mentioned earlier in the definition of a noun phrase.

usingSystem;usingSystem.Collections.Generic;usingSystem.Diagnostics;usingSystem.Linq;usingNUnit.Framework;namespaceOpenNLP.NET.PoC{[TestFixture]publicclassPosNounPhraseParserTests{[Test]publicvoidPosNounPhraseParser_GetNounPhrases_Extract_Noun_Phrases_From_Sentence(){string_modelPath=@"C:\Development\NLPForDotNET\lib\opennlp-models-1.5\";// arrange
varnounPhraseAdapter=newPosNounPhraseParser(_modelPath){PostProcessingFilters=newList<Func>{// more than two words
(nounPhrase=>nounPhrase.Split(" ".ToCharArray()).Count()>1),// character stop list
(nounPhrase=>!(nounPhrase.Contains(".")||nounPhrase.Contains("\"")||nounPhrase.Contains(",")||nounPhrase.Contains("”")||nounPhrase.Contains("“")||nounPhrase.Contains(";")))}};nounPhraseAdapter.WarmUpModels();varstopwatch=newStopwatch();stopwatch.Start();// act
varactualNounPhrases=nounPhraseAdapter.GetNounPhrases("The wells and water table had been polluted by chemical pesticides and fertilizers that leached into the earth and were washed by rain into the creeks, where the stunned fish were scavenged by the ospreys.").ToArray();stopwatch.Stop();Debug.WriteLine("Total time: {0}",stopwatch.Elapsed);// assert
Assert.Contains("The wells and water table",actualNounPhrases);Assert.Contains("chemical pesticides and fertilizers",actualNounPhrases);Assert.Contains("the earth",actualNounPhrases);Assert.Contains("the creeks",actualNounPhrases);Assert.Contains("the stunned fish",actualNounPhrases);Assert.Contains("the ospreys",actualNounPhrases);}}}

Conclusion

I think this project worked out remarkably well. I don't know if I'll attempt to use something like this in a production environment, but if nothing else it was a very enlightening foray into the interesting world of Natural Language Processing. There are many other subjects in this area that I would like to explore, such as Sentiment Analysis and ways to identify subjects of significance in large bodies of text. As the IBM Watson project demonstrated to us not too long ago, this is a young field with staggering potential. The current trajectory of research along with significant advances in computation capability suggest it won't be long before we can communicate with computers/information systems as easily as if you were talking to your best friend.

If you wish to use the solution I've demonstrated in this post please make your own determination on whether it's acceptable for your project. I'm no expert in licensing, but I've cited all my sources where available so that the reader can execute their own due diligence.