One of the technologies to solve this problem is OpenNLP. We ran a demo of OpenNLP during our Activate presentation, and you can find the commands under our Github account. In this blog post, we’ll run through that same demo and give some more details on the thinking behind it.

Why OpenNLP

OpenNLP is, to quote the website, a machine learning based toolkit for the processing of natural language text. It provides lots of functionality, like tokenization, lemmatization and part-of-speech (PoS) tagging. Of this functionality, Named Entity Extraction (NER) can help us with query understanding.

Setup and basic usage

Once you download and extract OpenNLP, you can go ahead and use the command line tool (bin/opennlp) to test and build models. You won’t use this tool in production though, for two reasons:

if you’re running a Java application (which includes Solr/Elasticsearch), you will likely prefer the Name Finder Java API. It has more options than the command line tool.

running bin/opennlp loads the model every time, which adds latency. If you expose NER functionality through a REST API, you only need to load the model on startup. This is what the current Solr/Elasticsearch implementations do.

We’ll still use the command-line tool here, because it makes it easier to explore OpenNLP’s functionality. You can build models with bin/opennlp and use them with the Java API as well.

To get started, we’ll pass a string to bin/opennlp’s standard input. We’ll then provide the class name (TokenNameFinder for NER) and the model file as parameters:

For anything more sophisticated, you’ll likely need your own model. For example, if we hope to get “youtube” back as a URL part. We can try to use the Organization pre-built model, but it won’t get us anything:

entities need to be surrounded by tags. Here, we want to identify youtube as a url

add spaces between tags (START/END) and labeled data

if possible, use one label per model (here, url). Multiple labels are possible, but not recommended

have lots of data. Documentation recommends a minimum of 15000 sentences

each line is a “sentence”. Some features (we’ll touch on them below) look at the position of the entity in the sentence. Does it tend to be at the beginning or the end? If you do entity extraction on queries (like we do here), the query is usually one sentence. For index-time entity extraction, you could have multiple sentences in a document

empty lines delimit documents. This is more relevant for index-time entity extraction, where there’s a difference between documents and sentences. Document boundaries are relevant for document-level feature generators (like DocumentBegin) and those influenced by previous outcomes within the document (usually, feature generators extending AdaptiveFeatureGenerator)

Feature generation

The training tool runs through the data set, extracts some features and feeds them to the machine learning algorithm. A feature could be whether a token is a number or a string. Or whether the previous tokens were numbers or strings. In OpenNLP, such features are generated by feature generators. You can find all options here. That said, you can always implement your own feature generators.

Once you’ve identified the feature generators to use and their parameters, put them in an XML file. Check out our GitHub account for a feature generation example close to the default one.

Algorithm selection and tuning

OpenNLP comes out of the box with classifiers based on maximum entropy (default), perceptron-based, and naive Bayes. To choose the classifier, you’d provide a parameters file. There are examples for all supported algorithms here.

In the parameters file, there are at least three important aspects to look at:

algorithm choice. Naive Bayes will train the fastest, but will work as if the provided features are independent. This might or might not be the case. The maximum entropy and perceptron-based classifiers are more expensive to run, but tend to give better results. Especially when features depend on each other

number of iterations. The more times you go through the training data, the more influence provided features will have on the output. This is a trade-off between how much will be learned on one hand, and overfitting on the other hand. And of course training will take longer with more iterations.

cutoff. Features that are encountered less than N times are ignored, to reduce noise.

Training and testing the model

Now we can put everything together and build our model. We’ll use the TokenNameFinderTrainer class this time:

To properly test the model, we can use the Evaluation Tool on another labeled dataset (written in the same format as the training dataset). We’ll use the TokenNameFinderEvaluator class, with parameters similar to the TokenNameFinderTrainer command (provide the model, dataset and encoding):

Conclusion

OpenNLP is a versatile tool for entity extraction. Default options and built-in feature generators work well for natural language, like picking up entities from books or articles at index time. That’s why current OpenNLP integrations for Solr and Elasticsearch are on the indexing side, rather than the query side. For query understanding, it’s usually more work to build a model that can accurately extract entities from a small context. But it can definitely be done by providing enough data and good features and algorithms for the use-case.

If you find this stuff exciting, please join us: we’re hiring worldwide. If you need entity extraction, relevancy tuning, or any other help with your search infrastructure, please reach out, because we provide:

Services

About

Contact

Apache Lucene, Apache Solr and their respective logos are trademarks of the Apache Software Foundation.
Elasticsearch, Kibana, Logstash, and Beats are trademarks of Elasticsearch BV, registered in the U.S.
and in other countries. Sematext Group, Inc. is not affiliated with Elasticsearch BV.