Day 14: Stanford NER–How To Setup Your Own Name, Entity, and Recognition Server in the Cloud

I am not a huge fan of machine learning or natural text processing (NLP) but I always have ideas in mind which require them. The idea that I will explore during this post is the ability to build a real time job search engine using twitter data. Tweets will contain the name of the company which if offering a job, the location of the job, and name of the contact person at the company. This requires us to parse the tweet for Person, Location, and Organisation. This type of problem falls under Named Entity Recognition.

According to wikipedia,

Named-entity recognition (NER) is a subtask of information extraction that seeks to locate and classify atomic elements in text into predefined categories such as the names of persons, organisations, locations, expressions of times, quantities, monetary values, percentages, etc.

To make it more clear, let us take an example. Suppose we have the following tweet

A human can easily figure out that an organisation named PSI Pax has an opening in Baltimore. But how we can do this programmatically? The easiest way to do this is to maintain a list of all organisations and locations and search through it. However, implementing this solution will not scale.

Today, in this blog post, I will cover how we can set up our own NER server using the Stanford NER package.

What is Stanford NER?

Stanford NER is a Java implementation of a Named Entity Recognizer. Named Entity Recognition (NER) labels sequences of words in text which are the names of things, such as person and company names, or gene and protein names.

Prerequisite

Basic Java knowledge is required. Install the latest Java Development Kit (JDK) on your operating system. You can either install OpenJDK 7 or Oracle JDK 7. OpenShift support OpenJDK 6 and 7.

Sign up for an OpenShift Account. It is completely free and Red Hat gives every user three free Gears on which to run your applications. At the time of this writing, the combined resources allocated for each user is 1.5 GB of memory and 3 GB of disk space.

Install the rhc client tool on your machine. RHC is a ruby gem so you need to have ruby 1.8.7 or above on your machine. To install rhc, just type

sudo gem install rhc

If you already have one, make sure it is the latest one. To update your rhc, execute the command shown below.

Setup your OpenShift account using rhc setup command. This command will help you create a namespace and upload your ssh keys to OpenShift server.

Step1 : Create a JBoss EAP application

We will start with creating the demo application. The name of the application is nerdemo.

$ rhc create-app nerdemo jbosseap

If you have access to medium gears then you can use following command.

$ rhc create-app nerdemo jbosseap -g medium

This will create an application container for us, called a gear, and setup all of the required SELinux policies and cgroup configuration. OpenShift will also setup a private git repository for us and clone the repository to the local system. Finally, OpenShift will propagate the DNS to the outside world. The application will be accessible at http://nerdemo-{domain-name}.rhcloud.com/. Replace domain-name with your own unique OpenShift domain name (also sometimes called a namespace).

Step 3 : Enable CDI

We will be using CDI for dependency injection. CDI or Context and Dependency injection is a Java EE 6 specification which enables dependency injection in a Java EE 6 project. CDI defines type-safe dependency injection mechanism for Java EE. Almost any POJO can be injected as a CDI bean.

Create a new xml file named beans.xml in the src/main/webapp/WEB-INF folder. Replace the content of beans.xml with the following:

Deploy to OpenShift

Finally, deploy the changes to OpenShift

$ git add .
$ git commit -am "NER demo app"
$ git push

After the code is pushed and the war is successfully deployed, we can view the application running at http://nerdemo-{domain-name}.rhcloud.com. My sample application is running at http://nerdemo-t20.rhcloud.com.

Now make a request http://nerdemo-t20.rhcloud.com/api/v1/classify/Microsoft%20SCCM%20Windows%20Server%202012%20Web%20Development%20Expert%20(SME3)%20at%20PSI%20Pax%20(Baltimore,%20MD)