How to Build a Natural Language Processing App

In this article, you will learn about the technology that makes these applications tick, and you will learn how to develop natural language processing software of your own.

1

votes

Dear friends, with this article we are continuing our collaboration with Toptal. Toptal is an exclusive network that aims to connect the top freelance software developers, designers, and finance experts in the world to top companies for their most important projects. The article is authored by Shanglun Wang and was originally published in Toptal’s blog.

Natural language processing — a technology that allows software applications to process human language — has become fairly ubiquitous over the last few years.

Google search is increasingly capable of answering natural-sounding questions, Apple’s Siri is able to understand a wide variety of questions, and more and more companies are using (reasonably) intelligent chat and phone bots to communicate with customers. But how does this seemingly “smart” software really work?

In this article, you will learn about the technology that makes these applications tick, and you will learn how to develop natural language processing software of your own.

The article will walk you through the example process of building a news relevance analyzer. Imagine you have a stock portfolio, and you would like an app to automatically crawl through popular news websites and identify articles that are relevant to your portfolio. For example, if your stock portfolio includes companies like Microsoft, BlackStone, and Luxottica, you would want to see articles that mention these three companies.

Getting Started with the Stanford NLP Library

Natural language processing apps, like any other machine learning apps, are built on a number of relatively small, simple, intuitive algorithms working in tandem. It often makes sense to use an external library where all of these algorithms are already implemented and integrated.

For our example, we will use the Stanford NLP library, a powerful Java-based natural-language processing library that comes with support for many languages.

One particular algorithm from this library that we are interested on is the part-of-speech (POS) tagger. A POS tagger is used to automatically assign parts of speech to every word in a piece of text. This POS tagger classifies words in text based on lexical features and analyzes them in relation to other words around them.

The exact mechanics of the POS tagger algorithm is beyond the scope of this article, but you can learn more about it here.

To begin, we’ll create a new Java project (you can use your favorite IDE) and add the Stanford NLP library to the list of dependencies. If you are using Maven, simply add it to your pom.xml file:

Scraping and Cleaning Articles

The first part of our analyzer will involve retrieving articles and extracting their content from web pages.

When retrieving articles from news sources, the pages are usually riddled with extraneous information (embedded videos, outbound links, videos, advertisements, etc.) that are irrelevant to the article itself. This is where Boilerpipe comes into play.

Boilerpipe is an extremely robust and efficient algorithm for removing “clutter” that identifies the main content of a news article by analyzing different content blocks using features like length of an average sentence, types of tags used in content blocks, and density of links. The boilerpipe algorithm has proven to be competitive with other much more computationally expensive algorithms, such as those based on machine vision. You can learn more at its project site.

The Boilerpipe library comes with built-in support for scraping web pages. It can fetch the HTML from the web, extract text from HTML, and clean the extracted text. You can define a function, extractFromURL, that will take a URL and use Boilerpipe to return the most relevant text as a string using ArticleExtractor for this task:

The Boilerpipe library provides different extractors based on the boilerpipe algorithm, with ArticleExtractor being specifically optimized for HTML-formatted news articles. ArticleExtractor focuses specifically on HTML tags used in each content block and outbound link density. This is better suited to our task than the faster-but-simpler DefaultExtractor.

The built-in functions take care of everything for us:

HTMLFetcher.fetch gets the HTML document

getTextDocument extracts the text document

CommonExtractors.ARTICLE_EXTRACTOR.getText extracts the relevant text from the article using the boilerpipe algorithm

Now you can try it out with an example article regarding the mergers of optical giants Essilor and Luxottica, which you can find here. You can feed this URL to the function and see what comes out.

You should see in your output in the main body of the article, without the ads, HTML tags, and outbound links. Here is the beginning snippet from what I got when I ran this:

MILAN/PARIS Italy’s Luxottica (LUX.MI) and France’s Essilor (ESSI.PA) have agreed a 46 billion euro ($49 billion) merger to create a global eyewear powerhouse with annual revenue of more than 15 billion euros.
The all-share deal is one of Europe’s largest cross-border tie-ups and brings together Luxottica, the world’s top spectacles maker with brands such as Ray-Ban and Oakley, with leading lens manufacturer Essilor.
“Finally … two products which are naturally complementary — namely frames and lenses — will be designed, manufactured and distributed under the same roof,” Luxottica’s 81-year-old founder Leonardo Del Vecchio said in a statement on Monday.
Shares in Luxottica were up by 8.6 percent at 53.80 euros by 1405 GMT (9:05 a.m. ET), with Essilor up 12.2 percent at 114.60 euros.
The merger between the top players in the 95 billion eyewear market is aimed at helping the businesses to take full advantage of expected strong demand for prescription spectacles and sunglasses due to an aging global population and increasing awareness about eye care.
Jefferies analysts estimate that the market is growing at between…

And that is indeed the main article body of the article. Hard to imagine this being much simpler to implement.

Tagging Parts of Speech

Now that you have successfully extracted the main article body, you can work on determining if the article mentions companies that are of interest to the user.

You may be tempted to simply do a string or regular expression search, but there are several disadvantages to this approach.

First of all, a string search may be prone to false positives. An article that mentions Microsoft Excel may be tagged as mentioning Microsoft, for instance.

Secondly, depending on the construction of the regular expression, a regular expression search can lead to false negatives. For example, an article that contains the phrase “Luxottica’s quarterly earnings exceeded expectations” may be missed by a regular expression search that searches for “Luxottica” surrounded by white spaces.

Finally, if you are interested in a large number of companies and are processing a large number of articles, searching through the entire body of the text for every company in the user’s portfolio may prove extremely time-consuming, yielding unacceptable performance.

For our analyzer, we will use the Parts-of-Speech (POS) tagger. In particular, we can use the POS tagger to find all the proper nouns in the article and compare them to our portfolio of interesting stocks.

By incorporating NLP technology, we not only improve the accuracy of our tagger and minimize false positives and negatives mentioned above, but we also dramatically minimize the amount of text we need to compare to our portfolio of stocks, since proper nouns only comprise a small subset of the full text of the article.

By pre-processing our portfolio into a data structure that has low membership query cost, we can dramatically reduce the time needed to analyze an article.

Stanford CoreNLP provides a very convenient tagger called MaxentTagger that can provide POS Tagging in just a few lines of code.

The tagger function, tagPos, takes a string as an input and outputs a string that contains the words in the original string along with the corresponding part of speech. In your main function, instantiate a PortfolioNewsAnalyzer and feed the output of the scraper into the tagger function and you should see something like this:

Processing the Tagged Output into a Set

So far, we’ve built functions to download, clean, and tag a news article. But we still need to determine if the article mentions any of the companies of interest to the user.

To do this, we need to collect all the proper nouns and check if stocks from our portfolio are included in those proper nouns.

To find all the proper nouns, we will first want to split the tagged string output into tokens (using spaces as the delimiters), then split each of the tokens on the underscore (_) and check if the part of speech is a proper noun.

Once we have all the proper nouns, we will want to store them in a data structure that is better optimized for our purpose. For our example, we’ll use a HashSet. In exchange for disallowing duplicate entries and not keeping track of the order of the entries, HashSet allows very fast membership queries. Since we are only interested in querying for membership, the HashSet is perfect for our purposes.

Below is the function that implements the splitting and storing of proper nouns. Place this function in your PortfolioNewsAnalyzer class:

There is an issue with this implementation though. If a company’s name consists of multiple words, (e.g., Carl Zeiss in the Luxottica example) this implementation will be unable to catch it. In the example of Carl Zeiss, “Carl” and “Zeiss” will be inserted into the set separately, and therefore will never contain the single string “Carl Zeiss.”

To solve this problem, we can collect all the consecutive proper nouns and join them with spaces. Here is the updated implementation that accomplishes this:

Now the function should return a set with the individual proper nouns and the consecutive proper nouns (i.e., joined by spaces). If you print the propNounSet, you should see something like the following:

Comparing the Portfolio against the PropNouns Set

We are almost done!

In the previous sections, we built a scraper that can download and extract the body of an article, a tagger that can parser the article body and identify proper nouns, and a processor that takes the tagged output and collects the proper nouns into a HashSet. Now all that’s left to do is to take the HashSet and compare it with the list of companies that we’re interested in.

The implementation is very simple. Add the following code in your PortfolioNewsAnalyzer class:

Putting it All Together

Now we can run the entire application—the scraping, cleaning, tagging, the collecting, and comparing. Here is the function that runs through the entire application. Add this function to your PortfolioNewsAnalyzer class:

Run this, and the app should print “Article mentions portfolio companies.”

Change the portfolio company from Luxottica to a company not mentioned in the article (such as “Microsoft”), and the app should print “Article does not mention portfolio companies.”

Building an NLP App Doesn’t Need to Be Hard

In this article, we stepped through the process of building an application that downloads an article from a URL, cleans it using Boilerpipe, processes it using Stanford NLP, and checks if the article makes specific references of interest (in our case, companies in our portfolio). As demonstrated, leveraging this array of technologies makes what would otherwise be a daunting task into one that is relatively straightforward.

I hope this article introduced you to useful concepts and techniques in natural language processing and that it inspired you to write natural language applications of your own.

[Note: You can find a copy of the code referenced in this article here.]