Highlighting PDF for CoreNLP

Prior to the release of 3.2, using PDF Highlighter with NLP (natural language processing) tools was inefficient and couldn’t guarantee 100% precision. Before 3.2, it was necessary to parse NLP output to extract key terms and phrases and then feed Highlighter a query file containing potentially hundreds of long phrases (e.g. whole sentences) you wanted marked in PDF. Thus, highlighting a single document with a few hundred pages could take minutes to complete—far from ideal.

From its early days, aside from highlighting for query terms, Highlighter supported document highlighting for Adobe’s PDF Highlight File Format. The highlight file contains character offsets and the length of terms to be marked in PDF. Because output from NLP tools contains text offsets, it seems reasonable to use this offset information to create a highlight file; unfortunately, this wasn’t possible for several reasons, most notably:

There are quirks in what Adobe highlight file format considers a character and how many positions a character takes.

Highlight files require a page index and text position on the page. However, generally, you feed NLP with text extracted from a multi-page document, and NLP output doesn’t contain page information.

Highlighter 3.2 surpasses Adobe’s highlight file format, supporting extended syntax that allows absolute text positions (offsets in document instead of in page) and highlighting color specified for each term—exactly what we needed for the efficient use of Highlighter with NLP tools!

The new PDF highlighting workflow for NLP output looks like this:

Get the text from the PDF document using the /extract service provided by Highlighter. This is to ensure that the text sent to our NLP tool is the same that Highlighter sees and ensures that the text positions match.

Pass the text through the NLP tool.

Transform the NLP output to the highlights XML file. You will need to write some custom code for this. It’s simple enough, and we give an example for Core NLP below.

Send the PDF and XML files to Highlighter and, depending on the request settings, receive the viewing URL or the new PDF with highlighted text.

Now, coming from NLP output to the highlighted PDF takes fractions of a second— not minutes!

In the exercise below, we will show you how to setup Highlighter and highlight PDF documents for CoreNLP output files.

Setting Up PDF Highlighter

First, we’ll need to install PDF Highlighter.

Get PDF Highlighter installer from its download page and install. The Highlighter needs a license key to run, so make sure to request one while on the download page.

Modified hit navigation strategy in Highlighting PDF Viewer to page-to-page. Because the number of highlighted terms for NLP is generally high, hit-to-hit navigation in viewer would be cumbersome.

After you have installed the key and configuration file, restart Highlighter (“Highlighter Service” in Windows Services or “highlighter-service” service in Linux).
To ensure everything is running properly, open http://localhost:8998/status to access your local Highlighter installation—you should see “status: OK” in the results.

PDF Highlighter exposes /extract method, allowing text extraction for the referenced document. For the complete list of available services, open http://localhost:8998/apidocs/ for live web service API documentation. We will use this service to extract PDF text to be analyzed with CoreNLP.

We use absolute path to the PDF document (could be URL as well) because the Highlighter server must be able to open it. Depending on the document’s size, text extraction may take several seconds to complete. (Note that Highlighter can be setup to index document folders in the background. In that case, for an already indexed document, getting extracted text is an instant operation.)

Depending on the document’s size and enabled annotators, this can take minutes to complete.

We told CoreNLP to use the current folder as output directory and thus, it will create GlobalEconomicProspects.txt.xml in it.

Transforming CoreNLP Output to Highlight File

Generally, in accordance with your needs, you will need to create a custom program/script to get data from CoreNLP output XML and create a highlights XML file that PDF Highlighter can digest.

In this exercise, we created a Java tool that can create a highlight file for named entities and sentiment data. You can download this tool (for simplicity, choose jar with dependencies included) from its project page.

Note that highlight location attributes “word”, “nlp-pos”, “nlp-ner”, and “nlp-sentiment” are not actually used by PDF Highlighter. We are adding them in CoreNlpToHighlightsXml just for the reference and result verification— they can be safely removed to make an XML file smaller.

At this point, we are ready to highlight our PDF…

Highlighting PDF

The easiest way to ad-hoc highlighting is to use the live API documentation provided with the Highlighter. Open http://localhost:8998/apidocs/ in your browser, click on “/highlight-for-xml” section, then click the “Try it out” button. You should see something like this:

To the “uri” field, put an absolute path to the PDF file and to the “xml” field absolute path to the highlights XML file that was generated in the previous step. Click “Execute”.

If successful, you should see JSON response containing, among other elements, a field named documentUrl. This is the URL to the result document in PDF Highlighting Viewer.

Copy the URL and open it in the web browser. You should see the PDF with marked entities:

That’s it!

If you prefer having a copy of the PDF document with highlights added to it (so it can be opened in any PDF viewer), change the desired response content type to application/pdf. (Unfortunately, live API documentation is unable to receive and save the generated document, so you will need to use a different REST client, use curl, or write a client application to use it.)

We hope you find this useful. If you have any questions or need assistance with PDF Highlighter integration, feel free to get in touch with us.