Enrichment: Starting at the beginning with Text Extraction

By Paul Maker

If you recall the first blog in this series, I introduced something that we call 'information enrichment'. Just to recap, this is the process of taking a piece of information – it could be data or a document – and then adding additional business context to it. This context is often labels, metadata, classifications and other such things that we can use to better structure, navigate and use that information.

This blog is going to talk about the first step in the process of document enrichment – text extraction.

Managing multiple document formats

On the face of it, this sounds simple – a document contains text, let’s just extract it. However, the reality behind the scenes is far from simple. Whilst documents contain text, that text is wrapped up in a whole host of structures that control format, rich media and so on.

Our challenge in Aiimi Labs was to find an efficient way for our product, InsightMaker, to handle the myriad of document formats out there. Even just looking at Microsoft Office documents, we have numerous format versions that date back over decades – consider Word, for instance, the 32-bit Office version was released in 1995!

We tested a series of different approaches from 3rd
party vendors, such as the Microsoft Filter interface which was developed by Microsoft to support its indexing service, and Open Source offerings, such as Apache Tika.

After several rounds of functionality and performance testing, we settled on Apache Tika. Its huge breadth of supported document formats and the fact that it was Open Source gave us a scalable approach to document conversion, as well as the flexibility and control we needed – something that would pay back in spades later in our journey.

Time(out) trials...

After deploying this at numerous customer sites, we found a problem. Some document formats (especially those containing embedded documents, something Tika recursively handles) could cause the Java processes' memory consumption and CPU cycles to spiral out of control. Now, in an enterprise processing half a billion documents this very quickly becomes a problem.

To solve this challenge, we decided to fork the Apache Tika code and make some internal modifications to the REST endpoints. Specifically, we added a timeout parameter. This allows us to control the maximum time that is spent trying to convert a document. If the conversion exceeds this time limit, we internally terminate the conversion process. We also have plans in our InsightMaker product roadmap to add memory utilisation monitoring. This will further enhance how we can reliably convert masses of documents, since long execution time alone is not always an indicator of a problem!

Unpacking complex CAD drawings

Whilst we found Apache Tika great for Office documents, PDF files, ZIP files and numerous other unusual formats, it did not handle things like CAD drawings.

CAD is an important format for us to be able unpack and extract text from because we have several customers in the Asset Management and Engineering sector. Unless you’re a domain expert, CAD drawings are often some of the the most difficult files to find in an organisation. CAD files also contain a wealth of information which we can convert to metadata and entities, allowing us to begin joining up drawings with their structured SAP data.

For CAD, we decided to use ASPOSE for conversion to text; but, like Tika, we had similar problems with certain large drawings taking an excessive amount of time to process. To resolve this, all we needed to do was carry over the same timeout approach that we used with Tika. We also invested a lot of time making sure that we only extract useful text from CAD drawings – not the thousands of numerical measurement values.

So, that’s the first step of enrichment covered – text extraction. Tomorrow, I will talk about how we extract business entities from information and start to put them to work in the business.