Disclaimer

The opinions shared here represent those of the contributor themselves and not those of their employers nor that of Big Men On Content as a whole.

One of the things that got me going again about Text Analytics was the claim by many vendors at the AIIM Conference 2016 that they support Text Analytics. Most didn’t and here’s why.

Zonal OCR and Smart Capture are Not Text Analytics

In automated indexing of captured (or imaged) documents, there are two options Zonal OCR and Smart Capture.

Zonal OCR starts during job configuration by identifying a square zone on a document where particular information is usually located. For instance, the top one inch of the page is identified as the location for a vendor name. During scanning, that information is reviewed for inclusion as a keyword. The system might do some extra recognition there like looking for two lines that include a comma and a five digit number to indicate the address. Or the system may look for “P.O.” and then capture any numbers next to them. The differentiation of Zonal OCR is that areas are already identified.

Smart Capture, on the other hand, does not start with any identified areas. Instead Smart Capture relies upon proximity of words or specifically formatted character (wildcards). With Smart Capture the document layout does not usually need to be pre-configuration. Instead, Smart Capture relies on a document training set to ensure that it’s finding the right data. During operation, once the document has been scanned and OCRed, the system looks for word proximity. For instance, it looks for combinations of “purchase order” and then it searches the nearby text for a numeric sequence. While this may sound similar to Zonal OCR it differs in that the text combinations may not always physically fall in the same area.

Both of these require semi-structured content as you would see in a form or common layout like an invoice. But neither of these can truly look at any document to extract keywords for classification.

Entity Extraction is Text Analytics

Entity extraction does not require documents to have any structure. It works by performing fact extraction that looks at the individual words, their meanings, and use in sentences. It builds maps of these words in a 3-dimensional space. Text Analytics practitioners call this concept doc2vec (documents to vectors). These collections of words in relation are used to highlight key facts. Any random document can be run through this process.

These could then be used to create keywords by identifying the most common entities. An entity in not identified by an individual word, but instead all words that represent that entity.

Summary

It’s best to watch out for buzzwords. It seems that everyone wants to jump on the text analytics story. So if you see “OCR” or “ICR” with “Text Analytics”, make sure that they’re not selling Zonal OCR or Smart Capture. Then again, the one vendor at the conference that really did text analytics didn’t have one person at their booth that could explain this without making it sound like Zonal OCR.

Is your company looking for Content Categorization or something more from Text Analytics?