All technology markets are guilty of coming up with at least one or two confusing terms. In the document imaging world, it’s terms with very similar sounding names. They are technically similar, but strictly different.

One of the most confusing things in the imaging world is the difference between Image Capture software often just called Capture, and Data Capture software. Not only are the names confusing, but technically there is a lot of overlap. All data capture products have imaging capabilities, all capture products have basic data capture. The risk of the confusion is replacing one product for the other. For example, organizations that attempt to take the data capture functionality built into a capture application for a full blown project, end with little success and a lot of frustration. Let me explain where they fit.

Capture products have the primary function of delivering quality images in a proper document structure. They often feature image clean-up, review, and page splitting tools that are more advanced then the scanning found in data capture applications. Most demonstrate what is called rubber-band OCR, the reading of a specific coordinate on a page. Some go as far as creating templates where coordinates zones are saved. This is where the solutions get confused with data capture. Until there is a registration of documents and proper forms processing approaches, it is not data capture. The risk of such basic templates is low accuracy and zones that do not always collect data.

Data capture products need images to function, so it was an obvious choice to add scanning to the solutions. These solutions however are better fed by a full capture application that has the performance and additional features such as batch naming, annotations, page splitting, etc. that the organization may require in the resulting image files. For data capture, the purpose of image capture is for getting data only and sometimes neglect the features that are important for image storage and archival.

In the end, both solutions are improving in the other’s territory. Eventually the lines will blur to the point where feature-wise they will be identical, and the benefit of one over the other will be rooted in the vendors expertise, either capture or data capture. If your primary requirement is quality images, the capture vendors solution is best chosen, but if it’s data extraction, then data capture rooted solutions are better.

Our culture is built on the fact that the newer and more means better. In the advanced technologies that exist, this for the most part is true, but people are always surprised when I tell them that disabling some of the newer technology will actually produce a better result. I am going to give you three examples of where technology demands time travel to older approaches for higher accuracy.

In data capture and OCR, there is a component of the technology called document analysis. Document analysis prior to any collection of data tells the structure of a page including columns, rows, tables, pictures, paragraphs, lines, etc. It’s the biggest contributor to modern day OCR accuracy. Document analysis is really designed for documents that are more traditional such as an article, a book page, or a letter. Document analysis ( although there have been special ones ) does not excel at form type documents. One of the most difficult documents in the world is an Explanation of Benefits EOB. This document has its own structure per variant typically. Surprisingly, the best way to process such a document is to turn off document analysis and default to a basic full-page read of the text. The reason for this is that document analysis provides an overwhelming bias for tables that no EOB will match.

It is the same case when reading text from photographs. When reading text from license-plates and product-plates ( serial number plates welded or stuck to many products ) during assembly it is best done with engines that do not have document analysis. In this case, the document analysis is trying too hard to find information. Because of the nature of these images, what ends up happening is characters in the photo are split into multiple lines and characters. Without document analysis, the engine sees the whole image as one text block and just reads it, thus creating better results. Looking at the license-plate readers that snap pictures of your license plate at toll booths, they are all using older antiquated OCR technology. By turning off document analysis they can use the newer engines.

Finally, there is mobility. This one makes a lot of people uncomfortable. Our society wants to believe their cell phone can do anything. Just today I was wondering why my cell phone did not brush my teeth for me. You can have your cell phone do OCR sure, but it requires older smaller and limited OCR engines to do so. I prefer to send an image to a server and use more advance OCR, but many demand OCR on the phone though in practice it’s usually slower. The reason for this is OCR requires specific processing power, and specific types of processing. Chips in phones today, and likely for a very long time to come will not compete with the power of a computer nor will they, and most importantly, include the proper math operators it takes for efficient and math heavy modern OCR. Cell phones cannot adopt proper chips because we demand long lasting batteries, small size, and low cost. Intense math is simply not important for 99.9% of mobile applications.

There you have it. Modern OCR taken down a few notches to solve current day problems. The best engines that exist today allow you to turn on and off all the various functionality you need thus making it possible to purchase the latest OCR technology and limiting it however you need. Most organizations don’t understand why anyone would want to turn off the new but today I’ve proven new is not always better!

Document imaging and scanning are facilitated in large parts by various software applications. Often some of the greatest appeal, for those not too familiar with document imaging, is the functionality contained within the software that is bundled with a document scanner. Many of the vendors, while they are selling document scanners, put all the focus on their applications that are married to the scanner and how they handle the images.

Recently at MacWorld 2010, this was proven to be true from the various scanner vendors who had more to say about their personal content management applications than their actual scanners. What surprised me is how little end-users were concerned about where and how the images are stored.

Knowing how your personal content management application stores images is critical for your future retention and use of those images. To give you an example, if you are now scanning to an application that converts images to a proprietary format and saves them in an SQL Express database you don’t have access to, migrating from this application will be as difficult as re-scanning each and every piece of paper. What if you no longer have the originals?

Many of the sexy software applications out there make it very difficult to get to your data files directly, for use in other applications or for purpose of migration. I would expect this to be a common question asked by vendors but it was not. Only once did I see a vendor explain how you can still get to the files that are contained in their application. Indeed you could, following some non-obvious steps. And once you found all the image files they were bizarrely named, not the name assigned within the software. It is good to know they are there and accessible, but what a tremendous amount of work to get there.

You own the information so make sure you know where the images go, how they are stored, and how you can get to them if at all. If a particular solution is locked down or requires some hacking, it’s not a personal content management system for you.

Often organizations have no control over the images they have received. Images can come via fax which has a varied range of resolutions, or they can come as poor scans. All of which are no good for data capture and OCR processes. Fortunately, there are a lot of imaging tools and tricks out there to help. None of these tools replace a good scan but some get close. One of the tools not often thought about is up-sampling.

Up-Sampling is the process of taking an image at a lower resolution and increasing it to a higher resolution. The technology basically increases the resolution of an image then fills in new empty pixels with predicted values from the original image. For data capture and OCR, up-sampling is usually done from 150 DPI and 200 DPI to 300 DPI. Up-sampling technologies have become very impressive and useful. Often I will recommend up-sampling over working with the source that has lower resolution. But lets talk about the facts and how and when you should consider up-sampling.

Up-sampling should be considered on documents that have a low amount of noise such as watermarks, spills, stains, stamps, speckling. Essentially documents that are a good quality and scan but low resolution. You should also avoid doing up-sampling on documents with close spacing of elements and text crowding. In these two above scenarios it’s better to work with the source image as-is and work around the problems

The bigger the gap between the source resolution and the desired resolution, the more risk of fragments exist after up-sampling. For example 150 DPI to 300 DPI will not yield the quality that 150 DPI to 200 DPI will. This is why going crazy and up-sampling to the highest possible resolution is not a good idea. It’s like taking a very small image and trying to zoom in as far as you can to get details that you probably wont. Trying to trick the system will only hurt you. Up-sampling from 150 DPI to 200 DPI then again to 300 DPI would not be better than just converting from 150 DPI to 300 DPI. In fact this would be a pretty big mistake. Essentially what you are doing is magnifying the mistakes created during up-sampling as they get propagated two times now. These will likely decrease your quality and can result in such things as bloated characters, fuzzy characters, or an abundance of speckling. The goal is to do as few conversions on the document as possible.

I will always defer to a proper scan over any image techniques, but when you do not have control of the image scan, one of the image tools to consider is up-sampling. Uneducated use of the technology is unsafe as is true with all advanced technologies, but if you stick with the facts, and pick a great technology you will be successful.