It is tot too often to companies using Data Capture technology that they have the chance to change their forms design or even create new ones. If you have this ability, USE IT! A properly designed form is the fist step to success in automating that form. There are many things you can do to make sure your form is as machine readable as possible. Typically the forms we are talking about are hand-written but occasional also machine filled. I will highlight the major points.

Corner stones. Make sure your form has corner stones in each corner of the page. The corner stones should be at 90 degree angles to each neighbor one and the ideal type is black 5 mm squares.

Form title. A clear title in 24 pt or higher print and no stylized font.

Completion Guide. This is optional but sometimes is useful at the top of the form to print a guide on how best to fill in the fields of the type you use.

Mono-Spaced fields. For the fields to be completed, it’s best to use field types that are character by character separation. Each character block should be 4 mm x 5 mm and should be separated by 2 mm or more distance. The best types of fields to use in order are letters separated by dotted frame, letters separated by drop-out color frame, letters separated by complete square frames.

Segmented fields by data type. For certain fields, it will be important to segment the field in portions to enhance ICR accuracy. The best example is date; instead of having one field for the complete data, split it into 3 separate parts with the first being a month field, next a day field, and finally a year field. Same is often done for numbers, codes, and phone numbers.

Separate fields. Separate each field by 3 mm or more.

Consistent fields. Make sure the form uses consistent field types stated in 4.

Form breaks. It’s OK to break the form up into sections and separate those sections with solid lines. This often helps template matching.

Placement of field text. For the text that indicates what a field is “first name”, “last name”. It is best to put these left justified to the left of the field at a distance of 5mm or more. DO NOT put the field descriptor in drop-out in the field itself.

Barcode. Barcode form identifiers are useful in form identification. Use a unique id per form page and place the barcode at the bottom of the page at lease 10 mm from any field.

Just this morning, I was reminded of why market education is so important. I received an email in the morning from a customer who has been exposed to data capture technology for many years. This customer owns a semi-structured data capture solution that is capable of locating fields on forms that changes from variation to variation. In an attempt to help my understanding, we started a conversation about their expectations. Very wisely, the customer broke down their expectations into three categories: OCR accuracy ( field level ), field location accuracy, and amount of time to process per document. This is a step more advanced than a typical user who will clump all of this into one category. In addition to this, there should be a minimum template matching accuracy. In any case, they expect an OCR accuracy of 90%, which is reasonable considering the document they are working with are pixel perfect. They expect a 20 page document to be processed in 4 minuets which is also reasonable and right on the line. Finally, they expect field location to be 100%, RED FLAG!

This is not the first time that there is an assumption that you can locate fields on a semi-structured form with 100% accuracy, 100% of the time. To my dismay, as people seem to be learning more about the technology, this is the next class of common fallacy. And because the organization did not specify template matching accuracy, it means they must also assume templates match 100% of the time to get 100% field location accuracy. Trouble.

It’s clear as to why 100% field accuracy is important for them. That is because, basic QA processes are capable of only checking recognition results ( OCR Accuracy ), and not locations of fields. Instead of modifying QA processes, an organization’s first thought was how to eliminate the problems that QA might face. 100% accuracy is not possible no matter what is done, including straight text parsing. In this case, the reason it’s not possible is that even in a pixel perfect document, there are situations where a field might be located partially, located in excess, or not located at all. The scenario that most often occurs in pixel perfect documents is that text may sometimes be seen as a graphic because it’s so clean, and text that is too close to lines are ignored. So typically in these types of documents, any field error is usually a field located partial error. Most QA systems can be setup such that rules are applied to check data structure of fields, and if the data contained in them is faulty, an operator can check the field and expand it if necessary. But this is only possible if the QA system is tied with data capture.

After further conversation, it became clear that the data capture solution is being forced to fit in a QA model. There are various reason as to why this may happen: license cost, pre-existing QA, or miss-understanding of QA possibilities. This is very common for organizations and very often problematic. Quality assurance is a far more trivial processes to implement than data capture. When it comes to data capture it would be more important to focus on the functionality of the data capture system and develop a QA that makes it’s output most efficient.

Exceptions happen! When working with advanced technologies in Data Capture and forms processing, you will always have exceptions. It’s how companies choose to deal with those exceptions that often make or break an integration. Too often exception handling is not considered for data capture projects, but it’s important. Exceptions help organizations find areas for improvement, increase the accuracy of the overall process, and when properly prepared for, keep return on investment (ROI) stable.

There are two phases of exceptions; those that make it to the operator driven quality assurance step, and those that are thrown out of the system. It would take some time to list all the possible causes of these exceptions but that is not the point here, it’s how to best manage them.

Exceptions that make it to the quality assurance ( QA ) process have a manual labor cost associated with them, so the goal is to make the checking as fast as possible. The best first step is to use database look up for fields. If you have pre-existing data in a database, link your fields to this data as a first round of checking and verification. Next would be to choose proper data types. Data types are formatting for fields. For example a date in numbers will only have numbers and forward slashes in the format NN”/”NN”/”NNNN. By only allowing these characters, you make sure you catch exceptions and can either give enough information for the data capture software to correct it ( if you see a g it’s probably a 6 ) or hone in for the verification operator exactly where the problem is. The majority of your exceptions will fall into the quality assurance phase. There are some exception documents that the software is not confident about at all and will end up in an exception bucket.

Whole exception documents that are kicked out of a system are the most costly, and can be if not planned for be the killer of ROI. The most often cause of these types of exceptions is a document type or variation that has not been setup for. It’s not the fault of the technology. As a matter of fact because the software kicked the document out and did not try to process it incorrectly it’s doing a great job! What companies make the mistake of doing is every document that falls in this category gets the same attention, an thus additional fine-tuning cost. But what happens if that document type never appears again, then the company just reduced their ROI for nothing. The key to these exceptions whether they are whole document types or just portions of one particular document type is to set a standard that indicates an exact problem that has to repeat X times ( based on volume ) before it’s given any sort of fine-tuning effort.

Only with an exceptional exception handling process will you have an exceptional data capture system and ROI.

When it comes to forms processing and data capture, working with documents that have hand-print vs. handwriting is a huge difference in accuracy and validity. Sometimes the difference between these two is not so clear. So how do you tell if your form is hand-print, or handwriting, or better yet both!

ICR ( Intelligent Character Recognition ) is the algorithm used in the place of OCR for characters generated by a human hand. The algorithm is more dynamic as a persons hand-print changes slightly by the minute. It’s possible to be very accurate when processing hand-print forms when the form is designed correctly. When doing this type of forms processing you will always have quality assurance steps, but you can get close to the accuracy of any OCR process. Very often forms that were not created with data capture or automatic extraction in mind will contain handwriting. The reason for this is that hand-print is usually guided by the form itself. Forms without hand-print cannot expect to be processed at a high accuracy. So what makes hand-print, hand-print?

Mono-spaced text: What this means that each character as it’s filled out is the same distance apart as all the other characters. In handwriting very often you will have characters that connect, in the extreme form this is cursive. When characters touch or are not spread out equally you get improper segmentation and get characters clumped together as one or split in half during recognition. Mono-spaced text is usually achieved using boxes on the form guiding the user to fill within the boxes.

Uniform Height and Width: Similar to mono-spaced text the text as it is filled in should have a more or less uniform height or width. This forces the completer to not introduce as many variable elements as they would in straight handwriting and increases accuracy. This is also accomplished using boxes on the form keeping users within boundaries.

Stable Base-Line: This aspect of hand-print is the lessor thought about but very important. Text must always be on the same horizontal base-line. What happens typically in handwriting is a user varies up and down on an invisible baseline. You may have noticed sometimes when you write that the end of any line is lower then the beginning. Baselines are important for OCR and ICR to get proper character segmentation and recognition of a few key characters such as “q” and “p” the “tail” characters.

Sans-serif: The last element is keeping characters sans-serif. The reason for this is the extra tails to characters can cause confusion between certain characters like “o” vs. “q” and “c” vs. “e”. The way to achieve this is less obvious, it is by putting a guide on the top of the form that shows a good character and a bad character.

ICR is a technology for Hand-print recognition and can be very accurate when having the proper guides. Today handwriting and cursive automation is not complete and usually only successful when augmented with other technologies such as data base look-up and CAR and LAR. Sometimes the difference between the two is unclear, but the above 4 elements provide a clear definition of hand-print. The best hand-print that can be found is by the highly training creators of engineering drawings whose print is so perfect it resembles very closely typographic text.

It’s hard for people to accept the possibility of over cleaning a scanned image. I myself would love to believe you can clean-up an image so much that it does not matter what OCR technology you use, it will always be 100% accurate. The fact is however, that OCR engines don’t work this way. There are particular ways to improve the quality of a document, and there are ways that image clean-up hurts your OCR accuracy. I am going to talk about two such phenomenon. Fuzzy characters, and characters with legs.

In data capture, a commonly sought after imaging technique to use is line-removal. Line-removal attempts to find all lines in a form and make them disappear. Especially when considering forms where text is filled in fields where each character and the field itself is bounded in lines. Most forms processing tools have actually advanced in a way that they incorporate the lines in the algorithm and anticipate them being there. They can thus recognize the characters even with lines. What often happens when a line-removal algorithm is used, you get characters with legs. Like the name sounds, these are characters where on the top and or bottom of the character a portion of the line remains where it touches the character. The result is the character no longer looks like its original self. For most characters they become un-recognizable, for others they become another character for example an H becomes an A and an I becomes a T. For this reason, line-removal is no longer a recommended image clean-up tool for data capture.

The next imaging technique is both extremely beneficial to data capture or detrimental. It all has to do with the form itself. I’m talking about despeckle. Despeckle is the algorithm that removes annoying dots on the document and enhances both the read of characters as well as the removal garbage that might be recognized as characters. Despeckle is usually beneficial to data capture, especially hand-print forms where the dots can interfere with the ICR algorithm. Where despeckle hurts data capture and forms processing is when the dots touch characters. Similar to line-removal, if the dots are touching the characters, the segmentation tool believes it’s a part of the character so leaves it. Thus you get fuzzy characters. Fuzzy characters are very difficult for OCR engines to read. It’s a simple test, look at your form and notice weather or not the dots on the form touch the characters or not. If they do, you are better off working with the dots.

These two examples demonstrate huge differences in OCR accuracy and are simply choices made on the image itself not including setup or the software you use.