Invoices are one of the highest demanded documents to automate. Let’s talk a little about what it takes to be successful in invoice processing. Data Capture is the technology used for invoices. This is where you extract field-by-field the information you want from the invoice in field order. In order to automate invoices with the high accuracy and utilize a boxed invoice solution you need to do some preparation. Here are 4 MUST have steps:

1.)Separate your commercial invoices from any specialized invoice types such as legal, manufacturing, telecommunication, etc. The reason you do this is because the low hanging fruit when automating invoices is commercial invoices. Software packages have put the most amount of effort in these documents. By working with them first, you are ensuring your success on a large population of your invoices and then can tackle the remainder.

2.)Know how many vendors you have. Understanding the makeup of your invoices is very important. Your focus should be determined by those invoices that are easiest to automate and make up the greatest portion of your entire volume. So make a list of all your vendors and what paper volume percentage each makes up of the whole.

3.)Know if you want to collect line-item data or not. At first glance, majority of companies say they want line-items, only later to change their mind. Find that business process that mandates you collect line items. In your current process, are you having line items entered? What database of existing information will you use to support your line-item extraction? Most companies in the end choose against line-items or choose to extract them for limited critical vendors.

4.)Know how you are going to check the quality of extraction. Quality assurance happens with human review, and business rules. Know before hand how you want those to work. For example a business rule simply could be all line-items must add up to total amount, if they don’t you have someone, look at the entire invoice.

These four steps are not the end-all in proving you invoice processing accuracy, but they are necessary and all steps to consider before you look and purchasing a boxed invoice processing solution.

All technology markets are guilty of coming up with at least one or two confusing terms. In the document imaging world, it’s terms with very similar sounding names. They are technically similar, but strictly different.

One of the most confusing things in the imaging world is the difference between Image Capture software often just called Capture, and Data Capture software. Not only are the names confusing, but technically there is a lot of overlap. All data capture products have imaging capabilities, all capture products have basic data capture. The risk of the confusion is replacing one product for the other. For example, organizations that attempt to take the data capture functionality built into a capture application for a full blown project, end with little success and a lot of frustration. Let me explain where they fit.

Capture products have the primary function of delivering quality images in a proper document structure. They often feature image clean-up, review, and page splitting tools that are more advanced then the scanning found in data capture applications. Most demonstrate what is called rubber-band OCR, the reading of a specific coordinate on a page. Some go as far as creating templates where coordinates zones are saved. This is where the solutions get confused with data capture. Until there is a registration of documents and proper forms processing approaches, it is not data capture. The risk of such basic templates is low accuracy and zones that do not always collect data.

Data capture products need images to function, so it was an obvious choice to add scanning to the solutions. These solutions however are better fed by a full capture application that has the performance and additional features such as batch naming, annotations, page splitting, etc. that the organization may require in the resulting image files. For data capture, the purpose of image capture is for getting data only and sometimes neglect the features that are important for image storage and archival.

In the end, both solutions are improving in the other’s territory. Eventually the lines will blur to the point where feature-wise they will be identical, and the benefit of one over the other will be rooted in the vendors expertise, either capture or data capture. If your primary requirement is quality images, the capture vendors solution is best chosen, but if it’s data extraction, then data capture rooted solutions are better.

Documents containing tables have the majority of information of the document printed thus the demand to collect this data is very high. In data capture organizations will choose three scenarios to collect data from these documents; ignore the table, get the header and footer and just a portion of table, or get it all. Ideally organizations prefer the last option, but there are some strategic decisions that have to be made prior to any integration using tables. One of those decisions is whether to capture the data in the table as a large body of individual fields or as a single table block. Lets explore the benefits and downside to both.

Why would you ever perform data capture of a table with a large collection of individual fields when you can collect it as a single table field? Accuracy. Theoretically it will always be more accurate to collect every cell of a table as it’s own individual field. The reason for this is because you will accurately located field, remove risk of partially collected cells or cells where the base line is cut, and remove white space or lines from fields. In some data capture solutions this is your only choice. Because of this many have made it very easy to duplicate fields and make small changes so the time it takes to create so many fields is faster. This is a great tool because the downside to tables as a collection of individual fields is in the time it takes to create all fields and maybe this is too great to justify the increase in accuracy.

If you have the ability in your data capture application to collect data as an individual table block, you are able to very quickly do the setup for any one document type. Table blocks require document analysis that can identify table structures in a document. The table block relies heavily on identified tables and then applies column names per the logic in your definition. This is what creates its simplicity but also its problems. Sometimes document analysis finds tables incorrectly, more often partially. This can cause missing columns, missing rows, and the worse case scenario rows where the text is split vertically between two cells or horizontally cutting columns in half.

There is a varying complexity in the tables out there, and this most often is the deciding factor of which approach to take. Also very often the accuracy required, and the amount of integration time to obtain that accuracy determines the approach. For organizations that want line-items, but they are not required, table blocks are ideal. For organizations needing high accuracy and processing high volume, individual fields are ideal. In any case, it’s something that needs to be decided prior to any integration work.

Barcodes are a great technology. You can fit a lot of information in a barcode, they can be read at any angle, and they are also very accurate. You have to degrade 30% of a barcode before it’s unreadable. In data capture, barcodes are commonly used for batch cover sheets, document separation, or printed on the document themselves. This has been proven to be a time saver both in quality and because they can be read very quickly using both software based and hardware based solutions. What organizations often don’t think about is the additional time and cost that barcodes add to the capture process.

Organizations usually don’t connect document creation and prep time with data capture time. The total time and cost associated with the capture of documents is not just from the point of scan to export. It is all the additional steps leading up to the scan to get the document in the state it needs to be fore scanning. If an organization uses barcode pages to separate documents, it’s the time it takes for an operator to generate the pages and put them manually between documents. If organizations use barcode pages as batch separation, it’s the time it takes to create the unique barcode for each batch and place it on top of the batch prior to scan. These are just the two most common examples but there are many more. This is a common misconception because it’s not the same person doing the barcode creation and separation as the person scanning, or the barcodes are created in advanced and the time it took is forgotten.

Because organizations are not counting this into the total capture process they are missing out in the real data capture time and cost. It’s no surprise then when they are maintaining high paper cost and not reaching the ROI they expected. Barcodes are a great tool, but should be used when their benefit is greater then their time cost. Benefits can be accuracy, and process molding. Very seldom are barcodes alone responsible for substantial cost savings. Very often organizations don’t realize that they could in fact do away with barcodes by using advanced data capture. Accuracy may surfer slightly but the time savings is substantially more.

I find myself educating even industry peers on the topic of document type structure more and more recently. Often the conversation starts with one of them telling me about how unstructured document processing exists, OR the fact that a particular form is fixed when it is not. Understanding what is meant when talking about document structure is very important.

First lets start with defining a document. A document is a collection of one or many pages that has a business process associated with it. Documents of a single type can vary in length but the content contained within or the possibility of it existing is constrained. When data capture technology works, it works on pages, so each page of a document is processed as a separate entity and this it seems, is the meat of the confusion.

Often someone will say a document is unstructured. What they are thinking of is that the order of pages is unstructured, this is more or less accurate, however the pages within this unstructured document are either fixed or semi-structured. The only truly unstructured documents that exist are contracts and agreements. How you know this is that if at any moment in time you pull a page from the document and state what that page is and what information it would have, then it IS NOT unstructured.

The ability to process agreements and contracts is very limited in very concrete scenarios, where the contract variants are non-existent which essentially also makes them unstructured. In general the ability to process unstructured documents does not exist. Now to explore the difference between semi-structured and fixed.

It’s actually very easy because 80% of the documents that exist are semi-structured. Even if a field appears in the same general location on every page of a particular type, it does not make it fixed. For example, a tax form always has the same general location to print the company name. The printer has to print within a specified range. They can print more to the left, more to the top, and the length will very with every input name. This makes it semi-structured and additionally this document when it is scanned will shift left , right, up, down small amounts. A document is ONLY truly a fixed form when it has registration marks and fields of fixed location and length. Registration marks are how the software matches every image to the same set of coordinates making it more or less identical to the template.

There again the confusion is exposed. It’s very important to understand when having conversations about data capture to understand the true definitions of the lingo that is used. I task you, if you catch someone using the lingo incorrectly, it will help you and them to correct it.

When it comes to the input path that documents follow, for many it’s as simple as scan, convert, save, but others require more complex work-flows. The good news is there are tools out there to perform even the most advanced work-flows you could imagine. The bad news, they are expensive. I’m here to tell you about a way of combining your scanner and data capture, OCR, and document conversion software to make more complex work-flows without the premium.

By using settings that come with most document scanners and the ability of most data capture, OCR, and document conversion products to utilize hot-folders ( watch folders ) you can create robust multi-step work-flows out of the box. What you need is a scanner that supports multiple destinations usually 9 or more. This is indicated by an LED on your document scanner which at the point of a batch scan allows you to pick a destination number. Second you will need all the software required to perform the conversions needed for final result. In our example we will want to be able to OCR, data capture, compress and archive.

Basically the task is to create a funnel for your documents and the end result is saved where you want final destination to be. If your scanner supports what is called duel-stream then you can be working with two funnels simultaneously making your work-flow all the more robust. The first part of the funnel is identifying the document type. Each of the 9 destinations on your scanner should be configured for one document type ( you may want it to be one destination per business process instead ). The configuration would include the scan settings, 300 DPI of course, and what folder the document will go in. This is just the staging folder for the next step. Lets assume that we setup destination 1 for invoices and our scanner supports duel-stream. We want the invoices when it’s all said and done to have one copy to saved in a search-able directory, where the file is both compressed and in PDF/A format. Then we want another copy of the same invoice to be data captured and put in a working directory for someone to review. Lets put it all together.

Destination one on the scanner is configured for invoices. The first copy of any invoice will be saved to a hot-folder that the PDF conversion utility is watching, the second copy will be scanned into a hot-folder that the data capture product is watching. Because these are hot folders, both copies are picked up instantly and processed by each application. Our requirement for the second copy was only to be data captured and exported to a working directory, so we have now completed it’s task. For the first copy we have more conversions to do. The PDF conversion utility saves the OCRed search-able PDF to a hot-folder for the compression utility, the compression utility compresses the PDF and saves it to a hot-folder for the archive utility, and FINALLY the archive utility saves the result in our final destination for all invoices. Below is a basic diagram of the work-flow we created for invoices ( destination 1 )

Although it may have been slightly difficult to read, hopefully it’s clear that above is just one work-flow getting the most out of the tools offered by both the document scanner and conversion software packages. Now you can proceed to program each other destination with different document types and their associated work-flows. Programmers and tech savvy individuals will be able to easily envision ways to add scripts to make the process even more robust with email notifications etc. This approach is not a replacement for advanced work-flows but a middle ground between no work-flow and very pricey work-flows.

Because of the way the market has come to understand OCR ( typographic recognition ) and ICR ( hand-print recognition ) there is no surprise when some of the most common questions and expectations about the technology appear to be fact from a tarot card. Before I talked about one of these questions “How accurate is it” and how the basis of this question is completely off and can come to no good, here is a similar “It learns right?” which is quite a loaded question, so lets explore.

Learning is the process of retaining knowledge for a subsequent use. Learning is based in the realm of fact, following the same exact steps creates the same exact results. OCR and ICR arguably learn everytime it’s used, for example engines will do one read and go back and re-read characters with low confidence values using patterns and similarities they identified on a single page. This is on a page level, and after that page is processed this knowledge is gone. This is where the common question comes in. What people expect happens is that the OCR engine will make an error on a degraded character that is later corrected, now that it’s been corrected once that character will never have an error again, assuming this is true then you would believe that at some point the solution will be 100% accurate when all the possible errors are seen.

WRONG! Because the technology does not remember sessions, this is also the reason it works so well. Can you imagine if for example a forms processing system was processing all surveys generated by a single individual ( this is true for OCR as well ), the processing happened enough that in learned all possible errors and was 100%. Then you start processing a from generated by a new individual, your results on the first form type and the new will likely be horrendous, not because of the recognition capability, all because of supposed “learning”. In this case learning killed your accuracy as soon as any variation was introduced.

What most people don’t realize is that characters change, they change based on paper, printer, humidity, handling conditions, etc. In the area of ICR it’s exaggerated as characters for a single individual change by the minute, based on mood and fatigue. So learning is a misnomer as what you are learning is only one page, one printer, one time, one paper who will likely never repeat again. A successful production environment allows as much variation that is possible at the highest accuracy and this is not done with this type of learning.

Things that can be learned: Like I said before a single pass of a page, can have a second pass of low confident characters with learned patters on that page. In the world of Data Capture field locations can be learned, field types also can be learned. In the world of classification documents based on content are learned, this in fact is what classification is.

While the idea of errors never repeating again is attractive, people need to understand this technology is so powerful because of the huge range of document types and text that can be processed, and this is only possible by allowing variance.

When considering the ROI on a data capture integration, setup time is one of the most important and often miscalculated factors. Not just the setup time for initial integration, but the setup time used for any fine-tuning and optimization may sometimes postpone production.

The difference in setup time between a fixed data capture environment where coordinate based fields are used and rules based semi-structured environments is substantial. It’s not usually the fixed data capture environments that pose the biggest challenge in calculating ROI or predicting it. It takes an administrator on average between 15 to 45 seconds to create and fine-tune a fixed form field. In semi-structured processing, the field setup time can take between 60 seconds and hours, depending on the complexity of the document and the logic being deployed. It’s this large gap that throws a wrench in some ROI calculations.

For experienced integrators, the ability to put a document and it’s associated fields into complexity classes is usually pretty easy. After doing so gauging, the average amount of time to setup each field, and thus all fields should be accurate. There is always a field or two that requires extra fine-tuning. The key is a complete understanding of the document. Sometimes document variations are obvious, other times they sneak up on you and you have no idea the variation exists until you start working with it. Knowing all variations is the easiest way to understand the additional time any field will take to setup. Variants are the biggest contributor of time in semi-structured data capture setup. Second is odd field types, such as fields that take up one to many lines, or are continuous across two separate lines, and finally tables. The third and final largest contributor to setup time is poor document quality. This means the administrator has to be more general when creating fields and likely has to deploy multiple logic per each field to locate information in several possible ways.

When calculating the ROI on your data capture project, make sure to be aware of these sometimes sneaky factors that can eat at integration time. Bottom-line, know your documents, and know the technology before any work is done. If you are unsure, seek professional assistance.

If you are thinking about using data capture to read text from tax returns it’s time now to start thinking about the steps to accomplish this. Reading typographic tax returns from current and previous years has proven to be very accurate and a great use of data capture and OCR technology. Tax Returns fall into the medium complexity to automate category. There are a few things that make tax returns unique.

Checkmarks: Tax returns have two types of checkmarks, ones that are standard and printed in the body of the document. These can be handled similar to all other common checkmark types. The other type of checkmark is unique only to tax forms and they are typically on the right side of the document. They are boxes that within can be filled with a character or a checkmark symbol. With these checkmarks, the best approach is to create a field the entire size of where the checkmark can be printed and set the checkmark type to be of type “white field”. In this case the software will expect there to be only white space and a presence of enough black pixels will consider it checked.

Tabular Data: Much of the data in a tax form is presented as a table. When considering capturing data from a table, organizations have to decide if they want to capture each cell of the table as its own field OR if they would like to capture the data in the table as a table field that later must be parsed. This can dramatically effect the exported results so knowing before hand is very important.

Delivery Type: Tax forms usually come as eFile which is a pixel perfect document that is never printed and never scanned, or as a scanned document received first as paper then scanned. For the most part the eFile version of the tax form will be more accurate, however the eFile version of the form has non-traditional checkmarks that could cause a problem. Organizations need to decide if they are going to process all delivery types together as a single type or separate them. There are advantages to both. By combining them integration time is less, by separating them accuracy is higher.

I much rather OCR a tax return than file one. Because of this, the skills I’ve developed in processing tax returns are better than creating them, and I hope today I imparted some of that knowledge.

It is tot too often to companies using Data Capture technology that they have the chance to change their forms design or even create new ones. If you have this ability, USE IT! A properly designed form is the fist step to success in automating that form. There are many things you can do to make sure your form is as machine readable as possible. Typically the forms we are talking about are hand-written but occasional also machine filled. I will highlight the major points.

Corner stones. Make sure your form has corner stones in each corner of the page. The corner stones should be at 90 degree angles to each neighbor one and the ideal type is black 5 mm squares.

Form title. A clear title in 24 pt or higher print and no stylized font.

Completion Guide. This is optional but sometimes is useful at the top of the form to print a guide on how best to fill in the fields of the type you use.

Mono-Spaced fields. For the fields to be completed, it’s best to use field types that are character by character separation. Each character block should be 4 mm x 5 mm and should be separated by 2 mm or more distance. The best types of fields to use in order are letters separated by dotted frame, letters separated by drop-out color frame, letters separated by complete square frames.

Segmented fields by data type. For certain fields, it will be important to segment the field in portions to enhance ICR accuracy. The best example is date; instead of having one field for the complete data, split it into 3 separate parts with the first being a month field, next a day field, and finally a year field. Same is often done for numbers, codes, and phone numbers.

Separate fields. Separate each field by 3 mm or more.

Consistent fields. Make sure the form uses consistent field types stated in 4.

Form breaks. It’s OK to break the form up into sections and separate those sections with solid lines. This often helps template matching.

Placement of field text. For the text that indicates what a field is “first name”, “last name”. It is best to put these left justified to the left of the field at a distance of 5mm or more. DO NOT put the field descriptor in drop-out in the field itself.

Barcode. Barcode form identifiers are useful in form identification. Use a unique id per form page and place the barcode at the bottom of the page at lease 10 mm from any field.