Let’s face it. OCR is not 100% accurate 100% of the time. Accuracy is highly dependent on document type, quality of scan, and document makeup. The reason OCR is so powerful is because it’s not. How do we give OCR the best chance to succeed? There are many ways, what I would like to talk about now is quality assurance.

Quality assurance is usually the final step in any OCR process where a human reviews uncertainties, and business rules based on the OCR result. An uncertainty is a character that the software flags that did not during recognition satisfy a threshold. This process is a balancing act between a desire to limit as much human time as possible and a need to see every possible error but not more.

Starting with review of uncertainties. Here an operator will look at just those characters, words, sentences, that are uncertain. This is determined by the OCR product which will have some indicator of what they are. In full page OCR, often spell checking is used. In Data Capture, usually a review character-by-character of a field is done and you don’t see the rest of the results. Some organizations will set critical fields to be reviewed always no matter the accuracy. Others may decide that a field is useful but does not need to be 100%. Each package has its own variation of “verification mode”. It’s important to know their settings and the levels of uncertainty your documents are showing to plan your quality assurance.

After the characters and words have been checked in Data Capture, there is an additional step in quality assurance, business rules. In this process, the software will apply arbitrary rules the organization creates and check them against the fields, a good example might be “don’t enter anyone in the system who’s birth year is earlier than 1984”. If such a document is found, it is flagged for an operator to check. These rules can be endless and packages today make it very easy to create custom rules. The goal would be to first deploy business rules you have already in place in the manual operation and augment it with rules to enhance accuracy based on the raw OCR results you are seeing.

In some more advanced integrations, the use of a database or body of knowledge is deployed as first round quality assurance that is also still automated.

These two quality assurance steps combined should give any company a chance to achieve the accuracy they are seeking. Companies who fail to recognize or plan for this step are usually the ones that have the biggest challenges using OCR and Data Capture technology.

Often the purpose of doing Optical Character Recognition ( OCR ) for individuals and companies is to get a digital version of a document where the individual intends to edit and or re-purpose. This is not the most common use of the technology but a use that requires specific attention.

In order to convert a document so that it is printable later on, it’s important to not only get the text from the document but also the format of the text. This includes layout as well as things such as graphics, and font colors. To do this, the OCR product must be able to recognize colors (requires color scanning), recognize font styles, and very importantly, recognize document structure.

Engines that support advanced document analysis have this. Document analysis ( DA ) is the process that happens before any text is read on a page. Document analysis makes sense of a document in order to improve recognition as well as get the formatting required for a formatted export. First, document analysis finds document structured, ie. columns, tables, text, paragraphs lines. Once this is done, it identifies colors in text and graphics. After document analysis has done it’s job, the recognition can begin. During recognition, the style of fonts is detected: bold, italic, underlined. All of this is put together with a result formatted as close as possible to the input document.

For those individuals that are concerned about the re-purposing of their documents, a straight text OCR engine will not work. Basic OCR engines get the text on the document in digital form and nothing more. For these individuals, it’s important to find a solution that has good documenting analysis.

Occasionally the need to convert large documents such as maps and engineering documents comes along. Many times the OCR requirement is limited to a small subset of fields and clearly defined, but when it comes to converting the entire document to get as much text as possible there are many things you need to consider.

First is if you already have the ability to scan or are receiving images of large format drawings congratulations, as this can be one of the biggest challenges. Scanning large format documents requires either a large format scanner, or stitching of partial scans ( less preferred ). Because these documents have small fonts it’s important to scan at 300 to 400 DPI. For maps because of the amount of graphics, drop-out of all colors would be ideal or a thresholded black and white scan where you are left with mostly only text in the image.

The purpose of OCR for most of these documents is for index and search-ability, so the goal is to get as much possible text as you can. For maps with a good scan you should be able to get the majority of the text except for names printed on a curve. Running line straightening on these might work but more likely hurt the recognition of the rest of the map so I would recommend avoiding it. Prior to OCR set your OCR engine to disable auto-rotate because there are a lot of things on these documents that can cause a mis-rotation namely text printed in every direction.

Now to the secret, it has to do with rotation. Depending on the setup of the drawing or map if you OCR the document at every 90 degrees, once completing a full 360 degrees will have the majority of the text. That is right, I’m suggesting that you OCR the document 4 times, hopefully in an automated fashion. Now this might leave you thinking that you will end up with a lot of garbage, and you are right. But what you can simply do with the final OCR result is use a dictionary to remove all garbage text.

The end result is a map or drawing with the most amount of index level text possible. I admit that I made it sound a little easier then it is, and most likely you will require an API to get the full job done, but the possibility exists and it’s been proven successful.

There are several document types out there both for full-page OCR and for data capture that require special attention and configuration. For Full-Page OCR ( extraction of all the text on a document ) newspapers is one of these and poses some interesting challenges. When considering the OCR of these types of documents you need to change your opinion on the document itself.

When you open up a page of any newspapers you likely are considering the document as a whole, while your brain is picking apart the pieces. This is the key to OCRing news papers. The biggest challenge facing companies wanting to convert newspapers to text using OCR is their layout. Often times though the font on newspapers is usually pretty small it can be scanned at a quality that the raw OCR read is very high. Newspapers have their own structure; they have page headings, section headings, article titles, article sub-titles, by lines, articles, and then footers. Not only that but articles can span pages.

When converting a newspaper the most effort should be spent on a process of proper zoning. Because document analysis tools built into OCR engines are tuned to the average document (newspapers are not ) they will accurately find columns and paragraphs, but the key is to find the titles by lines and be able to separate articles. Most large service bureaus processing newspapers at high volumes have a manual zoning process and then a single read of OCR which produces very accurate results all because the zoning was done properly. Others have devised a two pass OCR system that essentially zones documents twice narrowing the focus on each step and increasing zoning accuracy thus OCR accuracy. This solves the read accuracy but not page continuations.

Page continuations are handled most often post OCR with a business rule applied to the OCR result. Meta-data from the OCR results should indicate on which page the text came from, thus by finding the words “continues on” at the bottom of any given article you can concatenate to it their continuation for final presentation. As apart of this rule is an article count and an article portion count, by the end you should have 0 portions and only articles. If you have a low confidence on the merging of articles, you can simply merge the result, review the remaining portions and your accuracy will then increase.

OCRing newspapers has its challenges, not to mention the difficulty in scanning them, but it’s possible and can be very accurate if in the right state of mind, and using the right approaches.

One of the most popular questions to ask when organizations purchase data capture or OCR software, “what accuracy can you guarantee?”. If you have ever asked this question of a vendor you got one of two responses: the first was a percentage of accuracy, the second is a long explanation on why they can’t guarantee anything. If the vendor gave you a percentage you should probably run, because it’s the start of a bad relationship.

Why? It’ not really possible for a vendor to tell you how accurate your recognition will be on your documents. Vendors can estimate accuracy based on samples, they can give you an idea of range, but because of the nature of the technology there is no way to guarantee anything. The first fact of OCR is that you can ALWAYS find a document that breaks the norms of recognition and accuracy. Because of this possibility, it’s hard to know how exception documents will effect the accuracy of the entire system. So lets talk about what is reasonable.

It is reasonable to provide a sample set of documents and expect an average accuracy level as a percentage on the samples. Because they are a discrete subset of documents, this is something that can actually be measured. It is the job of the organization to pick samples that most closely represent production. It would be wise to include bad, average, and good documents in the sample set so as to cover the entire range of possibilities.

What organizations often forget is that even if 50% of the documents are automated there is a cost savings as compared to manual entry. The industry standard for accuracy is 85% however this changes heavily based on document type and the organizations perception of accuracy. The ideal way to measure accuracy is to compare recognition results to truth data. If truth data is not available the next best thing is to count not accuracy but level of uncertainty on the document. If a document is 5% uncertain according to the OCR engine, then it is 95% certain and this should be your measure.

Next time a vendor is faced with the question of “how accurate are you?” or “what accuracy do you guarantee” I hope they issue the proper response of “how accurate will your process allow us to be?”. It’s a fair question to ask when you are not familiar with the technology, but hopefully the above gives you the proper approach to measuring a solution.

As fonts get smaller the challenge to read them with OCR software increases, however there are some key things that organizations should be aware of when reading the fine-print.

OCR technology today is capable of reading fonts as small as 8 pt or even 6 pt very accurately. It used to be the case that unless you have a 12 pt font you stood no chance. Because of increased quality of scans and more advanced OCR engines, reading small fonts will not be a problem if the right approaches are used.

Small fonts have a higher sensitivity to image quality and degradation to the document. For this reason, original source images that are scanned at 300 DPI or higher are necessary. For normal fonts there is seldom reason to scan higher than 300 DPI but for small fonts the goal is to get them to appear more or less the same as the regular fonts, so scanning them at 400 to 600 DPI is useful. Additionally documents that are “clean” are very important. A smudge or spill on a document impacts smaller fonts many times more then a larger font because of the closeness of lines. Once you have a good image quality you can start the conversion.

The next best benefit for small fonts is for them to be zoned separately. Zoning is the process of rubber banding the region where the text exists. When small fonts are grouped in the same zone with normal sized fonts the OCR software assumes that they should be of the same size and the confidence and accuracy go down. If you zone the small fonts separately you increase the OCR engines ability to use experts just for small fonts and increase the accuracy on them.

Next time someone tells you to read the small print, tell them you wont read it, you will scan and OCR it.

The search for greater accuracy when it comes to document automation, never stops. It’s true that with every new release, OCR technology has become so advanced that the jumps in accuracy are not what they were 10 years ago. Now, new versions of OCR engines contain enhancements for low quality documents and vertical document types but general OCR can’t get much better. Because of this, modern integrations need to find new tricks. This blog is full of them, but I’m about to explain just one more. OCRing inverted text.

OCRing inverted text is nothing new. Many document types have regions where white text is printed on a black background. The modern engines have an ability to read this text. Typically it’s not as accurate as black text on white background OCR, but it has its unique benefits. Especially with complex document types such as EOBs and drivers licenses.

There is a trick in using inverted text OCR to increase overall OCR accuracy. The method is to first OCR a document normally, then using imaging technology to invert the image. When you invert the image, the black text on white background switches to white text on a black background. Once the inversion is done, run OCR again. By comparing the two OCR results, you have essentially voted the same engine with little effort.

Large volume processing environments can deploy this trick without re-loading a new OCR engine, and applying different settings. It’s important to note that when using this technique, how you compare the two results is as important as the process itself. Typically you will assign more weight to the original version of the document then the inverted one. There you have it, one more tool in increasing the OCR accuracy of the engine you already use.

The difference often between easy data capture projects and more complex ones has to do with the type of data being collected. For both hand-print and machine print forms, certain fields are easy to capture while others pose challenges. This post is to discuss those “problem fields” and how to address them.

In general fields that are not easily constrained and don’t have a limited character set are problem fields. Fields that are usually very accurate and easy to configure are number fields, dates, phone numbers, etc. Then there are the middle ground fields such as dollar amounts and invoice numbers for example. The problem fields are addresses, proper names, items.

Address fields are for most people surprisingly complex. Many would like to believe that address fields are easy. The only way to very easily capture address fields would be to have for example in the US the entire USPS data base of addresses that they themselves use in their data capture. It is possible to buy this data base. If you don’t have this data base the key to addresses is less constraint. Many think that you should specify a data type for address fields that starts with numbers and ends with text. While this might be great for 60% of the addresses out there, by doing so you made all exception address 0%. It’s best to let it read what it’s going to read and only support it with an existing data base of addresses if you have it.

Proper names is next in complexity to address. Proper names can be a persons name or company names It is possible to constrain the amount of characters and eliminate for the most part numbers, but the structure of many names makes the recognition of them complex. If you have an existing data base of names that would be in the form you will excel at this field. Like addresses, it would not be prudent to create a data type constraining the structure of a name.

Items consist of inventory items, item descriptions, and item codes. Items can either be a breeze or very difficult, and it comes down to the organizations understanding of their structure and if they have supporting data. For example if a company knows exactly how item codes are formed then it’s very easy to accurately process them with an associated data type. The best trick for items is again a data base with supporting data.

As you can see, the common trend is finding a data base with existing supporting data. Knowing the problem fields focuses companies and helps them with a plan of attack to creating very accurate data capture.

Organizations seeking full-page or Data Capture technology have a serious need to estimate accuracy before they even deploy a technology, as this is a primary variable in determining the range of return on investment they can expect to achieve. When organizations try to understand accuracy by asking the vendor “How accurate are you?” they have gone down a path that may be hard to undue.

Accuracy is tied very closely to your document types and business process. While even asking for an accuracy on a document similar to yours is fair, it should not have much weight. An organization’s business process dramatically impacts OCR accuracy as well. Instead of asking “How accurate are you?” you should be asking “Can I test your software on my documents?”.

A properly established test bed of documents is the ideal way to evaluate the accuracy of a product. You want to know the worst case scenario. Build a set of documents that are samples of your production documents and make sure your collection is proportional to the volume you intend to process and the number of variations. Of that, 25% of them should be the “pretty” documents, 50%, should be your typical documents, and 25%, your worse documents. Use this sample set on all products you test. If you are able to compile truth data ( 100% accurate manual results from these documents ) then you are even better off in your analysis.

While I would hope no vendor answers this question directly, the question itself means that you don’t understand yet the problem you are trying to solve. Today, the ability to test is essential and the vendor should grant you that right. Taking the time to test will save you much pain and time later.

Exceptions happen! When working with advanced technologies in Data Capture and forms processing, you will always have exceptions. It’s how companies choose to deal with those exceptions that often make or break an integration. Too often exception handling is not considered for data capture projects, but it’s important. Exceptions help organizations find areas for improvement, increase the accuracy of the overall process, and when properly prepared for, keep return on investment (ROI) stable.

There are two phases of exceptions; those that make it to the operator driven quality assurance step, and those that are thrown out of the system. It would take some time to list all the possible causes of these exceptions but that is not the point here, it’s how to best manage them.

Exceptions that make it to the quality assurance ( QA ) process have a manual labor cost associated with them, so the goal is to make the checking as fast as possible. The best first step is to use database look up for fields. If you have pre-existing data in a database, link your fields to this data as a first round of checking and verification. Next would be to choose proper data types. Data types are formatting for fields. For example a date in numbers will only have numbers and forward slashes in the format NN”/”NN”/”NNNN. By only allowing these characters, you make sure you catch exceptions and can either give enough information for the data capture software to correct it ( if you see a g it’s probably a 6 ) or hone in for the verification operator exactly where the problem is. The majority of your exceptions will fall into the quality assurance phase. There are some exception documents that the software is not confident about at all and will end up in an exception bucket.

Whole exception documents that are kicked out of a system are the most costly, and can be if not planned for be the killer of ROI. The most often cause of these types of exceptions is a document type or variation that has not been setup for. It’s not the fault of the technology. As a matter of fact because the software kicked the document out and did not try to process it incorrectly it’s doing a great job! What companies make the mistake of doing is every document that falls in this category gets the same attention, an thus additional fine-tuning cost. But what happens if that document type never appears again, then the company just reduced their ROI for nothing. The key to these exceptions whether they are whole document types or just portions of one particular document type is to set a standard that indicates an exact problem that has to repeat X times ( based on volume ) before it’s given any sort of fine-tuning effort.

Only with an exceptional exception handling process will you have an exceptional data capture system and ROI.