My office is a paper monster. Paper comes in and never leaves intact. The scary part is how fast this happens. Paper in hand, review its contents and asses its value, scan it, shred it. Usually within minuets of its existence. The value of set it and forget it OCR is tremendous, but you have to be comfortable.

Set it and forget it OCR is where you take your OCR product and configure it to automatically process any images that appear in a certain folder. For my office, I scan to an “input” folder and all the resulting compressed and OCR’ed PDF files end up in the “File Cabinet” folder. My strategy will not work for the timid because basically I’m relying solely on the power of OCR text and search to retrieve documents when I need them. Most would rather configure their ADF scanner to have a setting or folder for each particular class of documents. Most document scanners anymore have as few as 9 and as many as 99 destinations you can program. You can set each destination as its own input folder with its own OCR settings with its own output folder.

I know I can do this because I know what settings it takes to get the quality of OCR I would need to at least have one or more usable keyword on the document for search. And after-all, I’m an expert in OCR so to not use it everyday would be crazy in its own right. I’ve yet to be proven wrong, my “File Cabinet” abyss has always given me the information I need at the time I asked for it and sometimes even new information I did not realize I had.

Now for you records management folks shaking your head, I understand your complaint. It should not be about my approach but should be about what I do with the final paper product. For those items that are for legal or business reasons that are deemed as a record by your taxonomy, they should be filed as such, perhaps scanned again as a record, and for heavens sake if you are not supposed to, don’t destroy it!

The purpose of my madness is to touch paper as little as possible, and get information only when I need it. I am an extremist, but I assure you there is serious value, and a little fun in the set it and forget it OCR technique.

There are a lot of technologists out there who believe that optical character recognition has its days numbered and is an aged technology. The belief is that soon paper will go away. This post is for those who believe OCR technology is going away.

The reality is that paper consumption has not really decreased. In some areas paper has been replaced with electronic data interchange EDI, but in other areas it has actually increased. Studies have also shown that because documents are being scanned more often, there is also an increase in printing when the documents need to be shared or re-purposed. But I’m not here to argue that paper is not going away and that document conversion technologies are required to convert them. I’m here to point out a few futuristic uses of the technology that technologists like to already talk about and involve OCR.

Data Security

The first futuristic use of the technology that I would like to discuss is the use of OCR in data security. Text strings sent over the Internet are far easier to sniff and unlock than a compressed JPEG image. What if you were to convert the text into a JPEG during transmission and the person on the receiving end would OCR it to get the data. By doing so the data has been masked in a more efficient and secretive way. For added security, proprietary image formats could be devised.

File Compression

Storing ASCII text takes up far less space than an image or video file. As apart of the future of compression technologies, expect that OCR will be uesd to extract the text from an image and saved as an ASCII file. Viewers will convert the text back to an image during viewing. This then removes the image portion of the text and significantly reduces file size.

Robots

How else to you expect future robots to read text? OCR of course. The eyes of the robot are essentially a camera that takes pictures of images rapidly. When the robot is faced with the comprehension of text, the image will be converted using OCR and fed through an engine to gain meaning from the text and act on it.

So there you have it, three really cool and cutting edge ways OCR is and will be used in the future. Paper is not going away, but even if it were, just look at the other cool uses of OCR technology.

What is the greatest difference between the most accurate Optical Character Recognition ( OCR ) products and the least? It might not be what you think. The greatest improvements in OCR in the last 10 years has not been so much on character level recognition, it’s been more about how the engine’s understand the structure of documents. This is called document analysis. Theoretically, if you were to compare two engines that had identical character recognition, but engine A had document analysis and engine B did not, engine A would win.

Document analysis is first how the engine breaks apart components of a document such as paragraphs, lines, columns, graphics, etc. Without this, the engine is OCRing blind, and its assumption is that every object it encounters is text. This sometimes leads to clumping of lines, or OCR of graphics. The second aspect of document analysis is the delivery of formatting in the export that matches the formatting in the document. This can also include font style and color.

With traditional documents you can expect that products with document analysis will get the formatting spot on. This is very important, not only for editing and re-purposing, but also for keeping the readability of a document. Another aspect of document analysis is to determine reading order. For example if you have a multi-column, multi-paragraph page, the software has to decide in what order the paragraphs are read. This is useful during recognition, but also in case a formatted document is converted to a more flat file structure such as TXT file where the order stands a chance of being confused.

The reality is that for clean documents character level recognition is not getting any better, it’s amazingly accurate today. The opportunity to improve is in document analysis and language morphology, but that is another post.

The two most common question when organizations ask when they are seeking document automation technology is “how fast is it?” and “how accurate is it?”. Many don’t realize that the two are at opposition to each other most of the time. The more accurate a system, the slower it is, and the faster it is, the less accurate. But there is one fatal mistake in all these calculations, and that mistake is how efficiency is calculated.

Most companies who trial data capture, calculate performance on the slowest step which is optical character recognition (OCR). Literally, companies will hit the “read” button and immediately start timing until the read is complete. This is what is considered the speed of the document automation system. This is incorrect.

There is no question that OCR can be a tremendous bottleneck in the entire entry process, but poor OCR could create an even greater bottleneck. Imagine an OCR engine that reads a document with 100 characters in 1 second as compared to an engine that reads the same 100 characters in 3 seconds. Your initial thought is that the first engine would be better, but consider that the first engine may be 60% accurate leaving 40 characters to be manually entered, and the other engine 98% accurate leaving 2 characters to be manually entered or correct. If you consider an average entry speed of 1.6 characters per second then it will take the 40 characters an additional 25 seconds to enter for a total entry time of 26 seconds for the faster engine. For the slower engine it will take an additional 1.25 seconds to enter or edit 2 wrong characters thus a total entry time of 4.25 seconds. This means that end-to-end, the slower engine is 6 times faster in the document automation process then the slower engine.

This simple calculation illustrates the folly in assuming that the slower OCR time makes for a slower overall process. Usually focusing on accuracy has the greatest benefit for an organization unless you are improving the speed of a slower engine with hardware, or two engines are too close to see a benefit.

I’ve faced unique projects in the last four years and in a few, the best approach even seemed to contradict my better logic. The projects I’m talking about are ones where the data we were working with was already in a digital format, namely a PDF file that was created digitally. What this meant was that all the text in the PDF was available and 100% accurate. So why then, to accomplish the project’s goals, did we use OCR to read the already digital files as images?

I had intended for all these projects to do a logical parsing of the already digital content so I can get what I want. The problem is that even though the internal structure of the PDF has a logical standard, it’s not used logically 90% of the time by most PDF generating applications. PDF has in it a tolerance for mistakes that allows organizations to deviate quite drastically from the standard. What this means is that not only is the content in each PDF unique per company that generates it, it’s unique per number of applications able to create them. Variations on-top of variations makes logical parsing very difficult. This becomes most obvious when the documents contain tables. Because of this the only way to text parse the PDF properly would be to flatten the internal logic so that they consist of nothing but text, but by doing so you lose some of the information pointing to where tables are and their structure.

You may have guessed by now that all my projects were to parse tables from PDF. Not just any table but specific tables in PDFs where each was a unique format. As I said before, my preference would have been to use the 100% accurate data already in the PDF. In the end what I ended up doing was OCRing the PDFs because they were what is called “pixel perfect” so the accuracy was very high. Now that I was using OCR, I was able to first recognize an entire document and remove everything that was not a table which was determined by my OCR document analysis. Then I was able to use keywords to find the specific table that I wanted. The end result took me about 3 weeks of work for each project, and the result was higher accuracy in table finding, and only slightly less accurate in the text values than a table parsing.

While it seemed most logical to do the parsing, in the end I saved over 5 man-months of work by using OCR.

Often the purpose of doing Optical Character Recognition ( OCR ) for individuals and companies is to get a digital version of a document where the individual intends to edit and or re-purpose. This is not the most common use of the technology but a use that requires specific attention.

In order to convert a document so that it is printable later on, it’s important to not only get the text from the document but also the format of the text. This includes layout as well as things such as graphics, and font colors. To do this, the OCR product must be able to recognize colors (requires color scanning), recognize font styles, and very importantly, recognize document structure.

Engines that support advanced document analysis have this. Document analysis ( DA ) is the process that happens before any text is read on a page. Document analysis makes sense of a document in order to improve recognition as well as get the formatting required for a formatted export. First, document analysis finds document structured, ie. columns, tables, text, paragraphs lines. Once this is done, it identifies colors in text and graphics. After document analysis has done it’s job, the recognition can begin. During recognition, the style of fonts is detected: bold, italic, underlined. All of this is put together with a result formatted as close as possible to the input document.

For those individuals that are concerned about the re-purposing of their documents, a straight text OCR engine will not work. Basic OCR engines get the text on the document in digital form and nothing more. For these individuals, it’s important to find a solution that has good documenting analysis.

It’s common for organizations to outsource their scanning and document conversion. Organizations find it sometimes that the skill required, the convince factor, and liability is worth the additional cost. Other organizations that have one time backlog conversions save money by using an outsourcing company vs. bringing the software in-house. In recent years, service bureaus and business process outsourcing companies have dramatically improved their use of recognition technology and prices have dropped substantially. Though as an organization who chooses to outsource, you are removing the responsibility of picking document conversion technology. Shouldn’t you want to know what technology your service bureau is using?

YOU SHOULD! Absolutely you should be concerned about the OCR and Data Capture technology that your outsourcing company is using. It’s just as important than if you were bringing the technology in-house. It’s your job to make sure your vendor is using the best technology but also in the best way. The education level between outsourcing companies is different and they each often specialize in one document type or one type of processing. Proper evaluation of a service bureau will include reviews of sample results. You should have your prospect service bureau or BPO run a good number of your production documents and provide you with results. Make sure the technology they used to produce the results is the same that is used when in production. Don’t be afraid to ask the vendor what engine or engines are being used and even what version. Make sure you understand how your vendor handles exceptions.

While it’s easy to overlook these items when you are looking at a service instead of a technology, it’s still important that you are educated. Service bureaus make money based on how much they save. This can occasionally create motives to use poor technology to gain greater margins. Some outsourcing companies put customers into categories by volume and those with the greater volume get the best technology. Most of the outsourcing companies out there are very good at ensuring their document quality, and many will even go as far to give you a guarantee on quality. But the nature of production environments is such that you cannot check everything always. It’s about relationship. Sometimes paying a higher price per page for a better solution is worth it!

You probably use the copy and paste functionality on your computer daily. I too use copy and paste on a regular basis, but I also use OCR and paste nearly as much. OCR and paste is what I’m referring to as the process of selecting a region on your computer screen and using OCR to read that region as a screen-shot and converting it to text. Even to my surprise, it has become quite the habit and one of my favorite ways to collect data from one location on my computer to another. Many wonder why this might be the case, as most information on the screen is available as text anyways. The reasons are: it’s more efficient than copying and pasting into a program. It maintains structure of information using document analysis, and there are times when the information I want is not in text form but in an image only.

I have actually taken it one step further and used the technology to automate the extraction of data from web pages that are scroll heavy. Instead of scrolling forever for information on a web page, I can use the tool to take a screenshot of the entire web page and convert it to text for me. You can imagine how the technology could be used maliciously, but in this case, it’s just to get information.

The ability of OCR to read screen-shots is quite impressive. Though screen-shots usually come out in low DPI resolution which is traditionally not optimal for OCR, the text and text in image is what is called pixel perfect so it provides an excellent candidate for conversion. Also leveraging document analysis technologies built into OCR, I can grab a table and have it export a table versus having to copy and paste text and manipulate back to original form later.

When you become an expert in OCR, you find yourself using the technology in the oddest places, but this is one case where my productivity has increased because of the tool, and I think it’s worth sharing. I suspect that OCR of screen-shots is only going to increase in the future. Because of this and malicious reasons, so will counter mal-ware technologies. As well as a very easy way to convert data from one locked down legacy system to a new one.

I have touched on this topic a little on one of my previous posts but because of eDiscovery’s popularity I thought it was fitting to look at OCRs interaction with eDiscovery preparedness. Organizations who are not ready for audits and court orders to deliver documents are spending tremendous amounts of money to undo bad document processes. Because of this, preparing yourself to be ready for possible legal future events is critical and a long term cost saver.

The purpose of OCR technology in conjunction with eDiscovery readiness is based in the principle of having as much data at your finger tips as possible. The proper policies of being ready is heavy in records management policies, and a good taxonomy that is strictly followed. Because of this, sometimes OCR is overlooked as a tool. With the proper above practices, it should be possible to pull up any document at any time. However, OCR should be viewed as an insurance policy because by OCRing every document you have would give you even more information than you would have otherwise, and information is the key to success in these situations.

eDiscovery also includes other types of data email being one of the most popular. But what about the data contained in email attachments that are PDF, TIFF, JPEG? OCR is the only tool to extract the data from the images in these formats. Surprisingly products that provide eDiscovery tools just for email still do not yet heavily deploy OCR technology, but the information contained in these attachments is often as valuable as the emails themselves.

In addition to all the traditional proper records management practices, and eDiscovery tools, OCR should be considered as a must have for organizations preparing themselves for audits or court orders, and sometimes even more importantly knowing what to omit.

I’ve covered various interesting and non-conventional uses of OCR. I would like to talk about a new one, OCR to Speech. The blind community is familiar with technology and it assists them in their everyday lives. The key to OCR to speech is simplicity. When the concept was first developed, it required some very elaborate combination of software and hardware, now it’s possible to take the latest and greatest OCR technology and make it talk for you with a simple configuration.

It requires a document scanner with a easy physical button interface and programmed to scan an image at 300 DPI to a folder on a machine. Traditional documents work very well for OCR to speech whereas documents that have a lot of graphics and un-traditional formats may be more challenging. It’s important that the technology is able to omit garbage. To do this the OCR process should be driven by a dictionary. The words recognized must be in this dictionary or they will not show up in the final results. The reason for this is a lot of time can be wasted if bad recognition results are spoken.

Once the OCR engine has done it’s job of accurately and automatically converting an image to text, the ASCII text results from OCR will be saved into a directory. Now it’s time to automatically put the text to speech. There are many text to speech applications out there, some free, some for pay. The goal is to find one that also reads results from a directory and automatically speaks the text over computer speakers.

It can be that easy! Some users of such technologies spend more time trying to find an acceptable digital voice then really configuring the solution. I assure you the packages exist and when configured correctly is very accurate. One scanner, One OCR application hot folder driven, and one text to speech application also hot folder driven will give a robust OCR to speech solution that can be setup in minuets.