Often provided in tandem with other services, OCR delivers clean, searchable text for all printed words in a document. Using the most powerful character-recognition algorithms available, Valora routinely processes well over 200,000 pages per day of OCR.

Because OCR is often near the start of a long chain of automated and hybrid document processing activities, Valora has ensured that we utilize the highest quality, most powerful recognition algorithms available and fully integrated them into our critically acclaimed PowerHouse™ processing platform. This allows us to run OCR simultaneously with other processing efforts, offering a completely scalable, parallel-processing workflow model for high-volume, time-sensitive projects.

With more than a decade of OCR, text extraction and text manipulation experience, Valora delivers expert text searchability, analysis and reporting to some of the world’s largest corporations, law firms, federal agencies and consultancies. Contact us for more information.

“We’ve always found Valora’s OCR to be excellent. It has to be since they use it themselves for AutoCoding.“

Senior Operations Manager, US federal contracting agency

What is Optical Character Recognition (OCR)?

OCR stands for Optical Character Recognition. OCR is a computer software program that scans a digital image for letters and words. It then attempts to interpret what it sees, that is to recognize the character(s), and form logical, most-likely word sequences.

OCR is used to convert an otherwise static digital image (essentially a digital “picture” of a document) into a searchable text file. Once OCR is run on a digital image, there is a corresponding text file for that image. These two are typically linked together by naming conventions and database load/cross-reference files.

OCR is sometimes used on electronic files to provide extracted text which is unobtainable by other means.

It is very common to supplement scanning with OCR, so that scanned images can be read and searched by people and sophisticated analysis programs.

Because OCR is attempting automated recognition of characters, it can sometimes be fooled into mis-recognizing characters or missing their presence entirely. Because of this, OCR text can have imperfections and automated workflows should make accommodations for such conditions.

Ultimately, the goal of any OCR effort is to quickly and accurately create a searchable working-content version of otherwise pictorial images.

Example Usage Scenarios for OCR

Paper documents have been scanned to digital image, but are not searchable in any way.

Documents are being prepared for further automated analysis, such as Near Duplicate Detection, Clustering, AutoCoding, AutoIndexing, Predictive Coding or Data Mining.

Historical paper records are being integrated with more current electronic ones. This is common for medical records (EMRs), financial/mortgage/banking records, insurance records, employment files and government forms.

Paper documents have historically been stored at a warehouse and the organization is looking to reduce storage & retrieval costs, data foot prints and complexity.

Physical or scanned copies of documents have been provided by opposing or third parties.

Documents are printouts from obsolete, bankrupt/defunct or otherwise inaccessible sources.

Documents are paper “backup system” for electronic files and need to be brought forward into electronic file storage systems.

Supplement, backfile conversion and Day Forward Scanning for ongoing business processes.

Important Things to Know & Ask About OCR

OCR is a computerized process that makes a “best guess” at character recognition from digital images. As such, it can have the occasional mis-read. This means that the resulting text will be incorrect, missing or incomplete. Such errors can be difficult to spot and present a much larger problem when their results are being used for search and analysis. OCR quality is a function of how clean the input images are. In general, the cleaner and clearer the original document, the better the OCR.

Similar to above, some text does not convert to OCR. The most common non-conversion occurs with handwriting. Nearly all handwriting appears as “squiggles” to an OCR engine. When a document population has a high density of handwriting, OCR will be of limited use.

Many people erroneously believe that electronic files always contain good-quality text and never require OCR. This is not true. Sometimes ESI processing and storage systems fail to render full or adequate text. Other times ESI is simply a digital image without any text available, such as common PDF files. In both cases, these ESI files must utilize OCR in order to create complete, high-quality text for searching and analysis.

Not everyone OCRs documents and files the same way. Make sure you tell your OCR services provider exactly what you want. Example specifications: Single-page OCR file, matched to image, with each page named with a sequential Bates number, sorted into file folders by custodian.

Since OCR efforts often occur on a page-by-page basis, make sure you understand how the charges work. For example: are the OCR fees assessed per page, per file/document or per hour? Do they include roll-up to document-level, if required?

Most OCR services do not include clean-up of the resultant OCR, and thus will contain imperfections. If you require 100% perfect OCR text, you will need to order OCR text cleanup, which is almost universally an additional charge.

How will you transmit images to your OCR services provider? Large volumes of images will take significant time to copy, download, etc. Make sure you leave ample transmission time and utilize zip compression capabilities wherever possible.