OCR Shop XTR Tips for Better Recognition and OCR Processing
From working with customers over the past several years, we have identified
the most common issues that arise when processing images with OCR Shop XTR,
and have outlined them below along with methods for improving results and
processing times.
===============================================================================
Contents:
* Input Image Resolution
* Improving Results with Non-binary Input Images
* Automatic Processing and Filtering
* Output filesize
* PDF and PS Input: Bit-depth, Memory Usage, and Processing Speed
* Non-square Fax Images
* TIFF Fill-order Bit
* OCR Processing Speed
* Understanding OCR Processing Using the PDF "normal" Output Format
* Output of Non-Latin1 Character Sets
===============================================================================
* Input Image Resolution
Make sure the resolution of the input image, as well as the font size with
respect to that resolution, are within normal limits.
OCR Shop XTR accepts:
- Image resolutions from 72dpi to 900dpi
- Fonts from 5 to 72 points
The resolution of the input image determines what one "point" means in the
font point size. The resolution of the input image is specified in the
input image file, or, when not specified, is assumed to be 300 dpi by
default.
- There are 72 points per inch.
- The point size of a font is measured from the top of the highest
ascender to the bottom of the lowest descender.
- The dpi specifies the number of pixels per inch.
If the type in your image is particularly large or small, it might fall
outside the accepted font point sizes, depending on the image resolution.
OCR Shop XTR allows you to adjust how the OCR engine interprets the font
size through the "in_res" option.
For example, if your font size is 15 pixels high and the image resolution is
300dpi, then the font point size is approximately 3 points, too small for
the engine to recognize well. In this case, we recommend setting the
"in_res" option to 200dpi or 100dpi so the font will be interpreted as 5 or
10 points in size, respectively.
Similarly, if your font size is 80 pixels high and the image resolution is
72dpi, then the font point size is approximately 80 points, too large for
the engine to recognize well. In this case, we recommend trying an "in_res"
of 100, for instance, to have the font interpreted at a point size of 57
points.
You may approximate the point size of your font with the equation:
[height of font in pixels] * 72 points/inch / [image dpi] = [point size of font]
Remember the height of the font is measured from the top of the highest
ascender to the bottom of the lowest descender. If you count the pixels,
make sure you view that portion of your image at full resolution on your
screen, sometimes referred to as "actual pixels".
* Improving Results with Non-binary Input Images
When a grayscale or color image is sent to OCR Shop XTR as input, the OCR
engine converts it to 1-bit black and white image data before processing.
You can control this transformation using the "black_threshold" option.
The default conversion to 1-bit image data is optimized for black text on a
white background. If your input image is low in contrast, you can probably
dramatically improve the results by adjusting the black_threshold.
The default value of black_threshold is 60, and its range is 0-100, 101, 102,
where 101 and 102 are used to indicate special algorithms: random threshold,
and Floyd-Steinberg.
A good way to understand the effect of the black_threshold option is to
generate a debug output file that shows you the 1-bit black and white image
data that is sent to the OCR engine for processing: Try running this
command-line:
ocrxtr -out_debug_files=y image.tif
An image file called "converted_input_file" will be created. This is the
image data after it has been converted to 1-bit black and white image data;
it is the data that will be recognized by the engine. View
"converted_input_file" in an image viewer to see what the OCR engine is
attempting to recognize. Try adjusting the black_threshold and regenerating
this file; notice how "converted_input_file" looks different depending on how
you set the black_threshold option.
Note that you should remove "converted_input_file" before generating it again,
because ocrxtr will append to it and not overwrite it.
There are a few different approaches to handling the conversion to 1-bit over
a large number of images:
First, if all of your images come from the same source material and were
scanned on the same scanner, then you can simply adjust the black_threshold
to the best value for one of the pages, then use that value when recognizing
the entire set of documents.
However, if you are planning to recognize a large number of documents from a
variety of sources, you might want to take a different approach. If you will
be calling ocrxtr from within another program, then you could, for instance,
adjust the black_threshold programatically for each logical set of similar
documents: recognize the first page, evaluate the quality of the results,
adjust the black_threshold if needed and recognize again. OCR Shop XTR does
provide an output format called "XDOC" which can include confidence values
for each word or character, which could help you make this judgement
programatically.
Alternatively, you could convert the input images to 1-bit prior to
submitting them to ocrxtr. This would give you more control over the
conversion and then you would know exactly what image data the OCR engine is
operating on.
One other method you can use to try to improve recognition results with a
multi-bit input image is increasing the image resolution before using it
with OCR Shop XTR.
Be aware that increasing image resolution will increase the image's file
size, with the result of longer OCR processing time and increased memory
usage. You may reduce this side effect by converting the input image to
1-bit depth, after increasing its resolution and before using it with OCR
Shop XTR.
* Automatic Processing and Filtering
Turn on the options "auto_process" and "auto_filter" in order to have the
OCR engine determine which filters will provide the best results for your
input images. Both of these options are on by default.
When you first try OCR Shop XTR, it is best to try it with all default
options to observe the basic behavior and results. Then, if you turn off
auto_process and auto_filter, you can try the different filter options to
determine if any will improve your results: fax_filter, newspaper_filter,
and dotmatrix_filter. You may also leave auto_process and auto_filter on,
and turn off the specific processing and filter options individually.
If you turn off auto_processing, you should be careful to turn auto_orient
to "correct" and auto_flip to "Y" if any image might need to have its
orientation automatically detected and corrected. Similarly, setting
"photometric_interp" to "correct" is important if some areas of the input
image are black on white and other areas are white on black.
* Output filesize
To control the filesize of an output format that contains image data (PDF,
HTML, and graphics output), set the bit-depth for the output image data
using the parameter, "out_depth". For instance, to create a 1-bit output
PDF file, run the command-line:
ocrxtr -out_depth=1 -out_text_format=pdf image.tif
By default, the bit-depth of the output matches the bit-depth of the input.
For PDF and PS input, this default is 1 bit per pixel, because the bit-depth
of an input PDF or PS file is unknown.
The "out_depth" option may have these values:
input Bit depth of the input image (default)
1 1 bit per pixel
8 8 bits per pixel
24 24 bits per pixel
* PDF and PS Input: Bit-depth, Memory Usage, and Processing Speed
PDF and PS input files are rendered by default to 1-bit image data prior to
OCR processing.
When using a PDF or PS input file, if you need to retain 8 or 24-bit image
data in PDF, HTML, or graphics output, set the "out_depth" parameter to 8 or
24 bits per pixel.
WARNING: If your input PDF or PS file contains a large number of pages,
setting "out_depth" to 8 or 24 may result in excessive memory usage, slow
processing times, and potential swap errors if memory is exceeded. For
large input PDF and PS files, we recommend using the default "out_depth" of
1 bit per pixel.
Details:
OCR Shop XTR treats PDF and PS input files differently from other input
formats such as JPEG and TIFF, because PDF and PS input files must be
rendered into image data. For maximum efficiency, OCR Shop XTR renders PDF
and PS input files as 1-bit image data by default. However, in cases where
the user sets the "out_depth" option, OCR Shop XTR must render the input PDF
or PS file at the bit-depth specified so that level of graphical information
is maintained through to the PDF, HTML, or graphics output.
* Non-square Fax Images
Some fax images have resolutions that are not square. If your input image
is a fax and you suspect this is the case, try setting the option
"-double_dimension=y".
When "double_dimension" is set, the engine internally doubles either the
columns or rows to make the image more square and improve recognition, if
one dimension or the other is rectangular. Turning on this flag does not
guarantee that the image will be doubled.
Image output, either in an embedded document or with plain graphics output,
is not affected by image doubling.
* TIFF Fill-order Bit
If your input image is a TIFF file and your results are much worse than you
expect, given the quality and properties of the input image, it is possible
that the "fill-order" bit is set in the input image file.
Most TIFF images do not use the fill-order bit; in fact, many programs that
create TIFF files write the fill-order bit incorrectly. As a result, by
default, OCR Shop XTR ignores the TIFF fill-order bit. In the rare case
where an image has the fill-order bit set correctly, you will need to
instruct OCR Shop XTR to obey it.
To determine if OCR Shop XTR should obey the fill-order bit for your input
TIFF image, run this command-line with your image:
ocrxtr -out_debug_files=y image.tif
An image file called "converted_input_file" will be created. This is the
actual image data that will be recognized by the engine. View the file in an
image viewer. Does it look odd, as though each byte of image data has the
bits reversed? If so, then the fill-order bit in the image is probably
set and should be obeyed.
To have OCR Shop XTR obey the TIFF fill-order bit, set the command-line
option, "-ignore_tiff_fillorder=n". Alternatively, you may set an
environment variable, VV_IGNORE_FILLORDER, to "n".
Note on the "out_debug_files" flag:
This flag instructs OCR Shop XTR to create two debug TIFF files:
"converted_input_file" and "unconv_input_file". OCR Shop XTR creates
"unconv_input_file" immediately after reading in the input image, and it
should be an exact copy of the input image in TIFF format. OCR Shop XTR
creates "converted_input_file" after converting the input image data to
1-bit prior to OCR processing.
OCR Shop XTR appends to these debug files, instead of overwriting them.
As a result, with "out_debug_files" set, running OCR Shop XTR multiple
times, passing multiple images on the command-line, or passing a multipage
input file will result in a new page of image data appended to
"converted_input_file" and "unconv_input_file" for each page of input
image data. We recommend that you delete "converted_input_file" and
"unconv_input_file" between each run, and/or view them with an application
designed to handle multipage TIFF images.
* OCR Processing Speed
The main variables that affect how fast OCR Shop XTR will process an image
are:
- Filesize of the input image
Large input files require more memory and can result in a longer
processing times.
PDF and PS input files may require a larger amount of memory than
anticipated at first glance, because PDF and PS input files are rendered
into image data before being loaded into the OCR engine. By default, PDF
and PS input files are rendered into 1-bit image data, which is small. If
the user specifies an "out_depth" of 8 or 24 bits, however, the input PDF
and PS input files will be rendered at 8 or 24 bits per pixel, and the
amount of image data may be large. This typically only presents
complications if the input PDF or PS file has a large number of pages; see
the section "PDF and PS Input: Bit-depth, Memory Usage, and Processing
Speed".
Be aware that setting the "out_depth" to a value lower than the input
image's bit-depth by definition reduces the colorspace in PDF, HTML, or
graphics output.
- Quality of the input image
Lower quality input images always take longer to process. In the
preprocessing step of OCR, cleaning up and deskewing lower quality images
can be time consuming. In the recognition step, the engine simply takes
longer to recognize less clear text. Similarly, extraneous marks on the
input images, such as handwriting, stamps, or scribbles, will cause the
engine to take longer; note that distinct image regions are much easier
for the engine to understand than amorphous marks.
If you can ensure your input images will be high quality, with clear text,
no image skew, and no extraneous marks, OCR Shop XTR will run fastest.
- Command-line options
During the preprocessing step of OCR, certain options detect image
properties such as orientation, fax images, or skew automatically. If you
know for example, that your images will always be oriented correctly, that
they aren't faxes, and that they aren't skewed, you can turn these options
off and improve processing time.
- Your machine (CPU speed, RAM)
A faster processor will result in faster OCR processing. Sufficient RAM
will help you achieve the fastest results, and is especially important for
complicated images, large input files, and large combined output
documents. If you notice your machine thrashing, where it spends
excessive time reading and writing to swap, then more memory is
likely to improve performance.
OCR Shop XTR is not multithreaded, so multiple CPUs will not significantly
improve performance unless you are running multiple instances of OCR Shop
XTR concurrently. Multiple instances of OCR Shop XTR may run concurrently
if you purchase multiple OCR Shop XTR licenses for the same machine.
Trial users may request a demo license key that permits multiple
instances.
* Understanding OCR Processing Using the PDF "normal" Output Format
When you first use OCR Shop XTR, the PDF output format "normal" can be
helpful in understandng how OCR Shop XTR recognizes the text and
reconstructs the formatting, as well as in providing visual feedback for how
different command-line options you chose affect the processing of your
image:
ocrxtr -out_text_format=pdf -pdf_format=normal image.tif
While this format is typically not as useful for archiving images as
"-pdf_format=img_txt", it makes a good experimentation and debugging tool.
The PDF "normal" format contains the recognized text, reconstructions of
tables and lineart, plus small images that correspond to the image regions
identified by the OCR engine. This information is laid out in the output
PDF document in an attempt to approximate the original image's formatting as
closely as possible.
See how changing different command-line parameters affects the appearance of
an output PDF "normal" document. Consider using this format to find the
optimal settings for a particular set of your input images, then create
final output in format you wish.
* Output of Non-Latin1 Character Sets
If you generate output using a character set other than Latin1, be careful
which output format you choose because not all output formats support
non-Latin1 characters. For example, Russian cannot be represented by ASCII
text ("iso"), but can be represented by Unicode ("unicode").
Also be aware that the viewer with which you open output files must support
the character set and format generated.