Contents

History[edit]

The Tesseract engine was originally developed as proprietary software at Hewlett Packard labs in Bristol, England and Greeley, Colorado between 1985 and 1994, with some more changes made in 1996 to port to Windows, and some migration from C to C++ in 1998. A lot of the code was written in C, and then some more was written in C++. Since then all the code has been converted to at least compile with a C++ compiler.[3] Very little work was done in the following decade. It was then released as open source in 2005 by Hewlett Packard and the University of Nevada, Las Vegas (UNLV). Tesseract development has been sponsored by Google since 2006.[7]

Features[edit]

Tesseract was in the top three OCR engines in terms of character accuracy in 1995.[8] It is available for Linux, Windows and Mac OS X, however, due to limited resources only Windows and Ubuntu are rigorously tested by developers.[3][4][9]

Tesseract up to and including version 2 could only accept TIFF images of simple one column text as inputs. These early versions did not include layout analysis and so inputting multi-columned text, images, or equations produced a garbled output. Since version 3.00 Tesseract has supported output text formatting, hOCR positional information and page layout analysis. Support for a number of new image formats was added using the Leptonica library. Tesseract can detect whether text is monospaced or proportional.[4]

If Tesseract is used to process right-to-left text such Arabic or Hebrew the results are ordered as though it is left-to-right text.[10]

Tesseract is suitable for use as a backend, and can be used for more complicated OCR tasks including layout analysis by using a frontend such as OCRopus.[11]

Tesseract's output will be very poor quality if the input images are not preprocessed to suit it: Images (especially screenshots) must be scaled up such that the text x-height is at least 20 pixels,[12] any rotation or skew must be corrected or no text will be recognized, low-frequency changes in brightness must be high-pass filtered, or Tesseract's binarization stage will destroy much of the page, and dark borders must be manually removed, or they will be misinterpreted as characters.[13]

User interfaces[edit]

There are several separate projects which provide a GUI for Tesseract:[15]

FreeOCR – a Windows Tesseract GUI. However this has been widely reported as installing malware along with the OCR program.[16][17]

gImageReader – GTK GUI frontend for Tesseract that supports selecting columns and parts of the document. It can open multipage PDF files or images, supports all formats, can transmit a selected area to Tesseract for recognition and spell check the output.[18]

VietOCR – A Java-based cross-platform GUI that includes a language pack for Vietnamese and special post-processing tools for Vietnamese. It can be used for recognizing text in all languages supported by Tesseract by downloading the appropriate language data files.[25]

Reception[edit]

In a July 2007 article on Tesseract, Anthony Kay of Linux Journal termed it "a quirky command-line tool that does an outstanding job". At that time he noted "Tesseract is a bare-bones OCR engine. The build process is a little quirky, and the engine needs some additional features (such as layout detection), but the core feature, text recognition, is drastically better than anything else I've tried from the Open Source community. It is reasonably easy to get excellent recognition rates using nothing more than a scanner and some image tools, such as The GIMP and Netpbm."[2]