Studies in Recognition of Telugu Document Images

Abstract

The rapid evolution of information technology (IT) has prompted a massive growth in digitizing
books. Accessing these huge digital collections require solutions, which will enable the archived ma-
terials to be searchable. These solutions can only be acquired through research in document image
understanding. In the last three decades, many significant developments have been made in the recog-
nition of Latin-based scripts. The recognition systems for Indian languages are very far behind the
European language recognizers like English. The diversity of archived printed document poses an ad-
ditional challenge to document analysis and understanding. In this work, we explore the recognition of
printed text in Telugu, a south Indian language.
We begin our work by building the Telugu script model for recognition and adopting an existing
optical character recognition system for the same. A comprehensive study on all the modules of the
optical recognizer is done, with the focus mainly on the recognition module. We then evaluate the
recognition module by testing it on the synthetic and real datasets. We achieved an accuracy of 98% on
synthetic dataset, but the accuracy drops to 91% on 200 pages from the scanned books (real dataset).
To analyze the drop in accuracy and the modules propagating errors, we create datasets with different
qualities namely laser print dataset, good real dataset and challenging real dataset. Analysis of these
experiments revealed the major problems in the character recognition module. We observed that the
recognizer is not robust enough to tackle the multifont problem. The classifier’s component accuracy
varied significantly on pages from different books. Also, there was a huge difference in the component
and word accuracies. Even with a component accuracy of 91%, the word accuracy was just 62%. This
motivated us to solve the multifont problem and improve the word accuracies. Solving these problems
would boost the OCR accuracy of any language.

A major requirement in the design of robust OCRs is the invariance of feature extraction scheme
with the popular fonts used in the print. Many statistical and structural features have been tried for
character classification in the past. In this work, we get motivated by the recent successes in object
category recognition literature and use a spatial extension of the histogram of oriented gradients (HOG)
for character classification. We conducted the experiments on 1.46 million Telugu character samples in
359 classes and 15 fonts. On this data set, we obtain an accuracy of 96-98% with an SVM classifier.
Typical optical character recognizer (OCR) only uses local information about a particular character
or word to recognize it. In this thesis, we also propose a document level OCR which exploits the fact
that multiple occurrences of the same word image should be recognized as the same word. Whenever the OCR output differs for the same word, it must be due to recognition errors. We propose a method
to identify such recognition errors and automatically correct them. First, multiple instances of the same
word image are clustered using a fast clustering algorithm based on locality sensitive hashing. Three
different techniques are proposed to correct the OCR errors by looking at differences in the OCR output
for the words in the cluster. They are character majority voting, an alignment technique based on
dynamic time warping and one based on Progressive Alignment of multiple sequences. In this work,
we demonstrate the approach over hundreds of document images from English and Telugu books by
correcting the output of the best performing OCRs for English and Telugu. The recognition accuracy at
word level is improved from 93% to 97% for English and from 58% to 66% for Telugu. Our approach
is applicable to documents in any language or script.