Abstract

Holistic word recognition attempts to recognize the entire word image as a single pattern. In general, it performs better than segmentation based word recognition model for known, fixed and small sized lexicon. The present work deals with recognition of handwritten words in Hindi in holistic way. Features like area, aspect ratio, density, pixel ratio, longest run, centroid and projection length are extracted either from entire word image or from the hypothetically generated sub-images of the same. An 89-elements feature vector has been designed to represent each word in the feature space and five different classifiers have been used for measuring recognition performances. Considering the complexities of Hindi characters, the technique shows an impressive result using a Multilayer Perceptron (MLP) based classifier. Moreover, the technique shows scale and rotation invariant nature to a significant extent.

Article Preview

Introduction

Technological advancements open the door for using digital media for communication which leads to a paperless society. The first and foremost requirement to achieve this goal is to digitize the huge existing paper documents, either printed or handwritten form. Optical Character Recognition (OCR) would be a good choice to accomplish this as OCRed data are manipulated and maintained very easily by the digital devices. OCR systems which are described in the works Chaudhuri & Pal (1998) and Natarajan et al. (2001) perform well for printed documents. But no such good OCR system is found in literature for handwritten documents till date. The poor performance of the system on handwritten documents is mainly due to the lack of efficient word extraction module, described in Sarkar et al. (2011), and word recognition module, described in Basu et al. (2009), Bhowmik et al. (2015). In the literature, two major approaches for word recognition are found viz., segmentation based word recognition (Basu et al., 2009) and holistic word recognition (Bhowmik et al., 2015).

In the first category of approaches, word images are first segmented into constituent characters and then each character is recognized to get represented with some standard codes for storing them in a digital format which can be editable by standard word processing software. The most advantageous point about segmentation based approach is that it is lexicon free. But in most of the cases, it performs poorly for handwritten documents due to the ambiguity of the character segmentation (Edelman et al., 1990). The low segmentation accuracy indicates the failure of segmentation algorithm to yield the ideal segmentation points on the handwritten word images resulting in under/over segmentation (Malakar et al., 2011). As a result, the OCR systems are forced to work with various combinations of erroneous characters or their components. Deciding the possible character subparts from over-segmented characters and all possible combinations of valid characters or character subparts lead to large set of patterns to recognize. Thus, designing such systems to recognize word(s) for unbounded lexicon is perhaps impossible. The second approach i.e., the holistic approach has been introduced by the researchers as an alternative Handwritten Word Recognition (HWR) mechanism. This mechanism tries to recognize the whole words as a single pattern. It provides acceptable recognition accuracy for small and pre-defined lexicons. In other words, this approach is not suitable for unknown and large lexicon.

Therefore, it is a wise decision to apply the holistic word recognition methodology when the words to be recognized belong to a small lexicon and pre-defined. Therefore, while recognizing the names of months, week days, cities, countries, states, capitals, user defined keywords/key phrases etc. holistic approach would be the better option.