Abstract

The major problem faced by a Gujarati optical character recognition (OCR) can be attributed to the presence of broken character in machine printed Gujarati document image. This character could cause the error in character segmentation process. Broken characters are generated due to noise scanning, older documents with low-quality printing, and thresholding error. It is necessary to identify and segment it properly. So this paper presents mean-based thresholding technique for broken character segmentation from printed Gujarati documents. Line segmentation is used to extract lines from Gujarati document image. Individual characters are extracted using vertical projection profile method. Then, broken characters are identified using mean-based thresholding (MBT) algorithm. Heuristic information is used to merge the identified broken characters. The main purpose of this paper is to merge vertical and naturally broken Gujarati characters as a single glyph from the document image. Experimental results are carried out using various types of Gujarati documents (A, B, C, and D). 79.93 % accuracy is achieved from experimental results.