Automatic Machine Written Document Reader

Project Summary

The aim of this project is to automatically read the printed characters, identify them. Later these identified text is to be passed to a syntetic speech generator which would read aloud the text. The system implemented by us takes an image with characters written on it as the input. The image has a white back-ground and the characters are written in black. As of now, it can detect and identify only capital characters (viz. A-Z) and numerical characters (viz. 0-9). A sample image is as shown in the figure below. We have used a pattern recognition based approach for classification. Simple schemes for segmentation and spatial ordering of characters have been deviced.

Downloads

Acknowledgements

This project was done as a part of the Digital Image Processing course in LNMIIT during by 5th Sem in Aug-Dec 2010. The project partners were Manohar Kuse (myself) and Sunil Jaiswal. Our course instructor was Mr. Sudhir Gupta.

Brief About Pattern Recognition

As per the defination from wikipedia, Pattern Recognition (there after refered as PR), is the assignment of a symbol (or value) to a given input. The input are the features of the object in our case. In our system for character recognition simple features like Area, Perimeter, Centroid have been used. This output symbol is determined by some algorithms called machine learning algorithms. This process is also refered to as classification. These algorithms are categorized into two basic categories.

1. Supervised Classification.

In this type of PR algorithms, there are two stages involved. Learning stage and Classification stage. During learning phase, inputs and their corresponding correct outputs (Also refered to as `Ground Truth Labels`) are to be given. The algorithm processes this data and produces an internal equation representation of this data. This equation representation (also called model) thus formed is used during the classification phase to predict the ouputs for a given input features. Examples of supervised algorithms are – Support Vector Machines (SVM), Neural Network, Decision Trees etc.

2. Un-supervised Classification.

In this category of classification algorithms, there is only one stage involved. This is mainly acheived by grouping (clustering). Which means that similar samples are grouped together. This grouping would in turn produce a classification. Well known clustering algorithms include – K-mean, Hierarchical Clustering etc. In our implementation of the character recognition system, we have used a supervised classification scheme. We have used the Multi-category Support Vector Machines (SVM).

Scheme Used

Following steps are involved —

1. Thresholding

2. Connect Component Labelling

3. Normalize Single Character Image

4. Extract Features

5. Learn / Classify

6. Post-Processing

7. Speak Out

Thresholding

This is performed to binarize the input image. At the end of this step, the image would contain only white (255) or black (0) pixel values. The motive in doing this is to facilitate extraction of each of the characters. Connected Component Labelling This has been used to identify the connected regions in an binary image. We have used the 8-neighbourhood for connectedness. Refer to mentioned link for algorithmic details about it. After applying this algorithm, we get a labeled image, and the labels denote the blob ID to which the pixel belongs.

The figure shows the above binary image has two blobs. It is also to be observed that, all pixels in a blob have the same label. The objective of doing this labelling is to separate each of the chracters. Please note that, we have assumed (for sake of simplicity) that, a single character is one connected component. However this might not be true for character like “i” and “j” which has two connected components each. Sometimes there may be touching characters as one component. This means two characters together come as a single character. The figure on the right illustrates the touching chracters problem. However, Trier and Taxt (“Evaluation of binarization methods for document images”) have attempted to resolve this issue of touching characters. Our system however, does not implement the method suggested by authors Trier and Taxt and is thus naive.

Normalize Single Character Image

The character thus obtained by the connect component labelling needs to be size-normalized for standardization. This standardization is required for the feature extraction step. Since the features should not depend on the size of the characters, we have scaled and croped each of the characters to 100×100. A few of the extracted normalized characters are shown below.

Feature Extraction

Since we are using a pattern recognition approach to identify these characters, we need a few numbers representatives of these normalized character images. These representative numbers are there after refered to as features. A detailed survey of the various feature set in use for character recognition is done. The paper is titled “Feature extraction methods for character recognition-a survey” available for download from here. In this work we have used rather simple feature set called ‘Zonal Features’. Which means that the characters are divided into zones. We have used a total of 9 zones, viz – 4 horizontal zones, 4 vertical zones and the full image. This means that the normalized 100×100 image is divied into following 4 sub-images vertically (each size 100×25) and 4 sub-images horizontally (each size 25×100). Following regional features for each of the 9 zones have been evaluated — Area, Perimeter, Centroid X, Centroid Y. This would total 36 features (9×4). Also, we plan to include the aspect ratio (width/height ratio) as a feature. Which means that, we shall have 37 features for each character image.

Learning Phase

Since, we have used the supervised training model for classification. Manual annotation is required for the training data. As of now, we have used about 4500 samples for training, which includes characters of various size and font types. We have also made a web-based annotation tool for ease and speedy manual annotation with the help of relational database system (MySQL). The labels as produced by this web-based tools are to be fed into Support Vector

Machines. We have used the SVM-light multi-class implementation for it. This is available at http://svmlight.joachims.org/svm_multiclass.html

The training has been done with following parameters

Maximum training Error : 1.00

1-slack structural algorithm.

Linear Kernel.

The learning phase produces a model file which has to be used for classification. Please note that, learning has to be done just once on the training data.

The support vector machines basically contructs hyper-planes in higher dimentional space. The figure along-side shows a plane (red) which separates the two categories (black and white dots). Here X1 and X2 are the features of the categories. The details of SVM are out of scope of this project. For details on the SVM algorithm please refer to wikipedia page of SVM for a bit more details about it.

Classification Phase

Illustration 3: Support Vector Machines (SVM) with two features separable by a line This comes into play, when the characters are to be identified from its features. On a test image, features of each of the individual characters are extracted and SVM-light classifies these characters based on the learning process file (model file). At this stage, we have each chracter classified. However, these character may be out of order with respect to their spatial positions (position in image).

Post Processing

After each of the characters are recognised, they need to be printed in the correct order of their occurrence and also, spaces between characters are also to be identified. Thus, the post processing would consist of following two stages

1. Sorting Characters by Spatial Position

2. Identifying Spaces

1. Sorting Characters by Spatial Position

We have used a quick sort algorithm for sorting the characters by their co-ordinate position in the image. The co-ordinates of the top left corner of the bounding box were used by the compairison function of the sorting algorithm. This solution has been quite successful is arranging these incoming characters in order which they occur.

2. Identifying Spaces

This is a challenge as, the Connected component labelling would identify the characters. Spaces have been identified by setting a cut-off distance between the bounding boxes of each of the characters. This cut-off distance is calculated adaptively. The adaptive cut-off scheme has been devised by us and is as follows

a) A frequency plot (histogram) of the distances between the bounding boxes was evaluated.

b) Then the maximum occuring distance was calculated. Assume that distance ‘d’ occures maximum number of times.

c) A zero frequency occuring after this ‘d’ is the threshold distance to identify the spaces between the characters. This means, that, if the distance between two characters is more than ‘d’ then there is a space in between these characters. Note, that, it might be possible that a zero frequency does not occur after maximum, in such a case, minimum occurring frequency is taken to be the threshold ‘d’.

This scheme seems to work fine when the character size is nearly same. In cases with characters of different sizes, a new scheme has to be devised which takes into account variability in font sizes. At the end of this step, the characters in the images come as ASCII text. This text is then passed to a Speech Sythesizer. A Speech Sythesizer converts this ASCII text into sound. We have used the LINUX tool “espeak” as the speech synthesizer. The details about the process of this sythesis is beyond the scope for this project.

General Comments About Scheme Used

It is assumed that the letter would always be written in black and the back-ground is white. This basic assumption is used for binarisation of the image with thresholding.

Another assumption is that, each character is a connected component.

Only capitals letters and numbers are used in this case, for the sake of simplicity. However, using small letter would make no difference. Only “i” and “j” would be able to identify since they are two connected components.

Work Still Felt to be Done

Character aspect ratio (Width/height) to be added as a feature.

Larger training set. For information, a training set available online had about 50,000 samples with annotation as training set. We plan to have about 10,000 samples with about 10 different font types.

If time permits, take input from a web-cam. The major challenge is the high prevalence of noise.

As a heads up in 2016, one can use neural nets (deep learning) to solve this problem much more elegantly and robustly rather than using hand-crafted features. MNIST dataset (http://yann.lecun.com/exdb/mnist/) is a hand written numbers dataset, can use existing tools to solve this, possibly extend that approach (which i am pretty sure must have been done before) to characters.