Introduction

One of the first steps in developing OCR systems is line detection. Farsi/Arabic text has some properties which make them difficult to recognize. For example, there are characters in Farsi like "i" in English which has two parts but are recognized as one character. And I have covered this problem in the following code.

Background

The reader is assumed to have basic GDI skills and knowledge of elementary concepts of image processing.

Using the code

First of all, you should take it into account that this algorithm does not detect lines of characters covered vertically by a line like in the image below:

The algorithm is so easy:

Threshold image

Consider horizontal projection of line of character as a continuous vertical line

Scan image from top to bottom and find the top and bottom of each vertical line from the previous phase

Because characters like ? are identified as two lines, we merge those lines whose distance to the next line is a fraction of their height

Save lines in the output directory

First, we should threshold the image. I used a trivial thresholding algorithm, but algorithms like the famous Otsu thresholding will result in a better image.

In the second step, we try to project all black cells horizontally to extract the horizontal projection of the image. This will result in a discontinuous collection of black points which we consider the top and bottom of each collection, as the top and bottom of the line:

To find the bottom and top of lines, I developed these two functions: FindNextLine, which finds the first black pixel of the next collection extracted from the horizontal projection, and FindBottomOfLine, which looks for the first white pixel with a Y dimension bigger than the top of the line.

And ultimately, we save the images of the lines in the output directory.

Experimental results

I tested this algorithm for different fonts and sizes, including Mitra, TimesNewRoman, Arial, and Zar. For those without any noise, it works 96% percent, but for noisy samples, based on the noise ratio, we get different results which are not acceptable.

History

I have spent two years of my life developing an Open Source Farsi /Arabic OCR, and now I want to share some of my experiences here. If you are interested in developing Farsi/Arabic OCR, you can join the following group: farsi_arabic_OCR@groups.yahoo.com.

Share

About the Author

Hands-on .Net developer with 8 years of working experience, as C# developer, software designer, Test developer and architect including 3 years of part time and project based and near 5 years of full time job, contributing to and leading all phases of the software development life cycle (SDLC) for a wide variety of enterprise systems and Web-based applications, particularly within the Automation / Data mining /Insurance sectors. Highly skilled in application design, architecture and development with strong expertise in server side programming as well as in the complete range of .net technologies.

I have got the rank 301 among 500,000 in math & physics university entrance exam of IRAN in 2003 and I was member of national elites of IRAN for one year ,I got my BS Of Information Technology from Tehran University in 2009. now I am in spending the last semester of my Master degree in the field of Software engineering at Shahid Beheshti University .

Comments and Discussions

I was working on a project (optical character recognition) and I Want you to clarify for many optical character recognition English In detail about how his job, how to read the letters from the image, the program code is possible

Hi
it is a hard task to do.it is hard to recognize a word in attendance of noise, skew, illumination change and translation . the above code is just a simple answer to the situation that the picture is completely clear. it detects the lines of a text. if you apply the same rules in a vertical direction (after detecting lines)you can detect the words. my skype id is mehran.ghainian and my gmail is ghainian@gmail.com. please feel free to make an appointment time and talk about the whole project later.

Hi,
good job thank you for you codes
now some people work on other project that supports arabics and it's core is started by HP and Google. you can find more information in here and Persian Project is here

i have tried lots of algorithms an segmenting characters
which takes lots of time to explain this is my gmail id
you can send me an email to make an internet appointment
for sharing knowledge via googletalk

Man i really appreciate your help
I live in Lebanon i want to test your application i have visual studio 2008
i tried to open the solution it gives me error"the selected file is a solution but was created by newer version of this application and cannot be opened

I really appreciate your time and your help also sharing your knowledge with people