Some time ago I began a new job in big corporation. My first task was re-implement / re-import their C# tcp client to Java’s. Existed convertors have been sucking, so I did it manually. After week or so, freshy Java tcp client & server simulator have written & waited for further use. Having met with client’s requirements we found that Java’s implementation has a lack of important features such as: fail-over & auto-reconnection. Adding such functionality required from us add some untested code and might be insufficient flows in the business logic. One of our guys said, Aha, what if …? We can replace Java’s implementation to another one, for instance – Spring Integration. The rest of us smiled thinking what the heck? Anyway, my was is a good champ trying to take best technologies ever existed. We got a green light to do research & learn something exciting. To simplify our requirements I am going to show a simulator (aka server) & a client.

Before delving deeper, let me explain what Spring Integration intended for. As their site suggests: “it provides an extension of the Spring programming model to support the well-known Enterprise Integration Patterns”. Rephrasing, to design good enterprise application one could use a messaging more precisely asynchronous messaging that enables diverse applications to be integrated each other without nightmare or pain. One of wise guys named Martin Fowler has written famous book “Enterprise Integration Patterns”. Folk from Spring probably one day decided to materialize a theory in practice. Very pragmatic approach, isn’t it? Later you will see how wonderful fits for regular tasks. The main concept of SI is: Endpoint, Channel & Message.

Endpoint is a component which actually does something with a message. A message is a container consisting of header & payload. The header contains data that’s relevant to the messaging system where the payload contains the actual data. Channel connects two or more endpoints, it’s similar to Unix’ pipes. Two endpoints can exchange messages iff they’re connected through a channel. Pretty easy, isn’t it? The following diagram shows this.

The next step to our crash course will be defining requirements. I would say, we need a server (a tcp) & tcp client. We will write a simple application that will exchange a couple of messages with each other.

Important thing using SI is a configuration file which contains all necessary components that we going to use. Here is a “server” part of the configuration. Simplifying a model & SI lifecycle, Spring creates objects that defined in configuration xml. More generally such a concept called declarative programming. You define a business object in the xml, and a framework will generate appropriate classes for you, injects and initializes dependencies. The mantra says: you should be concentrated only on business and not on implementation.

Important things are: i. A factory (tcp-connection-factory) – creates tcp server using byte array length serializer. A serializer is needed for “packaging” our message by some way or encode it in order to transmit it over a wire. On the other hand deserializer is needed for “unpackaging” our message or decode it. Spring Integration has two factories one for client & another – for the server. The difference is – by type [server or client]. A port – means to listen to for incoming messages. IP address not mentioned here because a server runs as a localhost.

We also defined two channels: serverIn (for incoming messages) & serverOut (for outgoing messages). In order our server will send & receive messages we define inbound & outbound adapter which are associated with factory & channels. In our case it defines the endpoints. So, when message comes somewhat should take care of it. This responsibility takes a service, i.e. file sender service. If it accepts a message afterwards will send in background a file, line by line to the client. Basically, when a server starts, it listens for incoming messages however only specific message will be accepted and if that message is gotten, than server sends line by line a file. If an error occurs it’s routed to the error channel. It’s done using interceptor.

I would say a couple of words about SI lifecycle. Spring framework has two “main” packages: org.springframework.beans & org.springframework.context that builds up the core utility of the dependency injection of the component. The org.springframework.beans.factory.BeanFactory interface provide a basic lifecycle methods (start & stop) for bean initialization/destruction. The org.springframework.context.ApplicationContext offers AOP integration, message resource handling and even more.

Our server is ready, I mean, completely ready. To run the example follow the below steps:

/* Creates an input stream to be read. */
fstream = new FileInputStream(getPath2File());
/* Wraps an input stream in order to be able reading of a whole line */
DataInputStream in = new DataInputStream(fstream);
BufferedReader br = new BufferedReader(new InputStreamReader(in));
while ((line = br.readLine()) != null) {
command = line;
sendAndLog(timeToWait);
}

I’m pleased to announce that tika chm extractor LGPL licensed is released yesterday. Honestly, it’s not pure LGPL, only libraries it depends on, the rest of the code – Apache license version 2.0.

All relevant information can be found here.
Download the sources go to the github.

Why should it live?
Well, the “original” Tika’s extraction algorithm works pretty well in most of the cases, however, has “difficulties” in rare cases. Inventors of compressed html files by unknown reason couldn’t publish their specification thus the algorithm for extracting context from Tika chm parser is not perfect, but quite good.
Possible solution that crossed everybody’s mind, to use native libraries. Fare enough though. The only one question is in, how to make it working on multiple platforms. Aha! Having checked available options I figured out stable Java library called sevenzipjbind.

The extractor designed as stand alone program. I.e. is a server based on Jetty which listens to HTTP requests. Currently has three options: i. Extracts single file including metadata; ii. Extracts context &amp; metadata from all files in the provided directory; iii. Extracts only metadata from single chm.
In addition, it saves extracted context &amp; its metadata in special folder following the pattern : ../extracted_files/folder_name_as_file_name/extracted html files. Metadata goes under ../extracted_files/file_name.json

While many applications today use direct data entry via keyboard, more and more of these will return to automated data entry. The reasons for this include the increased incidence of operator wrist problems from constant keying and the potential hazards of video display terminal emissions. Therefore any application imaginable is a candidate for OCR.

What are its Applications?

Automatic number plate recognition, is used by various police forces and as a method of electronic toll collection on pay-per-use roads, parking, car washing stations etc and cataloging the movements of traffic or individuals (quite popular in Central London).

Book scanning – digital books can be easily distributed, reproduced and read on-screen. Projects like Project Gutenberg, Google Book Search scan books on a large scale.

CAPTCHA – is a type pf challenge-response test used in computing as an attempt ensure that the response is not generated by a computer. Stands for Completely Automated Public Turing test to tell Computers and Humans Apart.

Computational linguistics – machine translation

Digital pen as well as digital paper

Digital mail room is an automation of incoming mail processes for classification and distribution of mail.

Handwriting – is a person’s particular and individual style of writing with pen or pencil. Every literate human has his own manner of writing. Graphology is the controversial study and analysis of handwriting especially in relation to human psychology. Sometimes it’s a part of hiring processing, from the candidate asked to write by hand about its familiar topic and after that send to the authorities for the psycho-analysis of the person.

Music OCR – intended to interpret sheet music or printed scores into editable and playable form.

Optical Mark Recognition – is a process of capturing human marked data from document forms such as surveys and tests.

Optical Character Recognition (OCR) systems may recognize machine print. Using pattern-matching technology, OCR translates the shapes and patterns of machine-made characters into corresponding computer codes. Though most advanced systems are able to recognize multiple fonts, they can process only standard fonts such as Times Roman and Arial. Once all characters in a given word are recognized, the word is compared against a vocabulary of potential answers for the final result.

Character recognition then segments lines of text or words into separate characters that are recognized by the makeup of their component shapes. Machine-printed letters are evenly spaced across, and up-and-down, a given page, allowing the OCR system to read the text one character at a time. Segmentation into single characters represents a critical recognition failure point for forms processing organizations, because OCR recognition technology requires high-quality images with excellent contrast, character and clarity. Any text that is less than perfect will cause even the most sophisticated OCR systems to return significant reductions in accuracy when processing degraded images.

How to choose an optimal product?

When discussing what an OCR product to choose, the number of criteria should be considered. What a price you’re ready to pay? What’s a quality of the product? How is it supported? And so on, and so on. Fortunately for us, such a product exists. It’s open source, very good quality, pretty well supported and still alive. It called tesseract-ocr. Why tesseract? Because it’s open source, it’s licensed ASFv2, because it’s one of the best, the support is pretty well via mailing-list, runs on multiple platforms, has wide range of build-in languages, stable and easily integrates with other systems.

This tutorial divided by:

Introduction to tesseract-ocr

Installation of tesseract 3.0.1 for Windows.

Extracting the text

Writing simple tesseract function using baseapi

Writing Java function that extracts text from given image using ProcessBuilder and tesseract.exe

Introduction to ImageMagic

Installation ImageMagic 6.6.9-8 for Windows

Checking the installation

Brief description what’s under the hood, useful command line utilities.

As WIKI suggests, in geometry, the tesseract, also called an 8-cell or regular octachoron or cubic prism, is the four-dimensional analog of the cube. The tesseract is to the cube as the cube is to the square. Just as the surface of the cube consists of 6 square faces, the hypersurface of the tesseract consists of 8 cubical cells. The tesseract is one of the six convex regular 4-polytopes.

In our case, the Tesseract OCR engine was one of the top 3 engines in the 1995 UNLV Accuracy test. Orignally developed at Hewlett Packard Laboratories Bristol and at Hewlett Packard Co, Greeley Colorado. Between 1995 and 2006 it had little work done on it, but it is probably one of the most accurate open source OCR engines available. The source code will read a binary, grey or color image and output text. Now Google takes care of it.

Tesseract Installation

During this tutorial we will use Windows box with Microsoft Visual Studio 2008 Express installed.

The installation is very simple, takes about 1 hour. You can use Ant script provided for running particular tasks or do it by yourself.

cntraining – generates a normproto and pffmtable. Reads in a text file consisting of feature samples from a training page in the following format: FontName CharName NumberOfFeatureTypes(N). It then appends these samples into a separate file for each character. The name of file is: DirectoryName/FontName/CharName.FeatureTypeName. The DirectoryName can be specified via a command line argument. If not specified, it defaults to the current directory.

combine_tessdata – creates an unified traineddata file from different files produced by the training process.

Overwrites individual components of the given lang_code.traineddata file. Example:

combine_tessdata -o tessdata/ell.traineddata-uUnpacks all the components to the specified path. For instance,

combine_tessdata -u tessdata/ell.traineddata /home/$USER/temp/ell

mftraining – Separates training pages into files for each character. Strips from files only the features and there parameters of the feature type mf. Reads in a text file consisting of feature samples from a training page in the following format: FontName CharName NumberOfFeatureTypes(N). The result is a binary file used by the OCR engine.

unicharset_extractor – Extracts a character/ligature set. Given a list of box files on the command line, generates a file containing an unicharset, a list of all the characters. The file contains the size of the set on the first line, and then one unichar per line.Usage: unicharset_extractor [-D DIRECTORY] FILE…

For human being it’s easy to recognize what’s written (rondity describe.), however, look at output:

rmdwdescrbe.

It could not recognize the first word, white space. Only second word recognized perfectly. You can train you OCR be able take care of words like a first one, but that who produces such gotchas will change their algorithm and you fail again. In this case, don’t try harder.

In addition to batch processing tesseract-ocr makes possible integrate its capabilities with your program/product through basic c++ API. It’s well documented and easy to use. All basicapi sources located in ../api folder.

Working with OCR, often you will want to prepare your data (images) before throwing to the OCR. It could be converting image format, increase/decrease an image resolution, reduce image noise. There are a lot of options to achieve that, GIMP (http://www.gimp.org/) – is free, mutual and cute! If you trying to automate the data preparation process look at ImageMagic (http://www.imagemagick.technocozy.com/).

ImageMagic

It can do: detect edges, add noise, capture a screen and more and more. I could not cover them here, however I’m going to cover a relevant part to the our ocr processing.

Format conversion;

Transformations;

Composite – not sure …

Image identification

MSL – Magic Scripting Language – not sure

Installation of the ImageMagic

IM supports wide range of platforms, from *Nix to the Windows. I suppose, throughout this tutorial you used Windows and let it be so.

If you use the Ant script provided, run:

ant im.http

This command will download the windows installer. The Ant properties are in build.properties files, change them according to your set-up.

Moreover, the MAGICK_HOMEenvironment variable should be set to the path where you previously extracted the ImageMagick files.

Verifying installation

convert logo: logo.miff
imdisplay logo.miff

ImageMagick core utilities

Utility name

Usage

Display

Intended to view an image, manage its functionality including load, print, write to file, zoom, copy a region, paste a region, crop, show histogram and even more.

Convert

Converts image formats. Can be used for making thumbnails, charcoal drawning, oil painting, morphing

Import

Used to capture the screen and writes it to the file. Can be specified a single window, the entire screen, or any portion of the screen

Animate

Shows animated formats or a sequence of images. Has a capability for color reduction to match the color resolution of the display.

* creates IMOperation, adds to it an image, runs identify and verbose commands

*

* @param searchPath – where ImageMagic exe’s placed

* @param sourceImage – a source image

* @param destImage – a destination image to be converted

*

* @throws IOException

* @throws InterruptedException

* @throws IM4JavaException

*/

publicstaticvoid tryExample(String searchPath, String sourceImage,

String destImage) throws IOException,

InterruptedException,

IM4JavaException

{

ConvertCmd convertCmd = new ConvertCmd();

convertCmd.setSearchPath(searchPath);

convertCmd.setCommand(sourceImage, destImage);

convertCmd.run(new IMOperation());

IMOperation op = new IMOperation();

op.addImage(destImage);

IMOps ops = op.identify().verbose();

convertCmd.run(ops);

}

There is another cool thing called MSL. Stands for Magick Scripting Language basically XML language, intends for those who want to accomplish custom image processing tasks without programming. The interpreter is called conjure. The scripts looks as typical XML file with specialized tags in it and file extension msl.

Removes any edges that are exactly the same color as the corner pixels.

In this short tutorial I could not include all ImageMagick utilities, so you welcome check them out by yourself.

Tips

During using tesseract I’ve been making wrong decisions. One of them was using Cygwin. I wasted about two days trying to compile the sources, adding more and more missed libraries, recompiling again. Finally, I got my executables. However some of the features still were not working. Having decided remove all cygwin “mess”, and installed Microsoft Visual studio 2008 Express solved my troubles. It took only about 2 hours including installation of MS VS and compiling entire solution. It worked as a charm!

The next challenge was to install the ImageMagick. I thought that having MS VS 2008 installed I wouldn’t have problems compiling the sources. I was wrong again. The ImageMagick has dependencies on MSF library and cannot be compiled using MS VS 2008 Express, i.e. this library is out. The other option was to install ImageMagick binary distro. And it worked. Or you can install Visual Studio 6. It’s up to you.

Before using tesseract I encourage you to read its FAQ and Wiki. If you have some question(s) subscribe to the tesseract mailing-list. There are excellent people that can help you. Unlike opening tickets requesting support and waiting days or even months, here help comes very quick.

Conclusion

Usually OCR contains two stages. In the first stage we prepare our data to be processed. Some images have a noise, others poorly scanned or their format do not fit to our purposes. ImageMagick helps us to perform such kind of preparation aiming create scripts that automate the process. In the second stage, we actually do an OCR. Tesseract has a baseapi that make easier to integrate its capabilities with an environment.

Building your system, keep in mind OCR’s limitations.

If you have comments/suggestions please share it with me and other people.