Going paperless with Tesseract OCR

The New Year period always prompts me to look for improvements I could make in my life. After some searching, I came across this article "Three Steps Toward a Paperless Culture". A quick glance at my immediate surroundings confirmed that becoming paperless would be a great improvement for my lifestyle. Imagine all those paper records gone!

In New Zealand, there are laws requiring you to keep business related records for at least seven years. Potentially this could become an insurmountable mountain of paper - a nightmare to sort, store and retrieve.

(A very small sample shown above, trying to flatten with weights...)

My scripting mind immediately took control seeking a solution to this problem. Assuming I had a folder full of scanned documents, how hard would it be to sort them?

As it turns out, very easy.

Tesseract

Tesseract is my OCR library of choice. Originally developed by HP, Tesseract was later improved and maintained by Google.

tesseract-ocr is a .NET wrapper for Tesseract by Charles Weld. We will be using this library with PowerShell to perform our OCR tasks.

Environment

If you want to proceed through this step quickly, I would suggest downloading and running the Initialize-Environment.ps1 script from my GitHub repo.

If you prefer to set everything up manually, create the following directory structure:

{Base Directory} /
-Input/
-Lib/
-tessdata/
-x86/
-x64/
-Output/

You will need to download the tesseract nuget package and copy the files to your Lib folder. Then download the Tesseract libraries and grab just the tessdata folder in the language of your choice (I chose English). Place this folder also into the Lib directory.

Reading text from an image

Reading text from an image is as simple as loading an image, passing it to Tesseract and receiving the output. For example: