Note: Please click "Run Program" to open Abbyy, we use a leased software license, and you will also need to go to Tools> Options, select the Scan/Open tab .... then under General, select "Do not read and analyze acquired page images automatically", then click "OK".

Introduction to Optical Character Recognition (OCR) with ABBYY FineReader 11.

It seems simple really, scan an image of a page of text, and let the computer turn the image of each letter into the correct letter. Hence the name "optical" for the image of the page, and "character recognition" for recognizing the characters. If it were that easy, we wouldn't need this tutorial.
"Recognition" to me implies a certain intelligence, an intelligence that computers don't have. OCM would be a better term to use, what the computer is doing is character "matching", not character recognition.

The process covered here involves the scanning of archival typewritten documents, created with manual typewriters between 1952 and 1958. First, a set of image files are created, at 300 dpi, grayscale, with LZW lossless compression. Archival master TIF (Tagged Image Format) files are saved, as separate images, and the ABBYY project file is saved as well. Once the ABBYY project file is saved, it automatically saves from that point forward.

The files are then backed up to a Network Attached Storage (NAS) device, which students access via a symbolic link in their home directories on the iSchool server. The TIF image files can be transferred using Secure Shell File Transfer client, but the ABBYY project file MUST be compressed as a .zip file before transferring. Saving the files to the NAS will allow students to continue working on these files from computers outside the 1.210A Computer Classroom.

ABBYY FineReader 11 is fully capable of scanning, analyzing, and reading a document in a single step, but for this project, we needed to scan and return a large quantity of archival documents, so the setting in ABBYY to scan and read was turned off.

After the project is backed up to the NAS, it is then returned to a computer running ABBYY, and the pages are "read". Here is where the fun begins! Computers can be really dumb, they are incredible at quickly matching patterns, but cannot really think for themselves. Here is one example. Manual typewriters are precision mechanical instruments, that transmit the mechanical force applied by a human finger to a series of levers and hinges that results in a metal arm with a letter stamped on the end of it striking a ink ribbon, then a piece of paper, and then a rubber roller. With prolonged use, the mechanical components can wear, changing the spacing between letters, changing alignment, etc. In fact, each typewriter nearly has a unique "fingerprint" of sorts, it is often possible to trace typewritten documents back to the typewriter that created them due to these unique differences.

Here, these differences give ABBYY fits. ABBYY hates the number 4 on these pages, and wear on the typewriters creates several letter combinations that are simply too close together for ABBYY to read correctly. These result in "uncertain characters", where ABBYY is just not sure whether it got things correct or not. So you have to decide during the correction process.

Then, there are style conventions used in the transcripts. Two are particularly troubling to ABBYY, the O.- sequence to indicate which person is speaking, and the use of -- to indicate a pause. By altering the settings in ABBYY, you can have it either stop for you to check each uncertain character, or just pass them by. Our goal here is to harvest the PAGE NUMBERS and text from these documents, the PAGE NUMBERS are critical, because they tie back to the full index of terms. The two .- and -- combinations can mostly be ignored.

The OCR process then involves letting ABBYY read the document, then adjusting the options in both ABBYY and the spellchecker to correct each page, then double-check the page numbers. Once the OCR and spell-check process ( and please do correct any mistakes in the transcript) are complete, you will then save a text-only exact copy of the document, with correct PAGE NUMBERS!!! This text will then be used inside of Glifos Social-Media to synchronize with the audio file, and form the basis of what GSM uses to search the transcripts.

Transcript

This tutorial is going to guide you through using Abbyy FineReader 11 to scan and OCR some text documents.

The 1st thing were going to do is go to Start >All programs > Abbyy FineReader 11 and then select Abbyy FineReader 11 here.

There are a number of procedures you can use to scan and OCR using Abbyy. What we're going to do is actually scan all of our pages to image files
1st, before we do the OCR.

So I'm going to click scan and save image here, and that's going to bring up the Abbyy FineReader dialog box.

Were going to check and make sure Abbyy is set for 300 dpi using grayscale. Our pages are 11 x 8.5 here, and we're going to leave the image preprocessing set
at the defaults.

I'm going to select this multipage scanning setting here this is going to help speed things up tremendously. I'm going to have it pause for 10
seconds after each page.

That's going to allow us to use both hands to feed the scanner, and try to keep up with Abbyy, and get a lot of these pages taking care of.

Now I'm going to insert my 1st page into the scanner, and I'm going to click Preview here. You'll see Abby create a preview scan, and you can see there is
scanner area down at the bottom of the scanner,

We need to make sure this is pulled up to the bottom of the page, but to make sure that we don't cut off this bottom line and that's one of the reasons that
we'll always put our pages to the very top of the scanner as we scan them.

Now that our preview scan is done, our 1st page is in all the way at the top, we can go-ahead and click scan here to start the scanning process.

Abbyy will go through and scan the 1st page, and when it gets to the bottom and returns, you can pull the 1st page out, put the 2nd page gently on the
scanner bed, push it to the top, and Abbyy will begin to scan the 2nd page.

And you can continue this process until you're done with all the pages.

Once you get to the last page, you can wait until Abbyy is done scanning the page, and then you can click Stop to stop the scanning process.

And now it's a good time to take a look and step back through each of the pages, and make sure you scanned all the lines, in all of the text, on each of the
pages.

Now that we finished our scans, we are definitely going to want to save our pages.

I'm going to go here to File, Save FineReader document, and here I'm going to go create a folder on the desktop, and here's where naming conventions
become real important.

I'm going to create a new folder here, each of these tapes has a 3 digit number associated with it, were going to use the naming convention e_toi_ and then a
4 digit number here usually starting with 0.

So this one is a little bit different, because it's tape 205a, so I'm going to have 0205a here.

Yours probably will not have that a. So I'm going to click okay here, and that's going to make my folder, and then I'm going to open up the folder, and name
this document e_toi_0205a.

And I'm going to click save, and this is going to save my Abbyy FineReader 11 project file here.

Now because were going to need the individuals page scans for another project, I'm actually going to go back through here and save another set.
This time I'm going to save images.

And I'm also going to save these on the desktop, and I'm also going to create a folder to put them in. And so I'm going to go up here to new folder,
and this one is going to be e_toi_0205a_ and I'm going to call these archival_masters.

These are going to be the archival master images of these page scans.

So I'm going to double-click this, so I'm saving inside this folder, and I'm going to use the same file name e_toi_0205a.

And here we need to select our file type. If you'll remember, we actually scanned these in grayscale, yet Abbyy is trying to save these images as color, and
that's going to take up some excess disk space.

If we look here, there's lots and lots and lots of choices, but we're just going to want to move up a few slots here and save to tiff, gray, LZW
compression.

And that's going to save our images pretty much exactly the way they came off the scanner.

So now I'm going to click save here, to save these page images.

So now if will look on our desktop, I should have 2 folders here. Here's my project folder, excuse me, this is my archival master folder, that has all the
documents in here.

And you'll see, Abbyy has gone through and added a page number to each of these. So this is this tape name, and this is the 1st page 2nd page 3rd page 4th
page etc.

And now if I look in the other folder here, what I'll see is my Abbyy project folder. This is an ABBYY FineReader icon here, associated with this
project folder.

So now I essentially have this project saved in 2 places, and will be using one for one purpose, and one for the other.

For the time being, we can leave these 2 folders on the desktop of the computer that you're logged into in the computer classroom, but will soon be moving
these to a network attached storage space.

And for that, you will need an iSchool username and password.

Now that we've created our archival master scans and our Abbyy project file, lets look at what we need to do to start saving them, forever.

The 1st thing is, here is our archival Masters here, and if you look closely these are TIF files, tagged image format files.

These files should file transfer without a lot of problems.

Now on the other hand, the Abbyy project file itself, this particular ABBYY file icon and the file associated with it, if you will look here, I'm going
to right-click on it and look at properties.

You'll see that this is a pretty big file already, and inside of this is pretty much all of the materials and the scans that we have.

As I've stated before, we are saving this in 2 places for safety.

The trick becomes, this is a proprietary file format, that does not survive file transfer very nicely at all.

In order to transfer this file to our network attached storage device and move it to other computers, we need to do something special with it.

We actually need to compress it. What we're going to do here is right-click, and send to a compressed or zipped folder.

And what that's going to do is go through and actually compress this file and all of the custom pieces and parts of it into a format that we can move
around without a lot of problems.

So what you should end up with is another folder right here that looks like this. And if you mouse over it, you should see this is a compressed zip
folder.

Notice it's a little bit smaller, sometimes considerably smaller, but in this case, well, oops wrong file here, this is 42 megs, and this compressed it down
to 38 megs.

Not all that much, but the most important thing is this makes the Abbey project folder movable across a UNIX network attached storage system.

And that's a real important thing for us to do here, and will also allow you to move your files outside of the computer classroom and into the main lab or
other places to work on them successfully.

Now we have some files and folders that we need to save from our computer in the computer classroom to the network attached storage space that's
attached to our iSchool account.

So, if you've got an iSchool username and password, what will do is go to Start, All programs, SSH Secure Shell, and the Secure Shell file transfer client
here.

Once the program is open, will need to hit Enter or the space bar in order to connect.

Our hostname here is ftp.ischool.utexas.edu, and now are going to enter our username here. And we are going to click Connect.

And if this is the 1st time we connected, we'll select Yes here. And we'll put in our password, the password to your iSchool account, and then we're going to
click Okay.

Now my account here is a big mess, but what I'm looking for, and what you should see, is this link right here to sod_fall_2012.

I'm going to double-click on this, and inside this folder, what you have just traversed is a symbolic link into 10 GB of network attached storage space to
support this class.

Please just use this space for the SOD class, and not for your online movie collection.

The 1st thing we will need to do is set up a directory here. I'm going to click New Folder, I'm going to call this folder text, because that's what were
working with here.

Once I create that folder I'm going to double-click on it, open it up and then I'm going to go over and dragged his zipped ABBYY folder over into this
space.

And it will be moved up, and then I'm going to grab this archival_masters folder here, and pull it over here. I can also do the same thing right here as
well.

And what that's doing is backing up all the files and folders that we created with our scanning to the network attached storage space.

You'll repeat this process throughout the course to back up your files, just keep in mind that if you pull this zip file down to another computer and work on
it,

It's going to be this zip file that you need to keep up with and move back to your computer in the computer classroom to continue working on it.

Because the version in the classroom will be an older version.

The network attached storage space is also where we'll be collecting your files for grading.

Now let's scan some documents that are going to be a little more difficult to OCR.

These are the documents from the Hoccleyve archive, and there are 2 types of these.

We'll go ahead and click scan here, the 1st document is going to be the collation tables that were used in the transcription of this piece.

and for these, were going to bump this up to 600 dpi, for smaller text hopefully to make this a little easier for this to work.

and, we're going to leave this on grayscale here, because these are black, white, and pencil markings that are gray,

and make sure that we have all of these other elements, we'll let it enhance the images for OCR, and see how that works out as well.

And for these, I would really encourage you to maybe set your time a bit longer than this, I"m only going to do a single document here, and see how it
works.

I"m going to click Preview here, to preview it. And as you can see, these documents are edge to edge, and margin to margin.

What we're going to be concerned with here is making sure we get all the information.

we don't know how well Abbyy is going to be able to do with this, but right now we are just scanning for the page images and trying to get a good scan
of each of these.

Will also be paying close attention to this number here which is a unique identifier, will be assigning you a naming convention in class, but for now I'm
going to just use this page number when we get to that point.

So once I got my margin sets, we can go and click scan here.

Once it's done, I'm going to go ahead and close this window.

If you end up with something that looks like this, Abby has gone ahead and tried to OCR this document.

Wasn't what I asked Abbyy to do.

I can go up here to Tools>Options Scan and Open, I'm going to have it do not read and analyze acquired pages automatically.

And then I'm going to click okay. And I'm going to go back appear and click scan again.

And I'm actually going to go right click over here, and have it delete that page.

So now It shouldn't try to do that for us, and I'm going to go ahead and scan this again.

You'll notice that scanning at 600 dpi takes a little bit longer.

Once Abbyy is completed right here, we can close this, and this is our resulting scanned image.

I'm going to click down here to actually view this zoomed image, because what I'm going to be concerned about as part of the filename is going to be this
page number right here, 3627.

It may not actually be a page number, but it's one of the unique identifiers that will be used in the naming convention you will be given in class.

So the next thing I want to do here, is save this. So the 1st thing I'm going to do, is I'm going to go up here and save this fine reader document. Again,
you will be given this naming convention in class, but this document is going to contain all the pages.

So I'm going to name this collation_pages, and then click okay here. And then inside this folder, I'm going to put this 3627, and save that as well.

So that saved my Abby fine reader document, the next thing I'm going to do is actually save the page scans.

So I'm going to select save images here, and this is going to put it on the desktop, but I would like to have a folder to put it in, so I'm going to create
another folder here

And I'm just going to call this collation_pages_archival_masters.

and I"m going to double-click and open that up, and then put in this 3627 here to indicate this particular page I'm working with, and our
compression on this is going to be tiff gray LZW compression right here, and I"m going to click save, to save this page.

There's another type of document were going to be working with as well. so I'm going to go up here and start a new task here, it's also going to be
scan and save image.

That's going to be bringing up my scanning window here again, but this time were going to be scanning the actual manuscripts themselves, that were in a
dot matrix format line by line.

So this, because it's in pretty good shape, were going to go back down to 300 dpi grayscale on these particular pages.

And here again, you might increase your pause for each second, so you can handle these a little more carefully, but they're actually in pretty good
shape.

Then I'm going to click scan here, and since I'm only scanning this one pages as an example, I'm just going to go ahead and close this, and it's going
to asked me if I want to save this file.

And this is a little bit different, actually it's not, it's trying to save this image that I just scanned, but I don't want to save it in the same folder
because it's a different thing.

So I'm going to go appear, create a new folder called manuscript, going to double-click on it, again will be giving you the file naming convention for this,
I'm going to call this temporarily just GreetX2,

And this were going to be saving as tif, gray, LZW compression. I'm going to click save here.

And you can see the page image were going to be working with here as well. Now in addition, I also need to save my fine reader document.

ECO will, it's not recognizing because they're so close together there, but that's fine.

ATI ON, ATI all is fine, there's the eye, and that's fine too,

Fits town again, now that we've seen this a time or 2, we can say ignore all on fits town.

Here we have a compound word that it doesn't understand here, because I know were going to go into a place where we're not going to need hyphenating compound
words I'm going to go ahead and replace this with one word right there.

Here's our S., We can ignore that

, We can ignore that

And there's the I, we can ignore it, and there's an', we can ignore that

The W looks fine, the EE–looks fine, and I really can't tell where this is located, so I'm going to ignore it.

Now you can see were starting on page 2, this I KE is really an eye capital in were going to confirm that

It's a little uncertain about the eye, will ignore that. DS was recognized correctly, so was the you, so was the T–, so was the S.

The we is correct, if you'll notice here and were watching closely, this didn't say hove, this said have, so that's where the OCR just plain failed.

Here we have an E, were going to confirm, we have an EL, were going to ignore that we have a.–will ignore that,.–again, will ignore that

And will ignore those 2,

Abby seems to be doing a pretty good job with these uncertain characters. Let's take a look at these options right here.

What were going to do, is where going to have Abby quit stopping at words with uncertain characters, and see how that works for a little while.

I'm going to click okay here, now we should see more red than blue, so here's an EE, and H, that looks good

Here's our capital E again, H, W, Delaney, we know Delaney's okay we've seen it before let's do and ignore all here,

Lots of man hours to do the work thought, I think there's a good chance that it's that, so will replace

Attention one word hyphenated, but go ahead and replace that.

Ahh here's a good one, this happens a lot, this 4 here is tough.

Let's see if we can ask the add this 4 here into the custom dictionary.

Okay, men, that got men again, ignore that.

And, let's see here, it's got 2 dashes, I think were going to ignore that one as well.

Bessemers, Bessemer is correct, we will ignore all those,

We have an EH, were going to ignore

here's the H, the E, H, were going to start ignoring all on that one, and were going to start ignoring all on that one.

TH a N, look down here, were going to replace, and the spell check is complete,

Notice we didn't verification, but in our own spell check, that's just a change between Abby 10, and Abbey 11.

Either way, we've verified in check spelling on Abby's OCR work.

So I'm going to click okay here, and now if you'll notice each of these pages now has an icon with a check on it, that means it's been recognized and
verified, or spell checked.

Now let's double–check Abby's OCR for something that's really important to these pages, which is the page number.

The page number of these is tied to the cumulative index, so we need to make sure that this page numbers stays with the text that was recognized.

So I'm going to look back through here, and make sure that each of these pages has the page number.

But this looks a little strange right there, I don't know what Abbey is recognizing that as, but if will go head increase the size of this right here so I
can actually see what's going on

I should be able to see what Abbey put there, and I don't think it's page K

What I'm going to do now is see if Abby will reread it properly, I'm going to have it read that area, and it still got it wrong.

So I'm going to go back up here, see if I can correct this, to page 4.

In then I'm going to step back to the other pages like page 3, and see if I can zoom in a little bit here, move over, and let's just check our page
numbers.

That's 3, and less zooming here again, that's page 2, it's looking okay, there won't be one on page 1,

Let's check 5, page 5, but again we need to see with the OCR version looks like, it got 5 right.

6, let's zoom in, 6, and then we'll check 7, and it looks like it got page 7 okay.

So again, please pay special attention to the page numbers of these, and make sure Abby gets the page numbers correct

Now that we've checked all of our pages for page numbers, the next thing we need to do is ask the harvest this text, so we can match it up with the audio
files.

In the way were going to do that, is where going to select one of these pages over here,

And were going to go to edit, select all, and then were going to go here to file, and were going to save this document as a text document.

And what were going to want to do is make this the simplest document we possibly can.

So I'm going to put this on the desktop, this is what the filename is going to be called, and were going to create a single file for all of the pages, excuse
me.

And then were going to go under options right here, and were going to maintain our line breaks here.

And then once we do that, were going to click okay. And now are going to click save, to save the OCR text is a text document.

You should end up with a document that looks like this.

The main thing were concerned about here is these headings, Pioneers in Texas oil, and P2 or page 2.

We need to make sure that we can tell the difference between the pages of this, and that it's in plain text.

And so this document looks pretty good, and we should be able to use it to paste in the GLIFOS, for the next part of this project.

The last thing you need to do here, after you close up this document, and go ahead and close ABBYY Fine reader.

Is to make sure that you take this text document that you just created and put it on your server space on the NAS so we can access it outside of the
classroom for grading.