Blog From Print To Screen

mu:zines Blog

From Print To Screen

Part 6 - OCR Part 1a - Contents & Metadata

by Ben | 24th Apr 2020

A Word On OCR

OCR (Optical Character Recognition) is, for those who don't know, the process of converting an image of text into actual text, and right from the beginning of this project I didn't just want to scan the magazines and put them up as PDF images - I wanted to be able to search the content, let search engines index it and do all the other good stuff that having the text allows, including updating errata and fixing article errors - reading PDFs on a computer or iPad is okay to a point, but on a phone for example it's a pretty lousy experience.

The very first thing I did back in 2006 at the dawn of mu:zines to decide whether getting a magazine issue into a web site was viable, was to investigate the available OCR tools, and run some tests to see how much time and effort would be involved. Back then I was using a Mac Powerbook G4, which was rather slow (I also had a regular PC available - but being Mac-based, a Mac solution was preferable for me).

I explored the range of OCR solutions out there at the time, pretty much everything aside from large-scale commercially licensed systems, and I certainly can't recall all the systems I looked at. But the one that gave the best results at the time, and was widely regarded as "one of the better systems", was TextBridge Pro.

Now, like many OCR systems, this one had come from the PC/Windows world, and the Mac version was a bit clunky in terms of the interface, but of all the systems I tried, it gave the best output, and that was what I chose to initially work with.

Reading was slow, tedious, and had many errors. As each article was read, I had to go through a proofing/fixing process which involved stepping through the words that Textbridge wasn't sure about, and either fixing them, or ok-ing them, in a rather small window, before eventually saving the text file which I could then work on in my text-editor of choice. And from there, I would have to again do a lot of work to bring the text up to the required standard, often retyping whole paragraphs. It was, shall we say, "sub-optimal" - but it was the best I could do at the time.

Looking at my files, it looks like I did about 17 issues of MT using this method, but I think the tedium of doing it this way contributed to not making much progress back them.

Thankfully, today, things are *much* better in terms of the available tools and technology. When I came back to this project, I did the same thing as before, which was take a survey of available tools, run some tests, and see what would work best for the task at hand. Again, I tried pretty much everything out there available on the Mac platform, and also some PC solutions for comparison.

To keep the story short, ABBY Finereader was the winner, by some margin. I'd nearly go so far as to say that without Finereader (referred to as "FR" from now on), this project would probably not be practical.

Gone was the tedious error-correcting phase. Recognition was better than anything else I tried (I ran the same test scans through different systems so I could compare the results on the same content.). On good scans of simple articles (say, a modern Sound On Sound article printed on white paper) it's not unusual for the resultant text to have *no errors at all*. The interface was ok - not amazing, but a *lot* better than the old Textbridge, that's for sure!

Back to Work

So, OCR software chosen - let's get back to the task in hand. I will go into some specifics on how I use Finereader, and it's strengths and weaknesses as we go.

If you're following along you'll hopefully recall that in the previous blog entry, we had output the desired scanned images as full size high quality jpegs. In the first stage of the OCR process, we need to create an OCR document for the issue, containing those pages, and we want to OCR out the contents page/s of the magazine, so we can create the necessary article entries in the CMS (eg, "On page 27-29, it's a Korg M1 review", and so on.)

Ok, in FR, we create a new document, navigate to our scans folder, and import the scan images. FR has options to automatically pre-process, and recognise documents automatically, which I *don't* use for this task (although I do use it in other situations). Importing the large images is fairly slow (over a minute), and adding additional processing at this stage makes it even slower, and has some gotchas that I prefer to avoid (see panel below).

Once the images are imported, I'll go to the contents pages - typically a single or double-page (and sometimes there may be another mid-magazine contents page where they break the contents of that sections down individually).

I'll manually draw a recognition area over the contents text, and export as raw text to *muzines*/processing/mt/mt_94_02_feb/00 contents.txt, and save the FR document as *muzines*/processing/mt/mt_94_02_feb/mt_94_02_feb.frdoc. FR documents are self-contained and have the images inside them, so I no longer need the temporary scan files, so they've served their purpose and can be deleted.

The "00" in the filename is a naming convention I use to keep exported articles in magazine order - articles will end up named like this:00 contents.txt01 editorial.txt02 shapeofthings.txt03 ...

At this stage, I officially designate the issue as "In Processing", in that there is an FR doc of the issue (editorial pages only), and a contents text file, and the website has the page scans - so I go back to the CMS and mark the issue as such - synchronising the status change to be visible on the live site.

It takes probably ten minutes or so to create the FR document, read and output the contents file, and update the CMS, most of which is the time it takes FR to import and save the document, which are typically around 1GB in size. It's pretty slow, but given that site donations aren't helping me get an iMac Pro any time soon, I have to settle for slow...

Ok - so for all the initial talk about OCR, we didn't actually *do* much OCR in this part. That's OK - we're not quite ready to do the bulk of the OCR work yet. To recap - we imported the scans into a Finereader document, saved it, OCR'd the contents page out to a text file, and marked the issue as "In Processing" in the website.

The next step is to edit and format the contents file we just saved, so we can create the necessary articles and meta-data in the CMS. So this is what we'll cover in the next part...

Essential Tools: ABBY Finereader

The Finereader OCR engine is phenomenally good - I'm often surprised as to what it will read, and how well. There are pages where it just won't be possible to OCR though - these are where there are high contrast images behind the text, or where the text colour is very close to the background colour - I had an article once where the text was in light yellow on a white background! Barely readable to the eye, let alone anything else..!

Sometimes I can pre-process these difficult pages in Photoshop to bring out the text and make it more readable. I'd say that probably 97% of pages are readable without problems, and about 2.5% need a bit of help to bring out the text, and a small handful that are just about impossible to read without some extra effort. In those cases, I'll dictate the affected passages as necessary, and skip the OCR completely. The mid-90s era where desktop publishing had begun to get more advanced is when magazine layouts started to get a bit crazy, and these are often the worst offenders.

In the main text I mentioned I turn image pre-processing off on import. Image pre-processing is meant to adjust the scanned images for optimal text recognition, and will do things like correct a slightly skewed angle. I find this feature is unreliable though - it will mostly work, but sometimes it will see some text splash at a jaunty angle, and rotate and squish the entire page accordingly. That wouldn't be a huge problem as such, but in FR you can't undo the image processing and revert back to the original image - so anything bad that happens here means you have to do more work to fix it - delete the affected pages, reimport those pages (which end up at the end of the document), and then move them back to their correct position. I find any improvements in text recognition don't seem sufficiently worthy for this workflow headache, so I don't use this feature.

You can open the images in FR's image editor and perform these operations manually - occasionally the odd contrast boost might be necessary - but other than that I don't generally require any additional image processing to get good results. I wish it would be possible to revert to the original image before any adjustments though.

The other gotcha I've come across is that certain types of image can cause reading to go *very* slowly, or effectively hang - it looks like FR is interpreting a grainy, noisy image as lots of small text and this takes reading a page automatically from 5-10 secs to *hours*, and it's often quite difficult to abort the operation. This is another reason I don't generally let FR automatically read an entire issue - I keep control over the process and avoid those situations that are annoying and disruptive to the workflow.

There are some areas that I'd like to see improve - for instance, the area re-ordering interface, which I use a lot, is rather clunky, and I'd love some additional features that would improve my workflow a lot, but most of these things aren't deal breakers, and I only mention them because I'm so hot on workflow efficiency - anything that messes with this stands out as an inconvenience.

There will be more on using FR for OCR later in future parts of this series.

• Abby Finereader is available from: www.abbyy.com for both PC and Mac platform (though the file formats are not compatible), and also from the Mac App Store.