For most books - lots of pages of nicely printed English text, the main bottleneck for me in OCRing them is the length of time the scanner head takes to return to the start position. If my scanner were smart enough to be bidirectional, it'd be the length of time taken to lift the book up, turn the page and plonk it down on the platten again.

A guy I know at one of the universities in Portugal mentioned that to archive vast quantities of printed media, they're using 15 or 22 megapixel digital backs attached to medium format cameras to record the information on a page, and then do severe hoo-haa to the resulting TIFF.

Is TIFF->plain text OCR any harder than TIFF->PDF/other binary formats? I imagine that if you have sane contrast and knowledge of the kerning, that it couldn't be too hard.

Perhaps the trick is to have the money to afford medium format backs and the endless number of manhours found in a university.

That's beautiful. *) Thank you. I'm so tempted to forward it to the professor who once dissed me for naming variables "Fred" and "Wilma" and never leaving comments to explain what they did (I didn't care because it was a school project and no one else was ever going to see the code). "See just how bad it could have been if I had left comments? Aren't you glad I didn't?"

I'm still having mixed feelings on switching to "iter" for my iterator objects instead of "i". Especially because then I end up with silly things like "jter" and "kter" in the inner loops (which I should probably be avoiding with ....but I digress).

We can't blame the laywers for that one -- those doing the sanitization really went overboard. Most of them tended to rip out anything that even remotely suggested that maybe the code was less than perfect.