Branko of Teleread came up with some interesting statistics suggesting that - unlike distributed proofreaders - many of us would love to digitize their own personal libraries.

If you've ever tried to scan a full-length book without having access to a high-end $150k+ scanner, you'll understand why professional proofreaders who deal with books every day are not so fond of the idea of scanning their own content. Manual scanning and OCR'ing is a pain since both tasks are time-consuming and usually prone to errors. Now, as many of you know, Google is working with various major libraries to digitally scan books from their collections so that users worldwide can search them online. But don't expect some poor first-year student to sit all day and night in front of a low-cost scanner flipping pages. These libraries have access to fully automated page-turning and scanning devices that produces high quality digital images of bound materials (nondestructive) at throughput rates as high as 2400 pages per hour.

It'd be great if one day you could just visit a Kinko's outlet and rent a Kirtas scanning device for a short period of time. Only then would I be willing to turn my dusty library into a bunch of e-book.

- opticbook 3600 scanner (~250$) with decent ocr included (abby)
- I scan double page, 300 dpi, b&w, tif or pbm (mostly tif but sometimes pbm is easier to manipulate)
- I do 10 pages (5 dp sheets) per minute for hc/tp, 14 p per minute pb and just watch a movie on my portable dvd player when scanning
- pc does the ocr in about 20-30 minutes per book and I just send word and then text since that eliminates most strange characters
- since everything is for my personal use, I do not bother correcting, the software is good enough for the results to be nicely readable (once you get used with several quirks like "die" instead of "the" sometimes)
I have done maybe 20 books and read about 5 fully on my Nokia 770 or Ebookwise 1150, partially from others. Also you can do picture books and embed the scanned pages (maybe transformed to jpg) in html to read on pc/tablet with uBook, though I do it rarely since I do not like reading fiction on pc/tablet/laptop.

Eyesight problems are my main reason for doing this. I can create an eBook with a larger font size which makes for easier reading on my eBookwise than from the original paper book.

I use an Optibook 3600 or a Canon Lide 60 to scan two pages at a time into Abbyy Fine Reader. After editing with Abbyy for spelling and scanning errors I then send the pages to Word. It is in Word that I arrange for a larger font size and other special formatting for chapter headings etc and removal of page numbers. I save the file as an .rtf file and then convert this to the .imp format required by the eBookwise.

I never save in .txt format because all formatting such as bold, italics etc are lost. Italics in particular are necessary to follow the storyline in some novels because they often represent thought or telepathy etc. Project Gutenberg overcomes this by using all upper case letters for emphasis but I find this distracting.

So, time consuming — yes — but I can usually manage to produce a finished ebook in less than a day and I also find the work very therapeutic and rewarding. This process means that when browsing in my local bookstore I don't have to put most of what interests me back on the shelf because I can't read the text.

I just scan the images into a PDF. It takes a lot less time, and OCR errors really bug me for some reason.

I was wondering myself about how many people just scan the pages and don't bother with OCR because of mistakes. I've tried reading some OCR'd books, though, and it didn't really cause me any trouble because once you get used to the types of mistakes it's pretty easy to figure out what the text was supposed to have been... it's the same kinds of letters and letter pairs that get confused all the time.

The main problem I see with scanning to pdf without OCR is if you want to read on a small screen device or if you need small file sizes. It just wouldn't seem to be useful for mobile reading unless you are using a laptop. Even the new UMPCs might be too small for a scanned book, wouldn't they?

"In a regime of superabundant free copies, copies lose value. They are no longer the basis of wealth. Now relationships, links, connection and sharing are. Value has shifted away from a copy toward the many ways to recall, annotate, personalize, edit, authenticate, display, mark, transfer and engage a work. Authors and artists can make (and have made) their livings selling aspects of their works other than inexpensive copies of them. They can sell performances, access to the creator, personalization, add-on information, the scarcity of attention (via ads), sponsorship, periodic subscriptions -- in short, all the many values that cannot be copied."

If you do not (or cannot due to formulas/diagrams) OCR, you can read the images directly with your favourite slideshow software, or embed them in a blank html and use uBook or your favourite pc software reader.
Pdf's take less space true, but unless we get a portable reader that can read them properly (no scrolling or zooming necessary, pdf page to pdf portable device screen - here portable means something I can use one handed and without mouse/pen) size does not really matter since in all of the above ways you read an image at a time so speed is not an issue and actually it is less memory consuming this way than reading a pdf, you just need enough hard drive space for the images.
This is how I read selected pdf's with my Nokia 770, by cutting the pages (through djvudigital and ddjvu) in half (portrait) or 4 (landscape dble page scan), making sure that each image is 800x480, and using lower quality pnmtojpeg to get manageable size (~40 kb/image or 80 kb/page) since the Nokia screen is good enough. The result is very nicely readable, very fast since Fbreader gets an image at a time, though I lose navigation except page by page. But it is worth it since even with evince pdf's are slow and you need scrolling and so on...
Whenever you have a fast html reader that takes embedded images and enough hard memory this method works nicely as long as you cut to screen size and the result is readable (even on Ebookwise it works for most scans with cutting in half and resizing to 318x448), but of course I would rather read the pdf directly and not have to write the scripts to cut and so on...
We have to see but I think that the Iliad may be able to read nicely a portrait pdf scan, though not a landscape scan, while the Sony reader will not be able to do that due to lower resolution. It may read "reflowable" pdf's, but scans no.

Liviu

Quote:

Originally Posted by Bob Russell

The main problem I see with scanning to pdf without OCR is if you want to read on a small screen device or if you need small file sizes. It just wouldn't seem to be useful for mobile reading unless you are using a laptop. Even the new UMPCs might be too small for a scanned book, wouldn't they?

I've wondered myself if anyone else has tried to improve OCR by taking a 2-step scanning process... that is, photocopy-enlarging the pages to letter size, then doing the scan and OCR. This has worked for me on small article scans, but I've never gone through the trouble for an entire book.

(Frankly, my head would blow up if I considered digitizing my entire library, and it's not that big!)

That, I think, depends on what quality you get 'raw' from the scanner. If the scanner is clunky and produces uneven results in low resolution, it probably would work. I've done it for books printed on bad paper or with uneven press-work.

However, with a reasonably modern scanner, capable of real 300 dpi resolution, and OCR software with the functionality of, say, FineReader 8, you don't need it. You'll need to check thresholding levels (unless you go for greyscale) before you start working, and you may have to check for light levels drifting as the scanner gets warm, but apart from that it's rather plain sailing.

In higher resolution and with good print work, the problem more or less goes away. I've done 600dpi work, and had something like one misread per two pages with only one or two pages of training beforehand.