I have a question that MIGHT be big, or I just missed it in searching.

Can anyone point me towards any website, tutorial or software about converting physical books to an electronic format? I have searched widely and it seems unnecessarily difficult. The difficulty seems mostly connected to preserving hyperlinks.

I plan to hire someone with VERY limited computer skills to do the job (I have 500 books!), and the best I have been able to find is this set of procedures:

1. Scan each page using OCR
2. Copy the text, each page at a time, until a chapter is scanned..
3. Save each chapter separately and save as either RTF or TXT.
4. Save TOC separately.
5. Open a new DOC (or DOCX) and insert the text files one at a time.
6. Begin with TOC, and bookmark each chapter heading (first step in creating hyperlinks).
7. With each new chapter inserted, create the hyperlink.
8. Convert the whole document to an ebook format.

This seems WAY more tedious than I thought it might be. Is there a magic bit of software that I am missing?

What will I do, then, when I get to the academic books, with footnotes, bibliography and all that?

I use a somewhat similar method, but I have an ancient copy of Dreamweaver, the wysiwyg web page editor, so what I do is this:

Scan pages, save as RTF
(Usually ten or so pages at a time to prevent screaming boredom)
Start a new book RTF file and copy the pages onto it in turn, correct scan errors, fix up format. When I have entire book scanned and all pages loaded into the book RTF file, I go through, check and proof read and generally make presentable layout, with TOC etc. and with a 500 x 800 cover pasted at the top.

Then save as web page, filtered (ex MS Word). Reason is MS's own unfiltered version of a web page has a whole lot of junk which causes my Dreamweaver to have hysterics.

Now I open Dreamweaver, pick up the RTF book file, and in Dreamweaver I add the TOC links and add the cover, a 500 x 800 jpeg image, at the very top. (For some odd reason, when you convert an RTF to Web Page in Word, it reduces the size of any image and makes it into a GIF, which is bad news. So I replace the small GIF with another full-size copy of the jpeg cover.

Once I'm happy all is clean, I then go to Calibre, and "add book", picking up the web page version of the file.

Then I edit metadata, save, and then convert into Mobi or epub.

Laborious. but I have done at least 20 books that way so far. For an example, see "The Green Hat, by Michael Arlen, in the Mobi section of the Library. Or an Edgar Wallace such as "Kate, Plus Ten". Or poetry--archie and mehitabel (lower case please) by don marquis. All done the hard way.

I only make mobi, because I have a kindle, which reads mobi. If I had another reader which used epub, I'd convert to that.

If any of your books are out of copyright, you can sometimes find a pdf or rough scan somewhere on the net, which simply needs cleaning up.

There are faster ways, but they require special equipment. An industrial strength scanner/OCR where you dismantle book to its individual pages and it autmatically feeds them through like a photocopier, scanning both sides and stitching the text automatically. Another method uses two cameras at right angles and a frame which supports a book open at 90 degrees, with the cameras photographing the pages, and the inevitable software to stitch it all together.

If you are going to do 500 books, all I can say is: better you than me!

My process is scanning, OCR, create a Djvu file (I find it's easier to work with when proofing), as well as .doc. Import the .doc into Atlantis (a very nice word processor). Do as much editing and cleaning up in Atlantis as possible, Atlantis makes it easier to clean up the 100s of styles that OCR usually leaves you with, export to .html. Import that to Sigil and do final cleanups. Sigil will create a very nice TOC. Atlantis can also export to .epub, which you can also import into Sigil that I found to be very nice to work with also, Atlantis did a pretty nice job.

How much time you spend on the initial scanning and OCR will make the final steps easier. With ABBYY, run the deskewing, clean up the images removing the black artifacts that are sometimes around the edges, etc. Train the OCR for a few minutes at least, I find spending about 5 minutes on that is usually enough for a good OCR. I scan in png format at 600 dpi, I want really clear copies for the proofing process.

Footnotes are always difficult in ebooks. You can either put the footnote immediately following the paragraph, or put them all at the end of the book. Link to the footnote, and then have a link back to where you were reading from the footnote. Same with the bibliography, I would just add it to the back of the book. I haven't had to deal with footnotes yet, but I'd probably do it keeping them all together at the end of the book with links back to the text position.

The most time-consuming part of the process is proofreading the final epub and making corrections, the step which people tend to totally ignore or do minimally. OCR is never perfect. I like reading on my reader where I can enlarge the text, making it easier to pick up common errors like "comer" for "corner" for instance, and then editing in Sigil when I find an error. This is the most important step that will make the book an enjoyable read or not.

When done, if you need another format after this point, I just convert with Calibre.

Edit: Just a note on the scanning process. This is the easiest part of it all. I spend maybe a little over an hour scanning a book with around 250 pages on a flatbed scanner, I just park me and the scanner in front of the TV while doing it... time passes pretty quickly.

Also you missed out the most important step at the end, proof reading. OCR, even when done by someone with computer skills, introduces errors. Wrong words, random added characters, etc.

True. Even the best OCR software in the world on the most crisp and clean scan ever will still throw up oddities like /' instead of ," and other misreads.

I have scanned a lot of old books and found that the most direct route is:

- Get a scanner that will let you scan the open, face-down book and get both pages at once. Go through the entire book this way from cover to cover and you only have half as many scans as pages.

- Use Scan Tailor to process the images. It will rotate, split, deskew, clean and output nice clear TIFF images. At this point, depending on how much further you want to go, you can assemble the TIFF files into a PDF worth storing and reading on an ereader. If you are willing to have just a copy of the book as-is (no reflow or text adjusting).

- Process the cleaned images through an OCR program. There are many out there and none of them are 100% on the conversion. Depending on the program and the output file available, I would choose something that retains the original formatting as much as possible while still allowing ease of editing. My person preference is to output html code.

- Proofread the output file. Use your favorite editor to read through the file and cross check against the original book or scan. I find it easier to open the scanned images in one window and the OCR edit screen in another side-by-side and then just browse through it. Double-check any strange output against the scan, correcting as you go.

- Assemble your final ebook. I prefer epub and use Sigil to create the final ebook. I add the cover scan, break out chapters, do final formatting and run epub checks before outputting the finished ebook. I have also aken the final html code and processed it through the Amazon system (email the single html file to your kindle) to get a mobi file. This was really just to see what happened and not really my routine.

One thing to bear in mind is that creating your own ebook from a printed book is very time consuming. Now, I am retired, but I do casual wedding limousine driving (R-R and Bentley, thank you very much!; no stretch Hummers for me!) as a sort of income producing hobby, and get around $40 AU an hour.

I converted one of my own books, and it took many days, as it had something like 100 illustrations, each one of which had to be scanned, cleaned, retouched, resized etc.

If I had been out driving all the hours it took me to convert the book, I would have made something like 1,000 AUD or more. That's a hell of an expensive ebook when the paper book sold of around $20. So you have to ask yourself, is it better just to buy the ones you really want? And then only convert those which are, for some reason, unavailable in digital format yet.

Assuming a minimum of 4 hours labour per book (and if it is a text book with lots of graphs, charts etc, it will be much, much more), multiply by the number of books, and you will see why you should think hard about buying. Easier to make the dough and buy, than do it yourself.

Oh, and one last thought. By the time you have finished the whole job, scanning, formatting, proofing, and so on, you will have read the damn book about 4 or 5 times and will be sick of it. Once it's on you device, you won't actually want to sit back and read it again.

Hmmm. Would you hire someone with no decorating skills to paint your house, or someone with no knowledge of gardening to look after your garden? Why are you even considering hiring someone with no computer skills to scan your books?

My point, no doubt foggily expressed, was that even on the lowest costing of your own labour, converting hundreds of books just for yourself works out to be incredibly expensive in relative terms.

Buy the buggers commercially, is my view. If they're not available commercially as ebooks, and you must have one, do it yourself by all means. I've done a few myself simply because they're not in existence as ebooks and the paper ones I own are falling to bits. (and very hard to find second hard, at that.)

I'm sure there's an economic principle I was taught many years ago about this, but I've forgotten the jargon now.

For instance, if for some mad reason I wanted e-editions of Ian Fleming's complete works, what makes more sense? Take on an extra wedding limo job--five hours or so at rate, punting a bridal party around in a Rolls-Royce--and then blow the proceeds on the commercial editions already available? Or spend three weeks sweltering in a computer room scanning and etc etc them myself?

My post was responding to the original poster's saying that he planned to hire a computer illiterate person to scan his books, not to your post. Sorry for the confusion. As an ebook creator myself, I am very well aware of the economics of it, and I agree with you wholeheartedly about that.

My best time for a scanned book (an old out of print paperback that I really wanted to read again) was 4.5 days, and working at it most of the time during that period. That included heavy editing of the cover which was in really bad shape, and a complete read through (but only 1 in this case). Depending on how well the original programs do their job in scanning, OCR and output to epub will either increase or decrease the amount of time you spend on finding and correcting errors.

In comparison, my very first old book took me months because I started out with MS Word that made a royal mess of formatting, plus I'm sure I didn't use proper settings in ABBYY at the time either. I now use Atlantis as my word processor which is much better for this purpose.

Btw, ABBYY 11 now has image editing tools included which eliminated the use of ScanTailor for me, it does a very nice job.

To put this much work into any book (without getting paid for it) pretty much requires a love of that particular book. To hire someone not in the business won't have that passion and you'll probably end up with a very poor quality epub at the end of it.