how to convert a scanned page from a book (looks like photo of page) to clean text?

ok, fairly simple question - i have a book in pdf format, that is scanned with the page and all (looks a bit like a photo of the book almost).
this of course makes for awkward reading, and obviously awkward printing.

how could i take this roughly scanned book, and convert the text into nice clean, legible text?

for instance, how to change this page:

into something more like this?

is it an easy process? several steps involved, great patience, etc? someone fill me in

some of my old books are quite treasured and i'd love to see them get a bit of a second wind by being digitized, even if it does take some work on my part

Sure. Lots of times. Not for a book, but for all sorts of other documents (PDFs of journal articles, etc). OCR isn't perfect - you'll still need to proof-read the book to get it perfect - but Abbyy does an excellent job.

hmm, well this particular book (as per first photo) is an old anthology of horror stories about 300 pages hehe..
i have the book scanned in, but in order to read it comfortably i'd need to convert it into text somewhat like what we here are typing in.

are there any guides that you know of for this? i guess it is a long drawn out process

Thanks Harry - I imported the book in Abbyy as 'image/file -> pdf'
it's analyzing now, or as you guys say, analysing. I guess one good thing to do would be to keep the mode in black & white, for this type operation.

I suppose that, creating a table of contents is a manual job? I could see no way that it could be otherwise

If he book is already scanned and you have a pdf file, then why can't you open it in ACROBAT reader and save it as txt?
(File --> save as txt)?

George

Because they are basically JPG images. While some scanners may apply some half-assed OCR underneath those images ("positional OCR"), it's way too inferior compared to ABBYY FineReader. Adobe Acrobat can OCR it, as well, but it has a very poor engine backing it up.

Also, saving it as plain text is just awful for e-books, because there's absolutely no formatting at all (italics, bolds, chapter titles, etc). Italics are the soul of a book, and it's what makes the reading experience enjoyable - especially if used right. Trying to manually spot them in the scans, and then manually re-add them is pure madness. You're bound to miss a few, unless you spend a SIGNIFICANT amount of mental effort and you go over them at least twice.

My advice to the original poster of this thread is to watch a few tutorials on how to use Word, Acrobat, maybe InDesign and Illustrator, as well. The "Essential" libraries from lynda.com are excellent and I highly recommend them. Of course, there are also open-source alternatives (Adobe software is very expensive), such as LibreOffice, Inkscape, Scribus, GIMP, etc., which can do the job very well. I'm a fan of open-source software (Arch Linux is my primary OS), but I have to admit that Adobe has the flagship software for this industry.

My current workflow is to scan text in grayscale, at 300 dpi, JPG (~90% quality setting), and images (like the covers, pictures or other graphics) in colour, at 600 dpi, TIFF.

First I start with the graphics, using Photoshop to get the most out of them, and vectorize what I can by manually tracing them in Illustrator (again, the "Essential" training library from lynda.com should be enough). Then I run the scans through ABBYY FineReader, proofread it once, export as RTF, run a custom macro in Word 2010 SP1 that keeps only the bolds, italics, subscripts, superscripts and inserts the footnotes as in-line text (separated with tags, so I know where to place them later). This macro outputs a squeaky clean RTF, which I import into InDesign CS5.5 and start redoing the layout, based on styles. Here I sometimes use Scan Tailor, but just for 1:1 comparison when redoing the layout. Usually I batch rename the images from Scan Tailor to match the page number, so that it's easier to go back and forth.

And finally, I proofread again the final e-book, but this time on my e-reader, highlighting the parts that I may have missed or that just don't look right. As you can imagine, the quality is very high, but only after putting a lot of time into it. An e-book could take up to a month.

abbyy fine did a good job of scanning the original scan of the book.. one problem though, is that, on pages that were left intentionally blank in the book (separators of chapters), abbyy has tried to read the other side of those pages, and came out with gibberish with those.

once the scanning was done and i had the page by page preview, i looked for some sort of option like "save as blank page" or similar, but saw nothing like this. is this more the realm of a pdf editor?

if i could blank out those pages which were originally just blank pages to begin with i'd have a pretty good pdf made from the original scanned book, abbyy seemed to've picked up everything properly, the original scan was quite good.

i saved the scan as an rtf file through abbyy and was surprised to see an almost perfect table of contents when it opened in microsoft word, how did it do this? and why did it not when saved as a pdf?
much to learn, i guess

ok, so, i ran the original scans through abbyy fine. deskewed all pages, then saved the pdf.

is there anything else i should run the pdf through to optimize the text, or any other tips on optimizing text? i know there is an optimizer in acrobat, but i dont know if it would be useful in this instance