Does anyone have experience with scanning a book and then optimizing it as a pdf file?

Since I'm fortunate to have an undergraduate schlepper at work who does photocopies for us underpaid professors, I decided to try an ebook experiment today.

I had her xerox an entire book. Our Xerox machine's sheet feeder will then allow you to email an entire set of pages to yourself in various formats. The book was 51 (double column) pages. I find that selecting "compact pdf" results in a file that's not to large but fully readible.

So the resultant document is 2MB. I decided to run it through OCR software (Nitro 7) so I could have a document with searchable text.
There are few images in the book and none on the pages that contain text.

Here's where it starts to get confusing.

I used the default settings "searchable text image". I ended up with a 60mb file. And I don't understand why. Why was is it 30x larger than the original?

I then tried the alternative setting - "editable text". The resulting document looked the same except the few images and some artifacts were removed. But the file was still 7MB, considerably larger than the original.

The only other thing I played around with was the "optimize pdf" feature - using the 7mb file. I removed the embedded fonts. I ended up with a 460kb file, that, near as I can tell, looks the same.

I understand in principle what embedding fonts means - so that the doc will look exactly the same on all machines - but the book has few distinct fonts in it.

So I'm a bit perplexed at how best to optimize a pdf. I want to keep the file sizes small, but I don't want to lose legibility. I see myself doing this a great deal in the future with books I get from Inter-library loan. It's far easier for my research to have them electronically, to read and annotate on my iPad. And having the text searchable is a major asset.

The Nitro user guide is less than helpful.

These aren't chemistry or economics textbooks, so there aren't flowcharts, pie graphs and what have you. They're mostly text.

My advice is NOT to scan them directly as PDF, but as images (preferably TIFF or PNG). Then run the images through Scan Tailor and then through ABBYY FineReader if you want some search functionality. This is the "quick and dirty" way.

The "slow and of-an-exceptionally-high-quality way" would be to OCR it with ABBYY FineReader Professional, proofread the entire thing, process the graphics, track down the fonts, redo the layout in Word or InDesign, export as PDF (and maybe tweak a few things in Acrobat), and proofread again the final product. Not many people are willing to spend the time and effort for this, but the result is of very high quality. It's always a pleasure to read such a book. But first make sure that it's worth it, and that it's not already available as an e-book.

I then tried the alternative setting - "editable text". The resulting document looked the same except the few images and some artifacts were removed. But the file was still 7MB, considerably larger than the original.

It's for my own usage. So I don't really care if it doesn't look pretty and if the scan is let's say 95-98% accurate. i could deal with the occasional typo and I won't throw away the Xerox in case there is something I need to check it against when reading.

Ultimately, my purpose here is convenience and time saving [You can skip this part as it's not about pdf optimization]:
They are academic (humanities) books I use in my research.
My usual process for using a book in research is:
-Read the book and make little annotations near the relevant parts
-Xerox only those pages that contain what I may need to quote and cite when writing
-Scan them in to the PC as jpegs (or as a pdf)
-Take notes in MS Word on the book including brief summaries about the specific passages I may need to cite and where to find them in the book.
-If I saved them as jpegs then each jpeg will bear the name of its page number.
I'm sure this sounds tedious to you, but trust me, when it came time to writing my dissertation (2006) having all my material scanned into the computer (and having two monitors) made life considerably easier. No stacks of papers spread out all over my floor; no serious time wasted transcribing hundreds of quotes, half of which I didn't end up using; and having all my material stored in a flash drive so I could write wherever I was.
Obviously reading ebooks (as pdfs) on my iPad eliminates many of these steps. And it is so with the books I am able to find as ebooks.

So it occurred to me to experiment by scanning one in in its entirety.

Some things to note:
-It's the xerox machine that sends it as a "compact pdf". It's one of the settings. What it does exactly I have no idea, but an otherwise 10-15mb file becomes less than 2mb if I select compact. i can see no difference in the results. And I had no trouble running the compact pdf through OCR.

So my goal here is to (1) save time and make things more convenient (2) not end up with massive files (3) without sacrificing (or rather risking) reliability.

Near as I can tell it's the embedded fonts on nitro that is adding the bloat - how else to explain 500kb instead of 7mb.
500kb sounds like a normal size for an ebook.

But can someone explain the differences between "searchable text image," and "editable text" and what is at stake between choosing one over the other? And whether removing embedded fonts matters or not?

Again this is all for myself. I'm not trying to created a pirated ebook to circulate. But I do need to be confident that it will look OK on multiple PCs and on future versions of Windows and iOS and what have you. I know a jpeg will never be an issue. But with these PDFs, I have no idea.
It's about saving me time without wasting too much hdd space.

You don't want to save scanned documents as jpg. JPG is a lossy format, and is pretty atrocious on text documents. Since the Xerox already outputs as PDF, I would recommend that. Other formats that can be used for the original scans are any of the lossless image formats such as PNG or TIFF.

Quote:

Originally Posted by shmendrapolk

I'm sure this sounds tedious to you, but trust me, when it came time to writing my dissertation (2006) having all my material scanned into the computer (and having two monitors) made life considerably easier.

No way! I understand completely. Digital files that are properly OCRed are much easier to use than physical books. Searching through documents/entire books is a breeze! So many times with the physical book I got stuck on "well I remember him mentioning something about topic X... now which page was that in the book?"

Quote:

Originally Posted by shmendrapolk

no serious time wasted transcribing hundreds of quotes, half of which I didn't end up using; and having all my material stored in a flash drive so I could write wherever I was.

Being able to copy and paste alone probably saves massive amounts of time. So boring having to type out a paragraph or two of text out of a physical book!

Quote:

Originally Posted by shmendrapolk

Some things to note:
-It's the xerox machine that sends it as a "compact pdf". It's one of the settings. What it does exactly I have no idea, but an otherwise 10-15mb file becomes less than 2mb if I select compact. i can see no difference in the results. And I had no trouble running the compact pdf through OCR.

What "compact PDF" most likely does is just run some lossless compression on the scans resulting in no loss in quality. While just exporting as a "normal PDF" would be exporting the uncompressed image files.

Quote:

Originally Posted by shmendrapolk

But can someone explain the differences between "searchable text image," and "editable text" and what is at stake between choosing one over the other?

I cannot make one bit of sense out the documentation (I see what you mean by "not being very helpful"):

Thanks.
The jpegs are actually quite useful once I've gone through the process of naming them according to the page number. Very easy to track down a quote because my notes reference a page number.
And not having it OCRed in such cases isn't the biggest deal. The number of quotes I end up using are far fewer than the number I highlight while writing.
But I would never want to have to read a whole book in such a manner.

As for the embedded fonts and heir removal. I'm just imagining a hypothetical situation where many years down the road I'm on a different operating system, there's been some major technological changes. And I open up this PDF and the document won't render because some of he fonts or whatever no longer exist and I can't read it.
The few times I've opened up a word doc on Pages on my iPad I've seen problems.
And every time I try to open up something I wrote as an undergrad back in the early 90s (using MS Write or WordPerfect) the files are messed up.

So a jpeg may be lossy and hard on the eyes, but I know it will always look exactly the same regardless of the environment.

If you're going the "image-only" route, at least process them with Scan Tailor and archive them as ZIP or RAR. Another thing that you can count on long-term compatibility is HTML, which is easy to edit and easy to convert, but you will need to proofread the whole thing (the OCR content) against the scanned images, at least once.

Instead of Nitro try Pdf Xchange Viewer. From the menu choose Document-> OCR Pages and when the dialog shows up make sure to select for "PDF output type" "Preserve original content & add text layer". After the job is done, just save the file and you will have a slightly bigger pdf file. Note that if you choose the other option for "PDF output type", the file size increases significantly.

What "compact PDF" most likely does is just run some lossless compression on the scans resulting in no loss in quality. While just exporting as a "normal PDF" would be exporting the uncompressed image files.

I was amazed at how small the "Compact PDF" files come out on Konica-Minolta copiers, so I checked them out a little. A normal PDF output on Konica-Minoltas uses a JPEG-embedded (I think) PDF with typical quality settings for readability and several bits per color component, but "Compact PDF" compresses even further by dropping the bits-per-pixel considerably. I noticed with one document where I had red markings on an otherwise black-and-white document, "Compact PDF" stored the PDF in two layers--a black-and-white layer and a red layer, each one with very few bits per color.

Does anyone have experience with scanning a book and then optimizing it as a pdf file?

...

If you could provide me link to your 2 MB book i could do OCR in Adobe Acrobat, Abbyy Finereader etc. and tell you the difference between editable text and searchable text image in those applications that i usually use for pdf optimization.

In my testing between Adobe/Finereader, Finereader makes much smaller filesizes, AND has more accurate OCR.

In the original poster's case, I would still stick with my usual recommendation of, keeping the Original scan as a frontend, and having the OCRed text in the backend.

Quote:

Originally Posted by willus

I noticed with one document where I had red markings on an otherwise black-and-white document, "Compact PDF" stored the PDF in two layers--a black-and-white layer and a red layer, each one with very few bits per color.

That sounds like they do a fantastic job at making PDFs much smaller. I assume all of these scanners have their own little tiny proprietary tweaks to try to get their scanned PDFs smaller. Chopping out unused colors is one way to get the filesize way down. The book doesn't have all of the colors in the rainbow!

I personally just work with already scanned (mostly black and white) non-fiction books. Since there are only two colors, black, and white, you can imagine that they compress quite well.

Back to the OCR of documents, the auto-OCR on these scanners are ok (from what I have seen, many of these are based off of some sort of Adobe program), but if you look at the text, you can always see that there are the typical OCR errors.

I feel that an outside program (I use Finereader), will give you a much more accurate OCR than those that come bundled with the scanner. In my mind, a more accurate OCR = closer to the original book = a much more enjoyable reading experience.

My work is to convert the books into digital form (EPUB), so I need a nearly 100% correct conversion... and while I am at it, I can toss out that auto-OCRed stuff, and make a nearly 100% accurate PDF text backend as well.

On top of that, Finereader seems to have even better ways of making the PDFs smaller than those scanners. So I just see it as win-win-win-win-win.

I personally just work with already scanned (mostly black and white) non-fiction books. Since there are only two colors, black, and white, you can imagine that they compress quite well.

"Black and white" can be different for many people... Some refer to it this way, even though it's actually grayscale, while other refer to it this way, even though it's actually a 1 bit image (black)! 1 bit images compress better. It's the same output that you would get from processing them with Scan Tailor.