DIY Book Scanner

Daniel Reetz, the founder of the DIY Book Scanner community, has recently started making videos of prototyping and shop tips. If you are tinkering with a book scanner (or any other project) in your home shop, these tips will come in handy. https://www.youtube.com/channel/UCn0gq8 ... g_8K1nfInQ

There was a bug in the version of ImageMagick I was using (6.5.x) which caused it to fail with the -compress option ("No space to read TIFF" message); I'm running version 6.6.1-2 now, and for many of the images, -compress Group4 results in PDFs that are about 1/3 the size of those defaulting to JPEG compression. Now, I am noticing the pages on which text is erroneously tagged as image, since they are unable to use Group4 compression, and ImageMagick reverts to using the default. instead of specifying everything as mixed mode output in Scan Tailor, I may need to take the extra time to pick and choose pages containing images, to avoid this problem and get the benefit of G4 compression in my PDFs.

Thanks to everyone who took the time to help me over this little hump.

spamsickle wrote:
I assume that PDF does not have the option of using different compression options to encode binary text and greyscale/color images on the same page, as DJVU can, so if a page has both text and image, G4 is probably not an option.

In fact, PDF can contains multiple images using different compression method in one page. It is also possible to overlay images as layers with transparency.
I made a simple sample file. The attached file consists of two layers; while the bitonal text layer is G4-compressed, the figure of Hatter is DCT (JPEG).

However, since ST currently creates a single-layered tiff file from a mixed-mode page, there is no easy way to convert the tiff file to a PDF page of multiple images/layers. Possible code modifications might be: (1) to output picture zone and contents separetely, then combine them; or (2) to create single PDF directly, instead of tiffs.

kitashi wrote:However, since ST currently creates a single-layered tiff file from a mixed-mode page, there is no easy way to convert the tiff file to a PDF page of multiple images/layers.

There is an easy way to separate ST's output tiffs into Text / Pictures pairs. ST makes sure pure black and pure white are reserved for text areas, which makes such separation an easy thing to do. There exists a utility for this purpose called "ST Separator". It's currently in Russian only, though I think its author would be willing to make an English version.

The problem with generating PDFs is that no good opensource encoder exists. For example PDF supports the JBIG2 compression method for B/W content, which is a close relative to DJVU's JB2. Unfortunately, as far as I am aware, no opensource encoder can merge nearly identical as opposed to pixel-by-pixel identical characters. For DJVU, there exists an opensource encoder that can do that. It's called minidjvu I and I am trying to get it relicensed from "GPL2 only" to "GPL2 or above" to be able to incorporate parts of it in ST.

Scan Tailor experimental doesn't output 96 DPI images. It's just what your software shows when DPI information is missing. Usually what you get is input DPI times the resolution enhancement factor.

I've recently used Scan Tailor on multiple documents with only a few (~10) pages each. Each pack of tiff's outputted is then used to make an OCR'ed djvu file.

I speed up the work process by using two instances of Scan Tailor simultaneously, each processing a separate document. Here are some changes that could increase the speed further by reducing the number of manual actions. Note: I get that Tulon isn't focusing on these kinds of small GUI tweaks ATM. I'm posting anyway in case anyone else has the capacity and interest to contribute code for these things.

1. allow drag and drop of files into the "files in project" box
2. option to immediately go to the next step after such drag&drop (i.e. no manual press of OK needed. Compare to the five manual steps currently needed: copy folder path, paste path in ST, click select all, click arrow, click ok)
3. hotkeys for batch operations. Suggestion: F1 to F6 to start batch operation on steps 1 to 6.
4. option to begin batch operation (at some user set step) immediately when the main window opens, without any click/keypress. Example: on window do steps 1-4 on all files.
5. option to disable warning on "new project" (ctrl+N or ctrl+w)
6. option to disable warning on "remove from project..." (in thumbnail right click menu)
7. let "delete" key execute "remove from project..." on selected thumbnails
8. allow changing default value for thinner/thicker setting in output step (maybe through ctrl+drag of the slider?)
9. let ctrl+click on "apply to..." do the apply to all action immediately.

edit:
10. option to autorun a command line after processing on the last step ends (plus a parameter for the "out" folder for the currently processed files)

Tulon wrote:ST makes sure pure black and pure white are reserved for text areas, which makes such separation an easy thing to do. There exists a utility for this purpose called "ST Separator". It's currently in Russian only, though I think its author would be willing to make an English version.

Tulon, thanks for the information. Reading this forum throughout, I noticed that you had already mentioned that (31 Mar 2010, 05:12). Sorry to have overlooked.

ST Separator seems to be interesting. I'm looking forward to the English version. (or Japanese one! )
Well, ST Separator works only on Windows, but unfortunately I mainly use *nix these days, so I tried with ImageMagick and pdftk. It works fine for me.

In most case, Zip compression makes the files smaller than LZW does. I suppose that is because the files tend to contain vast continuous zones of transparent pixels. I read somewhere that Zip works better than LZW in such situations. IIRC.
In another way, we can "-trim" the margin of the images, so if we can put the trimmed image on the correct position in the PDF, it must be a smarter way. I'm trying this way now.

Tulon, may I ask you a question about DJVU's JB2 compression? The "merge nearly identical characters" method -- Is that only for lossy compression, right? If not, I wonder how does it possible to compress losslessly with that...

# I've tried command line tools of jbig2enc. Although resulting pdf was quite small in size, its page order was messed up. I have no idea why.

kitashi wrote:Tulon, may I ask you a question about DJVU's JB2 compression? The "merge nearly identical characters" method -- Is that only for lossy compression, right? If not, I wonder how does it possible to compress losslessly with that...

That's lossy by definition. In lossless mode it would only merge pixel-by-pixel identical characters.

Scan Tailor experimental doesn't output 96 DPI images. It's just what your software shows when DPI information is missing. Usually what you get is input DPI times the resolution enhancement factor.

I'm pretty late to the party here, but I got the chance to try out the version of Scan Tailor with Rob's dewarping algorithm. It seems to have somewhat... wacky results, at least with the book I tested. Could this be the result of something I did wrong? It's set to output DPI 600, black and white, default thickness. I know the algorithm isn't complete yet, so I may just be running into its current limitations.

Original:

2010DW001.131 normal.png (94.87 KiB) Viewed 6294 times

Dewarped:

2010DW001.131 dewarped.png (95.69 KiB) Viewed 6294 times

The opinions expressed in this post are my own and do not necessarily represent those of the Canadian Museum for Human Rights.

I tried the Mac OS version but couldn't get it to run. I'm running Scan Tailor in Crossover on the Mac, which uses WINE to run Windows apps.

Anyway, I've set to take some books out of DJVU and into PDF and, while doing so, I like to clean up the pages. I didn't scan the former, so they are often spreads, with nasty edges and so on.

I ran two books with no problems. Now a third, which was particularly nasty, has proven difficult. It has a high degree of gray in the background and really shouldn't be used, but it's what I have.

I've run this twice and both times hit the same dead-end. I go through, set up splits (many manual), tweak select content, ok. all is well so far. Then I get to page layout and about half the images are proper, as such:

but in the rest, the pages get shrunken to mini versions in a massive field of white.

The second time through, I ran "fix dpis." All the manual work means this took an hour while watching TV.

Would love to have your input on how to make this work next time.

BTW, regardless of why these images look the like they are different sizes, they actually are the same size.

Tulon wrote:The second time through, I ran "fix dpis." All the manual work means this took an hour while watching TV.

Would love to have your input on how to make this work next time.

Wrong DPIs would be my guess as well. Did fixing them help or did it not?
If not, then chances are you didn't fix all of them. Don't count on all wrong DPIs appearing in the "Needs Fixing" tab. In most cases all pages have the same DPI, so go to the "All Pages" tab, select the "All Pages" node and apply the correct DPI there. Don't try to guess the correct DPI - estimate it instead.

Scan Tailor experimental doesn't output 96 DPI images. It's just what your software shows when DPI information is missing. Usually what you get is input DPI times the resolution enhancement factor.