I don't know if anyone else has been as frustrated by the lack of easy to use software focused on document scanning on linux. I'm not talking about ocr, just scanning documents into a portable multi-page image format. In fact the only software that I've found that does exactly what I wanted was Adobe Acrobat on windows. But with that I had to go through the windows twain driver interface for my scanner which seems designed to make the process as slow and clumsy as possible. Still, that was the method I used for the last couple years on occasions when it was useful.
Finally, after I bought a new printer/scanner a couple month ago, (epson cx3200, its nice) I decided it was time figure out how to do the job on linux. By then I knew all the command-line tools to do what I wanted were available, it was "just" a matter of writing a little script to tie it all together. Being the amature I am it took me a couple days to figure it all out, but I got it working. This was back in January.
The other day I was playing around with controlling it with kdialog and I thought maybe someone else would find it usefull (the non kde dependant version that is) So I figured why not post it. I did try to make it a little more user friendly. It's still an ugly hack, but it works for me. Not only that, but once it's setup it's better at its specific purpose than anything else I've tried.

The script depends on the following packages.

sane-frontends
imagemagick
netpbm
ghostscript

If your scanner has a decent 1-bit(Lineart) mode (or if you can actually get convert's threshold function to work for you) then you can modify the script slightly and get rid of the netpbm dependency.

You need to know how to use the scanimage program with your scanner, as you will need to modify the SCANDEVICE and SCANCMD variables to fit. The rest of the configuration is pretty self explanatory I think.

To use it you just put you first page in the scanner then run the script with the name of the file to save as the argument. It will then immediately scan the first page then prompt you for more. The rest is gravy.

It's not very robust, if your scanner has a warm-up period then make sure it's finished before you start. Otherwise scanimage may timeout and the script gets a little confused then. And it's not designed to work with scanners that have an adf.

Anyway, hope someone can use it. Even if just for inspiration. EDIT: Later versions posted further down the thread. Chrwei posted one that should work with an ADF(I don't have the hardware to try it) and I've posted the python version I've been using for a while. This version is left here mainly for reference.

Code:

#!/bin/bash
#
# scan-pdf version 2
# April 27, 2004
# Copyright 2004 Zacchaeus Pearsall (zap4260 at yahoo.com)
# Distributed under the terms of the GNU General Public License v2

# Set this to a value between 0 and 1
# You may have to play with it some to get good scans
# of colored paper
THRESHOLD=0.55

# Scan resolution
RES=300

# Set these for the size paper you are scanning
# defaults are X=212.5, Y=275; for US letter paper
X=212.5
Y=275

summary:
- Added command line options with defaults
- Added ADF support with command line toggle to to use flatbed. can be set to use flatbed by default with command line toggle to use ADF.
- Changed to use scanimage's batch mode and prompt so that timeouts shouldn't be an issue. ADF doesn't use the prompt
- Made the scanner device name optional as scanimage will normaly detect your scanner automaticaly.

scanners tried:
- HP Officejet 6110

TODO:
- add more paper size options
- NetPBM says pgmtopbm is depreciated as of 7/2004 and to use pamditherbw instead. I plan on only doing color or full greyscale documents so I'm not touching this.

bugs:
- "mode" seems to be scanner specific, some want "Grey" others want "Greyscale". - needs testing
- might be an isue with providing -x and -y when using ADF, I need to test more

usage() {
cat<<EOF
$myname scans documents from your flatbed or ADF scanner and stores them in a multi page pdf.

Usage: $myname [Options] filename.pdf
Options:
-page "size" Page size for the PDF. See "man convert" for possibilites
-mode "mode" lineart, greyscale, or color.
-1bit [Y/N] If you scanner has a good 1-bit more and you want lineart, use Y here.
-adf [Y/N] Use ADF in no-prompt batch mode (Y/N) - edit this script and set your scanners options
-res dpi Resolution to scan at in DPI
-opts "options" Additional option to pass to 'scanimage' program
-threshold 0.55 Value between 0 and 1 to pass to 'pgmtopbm'
-h Help This info.

convert -page letter converts the original image to a blank postscript file. Converting without the -page option works but then the pdf document is not in the correct format. Is this happening to anybody else? Any sollutions?

I have been using the above method successfully and conveniently for the last few months now. One problem that I have found is that when I try to convert multiple pbm files into one multi page pdf, I can quickly run out of memory if I get above 6 pages or so. Does anyone have any suggestions as to how to get around this?

My xsane version has an option to scan "pages" to a pdf. Works pretty well AFAIR. I don't see a big difference.
Xsane has also other nice features like just "copying stuff" and "emailing stuff". Anyways, your script sounds also nice

It seems I have been lax in keeping up with this. Better late than never I guess.

zatalian wrote:

this script used to work for me but now convert gives me trouble...

convert -page letter converts the original image to a blank postscript file. Converting without the -page option works but then the pdf document is not in the correct format. Is this happening to anybody else? Any sollutions?

I've ran into this several times. Seems to be some interdependancy between imagemagick and ghostscript that caused a problem when I upgrade one or the other. Usually recompiling imagemagick after a ghostscript upgrades fixes it.

bludger wrote:

I have been using the above method successfully and conveniently for the last few months now. One problem that I have found is that when I try to convert multiple pbm files into one multi page pdf, I can quickly run out of memory if I get above 6 pages or so. Does anyone have any suggestions as to how to get around this?

I ran into this a while back too. You need use the "-limit Memory" and possibly the "-limit Map" options for convert to limit it's ram usage. Usually 1/4 of my physcal ram seems to work well enough. It does take a long time to convert though.

martoss wrote:

Isn't xsane doing the same? ...

It is now and I'm glad to see it. It didn't have those options 3 years ago though when I posted this. It still looks like this script might be more convenient in for some tasks. Xsane is probably less buggy though.

I've also made some changes since that original version. I converted it to python and added some ncurses "eyecandy" using dialog. Also got rid of the netpbm dependency. I had intended to rewrite it as a "proper" modular program with a seperate config file and such but never got very far with it. It's not something I use very often anyway.

Anyhow, here's my latest working version. Plenty of bugs I'm sure but it mostly works when I need it.

I ran into this a while back too. You need use the "-limit Memory" and possibly the "-limit Map" options for convert to limit it's ram usage. Usually 1/4 of my physcal ram seems to work well enough. It does take a long time to convert though.

Thanks for this. I just found this out independantly today and was returning to the thread to post my results, but it appears that you beat me too it.

I have just one question though. I used only the memory limit option. What does the map limit option actually do? The documentation seems rather sparse.

I think theres probably no need to use the Map limit on most systems. I think I just put it in mine because I had no clue how mmaping worked back then. It doesn't seem to make any performace difference when I remove it.

I was googling for Zacchaeus Pearsall's original version of this script, when I found this page.
I too used his script as a starting point when writing a shell script for batch document scanning using scanadf.

It uses a configuration file, ~/.bscanrc
where one can list all your scanners in a bash array,
with devices names as shown by "scanimage -L"
and the default scanner being SCANDEVICE="${scanners[0]}"

Importantly, specifying the scanner names in ~/.bscanrc saves time
since the script then skips finding the scanners using "scanimage -L"

One can also specify which scanners are true duplex,
so the script will scan fake duplex mode when true duplex is not available.
One can also specify lp printer instances so one can scan direclty to printer;
e.g. if you scan a document in duplex mode on letter-sized paper,
it will be printed in duplex from the appropriate tray holding letter-sized paper.

By default the script scans from the ADF in grayscale @300dpi and saves to format PDF.
So to scan a letter-sized document from the ADF @300dpi grayscale,
then compress using lzw, binarize using djvu and save to OUTFILE.pdf
one would use:

To save to another format, use --format={pnm,tif,pdf,ps,djv} or alternatively,
-pnm <equivalent to --format=pnm>
-tif <equivalent to --format=tif>,
and similarly for the other output options:
-pdf, -ps, -djv

Shortcut options, like the above switches take a single '-'
and arguments requiring a value have the form '--option=value'

One can specify various binarization algorithms,
such as those from Fred Weinhaus http://www.fmwconcepts.com/imagemagick/index.html
using the option --thresh={bw, constant, 2color, fuzzy, isodata, kmeans, sahoo, triangle, }
where the various binarization scripts must be in your $PATH.

If you use xsane or gscan2pdf to scan some images because, e.g. you need to crop the image
or tweak the contrast/brightness/gamma settings,
you can save the images as OUTFILE.%d.pnm
e.g. OUTFILE.0001.pnm, OUTFILE.0002.pnm, ...
Then use can use bscan with the option "-noscan" to skip the scanning,
and instead just process the images:
e.g., to rotate the images 180degrees and binarize using djvu compression:
B -noscan -BW --rot=180 OUTFILE
which would process the series of images and create one multipage OUTFILE.pdf

One can also deskew images using unpaper from http://unpaper.berlios.de/
The options to "unpaper" are hardwired into bscan because the options are just too numerous
to specify on the commandline.
so it might be best to just make alocal copy of bscan,
and modify the line which runs unpaper using whatever unpaper options you need.
Alternatively, you could add an option for unpaper settings
so that you could scan, e.g. B --unpaper=setting1 -BW OUTFILE
where setting1 would be specified in ~/.bscanrc or hardwired into bscan.

To photocopy, i.e. scan the print to printer:
For letter printed to PRINTERLETTER
B -prn --n=<number of copies>

You just need to define the lp printer instances in /etc/cups/lpoptions or ~/.cups/lpoptions
However, I find that KDE keeps modifying/deleting any printer instances in ~/.cups/lpoptions
so I given up and just use /etc/cups/lpoptions, which KDE leaves untouched.

You can define lp printer instances using lpoptions,
but I find it easier to just directly edit /etc/cups/lpoptions
e.g. for my Xerox Phaser8860 print queue

I was googling for Zacchaeus Pearsall's original version of this script, when I found this page.
I too used his script as a starting point when writing a shell script for batch document scanning using scanadf.

My version "bscan" is available at...

Thank you for this! Brother had provided some scripts but they used a tool that not longer works. I will have to see if I can use this._________________Open-mindedness is painful...