Syndicate

You are here

Forming a "scanner club"

I've accumulated tons of paper, and automated scanner technology keeps getting better and better. I'm thinking about creating a "Scanner club." This club would purchase a high-end document scanner, ideally used on eBay. This would be combined with other needed tools such as a paper cutter able to remove the spines off bound documents (and even less-loved books) and possibly a dedicated computer. Then members of the club would each get a week with the scanner to do their documents, and at the end of that period, it would be re-sold on eBay, ie. a "ReBay." The cost, divided up among members, should be modest. Alternately the scanner could be kept and time-shared among members from then on.

A number of people I have spoken to are interested, so recruiting enough members is no issue. The question is, what scanner to get? Document scanners can range from $500 for a "workgroup" scanner to anywhere from $1,500 to $10,000 for a "production" scanner. (There are also $100,000 scanning-house scanners that are beyond the budget. The $500 units are not worth sharing and are more modest in ability.

My question is, what scanner to get? As you go up in price, the main thing that changes is speed in pages per minute. That's useful, but for private users not the most important attribute. (What may make it important is that if you need to monitor the scanning job to fix jams or re-feed. Then speed makes a big difference.)

To my mind the most important feature is how automatic the process is -- can you put in a big stack of papers and come back later? This means a scanner which is very good at not jamming or double-feeding, and which handles papers of different sizes and thicknesses, and can tolerate papers that have been folded. My readings of reviews and spec sheets show many scanners that are good at detecting double feeds (the scanner grabs two sheets) as well as detecting staples, but the result is to stop and fix by hand. But what scanners require the least fixing-by-hand in the first place?

All the higher end units scan both sides in the same pass. Older ones may not do colour. Other things you get as you pay more will be:

Bigger input hoppers -- up to around 500 sheets at a time. This seems very useful.

Better, fancier OCR (generating searchable PDFs) including OCR right in the hardware.

Automatic orientation detection

Ability to handle business cards. Stack up all those old business cards!

The VRS software system, a high end tool which figures out if the document needs colour, grayscale or threshold, discards blank pages or blank backs and so on.

In a few cases, a CD-burner so can be used without computer.

Buttons to label "who" a document is being scanned for (can double as classification buttons.)

Ability to scan larger documents. (Most high-end seem to do 11" wide which is enough for me.)

One thing I haven't seen a lot of talk about is easy tools to classify documents, notably if you put several documents in a stack. At a minimum if would be nice if the units recognized a "divider page" which could be a piece of coloured paper or a piece of paper with a special symbol on it which means "start new document." One could then handwrite text on this page to have it as a cover page for later classification at the computer, or if neatly printed, OCR is not out of the question. But even just a sure-fire way to divide up the documents makes sense here. Comments suggest such tools are common.

It may be that the most workable solution is to hire teen-agers or similar to operate the scanner, fix jams and feed and classify documents. At the speeds of these scanners (as much as 100 pages/minute for the higher end) it seems there will be something to do very often.

Anyway, anybody have experience with some of the major models and comments on which are best? The major vendors include Canon, Xerox Documate/Visioneer, Fujitsu, Kodak, Bell and Howell and Panasonic.

As I said, that is my #1 requirement already. Most of the scanners get faster as you pay more, what I'm looking for is reports from people who have used various models on just how robust the feeders are, because that's the only way you'll be able to put in a stack of household papers and leave it to run.

The divider page sounds like a great idea, and it ought to be completely separable from the scanner choice. I could see people wanting to extend the 'divider page' concept into 'label stickers' that can be put directly on many pages. Barcodes may have all the desired properties already (easy to print, easy to find in a scan, easy to read, etc).

I've been thinking about my own paper scanning problem, and I was leaning towards building a copy stand that would let me use a digital camera as the scanner. That would be roughly no-cost, and it would gracefully handle odd/delicate pages such as newspapers. Leasing time on a higher-end scanner would probably be better for me, though.

These scanning systems are all combinations of hardware and software. Since the real thing you are buying here is convenience and less effort on the part of the scanner, you want it all in one place. You certainly don't want to have to build any features yourself. Though many of these scanners do have bar code processing in them so a sheet of bar code stickers might well work.

The digital camera approach is interesting, but harder than it seems. The resolution is sufficient, and it's fast and in colour, but you need to get the page completely flat for a good quality scan, and lit in a way that doesn't have glare. One approach is to put glass or plexi over the page, but that's work and glare can be a problem -- you need a dark room or a nice arrangement of lights.

A more interesting approach I thought of would be to make a suction table, sort of like an air hockey table in reverse, with lots of small holes in a flat surface, and a mild suction that is keyed to your shooting process -- ie. it goes off and on in a smooth way perhaps with a foot pedal or automatic controls to make it easy to slap down pages.

I considered that having $10/hour teens with cameras might be an effective method, in that they should be able to do a page every few seconds easily. For the best quality you want the camera dead on, which means mounted above the shooting area, but for the highest speed, bringing the camera to the documents (hand-held) might be sufficient.

It's possible that cheap workers in India with digital cameras could beat the price of the automatic scanner. Doing duplex with cameras takes twice as much scanning though.

You have identified many of the key issues in document scanning. Slipsheets and barcode stickers are widely used in the industry. Digital cameras with grunt labor turning pages is reportedly the way Google is scanning books. Your best bet is probably to buy a $400 personal desktop duplex scanner at 20 ppm and feed it patiently. This will take a long time. But I think that for production-scale scanning of mixed material you are way over your head. If you expect to simply drop in a stack of papers from your file drawers into a high-speed scanner and walk away you will be very disappointed. This is why there are document scanning services in most metropolitan markets. Your desire indicates a latent market for a roving document scanning services that will come to your home or office and do it all in one day with highly engineered equipment and processes. The question is, what's the value to you, so how much could they charge? The pricing would probably have to be above the value to consumers (who value personal time cheaply) but to small businesses going digital it could be worth it. I'd look for this service coming into existence in the next five years or so.
-Eric Saund
Area Manager
Perceptual Document Analysis
(Xerox) PARC

My current understanding of the services has them not at the price point that people would like for this. I think many people would like to get rid of their stack of papers, and would spend a day of their time on it, but would not pay a scanning service. I don't expect to just throw in a stack of papers of course, but what I do wonder about is taking a "stack" that I have done a bit of work to order and straighten. Some will scan no problem. Others were once folded (like bills etc.) and I would expect are more of a challenge.

If jams and bad feeds are going to be common, a fast scanner makes more sense because you would expect to sit there, feeding and unjamming, and if it's 60 ppm, you will spend little time just starting and most time in feeding and unjamming. Which is actually good.

I would expect decent clean feeding for things like well stacked documents pulled from binders, magazines and books with spine chopped off etc. (Google of course can't usually chop spines which is part of what makes the digital camera approach they and Internet Archive use make the most sense.) I would expect more trouble with the contents of my old filing cabinets. Business cards, if supported, should do well.

(For business cards, and other alternative would be for somebody to code a sheetfed scanner which is the width of a page to let you feed in business cards in parallel, zipping from left to right as you feed them in. That would let you do your cards much more quickly than any business-card only scanner unless it has a perfect autofeeder.

Over the years I have accumulated a fair bit of documentation in the form of photographed pages. I have a habit of photographing stuff people try to hand me to carry about, then giving it back. A while ago I set one of the OCR programs to simply scan every photo I've stored and extract only the text, while building a matching tree in another directory. Then I deleted all the docs less than 1kB and merged down to one directory (smart rename). That gave me about 300 documents after a week. Most were surprisingly accurate - for a page of text in good light I was getting ~5 errors per page. For the stuff that counts (my handwritten additions to printed docs) the error rate was worse than a person trying to read my writing. But that's expected, and I will eventually have to type things in off the images if I want that.

example of stuff that I want OCR'd but don't think it'll happen this decade: http://www.mozbike.com/build/long-2/one-less-ute-01.jpg

I suspect that you could get 90% accuracy just doing photo+scan. So it depends a lot on what your documents are - if they're publications I think just waiting until the googleborg sucks everything up then doing a get on your partial matches would work. For stuff that is not going to be borged it's harder, you want more accuracy. But that might be surprisingly little material for most people. I'm thinking of all the government RFC/RFD blurf I accumulate with the matching submissions, for instance, and those will likely be put on the net at some stage without my help. It's personal bills and so on that won't, but for many of those you explicitly want an image not a scan.

We have problems throwing stuff away, even stuff that should clearly be valueless. I would not get a scanner just for that, but having the fancy scanner would mean it might as well be done -- and then can be tossed. Old magazine collections. Books you know you never want again, nor plan to recommend to folks but can't bring yourself to toss. And yes, old records and files.

You can't wait for Google because the goal is to throw it away, which you won't do on an "it's likely."

In some cases, like a collection of Consumer Reports, you could just buy a subscription there I suppose.

But mostly for me it's 9 file drawers of stuff, and some of my father's papers. But with the fancy scanner I would do everything it had an easy time of scanning, and selected parts of things it doesn't handle so well.

I just got reminded that I'm hosting this ... www.ihpva.org recently gained a scanned and indexed copy of their Human Power journal thanks to a bunch of people that wanted it. Likewise www.ata.org.au have been persuaded to offer their magazine on CD updated every year or two. We're hopefully moving to offering online subscriptions (ideally the way www.homepower.com do). So with formal publications, no matter how small, it's likely that if you scan and OCR it people will love you for it.

> Al, you name came up in a discussion at the computer history
> museum. I read the bitsavers page but wonder if you might have
> some more advice on the question about production scanners for
> digitizing one's archives that I pose here

My current preference is the Kodak 2500D / Panasonic KV-S6xxx series SCSI
scanner for the bulk of my scanning. They can be had used for $500 or so and
will handle 11x17 documents. 400dpi monochrome is what I do the bulk of my
scanning in.

The main question are you scanning this for preservation, or access? If the later,
xerox-quality (400dpi bitonal) is adequate. Preservation (one-of a kind things
of substantial value) have different requirements. I think a week of someone's
time on a scanner is pretty optimistic. On a good day, I can get through a book
box worth of paper (roughly 5000 pages) without any postprocessing. The volume
of paper that I have (millions of pages) is one of the things that prompted my creation
of bitsavers.org over five years ago. Since that time, I've scanned several million pages
and post-processed (basic cleanup and pdf-ing) about 1/3 of it.

My advice is come up with a moderately priced duplex scanner, and set up a workflow
for dealing with the digitization. You are unlikely to get agreement on what to do for
postprocessing, scan at as high a resolution as you can afford (time, storage, etc.) and
do minimal postprocessing to make sure the scans you have are good (no missing pages..)

As far as a 'club' goes, check if the others have the same material that they want to scan
to try to minimize duplication.

But I had had the impression that somehow there would be some reason the $5,000 scanners would cost so much -- what is it that they deliver to justify that cost, if you think a $500 unit is the unit of choice? That's the reason for a club, especially one that will ReBay when done, because a scanner bought for $5K and sold for $4,500 when done split among 10 people is very low cost, effectively putting us in a "price is no object" position in choosing the scanner (to a degree.) With a $500 scanner it's more tempting to keep the scanner when done for future stuff, though none of these scanners have linux drivers and I've trying to be rid of windows for regular use. The more expensive ones do seem to have much higher duty cycles if you want to scan all weekend.

I think people want a mix of uses. For old files and records, magazines, it's just access (and destruction.) For things truly needing preservation, they won't be thrown out at all, though they might be moved into storage boxes. They might still be scanned for access, and for distribution. (For example, if scanning the papers of a dead parent, one might make copies for the family.) Then there is a middle class, documents that will be discarded but for which bitmaps are not enough.

But one unit might not handle all this. A club might even consider getting two units, both to allow two people to scan and to handle different needs (business cards, photos and other colour items might do better on another scanner.)

A club would also purchase a professional paper cutter that can cut the spine off a magazine or book. Those are at least $200 to $300, so it's better when shared. For magazines grayscale is probably desired.

However, there is an argument that at today's disk prices -- 20 cents/gigabyte for hard disk and 5 cents/GB for dvd-rom -- there is no reason to throw away information at all. A full colour 300dpi scan of a page is only 22mb uncompressed, and would usually compress to quite a bit less even lossless, and even less in an appropriate lossy format. (If it has not been done already, I can imagine the design of a lossy format for scanning which works hard to preserve edges for OCR but is happy to throw away gradual shading and noise etc.)

Also, on your page you say the Kodak 2500D is the Panasonic S2055, not an S6 series -- which are you recommending? There sure are a lot of choices.

> But I had had the impression that somehow there would be some reason the $5,000 scanners would cost so much
> what is it that they deliver to justify that cost, if you think a $500 unit is the unit of choice?

The 2500D was a $5000 unit. They are about five years old and SCSI, which makes them less interesting in a USB
world.

http://cgi.ebay.com/ws/eBayISAPI.dll?ViewItem&item=140178758338

for example.

> the Kodak 2500D is the Panasonic S2055, not an S6 series -- which are you recommending?

Especially with a scanner to be moved around people. Actually, for that something like the Canon CD 4050 which just has its own CD burner and ethernet has a lot of attraction.

Still, while the 2500D is down to $500 there are other older scanners (not all USB) still fetching several thousands. I will check more into its features -- the spec sheets are vague about its grayscale abilities saying "capable of grayscale output up to 600dpi" which makes no sense for a 400dpi optical scanner.

I saw the one with the door missing. While I presume a door is not hard to get ahold of, buying as-is stuff on eBay is a tricky business. Anyway, some sneaky guy went in and bought it -- does seem worth it for you, even if just for parts.

Great topic. I have been using the Fujitsu 4120C for document scanning for the last couple of years and love it. Scans very quickly... but, like most scanners you have to monitor the feed. I paid $600 in 2004. Looks like the price hasn't budged much.... http://www.dealtime.com/xPO-Fujitsu-fi-4120C

Last week I had an opportunity to demo the Kodak 1220 for photo scanning. It was incredible. I scanned 3,000 old images and unlike most of the document scanners I have used, the feeder on the 1220 for photos worked great. The 1220 also comes with Kodak image software. I opted to scan t 1200 dpi for family images and at 300 dpi for other images that I could have dumped, but opted to scan everything since storage is so cheap. Cost for the 1220 about $900 after rebate. http://www.kodak.com/US/en/dpq/site/TKX/name/s1220Scanner_product

The idea of a club is interesting. Even though I had intended to keep up with it, I certainly don't scan on a on a weekly or even monthly basis. Maybe every 4-6 months. If there were 12 people in the club, each person could get it for one week every quarter. That would also be a good forcing function. That said, I am not sure that I would need a faster scanner. The real time required is all the prep work. Removing staples from documents, and images from old albums is the time consuming part of going digital.

I support a web application and we want to scan docs from the web application. Basically we want to select the scan icon from within the web application, activate the scanner (scan the document), and upload the document within the web application. Would you know how to do this or know someone that does?

Hi, nice post Brad. I am really researching about the features about Fujitsu 4120C. I think, it is really a good product for a scanner, it has a lot of features, especially the quality of picture being scanned to. It is really a great product to use.