Image Format Comparison

For Archiving Documents

by

2 April 2005

Introduction

Recommendations by situation

The table belows shows the effects of scanning a printed word into several different formats.
This is particularly aimed at people who need to archive printed or written documents, especially
old ones, in digital format. The information may be useful to others as well. For a more general
introduction to web image formats, especially as they apply to the web, I suggest Daniel Beardsmore's article Making good use of Web image formats.

The original word was printed in 12-point Palatino. Except for the large GIF image, all scans were done
at 300dpi. The scanner was an HP Scanjet 4400c -- a low-level scanner -- using the supplied HP Precisionscan Pro software.

The images shown here are not the original scans, for two reasons. First, I wanted to display them
several times normal size so that you have half a chance of seeing the artefacts. Second, I manipulated the
images into formats which would not be further modified in the process of displaying them, and this dictated
using PNG format for the display images.

There is no standard definition of "high, medium, low" for JPEG image quality, nor any standard
measure of JPEG quality. The high, medium, low quality in the table are what the HP software offered. Some software
allows you to pick the quality as a number, but that number is no more standardized than "high, medium, low".

Note that I do not discuss TIFF. TIFF is a file format, rather than an image format.
It's possible to have various different image formats contained within a TIFF file. The acronym TIFF has become
associated with lossless formats and especially with uncompressed formats, but that's only by convention and not
by definition. Due to the variety of contained formats possible in TIFF, compatibility varies greatly. Now (April 2005)
that PNG is widely recognized in current graphics software, there's little reason to consider TIFF.

I only consider open formats, not proprietary formats. For example, PSD (Photoshop) is an excellent format
for preserving documents, but it's a proprietary format.

The file sizes are approximate, because the effectiveness of compression varies depending on the data.
The "full page" is for a 8.5"x11.7" image, thus larger than either US letter or A4 paper.

Photographs for archival storage or later editing: PNG, second choice high quality JPEG.

If your hard disk is too small: Buy a large external disk and a DVD writer!
External hard disks in the 150GB-250GB
range, in USB or Firewire, now (2005) cost well under US$1/GB, even less for internals. You need backups;
many new computers now include DVD writers, and a separate DVD writer is under $100 (internal).
A full-page PNG image at 300dpi will use about 20MB of disk. But a 200GB disk can hold well over
10,000 of these!

If your software can't read PNG files: Most likely you are running old software.
I sympathize with the desire to keep running old systems which are working perfectly well,
but some of the newer releases have brought some real improvements. If you are serious about
archiving images of documents, this is a good reason to upgrade.

Half-tone images, such as magazine and newspaper photographs: these require
specialized scanning techniques which I do not cover here.

Format Description

File Size Sample/Page

Recommendations

Sample Image (magnified 2x)

PNG is a lossless format which captures full color. It's compressed, but what you see is exactly what the scanner originally
saw. The compression is only mildly effective, as you can see by comparing the file size with even the
highest quality JPEG file. The image looks slightly fuzzy in the magnification, but this is only because
when the scanner scanned a pixel which was half white and half black, it averaged to gray. So the
gray edges in fact are real data, not an artefact.

PNG
300dpi

29KB/15.2MB

Because PNG is lossless and full color, it is the largest of
the files. However, it has no compression artefacts and can be edited
repeatedly without loss of quality. Best for any image (other than
black and white) which may need to be edited in the future. Best for
archival storage. Good for all full color images, including pictures of
documents when the actual appearance is important.

GIF is a lossless format which captures very limited color. In my scans, I specified that the GIF
scans only use black and white. It's compressed, but what you see is exactly what the scanner originally
saw. The compression is very effective on most images. The edge looks grainy because there's no gray
to fill in the part between full black and full white. For clean original documents, a B&W format such
as this is ideal for OCR.

GIF
300dpi

1KB/88KB

Because the image is B&W (and thus one bit per pixel compared with 24 bits per pixel for full
color), and because the compression is so effective, this is the smallest of the files. However, it does
not look good at this size due to the graininess. Good for OCR on clean printed or typed documents.
This is shown at many times the original size; when printed at 300dpi it will still look a lot better
than a fax.

This image is also GIF but is four times the resolution of the previous GIF.
(The image here is only twice the size because I didn't double this one.) I scanned this one at 1200dpi
instead of the 300 dpi I used for the others. Despite having 16 times as many pixels, it's still a very
small file.

GIF
1200dpi

5KB/649KB

Best for OCR on clean printed or typed documents, though 1200dpi is usually overkill. Use 600dpi for
best OCR speed and accuracy on most documents.

JPEG uses lossy compression: once it's been compressed, you cannot get an exact original
back. The compression uses a model of human vision to discard information from the image which the
human eye won't miss. Quality can be set to a range of values; low quality corresponds to more
information discarded.

JPEG is very good for continuous tone images, the kind that photographs are mostly composed of.
It falls down badly on images with many high-contrast sharp edges -- the kind of thing that documents
are almost entirely composed of. Look at all the noise around and within the letters in the low quality JPEG image. It's
a low-quality JPEG -- meaning a lot of information was discarded -- but it shows what happens in all
JPEG images with high contrast edges, just more dramatically. You see less noise in the medium quality
image, but it's still easy to see. Artefacts in the high quality image are very hard to see even if you magnify
it two or three times. (Some web browsers, such as Opera, make magnifying the page easy.) You can still detect them
in some documents though, and these artefacts can confuse OCR software.

Avoid repeatedly editing JPEG images at any quality, because you lose additional information each time the
image is recompressed to save it.

JPEG
300dpi
low quality

2KB/313KB

Good for photographs on the web because it does a reasonably good job
(especially for small images) and minimizes download times. Never use for archival copies or for
copies which may need editing again. Never use for documents.

JPEG
300dpi
medium quality

3KB/371KB

Best for photographs on the web. Download times are only a little
more than for low quality, and the appearance is often considerably better, especially for medium to
large sized images. Never use for archival copies or for
copies which may need editing again. Never use for documents. But medium quality JPEG does an excellent job of showing
photographs on the web.

JPEG
300dpi
high quality

14KB/4.40MB

Very good for archival storage of photographs which will not need editing again, or only
minor editing. Poor for images on the web due to large size -- convert to medium
quality for the web. Marginally OK for archiving documents in a pinch, if black and white is inadequate and you
don't have space for PNG images.