One of the problems we've faced on Groklaw is how incredibly difficult it is to get plain text from PDF files that comes from scanned paper legal documents. No doubt companies like OmniPage Pro would love it if thousands of us would buy their product, but aside from it not being available for GNU/Linux systems, and being a closed proprietary product, it's very expensive for mere mortals such as Groklaw volunteers, $499. When Google released Tesseract, that was great, except it's a bit hard to use. So here are some instructions to make it a bit easier.

******************************
Check requirements

You need to have a working Ghostscript installation with TIFF support. You need
to have the following libraries installed: /libtiff/, /libjpeg/, and /zlib/.
Most distros will have these installed "out of the box".

You also need the corresponding header files. These are usually packaged in the
corresponding "-devel" packages. So, using your preferred package management
software make sure the following (or equivalent) are installed:

ghostscript

(ghostscript may be divided into smaller sub-packages in your distro)

libtiff
libtiff-devel
libjpeg
libjpeg-devel
zlib
zlib-devel

Check Ghostscript installation

Fetch a PDF (I'm using "Interesting.pdf" as an example) into your current
directory and run the pdf2tif tool on it:

./pdf2tif Interesting.pdf

You should get a set of TIFF files, one per page of the PDF. Use an image viewer
to check they are OK.

Get tesseract

The home page of Tesseract is http://sourceforge.net/projects/tesseract-ocr. The
current version at time of writing is 1.02. Download the tarball (I henceforth
assume it is tesseract-1.02.tar.gz) and untar it somewhere convenient, creating
a directory "tesseract-1.02". Now put the helper scripts in that same directory,
and go into the directory.

# Doing an initial 'save' helps keep fonts from being flushed between pages.
# We have to include the options twice because -I only takes effect if it
# appears before other options.
exec gs $OPTIONS -q -dNOPAUSE -dBATCH -dSAFER -r300x300 -sDEVICE=tiffg3 "-sOutputFile=$outfile" $OPTIONS -c save pop -f "$1"

#!/bin/sh

# takes one parameter, the path to a pdf file to be processed.
# uses custom script 'pdf2tif' to generate the tif files,
# generates them at 300x300 dpi.
# drops them in our current directory
# then runs $progdir/tesseract on them, deleting the .raw
# and .map files that tesseract drops.

./pdf2tif $1

# edit this to point to wherever you've got your tesseract binary
progdir=.

for j in *.tif
do
x=`basename $j .tif`
${progdir}/tesseract ${j} ${x}
rm ${x}.raw
rm ${x}.map
#un-comment next line if you want to remove the .tif files when done.
# rm ${j}
done