Cameron Laird's personal notes on PDF
conversion utilities

Multitudes of FAQs and similar references for
PDF information have
been published in the past. As of 2003, I've found none that I regard
as convenient and well-maintained in regard to the "filters" that transform
files to and from PDF, not even the
Conversion
tools page of PDFZone or
PlanetPDF's
Extraction
page--so I'll start my own.

The focus of this page ('anyone think
I
should re-do it as a Wiki?)
is on the products available to convert
to and from PDF images. IDR Solutions
explains
the challenge.

David Boddie's
pdftools
and David Leonard's
PDFFile
provide interesting
Python-coded raw materials for those unafraid
of dirtying their hands with programming. Early in 2005, one
appreciated correspondent wrote me that the latter
"handles things like decryption better." From what I can tell, PDFFile
and python-pdftools do not write; they only read.

My clients often need to build reports which simply sequence existing
(or generated) .pdf and/or .ps source. That's
a far bigger undertaking than you might think, as Matthew Skala
documents
(in fact, I disagree with a few of his details, but he certainly
gets the frustration right). Here
are a few of the alternatives with which I've spent time:

Aladdin Ghostscript works on many, many of the documents
I've encountered. There appear still to be problems. I'll
eventually report details of those.

Ghostscript 7.07 gives up on some colorspaces
(perhaps specifically those indexed on an ICC basis with
RGB as an alternate color space?), and simply discards
corresponding images. Ghostscript before 7.0x can't handle
Adobe 6 output. A typical invocation is

Java-based
iText is
a very widely-used library for PDF management
[Explain
mailing
list,
1t3xt, and such.]
Be aware that, as maintainer Paulo Soares has written [find reference
in difficult mailing list], "If you're using it
in an intranet you don't have to do anything. If you're exposing the
service to the exterior either you provide the source code of your
application or you buy a commercial license." He was writing about
iText 5. Earlier releases could be freely embedded in Web
applications.

The iText creators (hope to?) receive significant income from
the book. While they generously make a wealth of information
available on-line, I don't find it organized for my convenience.
Among the highlights are:

JoinPDF
appears to be a retail-oriented utility based on iText. It
does not bookmark.

While I have yet to test Tom Phelps'
Multivalent,
I'm looking forward to it.

PageCatcher ...

Perl ...

pdcat
was my favorite concatenator for many years.
It's available for many platforms,
fast, and, most crucially to me, handles a wider range of inputs than
any of the other utilities I've tested. Also, its bookmarking is
convenient and correct [explain how far ahead of all others this is].
Still, I have identified a
few (obscure?) errors in its operation [explain].
Worse, the old release 2.36 on which I long relied couldn't keep
up as 2009 progressed, and I couldn't justify the licensing
expense for my applications. In mid-2009, I moved much of my
operations to iText. I remain fond of
vendor
PDF Tools,
though.

In January 2010, we've
suddenly switched over
many of our operations to pdfjoin, a member of
the TeX-based
pdfjam
suite. pdfjoin
handles instances that cause pdftk, pyPdf, and iText to stumble.
Phaseit
is likely to continue to invest in at least a couple of these
different open-source projects.

I haven't yet exercised
DocuCom PDF Online,
pdfmeld,
or
pdfpages. pdfmeld's price is modest, and it's documented to be
quite flexible, with good capabilities to bookmark, watermark,
highlight, underline, and so on.

Pdftk
is a GPLed "stand-alone, command-line tool that does lots of
things, including PDF concatenation," according to one enthusiast.
He provided this example usage:

pdftk A=in1.pdf B=in2.pdf cat B1 A1-7even A1-7odd \
output out1.pdf

pdftk works well, in my limited testing. It does not
bookmark. It does manage background
watermarks and foreground
stamps.
Bruno Lowagie
tells me that, while
Sid Steward
has left computing for a family business,
iText
Software Corporation "has plans to set up support for PdfTk".
Here, incidentally, is an interview
with Bruno.

PStill seems to have
a good record at concatenating. I want to work more with
it.

At least, that's
my usual first response,
although, as 2004 begins, a couple of products are making me
soften that stance. I understand all the situations that make
text-extraction appear to be desirable; I've lived through
most of them myself. As several sages have counseled, however,
from a programmatic standpoint, "think of PDF as paper", by which they
mean you could use scissors and glue on it, but there's almost
certainly a better way. Almost always, you're--we're--better
off going upstream to the data where the PDFs originated. I'm happy
to help
analyze
specific situations on a consulting basis to determine
whether there's an appropriate alternative to text-extraction, and
also to help your organization implement the text-extraction method
that's best for it. For more on the subject, and especially the
possibilities for tabulated data, see
this page focused
exclusively on content extraction.

If you insist on extracting text from PDF, and choose not to
engage our consultancy,
you're likely to find your answer
from the following list. This list remains partial; you're welcome
to write me to ask that I unpack more of my notes, if you have
specific requirements none of these meet.

Adobe has a
no-charge, online service
that transforms PDF to text. This came about, acccording
to the unverified story that reached me, as the result of
activism by advocates of the blind, and was part of the
price of Adobe's big government contracts.

The most common legitimate reason to render PDF to text is in combination
with some sort of search; that's certainly the application of this sort
I most often automate. Search and "content management"
specialists are generally aware of the issues involved, and often offer
their own PDF extractors as plug-ins or add-ons.

For immediate results,
Zamzar
is a Web application that quickly converts one or a small
number of PDF-defined pages [also mention YouConvertIt,
Neevia]. Even quicker, for those running
Mac OS, is simply to open Preview and SaveAs JPG.

An abundance of installable desktop applications include the
capability to visualize a PDF page as, for example, JPG.
Among them are:

I often field questions such as, "I need to programmatically
convert Office files to PDF. Is that possible / easy? How is that done?"
I'll start with a few personal comments.

Adobe certainly wants people--especially those who control budget
decisions--to think of it as the vendor-of-preference
for all such needs. I respect Adobe for their business success and
technical achievements. My experience as a front-line customer of
theirs is ... mixed. My first instinct is to look for alternatives.

The dominant producers of PDF documents in the current market
are Acrobat and Word. I suspect someone has reasonably accurate
measurements of the share each holds; my rough impression is that
the latter dominates. It certainly is feasible to automate Word
in principle. While most Word scripters use VBA, I rely most on
Tcl or Python ... There should be no effective barriers
to full automation using Word's built-in facilities.

Word, however, emits bad PDF, and is often slow and unreliable,
at least for the tasks that matter to me. Adobe frustrates me; I
have a terrible history at trying to find out the simplest product
information from the company. When I want "industrial-strength"
automation, I turn to
Antiword or
OpenOffice.
The latter produces higher-quality
PDF than Word, and is more open about its
scripting
capabilities, at least
on an ideologic level.

For special purposes, I've built even more involved "production
lines" involving intermediate steps with PS, TeX, and other formats
and technologies.