Tuesday, July 06, 2010

In defense of PDF

Can you identify the living fossil?

There seems to be a growing perception, not unlike the controversies surrounding Flash, that Adobe's Portable Document Format is (in the modern world of the Web) a legacy format, something of a living fossil, a technological Coelacanth that refuses to become extinct. Some would go so far as to say PDF doesn't belong on the Web. Some would be wrong, however.

My current employer, Day Software, has (how shall I say this in politically correct form?) a strong prejudice in favor of HTML as the one true and proper Web format for documents. This is reflected in the fact that all of Day's product documentation (like just about everything else Day produces, information-wise) is available online as HTML. There are those at Day, I'm sure, who would like to see PDF disappear from Web sites, if not from planet earth. HTML is a bit of a religion around Day.

And yet, when I joined Day as an employee (two weeks ago), not one document in the half-inch-thick stack of new-employee paperwork that I was asked to fill out was based on an HTML form. Every single document was based on a PDF.

So what is this walking fossil called PDF and why is it still so pervasive?

In the beginning, there was Postscript. Far and away the most successful Adobe technology ever created, Postscript was the first commercially successful page-description language based on vector graphics. It's a Turing-complete language with subroutines, looping, branching, and all the rest. Amazingly, it continues to inhabit printers worldwide. (You probably rely on a Postscript driver to get your printer to work.) It's a brilliant bit of technology, describing, as it does, fonts and shapes and whole pages in resolution-independent terms, in a plain-text (non-binary) interpreted language. Write once, rasterize anywhere.

Since fonts themselves can be described in Postscript, and since Postscript is just text, you'd think PS would be the ideal self-contained document format. Alas, it is not. It's far too verbose, and laborious to render onscreen. (This is what killed Display Postscript.) Still, its inherent portability made Postscript a compelling basis for a document format. So Adobe went on a quest to make Postscript smaller and more amenable to quick screen rendering. They reduced the number of operators (and made their names smaller), and cut out subroutines, and eliminated loops, and did a bunch of other things designed to make Postscript small and screen-friendly, yet without sacrificing portability. The result was PDF.

The first generation of PDF was ASCII-based, and if you looked inside it you basically saw thinly disguised Postscript commands, with loops unrolled and major page elements described as objects. References to objects were maintained in offset tables. Fonts could be embedded, or not. It was still a somewhat verbose format, but at least it could be interpreted and rendered quickly, onscreen or to a printer. (The translation from PDF to Postscript is extremely straightforward.) Adobe came up with a free Reader program and put PDF files out there for anyone who wanted to give them a try. Lo and behold, the format took off.

Why? Why does the world need something like PDF? The (sad, to some) answer is that there is still a need in the world for electronic documents that mimic paper documents. There are industries (such as insurance) in which the physical size and placement of certain pieces of text is regulated by law, and many forms have to fit on a certain size piece of paper when printed out -- there's zero tolerance for text autowrap variations. Form 1040 from IRS has to look like Form 1040, every time. (Imagine the pandemonium at IRS if every tax form that arrived in the mail looked different because it was printed out on a different printer at a different resolution, with text wrapping every which way, all manner of font substitutions, etc.) Like it or not, certain documents have to look a certain way every time, without fail. This is where PDF shines.

Is PDF right for every occasion? No. It's not. No more than HTML is.

Is PDF going to become obsolete? Not any time soon. Not any more than Postscript.

Can/should PDF coexist with markup languages in the world of the Web? I think the answer is yes. Adobe has done a good job of making online PDF forms (for example) REST-friendly and user-friendly. Having to load Reader (or a Reader plug-in for your browser) is a bit of a hassle, but you do get a lot of bang for the buck. The advantages of PDF tend to balance out the disadvantages -- for certain users, in certain situations.

And that's the key. It's not about religion -- it's not about whose document format is inherently better or worse. It's about diversity and choice: choosing the right tool for the job and letting the user (or customer) choose what's right for her. This is the part that the HTML zealots don't get. Some customers want PDF. Some users demand to have documents in a format that looks nice onscreen and prints out nicely (and predictably) on a printer. By not providing those users with that choice, we're (in essence) forcing a technology decision on people. We're forcing our religion on non-converts. And historically, that's always been a dangerous thing to do.

14 comments:

What I as a programmer don't like about PDF is that it's complicated and slow to extract the text from a PDF file. From what I heard (I didn't try myself yet), you have to basically form the letters into words, based on the gap between them. Extracting text from a PDF file uses a lot of CPU power, more than converting a MP3 file into music (and I thought that's CPU intensive).

I guess one important difference is that you can easily address a specific part of a PDF document using spoken or written "links" like "third paragraph of page 4". HTML is much better suited for Web addressing using URLs.

A big part of the value of a corpus of content lies in its internal links. So I guess, if your corpus of content lives in paper form, PDF is better. If it lives in Web form, HTML is better.

What I was complaining about in my recent blog post on the subject [1] is people using the paper model to distribute information on the Web - that sounds wrong, or at least suboptimal.

Another thing that people love about PDF, in my opinion, is is that they tend to think their paper document is the same as yours if it has the same number of pages and looks the same, so they're more comfortable with a PDF that cannot be reformatted easily. That's an illusion though - I'd much prefer signing the SHA digest of a contract (after generating it myself) than its paper form, as the latter is much easier to tamper with.

I am in total agreement! From a designer's perspective, HTML just doesn't allow me the same control for layout and use of high resolution elements. As for the comment above about extracting text; I purposely use PDF to protect the text from hacking.

In essence, you're definitely right; PDF is necessary when you need a document to look a certain way when printed. If that document has a form in it, PDF forms are a nice-to-have.

However, content managers get itchy about PDF because plenty of organizations take PDF as a quick way to get all their documents online. Instead of spending a little time entering the documents into the CMS, they hit the PDF button in Word and just upload the document. Before long, instead of a website with links and rich media, you have a very hard to navigate set of file downloads that periodically crash users' computers when Acrobat has trouble starting (which is not infrequently where I work). So, if you see an anti-PDF bias in your colleagues, it may be in response to a "Just PDF it" bias in customers that they feel they need to offset.

This won't actually work. Text extraction isn't easy, but it's not impossible. If you have Adobe Pro (which "hackers" definitely will) all you have to do is export as text or XML. I think Adobe also has an e-mail service that will convert it for you.

I'm confused by your argument, Kas. You say "PDF has a place on the web" but the reasons you cite all involve print and IRS forms. There is no reason to ever have PDF pages in your website's navigation scheme, and you should never have to rely on in-browser PDF reading, because PDF was not meant for the web. This is why most websites that regularly publish PDFs (scientific journals) include a PDF for download or HTML for inline reading. This is the proper way.

@Anonymous - In my opinion if you are trying to treat a PDF document as a database record and extract information from it then there is something else inherently flawed in the process. PDF is a display format only, a way to put data out in a specific layout. If it's being used to transfer data for later extraction then it's being used incorrectly. I understand there are cases where libraries of PDFs need to be imported but again, while this is programatically a pain, they should never have been used as an information store in the first place. Just not what it was designed for and as such I don't see the difficulty as being a shortcoming in the format. Just my 2c.

If I had to put a number on it, I'd say that 95% of the objections to PDF on the web come purely as a result of Adobe's crummy and bloated Reader. If Reader didn't suck so much, PDF wouldn't garner nearly the opposition and hatred that it does.

When PDF was a proprietary technology this was entirely Adobe's fault, but now that it's a more open standard, I think it's time for browser developers to build PDF rendering in. There's no reason why a user should have to load a crummy plug-in or download a file and launch an external application just to read a PDF; they should be rendered (progressively) by the browser.

Safari does this, I believe, and it's supposedly on the menu for Chrome as well. (IE will probably never have it but that's to be expected.) I fully expect that as users begin to adopt browsers that have PDF "baked-in" that opposition to PDF will start to decrease.

HTML is great for a lot of things, but it's much easier to produce a precisely formatted document, particularly if you are a non-technical user coming from a DTP background and not a programming one, with PDF as your target format rather than HTML+CSS. It's not going to go anywhere, and I fully expect that it'll probably get more popular rather than less in the future.

It's been done. Under Linux, the KDE browser, Konqueror, renders PDFs in a tab, or window, transparently. (Firefox can pop a reader, not Adobe's).

Apple's Safari uses Webkit (also used by Google Chrome), which is derived from KDE code, they could probably do the same. And there was talk of bringing KDE applications to MSWindows - the underlying Qt code is multi-platform.

But what other medium can really capture a highly complex designed document from something like InDesign and deliver that 'view' file to so many (Adobe Acrobat) readers across the world without the need for that person requiring any additional application software.

Its been around for many years and I think it will be around for quite a few more.

I don't know a lot of the technical details. But I get the feeling that the PDF is going to display and print out in a very predictable way. HTML on the other hand, feels like it might depend on your browser or screen size.

You know when I read this article, it really pissed me off...This is the only article I have ever seen that has to defend pdf. I have never read any articles about pdf's are on there way out or are bad for the web. You can google the death of pdf or the end of pdf, guess what? you will get zero results...The reason why this article pissed me off so bad, it only focuses on one aspect of pdf. Tell me what other format that could replace it? What other format could archive or store off line for reading later. What other formats can embed interactive content, java script, and view offline.

I think this article is a complete was of time to even read, and I think the author wasted his time even writing it...

I find it hard to believe that there are so many people out there talking negatively about pdf...I cant find them!

Tell me this, if there are soo many people out there talking bad about pdf and the author had to write this article to defend pdf, then why is the number request to improve the ipad was a better pdf reader?....Guess pdf is doing ok, I dont think you need to defend it.

Just in case anyone needs to edit or create a fillable PDF, I'd like to share this site PDFfiller which I find very useful in annotating PDF files. Here is the list of the site's functionality http://goo.gl/0YSH8l that you might find helpful.