Open Government and PDF

The premise of these articles is that the world would be a much better place if all governments would publish all of their information on the web in HTML or XML. You know what else would make the world a much better place? Unicorns.

The issue at hand is not whether governments should pick HTML or PDF. The issue at hand is whether governments are capable of publishing information at all. Show me an HTML creation tool that creates high quality, standards conformant markup from a Word document or any of the zillions of editing tools that government employees use. Now add in all the tools used by people who submit documents to the government. And all the versions of those tools released in the last 20 years. Now make sure that the HTML/XML works correctly even when the user doesn’t have the right browser or the right fonts installed. Guess what? There are no such conversion tools available. And who would pay to make sure that everyone had access to these tools and knew how to use them correctly?

Creating PDFs is trivially easy for all these cases, whether you use Acrobat or something built into your OS. It works regardless of what tool was used to create the content in the first place, and it works on pretty much every OS out there and on mobile.

HTML/XML Standards nerds should get a grip. They aren’t even in the running to compete with PDF here.

Advertisements

Rate this:

Share this:

Like this:

Related

29 Responses to “Open Government and PDF”

I think these articles go too far; in the words of Sunlight’s update-under-duress, “any time Government decides to release data to the public, we’re glad that government has taken a step forward.” My take is that the government should start by releasing data in whatever form takes the least effort, and then they (and organizations like Sunlight, and random hackers) can go about transforming it.

But it’s frustrating when, as they note, documents *are* stored internally in XML, but are published only in PDF. It’s not that I object to the PDF! But why not release the XML, too? Or if they have an internal database for something, why build a Flash UI for it and not a data API?

Like I say, I think the Sunlight post overstates its case. But I get the author’s frustration.

I didn’t comment on the whole Flash UI thing, because I agree Flash probably isn’t a good choice for government data.

As for the data in XML, its true that the government employee could publish the data in XML without doing any extra work. But does that really save him/her time? What about the time it will take to deal with people who aren’t capable of coping with XML? It isn’t just a publishing problem, its also a consumption problem.

I should say, I’m not really here to defend the “standards nerds” – which I think is the perspective Daring Fireball is posting from and maybe Ars Technica as well. There are certainly people who go around looking for purity violations to complain about. But I don’t think Clay Johnson is in that camp. He and his employer are actively trying to help the government open up; they’re doing good work, and I think he’s just venting about something that makes that work harder.

Yes, I generally support the work the Sunlight folks are doing. And I’m sorry that the use of PDF means more work for Sunlight, but they aren’t the target consumers I really care about at the end of the day.

Lets say that we did somehow manage to get all local, state, and federal governments to produce all documents in XML the way Clay wants.

The end result would be that most people would only be able to view government documentation through the website of some sort of “trusted” intermediary who could turn it into something human readable, instead of being able to go directly to the government source. Having to go to google or sunlight to view government docs does not make things MORE open imho.

It’s not that I don’t think PDF is a useful tool—and I never said it wasn’t in my piece on Ars—it’s that it belongs in a toolbox that should be filled with a variety of open standards. (And yes, I’m aware that PDF is an ISO standard—but Adobe continues to to add proprietary extensions to that standard.)

To take the example of the HR 3200 bill that Sunlight Labs specifically refers to, the PDF was generated from data originally in XML format. Why couldn’t a link to that original data be included IN ADDITION TO a nice looking, paginated PDF?

Yes, Adobe does extend the ISO PDF standard in various ways. But last I heard all of those extensions are being done in a manner compatible with the letter and spirit of the ISO standards process (but I admit I don’t follow it to closely these days). Unlike OOXML, the ISO pdf standards really are controlled by ISO and no astroturfing is being done. Building extensions and then submitting them to ISO for approval isn’t a bad thing, is it?

Wow. Because they use one bad tool now, we should replace it with 2 bad tools (the existing one PLUS PDF?) Now instead of maintaining one document they have to maintain two — the original plus the PDF version.

“The end result would be that most people would only be able to view government documentation through the website of some sort of “trusted” intermediary who could turn it into something human readable”

How is this true? My Word 2007 uses an XML format for it’s documents already. On my Mac pages documents are also xml files already. I can open both without a network connection or going to google. In fact I can open the raw XML in ANY program capable of processing text (which every OS ships with), not necessarily formatted properly but WAAAY more useful than the PDF version with no PDF reader.

Your basic premise is that XML/HTML shouldn’t be selected because Word can’t produce good HTML docs (the fact that 2007 Word uses XML as it’s native format would counter the argument it can’t produce XML). I guarantee you that if you mandate clean HTML or a particular XML version Word will be very quickly updated to produce that format. Everybody wants to be able to sell their software to the HUGE government market. I bet excellent converters would be available within months of the format being finalized.

Er, what? Because both Word and Pages use XML as an internal format, XML is somehow a human-readable format? Uh-huh. Try feeding either Word or Pages some random XML file, and see what happens?

XML is not a universal *format*. XML is a universal *language* that can be used to define data formats – but you need to know how to interpret it. Average humans are generally very bad at interpreting raw XML. Programs that use XML generally only know how to interpret the formats they define, and maybe a few other publicly-defined formats; they can’t interpret the language well enough to understand every random format. Heck, Filemaker Pro deals with data in a far broader way than Word or Pages, and imports/exports XML – but try sending it a random XML file without an XSLT stylesheet to translate it, and see how far you get!

This falls prey to the same fallacy that many open source advocates have – as long as the code is open (document is in XML) everything is somehow OK. It’s only OK for those who have the skill and knowledge to work with the raw data, and that is a tiny tiny TINY fraction of the population. Documents in “XML” are utterly useless to 99% of the public; to be actually useful, they need to be in a defined XML format that most people can read. ODF sort of counts, in that anyone skilled enough to download, install and use Open Office can read it; but a fair number of people simply aren’t. (And I, and probably a fair number of other people, view forcing someone to download and use the entire OO suite just to read government documents to be an unfair imposition; it’s a lot more to download and a lot more work to set up and use than even Acrobat, let alone the simple PDF readers like Preview.)

As for the fantasy that if you mandate it, they will come… that worked real well for ODF, didn’t it? Oh, wait. Microsoft leapt to the call – no, they put on a crash program to develop their own incompatible XML format in OOXML, which last I heard still contains big binary lumps that make it just as hard for other programs to interpret as the old binary Office formats were. *And* started a massive lobbying/astroturfing campaign to both reverse the ODF decision and establish their own format instead, turning it into a nasty political fight. “[If] you mandate clean HTML or a particular XML version Word will be very quickly updated to produce that format”? Uh-huh. Riiiiiiight.

I’m more of an Adobe-loather than an Adobe-lover these days, but I have to admit PDF is the best tool for distributing human-readable documents. I’ve spent my time down in the trenches too; in my case, it was managing the library of Material Safety Data Sheets for all the products our company distributed. I was there when we still got mostly paper copies from the manufacturers, and we had to scan them to get them in digital format; I’ve dealt with manufacturers that sent them in ASCII text or Word documents; I’ve dealt with them in PDF. As a tool for storing them and forwarding them on to customers, who would either print them out or read them on the computer, PDF was by far the best tool, bar none.

I think you miss the general point: PDFs often give the illusion of making a document available electronically without the reality.

There *are* solutions that already exist. Unformatted ASCII would be better than some PDFs I’ve tried to deal with, where you can’t even extract the text. Non-compliant HTML would be better. Wordperfect documents would be better.

I just tried copying the text out of a ballot initiative today, and when I copied it to a text editor, it was gobbeldy gook.

There have been countless times I’ve gotten “PDFs” that were just stitched-together scans of paper documents (or even directly-created images made right from a word processor.

Plus, there are easy ways to batch strip the metadata off Word docs. And there’s ODF. All available on any system from the last 10 years or so.

Finally, creating PDFs is hardly easy for all users. Ever try explaining the difference between PDFs that have embedded fonts and those that don’t? Microsoft’s built-in PDF software does *not* embed fonts. I’ve gotten plenty of PDFs with messed up text as a result.

I tried giving someone a publishing house a PDF made with ghostscript and they wouldn’t take it, fearing it would somehow not be compatible with their Microsoft and Adobe-based workflow.

I’ve also seen many PDFs that were supposed to be letter but were instead A4, because they used a free Windows utility with that as a default.

PDFs are a good way to transmit documents for printing out somewhere else. They are a good substitute for printing out volumes of dry text that no one will read. But they don’t have the virtues that a proper electronic document format should have.

I’m not saying PDFs are perfect, quite the contrary. (In fact you can make accessible PDFs quite easily using later versions of Acrobat, but the open source alternatives haven’t caught up here, in no small part because the accessibility stuff is quite difficult to implement.)

I think its you who is missing the point. The reality is that producing documents in the format you want is completely beyond the technical abilities of most government employees, as is consuming XML documents. Which format is technically better DOESN’T MATTER. What matters is that the government publish the data in a manner that anyone can read, and that they do so without imposing tremendous costs (both in terms of $$$$, time, and training) on the producer or the non-technical consumer. When you figure out how they can do that for HTML or XML or JSON, let me know. Meanwhile PDF solves the problem today, in a practical way, which is why it is being used by governments in the first place.

As I mentioned in my original post, the problem with just publishing in native format is that there are many different apps out there, and many versions of those apps. There are plenty of bureaucracies out there with people plugging away on WordPerfect. Heck, there are probably folks out there still using WordStar.

The reason people want PDF (and HTML/XML, for that matter) is to make sure that the documents are published in a format that:

Andrew, thanks for the quote on the original article – I missed that in my original reading, and agree that Clay Johnson asking to not use PDF -at all- is overstepping.

So while I do disagree with Clay on that specific point, at the same time I don’t believe that he’s addressing the same issue that you are – while your focus is the ease of creation and basic readability for gov’t documents, Clay’s focus is on increasing the utility of those documents (incorporating into databases, conversion into other formats/media, etc.).

Given that and machine-readability is pretty key to Clay’s general goals, publishing workflows focused solely on PDF are clearly not a viable long-term solution, and I agree that the gov’t needs to start pushing some alternatives – otherwise the inertia of the current workflows (PDF and Word) will prevent any progress in this arena.

Because i”m a mischievous SOB, I can’t resist pointing out that PDF is an ISO standard. In fact, it can be argued that PDF’s standards process is quite a bit more open than that of HTML5, since there is no Hixie equivalent. 🙂

Actually the question is “why is government dealing in anything other than plain text” with an appendix of images, if necessary. Word and that group of other tools you allude to are not necessary; 7-bit ASCII plus images on the side do fine.

PDF? meh. I can use it without being locked into adobe, but most government creted PDF files I’ve read appear to have sacrificed content and meaning in favor of glitter and presentation. I long for the days of xeroxed typewritten pages…

I think you might be missing the point. PDF is a dead end, it is a presentation-focussed universal container format that has a lot of support but the data is locked in when the PDF is baked. Yes there are ways to make PDFs accessible and ways to extract information from scanned PDFs but the broad base means it is a stupid format and should be treated as such. As citizens of a democracy we own ‘our’ government’s data and should be able to easily analyse and manipulate it. As the UK Government MPs expenses scandal has recently illustrated, pretending to be ‘open’ by publishing scanned redacted PDFs is not ‘open’ at all, and is an exercise in obfuscation. It isn’t about the standards geeks, it is about if you want to be able to easily examine and extract this information in future years. HTML is a subset of SGML. PDF is a derivative of postscript. One approach values information, the other presentation. What do you think is more important to capture in a government archive? What is says or how it looked?

I have nothing against HTML and readily admit that it is technically superior in many aspects as a choice for government data. I would love it if governments chose to publish their documents in multiple formats (PDF, HTML, XML, etc.). But I’m also painfully aware of the needs of the non-technical users (the producers and consumers), and what is being proposed (publishing solely in XML or HTML) is completely and utterly impractical in that context.

Interesting piece. While I’m not invested in either side of the debate I would like to point out that many PDFs are simply page images. These cannot be indexed in search engines and are thus harder to find. You are right that documents should be easy to read and create by humans, but it is equally important that the documents be findable using tools that people already know. This is where PDF struggles, IMHO.

ok, first, the “standards” perspective is a non-starter.
.pdf is a standard, and so is .xml, and they both suck.

but .pdf sucks for a worse reason. .pdf sucks because
— even though it’s easy to put content _into_ a .pdf —
it’s very difficult to scrape it back out in a systematic way.
that’s why .pdf is called “the roach motel format”, because
content can go in, but it cannot come out. and that sucks.
because it’s important that government data can come out.

.xml sucks too. and forget what i said above, because
.xml might even suck worse than .pdf. that’s because
it’s difficult to apply the .xml markup in the first place,
and then it’s also often difficult to scrape it back out…
and even if you _can_ scrape the content back out again,
you have to repeat the difficult step of reapplying .xml
if you want to make the content useful in its next round.

what’s needed is a simple straight-out plain-text format
which can serve as a “master format” that can generate
.html (for the web) _and_ .pdf (for those who prefer it).
(with extra credit if a straight copy from either of the
output formats gives you the same plain-text “master”,
such that you could do infinite “round-tripping” of it.)

it’s not hard to invent a light-markup format to do this…

i’ve done it myself — something called “z.m.l.”, short for
“zen markup language” — and it works astonishingly well.
(you don’t just get “serviceable” output in .html and .pdf,
you get _powerful_ documents that have lots of features.)

or — ironic, being that you’re riffing off daring fireball —
you could use gruber’s “markdown”, a fairly similar beast.

the funny thing is that — since the format is “plain-text”
in nature — it’s dirt-simple for people to learn and use it,
and surprisingly easy to code apps like authoring-tools…
and, for conversion routines, you don’t have .xslt difficulty.

-bowerbird

p.s. plus end-users can still keep using their old tools,
because all that old software can produce plain-text files.

oh, and by the way, it’s not just the government sector
which is using .pdf in a careless and stupid manner…

the academic world — another place where reusability
should triumph — has been equally and sadly myopic.

the business segment has likewise been thoughtless…

even, believe it or not, the book-scanning operations
like the internet archive haven’t been smart enough to
see the advantages of flexible, powerful light markup.
(amazingly, peter brantley, the director now, often puts
out their position papers in a very sloppy form of .pdf.)

so this failure to adopt an intelligent archival format has
been very broad, reaching across a variety of segments.

[…] Andrew Shebanow in Open Government and PDF: The issue at hand is not whether governments should pick HTML or PDF. The issue at hand is whether governments are capable of publishing information at all. Show me an HTML creation tool that creates high quality, standards conformant markup from a Word document or any of the zillions of editing tools that government employees use. Now add in all the tools used by people who submit documents to the government. And all the versions of those tools released in the last 20 years. Now make sure that the HTML/XML works correctly even when the user doesn’t have the right browser or the right fonts installed. […]

[…] Open Government and PDF | Shebanator The issue at hand is not whether governments should pick HTML or PDF. The issue at hand is whether governments are capable of publishing information at all. Show me an HTML creation tool that creates high quality, standards conformant markup from a Word document or any of the zillions of editing tools that government employees use. […]

I don’t know anyone who’s proposing to publish *only* in XML or HTML; rather, that an open, machine-readable format be the baseline, rather than a closed, impenetrable format like PDF. Centralizing around releasing large data sets in PDF format makes it vastly harder for anyone to access or analyze the data. (For example, I’d rather they released an Excel spreadsheet than a PDF table, even though Excel is proprietary and PDF is ISO. Converting a PDF table into something I can use in any analytical software is generally a manual process.) We shouldn’t have to settle for our government obscuring data from us behind an opaque format.

[…] Open Government and PDF « Shebanator shebanator.com/2009/11/02/open-government-and-pdf – view page – cached Daring Fireball just linked to two equally foolish articles about how PDFs are “bad” for Open Government. […]