PDF, A broken Spec!

August 14, 2010

(Or why I can’t parse a PDF)

This post is about the difficulties I ran into when trying to write a PDF parser. It’s my opinion that

PDF specification is broken because it permits the token “endstream” inside a stream!

Summary:

There are 4 ways of deciding the size of a PDF stream:

[+] Scanning for the “endstream” token
[1] Scanning for the endstream token
[2] Get the size from the direct \Length entry
[3] Get the indirect \Length using the normal xref
[4] Calculate the size from the starting marks pointed from the Normal cross-reference

What happens in actual PDF implementations if:

[+] Cross-reference is broken?
[+] Cross-reference point to overlapped objects
[+] Streams contains the endstream token
[+] Streams contains some evil endstream/endobj token combination
[+] If all the 4(or more) ways of parsing a PDF stream are present, should they be all consistent?

And finally, is this file PDF compliant? I bet someone may construct an obfuscation method based in this “issues”.

If you still think this is worth reading check out the following details and please comment if you find bug if you have a solution for the problems I stated here.

The problem…

A PDF stream Must be an indirect object. An indirect object is a PDF object enclosed between the keywords obj and endobj. If the following indirect object happens to be in your pdf:

obj 100 0
123456789
endobj

then any reference of the form “R 100 0” appearing in the PDF will reference the number 1234567789. Everything seams clean for indirect numbers and the other basic types like strings, arrays and even dictionaries. The problem arises with the PDF streams

A stream object, like a string object, is a sequence of bytes. A stream shall consist of a dictionary followed by zero or more bytes bracketed between the keywords stream(followed by newline) and endstream.

A stream will look like this…

<< \Length 100 >>
stream
AAAAAA ... AAAAAA
endstream

[1] Scan for the next endstream

First approach: GO UNTIL "endstream" KEYWORD
pros: clean scan all the file then parse order.
cons: slow and broken if endstream inside stream

The first naive approach when parsing PDF stream is to consume the dictionary …

<< \Length 100 >> ----> { "Length": 100 }

… then check if you have a stream keyword and scan until you get an endstream

stream
AAAAAA ... AAAAAA
endstream ----> '''AAAAA ... AAAAA'''

But, what if you for some reason you want to have the string “endstream” inside the PDF stream. Well something will obviously go wrong. Just try to naive-parse the following stream (wich contains the endstream string inside its payload):

Interesting but, that’s not going to fix the problem because we can also put the endobj keyword inside the binary stream. In fact we can simulate a complete trailing PDF structure inside the stream. Try to parse this by hand (ignore the \Length for now)…

NOTE: poppler,xpdf, and adobe parse it correctly no matter the bugging “endstream”.

Yeah right. The only thing that gets clear here is the fact that we can not rely “only” in the appearing of the stream,endstream,obj,endobj keywords. We need something else.

[2] The mandatory /Length keyword.

Second approach: GET THE \Length ENTRY
pros: fast and deterministic.
cons: Length could be an indirect object and depend on xref

Each stream object MUST have a /Length keyword in its dictionary for solving the ambiguities and speeding the scanning process. The /Length keyword must be a number indicating the amount of bytes in the stream. If we know the length we can “seek” until near the end of the stream payload and just check for the existence of endstream keyword.

Caveat 1: What happens when there is not an endstream keyword where it’s suppose to be one.

Caveat 2: As a way to facilitate the production of PDF files they let the Length value to be potentially an indirect reference to a number. That’s very useful when producing a PDF stream. This way you can procrastinate the setting of the length until you have already put the (potentially compressed) stream of bytes in place and then produce the size.

[+] Put a reference to a not yet defined length in the dictionary
[+] Put the dictionary
[+] Produce the stream
[+] Set the length in the referenced indirect object

So, for parsing a stream object we need to get another indirect object. Indirect objects are defined with obj and endobj keywords. But obj and endobj could appear inside a stream too. Deadlock? Or there is another hidden card in the spec?..

[3] The Normal Cross reference.

Third approach: CALCULATE THE SIZES OF THE OBJECTS USING THE XREF
pros: super fast.
cons: overlapping objects

The PDF cross reference is the fastest way to know where certain indirect pdf object starts! It comes in too flavours, normal XREF and a stream XREF.

But first we need to find where the XREF is placed. That is done with the help of the startxref keyword. This keyword must appear almost at the end of the file and point to the byte position of the trailer an cross reference. Check out the section 7.5.5 of the spec (PDF3200::7.5.5) for more detail. A pdf should end like this.

The spec suggests that conforming readers should read a PDF file from its end. Once you have the cross reference you know where the different indirect objects start. Also if you assume every cross-referenced position points only to one well defined object, you may after some calculation determine the size of every object. This will be the third way of determining a pdf stream length. What happens if this way doesn’t match the others?

[4] The Cross Reference Stream.

There are also cross-references streams. Cross-reference streams are stream objects, and contain a dictionary and a data stream. Each cross-reference stream contains the information equivalent to the cross-reference table and trailer for one cross-reference section.

The value following the startxref keyword shall be the offset of the cross-reference stream rather than the xref keyword. For files that use cross-reference streams entirely, the keywords xref and trailer shall no longer be used. Therefore, with the exception of the startxref address %%EOF segment and comments, a file may be entirely a sequence of objects.

So there is a way, the modern way, to hold cross references in potentially compressed pdf streams in the middle of the file. How do we parse this pdf stream? We don’t have the cross reference trick for getting the length of this stream. So we could do the buggy scan-to-the-next-endstream way or the \Length way. But is the \Length entry in the cross reference stream indirect? The spec enforces that some of the entries in the XStream dictionary not to be indirect, but not the /Length. ok, timeout. Head about to explode alert, hurn hurn!!

The Linearyzed hell.

More research need to be done on this one. We’ll just quote a bit of the spec on this matter…

”’For pedagogical reasons the linearized PDF is considered to be composed from 11 parts…”’

Like this:

Related

23 Responses to “PDF, A broken Spec!”

A related piece of info that I thought was interesting when playing with the PDF specification is the ability to embed objects within streams of other objects (with a valid xref table) or even to have an object span two other objects (again, with a valid xref table.) I was surprised to see Adobe Reader happily accept this. I noticed you’ve hinted at it with the xref section, but I just wanted to make it explicit.

Good post. I agree parsing PDFs is a nightmare and the specification is largely to blame.

I know exactly what you mean. Implementations seams to pay blind attention to the xref. Here, for example, the “correct” xref points to objects inside an unused PDFStream (even in xpdf). Again, I wonder if this is PDF spec compliant.

Note also that when xref is corrupted most implementations try to scan/lex for the objects getting radically different results. As there are chained filtered Xref Stream, it may be possible to change the parsing mode in the middle of the parsing, maybe feeding it a wrongly encoded XRefStream(not tested). hmm.. Well It’s insane.

First, I’d like to remind you that PDF is NOT owned by Adobe and instead has been an OPEN ISO Standard (32000-1) for over three years. As such, if you have concerns about the format, PLEASE GET INVOLVED! Participation in the ISO committee (TC171SC2WG8) is open to ANYONE and we’re always looking for additional participants. Contact your countries national standards body.

Second, as you sort of get to (finally) in your post, PDF is a structured binary format based on a cross-reference (either classic or stream format). A “conforming reader” MUST (shall in ISO terms) use that information to find the objects in the file wherever they exist. That defines the objects used to render/process the PDF. Any other method used to find the objects may produce a different result – as you’ve noted – which is why the parsing methodology is mandated in the spec.

Third, however, it is certainly a feature of many parsers to offer a “rebuild” or “recover” mode for damaged/broken PDF files. In those case, the reader is operating “out of scope” of the standard and may do whatever it wishes to rebuild the that PDF.

Finally, I should point out that a few of your examples (especially the first two) are incorrect. You have the order of the obj keyword and the numbering backwards (aka “100 0 obj” and “100 0 R” vs. what you have above).

Great to see more and more people learning about PDF! If there is anything I can do to help you better understand PDF, please don’t hesitate to ask.

1st) PDF is not ADOBE, noted. (Adobe references in post: 1 +stolen logo) And thank you for the invitation, I’ll think about it.

2nd) It’s clear that a conforming reader shall use the xref. (Though I haven’t found it on the spec, which section?)
What is not clear is that:
[-] a pdf can contain garbage between the indirect objects. It is possible?
[-] Xrefs can point to middle of other objects. It is possible?
Here I paste some spec quotes that I think support my point http://pastebin.com/nzQPj0iD

3rd) I agree that what an implementation does when something is broken is out of spec. Nevertheless there is “Implementation limits. C.1”

Finally) You are right! The obj numbering is wrong. I wasn’t paying attention to that, thanks!. My idea was to show that xref entries can point to middle of other (unused) objects.

I leave you another pdf here. Is ISO compliant? If Yes, ISO specify that a PDF file can be almost completely a random sequence of bytes.

2 – For what I read the xrefs are just for improving access speed, delete objects and manage OBJStreams internal references.
But in any case you’ll have the last word on this. It may help me understand if you could point me exactly where the spec contemplate the possibility of random uninterpreted data?
I get this with a flash check…

7.2.1:: At the most fundamental level, a PDF file is a sequence of bytes. These bytes can be grouped into tokens according to the syntax rules described in this sub-clause.

7.5.3:: File Body: The body of a PDF file shall consist of a sequence of indirect objects representing the contents of a document.

Is [‘A’, ‘sklerwthn’, ‘A’, ‘$#@!$%’, ‘A’] a sequence of ‘A’ ?
If NO, even 7.5.3 if wrong or the pdf I showed you is not compliant.
If YES, viva la pepa!

The PDF standard doesn’t talk about “random uninterpreted data” at all – in fact, I don’t think I’ve seen any file format spec that talks about such things. You only focus on the parts that are relevant in handling the format and if there happens to be a way to insert “other stuff” inside w/o impacting the format – so be it.

Consider any number of other formats that are based on “catalogs” or “cross references” – TIFF, PNG, ZIP, etc. In each case, there is nothing that prevents the insertion of random information between the “blocks” – since a proper parser would never see that stuff.

Of course, having that freedom does lead to the possibility of a form of “data hiding” that in certain circles would not be acceptable or could potentially lead to misinterpretation.

That’s one reason why the more restrictive subset of PDF, especially PDF/A, do not allow any such data in between the objects. (though you can, of course, have unreferenced objects).

So the spec doesn’t talk about accepting garbage. It does talk about being “a sequence of indirect objects representing the contents of a document” and “a sequence of bytes that can be grouped into tokens”, BUT it accepts garbage. Plop!

I’m confused. I’m certainly not convinced that garbage between objects is ISO accepted. For that I would need a convincing bit of the spec text.

Also, the spec suggests there shall not be un-xreferenced indirect object: “The table shall contain a one-line entry for each indirect object, specifying the byte offset of that object within the body of the file.” keyword: “for each”

Just to complement, YES I do think PDF has it flaws. Some of them are minor and not very important. But some of them makes me try to break into Adobe screaming “you might have released it openly to the world, but it is your baby and you shall pay!!!”

Jokes aside, the Adobe guys are always there to help. Leonard is one of them. Just post on Adobe forums and you’ll see.

Anyway, back to the problems. Ah, just a warning, I’ll write many things straight from my head, so there might be errors here and I hope you guys are kind enough to point them out.

1 – PDFs cannot be sequentially read.

They could be with a slight change on the specs, regard item 2 below

2- The stream Length is allowed to be an indirect reference!

That is the biggest “problem” of all. There is (there should be) just one way to read a stream (especially when you don’t know what is the meaning of it during the parsing), and it is through its Length. Have you found a stream? Do you understand it? No? You did not implement that filter? Or you don’t wanna interpret the data right now? All right, get the length and skip it. The length will give the number of bytes of the stream. Just skip ahead that ammount of bytes.

But some documents use an indirect reference for the stream length, since it is allowed by the current specs. That alone would not be a problem if, and only if, that indirect reference was only allowed to be “defined” somewhere BEFORE the stream object. Although that would totally kill the only purpose I can see of allowing the length to be an indirect reference (which would be to write a stream which size would only be known after its last byte is written), it would allow us to parse a PDF document sequentially, without worrying about xref tables. So far, that is the only thing that have stopped me from doing it.

I’m not saying parsing a PDF sequentially is the best way to. Cleary xref tables + /Root is the correct, fastest and less memory intensive way, but parsing a document sequentially is something I really wanted to be able to do if I wanted. Especially when facing a damaged document.

3 – Having to go to the end of the file in order to find the startxref keyword

That makes impossible to parse a PDF document without previously knowing the its size. If the data is somehow corrupted and have not fully arrived for whatever reason, and you don’t know how many bytes were supposed to have arrived, then there is no way to realiably know that the startxref keyword you found is the correct one. The document might have been updated and the latest startxref have not arrived with the data you got. That might make one person think he has a perfect copy of a document when in truth all he has is an outdated xref table in hands, which, on the best (or worst – who knows) case scenario, might parse correctly. But will render an older version of the document which might not have a crucial piece of information.

One solution for this, in my opinion, would be having the latest xref offset keyword moved right after the %PDF-x.x marker always. Have you updated the document? Good, rewrite that offset. I mean, you are trying know if you have a valid PDF in your hands. You read the %PDF-x.x? Good. The “magic value” is OK, now read the xref tables (+ trailers) and you are ready to go. But this whole “go to the end of the data and search backwards” is not, in my opinion, the most reliable way to go. If the startxref was always the very first thing inside the document, the problem would be solved. It would have an offset and all we would have to do was to check if that offset exists. No? Does not exist? “ERROR: invalid PDF document. Data is missing!”

4 – Document Trailer’s /Prev value MUST BE a indirect reference (and so does the XRefStm’s)

What? This is the most bizarre one yet. Must be (shall, in the dreaded ISO terms)? I would understand (but still would complain) if it was allowed to be an indirect reference. But MUST? Why? The whole parsing depends of the xref table (because of streams, see item 2), and now, in order to parse an xref table I might need another one??? I did not see it on the specs, but I trully hope that the that indirect reference must at least have already appeard on that very xref table, or at least on a previously read one. Otherwise we are kind of dead locked.

And I think I am not the only one that find that strange. I have several PDFs here that simply ignore that requirement and write the offset as a direct object. Maybe they are non-conformat. Maybe they were written using an older spec but still, that seems to be the logic way to go.

Any thought about my complaints? Have I interpreted something wrong? Please share you insights.

Assuming the spec doesn ‘t allow garbage between the indirect objects (my actual understanding of it). You would need direct /Length OR escaped “endstream” keywords inside streams for parsing it the right way. I choose to go against the “endstream” thing (instead of the /Length) because I think this is a tiny bit cleaner in the sense that you would be able to scan it and then parse it. (The /Length would be faster though)
If you just do the /Length improvement you still need to parse the previous dictionary (check it is composed of pairs (name-smthng), check the somethings) to be able “lex” the stream data. But YES we basically agree.🙂

About the 4 point.. The spec seems to force a scheme(I haven’t checked it thoroughly) in which any object needed for decompressing the indirectly referenced and needed objects are already cross referenced. Xrefs are pain. From my point of view they just must be a complement to enable rapid access(and delete/update) of objects.. not something you need to even read the file.

Ad 4) About the trailer dictionary the reference actually says: “Prev […] must not be an indirect reference […]” So if you just read the “not” correctly, you realize that the implementation and your expectations are correct…

Not sure what are you referring to, but at Table 15 (Entries in the file trailer dictionary). The Prev key entry says
(Present only if the file has more than one cross-reference section; shall be an indirect reference).

Mi intial idea was not to talk about any implementation. Just that the spec is ambiguous(Which is pretty much the same that say it is *wrong* when talking about specs, right?)

So, maybe that error was corrected in the updated specs, since I cannot find it in the specs of PDF 1.7 (from Adobe, didn’t check up the specs from ISO). Table 3.15, p. 108, is about the entries of the cross-reference stream dictionary, and it doesn’t say anything about “shall” or “must” of the Prev key. But the explanations on p. 107 says: “The value of all entries shown in Table 3.15 must be direct objects; indirect references are not permitted.” And Table 3.13, p. 97, talks about the good old trailer dictionary, where it says about the Prev key: “[…] must not be an indirect reference.” So I think it is pretty clear, at least in the current specs.

Summing up, the spec
* allows several possibly differing ways to determine indirect object’s size
* is not clear about the “endstream” string appearing in binary pdf streams
* is not clear about allowing overlapped objects
* is not clear about allowing noise between the objects

7.5.8.1 General
Therefore, with the exception of the startxref address %%EOF segment and comments, a file may be entirely a sequence of objects.

You are right. We can conclude two things:
1. The PDF format is not very practical: you cannot read a PDF file sequentially, you need to consult the xref e.g. to locate objects defining lengths of streams.
2. The ISO translation of the Adobe specification of PDF is buggy. For example table 3.13 in the Adobe spec, p. 97, says: “[…] must not be an indirect reference”, while the translation in table 15 of the ISO specs, p. 43, says: “[…] shall be an indirect reference”, which is just the opposite.
So you just need to follow the Adobe spec, since the ISO spec is meaningless.
My respect for the PDF format has definitely decreased since I’ve learned the technical details…

In my opinion, it is just too dangerous to rely on endstream. By doing that you are actually limiting the stream contents, imposing a new constraint.

The idea behind streams was to provide a way to store anything you want. And we are not talking about raw binary data. You can place anything in there. You could, for instance, place this very blog entry there. The source of this HTML document. But if you create a stream and place it there, without any sort of “filtering” (compression), your parser would fail right at the start, since you use the keyword “endstream” right on the fourth line.

And even if you compressed the data. Would you be able to prove that a certain algorithm would never generate a sequence of bytes that happen to be the ASCII codes for “endstream”? Well, one could encode the stream bytes again sugin those ASCIIHex filters, so you would be sure that only 0-9 and A-F would appear, but that would made the stream twice as big, so that is a big no no (IMHO)

Parsing streams the way you want is just unreliable. Perhaps doing like that would be great for your needs, but it would impose one great constraint on that sequence of bytes which is a stream.

You could you the knowledge about stream-endstream pairs when trying to recover a corrputed document but even that would unreliable. If you are lucky and the keywords never show up inside streams, then you’ll probably be able to recover, otherwise you will fail badly.

Ah, just one more thing I’d like to add. When dealing with filters, the specs mention the FlateDecode, which is based on ZLIB/Deflate. But it is not made clear that the stream is encoded (at least the ones of the spec itself are) using ZLIB, and not the raw deflate algorithm.

If you are using the library ZLIB simply thrown the stream there, OK, no problem, you are good to go. But if you have your own implementation of Inflate (the decompressor of “Deflated” data), it will fail. That is because the stream contains ZLIB headers before the actual deflate data.

Perhaps it would be a great idea having this explicitly written on the specs.

By the way, I’m still not convinced that allowing cross-reference streams to be compressed was such a great idea. The drawbacks overshadow all benefits I could think of. Specially nowadays, when bandwidth, storage and memory are more than enough to handle almost any real PDF document out there. If you really wanted to save space, compress the whole (raw) document. It will get way smaller.

I agree that having the xref section inside streams has brought a great deal in terms of flexibility. But still , as discussed above, xrefs are essential to parse PDF documents and by allowing it to be compressed at will only helps to impose even more dependencies on a parser.

IMHO, the only streams that should be allowed to be compressed are external objects (like images). Content? Maybe. But cross-reference data? Never.

Flate streams are raw encoded blocks of data – there are no “headers” included on them. It may be that some readers (such as Adobe’s) can handle it both ways, but the spec is clear – it’s raw encoded data.

Cross-reference streams have been around for almost a decade now (Acrobat 6/PDF 1.5) and the world was VERY different in terms of memory & network bandwidth. Sure, perhaps there are other alternatives we might use today…but this was then.

Compressing the entire document doesn’t allow for web-based streaming (aka linearized PDF) nor does it allow for any type of random access to the document. You’d need to decompress the entire PDF before you could start on it – since the xref is at the end. And then you’d need to keep the entire decompressed version around. Not a really good idea on any level.

Agree with you there, feliam, although, using the the stream /Length I’ve never had problems with EOL chars, since they shouldn’t be included on the stream size

Hi, Leonard,
– flate streams:
are you sure about that? Have you ever used an in house inflate implementation or are you using zlib? Because zlib will handle it just fine, while my own “raw” inflate implementation will fail unless I fist check if the stream has zlib headers and handle them accordingly. The solution I found, since I do agree the specs mention the use of deflate and not a zlib wrapped deflated stream, was to try both ways, first I search for zlib headers, since it is possible to do some integrity check using them and them, if they are not found, I treat it as if it was a raw deflated stream. And why I am going through all this trouble? Because the reference PDF I use for test, the very latest spec itself, encode the cross-reference stream like that, using zlib headers. I might be wrong though (which happens a lot), so I’d like to hear it from you.

– about compression:
yes, I agree that compressing only the document stream might have its advantages and yes, like many other things, it was appropriated when it was designed. BUT unless the specs were changed to something like I mentioned before (I’m not demanding it or anything like that) or to something like feliam mentioned (which I simply disagree, streams must allow any sequence of bytes inside it), it simply is not possible to sequentially read a PDF stream (I mentioned that on Adobe forums). Even linearization would not work. If the linearization dictionary points you to data that have not arived yet, your are doomed. So, unless the PDF doc is specially prepared for sequential parsing, there is nothing you can do. And, if the PDF was prepared like that, the whole linearization thingy would be meaningless.

By the way, the reason I do not post things like this on ADOBE forums, where I think a healty discussion would be easier to follow, is because ADOBE site do not allow me to log in anymore. Two BIG problems there:

– it knows about 325 different web browsers BUT DON’T KNOW OPERA? What? My browser “is not supported”? If there is anything that Opera claims is that they follow the W3C standards as close as possible. Those acid tests showed that over and over. Of course there might be an specific problem I’m not aware of and I bet the Opera devs would love to hear about it.

– it REQUIRES that I install flash? Ha! I simply refuse to install it, for personal reasons. IMHO, unless you youtube a lot, thing that I do not do at all, it is simply not worthy install the plugin. All I got is “heavier” ads, instead of those plain gifs or jpegs. And when you are on limited bandwidth, that matters, a lot.

Best regards to all, and, while ADOBE forums keep refusing to accept me, we can keep the discussion here.

By the way, next time we (I) will complain a little about predictors. Mostly PNG ones🙂. No problems with the spec there, but a little “unpredicting” examples would not hurt at all. At least would not have made me spend hours trying to figure it out.

[…] a pdf, parsing a pdf and also the one discussing the caveats in the actual PDF ISO standard here.. Besides the straight forward natural parsing algorithm the lib also tries a brute force algorithm […]