August 14, 2010

(Or why I can’t parse a PDF)

This post is about the difficulties I ran into when trying to write a PDF parser. It’s my opinion that

PDF specification is broken because it permits the token “endstream” inside a stream!

Summary:

There are 4 ways of deciding the size of a PDF stream:

[+] Scanning for the “endstream” token
[1] Scanning for the endstream token
[2] Get the size from the direct \Length entry
[3] Get the indirect \Length using the normal xref
[4] Calculate the size from the starting marks pointed from the Normal cross-reference

What happens in actual PDF implementations if:

[+] Cross-reference is broken?
[+] Cross-reference point to overlapped objects
[+] Streams contains the endstream token
[+] Streams contains some evil endstream/endobj token combination
[+] If all the 4(or more) ways of parsing a PDF stream are present, should they be all consistent?

And finally, is this file PDF compliant? I bet someone may construct an obfuscation method based in this “issues”.

If you still think this is worth reading check out the following details and please comment if you find bug if you have a solution for the problems I stated here.