Mark Stephens FollowMark Stephens has been working with Java and PDF since 1999 and has diversified into HTML5, SVG and JavaFX.
He also enjoys speaking at conferences and has been a Speaker at user groups, Business of Software, Seybold and JavaOne conferences. He has a very dry sense of humor and an MA in Medieval History for which he has not yet found a practical use.

Why writing a PDF parser is such a ‘challenging’ task (part 234)

July 26, 2011 1 min read

In theory the PDF file format is specified in detail and is very precise. In practise, you meet alsorts of ‘interesting problems’ – the trick is to try to make your code robust enough to handle all these without making it slow or complex. Here is an interesting example I have been working on today…

In theory every object starts objectNumber 0 obj and ends with endobj. Except of course object 938 which skips the endobj and is followed immediately by a start ref pointer and a spurious End of File marker (it is not the end of the file as you can see). So you cannot assume that there will be an endobj marker at the end of each object – Acrobat does not! My code was making the assumption there would be an endobj and hanging.

Imagine if XML markup behaved like this! And that is why it is ‘challenging’ to write a decent PDF parser…

This post is part of our “Understanding the PDF File Format” series. In each article, we aim to take a specific PDF feature and explain it in simple terms. If you wish to learn more about PDF, we have 13 years worth of PDF knowledge and tips, so click here to visit our series index!

Mark Stephens FollowMark Stephens has been working with Java and PDF since 1999 and has diversified into HTML5, SVG and JavaFX.
He also enjoys speaking at conferences and has been a Speaker at user groups, Business of Software, Seybold and JavaOne conferences. He has a very dry sense of humor and an MA in Medieval History for which he has not yet found a practical use.