Opaf!

August 23, 2010

It’s an Open PDF Analysis Framework!

A pdf file rely on a complex file structure constructed from a set tokens, and grammar rules. Also each token being potentially compressed, encrypted or even obfuscated. Open PDF Analysis Framework will understand, decompress, de-obfuscate this basic pdf elements and present the resulting soup as a clean XML tree(done!). From there the idea is to compile a set of rules that can can be used to decide what to keep, what to cut out and ultimately if it is safe to open the resulting pdf projection(todo!).

Its written in python using PLY parser generator. The project page is here and you can get the code from here:

Keep reading for a test run…
Most of the work OPAF! will hide from you is outlined in our earlier posts about scanning a pdf, parsing a pdf and also the one discussing the caveats in the actual PDF ISO standard here.. Besides the straight forward natural parsing algorithm the lib also tries a brute force algorithm based on just few tokens. Let’s take a look of what it can already do…

Well, you first need a shady pdf like this one. This is not any alien PDF and that’s nothing really malicious about it. It even look plain…

… but if you try to open it with a tex/hex editor it stop being so friendly…

Here is where you get to try the OPAF! thing. Get the code and the pdf, solve the dependencies an run it like this..

python opaf.py textg.pdf

it will generate a graph like the following for your ammusment..

That shows the minimalistic logical structure of this PDF. Note that you may get really big graphs here with other pdf samples.I have tried up to 3k nodes. Thats fun! But sadly not very useful. But that’s not all! It also gets you an XML representation of the pdf. This XML will look like this…

After this step, well you pretty much put in the game every known xml technology. XPATH being the most notable one when searching for specific things. In the project, the small, young, not finished, work in progress flagged, not really well coded project there are some examples of what you can do when got the pdf in its xml form. Use it, ignore it, patch it(lots of basic things to be done yet). Its open source!!! f/

–update–
Made a snapshot for you, download it here. Also in the news, the main tool now accepts some basic arguments…

It just point to the reference so if there was a loop it’s not really a problem.
All this is in early stage and I’m not such an XML guy but hopefully the framework will have an xml schema/DTD and all. We’ll see..