If this is your first visit, be sure to
check out the Forum Rules by clicking the
link above. You may have to register
before you can post: click the register link above to proceed. To start viewing messages,
select the forum that you want to visit from the selection below.

PDF file check question

Hi,
I was just wondering... If someone asked you to ensure that a large PDF file is clean of any evil things.
How would you do this.
(assume the file passed virus scanner, and it legitimately contains some JS content - so scanning source for existence of it is not enough).
This is a curiosity/non urgent question for those with time on their hands to share their most secret white hat ways.

Hi,
I was just wondering... If someone asked you to ensure that a large PDF file is clean of any evil things.
How would you do this.
(assume the file passed virus scanner, and it legitimately contains some JS content - so scanning source for existence of it is not enough).
This is a curiosity/non urgent question for those with time on their hands to share their most secret white hat ways.

Sin-cerely,
Trol (trolling the Forum since OMG ago)

There's a GPO template for AD that disables JS on all clients of AD. So far, I haven't found a single PDF that's actually needed JS.

A third party security audit is the IT equivalent of a colonoscopy. It's long, intrusive, very uncomfortable, and when it's done, you'll have seen things you really didn't want to see, and you'll never forget that you've had one.

Nice, And after reading this, I looked it up and learned new cool thing, thank you .

But that protect you from bad PDF, it doesn't tell you IF PDF was BAD.
Also, if you then forward such PDF into the wild, it could contaminate others.

Soooo, all the great white hats with too much time on your hands, How do you (if you do) ensure that PDF does not contain a new cool exploit?

Sin-cerely,
Trol

I know this may not be the answer you want to hear, but my personal opinion is that it is not really my concern about someone else's network and if they're vulnerable to exploit. If it's a 0-day, it's a 0-day, chances are, nothing that I would have available is going to detect it. Eventually, it will be detected as updated definitions are deployed, and the network is scanned during it's normal cycle.

The only real solution to this would be to quarantine all attachments until such time that definitions are available to scan. This of course is disruptive to business workflow, so it's not a real good solution. Of course, you could always manually look at every PDF that comes in, if you have nothing else to do with your time, I really don't have the time to do that myself.

A third party security audit is the IT equivalent of a colonoscopy. It's long, intrusive, very uncomfortable, and when it's done, you'll have seen things you really didn't want to see, and you'll never forget that you've had one.

JS is not the only "bad" think in a PDF. Most dangerous ones out there actually exploit the reader and you can't make sure it's clean unless you open it in a hex editor... and even then, if it's large, you're probably screwed. So no, you can't protect yourself - UPDATE your reader software and hope for the best!!!!

I use the tools pdfid and pdf-parser from here. I the past I have also used pdftk, but Im finding that less useful recently.

The process:

Use pdfid to analyse the pdf document. pdfid can tell you if a pdf has Javascript included as well as autorun functionality and how many pages it has. A one page document with Javascript and autorun functionality is suspicious.

If Javascript is present, extract it from the document to determine its purpose. Sometimes the Javascript is included in plain text, in which case you can just use the strings utility to extract it. Otherwise, you can use pdf-parser to extract certain types of encoded Javascript.

Malicious Javascript often contains obfuscation to disguise its true purpose. To remove this obfuscation I modify the script a little to allow easier debugging (e.g. assign the code from eval statements to a variable instead) and use the Rhino Javascript debugger to show me how the code is transformed as it runs.

Many of the Javascript based PDF exploits often involve buffer overflows, and the shellcode is often in unicode format. I have a perl script that I wrote to convert this type of shellcode to a C program (really just C style shellcode with some wrapper code) which can then be compiled to be further analysed using standard binary analysis techniques. I can post the script if anyone wants it.

I will note that PDF exploits are possible without Javascript, but in practice most of the ones out in the wild seem to use it. Certainly the ones I have seen have it.

Capitalisation is important. It's the difference between "Helping your brother Jack off a horse" and "Helping your brother jack off a horse".

Thank you Lupin for reply, between your information and that from streaker (xorred also my thanks) my escapade into PDF documents might end up being successful (since it is goal/subject selected for fun - i also define success which is handy)

I knocked this together in the middle of an incident and haven't had a chance to tidy it up, so be warned its pretty rough. You basically just run it at the command line with the JS shellcode as a parameter and it spits out a C program that you can compile.