Sunday, January 10, 2010

On a number of occasions I have had to analyse PDF documents to determine whether they were malicious, and in this post I am going to share the process I follow in performing this analysis.

For demonstration purposes, I will generate an example malicious PDF document using Metasploit, featuring the "use-after-free" media.newPlayer vulnerability. This is the very same exploit that became public knowledge on 15 December 2009, and we are still waiting for Adobe to release a patch, which is due 12 January 2010.

I will use a download exec payload for the PDF that tries to download an executable from a given URL then execute it. This payload was deliberately chosen to make this PDF similar to the malicious examples found in the wild. These use the malicious PDF, usually as part of a drive by download, as the first stage in an attack whose eventual aim is to install malware on the victim system.

I will be using a URL of hxxp://www.badhacker.com/evil.exe as the download and run target for my malicious PDF (which I deliberately broke just there by using hxxp instead of http so you don't click on it). At the time of writing a request to this URL will result in a 404 error.

Now I have chosen this particular URL for two reasons:

For purposes of illustration I wanted something that stands out and whose nefarious purpose is clear from the name of the URL.

None of my machines that this bad PDF gets anywhere near will be able to get far enough to actually access this URL (either because they are not Windows based or because they will be sandboxed from the Internet).

YOU may want to replace this URL with some other, definitely not dangerous URL like http://127.0.0.1/doesnotexist.blah if you are not as confident in your ability to keep things contained. Just remember that if you are doing this on a real malicious PDF document you need to be very careful. This is one of the reasons I do most of my analysis work of various Windows based evil stuff on Linux.

Requirements

Now, if you want to follow along with me, you will need the following software installed on your system:

Metasploit. Make sure you update so you have the adobe_media_newplayer exploit to create the example bad PDF.

Python interpreter.

Perl interpreter.

pdfid.py and pdf-parser.py. Get them from from Didier Stevens PDF Tools page.

pdftk. Use 'apt-get install pdftk' to install on Debian/Ubuntu/BackTrack 4, or grab the install from here for other systems.

Wget. You could also use some other equivalent HTTP downloading tool to transfer the bad PDF from the Metasploit web server. I don't recommend doing this using a web browser, especially if you are running a vulnerable system.

A Windows C compiler. I use the MinGW compiler which you can install and run on Linux like so.

unishellcodetoc.pl. Download it from my own little repository of small security tools here.

For this particular PDF document, we don't need to remove Javascript obfuscation, but at the end of the post I will briefly go over how this can be done using the following tools. You will definitely need something like the following in your arsenal if you intend to be analysing malicious PDF files on a regular basis.

A Java Interpreter. Like the Sun Java Runtime Engine, which you probably already have on your machine.

If you are running on Windows I believe there is an equivalent for the file program provided with CygWin or you could also try the TriD software to confirm the file you downloaded was a PDF file.

You can also open the file in a Hex editor or in Notepad, or use the strings utility and look for the identifying characters shown in the command output below at the start of the file.

%PDF-1.5

Now that we have the PDF file, we can commence analysing it.

Is this a malicious PDF?

Before we really dive into the guts of the PDF, its a good idea to first do a quick high level analysis of the file to see if it meets the general characteristics of a malicious PDF. Based on the results, we can then decide if the file merits further attention.

So what are some of the characteristics that indicate that a PDF may be malicious?

Its only one page long. A malicious PDF usually has what it needs to attack your machine as soon as you open it, and it doesn't need to include a lot of text content to keep your attention. Now there is nothing to say that a malicious PDF HAS to be only one page long, but many of them are.

It contains Javascript. Most of the malicious PDFs around exploit your system by taking advantage of vulnerabilities in PDF reader programs such as Adobe Acrobat Reader. And to actually trigger these exploits, the use of JavaScript is often required. While there have been workable PDF exploits that don't require Javascript, many of the PDF exploits in the wild seem to use it. Now there are (apparently) some legitimate uses for Javascript in PDF documents, so the mere presence of Javascript in a PDF doesn't definitely mean its malicious, however it should be enough to raise your suspicions.

The PDF document has an automatic launch action. This is used in malicious PDF documents to run the javascript code automatically when the document is opened.

If you see a PDF will all three of these characteristics, then it is probably deserving of further analysis. Keep in mind however, that at a minimum, only the automatic open action and Javascript will be required for most exploits to function.

The other things that will point to a PDF being malicious will be contextual. The most obvious thing to consider is how was the PDF obtained? Was it sent as an attachment to an email that demonstrates that the author had only the most basic grasp of the English language? Was it attached to an email from someone you don't know, or was it perhaps attached to an email from someone you do know but the email seemed oddly out of character? Did the PDF get downloaded in the background from some third party web site while you were browsing the web? All of these delivery methods should earn the PDF in question a very close examination.

While you will have to examine emails and Internet access records to pick up on some of the contextual queues, to check the parameters of the PDF document itself, we can use the pdfid.py Python program by Didier Stevens.

In this output /Page shows the number of pages in the document (1), /JS or /Javascript indicate the presence of Javascript in the document (yes there is Javascript) and /AA or /OpenAction determine whether there is an automatic open action (there is one).

We have hit the malicious pdf trifecta here, and if I didn't already know that the PDF had been generated by Metasploit at this point I would be very suspicious.

Wheres the Javascript?

Now that we have decided that this PDF is worthy of further attention, the first thing we want to do is get at that Javascript that we detected in the document to see what it does.

Javascript can be stored in a PDF document as plain text, so the first thing we can try is to extract it using the strings utility.

There's nothing that looks like JavaScript there. That's good, we didn't want this to be too easy.

The fact that no JavaScript appears in clear text in the document means that it has been encoded within the document, probably in an attempt to make analysis of the document more difficult. Remember we do know that JavaScript does exist in the document, because pdfid told us Javascript was present. We just need to get hold of it.

There are two methods we can use to remove this encoding and get at the actual Javascript. The first method I usually use is to run the PDF through the filter option in pdf-parser.py, again by malicious PDF expert Didier Stevens.

user@bt4pf:~$ pdf-parser.py -f evil.pdf
PDF Comment '%PDF-1.5\r\n'

PDF Comment '%\xdc\xe8\xbf\x82\r\n'

...

PDF Comment '%%EOF\r\n'

(Output truncated for readability)

OK there is nothing that looks like JavaScript here either.

Now lets try using pdftk. We can use this to attempt to remove all encoding from evil.pdf and write it to a new file evil2.pdf.

user@bt4pf:~$ pdftk evil.pdf output evil2.pdf uncompress

Now we can try using the strings utility on uncompressed file evil2.pdf:

Now that we have access to the decoded JavaScript we can attempt to determine definitively whether this is a bad PDF file.

Now if you are at all familiar with what the code of a software exploit looks like (check out my exploit tutorials if you aren't) then this Javascript above should be looking pretty suspicious to you right now.

Looking at all of the instructions used in the script, we see a reference to one method ".media.newPlayer" that appears at the end of the script after a number of other commands that appear to be filling memory with large numbers of binary characters.

If we Google for ".media.newPlayer exploit" we then find a number of articles relating to the Adobe Reader media.newplayer user after free exploit.

So we have determined that our PDF file most likely includes an exploit. But what does it do exactly?

OK that's what it is. But what does it do?

Many software exploits rely on inserting machine language code referred to as shellcode into memory to act as the payload to be run when the exploit has control of CPU execution. This exploit is no different.

Many different examples of Windows shellcode tend to weigh in at around the 300 to 400 byte mark, and if you look at the various buffers of characters that are fed into memory in this script you will notice that the first variable appears to be around this size.

Now when you want to analyse shellcode you can actually compile it into a executable using a c compiler by interpreting it as raw machine language instructions. However this particular shellcode is actually listed in Unicode format, so we need to convert it to a normal single byte "c style" format first.

To this end I have written a simple perl script that will do the conversion for you and spit out a c program that you can compile for analysis. Lets run this and write the file to code.c

Note: The line above has been line wrapped automatically for display purposes in this blog. Make sure when you enter the command everything goes on the same line to ensure it works.

Now we compile the file into a Windows executable.

user@bt4pf:~$ wine .wine/drive_c/MinGW/bin/gcc.exe code.c -o code.exe

We now have our shellcode in an executable file "code.exe" that we can analyse. The most basic type of static analysis we can do on an executable is to search for hard coded strings within it. This can reveal, for example, any URLs that the executable might try and contact when it is run.

Lets do this now, having a look to see if any URLs are within the file.

user@bt4pf:~$ strings code.exe | grep http

No output. This is strange because we specifically create the PDF using Metasploit to download and execute a file from a URL.

There is one of two reasons why we would be seeing this behaviour:

That block of characters we selected and compiled into our executable wasn't shellcode, or

The shellcode is encoded.

While strings analysis is the simplest executable analysis technique available, it is also the easiest for an attacker to hide from, and various encoding methods for both shellcode and for normal executables can be used to foil this analysis method.

To determine if this is actually our shellcode, we will need to turn to some dynamic methods of analysis.

How badly do you want to know what this does?

The methods that I have used in the past to analyse binary files are as follows:

Each of these method have positive and negative factors that make them suitable for particular situations. Before we proceed further its probably worth examining exactly what we want to get out of proceeding further, given that by this stage it definitely appears that this PDF file is malicious.

When I am analysing malicious PDF files I am usually doing it as part of an incident response activity, and the questions I want answered are "Did this infect the target machine?" and "How can I stop this particular exploit from happening again?". You may have different requirements which might affect the method you choose.

In my case, a general summary of the shellcodes purpose is all I need, so I will usually submit samples to Anubis, ThreatExpert and CWSandBox. This involves a minimum of work for me, and I usually get a usable answer back within minutes.

Or at least I can for CWSandbox and ThreatExpert. Anubis can take considerably longer depending on queue length.

Analysis using a debugger, or dynamic analysis tools and a fake Internet involves a lot more equipment and time, but you can sometimes get a lot more information because you can tailor the work done to your own requirements. These methods also have the drawback of having to actually run malicious code on your own machines however, so you have to be very careful in how you manage this process to prevent an infection from spreading.

Zerowine is (in theory) a good middle ground, however I have never had the best of luck with using it to analyse these compiled shellcode executables. Your mileage may differ.

Lets have a look at the results I got from CWSandBox.

Of particular interest is this section on Network Activity, which shows an attempt to download evil.exe from www.badhacker.com. Of course this file didn't actually exist, so we don't see much more of use in this report, however knowing the URL that is used gives me enough information to take some action, by blocking access to that particular web site.

So that's the analysis of this PDF completed.

Obfuscated JavaScript

In the requirements section of the document I mentioned the Rhino JavaScript implementation as a useful tool for analysing malicious PDF files. Now while this particular example did not make use of it, a number of other malicious PDFs I have seen have made use of JavaScript obfuscation techniques to hide the true purpose of the embedded scripts. Being able to remove this obfuscation is essential in being able to confirm that a PDF is malicious, and considering that obfuscated Javascript also appears in a number of web based exploits, Javascript "defuscation" is a valuable skill to have.

Now the way to remove the obfuscation in Javascript is to reorganise the code slightly so that the final instructions are displayed to screen instead of executed and then to run this new code in a JavaScript interpreter.

To remove obfuscation you essentially modify the JavaScript to ensure the routines within the code that remove obfuscation assign the unobscured code to a variable instead of running it. You then run this code inside the debugger, stepping through the code until the variable is assigned, then you can display the value of the variable by entering its name in the Evaluate section of the debugger in the bottom right hand pane. This value can then be copied to the clipboard and then written to a new file, and it can then be run in the debugger once more if a second layer of obfuscation needs to be removed.

If there is enough interest, I can demonstrate an example of how to go about this process of removing Javascript obfuscation in a future post.

References

Didier Stevens has a list of malicious PDF obfuscation methods here, for those interested in the topic.

5 comments:

hi,need help to analyse the util.printf() format string bug.I am using the following code in pdf javascript.pdf if crashing but it's exiting in ollydbg.Can u tell me how i sould place breakpoint in code by using something like "CCCCC"

Its hard for me to do that without analysing that vulnerability in a debugger myself. Id suggest you try and reverse engineer an existing public exploit for that vulnerability, like the Metasploit one.