PDF Stream Filter – Part 1

One of the challenges in analyzing malicious PDF document is stream filtering. Malicious contents in PDF file are usually compressed with stream filtering thus making analysis a bit complicated.

In a PDF document , stream object consists of stream dictionary, stream keyword, a sequence of bytes, and endstream keyword. A malicious content inside PDF file typically resides in between stream and endstream keyword, and usually it is compressed with compression scheme, such as:

ASCII85Decode

ASCIIHexDecode

FlateDecode

JBIG2Decode

LZWDecode

RunLengthDecode

and etc

Basically, there are two techniques used in stream filtering: single filtering and cascaded filtering. Single filtering means that there is just one compression scheme used to compress the stream while cascaded filtering means that there are more than one compression schemes used to compress the stream.

The most common compression schemes used are FlateDecode, ASCIIHexDecode, and ASCII85Decode. However, some of the latest samples of malicious PDF have shown the trend to include other compression scheme such as JBIG2Decode, LZWDecode, and RunLengthDecode.This is because most of the PDF analyzing tools (at least at the time of thsis writing) do not have features to decompress those types of compression schemes yet.

From the above screenshot, we can see the components of stream object that I mentioned earlier. By looking at the object dictionary, we can identify the length (/Length) of the byte sequence in the stream which is 4387, and the compression schemes used (from /Filter) are FlateDecode and ASCIIHexDecode.

Decompressing single filtering is straightforward since we only need to decompress one compression scheme. Cascaded filtering on the other hand, need multiple decompressing operations. If you look at the screenshot above, you’ll notice that the malicious content is compressed with ASCIIHexDecode and then compressed again with FlateDecode. Therefore, we need to follow the filter sequence where decompress of FlateDecode will be done first, and then ASCIIHexDecode to get the final analyzable content.

To name some of the useful PDF analyzing tools available, tools like pdf-parser or pyew allow us to decompress stream object that contains single or cascaded filtering.