Monday, July 16, 2012

Detecting File Hash Collisions

By Pär Österberg Medina.

When investigating a computer that is suspected of being involved in a crime or that might be infected with a malware, it is important to try to remove as many known files as possible in order to keep the focus of the analysis on the files you have not seen or analyzed before. This is particular useful in malware forensic when you are looking for something out of the ordinary, something you might not have seen before. In order to remove files from the analysis, a cryptographic checksum of the file is generated that is matched in a hash database. These hash databases are divided up in categories with files that are either known to be good, used, forbidden or bad.

Another common use of hash functions in computer forensics is to use them to verify the integrity of an acquired hard drive image or any other piece of data. A cryptographic hash of the data is generated when the evidence is acquired that later can be used to verify that the content has not changed. Even though newer hash functions exists, in computer forensics we are still relying on MD5 and SHA-1. These hash functions was written a long time ago and has flaws that can be used to generate something called hash collisions. In this blog post I will show how these collisions can be detected so we still can use our old trustworthy hash function - even though they are broken.

Hash Collision

You have most likely heard about hash collisions before and how they can be used for malicious intent. Briefly explained, a collision attack is when a file or a message produces the same hash even though the content of the files are different. To illustrate how this can look I have generated what on the surface seems to be two identical programs.

Detecting Hash Collisions

In the example above I showed two files that both had the same size and shared the MD5 checksum. Even though there have been documented hash collisions in both the MD5 and SHA-1 hash functions, it highly unlikely that a collision might occur on multiple hash functions at the same time. As you can see below, the files do not produce the same SHA-1 checksum and are therefore considered to be unique.

The danger of white listing files and removing them from a forensic investigation is that you have to be absolutely sure that the files you are excluding for analysis are exactly the files you want to remove. Even though there has been no public demonstration of a definite hash collision - an attack where a new file has been generated to produce the same hash as an existing file, I always like to be extra careful when removing files that generates matches in my hash databases. To verify that the files are indeed the same file is something that can be done by mapping a hash to another hash - a technique I call "hashmap".

Hashmap'ing a hash database

In order for us to be able to hashmap a hash database, the database needs to include at least two hashes created by using two different hash functions. Fortunately for us, the databases we generated that are using the RDS format or the NSRL databases we downloaded from NIST, both lists the MD5 and SHA-1 hash in each entry. I have previously showed how the program ‘hfind’ can be used to both create an index from a database as well as how to use that index to search the database. When ‘hfind’ finds a match for a hash we search for, the program will print the filename of the file that matched the hash.

This functionality of ‘hfind’ is something we can use when we want to hashmap a MD5 hash to the SHA-1 hash of the same file. In order for this to work, we need to replace the value in the database that holds the filename with the SHA-1 checksum of the file. This can be done in many ways but to demonstrate this I will use the Unix command ‘awk’ on one of the hash databases I have generated before.

As you can see the field that used to hold the filename now holds the SHA-1 checksum of the file instead. Since the offsets to the file entries in our hashmap database are not the same as in the original database, we do also need to create a new index.

Using our new database to search for the MD5 checksum now’ prints the SHA-1 hash of the file instead of the file name. This hash can we verify by generating a SHA-1 checksum of the file we are investigating.

Now we know for sure that the file we have on disk is exactly the same as we have in our database. We can of course also reverse this process to create a hashmap database that will print the MD5 hash when we query the database using a SHA-1 hash.

Modifying ‘hfind’ to hashmap automatically

Generating a new hash databases that hold the hash value you want to map to in the data field for the file name is a solution that works. The solution however is not that flexible and requires a lot of extra disk space since the hashmap database will be at least the same size as the original database and the index approximately a third of the database size. Instead of trying to work around the issue with ‘hfind’ only printing the field that holds the file name, a much better solution to our problem would be to patch the program so it will present us with the corresponding MD5/SHA-1 hash instead.

To do so the first thing we need to do is to download The Sleuthkit, verify the downloaded file and extract the content. At the time of this writing, the latest stable version of TSK is 3.2.3.

The source code that handles the way ‘hfind’ prints the results when processing a database that is using the NSRL2 format is located in the file ‘tsk3/hashdb/nsrl_index.c’. This is the file that we need to patch so that ‘hfind’ will print the SHA1 and MD5 checksum instead of the filename.

We also need to make sure that an error is returned if we are trying to use our patched ‘hfind’ binary on another database type but the one that are using the NSRL2 format. This is done by patching the file ‘tsk3/hashdb/tm_lookup.c’.

As you can see above, instead of returning the name of the file that we find in our hash database, the other pair in the MD5 to SHA-1 or SHA-1 to MD5 hashmap is given to us. With this solution, we do not need to create an additional database and can use the existing index that we already have created.

All content provided here is purely for educational purposes only. Review state and local laws before partaking in any activity. The views and statements here have not been reviewed, approved, or endorsed by Foundstone, McAfee, or Intel.