Fuzzy Hashing

The distribution and usage of electronic devices increased over the recent years. Traditional books, photos, letters and LPs became ebooks, digital photos, email and mp3. This transformation also influences the capacity of todays storage media that changed from a few megabyte to terabytes. Users are able to archive all their information on one simple hard disk instead of several cardboard boxes on the garret. This convenience for consumers complicates computer forensic investigations (e.g. by the Federal Bureau of Investigation), because the investigator has to cope with an information overload: A search for relevant files resembles no longer to find a needle in a haystack, but more a needle in a hay-hill.

The crucial task to solve this data overcharge is to distinguish relevant from non-relevant information. In most of the cases an automated preprocessing is used, which tries to filter out some irrelevant data and reduces the amount of data an investigator has to look at by hand. As of today the best practice of this preprocessing is quite simple: Hash each file of the evidence storage medium, compare the resulting hash value (also called fingerprint or signature) against a given set of fingerprints, and put it in one of the three categories: known-to-be-good, known-to-be-bad and unknown files. For instance, unmodified files of a common operating system (e.g., Windows, Linux) or binaries of a wide-spread application like a browser are said to be known-to-be-good and need not be inspected within an investigation. The most common set/database of such non-relevant files is the Reference Data Set (RDS) within the National Software Reference Library (NSRL) maintained by the US National Institute of Standards and Technology (NIST).

As of today for the purpose of handling the data overload cryptographic hash functions are used, which have one key property: Regardless how many bits changes between two inputs (i.e. 1 bit or 100 bits), the output behaves pseudo randomly. However, in the area of computer forensics it is also convenient to find similar files (e.g. different versions of a file), wherefore we need a similarity preserving hash function also called fuzzy hash function. Besides generating similarity hashes the table lookup is also very important. This is why we roughly divide our work into ‘generate a similarity hash’ and ‘lookup hash values in a database’.

In general we consider two different levels for generating similarity hashes. On the one hand there is the byte level which works independently of the file system and is very fast as we need not interpret the input. On the other hand there is the semantic level which tries to interpret a file and is mostly used for multimedia content, i.e. images, videos, audio. Although this approach is slower, it is necessary as some input content can be completely identical and their byte structure is different, e.g. BMP vs. JPG. These multimedia hashing schemes are often known as ‘perceptiual hashing’ or ‘robust hashing’.

Forensics of Mobile Devices

Mobile devices like smart phones or tablet PCs become more and more popular and are therefore objects of a forensic investigation. However, in contrast to classical IT systems like PCs or laptops the forensic process is different. Our main goal is to investigate on the legal impacts of the data modification during the acquisition phase.

Application Forensics

In many forensic investigations we have to deal with different application-specific files, e.g. log files of browsers, mail clients or instant messaging clients. Currently we provide a Python-based open-source tool to investigate MS Word doc-files. This script does not rely on any third-party Word-specific interfaces. Please feel free to test wordmetadata . Additionally we provide an analysis tool to investigate Skype binary log-files in the dat-format. It is an improved version of the Skype Chatsync Reader. The tools were developed by Achim Brand during his Master’s thesis abroad at Norway (see his Master’s thesis and the WDFIA 2012 paper for details).