Searching and extracting data from PST files

Keyword searches can be a significant aspect of an investigation and given the prevalence of Microsoft Outlook you’ll most likely find yourself needing to search through PST files for data, be it a simple keyword or more complex pattern. Even though you can use Outlook to open up a PST file, my personal preference is not to do the search within Outlook itself for two primary reasons:

Outlook will change data within the PST file; of course, you’re working on a copy – but I prefer to not have dynamically changing data (i.e. unread/read status, etc) when I’m doing my analysis.

If you’re wanting to find data matching a certain pattern (i.e. Regular Expressions) or data that is not within the message body (i.e. message header data), Outlook does not really have the facilities to support these kinds of searches.

Of course, there are several commercial investigative tools that will parse through and allow you to search PST files (FTK and Encase come to mind) but in this post I’m going to focus on performing the extraction and search with only free tools in a Linux environment.

What you’ll need:

A relatively up-to-date Linux system (be it physical or VM).

Readpst compiled/installed (in Ubuntu: apt-get install readpst) – readpst is a utility included with libpst which can be found here.

Also, I’m going to begin by assuming that you’ve acquired the PST file in a forensically sound fashion and that a copy of the file is accessible on your Linux system. Let’s get started…

Extracting data from a PST file using readpst

Run readpst on the PST file to extract all objects within the PST (i.e. messages/attachments, calendar entries, contacts, etc). By default, readpst exports data in mbox format – this ends up placing all of the extracted objects into a set of mbox files (one per subfolder), which can make extracting objects that match a search criteria a bit tedious. Instead, we’re going to tell readpst to write each object into its own file, the command looks like:

readpst -S -o out/ outlook.pst

Where out/ is the directory where you’d like readpst to output the files and outlook.pst is the PST file that you’re extracting data from. The -S flag indicates that you’d like readpst to extract each object separately, rather than in mbox format.

Once readpst has finished, in your output directory you’ll find a directory structure that matches the folder structure of the PST (generally starting with a base directory of Outlook). Within each of these folders you’ll find numerically named files that contain plain text representing the exported object (i.e. for a email message you’ll find the message body, headers, etc).

Working with the extracted data

Thanks to readpst, it is quite trivial to extract all data within a PST file into a nicely organized (and basically human readable) set of files and at this point you can begin processing these files as you would any other text file. For example, a commonly seen forensic task would be to search all objects within a PST for certain keywords or perhaps a pattern. As an example of pattern matching, let’s say you were investigating a PII incident and you wanted to see whether a subject had utilized email to send or receive emails that appear to contain social security numbers. You could use grep to search the files within the directory structure that readpst created with the following command:

This is telling grep to run a recursive search using a regular expression which will match numbers that look like SSNs in the readpst output directory. From there, you could even automate this process using a script to automatically move matching messages to a target folder that you could manually validate (or whatever the next step of you given workflow is).

As you can see, forensically analyzing PST files using freely available software is quite easy and can be a very powerful method for efficiently extracting case-pertinent data. Give it a try sometime…

On a side note, I’ve added a new Resources section to my blog and one of the pages contained within this section is dedicated to listing useful regular expressions (such as the SSN matching regular expression I used above). Right now, that is the only one I have up there, but I’ll keep adding to this page as I think of other useful regular expressions, so check back regularly.