Disk Images

The Real Data Corpus.

Between 1998 and 2006, Garfinkel acquired 1250+ hard drives on the secondary market. These hard drive images have proven invaluable in performing a range of studies such as the developing of new forensic techniques and the sanitization practices of computer users.

In 2001 the Honeynet project distributed a set of disk images and asked participants to conduct a forensic analysis of a compromised computer. Entries were judged and posted for all to see. The drive and writeups are still available online.

The Honeynet Project provided network scans in the majority of its Scan of the Month challenges. Some of the challenges provided disk images instead. The Sleuth Kit's Wiki lists Brian Carrier's responses to those challenges.

Memory Images

Network Packets and Traces

DARPA ID Eval

The DARPA Intrusion Detection Evaluation. In 1998, 1999 and 2000 the Information Systems Technology Group at MIT Lincoln Laboratory created a test network complete with simulated servers, clients, clerical workers, programmers, and system managers. Baseline traffic was collected. The systems on the network were then “attacked” by simulated hackers. Some of the attacks were well-known at the time, while others were developed for the purpose of the evaluation.

Text for Text Retrieval

American National Corpus

The American National Corpus (ANC) project is creating a massive collection of American english from 1990 onward. The goal is to create a corpus of at least 100 million words that is comparable to the British National Corpus.

IEEE VAST Challenges

Images

A set of freely redistributable images from all over the world, used for content-based image retrieval.

Voice

CALLFRIEND

CALLFRIEND is a database of recorded English conversations. A total of 60 recorded conversations are available from the University of Pennsylvania at a cost of $600.

TalkBank

TalkBank in an online database of spoken language. The project was originally funded between 1999 and 2004 by two National Science Foundation grants; ongoing support is provided by two NSF grants and one NIH grant.

Augmented Multi-Party Interaction Corpus

Other Corpora

Under an NSF grant, Kam Woods and Simson Garfinkel created a website for digital corpora [2]. The site includes a complete training scenario, including disk images, packet captures and exercises.

The Canterbury Corpus is a set of files used for testing lossless compression algorithms. The corpus consists of 11 natural files, 4 artificial files, 3 large files, and a file with the first million digits of pi. You can also find a copyof the Calgaruy Corpus at the website, which was the defacto standard for testing lossless compression algorithms in the 1990s.

The UMass Trace Repository provides network, storage, and other traces to the research community for analysis. The UMass Trace Repository is supported by grant #CNS-323597 from the National Science Foundation.