The '''Graphics Interchange Format''' ('''GIF''') (SM) format is a lossless [[image format]]. GIF images use internal [[LZW]] compression to reduce file size. CompuServe created this format, which is a bitmap image format allowing 256 different colors to be selected from a 24-bit color palette (RGB). GIF also allows for animations by sequencing through multiple GIF image data inside a single file.

+

This page describes large-scale corpora of forensically interesting information that are available for those involved in forensic research.

−

"The Graphics Interchange Format(c) is the Copyright property of CompuServe Incorporated. GIF(sm) is a Service Mark property of CompuServe Incorporated."

+

= Disk Images =

+

+

;''The Harvard/MIT Drive Image Corpus.'' Between 1998 and 2006, [[Simson Garfinkel|Garfinkel]] acquired 1250+ hard drives on the secondary market. These hard drive images have proven invaluable in performing a range of studies such as the developing of new forensic techniques and the sanitization practices of computer users.

GIF files consist of a [[header]], image data, optional [[metadata]], and a [[footer]]. The header consists of a signature and a version, each 3 bytes long. The signature is <tt>47 49 46</tt> (hex) / <tt>GIF</tt> (text). The versions are either <tt>38 37 61</tt> or <tt>38 39 61</tt> (hex) / <tt>87a</tt> or <tt>89a</tt> (text) respectively. The footer or trailer (as identified in the format specification) is usually <tt>3B</tt> (hex).

+

;''The Honeynet Project Forensic Challenge.'' In 2001 the Honeynet project distributed a set of disk images and asked participants to conduct a forensic analysis of a compromised computer. Entries were judged and posted for all to see. The drive and writeups are still available online.

;''The [http://www.cfreds.nist.gov/ Computer Forensic Reference Data Sets]'' project from [[National Institute of Standards and Technology|NIST]] hosts a few sample cases that may be useful for examiners to practice with.

−

GIF89a files can contain [[metadata]] in [[text]] format. GIF metadata is contained in sections identified as a Comment Extension, a Plain Text Extension, and an Application Extension. All extension sections begin with the Extension Introducer <tt>21</tt> (hex).

+

* http://www.cfreds.nist.gov/Hacking_Case.html

−

Comment Extensions are optional and more than one may be present. They were designed to allow including comments about the graphic, credits, descriptions or other types of non-control/non-graphic data. The beginning of this block has the Extension Introducer and a Comment Label <tt>FE</tt> (hex). Comment data has a sequence of sub-blocks between 1 and 255 bytes in length, with the size in a byte before the data. Comment Extensions should appear either before or after the control and graphic data blocks.

Plain Text Extensions are optional and more than one may be present. They were designed to allow rendering of textual data as a graphic. The beginning of this block has the Extension Introducer and a Comment Label <tt>01</tt> (hex). Plain text data has a sequence of sub-blocks between 1 and 255 bytes in length, with the size in a byte before the data.

+

= Network Packets and Traces =

−

Application Extensions are optional. They were designed to allow applications to insert application specific data inside a GIF. The beginning of this block has the Extension Introducer and an Application Extension Label <tt>FF</tt> (hex).

+

== DARPA ID Eval ==

−

== External Links ==

+

''The DARPA Intrusion Detection Evaluation.'' In 1998, 1999 and 2000 the Information Systems Technology Group at MIT Lincoln Laboratory created a test network complete with simulated servers, clients, clerical workers, programmers, and system managers. Baseline traffic was collected. The systems on the network were then “attacked” by simulated hackers. Some of the attacks were well-known at the time, while others were developed for the purpose of the evaluation.

''The [http://www.wide.ad.jp/project/wg/mawi.html MAWI Working Group] of the [http://www.wide.ad.jp/ WIDE Project]'' maintains a [http://tracer.csl.sony.co.jp/mawi/ Traffic Archive]. In it you will find:

+

* daily trace of a trans-Pacific T1 line;

+

* daily trace at an IPv6 line connected to 6Bone;

+

* daily trace at another trans-Pacific line (100Mbps link) in operation since 2006/07/01.

+

+

Traffic traces are made by tcpdump, and then, IP addresses in the traces are scrambled by a modified version of [[tcpdpriv]].

+

+

==Wireshark==

+

The open source Wireshark project (formerly known as Ethereal) has a website with many network packet captures:

+

* http://wiki.wireshark.org/SampleCaptures

+

+

==NFS Packets==

+

The Storage Networking Industry Association has a set of network file system traces that can be downloaded from:

+

* http://iotta.snia.org/traces

+

* http://tesla.hpl.hp.com/public_software/

+

+

=Text Files=

+

==Email messages==

+

+

''The Enron Corpus'' of email messages that were seized by the Federal Energy Regulatory Commission during its investigation of Enron.

+

+

* http://www.cs.cmu.edu/~enron

+

* http://www.enronemail.com/

+

+

==Log files==

+

[http://crawdad.cs.dartmouth.edu/index.php CRAWDAD] is a community archive for wireless data.

The [http://trec.nist.gov Text REtrieval Conference (TREC)] has made available a series of [http://trec.nist.gov/data.html text collections].

+

+

==American National Corpus==

+

The [http://www.americannationalcorpus.org/ American National Corpus (ANC) project] is creating a massive collection of American english from 1990 onward. The goal is to create a corpus of at least 100 million words that is comparable to the British National Corpus.

+

+

==British National Corpus==

+

The [http://www.natcorp.ox.ac.uk/ British National Corpus (100)] is a 100 million word collection of written and spoken english from a variety of sources.

+

+

=Voice=

+

==CALLFRIEND==

+

CALLFRIEND is a database of recorded English conversations. A total of 60 recorded conversations are available from the University of Pennsylvania at a cost of $600.

+

+

==TalkBank==

+

TalkBank in an online database of spoken language. The project was originally funded between 1999 and 2004 by two National Science Foundation grants; ongoing support is provided by two NSF grants and one NIH grant.

The [http://corpus.canterbury.ac.nz/ Canterbury Corpus] is a set of files used for testing lossless compression algorithms. The corpus consists of 11 natural files, 4 artificial files, 3 large files, and a file with the first million digits of pi. You can also find a copyof the Calgaruy Corpus at the website, which was the defacto standard for testing lossless compression algorithms in the 1990s.

+

+

The [http://traces.cs.umass.edu/index.php/Main/HomePage UMass Trace Repository] provides network, storage, and other traces to the research community for analysis. The UMass Trace Repository is supported by grant #CNS-323597 from the National Science Foundation.

Revision as of 22:46, 12 July 2008

This page describes large-scale corpora of forensically interesting information that are available for those involved in forensic research.

Disk Images

The Harvard/MIT Drive Image Corpus. Between 1998 and 2006, Garfinkel acquired 1250+ hard drives on the secondary market. These hard drive images have proven invaluable in performing a range of studies such as the developing of new forensic techniques and the sanitization practices of computer users.

The Honeynet Project Forensic Challenge. In 2001 the Honeynet project distributed a set of disk images and asked participants to conduct a forensic analysis of a compromised computer. Entries were judged and posted for all to see. The drive and writeups are still available online.

Network Packets and Traces

DARPA ID Eval

The DARPA Intrusion Detection Evaluation. In 1998, 1999 and 2000 the Information Systems Technology Group at MIT Lincoln Laboratory created a test network complete with simulated servers, clients, clerical workers, programmers, and system managers. Baseline traffic was collected. The systems on the network were then “attacked” by simulated hackers. Some of the attacks were well-known at the time, while others were developed for the purpose of the evaluation.

Text for Text Retrieval

American National Corpus

The American National Corpus (ANC) project is creating a massive collection of American english from 1990 onward. The goal is to create a corpus of at least 100 million words that is comparable to the British National Corpus.

British National Corpus

Voice

CALLFRIEND

CALLFRIEND is a database of recorded English conversations. A total of 60 recorded conversations are available from the University of Pennsylvania at a cost of $600.

TalkBank

TalkBank in an online database of spoken language. The project was originally funded between 1999 and 2004 by two National Science Foundation grants; ongoing support is provided by two NSF grants and one NIH grant.

Augmented Multi-Party Interaction Corpus

Other Corpora

The Canterbury Corpus is a set of files used for testing lossless compression algorithms. The corpus consists of 11 natural files, 4 artificial files, 3 large files, and a file with the first million digits of pi. You can also find a copyof the Calgaruy Corpus at the website, which was the defacto standard for testing lossless compression algorithms in the 1990s.

The UMass Trace Repository provides network, storage, and other traces to the research community for analysis. The UMass Trace Repository is supported by grant #CNS-323597 from the National Science Foundation.