The ClueWeb09 Dataset

The ClueWeb09 dataset was created to support research on information
retrieval and related human language technologies. It consists of
about 1 billion web pages in ten languages that were collected in January
and February 2009. The dataset is used by several tracks of the
TREC conference.

The web graph for both the entire dataset and for the TREC Category B dataset (first 50 million English pages) is complete. We are in the process
of retrieving the data and performing the final formatting of the web graph.

The ClueWeb09 dataset and subsets are distributed in several different
ways.

Full, 4 x 1.5TB:
The full dataset is distributed as tarred/gzipped files on four 1.5
terabyte (TB) hard disk drives, in Linux ext3 format. The physical
drives are standard SATA 3 Gbit/sec (SATA/300)
3.5" drives that should be compatible with any SATA/300
interface, including external USB to SATA/300 enclosures.

Full, 2 x 3.0TB:
The full dataset is also available as tarred/gzipped files on two 3.0
terabyte (TB) hard disk drives, in Linux ext3 format . The physical
drives are standard SATA 6 Gbit/sec (SATA/600)
3.5"
NOTE:
Check that your system’s operating system and hardware
are compatiable with 3 TB drives (not all SATA external
enclosures are compatiable with these drives).

CatB, 1 x 500GB:
The TREC "Category B" subset of the full dataset is distributed
as tarred/gzipped files on one 1.0
gigabyte (GB) hard disk drive, in Linux ext3 format. The physical
drive is a SATA 3 Gbit/sec (SATA/300) 3.5"
drives that should be compatible with any SATA/
interface, including external USB to SATA 300 enclosures.

JA, 1 x 500GB:
The Japanese subset of the full dataset is distributed
as tarred/gzipped files on one 500
gigabyte (GB) hard disk drive, in Linux ext3 format. The physical
drive is a SATA 3 Gbit/sec (SATA/300)
3.5" drive that should be compatible with any SATA/300
interface, including external USB to SATA/300 enclosures.

T11Crowd, web:
The subset used by the TREC 2011 Crowdsourcing track is
downloaded from the web.

File Formats and Sample Data:

Web pages are in the WARC file format. Each WARC file is
about 1 gigabyte, uncompressed. Each WARC file contains
several tens of thousands of web pages (e.g., 40,000). Each WARC
file is compressed by gzip.

The ClueWeb09 dataset is available on several 'cloud computer'
services (e.g., Open Cloud, the Pittsburgh Supercomputer Center).
It can also be accessed using
search
interfaces provided by the Lemur
Project.

A ClueWeb09 dataset license is required before you can begin using
a hosted copy of the dataset. There is no cost for a dataset license;
it is free.

The process for obtaining a ClueWeb09 dataset license is
described below.
This process takes two weeks. Please allow enough time.
We do not have the power to hurry the university administrators
that must approve your license.

Sign an
Organizational
Agreement .
This agreement must be signed by
a person with the authority to sign agreements on behalf of your
organization. The person signing must also initial each page
of the agreement on the bottom right corner.

The organizational data license typically applies to a single research
group or unit within a larger legal entity. For example, in a
university, the organizational license might apply to a research group
consisting of a few professors, and the students and staff doing
research with them. In this case, the organization would be
the name of the research group (e.g., the Information Retrieval
Laboratory), and the Corporation/Legal Entity would be the name
of the university.

Send the complete copy (all six pages) of the signed
organizational agreement to Christina Melucci at the Language Technologies
Institute. The preferred method of sending the signed organization agreement is to scan it into a pdf and email
it to Christina (cmelucci at cs dot cmu dot edu).
If you cannot send a pdf, you can fax the agreement. The fax number is +1 412-268-6298. If you choose to fax the organizational agreement, please
notify Christina by email (cmelucci at cs dot cmu dot edu)
so that we know to look for your fax.

Each individual who will use or have access to the dataset
must sign an
Individual
Agreement.
You must retain these signed individual agreements within your
organization.

The ClueWeb09 datasets are distributed by Carnegie Mellon University for
research purposes only. A dataset may be obtained from Carnegie Mellon
by signing a data license agreement with Carnegie Mellon University, and
paying a fee that covers the cost of distributing the dataset.

The process for obtaining a ClueWeb09 dataset is described below.
This process takes two weeks. Please allow enough time.
We do not have the power to hurry the university administrators
that must approve your license.

Sign an
Organizational
Agreement .
This agreement must be signed by
a person with the authority to sign agreements on behalf of your
organization. The person signing must also initial each page
of the agreement on the bottom right corner.

The organizational data license typically applies to a single research
group or unit within a larger legal entity. For example, in a
university, the organizational license might apply to a research group
consisting of a few professors, and the students and staff doing
research with them. In this case, the organization would be
the name of the research group (e.g., the Information Retrieval
Laboratory), and the Corporation/Legal Entity would be the name
of the university.

Send the complete copy (all six pages) of the signed
organizational agreement to Christina Melucci at the Language Technologies
Institute. The preferred method of sending the signed organization agreement is to scan it into a pdf and email
it to Christina (cmelucci at cs dot cmu dot edu).
If you cannot send a pdf, you can fax the agreement. The fax number is +1 412-268-6298. If you choose to fax the organizational agreement, please
notify Christina by email (cmelucci at cs dot cmu dot edu)
so that we know to look for your fax.

Provide order information to Christina Melucci by email or fax:

Which version of the dataset you want:
Category A, Category B, or Japanese;

Which type of invoice you want: pdf (fast) or paper (slow);

Mail and email address to which the invoice should be sent;

Mailing address and telephone number to which the datasets should be sent (FedEx requires both the shipping address and telephone number);

Preferred shipping method (1 day, 3 day, 1 week); and

Method of payment (wire transfer, check, credit card).

We will send you an email confirmation that we have received
your order.

We will send you an invoice for payment, by mail and/or email.
The costs of each dataset are shown below.

Includes two 3.0 terabyte hard disks (Check Compatiability with your operating system and hardware)

ClueWeb09-T09B
A subset of about 50 million English pages
(TREC 2009 "Category B" dataset)

$165

Includes one 500 gigabyte hard disk

ClueWeb09-JA
A subset of about 67 million Japanese pages
(NTCIR-9 Intent Task dataset)

$165

Includes one 500 gigabyte hard disk

ClueWeb09-T11Crowd
(TREC-2011 Crowdsourcing dataset)

$0

Web download only

Shipping

(varies)

US options: 1 day, 2 day, 7 day International options: 1 week

Payment information will be included on the invoice, and should be paid in U.S. dollars only.
The dataset will not be shipped until your payment is confirmed. Payment can only be made via check, wire transfer, or credit card.

If you are in a hurry, credit card is the fastest payment method.
Please be aware that we are not automatically notified when funds
arrive in CMU's bank account. After you make your payment, please
notify Christina Melucci by email (cmelucci at cs dot cmu dot edu)
so that we know to watch for it.

We ship the dataset to the mailing address that you specified.

Each individual who will use or have access to the dataset
must sign an
Individual
Agreement.
You must retain these signed individual agreements within your
organization.

Additional information about the ClueWeb09 Dataset is available
in the
ClueWeb09 Information Page.
New information is typically posted there.

We also maintain a
ClueWeb09 Mailing List, however it is used very rarely. Please note that
when you browse to this page, you may receive a warning stating that the security certificate for the domain
is invalid. The certificate is not invalid - it is just self-signed by the list maintainers at
Carnegie Mellon University. It is safe to accept the certificate.