The common crawl dataset

Common crawl is a publically available 30TB web crawl taken between September 2009 and September 2010.
As a small project I decided to extract and tokenised the visible text of the web pages in this dataset. All the code to do this is
on github.

1. Getting the data

The first thing was to get the data into a hadoop cluster.
It's made up of 300,000 100mb gzipped arc files stored in S3.
I wrote a dead simple
distributed copy to do this.

Only a few things of note about this job...

The data in S3 is marked as requester pays
which, even though it's a no-op if you're accessing the data from EC2, needs the "x-amz-request-payer" header to be set.

Pulling from S3 to EC2 is network bound so I ran using the
MultithreadedMapRunner to ensure I could get as much throughput as possible.

The code includes lots of retry logic but also sets
mapred.max.map.failures.percent=100 to
allow tasks to fail without killing the entire job (Eg there was one s3 object which had bad ACLs that couldn't be read, no amount of retries would have helped)

2. Filtering text/html

The next step was to filter out everything that didn't have a mime type of 'text/html'. This is pretty straightforward since the arc file format specifies the mime type in a header.
I used the ArcInputFormat from
Apache Nutch to actually generate the hadoop map input records.

Across the 3,000,000,000 objects in the crawl there ended up being 2,000 distinct mime types, 700 of each occuring only once, with about 90% of them being nonsense.

The top five mime types were ...

rank

mime type

freq

overall%

1

text/html

2,970,000,000

91%

2

text/plain

79,000,000

2%

3

text/xml

52,000,000

1%

4

application/pdf

48,000,000

1%

5

application/x-javascript

26,000,000

<1%

6

text/css

25,000,000

<1%

Even though there's probably interesting content in the non text/html object types it seemed that just handling text/html would get me the biggest bang for my buck.

Initially I had some problems with encoding. Though http response headers include an encoding
field that is meant to indicate what encoding the payload is I found it to be wrong about 30% of the time :( I just ignored what the header said and
used the CharsetDetector
provided in Apache Tika. CharsetDetector inspects a chunk of bytes, uses heuristics to guess the encoding, decodes and reencodes as UTF-8.

3. Extracting visible text

Next was to extract the visible text from this raw html. After playing with a few libraries I ended up going with
boilerpipe. In particular I ended up using the
KeepEverythingWithMinKWordsExtractor
extractor. Boilerpipe, roughly, returns a single line per block element of the html.

4. Filtering for english content

I then used
LanguageIdentifier, again a part of Tika, to filter out everything but english text.
It didn't seem to have any false positives but looking at the top 5 languages something seems amiss...

rank

language

freq

1

English (en)

1,600,000,000

2

Lithuanian (lt)

270,000,000

3

Norwegian (no)

150,000,000

4

Estonian (et)

140,000,000

5

French (fr)

140,000,000

I never got around to sampling some of the Lithuanian ones to see what was actually going on but I'm a bit suspicious. I might have actually lost a bit of content in this step...f