The second crawl of 2013 is now available! In late November, we published the data from the first crawl of 2013 (see previous blog post for more detail on that dataset). The new dataset was collected at the end of 2013, contains approximately 2.3 billion webpages and is 148TB in size. The new data is located in the commoncrawl bucket at /crawl-data/CC-MAIN-2013-48/
In 2013, we made changes to our crawling and post-processing systems. As detailed in the previous blog post, we switched file formats to the international standard WARC and WAT files. We also began using Apache Nutch to crawl – stay tuned for an upcoming blog post on our use of Nutch. The new crawling method relies heavily on the generous data donations from blekko and we are extremely grateful for blekko’s ongoing support!
In 2014 we plan to crawl much more frequently and publish fresh datasets at least once a month.