June 2016 Crawl Archive Now Available

The crawl archive for June 2016 is now available! The archive located in the commoncrawl bucket at crawl-data/CC-MAIN-2016-26/ contains more than 1.23 billion web pages.

To assist with exploring and using the dataset, we provide gzipped files that list:

By simply adding either s3://commoncrawl/ or https://commoncrawl.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.

The Common Crawl URL Index for this crawl is available at: http://index.commoncrawl.org/CC-MAIN-2016-26/.

For more information on working with the URL index, please refer to the previous blog post or the Index Server API. There is also a command-line tool client for common use cases of the url index.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in open data. Please contact [email protected] for sponsorship information and packages.

May 2016 Crawl Archive Now Available

The crawl archive for May 2016 is now available! More than 1.46 billion web pages are in the archive, which is located in the commoncrawl bucket at crawl-data/CC-MAIN-2016-22/.

To assist with exploring and using the dataset, we provide gzipped files that list:

By simply adding either s3://commoncrawl/ or https://commoncrawl.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.

The Common Crawl URL Index for this crawl is available at: http://index.commoncrawl.org/CC-MAIN-2016-22/.

For more information on working with the URL index, please refer to the previous blog post or the Index Server API. There is also a command-line tool client for common use cases of the url index.

We are grateful to our friends at Moz for donating a seed list of 400 Million URL’s to enhance the Common Crawl. The seeds from Moz were used for the May crawl in addition to the seeds from the preceding crawl. Moz URL data will be incorporated into future crawls as well.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in open data. Please contact [email protected] for sponsorship information and packages.

April 2016 Crawl Archive Now Available

The crawl archive for April 2016 is now available! More than 1.33 billion webpages are in the archive, which is located in the commoncrawl bucket at crawl-data/CC-MAIN-2016-18/.

To assist with exploring and using the dataset, we provide gzipped files that list:

By simply adding either s3://commoncrawl/ or https://commoncrawl.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.

The Common Crawl URL Index for this crawl is available at: http://index.commoncrawl.org/CC-MAIN-2016-18/.

For more information on working with the URL index, please refer to the previous blog post or the Index Server API. There is also a command-line tool client for common use cases of the url index.

Note that the April crawl is based on the same URL seed list as the preceding crawl of February 2016. However, the manner in which the crawler follows redirects is changed: redirects are not followed immediately; instead redirect targets from the current crawl are recorded and followed by the subsequent crawl. This approach serves to avoid duplicates with exactly the same URL contained in multiple segments (e.g., one of the commoncrawl.org pages). The February crawl contains almost 10% such duplicates.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in open data. Please contact [email protected] for sponsorship information and packages.

Welcome, Sebastian!

It is a pleasure to officially announce that Sebastian Nagel joined Common Crawl as Crawl Engineer in April. Sebastian brings to Common Crawl a unique blend of experience, skills, knowledge (and enthusiasm!) to complement his role and the organization.

Sebastian has a PhD in Computational Linguistics and several years of experience as a programmer working in search and data. In addition to hands-on experience maintaining and improving a Nutch-based crawler like that of Common Crawl, Sebastian is a core committer to and current chair of the open-source Apache Nutch project. Sebastian’s knowledge of machine learning techniques and natural language processing components of web crawling will help Common Crawl continually improve on and optimize the crawl process and its results.

With Sebastian on board, we have both the competence and momentum to take Common Crawl to the next level.

February 2016 Crawl Archive Now Available

As an interim crawl engineer for CommonCrawl, I am pleased to announce that the crawl archive for February 2016 is now available! This crawl archive holds more than 1.73 billion urls. The files are located in the commoncrawl bucket at /crawl-data/CC-MAIN-2016-07/

To assist with exploring and using the dataset, we’ve provided gzipped files that list:

By simply adding either s3://commoncrawl/ or https://commoncrawl.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.

The CommonCrawl Url Index for this crawl is available at: http://index.commoncrawl.org/CC-MAIN-2016-07/.

For more information on working with the url index, please refer to the previous blog post or the Index Server API. There is also a command-line tool client for common use cases of the url index.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in big open data. Please contact [email protected] for sponsorship information and packages.

November 2015 Crawl Archive Now Available

As an interim crawl engineer for CommonCrawl, I am pleased to announce that the crawl archive for November 2015 is now available! This crawl archive is over 151TB in size and holds more than 1.82 billion urls. The files are located in the commoncrawl bucket at /crawl-data/CC-MAIN-2015-48/

To assist with exploring and using the dataset, we’ve provided gzipped files that list:

By simply adding either s3://commoncrawl/ or https://commoncrawl.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.

The CommonCrawl Url Index for this crawl is available at: http://index.commoncrawl.org/CC-MAIN-2015-48/

For more information on working with the url index, please refer to the previous blog post or the Index Server API. There is also a command-line tool client for common use cases of the url index.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in big open data! Contact [email protected] for sponsorship information and packages.

September 2015 Crawl Archive Now Available

As an interim crawl engineer for CommonCrawl, I am pleased to announce that the crawl archive for September 2015 is now available! This crawl archive is over 106TB in size and holds more than 1.32 billion urls. The files are located in the commoncrawl bucket at /crawl-data/CC-MAIN-2015-40/.

To assist with exploring and using the dataset, we’ve provided gzipped files that list:

By simply adding either s3://commoncrawl/ or https://commoncrawl.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.

The CommonCrawl Url Index for this crawl is available at: http://index.commoncrawl.org/CC-MAIN-2015-40/

For more information on working with the url index, please refer to the previous blog post or the Index Server API. There is also a command-line tool client for common use cases of the url index.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in big open data! Contact [email protected] for sponsorship information and packages.

August 2015 Crawl Archive Available

The crawl archive for August 2015 is now available! This crawl archive is over 149TB in size and holds more than 1.84 billion webpages. The files are located in the commoncrawl bucket at /crawl-data/CC-MAIN-2015-35/.

To assist with exploring and using the dataset, we’ve provided gzipped files that list:

By simply adding either s3://commoncrawl/ or https://commoncrawl.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.

The release also includes the August 2015 Common Crawl Index, constructed by Ilya Kreymer, creator of https://webrecorder.io/. The Common Crawl Index offers a fascinating and new way to explore the dataset! For full details, refer to Ilya’s guest blog post.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in big open data! Contact [email protected] for sponsorship information and packages.

Web image size prediction for efficient focused image crawling

Katerina Andreadou
This is a guest blog post by Katerina Andreadou.
Katerina is a research assistant at CERTH, where she specializes in multimedia analysis and web crawling.


In the context of using Web image content for analysis and retrieval, it is typically necessary to perform large-scale image crawling. In our web image crawler setup, we noticed that a serious bottleneck pertains to the fetching of image content, since for each web page a large number of HTTP requests need to be issued to download all included image elements. In practice, however, only the relatively big images (e.g., larger than 400 pixels in width and height) are potentially of interest, since most of the smaller ones are irrelevant to the main subject or correspond to decorative elements (e.g., icons, buttons). Given that there is often no dimension information in the HTML img tag of images, to filter out small images, an image crawler would still need to issue a GET request and download the respective files before deciding whether to index them.

To address this limitation, we decided to explore the challenge of predicting the size of images on the Web based only on their URL and information extracted from the surrounding HTML code. In order to do so, we needed a large amount of images accompanied by their HTML metadata with the purpose of training and testing our image size prediction system. To this end we decided to use a sample of the data from the July 2014 Common Crawl set, which is over 266TB in size and contains approximately 3.6 billion web pages. Since for technical and financial reasons, it was impractical and unnecessary to download the whole dataset, we created a MapReduce job to download and parse the necessary information using Amazon Elastic MapReduce (EMR). The setup is based on a blog post by Steve Salevan. The data of interest include all images and videos from all web pages and metadata extracted from the surrounding HTML elements.

To complete the task, we used 50 Amazon EMR medium instances, resulting in 951GB of data in gzip format. The following statistics were extracted from the corpus:

  • 3.6 billion unique images
  • 78.5 million unique domains
  • ≈8% of the images are big (width and height bigger than 400 pixels)
  • ≈40% of the images are small (width and height smaller than 200 pixels)
  • ≈20% of the images have no dimension information

To predict the size of Web images, we came up with three different methodologies, which are analyzed in the rest of this post. This work is described in detail in a paper presented at CBMI 2015 (13th International Workshop on Content-Based Multimedia Indexing). The paper is available online (http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=7153609).

Textual Features approach

An n-gram in our case is a continuous sequence of n characters from the given image URL. The main hypothesis we make is that URLs which correspond to small and big images differ substantially in terms of wording. For instance, URLs from small images tend to contain words such as logo, avatar, small, thumb, up, down, pixels. On the other hand, URLs from big images tend to lack these words and typically contain others. If the assumption is correct, it should be possible for a supervised machine learning method to separate items from the two distinct classes.

The disadvantage of this approach is that, although the frequencies of the n-grams are taken into account, what is not considered is the correlation of the n-grams to the two classes, BIG and SMALL. For instance, if an n-gram is very frequent in both classes, it makes sense to get rid of it and not consider it as feature. On the other hand, if an n-gram is not very frequent, but it is very characteristic of a specific class, we should include it in the feature vector. To this end, we performed feature selection by taking into account the relative frequency of occurrence of the n-gram in the two classes, BIG and SMALL. We refer to this method as NG-trf, standing for term relative frequency.

In a variation of the aforementioned approach, we decided to replace n-grams with the tokens produced by splitting the image URLs by all non-alphanumeric characters. The employed regular expression in Java is W+ and the feature extraction process is the same as described above, but with the produced tokens instead of n-grams. We will refer to this method as TOKENS-trf.

Non-textual features approach

Our alternative non-textual approach does not rely on the image URL text, but rather on the metadata, that can be extracted from the image HTML element. The idea behind their choice is for them to reveal cues regarding the image dimensions. For instance, the first five features correspond to different image suffixes and they were selected due to the fact that most real-world photos are in JPG or PNG format, whereas BMP and GIF formats usually point to icons and graphics. Additionally, there is a greater likelihood that a real-world photo has an alternate or parent text than a background graphic or a banner.

Hybrid approach

The goal of the hybrid approach is to achieve higher performance by taking into account both textual and non-textual features. Our hypothesis is that the two methods will complement each other when aggregating their results as they rely on different kinds of features: the n-gram classifier might be best at classifying a certain kind of images with specific image URL wording, while the non-textual features classifier might be best at classifying a different kind of images with more informative HTML metadata.

Evaluation

For training we used one million images (500K small and 500K big) and for testing 200 thousand (100K small and 100K big). The described supervised learning approaches were implemented based on the Weka library. We performed numerous preliminary experiments with different classifiers (LibSVM, Random Tree, Random Forest), and Random Forest (RF) was found to be the one striking the best trade-off between good performance and acceptable training times. The main parameter of RF is the number of trees. Some typical values for this are 10, 30 and 100, while very few problems would demand more than 300 trees. The rule of thumb is that more trees lead to better performance; however, they simultaneously increase considerably the training time.

The comparative results for different number of trees for the Random Forest algorithm are displayed in Table 1. The first column of contains the method name, the second one the number of trees used in the RF classifier, the third one the number of features used, and the remaining columns contain the F-measures for the SMALL class, the BIG class and the average. The reported results lead to several interesting conclusions.

  • Doubling the number of n-gram features improves the performance in all cases.
  • So does adding more trees to the Random Forest classifier.
  • The hybrid method outperforms all standalone methods, its best F-score being 4% higher than the best textual features score.

Table 1: Comparative results (F-measure)

Method

RF trees

Features

F1small

F1big

F1avg

TOKENS -trf

10

1000

0.876

0.867

0.871

TOKENS -trf

30

1000

0.887

0.883

0.885

TOKENS -trf

100

1000

0.894

0.891

0.893

TOKENS -trf

10

2000

0.875

0.864

0.870

TOKENS -trf

30

2000

0.888

0.828

0.885

TOKENS -trf

100

2000

0.897

0.892

0.895

NG-tsrf-idf

10

1000

0.876

0.872

0.874

NG-tsrf-idf

30

1000

0.883

0.881

0.882

NG-tsrf-idf

100

1000

0.886

0.884

0.885

NG-tsrf-idf

10

2000

0.883

0.878

0.881

NG-tsrf-idf

30

2000

0.891

0.888

0.890

NG-tsrf-idf

100

2000

0.894

0.891

0.892

features

10

23

0.848

0.846

0.847

features

30

23

0.852

0.852

0.852

features

100

23

0.853

0.853

0.853

hybrid

0.935

0.935

0.935

Acknowledgement

This work was carried out at the Multimedia Knowledge and Social Media Analytics Lab in collaboration with Symeon Papadopoulos in the context of the REVEAL FP7 project.

July 2015 Crawl Archive Available

The crawl archive for June 2015 is now available! This crawl archive is over 145TB in size and holds more than 1.81 billion webpages. The files are located in the commoncrawl bucket at /crawl-data/CC-MAIN-2015-32/.

To assist with exploring and using the dataset, we’ve provided gzipped files that list:

By simply adding either s3://commoncrawl/ or https://commoncrawl.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.

The release also includes the July 2015 Common Crawl Index, constructed by Ilya Kreymer, creator of https://webrecorder.io/. The Common Crawl Index offers a fascinating and new way to explore the dataset! For full details, refer to Ilya’s guest blog post.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in big open data! Contact [email protected] for sponsorship information and packages.