Web image size prediction for efficient focused image crawling

Katerina Andreadou
This is a guest blog post by Katerina Andreadou.
Katerina is a research assistant at CERTH, where she specializes in multimedia analysis and web crawling.

In the context of using Web image content for analysis and retrieval, it is typically necessary to perform large-scale image crawling. In our web image crawler setup, we noticed that a serious bottleneck pertains to the fetching of image content, since for each web page a large number of HTTP requests need to be issued to download all included image elements. In practice, however, only the relatively big images (e.g., larger than 400 pixels in width and height) are potentially of interest, since most of the smaller ones are irrelevant to the main subject or correspond to decorative elements (e.g., icons, buttons). Given that there is often no dimension information in the HTML img tag of images, to filter out small images, an image crawler would still need to issue a GET request and download the respective files before deciding whether to index them.

To address this limitation, we decided to explore the challenge of predicting the size of images on the Web based only on their URL and information extracted from the surrounding HTML code. In order to do so, we needed a large amount of images accompanied by their HTML metadata with the purpose of training and testing our image size prediction system. To this end we decided to use a sample of the data from the July 2014 Common Crawl set, which is over 266TB in size and contains approximately 3.6 billion web pages. Since for technical and financial reasons, it was impractical and unnecessary to download the whole dataset, we created a MapReduce job to download and parse the necessary information using Amazon Elastic MapReduce (EMR). The setup is based on a blog post by Steve Salevan. The data of interest include all images and videos from all web pages and metadata extracted from the surrounding HTML elements.

To complete the task, we used 50 Amazon EMR medium instances, resulting in 951GB of data in gzip format. The following statistics were extracted from the corpus:

  • 3.6 billion unique images
  • 78.5 million unique domains
  • ≈8% of the images are big (width and height bigger than 400 pixels)
  • ≈40% of the images are small (width and height smaller than 200 pixels)
  • ≈20% of the images have no dimension information

To predict the size of Web images, we came up with three different methodologies, which are analyzed in the rest of this post. This work is described in detail in a paper presented at CBMI 2015 (13th International Workshop on Content-Based Multimedia Indexing). The paper is available online (http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=7153609).

Textual Features approach

An n-gram in our case is a continuous sequence of n characters from the given image URL. The main hypothesis we make is that URLs which correspond to small and big images differ substantially in terms of wording. For instance, URLs from small images tend to contain words such as logo, avatar, small, thumb, up, down, pixels. On the other hand, URLs from big images tend to lack these words and typically contain others. If the assumption is correct, it should be possible for a supervised machine learning method to separate items from the two distinct classes.

The disadvantage of this approach is that, although the frequencies of the n-grams are taken into account, what is not considered is the correlation of the n-grams to the two classes, BIG and SMALL. For instance, if an n-gram is very frequent in both classes, it makes sense to get rid of it and not consider it as feature. On the other hand, if an n-gram is not very frequent, but it is very characteristic of a specific class, we should include it in the feature vector. To this end, we performed feature selection by taking into account the relative frequency of occurrence of the n-gram in the two classes, BIG and SMALL. We refer to this method as NG-trf, standing for term relative frequency.

In a variation of the aforementioned approach, we decided to replace n-grams with the tokens produced by splitting the image URLs by all non-alphanumeric characters. The employed regular expression in Java is W+ and the feature extraction process is the same as described above, but with the produced tokens instead of n-grams. We will refer to this method as TOKENS-trf.

Non-textual features approach

Our alternative non-textual approach does not rely on the image URL text, but rather on the metadata, that can be extracted from the image HTML element. The idea behind their choice is for them to reveal cues regarding the image dimensions. For instance, the first five features correspond to different image suffixes and they were selected due to the fact that most real-world photos are in JPG or PNG format, whereas BMP and GIF formats usually point to icons and graphics. Additionally, there is a greater likelihood that a real-world photo has an alternate or parent text than a background graphic or a banner.

Hybrid approach

The goal of the hybrid approach is to achieve higher performance by taking into account both textual and non-textual features. Our hypothesis is that the two methods will complement each other when aggregating their results as they rely on different kinds of features: the n-gram classifier might be best at classifying a certain kind of images with specific image URL wording, while the non-textual features classifier might be best at classifying a different kind of images with more informative HTML metadata.


For training we used one million images (500K small and 500K big) and for testing 200 thousand (100K small and 100K big). The described supervised learning approaches were implemented based on the Weka library. We performed numerous preliminary experiments with different classifiers (LibSVM, Random Tree, Random Forest), and Random Forest (RF) was found to be the one striking the best trade-off between good performance and acceptable training times. The main parameter of RF is the number of trees. Some typical values for this are 10, 30 and 100, while very few problems would demand more than 300 trees. The rule of thumb is that more trees lead to better performance; however, they simultaneously increase considerably the training time.

The comparative results for different number of trees for the Random Forest algorithm are displayed in Table 1. The first column of contains the method name, the second one the number of trees used in the RF classifier, the third one the number of features used, and the remaining columns contain the F-measures for the SMALL class, the BIG class and the average. The reported results lead to several interesting conclusions.

  • Doubling the number of n-gram features improves the performance in all cases.
  • So does adding more trees to the Random Forest classifier.
  • The hybrid method outperforms all standalone methods, its best F-score being 4% higher than the best textual features score.

Table 1: Comparative results (F-measure)


RF trees




































































































This work was carried out at the Multimedia Knowledge and Social Media Analytics Lab in collaboration with Symeon Papadopoulos in the context of the REVEAL FP7 project.

Announcing the Common Crawl Index!

ilyaThis is a guest post by Ilya Kreymer
Ilya is a dedicated volunteer who has gifted large amounts of time, effort and talent to Common Crawl. He previously worked at the Internet Archive and led the Wayback Machine development, which included building large indexes of WARC files. Ilya is currently developing a new set of archive replay and access tools and an impressive new on-demand archiving service, webrecorder.io, that allows anyone to create a high-fidelity web archive of their own. Check out his exciting projects, including our new index and query api in the post below.

We are pleased to announce a new index and query api system for Common Crawl.

The raw index data is available, per crawl, at:

There is now an index for the Jan 2015 and Feb 2015 crawls. Going forward, a new index will be available at the same time as each new crawl.

To make working the index a bit simpler, an api and service for querying the index is available at: http://index.commoncrawl.org. The index is fully functional but we are looking for feedback to improve the usefulness and usability of the index for future updates.

Index Format
The index format is relatively simple: It consists of a compressed plaintext index (with one line for each entry) compressed into gzipped chunks, and a secondary index of the compressed chunks. This index is often called the ‘ZipNum’ CDX format and it is the same format that is used by the Wayback Machine at the Internet Archive.

Index Query API
To make working with the index a bit easier, the main index site (http://index.commoncrawl.org) contains a readily accessible api for querying the index.

The api is a variation of the ‘cdx server api’ or ‘capture index server api’ that was originally built for the wayback machine.

The site is built using pywb (https://github.com/ikreymer/pywb), a new collection of web archive replay and query tools, including the index query component.

The index can be queried by making a request to a specific collection.

For example, the following query looks up “wikipedia.org” in the CC-MAIN-2015-11 (Feb 2015) crawl:


The above query will only retrieve captures from the exact url “wikipedia.org/”, but a frequent use case may be to retrieve all urls from a path or all subdomains.

This can be done by using a wildcard queries:


For most prefix or domain prefix queries such as these, it is not feasible to retrieve all the results at once, and only the first page of results (by default, up to 15000) are returned.

The total number of pages can be retrieved with the showNumPages query:


This query returns:

{“blocks”: 4942, “pages”: 989, “pageSize”: 5}

This indicates that there are 989 total pages, at 5 compressed index blocks per page!

Thus, to get all of *.wikipedia.org, one would need to perform the query for each page:



This allows for the query process to be performed in parallel. For example, it should be possible to run a MapReduce job which computes the number of pages, creates a list of urls, and then runs the query in parallel.

Command-Line Client
For smaller use cases, a simple client side library is available to simplify this process, https://github.com/ikreymer/cdx-index-client This is a simple python script which uses the pagination api to perform a parallel query on a local machine.

First, a good idea is to verify the number of pages:
./cdx-index-client.py -c CC-MAIN-2015-11 *.wikipedia.org –show-num-pages

To perform the query, simply run and
./cdx-index-client.py -c CC-MAIN-2015-11 *.wikipedia.org -z -d ./wikipedia-index

This query will fetch all pages of the *.wikipedia.org index into a ./wikipedia-index directory and keep the data compressed (-z flag). For a full set of options, you may run
./cdx-index-client.py –help

The script will print out an update of the progress:

2015-04-07 08:35:18,686: [INFO]: Fetching 989 pages of *.wikipedia.org
2015-04-07 08:35:45,734: [INFO]: 1 page(s) of 989 finished
2015-04-07 08:35:46,577: [INFO]: 2 page(s) of 989 finished
2015-04-07 08:35:46,579: [INFO]: 3 page(s) of 989 finished

Adjusting Page Size:
It is also possible to adjust the page size to increase or decrease the number of “blocks” in the page. (Each block is a compressed chunk and can not be split further)
The pageSize query param can be used to set the page size in blocks (the default is 5 blocks per page). For example:

{“blocks”: 4942, “pages”: 989, “pageSize”: 5}

{“blocks”: 4942, “pages”: 4942, “pageSize”: 1}

In general, blocks / pageSize + 1 = pages. Adjusting the page size can help adjust the parallelization and load of the query as needed.

Capture Index JSON (CDXJ) Line Format
The raw index format (stored and returned from the query api) is as follows:

org,wikipedia)/ 20150227035757 {“url”: “http://www.wikipedia.org/”, “digest”: “PQE67QMKFGSZJU5SR2ESR7CMBKLSSBAJ”, “length”: “11996”, “offset”: “853671193”, “filename”: “crawl-data/CC-MAIN-2015-11/segments/1424936460472.17/warc/CC-MAIN-20150226074100-00147-ip-10-28-5-156.ec2.internal.warc.gz”}

This format consists of a ‘url<space>timestamp<space>’ header followed by a json dictionary. The header is used to ensure the lines are sorted by url key and timestamp.

Adding the output=json option to the query will ensure the full line is json. Example:

{“urlkey”: “org,wikipedia)/”, “timestamp”: “20150227035757”, “url”: “http://www.wikipedia.org/”, “length”: “11996”, “filename”: “crawl-data/CC-MAIN-2015-11/segments/1424936460472.17/warc/CC-MAIN-20150226074100-00147-ip-10-28-5-156.ec2.internal.warc.gz”, “digest”: “PQE67QMKFGSZJU5SR2ESR7CMBKLSSBAJ”, “offset”: “853671193”}

Currently, the index contains the urlkey (a canonicalized, reverse-order form of the url), the timestamp, the url, and the WARC filename, offset and length, as well as a checksum (digest) of the content. The digest can be used to identify duplicate captures, but also adds significantly to the index size and may be removed in future versions. Other fields may be added to the json dictionary as needed also.

It is possible to only select certain fields from the query with the fl field. For example, the following query will return only the url:


or via command-line tool:

./cdx-index-client -c CC-MAIN-2015-11 http://wikipedia.org –fl url

Multiple fields can be also specified, eg. fl=url,length to return only url and warc record length.

For a full reference of available query params, consult the latest CDX Server API reference

Additional Java Tools
For Java users wishing to access the raw index, the IIPC webarchive-commons has support for reading the ZipNum format. Additionally, the openwayback-cdx-server provides the Java implementation of the original cdx server api. However, some modifications would be required to that codebase to support the cdx json format and it has not been tested with this index.

Building the Index / Running CDX Index Server
All the tools for building the index are also available at: https://github.com/ikreymer/webarchive-indexing

The index was built using EMR and the mrjob python library, and the indexing tools from pywb project. This should enable others to build the index in the future, or create customized versions of the index as needed. Please refer to the project for additional reference and do not hesitate to contact with any specific questions.

The service running at http://index.commoncrawl.org is also available at:


To run locally, the secondary index (for binary search) for each collection will need to be fetched locally, while most of the index will be read from S3.

Request for Feedback and Future Plans
Although the index format is pretty well-tested, there is lots of room for customization, especially of the index query api, as well as what fields to include in the index. Feedback in the form of bug reports/feature requests/questions/suggestions about any aspect of the index is definitely welcome to make the index even more easy to use. Please do not hesitate to get back with any feedback about the index.

After some additional testing of the newly released indexes, we plan to build an index for older crawls as well. A cumulative index of all data ever crawled by CommonCrawl is also under consideration if there is enough interest. We look forward to hearing about any use cases or other feedback that you may have about the indexing project.

Please help us continue our support of great efforts like this by making a donation to the Common Crawl Foundation and follow us @CommonCrawl on Twitter for the latest in Big Open Data.

WikiReverse- Visualizing Reverse Links with the Common Crawl Archive

Ross FairbanksThis is a guest blog post by Ross Fairbanks

Ross Fairbanks is a software developer based in Barcelona. He mainly develops in Ruby and is interested in open data and cloud computing. This guest post describes his open data project wikireverse.org and why he built it.

What is WikiReverse?

WikiReverse [1] is an application that highlights web pages and the Wikipedia articles they link to. The project is based on Common Crawl’s July 2014 web crawl, which contains 3.6 billion pages. The results produced 36 million links to 4 million Wikipedia articles. Most of the results are from English Wikipedia (which had 32 million links) followed by Spanish, Indonesian and German. In total there are results for 283 languages.

I first heard about Common Crawl in a blog post by Steve Salevan— MapReduce for the Masses: Zero to Hadoop in Five Minutes with Common Crawl [2]. Running Steve’s code deepened my interest in the project. What I like most is the efficiency savings of a large web scale crawl that anyone can access. Attempting to crawl the same volume of web pages myself would have been vastly more expensive and time consuming.

I found that the data can be processed relatively cheaply, as it cost just $64 to process the metadata for 3.6 billion pages. This was achieved by using spot instances, which is the spare server capacity that Amazon Web Services auctions off when demand is low. This saved $115 compared to using full price instances.

There is great value in the Common Crawl archive; however, it is difficult to see with no interface to the data. It can be hard to visualize the possibilities and what can be done with the data. For this reason, my project runs an analysis over an entire crawl with a resulting site that allows the findings to be viewed and searched.

I chose to look at reverse links because, despite it’s relatively simple approach, it exposes interesting data that is normally deeply hidden. Wikipedia articles are often cited on the web and they appear highly in search results. I was interested in seeing how many links these articles have and what types of sites are linking to them.

A great benefit of working with an open dataset like Common Crawl’s is that WikiReverse results can be released very quickly to the public. Already, Gianluca Demartini from the University of Sheffield has released Who links to Wikipedia? [3] on the Wikimedia blog. This is an analysis of which top-level domains appear in the results. It is encouraging to see the interest in open data projects and hopefully more analyses of these types will be done.

Choosing Wikipedia also means the project can continue to benefit from the wide range of open data they release. The DBpedia [4] project uses raw data dumps released by Wikipedia and creates structured datasets for many aspects of data, including categories, images and geographic locations. I plan on using DBpedia to categorize articles in WikiReverse.

The code developed to analyze the data is available on Github. I’ve written a more detailed post on my blog on the data pipeline [5] that was developed to generate the data. The full dataset can be downloaded using BitTorrent. The data is 1.1 GB when compressed and 5.4 GB when extracted. Hopefully this will help others build their own projects using the Common Crawl data.

[1] https://wikireverse.org/
[2] https://commoncrawl.org/2011/12/mapreduce-for-the-masses/
[3] http://blog.wikimedia.org/2015/02/03/who-links-to-wikipedia/
[4] http://dbpedia.org/About
[5] https://rossfairbanks.com/2015/01/23/wikireverse-data-pipeline.html

Lexalytics Text Analysis Work with Common Crawl Data

Oskar Singer

This is a guest blog post by Oskar Singer

Oskar Singer is a Software Developer and Computer Science student at University of Massachusetts Amherst.  He recently did some very interesting text analytics work during his internship at Lexalytics . The post below describes the work, how Common Crawl data was used, and includes a link to code.

At Lexalytics, I have been working with our head of software engineering, Paul Barba, on improving our accuracy with Twitter data for POS-tagging, entity extraction, parsing and ultimately sentiment analysis through building an interesting model-based approach for handling misspelled words.

Our approach involves a spell checker that automatically corrects the input text internally for the benefit of the engine and outputs the original text for the benefit of the engine user, so this must be a different kind of automated spell-correction.

The First Attempt:

Our first attempt was to take the top scoring word from the list of unranked correction suggestions provided by Hunspell, an open-source spell checking library. We calculated each suggestion’s score as word frequency from Common Crawl data divided by string edit distance with consideration for keyboard distance.

The resulting corrections were scored against hand-corrected tweets by counting the number of tokens that differed. Hunspell scored worse than the original tweets. It corrected usernames and hashtags and gave totally unreasonable suggestions. My favorite Hunspell correction was the mapping from “ur” (as in the short-form for “your” or “you’re”) to “Ur” (as in the ancient Mesopotamian city-state).

Hunspell also missed mistakes like misused homophones, which did not count as a misspelling when considered in isolation. This last issue seemed to be the primary issue with our data, so the problem required a method with the ability to consider context.

The Second (and final) Attempt:

We title the next attempt “the Switchabalizer”, and it can be summarized as a multinomial, sliding-window, Naive-Bayes word classifier. On a high level, we classify each of the target words in a piece of text, based on the preceding and succeeding words, as itself or one of its homophones.

The training process starts with a list of bigrams from the Common Crawl data paired with their occurrence counts. We use this data to calculate P(wi-1 | wi) = #(wi-1wi)/#(wi-1) and P(wi+1 | wi) = #(wiwi+1)/#(wi+1) where wi is the current word, wi-1 is the preceding word and wi+1 is the succeeding word. These probabilities are serialized and archived so they can be deserialized into C++ data structures instead of recalculated for each instantiation of the spell check object.  In other words, we’re building a set of probabilities that each switchable “generated” the words preceding and succeeding wi.

The inference process starts with a set S of sets and an inverted index. Each s ∈ S represents a group of commonly confused homophones (e.g. two, too, 2, to), and no word is a member of multiple s ∈ S. The inverted index maps each word w in the union of all s ∈ S to the s in which w holds membership. Each word wi in the ordered sequence of words in a document is checked for an entry in the inverted index. If an entry V is found, the algorithm replaces wi with argmaxv∈V P(v) = P(wi-1 | v) + P(wi+1 | v).


As a matter of efficiency, we assumed that Wikipedia articles have perfect use of the target homophones. I wrote a Python script that took in text, randomly replaced target homophones with members of their switchable set, then output the result.

We ran the Switchabalizer on this data and compared to the original Wikipedia data. Comparing the corrections to the words changed by our test generator, Hunspell, even when forced to ignore usernames, had a 216% error rate (i.e. it made false corrections), and the Switchabalizer had a 20% error rate. Although the test data does not match the target data, the massive and varied data set provided by Common Crawl should ensure good results from the Switchabalizer on many types of data, hopefully even the near-nonsense from the bowels of Twitter.


The Switchabalizer approach is clearly superior to a traditional spell checker for our targeted issues, but still requires significant testing, tuning and improvement. The following section provides some possibilities for improvement and expansion. We hope this approach can be of use to other people with the same problem, and we would like to thank Common Crawl for the fantastic resource that they provide!

Future Work:

Possible future experiments include further testing on different types of data, integration of higher-order n-gram features, implementation of a discriminative model, implementation for other languages, and corrections of common misspellings like “ur”, which cannot be included in sets of switchables without risking the model mapping words to non-words.

The commented Python scripts that generate the testing data and perform feature extraction/training/feature selection can be found on my github account at https://github.com/oskarsinger/PythonScriptsFromLexalytics/tree/master/AutomatedSpellCheck/

Hyperlink Graph from Web Data Commons

The talented team at Web Data Commons recently extracted and analyzed the hyperlink graph within the Common Crawl 2012 corpus.

Altogether, they found 128 billion hyperlinks connecting 3.5 billion pages.

They have published resulting graph today together with some results from the analysis of the graph.


To the best of our knowledge, this graph is the largest hyperlink graph that is available to the public!

Startup Profile: SwiftKey’s Head Data Scientist on the Value of Common Crawl’s Open Data

Sebastian Spiegler is the head of the data team and SwiftKey and a volunteer at Common Crawl. Yesterday we posted Sebastian’s statistical analysis of the 2012 Common Crawl corpus. Today we are following it up with a great video featuring Sebastian talking about why crawl data is valuable, his research, and why open data is important.

The video is an excellent illustration of how startups can benefit from Common Crawl data and we hope that it inspires other startups to use our data!



A Look Inside Our 210TB 2012 Web Corpus

Want to know more detail about what data is in the 2012 Common Crawl corpus without running a job? Now you can thanks to Sebastian Spiegler!

Sebastian is a highly talented data scientist who works at the London based startup SwiftKey and volunteers at Common Crawl. He did an exploratory analysis of the 2012 Common Crawl data and produced an excellent summary paper on exactly what kind of data it contains: Statistics of the Common Crawl Corpus 2012.

From the conclusion section of the paper:

The 2012 Common Crawl corpus is an excellent opportunity for individuals or businesses to cost- effectively access a large portion of the internet: 210 terabytes of raw data corresponding to 3.83 billion documents or 41.4 million distinct second- level domains. Twelve of the top-level domains have a representation of above 1% whereas documents from .com account to more than 55% of the corpus. The corpus contains a large amount of sites from youtube.com, blog publishing services like blogspot.com and wordpress.com as well as online shopping sites such as amazon.com. These sites are good sources for comments and reviews. Almost half of all web documents are utf-8 encoded whereas the encoding of the 43% is unknown. The corpus contains 92% HTML documents and 2.4% PDF files. The remainder are images, XML or code like JavaScript and cascading style sheets.

View or download a pdf of Sebastian’s paper here. If you want to dive deeper you can find the non-aggregated data at s3://commoncrawl/index2012 and the code on GitHub.

Common Crawl Discussion List

We have started a Common Crawl discussion list to enable discussions and encourage collaboration between the community of coders, hackers, data scientists, developers and organizations interested in working with open web crawl data. Please join our discussion mailing list to:

  • Discuss challenges
  • Share ideas for projects and products
  • Look for collaborators and partners
  • Offer advice and share methods
  • Ask questions and get advice from others
  • Show off cool stuff you build
  • Keep up to date on the latest news from Common Crawl

The Common Crawl discussion list uses Google Groups and you can sign up here.

Answers to Recent Community Questions

It was wonderful to see our first blog post and the great piece by Marshall Kirkpatrick on ReadWriteWeb generate so much interest in Common Crawl last week! There were many questions raised on Twitter and in the comment sections of our blog, RWW and Hacker News. In this post we respond to the most common questions. Because it is a long blog post, we have provided a navigation list of questions below. Thanks for all the support and please keep the questions coming!

*Is there a sample dataset or sample .arc file?
*Is it possible to get a list of domain names?
*Is the code open source?
*Where can people obtain access to the Hadoop classes and other code?
*Where can people learn more about the stack and the processing architecture?
*How do you deal with spam and deduping?
*Why should anyone care about five billion pages when Google has so many more?
*How frequently is the crawl data updated?
*How is the metadata organized and stored?
*What is the cost for a simple Hadoop job over the entire corpus?
*Is the data available by torrent?

Is there a sample dataset or sample .arc file?
We are currently working to create a sample dataset so people can consume and experiment with a small segment of the data before dealing with the entire corpus. One commenter suggested that we create a focused crawl of blogs and RSS feeds, and I am happy to say that is just what we had in mind. Stay tuned: We will be announcing the sample dataset soon and posting a sample .arc file on our website even sooner!

Is it possible to get a list of domain names?
We plan to make the domain name list available in a separate metadata bucket in the near future.

Is your code open source?
Anything required to access the buckets or the Common Crawl data that we publish is open source, and any utility code that we develop as part of the crawl is also going to be made open source. However, the crawl infrastructure depends on our internal MapReduce and HDFS file system, and it is not yet in a state that would be useful to third parties. In the future, we plan to break more parts of the internal source code into self-contained pieces to be released as open source.

Where can people access the Hadoop classes and other code?
We have a GitHub repository, that was temporarily down due to some accidental check-ins. It is now back up and can be found here and on the Accessing the Data page of our website.

Where can people learn more about the stack and the processing architecture?
We plan to make the details of our internal infrastructure available in a detailed blog post as soon as time allows. We are using all of our engineering brainpower to optimize the crawler, but we expect to have the bandwidth for additional technical documentation and communication soon. Meanwhile, you can check out a presentation given at a Hadoop user group by Ahad Rana on SlideShare.

How do you deal with spam and deduping?
We use shingling and simhash to do fuzzy deduping of the content we download. The corpus in S3 has not been filtered for spam, because it is not clear whether we should really remove spammy content from the crawl. For example, individuals who want to build a spam filter need access to a crawl with spam. This might be an area in which we can work with the open-source community to develop spam lists/filters.

In addition, we do not have the resources necessary to police the accuracy of any spam filters we develop and currently can only rely on algorithmic means of determining spam, which can sometimes produce false positives.

Why should anyone care about five billion pages when Google has so many more?
Although this question was not common like the others addressed in this post, I would like to respond to a comment on our blog:

“If 5 bln. is just the total number of different URLs you’ve downloaded, then it ain’t much. Google’s index was 1 billion way back in 2000, They’ve downloaded a trillion URLs by 2008. And they say most of is junk, that is simply not worth indexing.”

We are not trying to replace Google; our goal is to provide a high-quality, open corpus of web crawl data.

We agree that many of the pages on the web are junk, and we have no inclination to crawl a larger number of pages just for the sake of having a larger number. Five billion pages is a substantial corpus and, though we may expand the size in the near future, we are focused on quality over quantity.

Also, when Google announced they had a trillion URLs, that was the number of URLs they were aware of, not the number of pages they had downloaded. We have 15 billion URLs in our database, but we don’t currently download them all because those additional 10 billion are—in our judgment—not nearly as important as the five billion we do download. One outcome from our focus on the crawl’s quality is our system of ranking pages, which allows us to determine how important a page is and which of the five billion pages that make up our corpus are among the most important.

How frequently is the crawl data updated?
We spent most of 2011 tweaking the algorithms to improve the freshness and quality of the crawl. We will soon start the improved crawler. In 2012 there will be fresher and more consistent updates – we expect to crawl continuously and update the S3 buckets once a month.

We hope to work with the community to determine what additional metadata and focused crawls would be most valuable and what subsets of web pages should be crawled with the highest frequency.

How is the metadata organized and stored?
The page rank and other metadata we compute is not part of the S3 corpus, but we do collect this information and expect to make it available in a separate S3 bucket in Hadoop SequenceFiles format. On the subject of page ranking, please be aware that the page rank we compute for pages may not have a high degree of correlation to Google’s PageRank, since we do not use their PageRank algorithm.

What is the cost for a simple Hadoop job over the entire corpus?
A rough estimate yields an answer of ~$150. Here is how we arrived at that estimate:

  • The Common Crawl corpus is approximately 40TB.
    • Crawl data is stored on S3 in the form of 100MB compressed archives.
    • There are between 400-500K such files in the corpus.
  • If you open multiple S3 streams in parallel, maintain an average 1MB/sec throughput per S3 stream and start 10 parallel streams per Mapper, you should sustain a throughput of 10 MB/sec.
  • If you run one Mapper per EC2 small instance and start 100 instances, you would have an aggregate throughput of ~3TB/hour.
  • At that rate you would need 16 hours to scan 50TB of data – a total of 1600 machine hours.
  • 1600 machine hours at $0.085 per hour will cost ~$130.
  • The cost of any subsequent aggregation/data consolidation jobs and the cost of storing your final data on S3 brings you to a total cost of approximately $150.

Is the data available by torrent?
Do you mean the distribution of a subset of the data via torrents, or do you mean the distribution of updates to the crawl via torrents? The current data set is 40+ TB in size, and it seems to us to be too big to be distributed via this mechanism, but perhaps we are wrong. If you have some ideas about how we could go about doing this, and whether or not it would require significant bandwidth resources on our part, we would love to hear from you.




Common Crawl Enters A New Phase

A little under four years ago, Gil Elbaz formed the Common Crawl Foundation. He was driven by a desire to ensure a truly open web. He knew that decreasing storage and bandwidth costs, along with the increasing ease of crunching big data, made building and maintaining an open repository of web crawl data feasible. More important than the fact that it could be built was his powerful belief that it should be built. The web is the largest collection of information in human history, and web crawl data provides an immensely rich corpus for scientific research, technological advancement, and business innovation. Gil started the Common Crawl Foundation to take action on the belief that it is crucial our information-based society that web crawl data be open and accessible to anyone who desires to utilize it.

That was the inspiration phase of Common Crawl – one person with a passion for openness forming a new foundation to work towards democratizing access to web information, thereby driving a new wave of innovation. Common Crawl quickly moved into the building phase, as Gil found others who shared his belief in the open web. In 2008, Carl Malamud and Nova Spivack joined Gil to form the Common Crawl board of directors. Talented engineer Ahad Rana began developing the technology for our crawler and processing pipeline. Today, thanks to the robust system that Ahad has built, we have an open repository of crawl data that covers approximately 5 billion pages and includes valuable metadata, such as page rank and link graphs. All of our data is stored on Amazon’s S3 and is accessible to anyone via EC2.

Common Crawl is now entering the next phase – spreading the word about the open system we have built and how people can use it. We are actively seeking partners who share our vision of the open web. We want to collaborate with individuals, academic groups, small start-ups, big companies, governments and nonprofits.

Over the next several months, we will be expanding our website and using this blog to describe our technology and data, communicate our philosophy, share ideas, and report on the products of our collaborations. We will also be working to build up a GitHub repository of code that has been and can be used to work with Common Crawl data. Most important, we will be talking with the community of people who share our interests. Thinking about an application you’d like to see built on Common Crawl data? Have Hadoop scripts that could be adapted to find insightful information in the crawl data? Know of a stimulating meetup, conference or hackathon we should attend? We want to hear from you!

This is the phase where the original vision truly comes to life, and the ideas Gil Elbaz had years ago will be converted to new products and insights. To say it is an exciting time is a tremendous understatement.