Common Crawl Enters A New Phase

A little under four years ago, Gil Elbaz formed the Common Crawl Foundation. He was driven by a desire to ensure a truly open web. He knew that decreasing storage and bandwidth costs, along with the increasing ease of crunching big data, made building and maintaining an open repository of web crawl data feasible. More important than the fact that it could be built was his powerful belief that it should be built. The web is the largest collection of information in human history, and web crawl data provides an immensely rich corpus for scientific research, technological advancement, and business innovation. Gil started the Common Crawl Foundation to take action on the belief that it is crucial our information-based society that web crawl data be open and accessible to anyone who desires to utilize it.

That was the inspiration phase of Common Crawl – one person with a passion for openness forming a new foundation to work towards democratizing access to web information, thereby driving a new wave of innovation. Common Crawl quickly moved into the building phase, as Gil found others who shared his belief in the open web. In 2008, Carl Malamud and Nova Spivack joined Gil to form the Common Crawl board of directors. Talented engineer Ahad Rana began developing the technology for our crawler and processing pipeline. Today, thanks to the robust system that Ahad has built, we have an open repository of crawl data that covers approximately 5 billion pages and includes valuable metadata, such as page rank and link graphs. All of our data is stored on Amazon’s S3 and is accessible to anyone via EC2.

Common Crawl is now entering the next phase – spreading the word about the open system we have built and how people can use it. We are actively seeking partners who share our vision of the open web. We want to collaborate with individuals, academic groups, small start-ups, big companies, governments and nonprofits.

Over the next several months, we will be expanding our website and using this blog to describe our technology and data, communicate our philosophy, share ideas, and report on the products of our collaborations. We will also be working to build up a GitHub repository of code that has been and can be used to work with Common Crawl data. Most important, we will be talking with the community of people who share our interests. Thinking about an application you’d like to see built on Common Crawl data? Have Hadoop scripts that could be adapted to find insightful information in the crawl data? Know of a stimulating meetup, conference or hackathon we should attend? We want to hear from you!

This is the phase where the original vision truly comes to life, and the ideas Gil Elbaz had years ago will be converted to new products and insights. To say it is an exciting time is a tremendous understatement.

  • John Hawkins

    I am opposed to free data. Down with open platforms and the creativity they enable! Down, I say!

  • Chris

    Amazing project!

    I’m curious what the expected EC2 costs would be for a simple, non-time-sensitive Hadoop run over the current 5B documents? For the sake of example, let’s say I wanted to count the number of pages that contained the word “Chris”. Would you expect it to cost me on the order of $10, $100, or $1000 on EC2?

    • http://www.commoncrawl.org Gil Elbaz

      I asked Ahad this same question a while back and I recall it was on the order of $100 of EC2 time to run a hadoop job that scans 5B documents. Hopefully we’ll get you a more confident answer soon.

      Of course, that number comes down by several orders of magnitude if a full text index is built on top of this raw data. Any volunteers to help build that?

    • http://www.commoncrawl.org Ahad Rana

      Hi, I work at commoncrawl, so I will try to answer your question. We store our crawl data on S3 in the form of 100MB compressed archives and there are between 40,000 and 50,000 such files in commoncrawl’s bucket today. To key to scanning such a large set of files efficiently on EC2 is to have your each of your Mappers (assuming you are running Hadoop) open multiple S3 streams in parallel to maintain some desired level of throughput. For example, assuming that you can maintain on average a 1MByte/sec throughput per S3 stream, and you start 10 parallel streams per Mapper, you should be able to sustain a throughput 80 Mbits/sec or 10 MBytes/sec. If you were to run one Mapper per EC2 small instance, and start 100 such instances, this would yield and aggregated throughput of close to 3TB/hour. At that rate, you would need 16 hours to scan 50TB of data, or a total of 1600 machine hours at $.085 per hour, costing you somewhere in the neighborhood of $130.00. Of course, you would then need to add in the cost of running any subsequent aggregation / data consolidation jobs and the cost of storing your final data on S3. So, the $100.00 number is generally in the ballpark but final numbers may vary :-) Hope that answers your question.

      • http://www.commoncrawl.org Ahad Rana

        A couple of corrections: There are 323694 files in our current bucket, which will grow to 455827 files once we have merged in some data from an earlier bucket. Also, the EC2 small instances are pretty underpowered, so processing 10MB/sec of compressed content (which is probably around 30-40MB uncompressed) might be an optimistic expectation depending on what exactly you are trying to do.

        Part of our goal for 2012 will be to produce smaller and fresher crawl collections by focusing more on sub-segments of the Web, such as the Blogosphere etc. Hopefully this will help increase the signal to noise ratio a bit and also make the crawl more approachable to a wider audience.

  • Beth

    This is really exciting! You could build your own search engine, you could build a meme-tracker, you could pull out sentiment surrounding politic issues… heaps of possibilities.

  • http://blogs.fluidinfo.com/terry Terry Jones (@terrycojones)

    OMG, Gil, there you are again! Nice job! :-)

    Terry

  • Konstantin Lopukhin

    Great news! Is there a list of domain names you reached during this crawl? Have you crawled any regional domain names? Particulary interested in “.ru” domain.

  • http://www.markosweb.com/ Deep

    Great job!
    Can you please provide some examples of data you have?

  • http://www.koch.ro Thomas Koch

    Do you open source your code so that we can contribute our ideas and improvements? Where’s the development mailing list?

    Thank you very much so far!

  • @virtualCableTV

    I encourage Mr. Elbaz to support crawling RSS Web feeds using .rss and .xml filetypes which are identified as resources in sitemap.xml files and then of course edit the FAQ to let people know it is possible and finally I am grateful the CommonCrawl initiative is making this all possible…

    @virtualCableTV

  • http://www.dannywhitehouse.com Danny Whitehouse

    Anything is better than Google. I will not be part of the Google bureaucracy!

  • John

    I’m trying to access this using the S3 perl wrapper around curl (http://aws.amazon.com/code/128). All I can get is access denied errors. Is this available for use now or am I doing something wrong accessing it?

  • Avi

    Looks great. There are many possibilities for this.

    I have a couple of questions.
    What kind of documents are being stored? pdfs? images? word documents?
    Is there a way to get a sample ARC file?

    Thanks,
    Avi

  • http://www.rc-helicopters.org/blade-400-reviews/ Guide

    Great! Thanks for the share!
    Arron

  • Pingback: The Common Crawl Foundation | halfblog.net

  • http://www2.parc.com/rzhou Rong

    This is fantastic — Thank very much, CC!

    How big is the link graph? Does the metadata map each web page to a unique (hopefully integer) id?

  • ik

    How do you deal with spam and (semi-)duplicates? I think it is the second most important question that must be answered, right after the number of pages, yet I don’t see the answer to it anywhere on your site! I don’t know what to think of your data…

    If 5 bln. is just the total number of different URLs you’ve downloaded, then it ain’t much. Google’s index was 1 billion way back in 2000, They’ve downloaded a trillion URLs by 2008. And they say most of is junk, that is simply not worth indexing. [1]

    • http://matpalm.com Mat Kelcey

      This is a public crawl; why on earth would they remove spam or semi duplicate pages?

      If you don’t want that content then it’s _your_ job to remove it. The web is as it is.

      A billion isn’t much? Well, when you’ve downloaded more and put it up for public access let us know hey?

      Mat

      • ik

        > This is a public crawl; why on earth would they remove spam or semi duplicate pages?

        To increase usefulness of the crawl to users. Because this is a big issue on the web.

        Would you prefer to pay $130 (estimante from an above post) for mapreduce and have to deal with all spam and dups on top of that, or rather pay $13 for a mapreduce over maybe just 100 mln. despammed documents?

        • http://matpalm.com Mat Kelcey

          Point taken; the cost of a full pass is not cheap.

        • MA

          They actually need to leave spam and duplicates in the archive. For anyone doing work to eliminate spam having a record of all the ways spam happens is very important to develop new algorithms to detect it.

  • http://sesyndicate.com/ Vanya Helsing

    This is very good! S. E. Syndicate and Onotolee endorses open web.

    We love you!

  • http://ce.sharif.edu/~a_rahimi afshin r

    Great Work. would you open up the architecture of the crawler?

  • http://www.commoncrawl.org Ahad Rana

    Hi, you need to sign every request, and also use a tool that supports the Amazon requester pays feature, as document here: http://docs.amazonwebservices.com/AmazonS3/latest/dev/index.html?RequesterPaysBuckets.html. With the python boto library, as an example, you have to add the following http header, x-amz-request-payer:requester, before letting the library sign the request.

  • http://www.covario.com/ Tom Davidson

    I am not having any luck accessing the S3 bucket. I keep getting access denied. I have an EC2 account and I am running my code on an EC2 instance, but no luck.

    BTW, this effort is very exciting. Keep up the good work.

  • http://www.covario.com/ Tom Davidson
  • Serge

    Great Job! Successfully accessed the data using s3cmd. One question about Hadoop processing. You mentioned that:
    This involves setting up a custom hadoop jar that utilizes our custom InputFormat class to pull data from the individual ARC files in our S3 bucket
    Where can one find this hadoop JAR?

  • http://www.link-assistant.com Viktar Khamianok

    It would be nice to have any sample code to see how to handle the crawled data. For example:
    1) Find how many websites/pages are indexed right now.
    2) Find how many pages has a particular website.
    3) Find Top100 keywords used in title for a website or for a whole Internet :)
    4) How to build a backlinks’ graph for a whole Internet :-D
    5) How to build a “Google” :-D
    6) How to conquer the Universe :-D
    7) …

  • http://www.cpan.org JAPH

    It would be really nice if you could please put a small example ARC file (say only 5Mb) somewhere FTP/HTTP accessible so people could download that. Then they would be able to develop their analysing programs at home before setting up Amazon EC2 instances to process over the large datasets.
    For example I would like to analyse the HTML tags, DOCTYPEs, id attributes, name attributes etc. Back in 2005 Google did an analysis of over 1 billion pages. See http://code.google.com/webstats/ and I would like to do an updated analysis.

  • Jiten

    Is it possible to access the data without an amazon customer id ?

  • Pingback: E-bike

  • Jerzy Kaltenberg

    nice. now make it free. I shouldn’t have to pay for  Amazon computing cycles.

  • Pingback: flatrate erotik

  • Pingback: Buy Fans Facebook

  • Pingback: Newsfoo 2011 – the rise of the fact-checkers « Sameer Padania

  • Robert Butler

    This is a fantastic idea!!!  Have you heard of ScaperWiki?