It was wonderful to see our first blog post and the great piece by Marshall Kirkpatrick on ReadWriteWeb generate so much interest in Common Crawl last week! There were many questions raised on Twitter and in the comment sections of our blog, RWW and Hacker News. In this post we respond to the most common questions. Because it is a long blog post, we have provided a navigation list of questions below. Thanks for all the support and please keep the questions coming!
*Is there a sample dataset or sample .arc file?
*Is it possible to get a list of domain names?
*Is the code open source?
*Where can people obtain access to the Hadoop classes and other code?
*Where can people learn more about the stack and the processing architecture?
*How do you deal with spam and deduping?
*Why should anyone care about five billion pages when Google has so many more?
*How frequently is the crawl data updated?
*How is the metadata organized and stored?
*What is the cost for a simple Hadoop job over the entire corpus?
*Is the data available by torrent?
Is there a sample dataset or sample .arc file?
We are currently working to create a sample dataset so people can consume and experiment with a small segment of the data before dealing with the entire corpus. One commenter suggested that we create a focused crawl of blogs and RSS feeds, and I am happy to say that is just what we had in mind. Stay tuned: We will be announcing the sample dataset soon and posting a sample .arc file on our website even sooner!
Is your code open source?
Anything required to access the buckets or the Common Crawl data that we publish is open source, and any utility code that we develop as part of the crawl is also going to be made open source. However, the crawl infrastructure depends on our internal MapReduce and HDFS file system, and it is not yet in a state that would be useful to third parties. In the future, we plan to break more parts of the internal source code into self-contained pieces to be released as open source.
Where can people access the Hadoop classes and other code?
We have a GitHub repository, that was temporarily down due to some accidental check-ins. It is now back up and can be found here and on the Accessing the Data page of our website.
Where can people learn more about the stack and the processing architecture?
We plan to make the details of our internal infrastructure available in a detailed blog post as soon as time allows. We are using all of our engineering brainpower to optimize the crawler, but we expect to have the bandwidth for additional technical documentation and communication soon. Meanwhile, you can check out a presentation given at a Hadoop user group by Ahad Rana on SlideShare.
How do you deal with spam and deduping?
We use shingling and simhash to do fuzzy deduping of the content we download. The corpus in S3 has not been filtered for spam, because it is not clear whether we should really remove spammy content from the crawl. For example, individuals who want to build a spam filter need access to a crawl with spam. This might be an area in which we can work with the open-source community to develop spam lists/filters.
In addition, we do not have the resources necessary to police the accuracy of any spam filters we develop and currently can only rely on algorithmic means of determining spam, which can sometimes produce false positives.
Why should anyone care about five billion pages when Google has so many more?
Although this question was not common like the others addressed in this post, I would like to respond to a comment on our blog:
“If 5 bln. is just the total number of different URLs you’ve downloaded, then it ain’t much. Google’s index was 1 billion way back in 2000, They’ve downloaded a trillion URLs by 2008. And they say most of is junk, that is simply not worth indexing.”
We are not trying to replace Google; our goal is to provide a high-quality, open corpus of web crawl data.
We agree that many of the pages on the web are junk, and we have no inclination to crawl a larger number of pages just for the sake of having a larger number. Five billion pages is a substantial corpus and, though we may expand the size in the near future, we are focused on quality over quantity.
Also, when Google announced they had a trillion URLs, that was the number of URLs they were aware of, not the number of pages they had downloaded. We have 15 billion URLs in our database, but we don’t currently download them all because those additional 10 billion are—in our judgment—not nearly as important as the five billion we do download. One outcome from our focus on the crawl’s quality is our system of ranking pages, which allows us to determine how important a page is and which of the five billion pages that make up our corpus are among the most important.
How frequently is the crawl data updated?
We spent most of 2011 tweaking the algorithms to improve the freshness and quality of the crawl. We will soon start the improved crawler. In 2012 there will be fresher and more consistent updates – we expect to crawl continuously and update the S3 buckets once a month.
We hope to work with the community to determine what additional metadata and focused crawls would be most valuable and what subsets of web pages should be crawled with the highest frequency.
How is the metadata organized and stored?
The page rank and other metadata we compute is not part of the S3 corpus, but we do collect this information and expect to make it available in a separate S3 bucket in Hadoop SequenceFiles format. On the subject of page ranking, please be aware that the page rank we compute for pages may not have a high degree of correlation to Google’s PageRank, since we do not use their PageRank algorithm.
- The Common Crawl corpus is approximately 40TB.
- Crawl data is stored on S3 in the form of 100MB compressed archives.
- There are between 400-500K such files in the corpus.
- If you open multiple S3 streams in parallel, maintain an average 1MB/sec throughput per S3 stream and start 10 parallel streams per Mapper, you should sustain a throughput of 10 MB/sec.
- If you run one Mapper per EC2 small instance and start 100 instances, you would have an aggregate throughput of ~3TB/hour.
- At that rate you would need 16 hours to scan 50TB of data – a total of 1600 machine hours.
- 1600 machine hours at $0.085 per hour will cost ~$130.
- The cost of any subsequent aggregation/data consolidation jobs and the cost of storing your final data on S3 brings you to a total cost of approximately $150.
Is the data available by torrent?
Do you mean the distribution of a subset of the data via torrents, or do you mean the distribution of updates to the crawl via torrents? The current data set is 40+ TB in size, and it seems to us to be too big to be distributed via this mechanism, but perhaps we are wrong. If you have some ideas about how we could go about doing this, and whether or not it would require significant bandwidth resources on our part, we would love to hear from you.