Common Crawl is my goto data set. It's a huge collection of pages crawled from the internet and made available completely unfettered. Their choice to largely leave the data alone and make it available “as is”, is brilliant.
It's almost like I did the crawling myself, minus the hassle of creating a crawling infrastructure, renting space in a data center and dealing with spinning platters covered in rust that freeze up you when you least want them to. I exaggerate. In this day and age I would spend hours, days maybe weeks agonizing over cloud infrastructure choices and worrying about my credit card bills if I wanted to create something on that scale.
If you want to create a new search engine, compile a list of congressional sentiment, monitor the spread of Facebook infection through the web, or create any other derivative work, that first starts when you think "if only I had the entire web on my hard drive." Common Crawl is that hard drive, and using services like Amazon EC2 you can crunch through it all for a few hundred dollars. Others, like the gang at Lucky Oyster , would agree.
Which is great news! However if you wanted to extract only a small subset, say every page from Wikipedia you still would have to pay that few hundred dollars. The individual pages are randomly distributed in over 200,000 archive files, which you must download and unzip each one to find all the Wikipedia pages. Well you did, until now.
I'm happy to announce the first public release of the Common Crawl URL Index, designed to solve the problem of finding the locations of pages of interest within the archive based on their URL, domain, subdomain or even TLD (top level domain).
Keeping with Common Crawl tradition we're making the entire index available as a giant download. Fear not, there's no need to rack up bandwidth bills downloading the entire thing. We've implemented it as a prefixed b-tree so you can access parts of it randomly from S3 using byte range requests. At the same time, you're free to download the entire beast and work with it directly if you desire.
Information about the format, and samples of accessing it using python are available on github. Feel free to post questions in the issue tracker and wikis there.
The index itself is located public datasets bucket at s3://commoncrawl/projects/url-index/url-index.1356128792.
This is the first release of the index. The main goals of the design is to allow querying of the index via byte-range queries and to make it easy to implement in any language. We hope you dear reader, will be encouraged to jump in and contribute code to access the index under your favorite language.
For now we've avoided clever encoding schemes and compression. We're expecting that to change as the community has a chance to work with the data and contribute their expertise. Join the discussion we're happy to have you.