Search results
Hyperlink Graph from Web Data Commons. The talented team at Web Data Commons recently extracted and analyzed the hyperlink graph within the Common Crawl 2012 corpus. Altogether, they found 128 billion hyperlinks connecting 3.5 billion pages.…
Web Data Commons. For the last few months, we have been talking with Chris Bizer and Hannes Mühleisen at the Freie Universität Berlin about their work and we have been greatly looking forward the announcement of the Web Data Commons.…
Web Data Commons Extraction Framework for the Distributed Processing of CC Data.…
Mining a Large Web Corpus. Robert Meusel, Christian Bizer. Introduction of the distributed, parallel extraction framework provided by the Web Data Commons project. Introduction to Common Crawl. Dave Lester.…
Common Crawl's First In-House Web Graph. We are pleased to announce the release of a host-level web graph of recent monthly crawls (February, March, April 2017). The graph consists of 385 million nodes and 2.5 billion edges. Sebastian Nagel.…
Double Checking in Web Data Commons. While the Common Crawl URL index is useful if you need the whole page, in many cases just extracted embedded semantic markup may be enough. The.…
The data made available by. Common Crawl. and. Web Data Commons. provide an excellent first opportunity for these researchers to understand the performance of graph processing systems at scales that justify their complexity.…
February 6, 2015. 5 Good Reads in Big Open Data: Feb 6 2015.…
He is a PhD student of computer science at Johns Hopkins University, focusing on developing frameworks for large-scale data analysis, particularly for massive graph analysis and data mining. Da Zheng.…
Allison Domicone was formerly a Program and Policy Consultant to Common Crawl and previously worked for Creative Commons. We’re very excited to announce the winners of the First Ever Common Crawl Code Contest!…
Common Crawl maintains a. free, open repository. of web crawl data that can be used by anyone. Common Crawl is a 501(c)(3) non–profit founded in 2007. We make wholesale extraction, transformation and analysis of open web data accessible to researchers.…
The Norvig Web Data Science Award. We are very excited to announce the Norvig Web Data Science Award! Common Crawl and SARA created the award to encourage research in web data science. Common Crawl Foundation.…
Web Graphs. Choose a Web Graph. Common Crawl regularly releases host- and domain-level graphs, for visualising the crawl data.…
The Winners of The Norvig Web Data Science Award. We are very excited to announce that the winners of the Norvig Web Data Science Award Lesley Wevers, Oliver Jundt, and Wanno Drijfhout from the University of Twente! Common Crawl Foundation.…
Data 2.0 Summit. Next week a few members of the Common Crawl team are going the Data 2.0 Summit in San Francisco. Common Crawl Foundation. Common Crawl builds and maintains an open repository of web crawl data that can be accessed and analyzed by anyone.…
Common Crawl on AWS Public Data Sets. Common Crawl is thrilled to announce that our data is now hosted on Amazon Web Services' Public Data Sets. Common Crawl Foundation.…
Introducing CloudFront as a new way to access Common Crawl data as part of Amazon Web Services’ registry of open data. Ten years ago(!) Common Crawl joined AWS’s Open Data Sponsorships program, hosted on S3, with free access to everyone.…
December 17, 2012. blekko donates search data to Common Crawl. We are very excited to announce that blekko is donating search data to Common Crawl!…
Web Archiving File Formats Explained. In the ever–evolving landscape of digital archiving and data analysis, it is helpful to understand the various file formats used for web crawling.…
Lexalytics Text Analysis Work with Common Crawl Data. This is a guest blog post by Oskar Singer, a Software Developer and Computer Science student at University of Massachusetts Amherst.…
New Crawl Data Available! We are very please to announce that new crawl data is now available! The data was collected in 2013, contains approximately 2 billion web pages and is 102TB in size (uncompressed). Common Crawl Foundation.…
Introducing the Common Crawl Errata Page for Data Transparency. As part of our commitment to accuracy and transparency, we are pleased to introduce a new Errata page on our website. Thom Vaughan.…
Startup Profile: SwiftKey’s Head Data Scientist on the Value of Common Crawl’s Open Data. Sebastian Spiegler is the head of the data team and SwiftKey and a volunteer at Common Crawl.…
April 2014 Crawl Data Available. The April crawl of 2014 is now available! The new dataset is over 183TB in size containing approximately 2.6 billion webpages. Stephen Merity.…
August 2014 Crawl Data Available. The August crawl of 2014 is now available! The new dataset is over 200TB in size containing approximately 2.8 billion webpages. Stephen Merity.…
July 16, 2012. 2012 Crawl Data Now Available. I am very happy to announce that Common Crawl has released 2012 crawl data as well as a number of significant enhancements to our example library and help pages. Common Crawl Foundation.…
July 2014 Crawl Data Available. The July crawl of 2014 is now available! The new dataset is over 266TB in size containing approximately 3.6 billion webpages. Stephen Merity.…
IIPC General Assembly & Web Archiving Conference 2025. The Common Crawl team attended the 2025 IIPC General Assembly and Web Archiving Conference in Oslo, presenting recent work and participating in discussions on web preservation. Thom Vaughan.…
March 2014 Crawl Data Now Available. The March crawl of 2014 is now available! The new dataset contains approximately 2.8 billion webpages and is about 223TB in size. Common Crawl Foundation.…
Winter 2013 Crawl Data Now Available. The second crawl of 2013 is now available! In late November, we published the data from the first crawl of 2013.…
Now Available: Host- and Domain-Level Web Graphs. We are pleased to announce the release of host-level and domain-level web graphs based on the published crawls of May, June, and July 2017.…
A Look Inside Our 210TB 2012 Web Corpus. Want to know more detail about what data is in the 2012 Common Crawl corpus without running a job? Now you can thanks to Sebastian Spiegler! Common Crawl Foundation.…
Video: Gil Elbaz at Web 2.0 Summit 2011. Hear Common Crawl founder discuss how data accessibility is crucial to increasing rates of innovation as well as give ideas on how to facilitate increased access to data. Common Crawl Foundation.…
SlideShare: Building a Scalable Web Crawler with Hadoop. Common Crawl on building an open Web-Scale crawl using Hadoop. Common Crawl Foundation.…
Common Crawl's Brand Spanking New Video and First Ever Code Contest! At Common Crawl we've been busy recently!…
We’re happy to share our January/February 2025 newsletter with updates and insights from the world of open data and web archiving. Jen English.…
Common Crawl's Advisory Board. As part of our ongoing effort to grow Common Crawl into a truly useful and innovative tool, we recently formed an Advisory Board to guide us in our efforts.…
Amazon Web Services sponsoring $50 in credit to all contest entrants! Did you know that every entry to the First Ever Common Crawl Code Contest gets $50 in Amazon Web Services (AWS) credits?…