Search results

Common Crawl - Blog - Hyperlink Graph from Web Data Commons

Hyperlink Graph from Web Data Commons. The talented team at Web Data Commons recently extracted and analyzed the hyperlink graph within the Common Crawl 2012 corpus. Altogether, they found 128 billion hyperlinks connecting 3.5 billion pages.…

Common Crawl - Blog - Web Data Commons

Web Data Commons. For the last few months, we have been talking with Chris Bizer and Hannes Mühleisen at the Freie Universität Berlin about their work and we have been greatly looking forward the announcement of the Web Data Commons.…

Common Crawl - Blog - Web Data Commons Extraction Framework for the Distributed Processing of CC Data

Web Data Commons Extraction Framework for the Distributed Processing of CC Data.…

Common Crawl - Use Cases

Mining a Large Web Corpus. Robert Meusel, Christian Bizer. Introduction of the distributed, parallel extraction framework provided by the Web Data Commons project. Introduction to Common Crawl. Dave Lester.…

Common Crawl - Blog - Common Crawl's First In-House Web Graph

Common Crawl's First In-House Web Graph. We are pleased to announce the release of a host-level web graph of recent monthly crawls (February, March, April 2017). The graph consists of 385 million nodes and 2.5 billion edges. Sebastian Nagel.…

Common Crawl - Blog - Analysis of the NCSU Library URLs in the Common Crawl Index

Double Checking in Web Data Commons. While the Common Crawl URL index is useful if you need the whole page, in many cases just extracted embedded semantic markup may be enough. The.…

Common Crawl - Blog - Evaluating graph computation systems

The data made available by. Common Crawl. and. Web Data Commons. provide an excellent first opportunity for these researchers to understand the performance of graph processing systems at scales that justify their complexity.…

Common Crawl - Blog - 5 Good Reads in Big Open Data: Feb 6 2015

February 6, 2015. 5 Good Reads in Big Open Data: Feb 6 2015.…

Common Crawl - Blog - Analyzing a Web graph with 129 billion edges using FlashGraph

He is a PhD student of computer science at Johns Hopkins University, focusing on developing frameworks for large-scale data analysis, particularly for massive graph analysis and data mining. Da Zheng.…

Common Crawl - Blog - Winners of the Code Contest!

Allison Domicone was formerly a Program and Policy Consultant to Common Crawl and previously worked for Creative Commons. We’re very excited to announce the winners of the First Ever Common Crawl Code Contest!…

Common Crawl - Open Repository of Web Crawl Data

Common Crawl maintains a. free, open repository. of web crawl data that can be used by anyone. Common Crawl is a 501(c)(3) non–profit founded in 2007. We make wholesale extraction, transformation and analysis of open web data accessible to researchers.…

Common Crawl - Blog - The Norvig Web Data Science Award

The Norvig Web Data Science Award. We are very excited to announce the Norvig Web Data Science Award! Common Crawl and SARA created the award to encourage research in web data science. Common Crawl Foundation.…

Common Crawl - Web Graphs

Web Graphs. Choose a Web Graph. Common Crawl regularly releases host- and domain-level graphs, for visualising the crawl data.…

Common Crawl - Blog - The Winners of The Norvig Web Data Science Award

The Winners of The Norvig Web Data Science Award. We are very excited to announce that the winners of the Norvig Web Data Science Award Lesley Wevers, Oliver Jundt, and Wanno Drijfhout from the University of Twente! Common Crawl Foundation.…

Common Crawl - Blog - Data 2.0 Summit

Data 2.0 Summit. Next week a few members of the Common Crawl team are going the Data 2.0 Summit in San Francisco. Common Crawl Foundation. Common Crawl builds and maintains an open repository of web crawl data that can be accessed and analyzed by anyone.…

Common Crawl - Blog - Common Crawl on AWS Public Data Sets

Common Crawl on AWS Public Data Sets. Common Crawl is thrilled to announce that our data is now hosted on Amazon Web Services' Public Data Sets. Common Crawl Foundation.…

Common Crawl - Blog - blekko donates search data to Common Crawl

December 17, 2012. blekko donates search data to Common Crawl. We are very excited to announce that blekko is donating search data to Common Crawl!…

Common Crawl - Blog - Introducing CloudFront as a new way to access Common Crawl data as part of Amazon Web Services’ registry of open data

Introducing CloudFront as a new way to access Common Crawl data as part of Amazon Web Services’ registry of open data. Ten years ago(!) Common Crawl joined AWS’s Open Data Sponsorships program, hosted on S3, with free access to everyone.…

Common Crawl - Blog - Web Archiving File Formats Explained

Web Archiving File Formats Explained. In the ever–evolving landscape of digital archiving and data analysis, it is helpful to understand the various file formats used for web crawling.…

Common Crawl - Blog - Lexalytics Text Analysis Work with Common Crawl Data

Lexalytics Text Analysis Work with Common Crawl Data. This is a guest blog post by Oskar Singer, a Software Developer and Computer Science student at University of Massachusetts Amherst.…

Common Crawl - Blog - New Crawl Data Available!

New Crawl Data Available! We are very please to announce that new crawl data is now available! The data was collected in 2013, contains approximately 2 billion web pages and is 102TB in size (uncompressed). Common Crawl Foundation.…

Common Crawl - Blog - Introducing the Common Crawl Errata Page for Data Transparency

Introducing the Common Crawl Errata Page for Data Transparency. As part of our commitment to accuracy and transparency, we are pleased to introduce a new Errata page on our website. Thom Vaughan.…

Common Crawl - Blog - Startup Profile: SwiftKey’s Head Data Scientist on the Value of Common Crawl’s Open Data

Startup Profile: SwiftKey’s Head Data Scientist on the Value of Common Crawl’s Open Data. Sebastian Spiegler is the head of the data team and SwiftKey and a volunteer at Common Crawl.…

Common Crawl - Blog - April 2014 Crawl Data Available

April 2014 Crawl Data Available. The April crawl of 2014 is now available! The new dataset is over 183TB in size containing approximately 2.6 billion webpages. Stephen Merity.…

Common Crawl - Blog - 2012 Crawl Data Now Available

July 16, 2012. 2012 Crawl Data Now Available. I am very happy to announce that Common Crawl has released 2012 crawl data as well as a number of significant enhancements to our example library and help pages. Common Crawl Foundation.…

Common Crawl - Blog - August 2014 Crawl Data Available

August 2014 Crawl Data Available. The August crawl of 2014 is now available! The new dataset is over 200TB in size containing approximately 2.8 billion webpages. Stephen Merity.…

Common Crawl - Blog - July 2014 Crawl Data Available

July 2014 Crawl Data Available. The July crawl of 2014 is now available! The new dataset is over 266TB in size containing approximately 3.6 billion webpages. Stephen Merity.…

Common Crawl - Blog - IIPC General Assembly & Web Archiving Conference 2025

IIPC General Assembly & Web Archiving Conference 2025. The Common Crawl team attended the 2025 IIPC General Assembly and Web Archiving Conference in Oslo, presenting recent work and participating in discussions on web preservation. Thom Vaughan.…

Common Crawl - Blog - Web Languages Needing Review by Native Speakers

Web Languages Needing Review by Native Speakers. Common Crawl’s Web Languages initiative has had many contributions since its introduction.…

Common Crawl - Blog - March 2014 Crawl Data Now Available

March 2014 Crawl Data Now Available. The March crawl of 2014 is now available! The new dataset contains approximately 2.8 billion webpages and is about 223TB in size. Common Crawl Foundation.…

Common Crawl - Blog - Winter 2013 Crawl Data Now Available

Winter 2013 Crawl Data Now Available. The second crawl of 2013 is now available! In late November, we published the data from the first crawl of 2013.…

Common Crawl - Blog - Now Available: Host- and Domain-Level Web Graphs

Now Available: Host- and Domain-Level Web Graphs. We are pleased to announce the release of host-level and domain-level web graphs based on the published crawls of May, June, and July 2017.…

Common Crawl - Blog - SlideShare: Building a Scalable Web Crawler with Hadoop

SlideShare: Building a Scalable Web Crawler with Hadoop. Common Crawl on building an open Web-Scale crawl using Hadoop. Common Crawl Foundation.…

Common Crawl - Blog - Video: Gil Elbaz at Web 2.0 Summit 2011

Video: Gil Elbaz at Web 2.0 Summit 2011. Hear Common Crawl founder discuss how data accessibility is crucial to increasing rates of innovation as well as give ideas on how to facilitate increased access to data. Common Crawl Foundation.…

Common Crawl - Blog - A Look Inside Our 210TB 2012 Web Corpus

A Look Inside Our 210TB 2012 Web Corpus. Want to know more detail about what data is in the 2012 Common Crawl corpus without running a job? Now you can thanks to Sebastian Spiegler! Common Crawl Foundation.…

Common Crawl - Blog - Common Crawl's Brand Spanking New Video and First Ever Code Contest!

Common Crawl's Brand Spanking New Video and First Ever Code Contest! At Common Crawl we've been busy recently!…

Common Crawl - Blog - Common Crawl Foundation at Stanford HAI: A Shared Legacy of Data and Innovation

Common Crawl Foundation at Stanford HAI: A Shared Legacy of Data and Innovation. Stanford HAI and Common Crawl are joining forces to explore how open data can shape the future of AI.…

Common Crawl - Blog - Common Crawl's Advisory Board

Common Crawl's Advisory Board. As part of our ongoing effort to grow Common Crawl into a truly useful and innovative tool, we recently formed an Advisory Board to guide us in our efforts.…

Search results

The Data

Resources

Community

About