Search results

Common Crawl - Blog - Common Crawl Discussion List

Common Crawl Discussion List.…

Common Crawl - Blog - Common Crawl Foundation Opt-Out Registry

Common Crawl Foundation Opt-Out Registry. Publishers have been sending Common Crawl legal opt-out requests.…

Common Crawl - Blog - Please Donate To Common Crawl!

Please Donate To Common Crawl! Big data has the potential to change the world. The talent exists and the tools are already there. What’s lacking is access to data.…

Common Crawl - Blog - Common Crawl Celebrates World Digital Preservation Day

Common Crawl Celebrates World Digital Preservation Day. Common Crawl celebrates World Digital Preservation Day Nov. 6, which invites the community to unite in answering a powerful question: Why Preserve? Common Crawl Foundation.…

Common Crawl - Blog - Welcome, Sebastian!

It is a pleasure to officially announce that Sebastian Nagel joined Common Crawl as Crawl Engineer in April. Sebastian brings to Common Crawl a unique blend of experience, skills, knowledge (and enthusiasm!) to complement his role and the organization.…

Common Crawl - Blog - Common Crawl Enters A New Phase

Common Crawl Enters A New Phase. A little under four years ago, Gil Elbaz formed the Common Crawl Foundation. He was driven by a desire to ensure a truly open web.…

Common Crawl - Blog - Mat Kelcey Joins The Common Crawl Advisory Board

Mat Kelcey Joins The Common Crawl Advisory Board. We are excited to announce that Mat Kelcey has joined the Common Crawl Board of Advisors!…

Common Crawl - Blog - blekko donates search data to Common Crawl

December 17, 2012. blekko donates search data to Common Crawl. We are very excited to announce that blekko is donating search data to Common Crawl!…

Common Crawl - FAQ

Common Crawl. General Questions. What is Common Crawl?…

Common Crawl - Blog - Common Crawl Foundation at COLM 2025

Common Crawl Foundation at COLM 2025. The Common Crawl team attended the 2nd Conference on Language Modeling in Montréal, organizing a workshop, giving invited talks, and strengthening links with the research community. Malte Ostendorff.…

Common Crawl - Blog - Common Crawl on AWS Public Data Sets

Common Crawl on AWS Public Data Sets. Common Crawl is thrilled to announce that our data is now hosted on Amazon Web Services' Public Data Sets. Common Crawl Foundation.…

Common Crawl - Blog - January/February 2025 Newsletter

Annotation for Language Identification. cc-downloader Command Line Tool. Citations Updates. Common Crawl at SXSW 2025. Software Heritage Symposium at UNESCO. NeurIPS 2024 Social with Wikimedia. Annotation for Language Identification.…

Common Crawl - Blog - Common Crawl Foundation at Stanford HAI

Common Crawl Foundation at Stanford HAI. The Common Crawl team presented a seminar at Stanford HAI entitled “Preserving Humanity's Knowledge and Making it Accessible: Addressing Challenges of Public Web Data”. Common Crawl Foundation.…

Common Crawl - Blog - Still time to participate in the Common Crawl code contest

Still time to participate in the Common Crawl code contest. There is still plenty of time left to participate in the Common Crawl code contest! …

Common Crawl - Blog - The Increase of Common Crawl Citations in Academic Research

The Increase of Common Crawl Citations in Academic Research. Common Crawl's impact on research has grown substantially since its beginning.…

Common Crawl - Blog - A Sampling of 2025 Research Referencing Common Crawl

A Sampling of 2025 Research Referencing Common Crawl.…

Common Crawl - Blog - May/June 2024 Newsletter

Greg is Chief Technology Officer at the Common Crawl Foundation. Table of Contents: Common Crawl Celebrates Our 100th Crawl Since 2008! AI and the Right to Learn on an Open Internet. Recent Research Using Common Crawl Data.…

Common Crawl - Blog - Common Crawl Code Contest Extended Through the Holiday Weekend

Common Crawl Code Contest Extended Through the Holiday Weekend. Do you have a project that you are working on for the Common Crawl Code Contest that is not quite ready? If so, you are not the only one.…

Common Crawl - Blog - Common Crawl at the United Nations Open Source Week, June 2025

Common Crawl at the United Nations Open Source Week, June 2025.…

Common Crawl - Blog - Professor Jim Hendler Joins the Common Crawl Advisory Board!

Professor Jim Hendler Joins the Common Crawl Advisory Board! We are extremely happy to announce that Professor Jim Hendler has joined the Common Crawl Advisory Board.…

Common Crawl - Blog - October/November 2025 Newsletter

Common Crawl Foundation. Common Crawl builds and maintains an open repository of web crawl data that can be accessed and analyzed by anyone. Table of Contents. Event Highlights. Web Languages. GneissWeb Annotations. SEO to AIO.…

Common Crawl - Blog - Common Crawl Foundation at ACL 2025

Common Crawl Foundation at ACL 2025. The Common Crawl team attended the 63rd Annual Meeting of the Association of Computational Linguistics in Vienna, presenting recent published work and strengthening links with the research community. Laurie Burchell.…

Common Crawl - Blog - Common Crawl Foundation at Stanford HAI: A Shared Legacy of Data and Innovation

Common Crawl Foundation at Stanford HAI: A Shared Legacy of Data and Innovation. Stanford HAI and Common Crawl are joining forces to explore how open data can shape the future of AI.…

Common Crawl - Blog - The Norvig Web Data Science Award

Common Crawl and SARA created the award to encourage research in web data science. Common Crawl Foundation. Common Crawl builds and maintains an open repository of web crawl data that can be accessed and analyzed by anyone.…

Common Crawl - Blog - Analysis of the NCSU Library URLs in the Common Crawl Index

Analysis of the NCSU Library URLs in the Common Crawl Index. Last week we announced the Common Crawl URL Index.…

Common Crawl - Blog - Providing Authenticity & Data Provenance for Common Crawl Using Blockchain: Our Work with Constellation Network

Providing Authenticity & Data Provenance for Common Crawl Using Blockchain: Our Work with Constellation Network.…

Common Crawl - Blog - Startup Profile: SwiftKey’s Head Data Scientist on the Value of Common Crawl’s Open Data

Startup Profile: SwiftKey’s Head Data Scientist on the Value of Common Crawl’s Open Data. Sebastian Spiegler is the head of the data team and SwiftKey and a volunteer at Common Crawl.…

Common Crawl - Blog - Common Crawl URL Index

Common Crawl URL Index. We are thrilled to announce that Common Crawl now has a URL index! Scott Robertson, founder of triv.io graciously donated his time and skills to creating this valuable tool. Scott Robertson.…

Common Crawl - Blog - Gil Elbaz and Nova Spivack on This Week in Startups

Underlying their conversation is an exploration of how Common Crawl's open crawl of the web is a powerful asset for educators, researchers, and entrepreneurs. Allison Domicone.…

Common Crawl - Contact Us

To communicate with Common Crawl team and the larger community, please see the. Common Crawl Discussion Group and Mailing List. For physical mail correspondence: Common Crawl Foundation. 9663 Santa Monica Blvd. #425. Beverly Hills, CA 90210.…

Common Crawl - Blog - A Look Inside Our 210TB 2012 Web Corpus

Want to know more detail about what data is in the 2012 Common Crawl corpus without running a job? Now you can thanks to Sebastian Spiegler! Common Crawl Foundation.…

Common Crawl - Blog - July/August 2025 Newsletter

The Common Crawl engineering team’s weekly meeting. Stanford HAI Seminar in October. Common Crawl Foundation is thrilled to present at an upcoming Stanford Institute for Human-Centered Artificial Intelligence (HAI) Seminar entitled.…

Common Crawl - Blog - 2012 Crawl Data Now Available

July 16, 2012. 2012 Crawl Data Now Available. I am very happy to announce that Common Crawl has released 2012 crawl data as well as a number of significant enhancements to our example library and help pages. Common Crawl Foundation.…

Common Crawl - Blog - Submission to the UK’s Copyright and AI Consultation

Common Crawl Foundation. Common Crawl builds and maintains an open repository of web crawl data that can be accessed and analyzed by anyone. Common Crawl started long before generative AI was front page news.…

Common Crawl - Blog - URL Search Tool!

A couple months ago we announced the creation of the Common Crawl URL Index and followed it up with a guest post by Jason Ronallo describing how he had used the URL Index.…

Common Crawl - Impact

Common Crawl has revolutionized access to web data, providing an open repository that anyone can use.…

Common Crawl - Blog - August 2015 Crawl Archive Available

August 2015 Crawl Archive Available. The crawl archive for August 2015 is now available! This crawl archive is over 149TB in size and holds more than 1.84 billion webpages. Stephen Merity.…

Common Crawl - Overview

The Common Crawl corpus contains petabytes of data, regularly collected since 2008. Choose a crawl. The corpus contains raw web page data, metadata extracts, and text extracts.…

Common Crawl - Blog - Hyperlink Graph from Web Data Commons

Hyperlink Graph from Web Data Commons. The talented team at Web Data Commons recently extracted and analyzed the hyperlink graph within the Common Crawl 2012 corpus. Altogether, they found 128 billion hyperlinks connecting 3.5 billion pages.…

Common Crawl - Blog - October/November 2024 Newsletter

NeurIPS Social with Common Crawl and Wikimedia. Event Updates. Open Job Positions. Web Languages Project.…

Common Crawl - Blog - Introducing Common Crawl AI Agent by ReadyAI

Introducing Common Crawl AI Agent by ReadyAI. We are pleased to announce the launch of an experimental AI Agent, developed by our friends at ReadyAI.…

Common Crawl - Blog - Setting the Record Straight: Common Crawl’s Commitment to Transparency, Fair Use, and the Public Good

Setting the Record Straight: Common Crawl’s Commitment to Transparency, Fair Use, and the Public Good.…

Common Crawl - Blog - Video Tutorial: MapReduce for the Masses

Learn how you can harness the power of MapReduce data analysis against the Common Crawl dataset with nothing more than five minutes of your time, a bit of local configuration, and 25 cents. Common Crawl Foundation.…

Common Crawl - Blog - Common Crawl Foundation at NeurIPS 2024: Expanding Horizons and Building Connections

Common Crawl Foundation at NeurIPS 2024: Expanding Horizons and Building Connections.…

Common Crawl - Blog - March 2015 Crawl Archive Available

March 2015 Crawl Archive Available. The crawl archive for March 2015 is now available! This crawl archive is over 124TB in size and holds more than 1.64 billion webpages. Stephen Merity.…

Common Crawl - Blog - June 2015 Crawl Archive Available

June 2015 Crawl Archive Available. The crawl archive for June 2015 is now available! This crawl archive is over 131TB in size and holds more than 1.67 billion webpages. Stephen Merity.…

Common Crawl - Blog - May 2015 Crawl Archive Available

May 2015 Crawl Archive Available. The crawl archive for May 2015 is now available! This crawl archive is over 159TB in size and holds more than 2.05 billion webpages. Stephen Merity.…

…

…

…

…

…

…