Search results

Common Crawl - Use Cases

Introduction of the distributed, parallel extraction framework provided by the Web Data Commons project. Introduction to Common Crawl. Dave Lester. Overview of Common Crawl with some example use cases.

Common Crawl - Blog - Please Donate To Common Crawl!

Please Donate To Common Crawl! Big data has the potential to change the world. The talent exists and the tools are already there. What’s lacking is access to data.

Common Crawl - Blog - Common Crawl Discussion List

Common Crawl Discussion List.

Common Crawl - Blog - Common Crawl Enters A New Phase

Common Crawl Enters A New Phase. A little under four years ago, Gil Elbaz formed the Common Crawl Foundation. He was driven by a desire to ensure a truly open web.

Common Crawl - Blog - Welcome, Sebastian!

It is a pleasure to officially announce that Sebastian Nagel joined Common Crawl as Crawl Engineer in April. Sebastian brings to Common Crawl a unique blend of experience, skills, knowledge (and enthusiasm!) to complement his role and the organization.

Common Crawl - Blog - Mat Kelcey Joins The Common Crawl Advisory Board

Mat Kelcey Joins The Common Crawl Advisory Board. We are excited to announce that Mat Kelcey has joined the Common Crawl Board of Advisors!

Common Crawl - FAQ

Common Crawl. General Questions. What is Common Crawl?

Common Crawl - Blog - blekko donates search data to Common Crawl

December 17, 2012. blekko donates search data to Common Crawl. We are very excited to announce that blekko is donating search data to Common Crawl!

Common Crawl - Blog - Common Crawl on AWS Public Data Sets

Common Crawl on AWS Public Data Sets. Common Crawl is thrilled to announce that our data is now hosted on Amazon Web Services' Public Data Sets. Common Crawl Foundation.

Common Crawl - Blog - The Increase of Common Crawl Citations in Academic Research

The Increase of Common Crawl Citations in Academic Research. Common Crawl's impact on research has grown substantially since its beginning.

Common Crawl - Blog - January/February 2025 Newsletter

Annotation for Language Identification. cc-downloader Command Line Tool. Citations Updates. Common Crawl at SXSW 2025. Software Heritage Symposium at UNESCO. NeurIPS 2024 Social with Wikimedia. Annotation for Language Identification.

Common Crawl - Contact Us

To communicate with Common Crawl team and the larger community, please see the. Common Crawl Discussion Group and Mailing List. For physical mail correspondence: Common Crawl Foundation. 9663 Santa Monica Blvd. #425. Beverly Hills, CA 90210.

Common Crawl - Blog - Still time to participate in the Common Crawl code contest

Still time to participate in the Common Crawl code contest. There is still plenty of time left to participate in the Common Crawl code contest! 

Common Crawl - Blog - Common Crawl Code Contest Extended Through the Holiday Weekend

Common Crawl Code Contest Extended Through the Holiday Weekend. Do you have a project that you are working on for the Common Crawl Code Contest that is not quite ready? If so, you are not the only one.

Common Crawl - Blog - May/June 2024 Newsletter

Greg is the Chief Technology Officer at the Common Crawl Foundation. Table of Contents: Common Crawl Celebrates Our 100th Crawl Since 2008! AI and the Right to Learn on an Open Internet. Recent Research Using Common Crawl Data.

Common Crawl - Blog - Analysis of the NCSU Library URLs in the Common Crawl Index

Analysis of the NCSU Library URLs in the Common Crawl Index. Note: this post has been marked as obsolete. Last week we announced the Common Crawl URL Index.

Common Crawl - Blog - 2012 Crawl Data Now Available

July 16, 2012. 2012 Crawl Data Now Available. I am very happy to announce that Common Crawl has released 2012 crawl data as well as a number of significant enhancements to our example library and help pages. Common Crawl Foundation.

Common Crawl - Blog - Professor Jim Hendler Joins the Common Crawl Advisory Board!

Professor Jim Hendler Joins the Common Crawl Advisory Board! We are extremely happy to announce that Professor Jim Hendler has joined the Common Crawl Advisory Board.

Common Crawl - Blog - Providing Authenticity & Data Provenance for Common Crawl Using Blockchain: Our Work with Constellation Network

Providing Authenticity & Data Provenance for Common Crawl Using Blockchain: Our Work with Constellation Network.

Common Crawl - Blog - Gil Elbaz and Nova Spivack on This Week in Startups

Underlying their conversation is an exploration of how Common Crawl's open crawl of the web is a powerful asset for educators, researchers, and entrepreneurs. Allison Domicone.

Common Crawl - Overview

The Common Crawl corpus contains petabytes of data, regularly collected since 2008. Choose a crawl. The corpus contains raw web page data, metadata extracts, and text extracts.

Common Crawl - Blog - Startup Profile: SwiftKey’s Head Data Scientist on the Value of Common Crawl’s Open Data

Startup Profile: SwiftKey’s Head Data Scientist on the Value of Common Crawl’s Open Data. Sebastian Spiegler is the head of the data team and SwiftKey and a volunteer at Common Crawl.

Common Crawl - Blog - The Norvig Web Data Science Award

Common Crawl and SARA created the award to encourage research in web data science. Common Crawl Foundation. Common Crawl builds and maintains an open repository of web crawl data that can be accessed and analyzed by anyone.

Common Crawl - Blog - A Look Inside Our 210TB 2012 Web Corpus

Want to know more detail about what data is in the 2012 Common Crawl corpus without running a job? Now you can thanks to Sebastian Spiegler! Common Crawl Foundation.

Common Crawl - Blog - URL Search Tool!

A couple months ago we announced the creation of the Common Crawl URL Index and followed it up with a guest post by Jason Ronallo describing how he had used the URL Index.

Common Crawl - Blog - Common Crawl URL Index

Common Crawl URL Index. Note: this post has been marked as obsolete. We are thrilled to announce that Common Crawl now has a URL index! Scott Robertson, founder of triv.io graciously donated his time and skills to creating this valuable tool.

Common Crawl - Impact

Common Crawl has revolutionized access to web data, providing an open repository that anyone can use.

Common Crawl - Blog - October/November 2024 Newsletter

NeurIPS Social with Common Crawl and Wikimedia. Event Updates. Open Job Positions. Web Languages Project.

Common Crawl - Blog - Submission to the UK’s Copyright and AI Consultation

Common Crawl Foundation. Common Crawl builds and maintains an open repository of web crawl data that can be accessed and analyzed by anyone. Common Crawl started long before generative AI was front page news.

Common Crawl - Blog - Video Tutorial: MapReduce for the Masses

Learn how you can harness the power of MapReduce data analysis against the Common Crawl dataset with nothing more than five minutes of your time, a bit of local configuration, and 25 cents. Common Crawl Foundation.

Common Crawl - Blog - Hyperlink Graph from Web Data Commons

Hyperlink Graph from Web Data Commons. The talented team at Web Data Commons recently extracted and analyzed the hyperlink graph within the Common Crawl 2012 corpus. Altogether, they found 128 billion hyperlinks connecting 3.5 billion pages.

Common Crawl - Blog - May 2016 Crawl Archive Now Available

May 2016 Crawl Archive Now Available. The crawl archive for May 2016 is now available! More than 1.46 billion web pages are in the archive. Sebastian Nagel. Sebastian is a Distinguished Engineer with Common Crawl.

Common Crawl - Blog - Video: Gil Elbaz at Web 2.0 Summit 2011

Hear Common Crawl founder discuss how data accessibility is crucial to increasing rates of innovation as well as give ideas on how to facilitate increased access to data. Common Crawl Foundation.

Common Crawl - Blog - Bridging Digital Exploration and Scientific Frontiers

This month Common Crawl Foundation members had the privilege of attending 5th International Open Search Symposium at CERN in Geneva, Switzerland. Thom Vaughan. Thom is Principal Technologist at the Common Crawl Foundation.

Common Crawl - Blog - Common Crawl Foundation at NeurIPS 2024: Expanding Horizons and Building Connections

Common Crawl Foundation at NeurIPS 2024: Expanding Horizons and Building Connections.

Common Crawl - Blog

Common Crawl Blog. The latest news, interviews, technologies, and resources. Common Crawl Blog. The Data. Overview. Web Graphs. Latest Crawl. Crawl Stats. Graph Stats. Errata. Resources. Get Started. AI Agent. Blog. Examples. Use Cases. CCBot.

Common Crawl - Blog - Introducing Common Crawl AI Agent by ReadyAI

Introducing Common Crawl AI Agent by ReadyAI. We are pleased to announce the launch of an experimental AI Agent, developed by our friends at ReadyAI.

Common Crawl - Blog - December 2017 Crawl Archive Now Available

December 2017 Crawl Archive Now Available. The crawl archive for December 2017 is now available! The archive contains 2.9 billion web pages and over 240 TiB of uncompressed content. Sebastian Nagel. Sebastian is a Distinguished Engineer with Common Crawl.

Common Crawl - Blog - Common Crawl's Advisory Board

Common Crawl's Advisory Board. As part of our ongoing effort to grow Common Crawl into a truly useful and innovative tool, we recently formed an Advisory Board to guide us in our efforts.

Common Crawl - Blog - Introducing CloudFront as a new way to access Common Crawl data as part of Amazon Web Services’ registry of open data

Introducing CloudFront as a new way to access Common Crawl data as part of Amazon Web Services’ registry of open data. Ten years ago(!) Common Crawl joined AWS’s Open Data Sponsorships program, hosted on S3, with free access to everyone.

Common Crawl - Blog - May 2015 Crawl Archive Available

May 2015 Crawl Archive Available. The crawl archive for May 2015 is now available! This crawl archive is over 159TB in size and holds more than 2.05 billion webpages. Stephen Merity.

Common Crawl - Blog - June 2015 Crawl Archive Available

June 2015 Crawl Archive Available. The crawl archive for June 2015 is now available! This crawl archive is over 131TB in size and holds more than 1.67 billion webpages. Stephen Merity.

Common Crawl - Blog - March 2015 Crawl Archive Available

March 2015 Crawl Archive Available. The crawl archive for March 2015 is now available! This crawl archive is over 124TB in size and holds more than 1.64 billion webpages. Stephen Merity.

Common Crawl - Blog - July 2016 Crawl Archive Now Available

July 2016 Crawl Archive Now Available. The crawl archive for July 2016 is now available! The archive contains more than 1.73 billion web pages. Sebastian Nagel. Sebastian is a Distinguished Engineer with Common Crawl.

Common Crawl - Blog - December 2019 crawl archive now available

December 2019 crawl archive now available. The crawl archive for December 2019 is now available! It contains 2.45 billion web pages or 234 TiB of uncompressed content, crawled between December 5th and 16th.

Common Crawl - Blog - SlideShare: Building a Scalable Web Crawler with Hadoop

Common Crawl on building an open Web-Scale crawl using Hadoop. Common Crawl Foundation. Common Crawl builds and maintains an open repository of web crawl data that can be accessed and analyzed by anyone. The Data. Overview. Web Graphs. Latest Crawl.

Common Crawl - Blog - Dialog and Discovery at AI_dev 2024

This month members from the Common Crawl Foundation attended the AI_dev: Open Source GenAI & ML Summit in Paris, where discussions focused on AI advancements, ethics, and Open Source solutions. Common Crawl Foundation.

Common Crawl - Blog - August 2015 Crawl Archive Available

August 2015 Crawl Archive Available. The crawl archive for August 2015 is now available! This crawl archive is over 149TB in size and holds more than 1.84 billion webpages. Stephen Merity.

Common Crawl - Blog - Learn Hadoop and get a paper published

Allison Domicone was formerly a Program and Policy Consultant to Common Crawl and previously worked for Creative Commons. We're looking for students who want to try out the. Hadoop. platform and get a technical report published.

Common Crawl - Blog - June 2016 Crawl Archive Now Available

June 2016 Crawl Archive Now Available. The crawl archive for June 2016 is now available! The archive contains more than 1.23 billion web pages. Sebastian Nagel. Sebastian is a Distinguished Engineer with Common Crawl.

Common Crawl - Blog - IIPC General Assembly & Web Archiving Conference 2025

The Common Crawl team attended the 2025 IIPC General Assembly and Web Archiving Conference in Oslo, presenting recent work and participating in discussions on web preservation. Thom Vaughan. Thom is Principal Technologist at the Common Crawl Foundation.

Common Crawl - Blog - March 2018 Crawl Archive Now Available

Sebastian is a Distinguished Engineer with Common Crawl. The crawl archive for March 2018 is now available! The archive is located in the. commoncrawl. bucket at. crawl-data/CC-MAIN-2018-13/.

Common Crawl - Blog - TalentBin Adds Prizes To The Code Contest

The prize package for the Common Crawl Code Contest now includes three Nexus 7 tablets thanks to TalentBin! Common Crawl Foundation. Common Crawl builds and maintains an open repository of web crawl data that can be accessed and analyzed by anyone.

Common Crawl - Blog - August 2020 crawl archive now available

August 2020 crawl archive now available. The crawl archive for August 2020 is now available! It contains 2.45 billion web pages or 235 TiB of uncompressed content, crawled between August 2nd and 15th.

Common Crawl - Blog - October 2020 crawl archive now available

October 2020 crawl archive now available. The crawl archive for October 2020 is now available! The data was crawled between October 19th and November 1st and contains 2.71 billion web pages or 280 TiB of uncompressed content.

Common Crawl - Blog - January 2022 crawl archive now available

January 2022 crawl archive now available. The crawl archive for January 2022 is now available! The data was crawled January 16 – 29 and contains 2.95 billion web pages or 320 TiB of uncompressed content.

Common Crawl - Blog - August 2022 crawl archive now available

August 2022 crawl archive now available. The crawl archive for August 2022 is now available! The data was crawled August 7 – 20 and contains 2.55 billion web pages or 295 TiB of uncompressed content.

Common Crawl - Blog - October 2021 crawl archive now available

October 2021 crawl archive now available. The crawl archive for October 2021 is now available! The data was crawled Oct 15 – 28 and contains 3.3 billion web pages or 360 TiB of uncompressed content.

Common Crawl - Blog - September 2020 crawl archive now available

September 2020 crawl archive now available. The crawl archive for September 2020 is now available! The data was crawled between September 18th and October 2nd and contains 3.45 billion web pages or 345 TiB of uncompressed content.

Common Crawl - Blog - September 2017 Crawl Archive Now Available

Sebastian is a Distinguished Engineer with Common Crawl. The crawl archive for September 2017 is now available! The archive is located in the. commoncrawl. bucket at. crawl-data/CC-MAIN-2017-39/.