Search results

Common Crawl - Blog - URL Search Tool!

A couple months ago we announced the creation of the Common Crawl URL Index and followed it up with a guest post by Jason Ronallo describing how he had used the URL Index.…

Common Crawl - Blog - WikiReverse- Visualizing Reverse Links with the Common Crawl Archive

This is a guest blog post by Ross Fairbanks, a software developer based in Barcelona. He mainly develops in Ruby and is interested in open data and cloud computing. This guest post describes his open data project and why he built it. Ross Fairbanks.…

Common Crawl - Blog - Announcing the Common Crawl Index!

This is a guest post by Ilya Kreymer, a dedicated volunteer who has gifted large amounts of time, effort and talent to Common Crawl.…

Common Crawl - Blog - Towards Social Discovery - New Content Models; New Data; New Toolsets

This is a guest blog post by Matthew Berk, Founder of Lucky Oyster. Matthew has been on the front lines of search technology for the past decade. Matthew Berk. Matthew Berk is a founder at Bean Box and Open List, worked at Jupiter Research and Marchex.…

Common Crawl - Blog - Lexalytics Text Analysis Work with Common Crawl Data

This is a guest blog post by Oskar Singer, a Software Developer and Computer Science student at University of Massachusetts Amherst. He recently did some very interesting text analytics work during his internship at Lexalytics.…

Common Crawl - Blog - Web Data Commons Extraction Framework for the Distributed Processing of CC Data

This is a guest blog post by Robert Meusel, a researcher at the University of Mannheim in the Data and Web Science Research Group and a key member of the Web Data Commons project.…

Common Crawl - Blog - Analyzing a Web graph with 129 billion edges using FlashGraph

This is a guest blog post by Da Zheng, the architect and main developer of the FlashGraph project.…

Common Crawl - Blog - Evaluating graph computation systems

This is a guest blog post by Frank McSherry, a computer science researcher active in the area of large scale data analysis. While at Microsoft Research he co-invented differential privacy, and lead the Naiad streaming dataflow project.…

Common Crawl - Blog - April 2015 Crawl Archive Available

For full details, refer to Ilya's. guest blog post. Please. donate. to Common Crawl if you appreciate our free datasets! We're also seeking corporate sponsors to partner with Common Crawl for our non-profit work in big open data!…

Common Crawl - Blog - July 2015 Crawl Archive Available

Common Crawl - Blog - Twelve steps to running your Ruby code across five billion web pages

The following is a guest blog post by Pete Warden, a member of the Common Crawl Advisory Board. Pete is a British-born programmer living in San Francisco.…

Common Crawl - Blog - August 2015 Crawl Archive Available

Common Crawl - Blog - June 2015 Crawl Archive Available

Common Crawl - Blog - March 2015 Crawl Archive Available

Common Crawl - Blog - May 2015 Crawl Archive Available

Common Crawl - Blog - Web Image Size Prediction for Efficient Focused Image Crawling

This is a guest blog post by Katerina Andreadou, a research assistant at CERTH, specializing in multimedia analysis and web crawling.…

Common Crawl - Blog - Navigating the WARC file format

This is a guest blog post by. Stephen Merity. , a Computational Science and Engineering master's candidate at Harvard University. His graduate work centers around machine learning and data analysis on large data sets.…

Common Crawl - Blog - Answers to Recent Community Questions

In this post we respond to the most common questions. Thanks for all the support and please keep the questions coming! Common Crawl Foundation.…

Common Crawl - Blog - Oct/Nov 2023 Performance Issues

This post details some steps to take if you are impacted by performance issues. Greg Lindahl. Greg is Chief Technology Officer at the Common Crawl Foundation. Introduction.…

Common Crawl - Blog - Winter 2013 Crawl Data Now Available

In late November, we published the data from the first crawl of 2013 (see. previous blog post. for more detail on that dataset). The new dataset was collected at the end of 2013, contains approximately 2.3 billion webpages and is 148TB in size.…

Common Crawl - Blog - October/November 2025 Newsletter

For highlights and slides, see our. blog post. The WMDQS team at COLM: Sebastian Nagel, Pedro Ortiz Suarez, Laurie Burchell, Thom Vaughan, and Malte Ostendorff.…

Common Crawl - Blog - July/August 2025 Newsletter

More details about the event and links to papers with and about Common Crawl can be found in our recent. blog post. In July we had the happy opportunity to attend IETF 123, held at the Meliã Castilla in Madrid.…

Common Crawl - Blog - March/April 2025 Newsletter

Also in February, we attended the AI Action Summit (see separate post below).…

Common Crawl - Blog - Analysis of the NCSU Library URLs in the Common Crawl Index

The index has already proven useful to many people and we would like to share an interesting use of the index that was very well described in a great blog post by Jason Ronallo. Jason Ronallo.…

Common Crawl - Blog - April 2026 Common Crawl Newsletter

More details in our. blog post. Introducing the New Examples & Resources Browser. We've replaced our old Examples and Use Cases pages with a. single searchable, filterable browser. 119 resources from 115 contributors, all in one place.…

Common Crawl - Blog - May/June 2025 Newsletter

For more details about the public test of this dataset and how to give feedback, see our. blog post. Refreshed Version of Our Whirlwind Tour.…

Common Crawl - Blog - A Further Look Into the Prevalence of Various ML Opt–Out Protocols

This post details some experiments that we have done regarding Machine Learning Opt–Out protocols.…

Common Crawl - Blog - January/February 2025 Newsletter

Web Languages. project, see our related. blog post. cc-downloader Command Line Tool.…

Common Crawl - Blog - October/November 2024 Newsletter

For more details, see our. blog post. We attended the IETF 121. meeting. in Dublin, where there was further discussion on the initial results from the recent. AI CONTROL workshop. Here are. some notes. from the chairs Mark Nottingham and Suresh Krishnan.…

Common Crawl - Blog - August 2016 Crawl Archive Now Available

More information can be found in a. separate blog post. To assist with exploring and using the dataset, we provide gzipped files that list: all segments. (CC-MAIN-2016-36/segment.paths.gz). all WARC files. (CC-MAIN-2016-36/warc.paths.gz). all WAT files.…

Common Crawl - Open Repository of Web Crawl Data

Latest Blog Post. The Data. Overview. CDXJ Index. Columnar Index. Web Graphs. Latest Crawl. Crawl Stats. Graph Stats. Errata. Resources. Get Started. AI Agent. Blog. Examples. CCBot. Infra Status. Opt-Out Registry. FAQ. Community. Research Papers.…

Common Crawl - Blog - Video Tutorial: MapReduce for the Masses

Check out the full. blog post. where this video originally appeared. The Data. Overview. CDXJ Index. Columnar Index. Web Graphs. Latest Crawl. Crawl Stats. Graph Stats. Errata. Resources. Get Started. AI Agent. Blog. Examples. CCBot. Infra Status.…

Common Crawl - Erratum - Missing content_truncated flag in URL indexes

For more information please refer to the. blog post announcing the November 2019 crawl. The reason for the truncation is given only for truncated records following the WARC header field. "WARC-Truncated". Affected Crawls. The Data. Overview. CDXJ Index.…

Common Crawl - Blog - Web Archiving File Formats Explained

In this post, we explain these formats, exploring their unique features, applications, and the enhancements they offer. Thom Vaughan. Thom is Principal Engineer at the Common Crawl Foundation. The Capabilities of ARC, WARC, WET, and WAT Formats.…

Common Crawl - Erratum - ARC Format (Legacy) Crawls

More information about these formats can be found in our blog post. Web Archiving Formats Explained. Affected Crawls. The Data. Overview. CDXJ Index. Columnar Index. Web Graphs. Latest Crawl. Crawl Stats. Graph Stats. Errata. Resources. Get Started.…

Common Crawl - Blog - August/September 2024 Newsletter

We recently published. a blog post on this. , and plan to further investigate the connections in this network. Common Crawl Statistics on Hugging Face. We're excited to announce that Common Crawl’s statistics are. now available on Hugging Face. !…

Common Crawl - Blog - February/March 2024 Crawl Archive Now Available

If you're interested, we have recently published a blog post with further details on these formats. here.…

Common Crawl - Blog - July 2016 Crawl Archive Now Available

For more information on working with the URL index, please refer to the previous. blog post. or the. Index Server API. There is also a. command-line tool client. for common use cases of the url index.…

Common Crawl - Blog - June 2016 Crawl Archive Now Available

Common Crawl - Blog - February 2015 Crawl Archive Available

Whilst full details will be released in an upcoming blog post, we're telling you about it now as we're interested in hearing feedback from the community! Please. donate. to Common Crawl if you appreciate our free datasets!…

Common Crawl - Blog - January 2017 Crawl Archive Now Available

Common Crawl - Blog - Still time to participate in the Common Crawl code contest

If you are looking for help with your work or a collaborator, you can post on the. Discussion Group. We are looking forward to seeing what you come up with! The Data. Overview. CDXJ Index. Columnar Index. Web Graphs. Latest Crawl. Crawl Stats.…

Common Crawl - Blog - February 2016 Crawl Archive Now Available

For more information on working with the url index, please refer to the previous. blog post. or the. Index Server API. There is also a. command-line tool client. for common use cases of the url index.…

Common Crawl - Blog - March 2017 Crawl Archive Now Available

Common Crawl - Blog - November 2015 Crawl Archive Now Available

Common Crawl - Blog - September 2015 Crawl Archive Now Available

Common Crawl - Blog - June 2017 Crawl Archive Now Available

Common Crawl - Blog - July 2017 Crawl Archive Now Available

Common Crawl - Blog - May 2016 Crawl Archive Now Available

Common Crawl - Blog - April 2017 Crawl Archive Now Available

Common Crawl - Blog - April 2016 Crawl Archive Now Available

Common Crawl - Blog - Web Languages Needing Review by Native Speakers

(introduced in. this blog post. in December of last year), and so far we’ve had 266 contributions from 67 people, thanks to whom we’ve added over 4,700 LOTE URLs to our seed list so far.…

Common Crawl - Blog - September 2017 Crawl Archive Now Available

Common Crawl - Blog - December 2017 Crawl Archive Now Available

Common Crawl - Blog - February 2018 Crawl Archive Now Available

Common Crawl - Blog - October 2017 Crawl Archive Now Available

Common Crawl - Blog - August 2017 Crawl Archive Now Available

Common Crawl - Blog - March 2026 Crawl Archive Now Available

We also recently ran. an experiment to measure the adoption of IPv6 across the top 100k web hosts. , about which you can read in our. recent blog post. , and see the corresponding data and code in its. GitHub repository. The crawl.…

Common Crawl - Blog - October 2016 Crawl Archive Now Available

Common Crawl - Blog - December 2016 Crawl Archive Now Available

Search results

The Data

Overview

CDXJ Index

Columnar Index

Web Graphs

Latest Crawl

Crawl Stats

Graph Stats

Errata

Resources

Get Started

AI Agent

Blog

Examples

CCBot

Infra Status

Opt-Out Registry

FAQ

Community

Research Papers

Mailing List Archive

Hugging Face

Discord

Collaborators

About

About

Team

Jobs

Privacy Policy

Terms of Use