Search results

Common Crawl - Blog - WikiReverse- Visualizing Reverse Links with the Common Crawl Archive

This is a guest blog post by Ross Fairbanks, a software developer based in Barcelona. He mainly develops in Ruby and is interested in open data and cloud computing. This guest post describes his open data project and why he built it. Ross Fairbanks.

Common Crawl - Blog - Analyzing a Web graph with 129 billion edges using FlashGraph

This is a guest blog post by Da Zheng, the architect and main developer of the FlashGraph project.

Common Crawl - Blog - Announcing the Common Crawl Index!

This is a guest post by Ilya Kreymer, a dedicated volunteer who has gifted large amounts of time, effort and talent to Common Crawl.

Common Crawl - Blog - Towards Social Discovery - New Content Models; New Data; New Toolsets

This is a guest blog post by Matthew Berk, Founder of Lucky Oyster. Matthew has been on the front lines of search technology for the past decade. Matthew Berk. Matthew Berk is a founder at Bean Box and Open List, worked at Jupiter Research and Marchex.

Common Crawl - Blog - Navigating the WARC file format

April 2, 2014. Navigating the WARC file format. Wait, what's WAT, WET and WARC? Recently CommonCrawl has switched to the Web ARChive (WARC) format.

Common Crawl - Blog - URL Search Tool!

Note: this post has been marked as obsolete. A couple months ago we announced the creation of the Common Crawl URL Index and followed it up with a guest post by Jason Ronallo describing how he had used the URL Index.

Common Crawl - Blog - Twelve steps to running your Ruby code across five billion web pages

Twelve steps to running your Ruby code across five billion web pages. The following is a guest blog post by Pete Warden, a member of the Common Crawl Advisory Board. Pete is a British-born programmer living in San Francisco.

Common Crawl - Blog - August Crawl Archive Introduces Language Annotations

It contains 2.65 billion web pages and 220 TiB of uncompressed content, crawled between August 14th and 22th.

Common Crawl - Blog - Answers to Recent Community Questions

In this post we respond to the most common questions. Thanks for all the support and please keep the questions coming! Common Crawl Foundation.

Common Crawl - Blog - Lexalytics Text Analysis Work with Common Crawl Data

This is a guest blog post by Oskar Singer, a Software Developer and Computer Science student at University of Massachusetts Amherst. He recently did some very interesting text analytics work during his internship at Lexalytics.

Common Crawl - Blog - Web Image Size Prediction for Efficient Focused Image Crawling

This is a guest blog post by Katerina Andreadou, a research assistant at CERTH, specializing in multimedia analysis and web crawling.

Common Crawl - Blog - Analysis of the NCSU Library URLs in the Common Crawl Index

Note: this post has been marked as obsolete. Last week we announced the Common Crawl URL Index.

Common Crawl - Blog - Web Data Commons Extraction Framework for the Distributed Processing of CC Data

This is a guest blog post by Robert Meusel, a researcher at the University of Mannheim in the Data and Web Science Research Group and a key member of the Web Data Commons project.

Common Crawl - Blog - October 2016 Crawl Archive Now Available

The archive contains more than 3.25 billion web pages. Sebastian Nagel. Sebastian is a Distinguished Engineer with Common Crawl. The crawl archive for October 2016 is now available!

Common Crawl - Blog - March/April 2023 crawl archive now available

The data was crawled March 20 – April 2 and contains 3.1 billion web pages or 400 TiB of uncompressed content.

Common Crawl - Blog - February/March 2024 Crawl Archive Now Available

The data was crawled between February 20th and March 5th, and contains 3.16 billion web pages (or 424.7 TiB of uncompressed content). Thom Vaughan. Thom is Principal Technologist at the Common Crawl Foundation.

Common Crawl - Blog - July 2024 Crawl Archive Now Available

We are pleased to announce that the crawl archive for July 2024 is now available, containing 2.5 billion web pages, or 360 TiB of uncompressed content. Thom Vaughan. Thom is Principal Technologist at the Common Crawl Foundation.

Common Crawl - Blog - Evaluating graph computation systems

This is a guest blog post by Frank McSherry, a computer science researcher active in the area of large scale data analysis. While at Microsoft Research he co-invented differential privacy, and lead the Naiad streaming dataflow project.

Common Crawl - Blog - August 2015 Crawl Archive Available

For full details, refer to Ilya's. guest blog post. Please. donate. to Common Crawl if you appreciate our free datasets! We're also seeking corporate sponsors to partner with Common Crawl for our non-profit work in big open data!

Common Crawl - Blog - July 2015 Crawl Archive Available

For full details, refer to Ilya's. guest blog post. Please. donate. to Common Crawl if you appreciate our free datasets! We're also seeking corporate sponsors to partner with Common Crawl for our non-profit work in big open data!

Common Crawl - Blog - April 2015 Crawl Archive Available

For full details, refer to Ilya's. guest blog post. Please. donate. to Common Crawl if you appreciate our free datasets! We're also seeking corporate sponsors to partner with Common Crawl for our non-profit work in big open data!

Common Crawl - Blog - May 2015 Crawl Archive Available

For full details, refer to Ilya's. guest blog post. Please. donate. to Common Crawl if you appreciate our free datasets! We're also seeking corporate sponsors to partner with Common Crawl for our non-profit work in big open data!

Common Crawl - Blog - June 2015 Crawl Archive Available

For full details, refer to Ilya's. guest blog post. Please. donate. to Common Crawl if you appreciate our free datasets! We're also seeking corporate sponsors to partner with Common Crawl for our non-profit work in big open data!

Common Crawl - Blog - March 2015 Crawl Archive Available

For full details, refer to Ilya's. guest blog post. Please. donate. to Common Crawl if you appreciate our free datasets! We're also seeking corporate sponsors to partner with Common Crawl for our non-profit work in big open data!

Common Crawl - Blog - January 2021 crawl archive now available

February 2, 2021. January 2021 crawl archive now available. The crawl archive for January 2021 is now available! The data was crawled between January 15th and 28th and contains 3.4 billion web pages or 350 TiB of uncompressed content.

Common Crawl - Erratum - Some 2–Level CCTLDs Excluded

Some 2–Level CCTLDs Excluded. A bad configuration was checked into our exclusion list on Sep 22, 2022 and was fixed on Oct 27, 2023. The configuration blocked a number of 2–level domains, meaning they were not included in certain crawls.

Common Crawl - Blog - Common Crawl URL Index

Note: this post has been marked as obsolete. We are thrilled to announce that Common Crawl now has a URL index! Scott Robertson, founder of triv.io graciously donated his time and skills to creating this valuable tool. Scott Robertson.

Common Crawl - Blog - January 2022 crawl archive now available

February 2, 2022. January 2022 crawl archive now available. The crawl archive for January 2022 is now available! The data was crawled January 16 – 29 and contains 2.95 billion web pages or 320 TiB of uncompressed content.

Common Crawl - Blog - May 2022 crawl archive now available

June 2, 2022. May 2022 crawl archive now available. The crawl archive for May 2022 is now available! The data was crawled May 16 – 29 and contains 3.45 billion web pages or 420 TiB of uncompressed content.

Common Crawl - Blog - Common Crawl Statistics Now Available on Hugging Face

The table shows the percentage of how character sets have been used to encode HTML pages crawled by the latest monthly crawls. The character set or encoding of HTML pages only is identified by. Apache Tika™. 's. AutoDetectReader. Crawl Metrics.

Common Crawl - Blog - September 2016 Crawl Archive Now Available

The archive contains more than 1.72 billion web pages. Sebastian Nagel. Sebastian is a Distinguished Engineer with Common Crawl. The crawl archive for September 2016 is now available!

Common Crawl - Blog - July 2019 crawl archive now available

It contains 2.6 billion web pages or 220 TiB of uncompressed content, crawled between July 15th and 24th. Sebastian Nagel. Sebastian is a Distinguished Engineer with Common Crawl. The crawl archive for July 2019 is now available!

Common Crawl - Blog - Host- and Domain-Level Web Graphs Aug/Sept/Oct 2017

As compared to prior web graphs, two changes are caused by the large size of this host-level graph (5.1 billion hosts): The text dump of the graph is split into multiple files; there is no page rank calculation at this time.

Common Crawl - Blog - Introducing the Common Crawl Errata Page for Data Transparency

Introducing the Common Crawl Errata Page for Data Transparency. As part of our commitment to accuracy and transparency, we are pleased to introduce a new Errata page on our website. Thom Vaughan.

Common Crawl - Blog - May/June 2020 crawl archive now available

It contains 2.75 billion web pages or 255 TiB of uncompressed content, crawled between May 24th and June 7th. It includes page captures of 1.2 billion URLs unknown in any of our prior crawl archives. Sebastian Nagel.

Common Crawl - Blog - March 2017 Crawl Archive Now Available

The archive contains 3.07 billion+ web pages and over 250 TiB of uncompressed content. Sebastian Nagel. Sebastian is a Distinguished Engineer with Common Crawl. The crawl archive for March 2017 is now available!

Common Crawl - Blog - September 2019 crawl archive now available

It contains 2.55 billion web pages or 240 TiB of uncompressed content, crawled between September 15th and 24th. It includes page captures of 1.0 billion URLs not contained in any crawl archive before.

Common Crawl - Blog - Learn Hadoop and get a paper published

Common Crawl, a nonprofit organization with a mission to build and maintain an open crawl of the web that is accessible to everyone, has a huge repository of open data - about 5 billion web pages - and documentation to help you learn these tools.

Common Crawl - Blog - April 2017 Crawl Archive Now Available

The archive contains 2.94 billion+ web pages and over 250 TiB of uncompressed content. Sebastian Nagel. Sebastian is a Distinguished Engineer with Common Crawl. The crawl archive for April 2017 is now available!

Common Crawl - Blog - New Crawl Data Available!

The data was collected in 2013, contains approximately 2 billion web pages and is 102TB in size (uncompressed). Common Crawl Foundation. Common Crawl builds and maintains an open repository of web crawl data that can be accessed and analyzed by anyone.

Common Crawl - Blog - October 2019 crawl archive now available

It contains 3.0 billion web pages or 280 TiB of uncompressed content, crawled between October 13th and 24th. It includes page captures of 1.1 billion URLs not contained in any crawl archive before. Sebastian Nagel.

Common Crawl - Blog - August 2019 crawl archive now available

It contains 2.95 billion web pages or 260 TiB of uncompressed content, crawled between August 17th and 26th. Sebastian Nagel. Sebastian is a Distinguished Engineer with Common Crawl. The crawl archive for August 2019 is now available!

Common Crawl - Blog - August 2020 crawl archive now available

It contains 2.45 billion web pages or 235 TiB of uncompressed content, crawled between August 2nd and 15th. It includes page captures of 940 million URLs unknown in any of our prior crawl archives. Sebastian Nagel.

Common Crawl - Blog - December 2019 crawl archive now available

It contains 2.45 billion web pages or 234 TiB of uncompressed content, crawled between December 5th and 16th. It includes page captures of 850 million URLs not contained in any crawl archive before. Sebastian Nagel.

Common Crawl - Blog - January 2020 crawl archive now available

It contains 3.1 billion web pages or 300 TiB of uncompressed content, crawled between January 17th and 29th. It includes page captures of 960 million URLs not contained in any crawl archive before. Sebastian Nagel.

Common Crawl - Blog - February 2018 Crawl Archive Now Available

March 2, 2018. February 2018 Crawl Archive Now Available. The crawl archive for February 2018 is now available! The archive contains 3.4 billion web pages and 270+ TiB of uncompressed content, crawled between February 17th and Feb 26th. Sebastian Nagel.

Common Crawl - Blog - February 2017 Crawl Archive Now Available

The archive contains 3.08 billion+ web pages and over 250 TiB of uncompressed content. Sebastian Nagel. Sebastian is a Distinguished Engineer with Common Crawl. The crawl archive for February 2017 is now available!

Common Crawl - Blog - September 2021 crawl archive now available

The data was crawled Sept 16 – 29 and contains 2.95 billion web pages or 310 TiB of uncompressed content. It includes page captures of 1.3 billion new URLs, not visited in any of our prior crawls. Sebastian Nagel.

Common Crawl - Blog - January 2017 Crawl Archive Now Available

The archive contains more than 3.14 billion web pages and about 250 TiB of uncompressed content. Sebastian Nagel. Sebastian is a Distinguished Engineer with Common Crawl. The crawl archive for January 2017 is now available!

Common Crawl - Blog - July 2020 crawl archive now available

It contains 3.14 billion web pages or 300 TiB of uncompressed content, crawled between July 2nd and 16th. It includes page captures of 1.1 billion URLs unknown in any of our prior crawl archives. Sebastian Nagel.

Common Crawl - Blog - February 2020 crawl archive now available

It contains 2.6 billion web pages or 240 TiB of uncompressed content, crawled between February 16th and 29th. It includes page captures of 1 billion URLs unknown in any of our prior crawl archives. Sebastian Nagel.

Common Crawl - Blog - June 2019 crawl archive now available

July 2, 2019. June 2019 crawl archive now available. The crawl archive for June 2019 is now available! It contains 2.6 billion web pages or 220 TiB of uncompressed content, crawled between June 16th and 27th with an operational break from 21st to 24th.

Common Crawl - Blog - October 2021 crawl archive now available

The data was crawled Oct 15 – 28 and contains 3.3 billion web pages or 360 TiB of uncompressed content. It includes page captures of 1.3 billion new URLs, not visited in any of our prior crawls. Sebastian Nagel.

Common Crawl - Blog - March/April 2020 crawl archive now available

It contains 2.85 billion web pages or 280 TiB of uncompressed content, crawled between March 28th and April 10th. It includes page captures of 1 billion URLs unknown in any of our prior crawl archives. Sebastian Nagel.

Common Crawl - Blog - May 2021 crawl archive now available

The data was crawled May 5 – 19 and contains 2.6 billion web pages or 280 TiB of uncompressed content. It includes page captures of 1.28 billion new URLs, not visited in any of our prior crawls. Sebastian Nagel.

Common Crawl - Blog - April 2025 Crawl Archive Now Available

The data was crawled between April 17th and May 1st, and contains 2.74 billion web pages (or 468 TiB of uncompressed content).

Common Crawl - Blog - April 2021 crawl archive now available

The data was crawled April 10 – 23 and contains 3.1 billion web pages or 320 TiB of uncompressed content. It includes page captures of 1.35 billion new URLs, not visited in any of our prior crawls. Sebastian Nagel.

Common Crawl - Blog - June 2021 crawl archive now available

The data was crawled June 12 – 25 and contains 2.45 billion web pages or 260 TiB of uncompressed content. It includes page captures of 1.3 billion new URLs, not visited in any of our prior crawls. Sebastian Nagel.

Common Crawl - Blog - November/December 2021 crawl archive now available

The data was crawled Nov 26 – Dec 9 and contains 2.5 billion web pages or 280 TiB of uncompressed content. It includes page captures of 1.2 billion new URLs, not visited in any of our prior crawls. Sebastian Nagel.

Common Crawl - Blog - MapReduce for the Masses: Zero to Hadoop in Five Minutes with Common Crawl

Common Crawl aims to change the big data game with our repository of over 40 terabytes of high-quality web crawl information into the Amazon cloud, the net total of 5 billion crawled pages. Common Crawl Foundation.