Search results

Common Crawl - Blog - The Promise of Open Government Data & Where We Go Next

In May 2013, the White House released its Open Data Policy and announced the launch of Project Open Data, a repository of tools and information--which anyone is free to contribute to--that help government agencies release data that is “available, discoverable

Common Crawl - Mission

Open Data derived from web crawls can contribute to informed decision-making at both individual and governmental levels.

Common Crawl - Use Cases

Mapping French Open Data Actors on the Web with Common Crawl. Gulliame LeBourgeois. Mapping French open data actors on the web with Common Crawl. The Switchabalizer – Our Journey from Spell-Checker to Homophone Corrector. Oskar Singer.

Common Crawl - Blog - 5 Good Reads in Big Open Data: Feb 20 2015

February 20, 2015. 5 Good Reads in Big Open Data: Feb 20 2015. A thriving ecosystem is the key for real viability of any technology.

Common Crawl - Blog - Startup Profile: SwiftKey’s Head Data Scientist on the Value of Common Crawl’s Open Data

Startup Profile: SwiftKey’s Head Data Scientist on the Value of Common Crawl’s Open Data. Sebastian Spiegler is the head of the data team and SwiftKey and a volunteer at Common Crawl.

Common Crawl - Blog - 5 Good Reads in Big Open Data: March 20 2015

March 20, 2015. 5 Good Reads in Big Open Data: March 20 2015.

Common Crawl - Blog - 5 Good Reads in Big Open Data: Feb 6 2015

February 6, 2015. 5 Good Reads in Big Open Data: Feb 6 2015.

Common Crawl - Blog - 5 Good Reads in Big Open Data: Feb 13 2015

February 13, 2015. 5 Good Reads in Big Open Data: Feb 13 2015. What does it mean for the Open Web if users don't know they're on the internet? Via QUARTZ: “This is more than a matter of semantics.

Common Crawl - Blog - Winners of the Code Contest!

People’s Choice: French Open Data. Another very popular entry, this work maps the ecosphere of French open data in order to identify the players, their importance, and their relationship.

Common Crawl - Blog - Introducing CloudFront as a new way to access Common Crawl data as part of Amazon Web Services’ registry of open data

Introducing CloudFront as a new way to access Common Crawl data as part of Amazon Web Services’ registry of open data. Ten years ago(!) Common Crawl joined AWS’s Open Data Sponsorships program, hosted on S3, with free access to everyone.

Common Crawl - CCBot

Common Crawl is a non-profit foundation founded with the goal of democratizing access to web information by producing and maintaining an open repository of web crawl data that is universally accessible and analyzable by anyone.

Common Crawl - Blog - 5 Good Reads in Big Open Data: March 6 2015

March 6, 2015. 5 Good Reads in Big Open Data: March 6 2015. 2015: What do you think about Machines that think?

Common Crawl - Blog - WikiReverse- Visualizing Reverse Links with the Common Crawl Archive

He mainly develops in Ruby and is interested in open data and cloud computing. This guest post describes his open data project and why he built it. Ross Fairbanks. Ross Fairbanks is a software developer based in Barcelona. What is WikiReverse?

Common Crawl - Blog - January 2015 Crawl Archive Available

Stephen Merity is an independent AI researcher, who is passionate about machine learning, open data, and teaching computer science. The crawl archive for January 2015 is now available!

Common Crawl - Blog - 5 Good Reads in Big Open Data: March 26 2015

March 26, 2015. 5 Good Reads in Big Open Data: March 26 2015.

Common Crawl - Blog - April 2015 Crawl Archive Available

Stephen Merity is an independent AI researcher, who is passionate about machine learning, open data, and teaching computer science. The crawl archive for April 2015 is now available!

Common Crawl - Blog - July 2015 Crawl Archive Available

Stephen Merity is an independent AI researcher, who is passionate about machine learning, open data, and teaching computer science. The crawl archive for July 2015 is now available!

Common Crawl - Blog - August 2015 Crawl Archive Available

Stephen Merity is an independent AI researcher, who is passionate about machine learning, open data, and teaching computer science. The crawl archive for August 2015 is now available!

Common Crawl - Blog - February 2015 Crawl Archive Available

Stephen Merity is an independent AI researcher, who is passionate about machine learning, open data, and teaching computer science. The crawl archive for February 2015 is now available!

Common Crawl - Blog - March 2015 Crawl Archive Available

Stephen Merity is an independent AI researcher, who is passionate about machine learning, open data, and teaching computer science. The crawl archive for March 2015 is now available!

Common Crawl - Blog - June 2015 Crawl Archive Available

Stephen Merity is an independent AI researcher, who is passionate about machine learning, open data, and teaching computer science. The crawl archive for June 2015 is now available!

Common Crawl - Blog - May 2015 Crawl Archive Available

Stephen Merity is an independent AI researcher, who is passionate about machine learning, open data, and teaching computer science. The crawl archive for May 2015 is now available!

Common Crawl - Team - Stephen Merity

Stephen Merity is an independent AI researcher, who is passionate about machine learning, Open Data, and teaching computer science. The Data. Overview. Web Graphs. Latest Crawl. Resources. Get Started. Blog. Examples. Use Cases. CCBot. Infra Status. FAQ.

Common Crawl - Blog - 5 Good Reads in Big Open Data: February 27 2015

February 27, 2015. 5 Good Reads in Big Open Data: February 27 2015.

Common Crawl - Blog - 5 Good Reads in Big Open Data: March 13 2015

March 13, 2015. 5 Good Reads in Big Open Data: March 13 2015. Jürgen Schmidhuber- Ask Me Anything - via Reddit: Jürgen has pioneered self-improving general problem solvers and Deep Learning Neural Networks for decades.

Common Crawl - Blog - November 2014 Crawl Archive Available

Stephen Merity is an independent AI researcher, who is passionate about machine learning, open data, and teaching computer science. The crawl archive for November 2014 is now available!

Common Crawl - Blog - October 2014 Crawl Archive Available

Stephen Merity is an independent AI researcher, who is passionate about machine learning, open data, and teaching computer science. The crawl archive for October 2014 is now available!

Common Crawl - Blog - September 2014 Crawl Archive Available

Stephen Merity is an independent AI researcher, who is passionate about machine learning, open data, and teaching computer science. The crawl archive for September 2014 is now available!

Common Crawl - Blog - April 2014 Crawl Data Available

April 2014 Crawl Data Available. The April crawl of 2014 is now available! The new dataset is over 183TB in size containing approximately 2.6 billion webpages. Stephen Merity.

Common Crawl - Blog - August 2014 Crawl Data Available

August 2014 Crawl Data Available. The August crawl of 2014 is now available! The new dataset is over 200TB in size containing approximately 2.8 billion webpages. Stephen Merity.

Common Crawl - Blog - December 2014 Crawl Archive Available

Stephen Merity is an independent AI researcher, who is passionate about machine learning, open data, and teaching computer science. The crawl archive for December 2014 is now available!

Common Crawl - Blog - July 2014 Crawl Data Available

July 2014 Crawl Data Available. The July crawl of 2014 is now available! The new dataset is over 266TB in size containing approximately 3.6 billion webpages. Stephen Merity.

Common Crawl - Impact

In the realm of communication and content creation, LLMs based on Open Data have revolutionized the way information is disseminated.

Common Crawl - Blog - May 2021 crawl archive now available

The data was crawled May 5 – 19 and contains 2.6 billion web pages or 280 TiB of uncompressed content. It includes page captures of 1.28 billion new URLs, not visited in any of our prior crawls. Sebastian Nagel.

Common Crawl - Blog - October 2021 crawl archive now available

The data was crawled Oct 15 – 28 and contains 3.3 billion web pages or 360 TiB of uncompressed content. It includes page captures of 1.3 billion new URLs, not visited in any of our prior crawls. Sebastian Nagel.

Common Crawl - Blog - March/April 2020 crawl archive now available

The March/April crawl archive is located in the. commoncrawl. bucket at. crawl-data/CC-MAIN-2020-16/. To assist with exploring and using the dataset, we provide gzipped files which list all segments, WARC, WAT and WET files.

Common Crawl - Blog - July 2016 Crawl Archive Now Available

The archive located in the. commoncrawl. bucket at. crawl-data/CC-MAIN-2016-30/. contains more than 1.73 billion web pages. To assist with exploring and using the dataset, we provide gzipped files that list: all segments.

Common Crawl - Blog - December 2019 crawl archive now available

The December crawl archive is located in the. commoncrawl. bucket at. crawl-data/CC-MAIN-2019-51/. To assist with exploring and using the dataset, we provide gzipped files which list all segments, WARC, WAT and WET files.

Common Crawl - Blog - September 2021 crawl archive now available

The data was crawled Sept 16 – 29 and contains 2.95 billion web pages or 310 TiB of uncompressed content. It includes page captures of 1.3 billion new URLs, not visited in any of our prior crawls. Sebastian Nagel.

Common Crawl - Blog - October 2019 crawl archive now available

The October crawl archive is located in the. commoncrawl. bucket at. crawl-data/CC-MAIN-2019-43/. To assist with exploring and using the dataset, we provide gzipped files which list all segments, WARC, WAT and WET files.

Common Crawl - Blog - June 2021 crawl archive now available

The data was crawled June 12 – 25 and contains 2.45 billion web pages or 260 TiB of uncompressed content. It includes page captures of 1.3 billion new URLs, not visited in any of our prior crawls. Sebastian Nagel.

Common Crawl - Blog - April 2021 crawl archive now available

The data was crawled April 10 – 23 and contains 3.1 billion web pages or 320 TiB of uncompressed content. It includes page captures of 1.35 billion new URLs, not visited in any of our prior crawls. Sebastian Nagel.

Common Crawl - Blog - June 2016 Crawl Archive Now Available

The archive located in the. commoncrawl. bucket at. crawl-data/CC-MAIN-2016-26/. contains more than 1.23 billion web pages. To assist with exploring and using the dataset, we provide gzipped files that list: all segments.

Common Crawl - Blog - May 2022 crawl archive now available

The data was crawled May 16 – 29 and contains 3.45 billion web pages or 420 TiB of uncompressed content. Page captures are from 45 million hosts or 36 million registered domains and include 1.4 billion new URLs, not visited in any of our prior crawls.

Common Crawl - Blog - January/February 2023 crawl archive now available

The data was crawled January 26 – February 9 and contains 3.15 billion web pages or 400 TiB of uncompressed content.

Common Crawl - Blog - September 2019 crawl archive now available

The September crawl archive is located in the. commoncrawl. bucket at. crawl-data/CC-MAIN-2019-39/. To assist with exploring and using the dataset, we provide gzipped files which list all segments, WARC, WAT and WET files.

Common Crawl - Blog - November 2015 Crawl Archive Now Available

crawl-data/CC-MAIN-2015-48/. To assist with exploring and using the dataset, we’ve provided gzipped files that list: all segments. (CC-MAIN-2015-48/segment.paths.gz). all WARC files. (CC-MAIN-2015-48/warc.paths.gz). all WAT files.

Common Crawl - Blog - September 2015 Crawl Archive Now Available

crawl-data/CC-MAIN-2015-40/. To assist with exploring and using the dataset, we’ve provided gzipped files that list: all segments. (CC-MAIN-2015-40/segment.paths.gz). all WARC files. (CC-MAIN-2015-40/warc.paths.gz). all WAT files.

Common Crawl - Blog - January 2022 crawl archive now available

The data was crawled January 16 – 29 and contains 2.95 billion web pages or 320 TiB of uncompressed content. It includes page captures of 1.35 billion new URLs, not visited in any of our prior crawls. Sebastian Nagel.

Common Crawl - Blog - June/July 2022 crawl archive now available

The data was crawled June 24 – July 7 and contains 3.1 billion web pages or 370 TiB of uncompressed content. Page captures are from 44 million hosts or 35 million registered domains and include 1.4 billion new URLs, not visited in any of our prior crawls.

Common Crawl - Blog - September 2020 crawl archive now available

The data was crawled between September 18th and October 2nd and contains 3.45 billion web pages or 345 TiB of uncompressed content. It includes page captures of 1.5 billion new URLs, not visited in any of our prior crawls. Sebastian Nagel.

Common Crawl - Blog - November/December 2020 crawl archive now available

The data was crawled between November 23 and December 6 and contains 2.64 billion web pages or 270 TiB of uncompressed content. It includes page captures of 1.4 billion new URLs, not visited in any of our prior crawls. Sebastian Nagel.

Common Crawl - Blog - August 2020 crawl archive now available

The August crawl archive is located in the. commoncrawl. bucket at. crawl-data/CC-MAIN-2020-34/. To assist with exploring and using the dataset, we provide gzipped files which list all segments, WARC, WAT and WET files.

Common Crawl - Blog - February 2016 Crawl Archive Now Available

crawl-data/CC-MAIN-2016-07/. To assist with exploring and using the dataset, we’ve provided gzipped files that list: all segments. (CC-MAIN-2016-07/segment.paths.gz). all WARC files. (CC-MAIN-2016-07/warc.paths.gz). all WAT files.

Common Crawl - Blog - November/December 2022 crawl archive now available

The data was crawled November 26 – December 10 and contains 3.35 billion web pages or 420 TiB of uncompressed content.

Common Crawl - Blog - January 2021 crawl archive now available

The data was crawled between January 15th and 28th and contains 3.4 billion web pages or 350 TiB of uncompressed content. It includes page captures of 1.15 billion new URLs, not visited in any of our prior crawls. Sebastian Nagel.

Common Crawl - Blog - August 2022 crawl archive now available

The data was crawled August 7 – 20 and contains 2.55 billion web pages or 295 TiB of uncompressed content. Page captures are from 46 million hosts or 37 million registered domains and include 1.3 billion new URLs, not visited in any of our prior crawls.

Common Crawl - Blog - October 2020 crawl archive now available

The data was crawled between October 19th and November 1st and contains 2.71 billion web pages or 280 TiB of uncompressed content. It includes page captures of 1.5 billion new URLs, not visited in any of our prior crawls. Sebastian Nagel.

Common Crawl - Blog - February/March 2021 crawl archive now available

The data was crawled between February 24th and March 9th and contains 2.7 billion web pages or 280 TiB of uncompressed content. It includes page captures of 1.2 billion new URLs, not visited in any of our prior crawls. Sebastian Nagel.

Common Crawl - Blog - January 2020 crawl archive now available

Improvements and Fixes. date time values in the column "fetch_time" of the. columnar index. are now stored using the "int64" data type. For details and compatibility issues please see. cc-index-table#7.