Search results
Common Crawl Blog. The latest news, interviews, technologies, and resources. Common Crawl Blog. The Data. Overview. Web Graphs. Latest Crawl. Crawl Stats. Graph Stats. Errata. Resources. Get Started. Blog. Examples. Use Cases. CCBot. Infra Status. FAQ.…
< Back to Blog. May 13, 2016. Welcome, Sebastian! It is a pleasure to officially announce that Sebastian Nagel joined Common Crawl as Crawl Engineer in April.…
< Back to Blog. June 19, 2012. OSCON 2012. We're just one month away from one of the biggest and most exciting events of the year, O'Reilly's Open Source Convention (OSCON). This year's conference will be held July 16th-20th in Portland, Oregon.…
< Back to Blog. October 4, 2016. News Dataset Available. We are pleased to announce the release of a new dataset containing news articles from news sites all over the world. Sebastian Nagel. Sebastian is a Distinguished Engineer with Common Crawl.…
< Back to Blog. March 28, 2012. Data 2.0 Summit. Next week a few members of the Common Crawl team are going the Data 2.0 Summit in San Francisco. Common Crawl Foundation.…
< Back to Blog. March 22, 2012. Web Data Commons.…
< Back to Blog. March 5, 2013. URL Search Tool! Note: this post has been marked as obsolete.…
< Back to Blog. January 21, 2025. Introducing cc-downloader. Introducing a command-line tool written in Rust for downloading data from Common Crawl. Pedro Ortiz Suarez. Pedro is a French-Colombian mathematician, computer scientist, and researcher.…
< Back to Blog. November 29, 2011. Common Crawl Discussion List.…
< Back to Blog. November 25, 2024. October/November 2024 Newsletter. We’re pleased to announce this month's newsletter, featuring key updates, upcoming events, and community highlights. Jen English.…
< Back to Blog. February 15, 2012. Common Crawl's Advisory Board. As part of our ongoing effort to grow Common Crawl into a truly useful and innovative tool, we recently formed an Advisory Board to guide us in our efforts.…
< Back to Blog. April 1, 2015. Evaluating graph computation systems. This is a guest blog post by Frank McSherry, a computer science researcher active in the area of large scale data analysis.…
< Back to Blog. November 27, 2013. New Crawl Data Available! We are very please to announce that new crawl data is now available! The data was collected in 2013, contains approximately 2 billion web pages and is 102TB in size (uncompressed).…
< Back to Blog. September 10, 2024. August/September 2024 Newsletter. We're pleased to announce our newsletter for August and September 2024. Jen English.…
< Back to Blog. February 3, 2025. January/February 2025 Newsletter. We’re happy to share our January/February 2025 newsletter with updates and insights from the world of open data and web archiving. Jen English.…
< Back to Blog. June 25, 2024. May/June 2024 Newsletter. We’re pleased to share our newsletter for May/June 2024, featuring the latest updates, events, and highlights from our community. Greg Lindahl.…
< Back to Blog. March 26, 2024. March/April 2024 Newsletter. We're excited to share an update on some of our recent projects and initiatives in this newsletter! Common Crawl Foundation.…
< Back to Blog. August 3, 2012. Strata Conference + Hadoop World. This year's Strata Conference teams up with Hadoop World for what promises to be a powerhouse convening in NYC from October 23-25.…
< Back to Blog. January 8, 2013. Common Crawl URL Index. Note: this post has been marked as obsolete. We are thrilled to announce that Common Crawl now has a URL index!…
< Back to Blog. March 4, 2015. January 2015 Crawl Archive Available. The crawl archive for January 2015 is now available! This crawl archive is over 139TB in size and contains 1.82 billion webpages. Stephen Merity.…
< Back to Blog. April 8, 2015. Announcing the Common Crawl Index! This is a guest post by Ilya Kreymer, a dedicated volunteer who has gifted large amounts of time, effort and talent to Common Crawl.…
< Back to Blog. November 16, 2011. Answers to Recent Community Questions. In this post we respond to the most common questions. Thanks for all the support and please keep the questions coming! Common Crawl Foundation.…
< Back to Blog. October 28, 2020. Interactive Webgraph Statistics Notebook Released.…
< Back to Blog. April 2, 2014. Navigating the WARC file format. Wait, what's WAT, WET and WARC? Recently CommonCrawl has switched to the Web ARChive (WARC) format.…
< Back to Blog. May 28, 2015. April 2015 Crawl Archive Available. The crawl archive for April 2015 is now available! This crawl archive is over 168TB in size and holds more than 2.11 billion webpages. Stephen Merity.…
< Back to Blog. August 15, 2015. July 2015 Crawl Archive Available. The crawl archive for June 2015 is now available! This crawl archive is over 145TB in size and holds more than 1.81 billion webpages. Stephen Merity.…
< Back to Blog. December 24, 2014. November 2014 Crawl Archive Available. The crawl archive for November 2014 is now available! This crawl archive is over 135TB in size and contains 1.95 billion webpages. Stephen Merity.…
< Back to Blog. July 8, 2015. May 2015 Crawl Archive Available. The crawl archive for May 2015 is now available! This crawl archive is over 159TB in size and holds more than 2.05 billion webpages. Stephen Merity.…
< Back to Blog. September 30, 2024. IAB Workshop on AI-CONTROL. Earlier this month, the Common Crawl Foundation had the privilege of participating in a groundbreaking workshop hosted by the Internet Architecture Board (IAB) in Washington DC.…
< Back to Blog. May 20, 2015. March 2015 Crawl Archive Available. The crawl archive for March 2015 is now available! This crawl archive is over 124TB in size and holds more than 1.64 billion webpages. Stephen Merity.…
< Back to Blog. September 22, 2014. August 2014 Crawl Data Available. The August crawl of 2014 is now available! The new dataset is over 200TB in size containing approximately 2.8 billion webpages. Stephen Merity.…
< Back to Blog. December 10, 2014. Please Donate To Common Crawl! Big data has the potential to change the world. The talent exists and the tools are already there. What’s lacking is access to data.…
< Back to Blog. January 9, 2015. December 2014 Crawl Archive Available. The crawl archive for December 2014 is now available! This crawl archive is over 160TB in size and contains 2.08 billion webpages. Stephen Merity.…
< Back to Blog. July 16, 2012. 2012 Crawl Data Now Available. I am very happy to announce that Common Crawl has released 2012 crawl data as well as a number of significant enhancements to our example library and help pages. Common Crawl Foundation.…
< Back to Blog. July 23, 2015. June 2015 Crawl Archive Available. The crawl archive for June 2015 is now available! This crawl archive is over 131TB in size and holds more than 1.67 billion webpages. Stephen Merity.…
< Back to Blog. July 16, 2014. April 2014 Crawl Data Available. The April crawl of 2014 is now available! The new dataset is over 183TB in size containing approximately 2.6 billion webpages. Stephen Merity.…
< Back to Blog. September 18, 2012. Winners of the Code Contest! We’re very excited to announce the winners of the First Ever Common Crawl Code Contest! We were thrilled by the response to the contest and the many great entries.…
< Back to Blog. October 10, 2015. August 2015 Crawl Archive Available. The crawl archive for August 2015 is now available! This crawl archive is over 149TB in size and holds more than 1.84 billion webpages. Stephen Merity.…
< Back to Blog. November 15, 2023. Oct/Nov 2023 Performance Issues. Our datasets have become very popular over time, with downloads doubling every 6 months for several years in a row.…
< Back to Blog. March 1, 2024. Web Archiving File Formats Explained. In the ever–evolving landscape of digital archiving and data analysis, it is helpful to understand the various file formats used for web crawling.…
< Back to Blog. November 20, 2014. October 2014 Crawl Archive Available. The crawl archive for October 2014 is now available! This crawl archive is over 254TB in size and contains 3.72 billion webpages. Stephen Merity.…
< Back to Blog. March 31, 2015. February 2015 Crawl Archive Available. The crawl archive for February 2015 is now available! This crawl archive is over 145TB in size and over 1.9 billion webpages. Stephen Merity.…
< Back to Blog. November 12, 2014. September 2014 Crawl Archive Available. The crawl archive for September 2014 is now available! This crawl archive is over 220TB in size and contains 2.98 billion webpages. Stephen Merity.…
< Back to Blog. August 7, 2014. July 2014 Crawl Data Available. The July crawl of 2014 is now available! The new dataset is over 266TB in size containing approximately 3.6 billion webpages. Stephen Merity.…
< Back to Blog. February 20, 2014. Common Crawl's Move to Nutch. Last year we transitioned from our custom crawler to the Apache Nutch crawler to run our 2013 crawls as part of our migration from our old data center to the cloud.…
< Back to Blog. October 29, 2019. October 2019 crawl archive now available. The crawl archive for October 2019 is now available! It contains 3.0 billion web pages or 280 TiB of uncompressed content, crawled between October 13th and 24th.…
< Back to Blog. July 30, 2019. July 2019 crawl archive now available. The crawl archive for July 2019 is now available! It contains 2.6 billion web pages or 220 TiB of uncompressed content, crawled between July 15th and 24th. Sebastian Nagel.…
< Back to Blog. April 4, 2017. March 2017 Crawl Archive Now Available. The crawl archive for March 2017 is now available! The archive contains 3.07 billion+ web pages and over 250 TiB of uncompressed content. Sebastian Nagel.…
< Back to Blog. February 29, 2016. February 2016 Crawl Archive Now Available. As an interim crawl engineer for CommonCrawl, I am pleased to announce that the crawl archive for February 2016 is now available!…
< Back to Blog. May 9, 2017. April 2017 Crawl Archive Now Available. The crawl archive for April 2017 is now available! The archive contains 2.94 billion+ web pages and over 250 TiB of uncompressed content. Sebastian Nagel.…
< Back to Blog. August 28, 2017. August 2017 Crawl Archive Now Available. The crawl archive for August 2017 is now available! The archive contains 3.28 billion+ web pages and over 280 TiB of uncompressed content. Sebastian Nagel.…
< Back to Blog. October 29, 2017. October 2017 Crawl Archive Now Available. The crawl archive for October 2017 is now available! The archive contains 3.65 billion web pages and over 300 TiB of uncompressed content. Sebastian Nagel.…
< Back to Blog. March 29, 2018. March 2018 Crawl Archive Now Available. The crawl archive for March 2018 is now available! The archive contains 3.2 billion web pages and 250+ TiB of uncompressed content, crawled between March 17th and 25th.…
< Back to Blog. March 2, 2018. February 2018 Crawl Archive Now Available. The crawl archive for February 2018 is now available! The archive contains 3.4 billion web pages and 270+ TiB of uncompressed content, crawled between February 17th and Feb 26th.…
< Back to Blog. May 24, 2016. April 2016 Crawl Archive Now Available. The crawl archive for April 2016 is now available! More than 1.33 billion webpages are in the archive. Sebastian Nagel. Sebastian is a Distinguished Engineer with Common Crawl.…
< Back to Blog. October 4, 2021. September 2021 crawl archive now available. The crawl archive for September 2021 is now available! The data was crawled Sept 16 – 29 and contains 2.95 billion web pages or 310 TiB of uncompressed content.…
< Back to Blog. May 1, 2024. April 2024 Crawl Archive Now Available. We are pleased to announce that the crawl archive for April 2024 is now available.…
< Back to Blog. September 24, 2024. September 2024 Crawl Archive Now Available. The crawl archive for September 2024 is now available.…
< Back to Blog. July 28, 2024. July 2024 Crawl Archive Now Available. We are pleased to announce that the crawl archive for July 2024 is now available, containing 2.5 billion web pages, or 360 TiB of uncompressed content. Thom Vaughan.…
< Back to Blog. November 13, 2013. Hyperlink Graph from Web Data Commons. The talented team at Web Data Commons recently extracted and analyzed the hyperlink graph within the Common Crawl 2012 corpus.…