Search results
June 19, 2012. OSCON 2012. We're just one month away from one of the biggest and most exciting events of the year, O'Reilly's Open Source Convention (OSCON). This year's conference will be held July 16th-20th in Portland, Oregon. Allison Domicone.…
July 16, 2012. 2012 Crawl Data Now Available. I am very happy to announce that Common Crawl has released 2012 crawl data as well as a number of significant enhancements to our example library and help pages. Common Crawl Foundation.…
August 13, 2013. A Look Inside Our 210TB 2012 Web Corpus. Want to know more detail about what data is in the 2012 Common Crawl corpus without running a job? Now you can thanks to Sebastian Spiegler! Common Crawl Foundation.…
BDT204 Awesome Applications of Open Data – AWS re: Invent 2012. Amazon Web Services. Discussion of how open, public datasets can be harnessed using the AWS cloud.…
November 13, 2013. Hyperlink Graph from Web Data Commons. The talented team at Web Data Commons recently extracted and analyzed the hyperlink graph within the Common Crawl 2012 corpus.…
March 28, 2012. Data 2.0 Summit. Next week a few members of the Common Crawl team are going the Data 2.0 Summit in San Francisco. Common Crawl Foundation.…
March 22, 2012. Web Data Commons. For the last few months, we have been talking with Chris Bizer and Hannes Mühleisen at the Freie Universität Berlin about their work and we have been greatly looking forward the announcement of the Web Data Commons.…
August 14, 2013. Startup Profile: SwiftKey’s Head Data Scientist on the Value of Common Crawl’s Open Data. Sebastian Spiegler is the head of the data team and SwiftKey and a volunteer at Common Crawl.…
November 15, 2012. The Norvig Web Data Science Award. We are very excited to announce the Norvig Web Data Science Award! Common Crawl and SARA created the award to encourage research in web data science. Common Crawl Foundation.…
January 10, 2012. Video: This Week in Startups - Gil Elbaz and Nova Spivack. Nova and Gil, in discussion with host Jason Calacanis, explore in depth what Common Crawl is all about and how it fits into the larger picture of online search and indexing.…
July 18, 2012. Common Crawl's Brand Spanking New Video and First Ever Code Contest! At Common Crawl we've been busy recently!…
Here's a look at how our presence in academic citations has evolved: This graph shows our citation count in Google Scholar from 2012 to 2023. More information on how this is collected can be found in. this GitHub repository.…
August 3, 2012. Strata Conference + Hadoop World. This year's Strata Conference teams up with Hadoop World for what promises to be a powerhouse convening in NYC from October 23-25. Check out their full announcement below and secure your spot today.…
January 12, 2012. Gil Elbaz and Nova Spivack on This Week in Startups. Nova and Gil, in discussion with host Jason Calacanis, explore in depth what Common Crawl is all about and how it fits into the larger picture of online search and indexing.…
July 2, 2018. June 2018 Crawl Archive Now Available. The crawl archive for June 2018 is now available! The archive contains 3.05 billion web pages and 235 TiB of uncompressed content, crawled between June 18th and 25th. Sebastian Nagel.…
Eva has been our advisor since January 2012! The Data. Overview. Web Graphs. Latest Crawl. Crawl Stats. Graph Stats. Errata. Resources. Get Started. AI Agent. Blog. Examples. Use Cases. CCBot. Infra Status. FAQ. Community. Research Papers.…
January 19, 2012. Video Tutorial: MapReduce for the Masses. Learn how you can harness the power of MapReduce data analysis against the Common Crawl dataset with nothing more than five minutes of your time, a bit of local configuration, and 25 cents.…
Greg Lindahl is an astrophysicist and technologist who has advised Common Crawl since 2012. He was previously the Founder and CTO of Blekko, an internet search engine.…
July 3, 2012. The Open Cloud Consortium’s Open Science Data Cloud. Common Crawl has started talking with the Open Cloud Consortium (OCC) about working together.…
August 1, 2012. Mat Kelcey Joins The Common Crawl Advisory Board. We are excited to announce that Mat Kelcey has joined the Common Crawl Board of Advisors!…
August 10, 2012. Amazon Web Services sponsoring $50 in credit to all contest entrants! Did you know that every entry to the First Ever Common Crawl Code Contest gets $50 in Amazon Web Services (AWS) credits?…
August 7, 2012. Still time to participate in the Common Crawl code contest. There is still plenty of time left to participate in the Common Crawl code contest! …
August 30, 2012. Common Crawl Code Contest Extended Through the Holiday Weekend. Do you have a project that you are working on for the Common Crawl Code Contest that is not quite ready? If so, you are not the only one.…
August 29, 2014. Web Data Commons Extraction Framework for the Distributed Processing of CC Data.…
January 23, 2012. Common Crawl on AWS Public Data Sets. Common Crawl is thrilled to announce that our data is now hosted on Amazon Web Services' Public Data Sets. Common Crawl Foundation.…
August 14, 2012. TalentBin Adds Prizes To The Code Contest. The prize package for the Common Crawl Code Contest now includes three Nexus 7 tablets thanks to TalentBin! Common Crawl Foundation.…
Topology of the 2012 WDC Hyperlink Graph document. Further details of WebGraph tool installation/usage, and the data visualization may be found in the. cc-notebooks. repository.…
February 8, 2018. Host- and Domain-Level Web Graphs Nov/Dec/Jan 2017-2018. We are pleased to announce a new release of host-level and domain-level web graphs based on the published crawls of November, December 2017 and January 2018.…
February 20, 2019. Host- and Domain-Level Web Graphs Nov/Dec/Jan 2018 - 2019. We are pleased to announce a new release of host-level and domain-level web graphs based on the published crawls of November, December 2018 and January 2019.…
April 13, 2012. Big Data Week: meetups in SF and around the world. Big Data Week aims to connect data enthusiasts, technologists, and professionals across the globe through a series of meet-ups.…
December 17, 2012. blekko donates search data to Common Crawl. We are very excited to announce that blekko is donating search data to Common Crawl!…
Eva has been our advisor since January 2012! Discord Server. We’re excited to announce our new. Discord Server. , to augment our online discussions in our. Mailing List on Google Groups. , and elsewhere.…
September 18, 2012. Winners of the Code Contest! We’re very excited to announce the winners of the First Ever Common Crawl Code Contest! We were thrilled by the response to the contest and the many great entries.…
CC-MAIN-2016-36. to. CC-MAIN-2016-50. , and. CC-MAIN-2018-34. to. CC-MAIN-2019-47. the fetch_time metadata for. robots.txt. might be incorrect. The correct times can be found in. collinfo.json.…
February 15, 2012. Common Crawl's Advisory Board. As part of our ongoing effort to grow Common Crawl into a truly useful and innovative tool, we recently formed an Advisory Board to guide us in our efforts.…
May 9, 2012. Learn Hadoop and get a paper published. We're looking for students who want to try out the Apache Hadoop platform and get a technical report published. Allison Domicone.…
This graph shows our citation count in Google Scholar from 2012 to 2025. Overall, our archive has been cited in over 10,000 academic papers, spanning natural language processing, machine learning, digital humanities, and many other fields.…
May 22, 2017. Common Crawl's First In-House Web Graph. We are pleased to announce the release of a host-level web graph of recent monthly crawls (February, March, April 2017). The graph consists of 385 million nodes and 2.5 billion edges. Sebastian Nagel.…
November 12, 2019. Host- and Domain-Level Web Graphs Aug/Sep/Oct 2019. We are pleased to announce a new release of host-level and domain-level web graphs based on the published crawls of August, September and October 2019.…
October 30, 2012. Towards Social Discovery - New Content Models; New Data; New Toolsets. This is a guest blog post by Matthew Berk, Founder of Lucky Oyster. Matthew has been on the front lines of search technology for the past decade. Matthew Berk.…
November 16, 2011. Answers to Recent Community Questions. In this post we respond to the most common questions. Thanks for all the support and please keep the questions coming! Common Crawl Foundation.…
Host- and Domain-Level Web Graphs Nov/Dec/Jan 2019 – 2020. We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of November, December 2019 and January 2020.…
August 8, 2019. Host- and Domain-Level Web Graphs May/June/July 2019. We are pleased to announce a new release of host-level and domain-level web graphs based on the published crawls of May, June and July 2019.…
May 9, 2019. Host- and Domain-Level Web Graphs Feb/Mar/Apr 2019. We are pleased to announce a new release of host-level and domain-level web graphs based on the published crawls of February, March and April 2019.…
January 8, 2014. Winter 2013 Crawl Data Now Available. The second crawl of 2013 is now available! In late November, we published the data from the first crawl of 2013.…
September 22, 2014. August 2014 Crawl Data Available. The August crawl of 2014 is now available! The new dataset is over 200TB in size containing approximately 2.8 billion webpages. Stephen Merity.…
January 9, 2015. December 2014 Crawl Archive Available. The crawl archive for December 2014 is now available! This crawl archive is over 160TB in size and contains 2.08 billion webpages. Stephen Merity.…
July 23, 2015. June 2015 Crawl Archive Available. The crawl archive for June 2015 is now available! This crawl archive is over 131TB in size and holds more than 1.67 billion webpages. Stephen Merity.…
May 20, 2015. March 2015 Crawl Archive Available. The crawl archive for March 2015 is now available! This crawl archive is over 124TB in size and holds more than 1.64 billion webpages. Stephen Merity.…
July 8, 2015. May 2015 Crawl Archive Available. The crawl archive for May 2015 is now available! This crawl archive is over 159TB in size and holds more than 2.05 billion webpages. Stephen Merity.…
March 4, 2015. January 2015 Crawl Archive Available. The crawl archive for January 2015 is now available! This crawl archive is over 139TB in size and contains 1.82 billion webpages. Stephen Merity.…
December 24, 2014. November 2014 Crawl Archive Available. The crawl archive for November 2014 is now available! This crawl archive is over 135TB in size and contains 1.95 billion webpages. Stephen Merity.…
August 15, 2015. July 2015 Crawl Archive Available. The crawl archive for June 2015 is now available! This crawl archive is over 145TB in size and holds more than 1.81 billion webpages. Stephen Merity.…
May 28, 2015. April 2015 Crawl Archive Available. The crawl archive for April 2015 is now available! This crawl archive is over 168TB in size and holds more than 2.11 billion webpages. Stephen Merity.…
July 16, 2014. April 2014 Crawl Data Available. The April crawl of 2014 is now available! The new dataset is over 183TB in size containing approximately 2.6 billion webpages. Stephen Merity.…
October 10, 2015. August 2015 Crawl Archive Available. The crawl archive for August 2015 is now available! This crawl archive is over 149TB in size and holds more than 1.84 billion webpages. Stephen Merity.…
March 31, 2015. February 2015 Crawl Archive Available. The crawl archive for February 2015 is now available! This crawl archive is over 145TB in size and over 1.9 billion webpages. Stephen Merity.…
August 7, 2014. July 2014 Crawl Data Available. The July crawl of 2014 is now available! The new dataset is over 266TB in size containing approximately 3.6 billion webpages. Stephen Merity.…
November 12, 2014. September 2014 Crawl Archive Available. The crawl archive for September 2014 is now available! This crawl archive is over 220TB in size and contains 2.98 billion webpages. Stephen Merity.…
November 20, 2014. October 2014 Crawl Archive Available. The crawl archive for October 2014 is now available! This crawl archive is over 254TB in size and contains 3.72 billion webpages. Stephen Merity.…