Search results
July 18, 2012. Common Crawl's Brand Spanking New Video and First Ever Code Contest! At Common Crawl we've been busy recently!…
Overview of the original Common Crawl crawler (in use 2008-2013) discussing the Hadoop data processing pipeline, PageRank implementation, and the techniques used to optimize Hadoop. The Web of Data and Web Data Commons.…
March 28, 2012. Data 2.0 Summit. Next week a few members of the Common Crawl team are going the Data 2.0 Summit in San Francisco. Common Crawl Foundation.…
July 16, 2012. 2012 Crawl Data Now Available. I am very happy to announce that Common Crawl has released 2012 crawl data as well as a number of significant enhancements to our example library and help pages. Common Crawl Foundation.…
July 16, 2014. April 2014 Crawl Data Available. The April crawl of 2014 is now available! The new dataset is over 183TB in size containing approximately 2.6 billion webpages. Stephen Merity.…
September 22, 2014. August 2014 Crawl Data Available. The August crawl of 2014 is now available! The new dataset is over 200TB in size containing approximately 2.8 billion webpages. Stephen Merity.…
August 7, 2014. July 2014 Crawl Data Available. The July crawl of 2014 is now available! The new dataset is over 266TB in size containing approximately 3.6 billion webpages. Stephen Merity.…
March 26, 2014. March 2014 Crawl Data Now Available. The March crawl of 2014 is now available! The new dataset contains approximately 2.8 billion webpages and is about 223TB in size. Common Crawl Foundation.…
January 8, 2014. Winter 2013 Crawl Data Now Available. The second crawl of 2013 is now available! In late November, we published the data from the first crawl of 2013.…
May 22, 2017. Common Crawl's First In-House Web Graph. We are pleased to announce the release of a host-level web graph of recent monthly crawls (February, March, April 2017). The graph consists of 385 million nodes and 2.5 billion edges. Sebastian Nagel.…
CC-MAIN-2016-36. to. CC-MAIN-2016-50. , and. CC-MAIN-2018-34. to. CC-MAIN-2019-47. the fetch_time metadata for. robots.txt. might be incorrect. The correct times can be found in. collinfo.json.…
June 19, 2012. OSCON 2012. We're just one month away from one of the biggest and most exciting events of the year, O'Reilly's Open Source Convention (OSCON). This year's conference will be held July 16th-20th in Portland, Oregon. Allison Domicone.…
August 13, 2013. A Look Inside Our 210TB 2012 Web Corpus. Want to know more detail about what data is in the 2012 Common Crawl corpus without running a job? Now you can thanks to Sebastian Spiegler! Common Crawl Foundation.…
November 12, 2019. Host- and Domain-Level Web Graphs Aug/Sep/Oct 2019. We are pleased to announce a new release of host-level and domain-level web graphs based on the published crawls of August, September and October 2019.…
February 20, 2015. 5 Good Reads in Big Open Data: Feb 20 2015. A thriving ecosystem is the key for real viability of any technology.…
February 6, 2015. 5 Good Reads in Big Open Data: Feb 6 2015.…
March 13, 2015. 5 Good Reads in Big Open Data: March 13 2015. Jürgen Schmidhuber- Ask Me Anything - via Reddit: Jürgen has pioneered self-improving general problem solvers and Deep Learning Neural Networks for decades.…
March 26, 2015. 5 Good Reads in Big Open Data: March 26 2015.…
February 13, 2015. 5 Good Reads in Big Open Data: Feb 13 2015. What does it mean for the Open Web if users don't know they're on the internet? Via QUARTZ: “This is more than a matter of semantics.…
March 20, 2015. 5 Good Reads in Big Open Data: March 20 2015.…
February 27, 2015. 5 Good Reads in Big Open Data: February 27 2015.…
March 6, 2015. 5 Good Reads in Big Open Data: March 6 2015. 2015: What do you think about Machines that think?…
Host- and Domain-Level Web Graphs Nov/Dec/Jan 2019 – 2020. We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of November, December 2019 and January 2020.…
August 8, 2019. Host- and Domain-Level Web Graphs May/June/July 2019. We are pleased to announce a new release of host-level and domain-level web graphs based on the published crawls of May, June and July 2019.…
February 8, 2018. Host- and Domain-Level Web Graphs Nov/Dec/Jan 2017-2018. We are pleased to announce a new release of host-level and domain-level web graphs based on the published crawls of November, December 2017 and January 2018.…
March 22, 2012. Web Data Commons. For the last few months, we have been talking with Chris Bizer and Hannes Mühleisen at the Freie Universität Berlin about their work and we have been greatly looking forward the announcement of the Web Data Commons.…
February 20, 2019. Host- and Domain-Level Web Graphs Nov/Dec/Jan 2018 - 2019. We are pleased to announce a new release of host-level and domain-level web graphs based on the published crawls of November, December 2018 and January 2019.…
November 15, 2012. The Norvig Web Data Science Award. We are very excited to announce the Norvig Web Data Science Award! Common Crawl and SARA created the award to encourage research in web data science. Common Crawl Foundation.…
May 9, 2019. Host- and Domain-Level Web Graphs Feb/Mar/Apr 2019. We are pleased to announce a new release of host-level and domain-level web graphs based on the published crawls of February, March and April 2019.…
November 13, 2013. Hyperlink Graph from Web Data Commons. The talented team at Web Data Commons recently extracted and analyzed the hyperlink graph within the Common Crawl 2012 corpus.…
August 14, 2013. Startup Profile: SwiftKey’s Head Data Scientist on the Value of Common Crawl’s Open Data. Sebastian Spiegler is the head of the data team and SwiftKey and a volunteer at Common Crawl.…
August 12, 2018. Host- and Domain-Level Web Graphs May/June/July 2018. We are pleased to announce a new release of host-level and domain-level web graphs based on the published crawls of May, June and July 2018.…
July 2, 2018. June 2018 Crawl Archive Now Available. The crawl archive for June 2018 is now available! The archive contains 3.05 billion web pages and 235 TiB of uncompressed content, crawled between June 18th and 25th. Sebastian Nagel.…
January 9, 2015. December 2014 Crawl Archive Available. The crawl archive for December 2014 is now available! This crawl archive is over 160TB in size and contains 2.08 billion webpages. Stephen Merity.…
April 1, 2019. March 2019 crawl archive now available. The crawl archive for March 2019 is now available! It contains 2.55 billion web pages or 210 TiB of uncompressed content, crawled between March 18th and 27th. Sebastian Nagel.…
March 29, 2018. March 2018 Crawl Archive Now Available. The crawl archive for March 2018 is now available! The archive contains 3.2 billion web pages and 250+ TiB of uncompressed content, crawled between March 17th and 25th. Sebastian Nagel.…
January 28, 2019. January 2019 crawl archive now available. The crawl archive for January 2019 is now available! It contains 2.85 billion web pages or 240 TiB of uncompressed content, crawled between January 15th and 24th. Sebastian Nagel.…
May 2, 2018. April 2018 Crawl Archive Now Available. The crawl archive for April 2018 is now available! The archive contains 3.1 billion web pages and 230 TiB of uncompressed content, crawled between April 19th and 27th. Sebastian Nagel.…
January 29, 2018. January 2018 Crawl Archive Now Available. The crawl archive for January 2018 is now available! The archive contains 3.4 billion web pages and 270 TiB of uncompressed content, crawled between January 16th and Jan 24th. Sebastian Nagel.…
November 13, 2018. Host- and Domain-Level Web Graphs Aug/Sep/Oct 2018. We are pleased to announce a new release of host-level and domain-level web graphs based on the published crawls of August, September and October 2018.…
May 7, 2018. Host- and Domain-Level Web Graphs Feb/Mar/Apr 2018. We are pleased to announce a new release of host-level and domain-level web graphs based on the published crawls of February, March and April 2018.…
March 1, 2019. February 2019 crawl archive now available. The crawl archive for February 2019 is now available! It contains 2.9 billion web pages or 225 TiB of uncompressed content, crawled between February 15th and 24th. Sebastian Nagel.…
She has worked in the areas of Open Access publishing, Open Science, Open Data, copyright, digital rights and policy. Lisa was Chief of Staff at Creative Commons and served as the director of Common Crawl from 2011 to 2015.…
December 10, 2014. Please Donate To Common Crawl! Big data has the potential to change the world. The talent exists and the tools are already there. What’s lacking is access to data.…
July 3, 2012. The Open Cloud Consortium’s Open Science Data Cloud. Common Crawl has started talking with the Open Cloud Consortium (OCC) about working together.…
November 27, 2017. Host- and Domain-Level Web Graphs Aug/Sept/Oct 2017. We are pleased to announce a new release of host-level and domain-level web graphs based on the published crawls of August, September, and October 2017.…
Apache Parquet™. for efficient indexing and data analysis, offering insights into how these technologies can refine the process of web data management.…
August 29, 2014. Web Data Commons Extraction Framework for the Distributed Processing of CC Data.…
February 18, 2015. WikiReverse- Visualizing Reverse Links with the Common Crawl Archive. This is a guest blog post by Ross Fairbanks, a software developer based in Barcelona. He mainly develops in Ruby and is interested in open data and cloud computing.…
January 29, 2015. The Promise of Open Government Data & Where We Go Next.…
November 27, 2013. New Crawl Data Available! We are very please to announce that new crawl data is now available! The data was collected in 2013, contains approximately 2 billion web pages and is 102TB in size (uncompressed). Common Crawl Foundation.…
February 20, 2014. Common Crawl's Move to Nutch. Last year we transitioned from our custom crawler to the Apache Nutch crawler to run our 2013 crawls as part of our migration from our old data center to the cloud.…
In 2007, Gil founded Factual to democratize access to quality data. In 2020, Factual merged with Foursquare and today Gil is Co-Chairman of the board of a combined entity which generated $150m in combined revenue at the time of the merger.…
January 10, 2012. Video: This Week in Startups - Gil Elbaz and Nova Spivack. Nova and Gil, in discussion with host Jason Calacanis, explore in depth what Common Crawl is all about and how it fits into the larger picture of online search and indexing.…
WARC. files of a specific segment of the April 2018 crawl: > aws s3 ls s3://commoncrawl/crawl-data/CC-MAIN-2018-17/segments/1524125937193.1/warc/. 2018-04-20 10:27:49 931210633 CC-MAIN-20180420081400-20180420101400-00000.warc.gz. 2018-04-20 10:28:32 935833042…
Common Crawl maintains a. free, open repository. of web crawl data that can be used by anyone. Common Crawl is a 501(c)(3) non–profit founded in 2007. We make wholesale extraction, transformation and analysis of open web data accessible to researchers.…
October 10, 2015. August 2015 Crawl Archive Available. The crawl archive for August 2015 is now available! This crawl archive is over 149TB in size and holds more than 1.84 billion webpages. Stephen Merity.…
July 23, 2015. June 2015 Crawl Archive Available. The crawl archive for June 2015 is now available! This crawl archive is over 131TB in size and holds more than 1.67 billion webpages. Stephen Merity.…
May 20, 2015. March 2015 Crawl Archive Available. The crawl archive for March 2015 is now available! This crawl archive is over 124TB in size and holds more than 1.64 billion webpages. Stephen Merity.…
July 8, 2015. May 2015 Crawl Archive Available. The crawl archive for May 2015 is now available! This crawl archive is over 159TB in size and holds more than 2.05 billion webpages. Stephen Merity.…