Use Cases

Articles

CC Catalog: Leveraging Open Data and Open APIs
sclachar
87 Million Domains PageRank
Aysun Akarsu
Big Changes for CC Search Beta: Updates Released Today!
Paola Villarrela
Kalev Leetaru
Common Crawl and Unlocking Web Archives for Research
Need Billions of Web Pages? Don’t Bother Crawling
Julien Nioche

AWS Public Data Sets: How to Stage Petabytes of Data for Analysis in AWS, AWS re:Invent 2018
Jed Sundwall, Sebastian Nagel, Dave Rocamora
Mining Public Datasets using Apache Zeppelin (incubating), Apache Spark and Juju
Alexander Bezzubov
Mining a Large Web Corpus
Robert Meusel, Christian Bizer

Introduction of the distributed, parallel extraction framework provided by the Web Data Commons project.
Introduction to Common Crawl
Dave Lester

Overview of Common Crawl with some example use cases.
Mapping French Open Data Actors on the Web with Common Crawl
Gulliame LeBourgeois

Mapping French open data actors on the web with Common Crawl.
The Switchabalizer – Our Journey from Spell-Checker to Homophone Corrector
Oskar Singer
‍
Description of using Common Crawl data and NLP techniques to improve grammar and spelling correction, specifically homophones.
Building a Scalable Web Crawler with Hadoop
Ahad Rana
‍
Overview of the original Common Crawl crawler (in use 2008-2013) discussing the Hadoop data processing pipeline, PageRank implementation, and the techniques used to optimize Hadoop.
The Web of Data and Web Data Commons
Jesse Wang, Chris Bizer, Oliver Grisel, Soren Auer
‍
Overview of Web Science including basic semantic web, Linked Open Data, primer, and then DBpedia, Linked Data Integration Framework (LDIF), Common Crawl Database, Web Data Commons.
Measuring the Impact of Google Analytics
Stephen Merity
‍
Using the Common Crawl data to perform wide-scale analysis over billions of web pages to investigate the impact of Google Analytics and what this means for privacy on the web at large.
BDT204 Awesome Applications of Open Data – AWS re: Invent 2012
Amazon Web Services
‍
Discussion of how open, public datasets can be harnessed using the AWS cloud. Covers large data collections (such as the 1000 Genomes Project and the Common Crawl) and explains how you can process billions of web pages and trillions of genes to find new insights into society.
Cenitpede: Analyzing Webcrawl
Primal Pappachan
‍
Centipede: Analyzing web crawl data for context of a location
2013 Open Analytics Meetup - Mortar
Open Analytics
‍
A tutorial on democratizing data development, references Common Crawl
London Hug: Common Crawl an Open Repository of Web Data
Lisa Green
‍
Common Crawl an Open Repository of Web Data
Scaling Credible Content
Joe Griffin
‍
Learn how iAcquire scaled identification of credible content producers – with credibility being based on authorship proliferation. CC used as seed source
Large-Scale Analysis of Web Pages− on a Startup Budget?
Hannes Mühleisen
‍
AWS Summit Berlin 2012 Talk on Web Data Commons. Large-Scale Web Analysis now possible with Common Crawl datasets
Graph Structure in the Web – Revisited.
Chris Bizer
‍
Large focus on Common Crawl Corpus and Web Data Commons Project
Applications of Tree Automata Theory Lecture VI: Back to Machine Translation
Andreas Maletti
‍
References Common Crawl Corpus

MapReduce for the Masses: Zero to Hadoop in Five Minutes with CommonCrawl
Steve Salevan

In this screencast, we’ll show you how to go from having no prior experience with scale data analysis to being able to play with 40TB of web crawl information, and we’ll do it in five minutes.
Efficiently Tackling Common Crawl Using MapReduce & Amazon EC2
Stephen Merity

C205: Efficiently Tackling Common Crawl Using MapReduce & Amazon EC2
SwiftKey’s Head Data Scientist on the Value of Common Crawl’s Open Data
Sebastian Spiegler

Sebastian Spiegler, leader of the data team at SwiftKey talks about the value of web crawl data, his research, and why open data is important.
Digital Preservation for Machine-Scale Access and Analysis
Lisa Green
‍
Lisa Green, “Digital Preservation for Machine-Scale Access and Analysis”
Data Days 2012 – Lisa Green – Data Track Keynote
Lisa Green

“Data Track” Keynote at Data Days 2012 by Lisa Green from Common Crawl Foundation, recorded in Berlin, October 1st 2012.
Data Days 2012 – Data Track Panel
Lisa Green
‍
“Data Track” Panel at Data Days 2012 with Stephan Baumann (German Science Institute for Artificial Intelligence), Daniel Dietrich (Open Data Foundation), Lisa Green (Common Crawl Foundation, San Francisco), Christopher Steiner (Best Selling Author, Chicago), Matt Turck (Bloomberg Ventures, NYC)
Spark Demo
Prashant Sharma
‍
A demo of how to process big data on spark in a shell. Demo of Ngrams (with N=6) data of common crawl corpus and some interesting possibilities with queries.
#bbuzz: Jordan Mendelson “Keynote: Big Data for Cheapskates”
Jordan Mendelson
‍
The general topic will be around utilizing open data and cloud computing resources so that everyone can benefit from modern big data methods.
Common Crawl Meets MIA — Gathering and Crunching Open Web Data
Lisa Green and Jordan Mendelson
‍
Lisa Green and Jordan Mendelson present Common Crawl, a Web crawl made publicly accessible for further research and dissemination. In a second talk, Peter Adolphs introduces MIA, a Cloud-based platform for analyzing Web-scale data sets with a toolbox of natural language processing algorithms.

Articles

Big changes for CC Search beta: updates released today!

Paola Villarreal

Common Crawl And Unlocking Web Archives For Research

Kalev Leetaru

Need Billions of Web Pages? Don’t bother Crawling

Julien Nioche

CC Catalog: Leveraging Open Data and Open APIs

sclachar

87 million domains pagerank

Aysun Akarsu

Slide Presentations

AWS Public Data Sets: How to Stage Petabytes of Data for Analysis in AWS, AWS re:Invent 2018

Jed Sundwall, Sebastian Nagel, Dave Rocamora

Mining a Large Web Corpus

Robert Meusel, Christian Bizer

Introduction of the distributed, parallel extraction framework provided by the Web Data Commons project.

Mining Public Datasets using Apache Zeppelin (incubating), Apache Spark and Juju

Alexander Bezzubov

Mapping french open data actors on the web with common crawl

Gulliame LeBourgeois

Mapping french open data actors on the web with common crawl.

Building a Scalable Web Crawler with Hadoop

Ahad Rana

Overview of the original Common Crawl crawler (in use 2008-2013) discussing the Hadoop data processing pipeline, PageRank implementation, and the techniques used to optimize Hadoop.

The Switchabalizer – our journey from spell checker to homophone correcter

Oskar Singer

Description of using Common Crawl data and NLP techniques to improve grammar and spelling correction, specifically homophones.

The Web of data and web data commons

Jesse Wang, Chris Bizer, Oliver Grisel, Soren Auer

Overview of Web Science including basic semantic web, Linked Open Data, primer, and then DBpedia, Linked Data Integration Framework (LDIF), Common Crawl Database, Web Data Commons.

Measuring the impact of Google Analytics

Stephen Merity

Description of using the Common Crawl data to perform wide scale analysis over billions of web pages to investigate the impact of Google Analytics and what this means for privacy on the web at large.

BDT204 Awesome Applications of Open Data – AWS re: Invent 2012

Amazon Web Services

Discussion of how open, public datasets can be harnessed using the AWS cloud. Covers large data collections (such as the 1000 Genomes Project and the Common Crawl) and explains how you can process billions of web pages and trillions of genes to find new insights into society.

2013 open analytics-meetup-mortar

Open Analytics

A tutorial on democratizing data development, references Common Crawl

London Hug: Common Crawl an Open Repository of Web Data

Lisa Green

Common Crawl an Open Repository of Web Data

Cenitpede: Analyzing Webcrawl

Primal Pappachan

Centipede: Analyzing Web Crawl data for context of a location

Scaling Credible Content

Joe Griffin

Learn how iAcquire scaled identification of credible content producers – with credibility being based on authorship proliferation. CC used as seed source

Graph Structure in the Web – Revisited.

Chris Bizer

Large focus on Common Crawl Corpus and Web Data Commons Project

Applications of Tree Automata Theory Lecture VI: Back to Machine Translation

Andreas Maletti

References Common Crawl Corpus

Large-Scale Analysis of Web Pages− on a Startup Budget?

Hannes Mühleisen

AWS Summit Berlin 2012 Talk on Web Data Commons. Large-Scale Web Analysis now possible with Common Crawl datasets

Videos

MapReduce for the Masses: Zero to Hadoop in Five Minutes with CommonCrawl

Steve Salevan

In this screencast, we’ll show you how to go from having no prior experience with scale data analysis to being able to play with 40TB of web crawl information, and we’ll do it in five minutes.

Digital Preservation for Machine-Scale Access and Analysis

Lisa Green

Lisa Green, “Digital Preservation for Machine-Scale Access and Analysis”

SwiftKey’s Head Data Scientist on the Value of Common Crawl’s Open Data

Sebastian Spiegler

Sebastian Spiegler, leader of the data team at SwiftKey talks about the value of web crawl data, his research, and why open data is important.

Data Days 2012 – Lisa Green – Data Track Keynote

Lisa Green

“Data Track” Keynote at Data Days 2012 by Lisa Green from Common Crawl Foundation, recorded in Berlin, October 1st 2012.

C205: Efficiently Tackling Common Crawl Using MapReduce & Amazon EC2

Stephen Merity

C205: Efficiently Tackling Common Crawl Using MapReduce & Amazon EC2

Data Days 2012 – Data Track Panel

Lisa Green

“Data Track” Panel at Data Days 2012 with Stephan Baumann (German Science Institute for Artificial Intelligence), Daniel Dietrich (Open Data Foundation), Lisa Green (Common Crawl Foundation, San Francisco), Christopher Steiner (Best Selling Author, Chicago), Matt Turck (Bloomberg Ventures, NYC)

Common Crawl meets MIA — Gathering and Crunching Open Web Data.

Lisa Green and Jordan Mendelson

Lisa Green and Jordan Mendelson present Common Crawl, a Web crawl made publicly accessible for further research and dissemination. In a second talk, Peter Adolphs introduces MIA, a Cloud-based platform for analyzing Web-scale data sets with a toolbox of natural language processing algorithms.

#bbuzz: Jordan Mendelson “Keynote: Big Data for Cheapskates”

Jordan Mendelson

The general topic will be around utilizing open data and cloud computing resources so that everyone can benefit from modern big data methods.

Do you like what you see here?

If you need further answers don't hesitate to get in touch.

Get in touch

Videos

MapReduce for the Masses: Zero to Hadoop in Five Minutes with CommonCrawl

Steve Salevan

In this screencast, we’ll show you how to go from having no prior experience with scale data analysis to being able to play with 40TB of web crawl information, and we’ll do it in five minutes.

Efficiently Tackling Common Crawl Using MapReduce & Amazon EC2

Stephen Merity

C205: Efficiently Tackling Common Crawl Using MapReduce & Amazon EC2

SwiftKey’s Head Data Scientist on the Value of Common Crawl’s Open Data

Sebastian Spiegler

Sebastian Spiegler, leader of the data team at SwiftKey talks about the value of web crawl data, his research, and why open data is important.

Digital Preservation for Machine-Scale Access and Analysis

Lisa Green
‍
Lisa Green, “Digital Preservation for Machine-Scale Access and Analysis”

Data Days 2012 – Lisa Green – Data Track Keynote

Lisa Green

“Data Track” Keynote at Data Days 2012 by Lisa Green from Common Crawl Foundation, recorded in Berlin, October 1st 2012.

Data Days 2012 – Data Track Panel

Lisa Green
‍
“Data Track” Panel at Data Days 2012 with Stephan Baumann (German Science Institute for Artificial Intelligence), Daniel Dietrich (Open Data Foundation), Lisa Green (Common Crawl Foundation, San Francisco), Christopher Steiner (Best Selling Author, Chicago), Matt Turck (Bloomberg Ventures, NYC)

Spark Demo

Prashant Sharma
‍
A demo of how to process big data on spark in a shell. Demo of Ngrams (with N=6) data of common crawl corpus and some interesting possibilities with queries.

#bbuzz: Jordan Mendelson “Keynote: Big Data for Cheapskates”

Jordan Mendelson
‍
The general topic will be around utilizing open data and cloud computing resources so that everyone can benefit from modern big data methods.

Common Crawl Meets MIA — Gathering and Crunching Open Web Data

Lisa Green & Jordan Mendelson
‍
Lisa Green and Jordan Mendelson present Common Crawl, a Web crawl made publicly accessible for further research and dissemination. In a second talk, Peter Adolphs introduces MIA, a Cloud-based platform for analyzing Web-scale data sets with a toolbox of natural language processing algorithms.

Use Cases

Articles

sclachar

Aysun Akarsu

Paola Villarrela

Kalev Leetaru

Julien Nioche

Jed Sundwall, Sebastian Nagel, Dave Rocamora

Alexander Bezzubov

Robert Meusel, Christian BizerIntroduction of the distributed, parallel extraction framework provided by the Web Data Commons project.

Dave LesterOverview of Common Crawl with some example use cases.

Gulliame LeBourgeoisMapping French open data actors on the web with Common Crawl.

Oskar Singer‍Description of using Common Crawl data and NLP techniques to improve grammar and spelling correction, specifically homophones.

Ahad Rana‍Overview of the original Common Crawl crawler (in use 2008-2013) discussing the Hadoop data processing pipeline, PageRank implementation, and the techniques used to optimize Hadoop.

Jesse Wang, Chris Bizer, Oliver Grisel, Soren Auer‍Overview of Web Science including basic semantic web, Linked Open Data, primer, and then DBpedia, Linked Data Integration Framework (LDIF), Common Crawl Database, Web Data Commons.

Stephen Merity‍Using the Common Crawl data to perform wide-scale analysis over billions of web pages to investigate the impact of Google Analytics and what this means for privacy on the web at large.

Amazon Web Services‍Discussion of how open, public datasets can be harnessed using the AWS cloud. Covers large data collections (such as the 1000 Genomes Project and the Common Crawl) and explains how you can process billions of web pages and trillions of genes to find new insights into society.

Primal Pappachan‍Centipede: Analyzing web crawl data for context of a location

Open Analytics‍A tutorial on democratizing data development, references Common Crawl

Lisa Green‍Common Crawl an Open Repository of Web Data

Joe Griffin‍Learn how iAcquire scaled identification of credible content producers – with credibility being based on authorship proliferation. CC used as seed source

Hannes Mühleisen‍AWS Summit Berlin 2012 Talk on Web Data Commons. Large-Scale Web Analysis now possible with Common Crawl datasets

Chris Bizer‍Large focus on Common Crawl Corpus and Web Data Commons Project

Andreas Maletti‍References Common Crawl Corpus

Steve SalevanIn this screencast, we’ll show you how to go from having no prior experience with scale data analysis to being able to play with 40TB of web crawl information, and we’ll do it in five minutes.

Stephen MerityC205: Efficiently Tackling Common Crawl Using MapReduce & Amazon EC2

Sebastian SpieglerSebastian Spiegler, leader of the data team at SwiftKey talks about the value of web crawl data, his research, and why open data is important.

Lisa Green‍Lisa Green, “Digital Preservation for Machine-Scale Access and Analysis”

Lisa Green“Data Track” Keynote at Data Days 2012 by Lisa Green from Common Crawl Foundation, recorded in Berlin, October 1st 2012.

Prashant Sharma‍A demo of how to process big data on spark in a shell. Demo of Ngrams (with N=6) data of common crawl corpus and some interesting possibilities with queries.

Jordan Mendelson‍The general topic will be around utilizing open data and cloud computing resources so that everyone can benefit from modern big data methods.

Articles

Big changes for CC Search beta: updates released today!

Common Crawl And Unlocking Web Archives For Research

Need Billions of Web Pages? Don’t bother Crawling

CC Catalog: Leveraging Open Data and Open APIs

87 million domains pagerank

sclachar

Aysun Akarsu

Paola Villarrela

Kalev Leetaru

Julien Nioche

Slide Presentations

AWS Public Data Sets: How to Stage Petabytes of Data for Analysis in AWS, AWS re:Invent 2018

Mining a Large Web Corpus

Mining Public Datasets using Apache Zeppelin (incubating), Apache Spark and Juju

Mapping french open data actors on the web with common crawl

Building a Scalable Web Crawler with Hadoop

The Switchabalizer – our journey from spell checker to homophone correcter

The Web of data and web data commons

Measuring the impact of Google Analytics

BDT204 Awesome Applications of Open Data – AWS re: Invent 2012

2013 open analytics-meetup-mortar

London Hug: Common Crawl an Open Repository of Web Data

Cenitpede: Analyzing Webcrawl

Scaling Credible Content

Graph Structure in the Web – Revisited.

Applications of Tree Automata Theory Lecture VI: Back to Machine Translation

Large-Scale Analysis of Web Pages− on a Startup Budget?

sclachar

Aysun Akarsu

Paola Villarrela

Kalev Leetaru

Julien Nioche

Videos

MapReduce for the Masses: Zero to Hadoop in Five Minutes with CommonCrawl

Digital Preservation for Machine-Scale Access and Analysis

SwiftKey’s Head Data Scientist on the Value of Common Crawl’s Open Data

Data Days 2012 – Lisa Green – Data Track Keynote

C205: Efficiently Tackling Common Crawl Using MapReduce & Amazon EC2

Data Days 2012 – Data Track Panel

Common Crawl meets MIA — Gathering and Crunching Open Web Data.

#bbuzz: Jordan Mendelson “Keynote: Big Data for Cheapskates”

Do you like what you see here?

sclachar

Aysun Akarsu

Paola Villarrela

Kalev Leetaru

Julien Nioche

Slide Presentations

Robert Meusel, Christian Bizer

Introduction of the distributed, parallel extraction framework provided by the Web Data Commons project.

Dave Lester

Overview of Common Crawl with some example use cases.

Gulliame LeBourgeois

Mapping French open data actors on the web with Common Crawl.

Oskar Singer
‍
Description of using Common Crawl data and NLP techniques to improve grammar and spelling correction, specifically homophones.

Ahad Rana
‍
Overview of the original Common Crawl crawler (in use 2008-2013) discussing the Hadoop data processing pipeline, PageRank implementation, and the techniques used to optimize Hadoop.

Jesse Wang, Chris Bizer, Oliver Grisel, Soren Auer
‍
Overview of Web Science including basic semantic web, Linked Open Data, primer, and then DBpedia, Linked Data Integration Framework (LDIF), Common Crawl Database, Web Data Commons.

Stephen Merity
‍
Using the Common Crawl data to perform wide-scale analysis over billions of web pages to investigate the impact of Google Analytics and what this means for privacy on the web at large.

Primal Pappachan
‍
Centipede: Analyzing web crawl data for context of a location

Open Analytics
‍
A tutorial on democratizing data development, references Common Crawl

Lisa Green
‍
Common Crawl an Open Repository of Web Data

Joe Griffin
‍
Learn how iAcquire scaled identification of credible content producers – with credibility being based on authorship proliferation. CC used as seed source

Hannes Mühleisen
‍
AWS Summit Berlin 2012 Talk on Web Data Commons. Large-Scale Web Analysis now possible with Common Crawl datasets

Chris Bizer
‍
Large focus on Common Crawl Corpus and Web Data Commons Project

Andreas Maletti
‍
References Common Crawl Corpus

Steve Salevan

In this screencast, we’ll show you how to go from having no prior experience with scale data analysis to being able to play with 40TB of web crawl information, and we’ll do it in five minutes.

Stephen Merity

C205: Efficiently Tackling Common Crawl Using MapReduce & Amazon EC2

Sebastian Spiegler

Sebastian Spiegler, leader of the data team at SwiftKey talks about the value of web crawl data, his research, and why open data is important.

Lisa Green
‍
Lisa Green, “Digital Preservation for Machine-Scale Access and Analysis”

Lisa Green

“Data Track” Keynote at Data Days 2012 by Lisa Green from Common Crawl Foundation, recorded in Berlin, October 1st 2012.

Prashant Sharma
‍
A demo of how to process big data on spark in a shell. Demo of Ngrams (with N=6) data of common crawl corpus and some interesting possibilities with queries.

Jordan Mendelson
‍
The general topic will be around utilizing open data and cloud computing resources so that everyone can benefit from modern big data methods.

Robert Meusel, Christian Bizer

Introduction of the distributed, parallel extraction framework provided by the Web Data Commons project.

Dave Lester

Overview of Common Crawl with some example use cases.

Gulliame LeBourgeois

Mapping French open data actors on the web with Common Crawl.

Oskar Singer
‍
Using Common Crawl data and NLP techniques to improve grammar and spelling correction, specifically homophones.

Ahad Rana
‍
Overview of the original Common Crawl crawler (in use 2008-2013) discussing the Hadoop data processing pipeline, PageRank implementation, and the techniques used to optimize Hadoop.

Jesse Wang, Chris Bizer, Oliver Grisel, Soren Auer
‍
Overview of Web Science including basic semantic web, Linked Open Data, primer, and then DBpedia, Linked Data Integration Framework (LDIF), Common Crawl Database, Web Data Commons.

Stephen Merity
‍
Using the Common Crawl data to perform wide-scale analysis over billions of web pages to investigate the impact of Google Analytics and what this means for privacy on the web at large.

Primal Pappachan
‍
Centipede: Analyzing web crawl data for context of a location

Open Analytics
‍
A tutorial on democratizing data development, references Common Crawl

Lisa Green
‍
Common Crawl an Open Repository of Web Data

Joe Griffin
‍
Learn how iAcquire scaled identification of credible content producers – with credibility being based on authorship proliferation. CC used as seed source

Hannes Mühleisen
‍
AWS Summit Berlin 2012 Talk on Web Data Commons. Large-Scale Web Analysis now possible with Common Crawl datasets

Chris Bizer
‍
Large focus on Common Crawl Corpus and Web Data Commons Project

Andreas Maletti
‍
References Common Crawl Corpus

Steve Salevan

In this screencast, we’ll show you how to go from having no prior experience with scale data analysis to being able to play with 40TB of web crawl information, and we’ll do it in five minutes.

Stephen Merity

C205: Efficiently Tackling Common Crawl Using MapReduce & Amazon EC2

Sebastian Spiegler

Sebastian Spiegler, leader of the data team at SwiftKey talks about the value of web crawl data, his research, and why open data is important.

Lisa Green
‍
Lisa Green, “Digital Preservation for Machine-Scale Access and Analysis”

Lisa Green

“Data Track” Keynote at Data Days 2012 by Lisa Green from Common Crawl Foundation, recorded in Berlin, October 1st 2012.

Prashant Sharma
‍
A demo of how to process big data on spark in a shell. Demo of Ngrams (with N=6) data of common crawl corpus and some interesting possibilities with queries.

Jordan Mendelson
‍
The general topic will be around utilizing open data and cloud computing resources so that everyone can benefit from modern big data methods.