Articles
- CC Catalog: Leveraging Open Data and Open APIs
- 87 million domains pagerank
- Big changes for CC Search beta: updates released today!
- Common Crawl And Unlocking Web Archives For Research
- Need Billions of Web Pages? Don’t bother Crawling
Slide Presentations
- AWS Public Data Sets: How to Stage Petabytes of Data for Analysis in AWS, AWS re:Invent 2018
- Mining Public Datasets using Apache Zeppelin (incubating), Apache Spark and Juju
-
Mining a Large Web Corpus
Introduction of the distributed, parallel extraction framework provided by the Web Data Commons project.
-
Introduction to Common Crawl
Overview of Common Crawl with some example use cases.
-
Mapping french open data actors on the web with common crawl
Mapping french open data actors on the web with common crawl.
-
The Switchabalizer – our journey from spell checker to homophone correcter
Description of using Common Crawl data and NLP techniques to improve grammar and spelling correction, specifically homophones.
-
Building a Scalable Web Crawler with Hadoop
Overview of the original Common Crawl crawler (in use 2008-2013) discussing the Hadoop data processing pipeline, PageRank implementation, and the techniques used to optimize Hadoop.
-
The Web of data and web data commons
Overview of Web Science including basic semantic web, Linked Open Data, primer, and then DBpedia, Linked Data Integration Framework (LDIF), Common Crawl Database, Web Data Commons.
-
Measuring the impact of Google Analytics
Description of using the Common Crawl data to perform wide scale analysis over billions of web pages to investigate the impact of Google Analytics and what this means for privacy on the web at large.
-
BDT204 Awesome Applications of Open Data – AWS re: Invent 2012
Discussion of how open, public datasets can be harnessed using the AWS cloud. Covers large data collections (such as the 1000 Genomes Project and the Common Crawl) and explains how you can process billions of web pages and trillions of genes to find new insights into society.
-
Cenitpede: Analyzing Webcrawl
Centipede: Analyzing Web Crawl data for context of a location
-
2013 open analytics-meetup-mortar
A tutorial on democratizing data development, references Common Crawl
-
London Hug: Common Crawl an Open Repository of Web Data
Common Crawl an Open Repository of Web Data
-
Scaling Credible Content
Learn how iAcquire scaled identification of credible content producers – with credibility being based on authorship proliferation. CC used as seed source
-
Large-Scale Analysis of Web Pages− on a Startup Budget?
AWS Summit Berlin 2012 Talk on Web Data Commons. Large-Scale Web Analysis now possible with Common Crawl datasets
-
Graph Structure in the Web – Revisited.
Large focus on Common Crawl Corpus and Web Data Commons Project
-
Applications of Tree Automata Theory Lecture VI: Back to Machine Translation
References Common Crawl Corpus
Videos
-
MapReduce for the Masses: Zero to Hadoop in Five Minutes with CommonCrawl
In this screencast, we’ll show you how to go from having no prior experience with scale data analysis to being able to play with 40TB of web crawl information, and we’ll do it in five minutes.
-
C205: Efficiently Tackling Common Crawl Using MapReduce & Amazon EC2
C205: Efficiently Tackling Common Crawl Using MapReduce & Amazon EC2
-
SwiftKey’s Head Data Scientist on the Value of Common Crawl’s Open Data
Sebastian Spiegler, leader of the data team at SwiftKey talks about the value of web crawl data, his research, and why open data is important.
-
Digital Preservation for Machine-Scale Access and Analysis
Lisa Green, “Digital Preservation for Machine-Scale Access and Analysis”
-
Data Days 2012 – Lisa Green – Data Track Keynote
“Data Track” Keynote at Data Days 2012 by Lisa Green from Common Crawl Foundation, recorded in Berlin, October 1st 2012.
-
Data Days 2012 – Data Track Panel
“Data Track” Panel at Data Days 2012 with Stephan Baumann (German Science Institute for Artificial Intelligence), Daniel Dietrich (Open Data Foundation), Lisa Green (Common Crawl Foundation, San Francisco), Christopher Steiner (Best Selling Author, Chicago), Matt Turck (Bloomberg Ventures, NYC)
-
Spark Demo
A demo of how to process big data on spark in a shell. Demo of Ngrams (with N=6) data of common crawl corpus and some interesting possibilities with queries.
-
#bbuzz: Jordan Mendelson “Keynote: Big Data for Cheapskates”
The general topic will be around utilizing open data and cloud computing resources so that everyone can benefit from modern big data methods.
-
Common Crawl meets MIA — Gathering and Crunching Open Web Data.
Lisa Green and Jordan Mendelson present Common Crawl, a Web crawl made publicly accessible for further research and dissemination. In a second talk, Peter Adolphs introduces MIA, a Cloud-based platform for analyzing Web-scale data sets with a toolbox of natural language processing algorithms.