May 9, 2012

Learn Hadoop and get a paper published

We're looking for students who want to try out the Apache Hadoop platform and get a technical report published.

Allison Domicone was formerly a Program and Policy Consultant to Common Crawl and previously worked for Creative Commons.

We're looking for students who want to try out the Hadoop platform and get a technical report published.(If you're looking for inspiration, we have some paper ideas below. Keep reading.)Hadoop's version of MapReduce will undoubtedly come in handy in your future research, and Hadoop is a fun platform to get to know. Common Crawl, a nonprofit organization with a mission to build and maintain an open crawl of the web that is accessible to everyone, has a huge repository of open data - about 5 billion web pages - and documentation to help you learn these tools. So why not knock out a quick technical report on Hadoop and Common Crawl? Every grad student could use an extra item in the Publications section of his or her CV. As an added bonus, you would be helping us out. We're trying to encourage researchers to use the Common Crawl corpus. Your technical report could inspire others and provide a citable papers for them to reference. Leave a comment now if you're interested! Then once you've talked with your advisor, follow up to your comment, and we'll be available to help point you in the right direction technically.

‍

‍Step 1:‍

Learn Hadoop

MapReduce for the Masses: Zero to Hadoop in 5 Minutes with Common Crawl
Jakob Homan's LinkedIn Tech Talk on Hadoop
Big Data University offers several free courses
Getting Started with Elastic MapReduce

‍

Step 2:‍

Turn your new skills on the Common Crawl corpus, available on Amazon Web Services.

"Identifying the most used Wikipedia articles with Hadoop and the Common Crawl corpus"
"Six degrees of Kevin Bacon: an exploration of open web data"
"A Hip-Hop family tree: From Akon to Jay-Z with the Common Crawl data"

‍

Step 3:‍

Reflect on the process and what you find. Compile these valuable insights into a publication. The possibilities are limitless; here are some fun titles we'd love to see come to life:

Here are some other interesting topics you could explore:

Using this data can we ask "how many Jack Blacks are there in the world?"
What is the average price for a camera?
How much can you trust HTTP headers? It's extremely common that the response headers provided with a webpage are contradictory to the actual page -- things like what language it's in or the byte encoding. Browsers use these headers as hints but need to examine the actual content to make a decision about what that content is. It's interesting to understand how often these two contradict.
How much is enough? Some questions we ask of data -- such as "what's the most common word in the english language" -- actually don't need much data at all to answer. So what is the point of a dataset of this size? What value can someone extract from the full dataset? How does this value change with a 50% sample, a 10% sample, a 1% sample? For a particular problem, how should this sample be done?
Train a text classifier to identify topicality. Extract meta keywords from Common Crawl HTML data, then construct a training corpus of topically-tagged documents to train a text classifier for a news application.
Identify political sites and their leanings. Cluster and visualize their networks of links (You could use Blekko's /conservative /liberal tag lists as a starting point).

‍

So, again -- if you think this might be fun, leave a comment now to mark your interest. Talk with your advisor, post a follow up to your comment, and we'll be in touch!

This release was authored by:

No items found.

Erratum:

Content is truncated

Originally reported by:

More details

Some archived content is truncated due to fetch size limits imposed during crawling. This is necessary to handle infinite or exceptionally large data streams (e.g., radio streams). Prior to March 2025 (CC-MAIN-2025-13), the truncation threshold was 1 MiB. From the March 2025 crawl onwards, this limit has been increased to 5 MiB.

Learn Hadoop and get a paper published

Erratum:

Content is truncated

The Data

Overview

CDXJ Index

URL Index

Web Graphs

Latest Crawl

Crawl Stats

Graph Stats

Errata

Resources

Get Started

AI Agent

Blog

Examples

CCBot

Infra Status

Opt-Out Registry

FAQ

Community

Research Papers

Mailing List Archive

Hugging Face

Discord

Collaborators

About

About

Team

Jobs

Privacy Policy

Terms of Use