We're looking for students who want to try out the Hadoop platform and get a technical report published.(If you're looking for inspiration, we have some paper ideas below. Keep reading.)Hadoop's version of MapReduce will undoubtedly come in handy in your future research, and Hadoop is a fun platform to get to know. Common Crawl, a nonprofit organization with a mission to build and maintain an open crawl of the web that is accessible to everyone, has a huge repository of open data - about 5 billion web pages - and documentation to help you learn these tools. So why not knock out a quick technical report on Hadoop and Common Crawl? Every grad student could use an extra item in the Publications section of his or her CV. As an added bonus, you would be helping us out. We're trying to encourage researchers to use the Common Crawl corpus. Your technical report could inspire others and provide a citable papers for them to reference. Leave a comment now if you're interested! Then once you've talked with your advisor, follow up to your comment, and we'll be available to help point you in the right direction technically.
Step 1:
Learn Hadoop
- MapReduce for the Masses: Zero to Hadoop in 5 Minutes with Common Crawl
- Jakob Homan's LinkedIn Tech Talk on Hadoop
- Big Data University offers several free courses
- Getting Started with Elastic MapReduce
Step 2:
Turn your new skills on the Common Crawl corpus, available on Amazon Web Services.
- "Identifying the most used Wikipedia articles with Hadoop and the Common Crawl corpus"
- "Six degrees of Kevin Bacon: an exploration of open web data"
- "A Hip-Hop family tree: From Akon to Jay-Z with the Common Crawl data"
Step 3:
Reflect on the process and what you find. Compile these valuable insights into a publication. The possibilities are limitless; here are some fun titles we'd love to see come to life:
Here are some other interesting topics you could explore:
- Using this data can we ask "how many Jack Blacks are there in the world?"
- What is the average price for a camera?
- How much can you trust HTTP headers? It's extremely common that the response headers provided with a webpage are contradictory to the actual page -- things like what language it's in or the byte encoding. Browsers use these headers as hints but need to examine the actual content to make a decision about what that content is. It's interesting to understand how often these two contradict.
- How much is enough? Some questions we ask of data -- such as "what's the most common word in the english language" -- actually don't need much data at all to answer. So what is the point of a dataset of this size? What value can someone extract from the full dataset? How does this value change with a 50% sample, a 10% sample, a 1% sample? For a particular problem, how should this sample be done?
- Train a text classifier to identify topicality. Extract meta keywords from Common Crawl HTML data, then construct a training corpus of topically-tagged documents to train a text classifier for a news application.
- Identify political sites and their leanings. Cluster and visualize their networks of links (You could use Blekko's /conservative /liberal tag lists as a starting point).
So, again -- if you think this might be fun, leave a comment now to mark your interest. Talk with your advisor, post a follow up to your comment, and we'll be in touch!