Learn Hadoop and get a paper published

We’re looking for students who want to try out the Hadoop platform and get a technical report published.

(If you’re looking for inspiration, we have some  paper ideas below. Keep reading.)

Hadoop’s version of MapReduce will undoubtedbly come in handy in your future research, and Hadoop is a fun platform to get to know. Common Crawl, a nonprofit organization with a mission to build and maintain an open crawl of the web that is accessible to everyone, has a huge repository of open data – about 5 billion web pages – and documentation to help you learn these tools.

So why not knock out a quick technical report on Hadoop and Common Crawl? Every grad student could use an extra item in the Publications section of his or her CV.

As an added bonus, you would be helping us out. We’re trying to encourage researchers to use the Common Crawl corpus. Your technical report could inspire others and provide a citable papers for them to reference.

Leave a comment now if you’re interested! Then once you’ve talked with your advisor, follow up to your comment, and we’ll be available to help point you in the right direction technically.

Step 1: Learn Hadoop

Step 2:
Turn your new skills on the Common Crawl corpus, available on Amazon Web Services.

  • “Identifying the most used Wikipedia articles with Hadoop and the Common Crawl corpus”
  • “Six degrees of Kevin Bacon: an exploration of open web data”
  • “A Hip-Hop family tree: From Akon to Jay-Z with the Common Crawl data”

Step 3:
Reflect on the process and what you find. Compile these valuable insights into a publication. The possibilities are limitless; here are some fun titles we’d love to see come to life:

Here are some other interesting topics you could explore:

  • Using this data can we ask “how many Jack Blacks are there in the world?”
  • What is the average price for a camera?
  • How much can you trust HTTP headers? It’s extremely common that the response headers provided with a webpage are contradictory to the actual page — things like what  language it’s in or the byte encoding. Browsers use these headers as hints but need to examine the actual content to make a decision about what that content is. It’s interesting to understand how often these two contradict.
  • How much is enough? Some questions we ask of data — such as “what’s the most common word in the english language” — actually don’t need much data at all to answer. So what is the point of a dataset of this size? What value can someone extract from the full dataset? How does this value change with a 50% sample, a 10% sample, a 1% sample? For a particular problem, how should this sample be done?
  • Train a text classifier to identify topicality. Extract meta keywords from Common Crawl HTML data, then construct a training corpus of topically-tagged documents to train a text classifier for a news application.
  • Identify political sites and their leanings. Cluster and visualize their networks of links (You could use Blekko’s /conservative  /liberal tag lists as a starting point).

So, again — if you think this might be fun, leave a comment now to mark your interest. Talk with your advisor, post a follow up to your comment, and we’ll be in touch!


  1. Weiyi Ng says:

    I am absolutely interested! This sounds quite exciting… would love to hear from you guys :)

    • Sksbhanda says:

      I am in too :)

      • Lisa Green says:

        The more the merrier :) Please feel free to email us if you have any questions.

  2. qqz says:

    I’m interested!! 

    • Lisa Green says:

      I love the enthusiasm expressed by double exclamation marks :) Please email us if you want any advice – or to tell us what you decide to work on. 

  3. MyNameIsKris says:

    I’ve also been looking for a grad topic, I’d be very interested in this

    • Lisa Green says:

      Email us if there is anything we can do to help. Looking forward to see what you come up with!

  4. noobcode says:

    I am interested. :)

    • Lisa Green says:

      Cute handle! But won’t you eventually have to change it? You are probably already well beyond noob :) 

  5. Albert Myrchiang says:

    I am really Interested

    • Lisa Green says:

      Excellent! Let us know if there is anything we can do to assist you. The discussion group can be very helpful. http://bit.ly/J1B06q

  6. Cypanic says:

    I am in

    • Lisa Green says:

      Cypanic – great! any idea what topic you want to tackle? 

  7. PeterisP says:

    I’d be interested in running some named entity recognition experiments on common crawl data if I can figure out on how can I filter out the pages in Latvian language without racking up an huge Amazon bill.

    • Allison Domicone says:

       PeterisP – that sounds like a great idea. If you’re looking for technical advice, you might consider joining the Common Crawl discussion group. Someone may be able to help you figure out a solution: http://groups.google.com/group/common-crawl/topics?pli=1

  8. William Cheung says:

    I’m ready to code. Let me know how to access the Common Crawl datasets.

    • Lisa says:

      Great! You can access the datasets on AWS Public Data Sets http://aws.amazon.com/datasets/41740. If you have any questions, feel free to email us at info@commoncrawl:disqus.org  


  9. Scott says:

    Sounds like fun. 

  10. Sheriffo Ceesay says:

    I am interested. 

    • Lisa Green says:

      Excellent! Please email or Twitter DM us if you have any questions.

  11. Navisam says:

    I am interested

    • Lisa Green says:

      That’s great! Let us know if there is anything we can do to support you.

  12. Harmeet Singh says:

    I’m glad that I found this opportunity. I recently got an internship in hadoop and I think this opportunity wil be a cherry on my cake. 

    • Allison Domicone says:

      Harmeet – Great! we’re excited to see what you come up with. To get started, check out this blog post for information on accessing and using the data: http://commoncrawl.org/mapreduce-for-the-masses/

      If you have any questions or need any support along the way, please email us at info@commoncrawl.org

  13. Ryan Schuetzler says:

    Sounds fun. We’ll see what this summer brings.

    • Allison Domicone says:

       Ryan – Great! We hope you take up the challenge.

  14. AJ Bahnken says:

    I am definitely interested. Would be fun to use Python-Streaming.

    • Lisa Green says:

      AJ – we would love to see something with Python-Streaming! That would be inspiring to the many people who rate Python as their favorite language.  Would you use dumbo? 

      • AJ Bahnken says:

        Possibly. I was just at the LA HUG @ Shopzilla where one of their Data Scientists talked about straight up Python-Streaming, which does not require Dumbo, but rather uses Numpy. There is also PyDoop, which is a really interesting project. But more than anything, I would love to try and do Python-Streaming using PyPy.

        If I have questions, who should I email?

        • Lisa Green says:

          You can email me! lisa@commoncrawl:disqus .org If you have seriously technical questions I can connect you with the right person. 

          • AJ Bahnken says:

            Sounds great! I might shoot you an email here soon. Is there a time limit for this?

  15. Luis says:

    Definitely interested. 

    • Lisa Green says:

      That’s great! Please don’t hesitate to email us if you have any questions. 

  16. Kartik Talwar says:

    Definitely interested.

    • Allison Domicone says:

      Awesome – let us know if you need any help from us along the way. We’re happy to point you to resources.

  17. vishal srivastava says:

    I did like to get involved!!

    • Allison Domicone says:

      That’s great! We hope you do. Don’t hesitate to get in touch if you need any help or guidance along the way. Excited to see what you come up with.

  18. ganesh says:

    i’m interested

    • Allison Domicone says:

      Awesome! Please be in touch if we can help or provide resources.

  19. Ajit Deshpande says:

    Very interested.

    • Allison Domicone says:

      Ajit – yes, that sounds like a great idea. We’d love to see what you come up with. Contact us at info@commoncrawl.org and we can point you to specific folks who can answer technical questions and offer help. Thanks for your interest!

  20. kumar vaibhav says:

    Extremely interested !  I am a .Net professional with good hands-on programming experience. Now I am looking to change my technology stack to Hadoop. Please keep me posted. 

    Thanks a bunch !

    • Allison Domicone says:

      Hi Kumar – great to hear you’re interested in taking up Hadoop. This is a good starting point: http://commoncrawl.org/mapreduce-for-the-masses/

      Don’t hesitate to be in touch if any questions come up!

      • kumar vaibhav says:

        Hi Allison – At this point of time I am looking to get some good suggestions on what to do with Common Crawl data. Can you help out with that?


  21. madhavi says:

    I am interested in writing technical paper writing and even doing PhD on this topic in India. Published some paper on hadoop. Request you to keep me posted. Thanks in advance

  22. kamaldeep randhawa says:

    I am doing M.Tech and just now i have completed my Operating Systems research project related to integration of virtualization with Hadoop tools. I am quite interested in working more on hadoop. Please keep me posted. 

    Thanks in advance.

  23. LMShadoop says:

    Currently I’m not a student in any academic institution however would like to explore for new ideas as I cross Hadoop with my traditional data modeling and RDBMS knowledge. Let me know if you are limiting this to “students”. Thanks for the idea.

  24. rayyan388 says:

    to get the sample paper of IAS exam go to http://www.ias-sample-papers.blogspot.com

  25. Gaurav says:

    I am a graduate student in NYU and will like to work on Hadoop and contribute as much as possible.
    I am planning to take course on Hadoop next smester but before that I will like to learn and implement it on a 150 TB size of data.

  26. Tapan K Avasthi says:

    I am very much interested in a technical publication on hadoop. I have around 1 year experience with hadoop/pig and mapreduce. I would like to learn more and get involved.

    • Gaurav Ashara says:

      Read book called “hadoop definitive guide”.

  27. Neha says:

    I am Java/J2EE professional with 1.5+ years of experience, extremely interested in learning Hadoop.

    • Nani says:

      hI Neha, I am also 2+ j2ee professional, extremely interested in hadoop.. did you get anything regarding hadoop..

      Please provide me details

      plz mail me nanielancer@gmail.com

    • Gaurav Ashara says:

      Read book called “hadoop definitive guide”. Its very good book in which you can find all details.

  28. Shruti says:

    Hi, I have done a project on scalability of web applications with mysql and Hadoop’s hbase backends for a social networking site. I am very much interested in continuing my involvement with hadoop. I am looking for good research topics to start with in hadoop. 

  29. mrunali says:

    Hi, I am interested in working with HADOOP . I would like to publish research paper in any of the topic you would provide. Kindly help me with this. I read about HADOOP on internet , the topic is very interesting and I would like to put all my efforts for it

  30. Mohammed Amine Mouhoub says:

    Hi, I’m a student at Paris Dauphine University, and I’m actually doing an internship at Data Publica. In the actual stage of my work, I have to use CommonCrawl to identify/classify the french websites. I am so grateful to have this opportunity to work on CommonCrawl.

  31. Prkrishna says:

    Hi, I’m a working guy in a reputed product development company and I’m very much interested in learning and contribute some thing in this area. I have linux machine with good configuration to try out stuff.

  32. Vaibhav Agarwal says:

    Hi, I am a Digital Analytics professional and would love to pursue this opportunity to learn and contribute. 

  33. Rosyfulla says:

    I’m interested 

  34. Ananda Prakash Verma says:

    I am currently a student at IIITB doing research on Hadoop and Search Engines with my friend. I am looking forward to use this opportunity.

    • Anand Karasi says:

      Hi Ananda … I am an IIT M alum. Would love to connect with you. Ping me when you get a chance.

      • Ananda Prakash Verma says:

        hey Anand Karasi, give your details to connect with you.

  35. Surya Prakash says:

    hi, send me some problems to help you out….

  36. Jaipal says:

    Hello I am Jaipal Currently working for a small scale company, we want to implement Crawler on hadoop… If you are aware of configuration side.. Please let me know it would be great helpful for me..


    Jaipal R

  37. School Management System says:

    The blog was absolutely fantastic! Lot of great information which can be helpful in some or the other way. Keep updating the blog, looking forward for more contents.

  38. Jyoti says:

    im post grad. and finding out a research topic for doctorate. i have implemented hadoop on single node. and going implement the same on multinode. and i wish my doctorate topic must be related to hadoop itself. because its very interesting topic for research. and for coming years it would be shining research area.

  39. Gobinda Paul says:

    I’m interested and recently , I am also working with HDFS too.

  40. Mitesh Mangaonkar says:

    Hey, My name is Mitesh Mangaonkar. Currently I am pursuing my masters in Management Information Systems in Texas Tech University. I am interested in Learning HADOOP technologies and have an inteerst in Business Intelligence. I would like the opportunity that you are providing.

    Please contact me @ miteshmangaonkar@gmail.com
    OR call @ 806-620-9944

  41. sana says:

    i m interested in doing my research in HADOOP but i have no idea how to continue my research plz help me…

  42. Sathyan says:

    I am working on Distributed Data Mining framework on Cloud for Mobile Business Intelligence. Any inputs in this research direction is much appreciated.!!

  43. Kashif says:

    I am research student. I have setup hadoop cluster of 10 virtual machines. I have analyze smart grid massive amount of data using hadoop and soon will submit the paper for publication. In the future i was thinking about the load balancing in Hadoop but I came to know that lot of work has been done in this area. Could you please let me help on what area of hadoop I should work to get the publication.

  44. Puneet Arora says:

    I am master students working on Big data and hadoop. I have done single node clustering hadoop manually on Ubuntu.

    I am so grateful to work with this….

    contact– arorapuneet2424@gmail.com

  45. AkhilAnil says:

    Hi, is the opportunity still available? I’m very much interested. The article was very helpful by the way.

  46. Nitin says:

    I am really interested in big data analysis,i have worked a bit in map/reduce in hadoop,I need some guidance in what direction i have to go further,I am a student in NITSurathkal,INDIA

  47. Lija Mohan says:

    I am a full time research scholar at CUSAT. Could u please guide me on how to make use of this Crawl Dataset…

  48. SRIRAMAN says:

    I’m a student of Anna university. And I’m currently working on hadoop. Would appreciate guidance!I


  1. Learn Hadoop and get a paper published « Another Word For It - [...] Learn Hadoop and get a paper published by Allison Domicone. [...]