Transcript of "Hadoop @ Sara & BiG Grid"

2.
First off... About me Consultant for SARAs eScience & Cloud Services Technical lead for LifeWatch Netherlands Lead Hadoop infrastructure About youWho uses large-scale computing as a supporting tool? For who is large-scale computing core-business?

14.
How are these observations addressed? We collect data, we store data, we have the knowledge to interpret data. What tools do we have that bring these together?Pioneers: HPC centers, universities, and in recent years, Internet companies. (Lots of knowledge exchange, by the way.)

20.
SARA the national center for scientific computingFacilitating Science in The Netherlands with Equipment for and Expertise on Large-Scale Computing, Large-Scale Data Storage, High-Performance Networking, eScience, and Visualization

23.
Case Study: Virtual Knowledge Studio How do categories in WikiPedia evolve over time? (And how do they relate to internal links?) 2.7 TB raw text, single file Java application, searches for categories in Wiki markup, like [[Category:NAME]] Executed on the Grid http://simshelf2.virtualknowledgestudio.nl/activities/biggrid-wikipedia-experiment

24.
Case Study: Virtual Knowledge StudioMethodTake an article, including history, as inputExtract categories and links for each revisionOutput all links for each category, per revisionAggregate all links for each category, per revisionGenerate graph linking all categories on links, per revision

25.
Case Study: Virtual Knowledge Studio1.1) Copy file from local 2.1) Stream file from Grid 3.1) Process all files in Machine to Grid storage Storage to single machine parallel: N machines 2.2) Cut into pieces of 10 GB run the Java application, 2.3) Stream back to Grid fetch a 10GB file as Storage input, processing it, and putting the result back

34.
Case Study: Virtual Knowledge Studio This is how it would be done with Hadoop1) Load file into 2) Submit code to HDFS MR Automatic distribution of data, Parallelism based on data,Automatic ordering of intermediate results

46.
Experience: How we embrace HadoopParallelism has never been easy… so we teach! December 2010: hackathon (~50 participants - full) April 2011: Workshop for Bioinformaticians November 2011: 2 day PhD course (~60 participants – full) June 2012: 1 day PhD courseThe datascientist is still in school... so we fill the gap! Devops maintain the system, fix bugs, develop new functionality Technical consultants learn how to efficiently implement algorithms

48.
Final thoughtsHadoop is the first to provide commodity computing Hadoop is not the only Hadoop is probably not the best Hadoop has momentumWhat degree of diversification of infrastructure should we embrace? MapReduce fits surprisingly well as a programming model for data-parallelismWhere is the data scientist? Teach. A lot. And work together.