Apache Hama Introduction

Comments (0)

Transcript of Apache Hama Introduction

Programming in Hama Introduction to BSP Introduced by Leslie ValiantIntroduced message passing and global synchronization to tackle shared memory contentionHama lets you express solutions using this model on HDFS (Hadoop Distributed File System)Google Map-Reduce paper mentions BSP model but explains why they didn't pursue BSP They did introduce Pregel though which is inspired from BSP. Apache Hama IntroductionProgramming ModelArchitectureCurrent ChallengesFuture PlansWorking Together Architecture Current Development for Hama 0.7 Future Directions Industrial Strength Fault tolerance.Hierarchical BSPOblivious SynchronizationIntegrate with HBase, Accumulo, etc.Work on MesosExperiment new models for computation over Hama Matrix MultiplicationsExcited to see Tez in incubation Working Together We are looking for more contributors!

Useful links:http://hama.apache.org/http://wiki.apache.org/hama/GettingStartedhttps://issues.apache.org/jira/browse/HAMAhttp://hama.apache.org/mail-lists.htmlFollow @ApacheHama on Twitter Bulk Synchronous Parallel solutions over Hadoop Suraj Menon (PMC, Committer) The existent API (as of 0.6) Groom Server Each unit task executing in parallel by a peer process is called SuperstepIn a superstep the peers exchange messages with each other.Then all the peers enter a synchronization barrier.After coming out of the barrier the peers work on message sent to them in the previous superstep. When peer enters the barrier synchronization mode, it is implied that :Peer has completely executed the superstep.All the messages for each of the others peers are (most of the times reliably) sent out.Peer is waiting for all other peers to enter the barrier. When a peer leaves a barrier, it is implied that :The peer is about to start working on the next superstep.There are no other peers in the system working on the previous superstep.A peer gets all the messages, sent to it by other peers in the previous superstep, as input for the new superstep execution. Pregel Workers send messages to each other and to the master node.Used for graph algorithms where each worker is responsible for the state changes in vertices it holds. The vertices exchange messages with each other based on the existent adjacent connectivity information.Hama Graph module, Apache Giraph uses the above model for implementing graph algorithms. Master TaskDepending on the frequency of superstep designed, receives messages from all peers.Aggregates the data received from all the workers and maintains a global state.Broadcasts the global state to all the workers. Superstep API Job Submission public class MyClass extends BSP<K1, V1, K2, V2, M extends Writable> { @Override public void setup(..){ }

Partitioning- Better partitioning scheme can help reducing message exchange during program execution.

Finalizing Superstep API- The less restrictive model implies the users would be provided with more responsibilities. We have to hit a sweet spot!

Asynchronous Messaging- Currently messages are sent at the end of the superstep, asynchronous messaging during superstep should give us a more concurrent design. Current Focus YARN Hama has been YARN aware for sometime now0.7 release is planned to run Hama with YARN schedulerThe implementation is not full-fledged yet.Tested with few job-submissions Much more code-refactoring to come. Machine Learning module Today contains:- K-Means- Linear, Logistic Regression implemented on BSPIn ApacheCon 2012, Tommaso makes an interesting point on suitability of BSP Model for iterative machine learning algorithms here - Today contains examples implemented:- PageRank- SSSP- BiPartite Matching