Comments 0

Document transcript

A FRAMEWORK FOR THE DYNAMICRECONFIGURATION OF SCIENTIFIC APPLICATIONSIN GRID ENVIRONMENTSByKaoutar El MaghraouiA Thesis Submitted to the GraduateFaculty of Rensselaer Polytechnic Institutein Partial Fulﬁllment of theRequirements for the Degree ofDOCTOR OF PHILOSOPHYMajor Subject:Computer ScienceApproved by theExamining Committee:Dr.Carlos A.Varela,Thesis AdviserDr.Joseph E.Flaherty,MemberDr.Ian Foster,MemberDr.Franklin Luk,MemberDr.Boleslaw K.Szymanski,MemberDr.James D.Teresco,MemberRensselaer Polytechnic InstituteTroy,New YorkApril 2007(For Graduation May 2007)A FRAMEWORK FOR THE DYNAMICRECONFIGURATION OF SCIENTIFIC APPLICATIONSIN GRID ENVIRONMENTSByKaoutar El MaghraouiAn Abstract of a Thesis Submitted to the GraduateFaculty of Rensselaer Polytechnic Institutein Partial Fulﬁllment of theRequirements for the Degree ofDOCTOR OF PHILOSOPHYMajor Subject:Computer ScienceThe original of the complete thesis is on ﬁlein the Rensselaer Polytechnic Institute LibraryExamining Committee:Dr.Carlos A.Varela,Thesis AdviserDr.Joseph E.Flaherty,MemberDr.Ian Foster,MemberDr.Franklin Luk,MemberDr.Boleslaw K.Szymanski,MemberDr.James D.Teresco,MemberRensselaer Polytechnic InstituteTroy,New YorkApril 2007(For Graduation May 2007)c Copyright 2007byKaoutar El MaghraouiAll Rights ReservediiCONTENTSLIST OF FIGURES................................viiLIST OF TABLES.................................xiACKNOWLEDGMENTS.............................xiiABSTRACT....................................xiv1.Introduction...................................11.1 Motivation and Research Challenges..................21.1.1 Mobility and Malleability for Fine-grained Reconﬁguration..31.1.2 Middleware-driven Dynamic Application Reconﬁguration...61.2 Problem Statement and Methodology..................81.3 Thesis Contributions...........................101.4 Thesis Roadmap.............................112.Background and Related Work.........................132.1 Grid Middleware Systems........................142.2 Resource Management in Grid Systems.................162.2.1 Resource Management in Globus................182.2.2 Resource Management in Condor................202.2.3 Resource Management in Legion.................212.2.4 Other Grid Resource Management Systems...........222.3 Adaptive Execution in Grid Systems..................232.4 Grid Programming Models........................252.5 Peer-to-Peer Systems and the Emerging Grid..............282.6 Worldwide Computing..........................31iii2.6.1 The Actor Model.........................312.6.2 The SALSA Programming Language..............332.6.3 Theaters and Run-Time Components..............363.A Middleware Framework for Adaptive Distributed Computing.......373.1 Design Goals...............................383.1.1 Middleware-level Issues......................383.1.2 Application-level Issues......................393.2 A Model for Reconﬁgurable Grid-Aware Applications.........403.2.1 Characteristics of Grid Environments..............403.2.2 A Grid Application Model....................423.3 IOS Middleware Architecture......................473.3.1 The Proﬁling Module.......................483.3.2 The Decision Module.......................513.3.3 The Protocol Module.......................523.4 The Reconﬁguration and Proﬁling Interfaces..............533.4.1 The Proﬁling API.........................533.4.2 The Reconﬁguration Decision API................553.5 Case Study:Reconﬁgurable SALSA Actors...............563.6 Chapter Summary............................574.Reconﬁguration Protocols and Policies....................604.1 Network-Sensitive Virtual Topologies..................604.1.1 The Peer-to-Peer Topology...................614.1.2 The Cluster-to-Cluster Topology................624.1.3 Presence Management......................624.2 Autonomous Load Balancing Strategies.................634.2.1 Information Policy........................654.2.2 Transfer Policy..........................674.2.3 Peer Location Policy.......................694.2.4 Load Balancing and the Virtual Topology...........704.3 The Selection Policy...........................714.3.1 The Resource Sensitive Model..................714.3.2 Migration Granularity......................744.4 Split and Merge Policies.........................75iv4.4.1 The Split Policy..........................754.4.2 The Merge Policy.........................764.5 Related Work...............................774.6 Chapter Summary............................795.Reconﬁguring MPI Applications........................805.1 Motivation.................................815.2 Approach to Reconﬁgurable MPI Applications.............825.2.1 The Computational Model....................825.2.2 Process Migration.........................845.2.3 Process Malleability.......................855.3 The Process Checkpointing Migration and Malleability Library....885.3.1 The PCM API..........................895.3.2 Instrumenting an MPI Program with PCM...........925.4 The Runtime Architecture........................925.4.1 The PCMD Runtime System..................965.4.2 The Proﬁling Architecture....................985.4.3 A Simple Scenario for Adaptation................995.5 The Middleware Interaction Protocol..................1015.5.1 Actor Emulation.........................1015.5.2 The Proxy Architecture.....................1025.5.3 Protocol Messages........................1025.6 Related Work...............................1055.6.1 MPI Reconﬁguration.......................1055.6.2 Malleability............................1075.7 Summary and Discussion.........................1096.Performance Evaluation............................1116.1 Experimental Testbed..........................1116.2 Applications Case Studies........................1126.2.1 Heat Diﬀusion Problem......................1126.2.2 Search for the Galactic Structure................1146.2.3 SALSA Benchmarks.......................1146.3 Middleware Evaluation..........................1156.3.1 Application-sensitive Reconﬁguration Results.........115v6.3.2 Experiments with Dynamic Networks..............1166.3.3 Experiments with Virtual Topologies..............1196.3.4 Single vs.Group Migration...................1256.3.5 Overhead Evaluation.......................1256.4 Adaptation Experiments with Iterative MPI Applications.......1286.4.1 Proﬁling Overhead........................1296.4.2 Reconﬁguration Overhead....................1296.4.3 Performance Evaluation of MPI/IOS Reconﬁguration.....1316.4.3.1 Migration Experiments.................1316.4.3.2 Split and Merge Experiments.............1326.5 Summary and Discussion.........................1357.Conclusions and Future Work.........................1387.1 Other Application Models........................1397.2 Large-scale Deployment and Security..................1407.3 Replication as Another Reconﬁguration Strategy............1417.4 Scalable Proﬁling and Measurements..................1417.5 Clustering Techniques for Resource Optimization...........1427.6 Automatic Programming.........................1427.7 Self-reconﬁgurable Middleware......................142References......................................144viLIST OF FIGURES1.1 Execution time without and with autonomous migration in a dynamicrun-time environment.............................41.2 Execution time with diﬀerent entity granularities in a static run-timeenvironment..................................51.3 Throughput as the process data granularity decreases on a dedicated node.62.1 A layered grid architecture and components (Adapted from [14]).....152.2 Sample peer-to-peer topologies:centralized,decentralized and hybridtopologies...................................302.3 A Model for Worldwide Computing.Applications run on a virtual net-work (a middleware infrastructure) which maps actors to locations in thephysical layer (the hardware infrastructure)................322.4 The primitive operations of an actor.In response to a message,an actorcan:a) change its internal state by invoking one of its internal methods,b) send a message to another actor,or c) create a new actor.......332.5 Skeleton of a SALSA program.The skeleton shows simple examples ofactor creation,message sending,coordination,and migration......343.1 Interactions between reconﬁgurable applications,the middleware ser-vices,and the grid resources.........................423.2 A state machine showing the conﬁguration states of an application atreconﬁguration points............................433.3 Model for a grid-aware application.....................463.4 The possible states of a reconﬁgurable entity................48vii3.5 IOS Agents consist of three modules:a proﬁling module,a protocol mod-ule and a decision module.The proﬁling module gathers performanceproﬁles about the entities executing locally,as well as the underlyinghardware.The protocol module gathers information from other agents.The decision module takes local and remote information and uses it todecide how the application entities should be reconﬁgured........493.6 Architecture of the proﬁling module:this module interfaces with high-level applications and with local resources and generates application per-formance proﬁles and machine performance proﬁles............503.7 Interactions between the proﬁling module and the Network Weather Ser-vice (NWS) components...........................513.8 Interactions between a reconﬁgurable application and the local IOS agent.523.9 IOS Proﬁling API..............................543.10 IOS Reconﬁguration API..........................583.11 A UML class diagram of the main SALSA/IOS Actor classes and behav-iors.The diagram shows the relationships between the Actor,Univer-salActor,AutonomousActor,and MalleableActor classes.........594.1 The peer-to-peer virtual network topology.Middleware agents repre-sent heterogeneous nodes,and communicate with groups or peer agents.Information is propagated through the virtual network via these commu-nication links.................................614.2 The cluster-to-cluster virtual network topology.Homogeneous agentselect a cluster manager to perform intra and inter cluster load balancing.Clusters are dynamically created and readjusted as agents join and leavethe virtual network..............................624.3 Scenarios of joining and leaving the IOS virtual network:(a) A node joinsthe virtual network through a peer server,(b) A node joins the virtualnetwork through an existing peer,(c) A node leaves the virtual network.644.4 Algorithm for joining an existing virtual network and ﬁnding peer nodes.644.5 An example that shows the propagation of work-stealing packets acrossthe peers until an overloaded node is reached.The example shows therequest starting with a time-to-live (TTL) of 5.The TTL is decrementedby each forwarding node until it reaches the value of 0,then the packetis no longer propagated............................654.6 Information exchange between peer agents using work-stealing requestmessages....................................68viii4.7 Plots of the expected gain decision function versus the process remaininglifetime with diﬀerent values of the number of migrations in the past.The remaining lifetime is assumed to have a half-life time expectancy..735.1 Steps involved in communicator handling to achieve MPI process migration..855.2 Example M to N split operations......................875.3 Example M to N merge operations.....................875.4 Examples of domain decomposition strategies showing block,column,and diagonal decompositions for a 3D data-parallel problem.......885.5 Skeleton of the original MPI code of an MPI application.........935.6 Skeleton of the malleable MPI code with PCM calls:initialization phase.945.7 Skeleton of the malleable MPI code with PCM calls:iteration phase...955.8 The layered design of MPI/IOS which includes the MPI wrapper,the PCMruntime layer,and the IOS runtime layer...................965.9 Architecture of a node running MPI/IOS enabled applications........975.10 Library and executable structure of an MPI/IOS application.......985.11 A reconﬁguration scenario of an MPI/IOS application...........1005.12 IOS/MPI proxy software architecture....................1035.13 The packet format of MPI/IOS proxy control and proﬁling messages...1056.1 The two-dimensional heat diﬀusion problem................1126.2 Parallel decomposition of the 2D heat diﬀusion problem..........1136.3 Performance of the massively parallel unconnected benchmark......1166.4 Performance of the massively parallel sparse benchmark..........1176.5 Performance of the highly synchronized tree benchmark..........1176.6 Performance of the highly synchronized hypercube benchmark......1186.7 The tree topology on a dynamic environment using ARS and RS.....1196.8 The unconnected topology on a dynamic environment using ARS and RS.1206.9 The hypercube application topology on Internet-like environments....1216.10 The hypercube application topology on Grid-like environments......121ix6.11 The tree application topology on Internet-like environments.......1226.12 The tree application topology on Grid-like environments.........1226.13 The sparse application topology on Internet-like environments......1236.14 The sparse application topology on Grid-like environments........1236.15 The unconnected application topology on Internet-like environments...1246.16 The unconnected application topology on Grid-like environments....1246.17 Single vs.group migration for the unconnected application topology...1256.18 Single vs.group migration for the sparse application topology......1266.19 Single vs.group migration for the tree application topology........1266.20 Single vs.group migration for the hypercube application topology....1276.21 Overhead of using SALSA/IOS on a massively parallel astronomic data-modeling application with various degrees of parallelism on a static en-vironment...................................1276.22 Overhead of using SALSA/IOS on a tightly synchronized two-dimensional heat diﬀusion application with various degrees of parallelismon a static environment...........................1286.23 Overhead of the PCM library........................1296.24 Total running time of reconﬁgurable and non-reconﬁgurable executionscenarios for diﬀerent problem data sizes for the heat diﬀusion application.1306.25 Breakdown of the reconﬁguration overhead for the experiment of Fig-ure 6.24....................................1306.26 Performance of the heat diﬀusion application using MPICH2 andMPICH2 with IOS..............................1336.27 The expansion and shrinkage capability of the PCM library.......1346.28 Adaptation using malleability and migration as resources leave and join 1356.29 Dynamic reconﬁguration using malleability and migration compared todynamic reconﬁguration using migration alone in a dynamic virtual net-work of IOS agents.The virtual network was varied from 8 to 12 to16 to 15 to 10 to 8 processors.Malleable entities outperformed solelymigratable entities on average by 5%....................136xLIST OF TABLES2.1 Layers of the grid architecture........................162.2 Characteristics of some grid resource management systems........182.3 Examples of a Universal Actor Name (UAN) and a Universal Actor Lo-cator (UAL)..................................352.4 Some Java concepts and analogous SALSA concepts............354.1 The range of communication latencies to group the list of peer hosts...695.1 The PCM API................................905.2 The structure of the assigned UAN/UAL pair for MPI processes at theMPI/IOS proxy................................1025.3 Proxy control message types.........................1045.4 Proxy proﬁling message types........................104xiACKNOWLEDGMENTSIt is with great pleasure that I wish to acknowledge several people that have helpedme tremendously during the diﬃcult,challenging,yet rewarding and exciting pathtowards a Ph.D.Without their help and support,none of this work could have beenpossible.First and foremost,I am greatly indebted to my advisor Dr.Carlos A.Varelafor his guidance,encouragement,motivation,and continued support throughout myacademic years at RPI.Carlos has allowed me to pursue my research interests withsuﬃcient freedom,while always being there to guide me.Working with him has beenone of the most rewarding experiences of my professional life.I am also deeply indebted to Professor Boleslaw K.Szymanski,my committeemember,for supporting my work.He has been very instrumental to the realizationof this work through his keen guidance and encouragement.Working with him hasbeen a great pleasure.I am also very thankful to the rest of my committee members,Dr.Joseph E.Flaherty,Dr.Ian Foster,Dr.Franklin Luk,and Dr.James D.Teresco.I am grateful to them for agreeing to serve on my committee and for their valuablesuggestions and comments.Special thanks go to my colleague,Travis Desell for his key contributions tothe design and development of the Internet Operating System middleware.Manythanks go also to the rest of the Worldwide Computing laboratory members,Wei-JenWang,Jason LaPorte,Jiao Tao,and Brian Boodman for their valuable commentsand constructive criticism.My fruitful discussions and interactions with them helpedme grow professionally.xiiI am grateful to the administrative staﬀ of the Computer Science department,who have spared no eﬀorts helping me in various aspects of my academic life at RPI.They were some of the best people I have ever worked with.In particular,I would liketo thank Pamela Paslow for her constant help and for also being a true friend.Shewas always there for me in easy and diﬃcult times.She has kept me on top of all thenecessary paperwork.I would like also to thank Jacky Carley,Shannon Carrothers,Chris Coonrad,and Steven Lindsey.I have been fortunate to have met great friends throughout my Ph.D journey.They have bestowed so much love on me.I amforever grateful for their moral support,encouragement,and true friendship.In particular I would like to thank BouchraBouqata,Houda Lamehamedi,Fatima Boussetta,and Khadija Omo-meriem.Specialthanks go to Rebiha Chekired for caring for my baby,Zayneb,during the last year ofmy Ph.D.She acted as a loving and caring second mother to my baby during timesI could not be around.I am forever grateful to her.Last but not least,I am forever indebted to my husband,Bouchaib Cherif,myparents,my sisters Hajar and Meriem,my brother Ahmed,and the rest of my family.My husband has been a great source of inspiration to me.None of this would havebeen possible without his love,support,and continuous encouragement.My parents’prayers have always accompanied me.Their love keeps me going.My daughterZayneb has been the greatest source of motivation and inspiration during the lastyear of my Ph.D.I am very lucky to have been blessed with her.I am grateful to allof them.This work is dedicated to my family.xiiiABSTRACTAdvances in hardware technologies are constantly pushing the limits of processing,networking,and storage resources,yet there are always applications whose computa-tional demands exceed even the fastest technologies available.It has become criticalto look into ways to eﬃciently aggregate distributed resources to beneﬁt a single ap-plication.Achieving this vision requires the ability to run applications on dynamicand heterogeneous environments such as grids and shared clusters.New challengesemerge in such environments,where performance variability is the rule and not theexception,and where the availability of the resources can change anytime.Therefore,applications require the ability to dynamically reconﬁgure to adjust to the dynamicsof the underlying resources.To realize this vision,we have developed the Internet Operating System (IOS),a framework for middleware-driven application reconﬁguration in dynamic executionenvironments.Its goal is to provide high performance to individual applications indynamic settings and to provide the necessary tools to facilitate the way in whichscientiﬁc and engineering applications interact with dynamic environments and re-conﬁgure themselves as needed.Reconﬁguration in IOS is triggered by a set of de-centralized agents that form a virtual network topology.IOS is built modularly toallow the use of diﬀerent algorithms for agents’ coordination,resource proﬁling,andreconﬁguration.IOS exposes generic APIs to high-level applications to allow for in-teroperability with a wide range of applications.We investigated two representativevirtual topologies for inter-agent coordination:a peer-to-peer and a cluster-to-clustertopology.As opposed to existing approaches,where application reconﬁguration hasxivmainly been done at a coarse granularity (e.g.,application-level),IOS focuses on mi-gration at a ﬁne granularity (e.g.,process-level) and introduces a novel reconﬁgurationparadigm,malleability,to dynamically change the granularity of an application’s en-tities.Combining migration and malleability enables more eﬀective,ﬂexible,andscalable reconﬁguration.IOS has been used to reconﬁgure actor-oriented applications implemented usingthe SALSA programming language and iterative process-oriented applications thatfollow the Message Passing Interface (MPI) model.To beneﬁt from IOS reconﬁg-uration capabilities,applications need to be amenable to entity migration or mal-leability.This issue has been addressed in iterative MPI applications by designingand building a library for process checkpointing,migration,and malleability (PCM)and integrating it with IOS.Performance results show that adaptive middleware canbe an eﬀective approach to reconﬁguring distributed applications with various ratiosof communication to computation in order to improve their performance,and moreeﬀectively utilize dynamic resources.We have measured the middleware overheadin static environments demonstrating that it is less than 7% on average,yet recon-ﬁguration on dynamic environments can lead to signiﬁcant improvement in applica-tion’s execution time.Performance results also show that taking into considerationthe application’s communication topology in the reconﬁguration decision improvesthroughput by almost an order of magnitude in benchmark applications with sparseinter-process connectivity.xvCHAPTER 1IntroductionComputing environments have evolved from single-user environments,to MassivelyParallel Processors (MPPs),to clusters of workstations,to distributed systems,andrecently to grid computing systems.Every transition has been a revolution,allowingscientists and engineers to solve complex problems and sophisticated applicationsthat could not be solved before.However every transition has brought along newchallenges,new problems,and also the need for technical innovations.The evolution of computing systems has led to the current situation,wheremillions of machines are interconnected via the Internet with various hardware andsoftware conﬁgurations,capabilities,connection topologies,access policies,etc.Theformidable mix of hardware and software resources in the Internet has fueled re-searchers’ interest to start investigating novel ways to exploit this abundant pool ofresources in an economic and eﬃcient manner and also to aggregate these distributedresources to beneﬁt a single application.Grid computing has emerged as an am-bitious research area to address the problem of eﬃciently using multi-institutionalpools of resources.Its goal is to allow coordinated and collaborative resource sharingand problem solving across several institutions to solve large scientiﬁc problems thatcould not be easily solved within the boundaries of a single institution.The conceptof a computational grid ﬁrst appeared in the mid-1990’s,proposed as an infrastruc-ture for advanced science and engineering.This concept has evolved extensively sincethen and has encompassed a wide range of applications in both the scientiﬁc andcommercial ﬁelds [46].Computing power is expected to become in the future a pur-12chasable commodity,like electrical power.Hence,the analogy often made betweenthe electrical power grid and the conceptual computational grid.1.1 Motivation and Research ChallengesNew challenges emerge in grid environments,where the computational,storage,and network resources are inherently heterogeneous,often shared,and have a highlydynamic nature.Consequently,observed application performance can vary widely andin unexpected ways.This renders the maintenance of a desired level of applicationperformance a hard problem.Adapting applications to the changing behavior ofthe underlying resources becomes critical to the creation of robust grid applications.Dynamic application reconﬁguration is a mechanism to realize this goal.We denote by an application’s entity,a self-contained part of a distributed orparallel application that is running in a given runtime system.Examples includeprocesses in case of parallel applications,software agents,web services,virtual ma-chines,or actors in case of actor-based applications.Application’s entities could berunning in the same runtime environment or on diﬀerent distributed runtime envi-ronments connected through the network.They could be tightly coupled,exchanginga lot of messages or loosely coupled,with no or few messages exchanged.Dynamicreconﬁguration implies the ability to modify the mapping between application’s enti-ties and physical resources and/or modify the granularity of the application’s entitieswhile the application continues to run without any disruption of service.Applica-tions should be able to scale up to exploit more resources as they become available orgracefully shrink down as some resources leave or experience failures.It is impracti-cal to expect application developers to handle reconﬁguration issues given the sheersize of grid environments and the highly dynamic nature of the resources.Adopt-ing a middleware-driven approach is imperative to achieving eﬃcient deployment ofapplications in a dynamic grid setting.Application adaptation has been addressed in previous work in a fairly ad-hocmanner.Usually the code that deals with adaptation is embedded within the ap-plication or within libraries that are highly application-model dependent.Most ofthese strategies require having a good knowledge of the application model and a good3knowledge of the execution environments.While these strategies may work for dedi-cated and fairly static environments,they do not scale up to grid environments thatexhibit larger degrees of heterogeneity,dynamic behavior,and a much larger numberof resources.Recent work has addressed adaptive execution in grids.Most of themechanisms proposed have adopted the application stop and restart mechanism;i.e.,the entire application is stopped,checkpointed,migrated,and restarted in anotherhardware conﬁguration (e.g.,see [72,110]).Although this strategy can result in im-proved performance in some scenarios,more eﬀective adaptivity can be achieved ifmigration is supported at a ﬁner granularity.1.1.1 Mobility and Malleability for Fine-grained ReconﬁgurationReconﬁgurable distributed applications can opportunistically react to the dy-namics of their execution environment by migrating data and computation awayfrom unresponsive or slow resources,or into newly available resources.Applicationstop-and-restart can be thought of as application mobility.However,application en-tity mobility allows applications to be reconﬁgured with ﬁner granularity.Migratingentities can thus be easier and more concurrent than migrating a full application.Additionally,concurrent entity migration is less intrusive.To illustrate the usefulness of such dynamic application entity mobility,consideran iterative application computing heat diﬀusion over a two-dimensional surface.Ateach iteration,each cell recomputes its value by applying a function of its currentvalue and its neighbors’ values.Therefore,processors need to synchronize at everyiteration with their neighbors before they can proceed on to a subsequent iteration.Consequently,the simulation runs as slow as the slowest processor in the distributedcomputation,assuming a uniform distribution of data.Clearly,data distributionplays an important role in the eﬃciency of the simulation.Unfortunately,in sharedenvironments,the load of involved processors is unpredictable,ﬂuctuating as newjobs enter the system or old jobs complete.We evaluated the execution time of the heat simulation with and without thecapability of application reconﬁguration under a dynamic run-time environment:theapplication was run on a cluster and soon after the application started,artiﬁcial load4Non-reconfigurable Execution Time

Reconfigurable Execution Time

4006008001,0001,2001,4003.812.441.370.95Time(s)Data Size (Megabytes)2000Figure 1.1:Execution time without and with autonomous migration in adynamic run-time environment.was introduced in one of the cluster machines.Figure 1.1 shows the speedup obtainedby migrating the slowest process to an available node in a diﬀerent cluster.While entity migration can provide signiﬁcant beneﬁts in performance to dis-tributed applications over shared and dynamic environments,it is limited by thegranularity of the application’s entities.To illustrate this limitation,we use anotheriterative application.This application is run on a dynamic cluster consisting of ﬁveprocessors (see Figure 1.2).In order to use all the processors,at least one entityper processor is required.When a processor becomes unavailable,the entity on thatprocessor can migrate to a diﬀerent processor.With ﬁve entities,regardless of howmigration is done,there will be imbalance of work on the processors,so each iterationneeds to wait for the pair of entities running slower because they share a processor.In the example,5 entities running on 4 processors was 45% slower than 4 entitiesrunning on 4 processors,with otherwise identical parameters.One alternative to ﬁxthis load imbalance is to increase the number of entities to enable a more even dis-tribution of entities no matter how many processors are available.In the example ofFigure 1.2,60 entities were used since 60 is divisible by 5,4,3,2 and 1.Unfortunately,the increased number of entities introduces overhead which causes the application torun slower,approximately 7.6% slower.Additionally,this approach is not scalable,as the number of entities required for this scheme is the least common multiple ofdiﬀerent combinations of processor availability.In many cases,the availability of5N=5

N=P (SM)N=PN=60

8010012014016018054321Iteration Time(sec)Number of Processors6040200Figure 1.2:Execution time with diﬀerent entity granularities in a staticrun-time environment.resources is unknown at the application’s startup so an eﬀective number of entitiescannot be statically determined.Figure 1.2 shows these two approaches comparedto a good distribution of work,one entity per processor.N is the number of entitiesand P is the number of processors.N=P with split and merge (SM) capabilitiesuses entities with various granularities,while N=P shows the optimal conﬁgurationfor this example (with no dynamic reconﬁguration and middleware overhead).N=60and N=5 show the best conﬁguration possible using migration with a ﬁxed numberof entities.In this example,if a ﬁxed number of entities is used,averaged over allprocessor conﬁgurations,using ﬁve entities is 13.2% slower,and using sixty entitiesis 7.6% slower.To illustrate further how process’s granularity impacts the node-level perfor-mance,we run an iterative application with diﬀerent numbers of processes on thesame dedicated node.The larger the number of processes,the smaller the data gran-ularity of each process.Figure 1.3 shows an experiment where the parallelism ofan iterative application was varied on a dual-processor node.In this example,hav-ing one process per processor did not give the best performance,but increasing theparallelism beyond a certain point also introduces a performance penalty.We introduce the concept of mobile malleable entities to solve the problem ofappropriately using resources in the face of a dynamic execution environment wherethe available resources may not be known.Instead of having a static number ofentities,malleable entities can split,creating more entities,and merge,reducing the6Figure 1.3:Throughput as the process data granularity decreases on adedicated node.number of entities,redistributing data based on these operations.With malleableentities,the application’s granularity,and as a consequence,the number of entities,can also be changed dynamically.Applications deﬁne how entities split and merge,while the middleware determines when based on resource availability information,and what entities to split or merge depending on their communication topologies.Asthe dynamic environment of an application changes,in response,the granularity anddata distribution of that application can be changed to utilize its environment mosteﬃciently.1.1.2 Middleware-driven Dynamic Application ReconﬁgurationThere are a number of challenges to enable dynamic reconﬁguration in dis-tributed applications.We divide them into middleware challenges,programmingtechnology challenges,and the interface between the middleware and the applica-tions.7Middleware-level ChallengesMiddleware challenges include the need for continuous and non-intrusive proﬁlingof the run-time environment resources and to determine when it is expected that anapplication reconﬁguration will lead to performance improvements or better resourceutilization.A middleware layer needs to accomplish this in a decentralized way,so asto be scalable.The meta-level information that the middleware manages must includeinformation on the communication topology of the application entities to co-locatethose that communicate extensively whenever possible,avoiding high-latency commu-nication.A good compromise must also be found between highly accurate meta-levelinformation—but potentially very expensive to obtain and with a cost of intrusivenessto running applications—and partial,inaccurate meta-level information—that maybe cheap to obtain in non-intrusive ways but may lead to far from optimal reconﬁg-uration decisions.Since no single policy ﬁts all,modularity is needed to be able toplug in and ﬁne tune diﬀerent resource proﬁling and management policies embeddedin the middleware.Application-level ChallengesThe middleware can only trigger reconﬁguration requests to applications that sup-port them.Programming models advocating for a clean encapsulation of state insidethe application entities and asynchronous communication among entities,make theprocess of reconﬁguring applications dynamically much more manageable.This is be-cause there is no need for replicated shared memory consistency protocols and preser-vation of method invocation stacks upon entity migration.While entity mobility isrelatively easy and transparent for these programming models,entity malleabilityrequires more cooperation from the application developers as it is highly application-dependent.For programming models where shared memory or synchronous commu-nication are used,application programming interfaces need to be deﬁned to enabledevelopers to specify how application entity mobility and malleability are supportedby speciﬁc applications.These models make the reconﬁguration process less trans-parent and sometimes limit the applicability of the approach to speciﬁc classes ofapplications,e.g.,massively parallel or iterative applications.8Cross-cutting ChallengesFinally,applications need to collaborate with the middleware layer by exportingmeta-level information on entity interconnectivity and resource usage and by pro-viding operations to support potential reconﬁguration requests from the middlewarelayer.This interface needs to be as generic as possible to accommodate a wide varietyof programming models and application classes.1.2 Problem Statement and MethodologyThe focus of this research is to build a modular framework for middleware-driven application reconﬁguration in dynamic execution environments such as Gridsand shared clusters.The main objectives of this framework are to provide high per-formance to individual applications in dynamic settings and to provide the necessarytools to facilitate the way in which scientiﬁc and engineering applications interactwith dynamic execution environments and reconﬁgure themselves as needed.Hence,such applications will be allowed to beneﬁt from these rapidly evolving systems andfrom the wide spectrum of resources available in them.This research addresses most of the issues described in the previous sectionthrough the following methodology.A Modular Middleware for Adaptive ExecutionThe Internet Operating System (IOS) is a middleware framework that has beenbuilt with the goal of addressing the problem of reconﬁguring long running applica-tions in large-scale dynamic settings.Our approach to dynamic reconﬁguration istwofold.On the one hand,the middleware layer is responsible for resource discovery,monitoring of application-level and resource-level performance,and deciding when,what,and where to reconﬁgure applications.On the other hand,the applicationlayer is responsible for dealing with the operational issues of migration and malleabil-ity and the proﬁling of application communication and computational patterns.IOSis built with modularity in mind to allow the use of diﬀerent modules for agentscoordination,resource proﬁling and reconﬁguration algorithms in a plug and playfashion.This feature is very important since there is no “one size ﬁts all” method for9performing reconﬁguration for a wide range of applications and in highly heteroge-neous and dynamic environments.IOS is implemented in Java and SALSA [114],anactor-oriented programming language.IOS agents leverage the autonomous nature ofthe actor model and use several coordination constructs and asynchronous messagepassing provided by the SALSA language.Decentralized Coordination of Middleware AgentsIOS embodies resource proﬁling and reconﬁguration decisions into software agents.IOS agents are capable of organizing themselves into various virtual topologies.De-centralized coordination is used to allow for scalable reconﬁguration.This researchinvestigates two representative virtual topologies for inter-agent coordination:a peer-to-peer and a cluster-to-cluster coordination topology [73].The coordination topologyof IOS agents has a great impact on how quickly reconﬁguration decisions are made.In a more structured environment such as a grid of homogeneous clusters,a hierar-chical topology generally performs better than a purely peer-to-peer topology [73].Generic Interfaces for Portable Interoperability with ApplicationsIOS exposes several APIs for proﬁling applications’ communication patterns andfor triggering reconﬁguration actions such as migration and split and merge.Theseinterfaces shield many of the intrinsic details of reconﬁguration from application de-velopers and provide a uniﬁed and clean way of interaction between applications andthe middleware.Any application or programming model that implements IOS genericinterfaces becomes reconﬁgurable.A Generic Process Checkpointing,Migration and Malleability Scheme forMessage Passing ApplicationsMigration or malleability capabailities are highly dependent on the application’sprogramming model.Therefore,there has to be built-in library or language-levelsupport for migration and malleability to allow applications to be reconﬁgurable.Part of this research consists of building the necessary tools to allow message passingapplications to become reconﬁgurable with IOS.A library for process checkpoint-ing,migration and malleability (PCM) has been designed and developed for MPI10iterative applications.The PCM is a user-level library that provides checkpointing,proﬁling,migration,split,and merge capabilities for a large class of iterative appli-cations.Programmers need to specify the data structures that must be saved andrestored to allow process migration and to instrument their application with few PCMcalls.PCM also provides process split and merge functionalities to MPI programs.Common data distributions are supported like block,cyclic,and block-cyclic.PCMimplements IOS generic proﬁling and reconﬁguration interfaces,and therefore enablesMPI applications to beneﬁt from IOS reconﬁguration policies.The PCM API is simple to use and hides many of the intrinsic details of howto perform reconﬁguration through migration,split and merge.Hence,with mini-mal code modiﬁcation,a PCM-instrumented MPI application becomes malleable andready to be reconﬁgured transparently by the IOS middleware.In addition,legacyMPI applications can beneﬁt tremendously from the reconﬁguration features of IOSby simply inserting a few calls to the PCM API.1.3 Thesis ContributionsThis research has generated a number of original contributions.They are sum-marized as follows:1.A modular middleware for application reconﬁguration with the goal of main-taining a reasonable performance in a dynamic environment.The modularityof our middleware is demonstrated through the use of several reconﬁgurationand coordination algorithms.2.Fine-grained reconﬁguration that enables reasoning about application entitiesrather than the entire application and therefore provides more concurrent andeﬃcient adaptation of the application.3.Decentralized and scalable coordination strategies for middleware agents thatare based on partial knowledge.4.Generic reconﬁguration interfaces for application-level proﬁling and reconﬁgu-ration decisions to allow increased and portable adoption by several program-ming models and languages.This has been demonstrated through the successful11reconﬁguration of actor-oriented programs in SALSA and process-oriented pro-grams using MPI.5.A portable protocol for inter-operation and interaction between applicationsand the middleware to ease the transition to reconﬁgurable execution in gridenvironments.6.A user-level checkpointing and migration library for MPI applications to helpdevelop reconﬁgurable message passing applications.7.The use of split and merge or malleability as another reconﬁguration mechanismto complement and enhance application adaptation through migration.Supportof malleability in MPI applications developed by this research is the ﬁrst of itskind in terms of splitting and merging MPI processes.8.A resource-sensitive model for deciding when to migrate,split,or merge appli-cation’s entities.This model enables reasoning about computational resourcesin a uniﬁed manner.1.4 Thesis RoadmapThe remainder of this thesis is organized as follows:• Chapter 2 discusses background and related work in the context of dynamicreconﬁguration of applications in grid environments.It starts by giving a litera-ture review of emerging grid middleware systems and how they address resourcemanagement.It then reviews existing eﬀorts for application adaptation in dy-namic grid environments.An overview of programming models that are suitablefor grid environments is given.Finally background information about world-wide computing,the actor model of computation and the SALSA programminglanguage is presented.• Chapter 3 starts ﬁrst by presenting key design goals that have been fundamen-tal to the implementation of the Internet Operating System (IOS) middleware.These include operational and architectural goals at both the middleware-level12and application-level.Then the architecture of IOS is explained.This chapteralso explains in details the various modules of IOS and its generic interfaces forproﬁling and reconﬁguration.• Chapter 4 presents the diﬀerent protocols and policies implemented as partof the middleware infrastructure.The protocols deal with coordinating theactivities of middleware agents and forwarding work-stealing requests in a peer-to-peer fashion.At the core of the adaptation strategies,a resource-sensitivemodel is used to decide when,what and where to reconﬁgure application entities.• Chapter 5 explains how iterative MPI-based applications are reconﬁgured us-ing IOS.First the checkpointing,migration,and malleability (PCM) libraryis presented.The chapter then proceeds by showing how the PCM library isintegrated with IOS and the protocol used to achieve this integration.• Chapter 6 presents various kinds of experiments conducted in this researchand the results obtained.In the ﬁrst section,the performance evaluation ofthe middleware is given including evaluation of the protocols,scalability,andoverhead.The second section presents various experiments that evaluate thereconﬁguration functionalities of IOS using migration,split and merge,and acombination of them.• Chapter 7 concludes this thesis with a discussion of future directions.CHAPTER 2Background and Related WorkThe deployment of grid infrastructures is a challenging task that goes beyond thecapabilities of application developers.Specialized grid middleware is needed to miti-gate the complex issues of integrating a large number of dynamic,heterogeneous,andwidely distributed resources.Institutions need sophisticated mechanisms to lever-age and share their existing information infrastructures in order to be part of publiccollaborations.Grid middleware should address the following issues:• Security.The absence of central management and the open nature of gridresources result in having to deal with several administrative domains.Each oneof thembrings along diﬀerent resource access policies and security requirements.Being able to access externally-owned resources requires having the necessarycredentials required by the external organizations.Users should be able to log ononce and execute applications across various domains1.Furthermore,commonmethods for negotiating authorization and authentication are also needed.• Resource management.Resource management is a fundamental issue forenabling grid applications.It deals with job submission,scheduling,resource al-location,resource monitoring,and load balancing.Resources in the grid have avery transient nature.They can experience constantly changing loads and avail-1This capability of being able to log on once instead of logging in all used machines is referredto in the computational grid community as the single sign-on policy.1314ability because of shared access and the absence of tight user control.Reliableand eﬃcient execution of applications on such platforms requires applicationadaptation to the dynamic nature of the underlying grid environments.Adap-tive scheduling and load balancing are necessary to achieve high performanceof grid applications and high utilization of systems’ resources.• Data management.Since grid infrastructures involve widely distributed re-sources,data and processing might not necessarily be collocated.Concernsin data management arise such as how to eﬃciently distribute,replicate,andaccess potentially massive amounts of data.• Information management.Being able to make informed decisions by re-source managers requires the ability to discover available resources and learnabout their characteristics (capacities,availability,current utilization,accesspolicies,etc).Grid information services should allow the monitoring and dis-covery of resources and should make them available when necessary to gridresource managers.This chapter focuses mainly on existing research in the area of grid resourcemanagement.The chapter is organized as follows.Section 2.1 surveys existing gridmiddleware systems.Section 2.2 discusses related work in resource management ingrid systems.Section 2.3 discusses existing work in adaptive execution of grid appli-cations.Section 2.4 presents various programming models that are good candidatesfor developing grid applications.In Section 2.5,we review basic peer-to-peer conceptsand how they have been used in grid systems.Finally,Section 2.6 gives an overviewabout the worldwide computing project and presents several key concepts that havebeen fundamental to this dissertation.2.1 Grid Middleware SystemsOver the last few years,several eﬀorts have focused on building the basic soft-ware tools to enable resource sharing within scientiﬁc collaborations.Among theseeﬀorts,the most successful have been Globus [45],Condor [105],and Legion [29].15Figure 2.1:A layered grid architecture and components (Adaptedfrom [14]).The Globus toolkit has emerged as the de-facto standard middleware infras-tructure for grid computing.Globus deﬁnes several protocols,APIs,and servicesthat provide solutions to common grid deployment problems such as authentication,remote job submission,resource discovery,resource access,and transfer of data andexecutables.Globus adopts a layered service model that is analogous to the layerednetwork model.Figure 2.1 shows the layered grid architecture that the Globus projectadopts.The diﬀerent layers of this architecture are brieﬂy described in Table 2.1.Condor is a distributed resource management systemthat is designed to supporthigh-throughput computing by harvesting idle resource cycles.Condor discovers idleresources in a network and allocates them to application tasks [104].Fault-toleranceis also supported through checkpointing mechanisms.Legion speciﬁes an object-based virtual machine environment that transparently16LayerDescriptionGrid FabricDistributed resources such as clusters,machines,supercomputers,storage devices,scientiﬁc instruments,etc.Core MiddlewareA bag of services that oﬀer remote processmanagement,allocation of resources from diﬀerentsites to be used by the same application,storagemanagement,information registration anddiscovery,security,and Quality of Service (QoS).User Level MiddlewareA set of interfaces to core middleware services toprovide higher levels of abstractions to endapplications.These include resource brokers,programming tools,and development environments.Grid ApplicationsApplications developed using grid-enabled programmingmodels such as MPI.Table 2.1:Layers of the grid architecture.integrates grid system components into a single address space and ﬁle system.Legionplays the role of a grid operating system by addressing issues such as process man-agement,input-output operations,inter-process communications,and security [80].Condor-G [49] combines both Condor and Globus technologies.This mergingcombines Condor’s mechanisms for intra-domain resource management and fault-tolerance with Globus protocols for security,and inter-domain resource discovery,access,and management.Entropia [30] is another popular system that utilizes cycleharvesting mechanisms.Entropia adopts similar mechanisms to Condor for resourceallocation,scheduling and job migration.However,it is tailored only for MicrosoftWindows 2000 machines,while Condor is tailored for both Unix and Windows plat-forms.WebOS [111] is another research eﬀort with the goal of providing operatingsystem’s services to wide area applications,such as,resource discovery,remote processexecution,resource management,authentication,security,and a global namespace.2.2 Resource Management in Grid SystemsThe nature of grid systems has dictated the need to come up with new mod-els and protocols for grid resource management.Grid resource management diﬀers17from conventional cluster resource management in several aspects.In contrast tocluster systems,grid systems are inherently more complex,dynamic,heterogeneous,autonomous,unreliable,and scalable.Several requirements need to be met to achieveeﬃcient resource management in grid systems:• Site autonomy.Traditional resource management systems assume tight con-trol over the resources.These assumptions make it easier to design eﬃcientpolicies for scheduling and load balancing.Such assumptions disappear in gridsystems where resources are dispersed across several administrative domainswith diﬀerent scheduling policies,security mechanisms,and usage patterns.Additionally,resources in grid systems have a non-deterministic nature.Theymight join or leave at any time.It is therefore critical for a grid resource man-agement system to take all these issues into account and preserve the autonomyof each participating site.• Interoperability.Several sites use diﬀerent local resource management sys-tems such as the Portable Batch System (PBS),Load Sharing Facility (LSF),Condor,etc.Meta-schedulers need to be built that are able to interface andinter-operate with all the diﬀerent local resource managers.• Flexibility and extensibility.As systems evolve,new policies get imple-mented and adopted.The resource management system should be extensibleand ﬂexible to accommodate newer systems.• Support of negotiation.QoS is an important requirement to meet severalapplication requirements and guarantees.Negotiation between the diﬀerentparticipating sites is needed to ensure that the local policies will not be brokenand that the running applications will satisfy their requirements with certainguarantees.• Fault tolerance.As systems grow in size,the chance of failures becomesnon-negligible.Replication,checkpointing,job restart,or other forms of fault-tolerance have become a necessity in grid environments.18System Architecture Resource QoS and SchedulingDiscovery/ProﬁlingDissemination/Globus Hierarchical Decentralized Partial support of QoS,Decentralized,query discovery,state estimation relies uses externalperiodic push on external tools schedulers fordissemination such as NWS for intra-domainproﬁling and forecasting schedulingof resources’ performanceCondor Flat Centralized No support of QoS,Centralizedquery discovery,matchmaking betweenperiodic push client requests anddissemination resources’ capabilitiesLegion Hierarchical Decentralized Partial support of QoS,Hierarchicalquery discovery,several schedules couldperiodic pull be generated,the best isdissemination selected by the schedulerTable 2.2:Characteristics of some grid resource management systems.• Scalability.All resource management algorithms should avoid centralized pro-tocols to achieve more scalability.Peer-to-peer and hierarchical approaches willbe good candidate protocols.The subsequent section surveys main grid resource management systems,howthey have addressed some of the above discussed issues,and their limitations.Foreach system,we will discuss its mechanisms for resource dissemination,discovery,scheduling,and proﬁling.Resource dissemination protocols can be classiﬁed as eitherusing push or pull models.In the push model,information about resources is peri-odically pushed to a database.The opposite is done in a pull model,where resourcesare periodically probed to collect their information about them.Table 2.2 provides asummary of some characteristics of the resource management features of the surveyedsystems.2.2.1 Resource Management in GlobusA Globus resource management system consists of resource brokers,resourceco-allocators,and resource managers,also referred to as Globus Resource Alloca-19tion Managers (GRAM).The task of co-allocation refers to allocating resources fromdiﬀerent sites or administrative domains to be used by the same application.Dissem-ination of resources is done through an information service called the Grid Informa-tion Service (GIS),also known as the Monitoring and Discovery Service (MDS) [36].MDS uses the Lightweight Directory Access Protocol [34] (LDAP) to interface withthe gathered information about resources.MDS stores information about resourcessuch as the number of CPUs,the operating systems used,the CPU speeds,the net-work latencies,etc.MDS consists of a Grid Index Information Service (GIIS) and aGrid Resource Information Service (GRIS).GRIS provides resource discovery servicessuch as gathering,generating,and publishing data about resource characteristics inan MDS directory.GIIS tries to provide a global view about the diﬀerent informationgathered from various GRIS services.The aim is to make it easy for grid applicationsto look-up desired resources and match them to their requirements.GIIS indexes theresources in a hierarchical name space organization.Resource information is updatedin GIIS by push strategies.Resource brokers discover resources by querying MDS.Globus relies on local schedulers that implement Globus interfaces to resourcebrokers.These schedulers could be application-level schedulers (e.g.AppleS [16]),batch systems (e.g.PBS),etc.Local schedulers translate application requirementsinto a common language,called Resource Speciﬁcation Language (RSL).RSL is a setof expressions that specify the jobs and the characteristics of the resources requiredto run them.Resource brokers are responsible for taking high level descriptions ofresource requests and translating them into more specialized and concrete speciﬁca-tions.The transformed request should contain concrete resources and their actuallocations.This process is referred to as specialization.Specialized resource requests are passed to co-allocators who are responsible forallocating requests at multiple sites to be used simultaneously by the same applica-tion.The actual scheduling and execution of submitted jobs is done by the localschedulers.GRAMauthenticates the resource requests and schedules them using thelocal resource managers.GRAMtries to simplify the development and deployment ofgrid applications by providing common APIs that hide the details of local schedulers,queuing systems,interfaces,etc.Grid users and developers do not need to know all20the details of other systems.GRAMacts as an entry point to various implementationsof local resource management.It uses the concept of the hour-glass where GRAM isthe neck of the hourglass,with applications and higher-level services (such as resourcebrokers or meta-schedulers) above it and local control and access mechanisms belowit.To sum up,Globus provides a bag of services to simplify resource managementat a meta-level.The actual scheduling needs still be done by the individual resourcebrokers.Ensuring that an application eﬃciently uses resources from various sitesis still a complex task.Developers still need to bear the burden of understandingthe requirements of the application,the characteristics of the grid resources,and theoptimal ways of scheduling the application using dynamic grid resources to achievehigh performance.2.2.2 Resource Management in CondorThe philosophy of Condor [105] is to maximize the utilization of machines byharvesting idle cycles.A group of machines managed by a Condor scheduler is calleda Condor pool.Condor uses a centralized scheduling management scheme.A ma-chine in the pool is dedicated to scheduling and information management.Submittedjobs are queued and transparently scheduled to run on the available machines of thepool.Job resource requests are communicated to the manager using the ClassiﬁedAdvertisements (ClassAds) [35] resource speciﬁcation language2.Attributes such asprocessor’s type,operating system,and available memory and disk space are used toindicate jobs’ resource requirements.Resource dissemination is done through periodic push mechanisms.Machinesperiodically advertise their capabilities and their job preferences in advertisementsthat use also ClassAds speciﬁcation language.When a job is submitted to the Condorscheduler by a client machine,matchmaking is used to ﬁnd the jobs and machinesthat best suit each other.Information about the chosen resources is then returned tothe client machine.A shadow process is forked in the the client machine to take careof transferring executables and I/O redirection.2The resource speciﬁcation language used by Globus follows Condor’s model for ClassAds.How-ever Globus’s language is more ﬂexible and expressive.21Flocking [39] is an enhancement of Condor to share idle cycles across severaladministrative domains.Flocking allows several pool managers to communicate withone another and to submit jobs across pools.To overcome the problems of not havingshared ﬁle systems,a split-execution model is used:I/O commands generated by ajob are captured and redirected to the shadow process running on the client machine.This technique avoids transferring ﬁles or mounting foreign ﬁle systems.Condor supports job preemption and check-pointing to preserve machine au-tonomy.Jobs can be preempted and migrated to other machines if their currentmachines decide to withdraw from the computation.Condor adopts the philosophy of high throughput computing (HTC) as opposedto high performance computing (HPC).In HTC systems,the objective is to maxi-mize the throughput of the entire system as opposed to maximizing the individualapplication response time in HPC systems.A combination of both paradigms shouldexist in grids to achieve eﬃcient execution of multi-scale applications.Improvingutilization and overall response and running time for large multi-scale applicationsare both important to justify the applicability for grid environments.On the onehand,application users will not be willing to use grids unless they expect to improvedramatically their performance.On the other hand,the grid computing vision triesto minimize idle resources by allowing resource sharing at a large scale.2.2.3 Resource Management in LegionLegion [29] is an object-based system that provides an abstraction over widearea resources as a worldwide virtual computer by playing the role of a grid operatingsystem.It provides some of the traditional features that an operating systemprovidesin a grid setting such as a global namespace,a shared ﬁle system,security,processcreation and management,I/O,resource management,and accounting [80].Every-thing in Legion is an object,an active process that reacts to function invocationsfrom other objects in the system.Legion provides high-level speciﬁcations and pro-tocols for object interaction.The implementations still have to be done by the users.Legion objects are managed by their own class object instances.The class object isresponsible for creating new instances,activating or deactivating them,and schedul-22ing them.Legion deﬁnes core objects that implement system-level mechanisms:hostobjects represent compute resources while vault objects represent persistent storageresources.The resource management system in Legion consists of four components:ascheduler,a schedule implementor,a resource database,and the pool of resources.Resource dissemination is done through a push model.Resources interact with theresource database,also called the collection.Users or schedulers obtain informationabout resources by querying the collection.For scalability purposes,there mightbe more than one collection object.These objects are capable of exchanging infor-mation about resources.Scheduling in Legion has a hierarchical structure.Higherlevel schedulers schedule resources on clusters,while lower-level schedulers sched-ule jobs on the local resources.When a job is submitted,an appropriate scheduler(Application-speciﬁc scheduler or a default scheduler) is selected fromthe framework.The scheduler,also called the enactor object is responsible for enforcing the schedulegenerated.There might be more than one schedule generated.The best schedule isselected.When it fails the next best is tried until all the schedules are exhausted.Similar to Globus,Legion provides a framework for creating,and managingprocesses in a grid setting.However,achieving high performance is still a job thatneeds to be done by the application developers to make eﬃcient use of the overallframework.2.2.4 Other Grid Resource Management SystemsSeveral resource management systems for grid environments exist beside the dis-cussed systems.2K [65],is a distributed operating system that is based on CORBA.It addresses the problems of resource management in heterogeneous networks,dy-namic adaptability,and conﬁguration of entity-based distributed applications [64].Bond [60] is a Java distributed agents system.The European DataGrid [56] is aGlobus-based system for the storage and management of data-intensive applications.Nimrod [5] provides a distributed computing system for parametric modeling thatsupports a large number of computational experiments.Nimrod-G [25] is an ex-tension of Nimrod that uses the Globus services and that follows a computational23economical model for scheduling.NetSolve [27] is a system that has been designedto solve computational science problems through a client-agent-server architecture.WebOS seeks to provide operating system services,such as client authentication,naming,and persistent storage to wide area applications [111].There is a large body of research into computational grids and grid-based mid-dleware,hence this section only attempts to discuss a selection of this research area.The reader is referred to [66] and [115] for a more comprehensive survey of systemsgeared toward distributed computing on a large-scale.2.3 Adaptive Execution in Grid SystemsGlobus middleware provides services needed for secure multi-site execution oflarge-scale applications gluing diﬀerent resource management systems and access poli-cies.The dynamic and transient nature of grid systems necessitates adaptive mod-els to enable the running application to adapt itself to rapidly changing resourceconditions.Adaptivity in grid computing has been mainly addressed by adaptiveapplication-level scheduling and dynamic load balancing.Several projects have de-veloped application-oriented adaptive execution mechanisms over Globus to achievean eﬃcient exploitation of grid resources.Examples include AppleS [17],Cactus-G [12],GrADS [40],and GridWay [59].These systems share many features withdiﬀerences in the ways they are implemented.AppleS [17] applications rely on structural performance models that allows pre-diction of application performance.The approach incorporates static and dynamicresource information,performance predictions,application and user-speciﬁc informa-tion and scheduling techniques to adapt to application’s execution “on-the-ﬂy”.Tomake this approach more generic and reusable,a set of template-based software for acollection of structurally similar applications has been developed.After performingthe resource selection,the scheduler determines a set of candidate schedules basedon the performance model of the application.The best schedule is selected basedon user’s performance criteria such as execution time and turn-around time.Theschedule generated can be adapted and reﬁned to cope with the changing behavior ofresources.Jacobi2D [18],Complib [94],and Mcell [28] are examples of applications24that beneﬁted from the application-level adaptive scheduling of AppleS.Adaptive grid execution has been also explored in the Cactus project throughsupport of migration [69].Cactus is an open source problem-solving environment de-signed for solving partial diﬀerential equations.Cactus incorporates,through specialcomponents referred to as grid-aware thorns [11],adaptive resource selection mecha-nisms to allow applications to change their resource allocations via migration.Cactususes also the concept of contract violation.Application migration is triggered when-ever a contract violation is detected and the resource selector has identiﬁed alternativeresources.Checkpointing,staging of executables,allocation of resources,and appli-cation restart are then performed.Some application-speciﬁc techniques were used toadapt large applications to run on the grid such as adaptive compression,overlappingcomputation with communication,and redundant computation [12].The GrADS project [40] has investigated also adaptive execution in the con-text of grid application development.The goal of the GrADS project is to simplifydistributed heterogeneous computing in the same way that the World Wide Websimpliﬁed information sharing.Grid application development in GrADs involves thefollowing components:1) Resource selection is performed by accessing Globus MDSand the Network Weather Service [117] to get information about the available ma-chines,2) An application-speciﬁc performance modeler is used to determine a goodinitial matching list for the application,3) and a contract monitor,which detects per-formance degradation and accordingly does rescheduling to re-map the applicationto better resources.The main components involved in application adaptation are thecontract monitor,the migrator,and the rescheduler which decides when to migrate.The migrator component relies on application-support to enable migration.The StopRestart Software (SRS) [110] is a user-level checkpointing library used to equip theapplication with the ability to be stopped,checkpointed,and restarted with a dif-ferent conﬁguration.The Rescheduler component allows migration on-request andopportunistic migration.Migration cost is evaluated by considering the predictedremaining execution time in the new conﬁguration,the current remaining executiontime,and the cost of rescheduling.Migration happens only if the gain is greater thana 30% threshold [109].25In the GridWay [59] project,the application speciﬁes its resource requirementsand ranks the needed resources in terms of their importance.A submission agentautomates the entire process of submitting the application and monitoring its perfor-mance.Application performance is evaluated periodically by running a performancedegradation evaluator program and by evaluating the accumulated suspension time.The application has a job template which contains all the necessary parameters forits execution.The framework evaluates if migration is worthwhile or not in case ofa rescheduling event.The submission manager is responsible for the execution of ajob during its lifetime.It is also responsible for performing job migration to a newresource.The framework is responsible for submitting jobs,preparing RSL ﬁles,per-forming resource selection,preparing the remote system,canceling the job in case of akill,stop,or migration event.When performance slowdown is detected,reschedulingactions are initiated to detect better resources.The resource selector tries to ﬁndjobs that minimize total response time (ﬁle transfer and job execution).Application-level schedulers are used to promote the performance of each individual applicationwithout considering the rest of the applications.Migration and rescheduling couldbe user-initiated or application-initiated.2.4 Grid Programming ModelsTo achieve an application adaptation to grid environment,not only should themiddleware provide the necessary means for state estimation,and reconﬁguration ofresources.The application’s programming model should also allow the application toreact to the diﬀerent reconﬁguration requests from the underlying environment.Thisfunctionality can take diﬀerent forms,such as application migration,process migra-tion,checkpointing,replication,partitioning,or change of application granularity.We describe in what follows existing programming models that appear to berelevant to grid environments and that provide some partial support for reconﬁgura-tion.• Remote procedure calls (RPC) [20].RPC uses the client-server modelto implement concurrent applications that communicate synchronously.TheRPC mechanism has been traditionally tailored for single-processor systems26and tightly coupled homogeneous systems.GridRPC is a collaboration eﬀortto extend the RPC model to support grid computing and to standardize itsinterfaces and protocols.The extensions consist basically of providing supportfor coarse-grained asynchronous systems.NetSolve [27] is a current implemen-tation of GridRPC [92] based on a client-agent-server system.The role of theagent is to locate suitable resources and select the best ones.Load balancingpolicies are used to attempt a fair allocation of resources.Ninf [91] is anotherimplementation built on top of Globus services.• Java-based models.Java is a powerful programming environment for de-veloping platform-independent distributed systems.It was originally designedwith distributed computing in mind.The applet and Remote Method Invoca-tion (RMI) models are some features of this design.The use of Java in gridcomputing has gained even more interest since the introduction of web servicesin the OGSI model.Several APIs and interfaces are being developed and inte-grated with Java.The Java-Grande project is a huge collaborative eﬀort tryingto bring Java platformup-to-speed for high-performance applications.The JavaCommodity Grid (CoG) toolkit [70] is another eﬀort for providing Java-basedservices for grid-computing.CoG provides an object-oriented interface to stan-dard Globus toolkit services.• Message passing.This model is the most widely used programming model forparallel computing.It provides application developers with a set of primitivetools that allowcommunication between the diﬀerent tasks,collective operationslike broadcasts and reductions,and synchronization mechanisms.However,message passing is still a low-level paradigm and does not provide high-levelabstractions for task parallelism.It requires a lot of expertise from developersto achieve high performance.Popular message passing libraries include MPIand the Parallel Virtual Machine (PVM) [41].MPI has been implemented suc-cessfully on massively parallel processors (MPPS) and supports a wide rangeof platforms.However,existing portable implementations target homogeneoussystems and have very limited support for heterogeneity.PVM provides sup-27port for dynamic addition of nodes and host failures.However,its limitedability to meet the required high performance on tightly coupled homogeneoussystems did not encourage a wide adoption.Extensions to MPI to meet gridrequirements have been actively pursued recently.MPICH-G2 is a grid-enabledimplementation of MPI based on MPICH,a portable implementation of MPI.MPICH-G2 is built upon the Globus toolkit.MPICH-G2 allows the use ofmultiple heterogeneous machines to execute MPI applications.It automaticallyconverts data in messages sent between machines of diﬀerent architectures andsupports multi-protocol process communication through automatic selection ofTCP for inter-machine messaging and more highly optimized vendor-suppliedMPI implementations (whenever available) for intra-machine messaging.• Actor model [7,54].An actor is an autonomous entity that encapsulatesstate and processing.Actors are concurrent entities that communicate asyn-chronously.Processing in actors is solely triggered by message reception.Inresponse to a message,an actor can change its current state,create a new actor,or send a message to other actors.The anatomy of actors facilitates autonomy,mobility,and asynchronous communication and makes this model attractive foropen distributed systems.Several languages and frameworks have implementedthe Actor model (e.g.,SALSA [114],Actor Foundry [81],Actalk [23],THAL [63]and Broadway [97]).A more detailed discussion of the Actor model and theSALSA language is given in Section 2.6.• Parallel Programming Models.Several models have emerged to abstractapplication parallelism on distributed resources.The Master-Worker (MW)model is a traditional parallel scheme whereby a master task deﬁnes dynam-ically the tasks that must be executed and the data on which they operate.Workers execute the assigned tasks and return the result to the master.Thismodel exhibits a very large degree of parallelism because it generates a dy-namic number of independent tasks.This model is very well suited for grids.The AppleS Master Worker Application Template (AMWT) provides adap-tive scheduling policies for MW applications.The goal is to select the best28placement of the master and workers on grid resources to optimize the overallperformance of the application.The Fork-join model is another model wherethe degree of parallelism is dynamically determined.In this model,tasks aredynamically spawned and data is dynamically agglomerated based on systemcharacteristics such as the amount of workload or the availability of resources.This model employs a two-level scheduling mechanism.First a number of vir-tual processors are scheduled on a pool of physical processors.The virtualprocessors represent kernel-level threads.Then user-level threads are spawnedto execute tasks from a shared queue.The forking and joining is done at theuser-level space because it is much faster than the kernel-level thread.Severalsystems have implemented this model such as Cray Multitasking [88],ProcessControl [51],and Minor [76].All the afore mentioned implementations havebeen targeted mainly for shared-memory and tightly coupled systems.Othereﬀective parallel programming models have been studied,such as divide andconquer applications and branch and bound.The Satin [112] system is an ex-ample of a hierarchical implementation of the divide and conquer paradigmtargeted for grid environments.2.5 Peer-to-Peer Systems and the Emerging GridGrid and peer-to-peer systems share a common goal:sharing and harnessingresources across various administrative domains.However they both evolved fromdiﬀerent communities and provide diﬀerent services.Grid systems focus on providinga collaborative platformthat interconnects clusters,supercomputers,storage systems,and scientiﬁc instruments from trusted communities to serve computationally inten-sive scientiﬁc applications.Grid systems are of moderate size and are centrally orhierarchically administered.Peer-to-peer (P2P) systems provide intermittent par-ticipation for signiﬁcantly larger untrusted communities.The most common P2Papplications are ﬁle sharing and search applications.It has been argued that gridand P2P systems will eventually converge [48,99].This convergence will likely hap-pen when the participation in grid increases to the scale of P2P systems,when P2Psystems will provide more sophisticated services,and when the stringent QoS require-29ments of grid systems are loosened as grids host more popular user applications.Inwhat follows,we give an overview of P2P systems.Then we give an overview of someresearch eﬀorts that have tried to utilize P2P techniques to serve grid computing.The peer-to-peer paradigmis a successful model that has been proved to achievescalability in large-scale distributed systems.As opposed to traditional client-servermodels,every component in a P2P systemassumes the same responsibilities acting asboth a client and a server.The P2P approach is intriguing because it has managed tocircumvent many problems with the client/server model with very simple protocols.There are two categories of P2P systems based on the way peers are organized andon the protocol used:unstructured and structured.Unstructured systems imposeno structure on the peers.Every peer in an unstructured system is randomly con-nected to a number of other peers (e.g.Examples include Napster [3],Gnutella [33],and KaZaA [2]).Structured P2P systems adopt a well-determined structure for in-terconnecting peers.Popular structured systems include Chord [79],Pastry [74],Tapestry [119],and CAN [98].In a P2P system,peers can be organized in various topologies.These topologiescan be classiﬁed into centralized,decentralized and hybrid.Several p2p applicationshave a centralized component.For instance,Napster,the ﬁrst ﬁle sharing applica-tion that popularized the P2P model,has a centralized search architecture.However,the ﬁle sharing architecture is not centralized.The SETI@Home [13] project hasa fully centralized architecture.SETI@Home is a project that harnesses free CPUcycles across the Internet (SETI is an acronym of Search for Extra-Terrestrial Intelli-gence).The purpose of the project is to analyze radio telescope data for signals fromextra-terrestrial intelligence.One advantage of the centralized topology is the highperformance of the search because all the needed information is stored in one centrallocation.However,this centralized architecture creates a bottleneck and cannot scaleto a large number of peers.In the decentralized topology,peers have equal respon-sibilities.Gnutella [33] is among the few pure decentralized systems.It has only aninitial centralized bootstrapping mechanism by which new peers learn about existingpeers and join the system.However,the search protocol in Gnutella is completely de-centralized.Freenet is another application with a pure decentralized topology.With30CentralizedDecentralizedHybridFigure 2.2:Sample peer-to-peer topologies:centralized,decentralized andhybrid topologies.decentralization comes the cost of having a more complex and more expensive searchmechanism.Hybrid approaches emerged with the goal of addressing the weaknessesof centralized and decentralized topologies,while beneﬁting from their advantages.In a hybrid topology,peers have various responsibilities depending on how importantthey are in the search process.An example of a hybrid system is the KazaA [2]system.KazaA is a hybrid of the centralized Napster and the decentralized Gnutella.It introduced a very powerful concept:super peers.Super peers act as local searchhubs.Each super peer is responsible for a small portion of the network.It acts as aNapster server.These special peers are automatically chosen by the system depend-ing on their performance (storage capacity,CPU speed,network bandwidth,etc) andtheir availability.Figure 2.2 shows example centralized,decentralized,and hybridtopologies.P2P approaches have been mainly used to address resource discovery and pres-ence management in grid systems.Most current resource discovery mechanisms arebased on hierarchical or centralized schemes.They also do not address large scale dy-namic environments where nodes join and leave at anytime.Existing research eﬀortshave borrowed several P2P dynamic peer management and decentralized protocolsto provide more dynamic and scalable resource discovery techniques in grid systems.In [48] and [100],a ﬂat P2P overlay is used to organize the peers.Every virtualorganization (VO) has one or more peers.Peers provide information about one or31more resources.In [48],diﬀerent strategies are used to forward queries about resourcecharacteristics such as random walk,learning-based,and best-neighbor.A modiﬁedversion of the Gnutella protocol is used in [100] to route query messages across theoverlay of peers.Other projects [75,83] have adopted the notion of super peersto organize,in a hierarchical manner,information about grid resources.StructuredP2P concepts have also been adopted for resource discovery in grids.An example isthe MAAN [26] project that proposes an extension of the Chord protocol to handlecomplex search queries.2.6 Worldwide ComputingVarela,et al.[10] introduced the notion and vision of the World-Wide Computer(WWC) which aims at turning the widespread resources in the Web into a virtualmega-computer with a uniﬁed,dependable and distributed infrastructure.The WWCprovides naming,mobility,and coordination constructs to facilitate building widelydistributed computing systems over the Internet.The architecture of the WWCconsists of three software components:1) SALSA,a programming language for appli-cation development,2) a distributed runtime environment with support for namingand message sending,and 3) a middleware layer with a set for services for recon-ﬁguration,resource monitoring,and load balancing.Figure 2.3 shows the layeredarchitecture of the World-Wide Computer.The WWC views all software components as a collection of actors.The Actormodel has been fundamental to the design and implementation of the WWC architec-ture.We discuss in what follows concepts related to the Actor model of computationand the SALSA programming language.2.6.1 The Actor ModelThe concept of actors was ﬁrst introduced by Carl Hewitt at MITin the 1970’s torefer to autonomous reasoning agents [54].The concept evolved further with the workof Agha and others to refer to a formal model of concurrent computation [9].Thismodel contrasts with (and complements) the object-oriented model by emphasizingconcurrency and communication between the diﬀerent components.32Figure 2.3:AModel for Worldwide Computing.Applications run on a vir-tual network (a middleware infrastructure) which maps actorsto locations in the physical layer (the hardware infrastructure).Actors are inherently concurrent and distributed objects that communicate witheach other via asynchronous message passing.An actor is an object because it encap-sulates a state and a set of methods.It is autonomous because it is controlled by a