17453 >> Sumit Gulwani: It is my great pleasure to welcome Swarat Chaudhuri, who is an assistant professor at Penn State University. Swarat graduated in 2007 from UPENN, and he won the ACM SIGPLAN Doctoral Dissertation Award. He has a wide variety of interests. Works in program verification, parallel programming, logic determinative

Σχόλια 0

Το κείμενο του εγγράφου

17453

>> Sumit Gulwani: It is my great pleasure to welcome Swarat Chaudhuri, who is anassistant professor at Penn State University. Swarat graduated in 2007 from UPENN,and he won the ACM SIGPLAN Doctoral Dissertation Award. He has a wide variety ofinterests. Works in program verification, parallel programming, logic determinativetheory. And today he's going to tell us about parallel programming.

>> Swarat Chaudhuri: Thank you, Sumit. It's great pleasure to be here. So today I willtell you about this parallel programming language and programming model that I havebeen working on for almost a year now.

So the focus of this model, which is called Chorus, is data parallelism. So in the space ofdata parallelism, main success stories havebeen either when the granularity ofparallelism is high core screened and an example is not produced or when we areworking with dense data structures with highly fine-grained computations.

So an example of that is these numeric computations that are common in scientificcomputing on dense arrays. And in addition, there are some problem specific methods.For example, people have worked a lot on parallelizing model simulation and othercertain important problems like that.

So here our goal is not toaddress just one or two specific applications, but a range ofdata parallel computations over large unstructured shared memory graphs. And here thegranularity of parallelism is not known in advance. In fact, it's not predictable by staticanalysis. Soin the average case there is a lot of data parallelism in these problems. Butin the worst case there is no parallelism at all.

And as we will see, there are lots and lots of applications which fall into this category.And our goal, when we offer thismodel, will be high level correctness as well asefficiency. So, for example, we would aspire to achieve race freedom at the languagelevel or develop freedom.

Also we would like to express the essence of this kind of parallelism so our compiler andruntime system can take advantage of that.

Now, a classic example of an application of this sort is [inaudible] refinement. So herethe problem instance is a triangulation the problem instance is a triangulation of a set ofpoints. You're given a settleof points and you're building triangles with these points asnodes.

Now what you want to achieve are these two properties. So the first property is known asa Delaunay property, which is that no point is contained within the certain circle of atriangle. As you see that's the case here.

But the second property is what's known as the quality property, which is that all thetriangles that you get in this mesh, they satisfy a certain goodness constraint, which isthat in this case there is no triangle where there is an angle that's greater than 120degrees.

So typically you get meshes when this quality constraint may not be satisfied. So whatyou want to do is you want to refine the mesh. You want to add new points to it andretriangulate it so that

there are no bad triangles left in the mesh. So this is the problemof Delaunay mesh refinement.

Now, let's look at this a little bit more closely. So the one classic algorithm for thisproblem works as follows: That you collect in a data structureknown as a cavity alltriangles whose circum circle contains a new point. You've added a new point to refinethe mesh. Now what you do is you collect all these triangles. And as you're seeing herein this second picture I have colored them in a different color, and then you retriangulatethem, like I've done in the picture on the right.

So actually it so happens that even after you do this thing, some new triangles may againviolate this quality constraint so you have to continue. But there is a guarantee that thisalgorithm will terminate. So that requires some reasoning.

So let's look at this algorithm in the pharmaceutical. So I have a mesh and then I have awork list in which I'm going to collect all these bad triangles. And then I'm going to pick atriangle from this work list. I'm going to create a cavity. I'm going to expand the cavity.Then I'm going to re triangulate it.

And, finally, I'm going to update the measure and I'm going to continue until there'snothing left in the work list.

So observation number one, there is no kind of

--

so these cavities we have here theseare contiguous regions within the mesh but the second which is not obvious the worst isthe cavities can encompass the whole mesh so you can create such an instance.

Now, what do we do if we want to parallelize this algorithm? So the first observation ishere we are working on a complex unstructured graph, which is a mesh. And we do notreally have properties necessarily that we can use in parallelism. For example, the sizeof a cavity as we have noted can encompass the entire mesh.

So what we will do here is that we are going to view this problem more abstractly as acomputation over a graph. And in this case we can view these nodes in this graph astriangles and edges representing adjacency between triangles.

And so by the definition, a cavity corresponds to neighborhood contiguous region in thisgraph and we are going to use this observation later. But let's continue.

So now what's the adversary property we want to we want them to be triangulatedatomically. And non overlapping cavities may be processed in parallel. However, thisquestion of what can be done in parallel seems impossible to handle with static analysis.The reason is that the shape of the data structure, it changes greatly over time.Moreover, it's extremely independent. Easy to create an instance where you get noparallelism at all. There's been a lot of work on this type of late [inaudible] at his groupUT Austin.

So just to

note that this is not just one isolated application, there are lots and lots ofsimilar problems. So in the space of measuring, there is mesh refinement, there'striangulation, problems in clustering, ray tracing, maintenance of social networks, graphlog problems such as spanning tree and flow computation. [inaudible] simulation, sparseproblems in scientific computing. Sparse matrix vector multiplication and even someprogram analysis problems. So I added this for the effect because I primarily workonprogram analysis, iterative data analysis and patronage assimilation. And theserepresents a class of applications which are very large but these traditional statictechniques of extracting parallelism don't really work so what can we do about it?

So what I'm going to do here is that first I'm going to present more abstractly ourprogramming model and show how it can be used to encode that application I showed,Delaunay mesh refinement. Then I'll present a more concrete language JChorus anembedding

of the model on top of the Java and I'll talk about implementation and finallyI'm going to talk about some ongoing work on embedding this model into the HabaneroJava which is an offshoot of IBM's extent language.

>>: With these kind of applications,[inaudible] permutations how are they performed?

>> Swarat Chaudhuri: So as you will see our implementation has some stuff related to[inaudible] so we are using some of the ideas there. But I do not think that that alone willsuffice. We'll come tothat. Maybe at the end we can discuss this in more detail.

So one of the interesting observations here is that in all these applications there is acommon property of locality, which is that even though you are doing computations thatin the worst case

can record global access to a gigantic data structure, in the typicalscenario that is not at all so. So, for example, we ran some experiments on a mesh ofabout 100,000 triangles from these benchmarks from [inaudible] and we found that theaverage sizeof a cavity is actually only 3.75 triangles and the maximum cavity size was12 triangles. So in that case there is a lot of parallelism in this problem, and the sense ofthat is perhaps this average case locality.

But, again, there is no way to know this ahead of time what is the extent of locality in aparticular, for a particular application. So the main idea of Chorus is that since we areworking on these local regions in the data structure, why not let these regions be thedrivers of the parallel computation, just as objects of the drivers of computation in objectoriented programming or concurrent objects are the drivers of computation in the actormodel so we are going to make this more clear soon.

So let us first phrase the problem a little bit

more abstractly. So I'm going to view fromnow on these problems as a problem of parallel computations and graphs. So here Ihave a giant graph and this represents the heap.

And the objects here correspond to the nodes. The edges here correspond tothepointers, and I'm going to refer to an induced sub graph in this heap as a region.

And so an assembly in our terminology is a region in this heap that is equipped with athread of control. So the typical situation would be that these assemblies would beshort-lived and speculative. And we are going to see what this means in the context ofsome applications.

So one point I would like to make here is that this might remind some of you of the actormodel of computation model. As we'll see there are some differences but for one herewe have this property that all these regions that are there in the various assemblies in theheap at one time they all formed partitions of the heap and they're all isolated. Objects inthem are completely isolated.

And second is that there is no a priori guarantee on how big a region might be. It mightconsist of one object. It might consist of the entire heap.

Okay. So in a situation like ours we need to talk about not just static data partitions butpartitions

that actually change over time because of the very dynamic adaptive nature ofthese applications.

So our primitive for synchronization, in fact, this is the only primitive for synchronizationhere, is margin of assemblies. Here we have assembly I and assembly J. And there's anedge from this I to J. And now I can merge with J and the result is a bigger assembly wesee at the bottom.

This assembly J which previously had ownership of that region, that now dies. So inorder to rule out races, we require that J must be in a so-called ready state, that is, itshould not be doing in the middle of doing some sort of an imperative update this modelhappens. I'll clarify what that means.

So then merging is a way in which we coarsen the granularity of the parallelism in theheap. How do we find the granularity of parallelism. Well it's splitting. What we havehere is assembly I. Now I has split through these assemblies I one through I six. Andthen this tau work.

So let me actually get to that ina little bit. For now let's just get this picture that assemblyI has bled into these constituent assemblies. Now one of the things to note here is thatthis J is completely oblivious to what has gone on inside I here. So I can split. J doesn'thave to

know. In other words, the splitting operation is not actually a synchronizationconstruct.

Whereas, in the previous case, for merging, there is actually synchronization involvedbecause we are saying that J must be in a ready state while the merge happens.

Okay. And finally an assembly being, having an imperative computation associated withit can actually modify its heap as well as its local variables. So the restriction, though, isthat objects inside an assembly are isolated, which means that you're not allowed tofollow that point, read some data and do some modification of that. That's not permitted.Everything else is permitted.

So okay that sounds nice. But what do programs here look like? So what we do here isthat we just generalize

--

we just seek inspiration from object oriented programming andwe define the notion of an assembly class. An assembly class consists of a set of localvariables, a set of guarded updates so the computations here are in the guardedcommand style and there's a constructor which creates an instance of an assembly froman assembly class and then there are some things called public variables which I'm goingto explain soon.

And a program is simply a set of classes. And each assembly we are going to view it

asthis sort of very simple state machine. So each assembly can be in a busy state, whichmeans that it's now executing an update. So remember that we have these guardedupdates so the pattern is evaluate the guard, execute the update and then come back

and then do this again and again, right?

So this is the state when an assembly is executing an update. And this ready state,which is where the assembly starts and this is the default state and it goes back to thatafter finishing an update is a state

where an assembly can actually be preempted byanother assembly.

So an assembly can now be merged by a neighboring assembly and die as a result.When an assembly dies, it just goes to this terminated state. So let's now look at thestructure of computations here a little bit more closely. So, first of all, these guardedupdates is there's an update guard and an update.

And the rule is that this guard is executed atomically. After that, this update is only goingto refer to the objects that are owned by the assembly. Therefore, there is no furtherneed for trying to acquire locks of simple objects. So that's the idea. So guard isexecuted atomically. Update refers to objects owned by the assembly. So now theconstruct for merge is going to happen actually within guards. So this is a slightdifference from standard guarded command languages that our guards can actually havethese effects of merging two assemblies. However, they cannot modify the heap. Sothey cannot have effects that are imperative in that sense they cannot touch the heap.However, they can change the structure of concurrency in the heap.

So here at the top I have this guarded command, merge.UF followed by S. So this saysthat try to do this merge. So I when it executes it

tries to do this merge along U.F so U isis that object at the top. U.F is the reference which points outside of I. And it says thattry to merge with this. If this merge goes through, you know, in that case the guardevaluates to true, and in that case execute the command S on the new region that youacquire as a result.

>>: For verification, U and F, are those

--

I think of those as two argument storage,passing both the U and the U.F.

>> Swarat Chaudhuri: You should say U.F as? Java it's a reference.

>>: But in that case how do you know what you're merging to? What's the capital S?

>> Swarat Chaudhuri: S is the statement that you're going to execute after that.

>>: You're saying that the parameter to merge, there's just one parameter.

>> Swarat Chaudhuri: Right.

>>: It's going to be whatever the [inaudible] points to, but how do you know what you'regoing to join that to.

>> Swarat Chaudhuri: You have no control of that aside from wanting

--

that I've notshown here. You can specify the type of assembly you're going to join with. But asidefrom that there is no control over who you will merge with.

>>: In particular, it might not merge with the assembly that contains U.

>> Swarat Chaudhuri: No, U is what U contains.

U is within I. I is trying to execute this.So U.F is the reference that points outside of I.

>>: So I refers to the whole assembly.

>> Swarat Chaudhuri: Yeah, yeah. Sorry. It may look like it's just that one object. No,it's the whole assembly. And then there's a slight refinement of that that you can alsoguard this with an original guard which is G, which says that G is constraint that'sevaluated on this assembly I again. And when this property G is true at that point you tryto executethis merge if it goes through then you execute S. And for guards, that's reallywhat we have. So we do not really need anything further. And the split so for

--

as forsplits what we have is that it's just command that's executed within that S thing that Ishowed, that there is a guard and there is an update and that update was that F symbolS. So the split is executed within that. When it splits it produces something like this andthe syntax is like this that you can split tau over tau is the class of

the assemblies beingproduced as a result of the split. This P1 through PN is going to be the parameters thatyou pass to the constructors of these new assemblies and that's the way you pass fromlocal state from the parent assembly to this children assembly.

>>: If you can imagine that failed split comes

--

>> Swarat Chaudhuri: Speed cannot fail because it doesn't depend on anybody else,right. It's a completely local operation. Whereas marginal synchronization construct andas a result that might potentially fail. And finally local updates are, as I said before. Sothis is just like in Java, except again if you try to access something that's outside yourassembly, then you are going to get an exception.

And there are some refinements of that. So one version of merge, for example, is thatyou can create new assembly of a new type when you part from the merge becauseremember that previously when I merged with J it was just I. So now you can also createa new thing in the process.

And likewise you can, the results are refinement of split, which is that earlier we had tosplit into all these individual components, assemblies that contained only one objecteach. However, you can also basically release one of the objects in the assembly andcreate one assembly out of that. And that would be a split one.

But the most interesting constructs are really the ones I showed earlier. Okay. Soenough about this sort of abstract discussion of the language. Let's look at anapplication. So herelet's say we try to implement this Delaunay mesh refinementapplication in this language. So how will we do this? So we will use two kinds ofassembly classes. One will be these triangles and another will be these cavities. Andwhat does each triangledo. Each triangle determines if it's bad or not. If it is bad, then itmerges with neighboring triangle to form a cavity. Okay. What about a cavity? So acavity can determine whether it has all the triangles needed to finish the retriangulation.Right? And that's a local operation.

So if it finds that it needs more, then it tries to merge with a neighbor. If it finds that it hasenough, then it's going to just retriangulate locally and then split back into these trianglesbecause now you're sort of back in the default state all these triangles are back andyou're going to continue, this process is going to continue until there is no further activityin this solution, if you will. Solution of triangles which are just merging and splitting witheach

other.

So the code then looks like this, that I have one assembly class called triangle and theaction within is that merge with V.F let's say V.F corresponds to a neighbor. You mergewith that become a cavity, when this property is true meaning thistriangle is bad. That'sthe only thing that a triangle does. Otherwise it just stays inactive.

What about a cavity, the cavity determines if it's not complete, if it's not complete then it'sgoing to merge with, again, let's say this points to an arbitrary neighbor and it's going toform a bigger cavity.

And if it's complete, then it's going to retriangulate and then it's going to split into theseindividual triangles that form it.

So what happens then if there is a conflict? Suppose you're trying to grow a cavity.You're trying to merge with more and more triangles and in parallel there's another cavitybeing formed and these two things collide, what happens then? Remember what wesaid. That when one cavity, when one assembly merges with another, the assembly thatis on the receiving end of the merge dies as a result.

So what that means here is that there is one cavity that's being formed and anothercavity that's being formed and then one cavity gets absorbed by another. So what thiswill

lead to is that all the work that one of the cavities did up to this point will have to go.Right? In a sense the computation that you're trying to achieve is rolled back.

However, so there are some subtleties here so in some sense the work that youset outto do that will not be finished you're back at the original state. However, at the end of allthis, when the cavity that is the killer, if you will, the cavity that does the absorption, thiscavity at some point will have done its job and then it's going to split again.

Then the original thing can start all over again from scratch. Okay. Are there questions?Yes?

>>: Every time mesh on clarity you can [inaudible] there are cases when you match twocavities?

>> Swarat Chaudhuri: So I'm

going to address the [inaudible] location, no you cannotmerge with two cavities at one time. You're going to merge with one assembly who isgoing to be your neighbor. But when you do the merge, then the cavity that's on thereceiving end of it, that has to go. And as a result all the work that he has done until thispoint that work is basically rendered worthless because you're not using

--

it's not able tofinish what it started.

However, the point is that eventually this cavity who killed it, killed this other cavity, thiscavity will have finished its work. And then it's going to split off again. And thenpresumably if that work still needs to be done, that work will again start from thebeginning and it's going to reach its final point if it's not interrupted by anybody else andthen it's going to do the retriangulation and split back and then it will log the offer.

So let's now look at some of the other approaches. So I'm going to give a few otherapplications that we can encode in this style but let's look at some other competingapproaches. So the first question is, of course, what about threads and explicit locking.So here perhaps the main objection is that the heap in shared memory languages is acompletely global entity. And furthermore you have arbitrary aliasing, and as a resultwhen, if you are trying to ensure that there is no, there are not multiple objects that aretrying to, multiple cavities that are trying to work on the same object at the same time,that's very hard to do.And it's also low level and error prone.

What about software transactions? So here what happens is that this burden ofreasoning is really passed on to the transaction managers. And in most implementationsof software transactions, the conflicts between two activities that are happening in parallelbut happen to conflict on some data item, that's detected by monitoring reads and writesto the memory.

And as a result this either leads to you doing some sort of conservative analysis andgetting conservative results, or it is very expensive because dynamically you have toreally search the entire memory and try to see if there is a conflict or not so what aboutstatic data par tingeing as is there, for example, in languages call extent. But here theproblem is the unpredictive nature of partition you may have started with data partitionthat's very nice but after a while after all these changes it may become quickly bad soyou need some sort of an adaptive mechanism and really what our constructs for marginsplits offer is a way to implement this adaptive nature of isolation in using a few simpleconstructs.

So finally what about actors? So the actor model is perhaps the most well known and theoldest type of data oriented parallel programming. So if

you are going to model thisproblem of mesh refinement using actors, what you do is you would give each actor a setof cavities. Right? And then what would expansion mean? It would mean that one actorpasses around some triangles to another actor. Now

there are two ways of doing this.One is that you would copy the triangles and send it to the other person. But thispresumably would be too expensive. But then the other option is to just pass aroundreferences but on the other hand this introduces problems like aliasing and races. Noteon the other hand that in our model there is no aliasing of objects. As a result of marginsand splits objects get transferred from one entity to another.

An entity does not have aliases to objects that are owned by

other entities. So that's animportant point. So finally if I'm talking about irregular applications I have to talk aboutGulliver, which is [inaudible] system for irregular data parallel applications.

So this is in this system what they do is that they annotate data structures withinformation about commitivity and associativity and have iterators not just over lists butsets and Poe sets and it's a different style of programming we see ours as an alternativethat's trying to address the same problem.

Okay. So let's now look at a few other applications. So another application that hasbeen looked at a lot in the setting of this sort of parallelism is Boruvka's algorithm forminimum spanning tree. So the parallel version of this is as follows: That you have thisgraph. In the beginning you start with all these small spanning trees which correspond tojust one or a few nodes each. And then these small spanning trees merge to form biggerand bigger spanning trees until finally you're going to get

this spanning tree that coversthe entire craft. So this is pretty obvious how to model this in our language. So what youdo is simply you model this spanning trees as assemblies and merging is just a directconstruct in this system.

In another application from [inaudible] benchmarks is that of focused community. So thegoal is you have a giant social network where there are updates being madecontinuously. And what you want to do is that you want to maintain communities in thisnetwork where a community is defined as a sub graph, which is such that the measure ofcloseness within the graph is bigger than the measure of closeness between any node inthe sub graph and something outside.

So basically you're in a community if you're more tightly connected with people in thecommunity rather than people outside the community. So, of course, this graph is beingupdated constantly. And as a result the structure of this communities change all the time.

And now a sequential algorithm for this is as follows, that you, suppose you want todetermine the community of one person. So you do this sort of greedy fixed pointcomputation that you keep on adding nodes if you find that this addition of this nodeincreases some objective function, and otherwise you also shed some nodes.

So again it's pretty obvious how to do this in our setting. You use an assembly torepresent such a community and then the shedding and growing of communities, theshrinking and growing of communities is captured using the margin split primitives.

So a few nodes. So, first of all, note that in the worst case there's no parallelism here atall. The entire heap merges into one neighborhood. Second, note that the merges andthe splits here are unordered. So, for example, there

is no real ordering between amerge that happens between say this and some assembly here and the split of thatassembly. So it's all completely true concurrent semantics.

So what about data erases. So there are no data erases, because each assembly onlyupdates, only modifies locally isolated objects and merges happen only when anassembly is, the recipient of the merge is not within the in the middle of an update. So ofcourse the definition of a data erase is multiple imperative computations do not

attack thesame object at the same time, whether for reads or writes.

So what about deadlocks? So imagine a situation like this where there are twoassemblies and this assembly is trying to merge with assembly and this assembly istrying to merge with

this assembly. So in other words I is waiting for J and J is waiting forI. So this is perhaps the closest to what we can define as deadlock in a situation like this.

But note that merges here are unordered. So, for example, if you have a situationlikethis, then the runtime system can really execute one of the two merges and when thathappens the other merge, the other assembly is dead. There is no further requesting andtherefore there is no further waiting. Progress happens because a bigger assembly isformed and that goes on to do whatever it needs to do.

So the key to this is this preemptive nature of assembly is that when an assembly is inthe ready state, then it can be killed by anybody. There is no protection whatsoever.

And furthermore, at the end of every update you have to come back to this top levelready state. So assuming updates terminate you are going to be able to resolve such asituation eventually.

>>: So when Gene wants to merge, so it stays to the system I want to merge withassembly that owns J.net, the object.net.

>> Swarat Chaudhuri: Absolutely, yes.

>>: So in the process if I is splitting, so it could actually merge with some of assembly,some child of mine?

>> Swarat Chaudhuri: Absolutely.

>>: J cannot mean I specifically.

>> Swarat Chaudhuri: Absolutely. That's key to the erase freedom in development, yes.Of course, I said all this but this assumes the existence of some kind of omniscientruntime system which is going to look at all of these things, all of these requests and justorder them and resolve them somehow.

But if you are going to write an implementation for this system on say a multi-coremachine, then there's not going to be just one scheduler that is centralized somewhere.So we'll have to talk about more distributed runtime system and I'm going to talk aboutthat in a little bit. So then we'll have to reason about deadlock at a slightly lower leveland we will do so.

Okay. But before I go there, let me just present the language for which that runtimesystem has built. So this language is known as J Chorus. This is simply Chorus on topof sequential Java. So in fact these abstractions of assembly classes can very naturallybe integrated with an object oriented language because we already have notions ofobject classes. So what we do is that in addition to that we also now have these classesof assemblies.

And we have these method calls from within the assembly class bodies. We saw thoseguarded commands, right? So inside the updates we can now make method calls andthese calls can now be on any of the object classes that I've defined in the sequentialpart of the program.

And there are a few extra features that I did not show in the previous core version of themodel. So we have now objects which are read only and mutable and these arespecified as part of the types. And, for example, a read only data item can be sharedubiquitously by assemblies all over the system. And mutable objects are like what Iexplained earlier.

Okay. So in that case the code for mesh refinement would look something like this. I'mnot going to give you the details. But let's just look at this little piece here. So I have thedeclaration of the cavity assembly and inside that I have this declaration of actions andinside the actions I have calls to these methods, which are just Java methods.

And similarly we have these other assemblies and in fact we found that when we lookedat this mesh refinement algorithm, mesh refinement application and tried to write it inChorus, we found that most of the code was actually just sequential Java code. We onlyneeded to add about 15 lines of code, which corresponded to Chorus. So we really couldjust take the sequential version of the program and then add these 15 lines of code ontop of that and we got a Chorus version of the program, which I thought was quite nice.

So this is the language aspect of the problem. What about the system aspect of theproblem? That's what I'm going to discuss now. So one thing

--

one problem that you hitas soon as you try to implement something like this is that the parallelism that youexpress in the language is potentially huge.

So in any realistic system which perhaps has eight cores or 16 cores, what you need todo you would have to have some mechanism for mapping these hundreds of thousandsof possible assemblies to the cores in the system.

So how to manage this mapping is a funny question. And the questions that arise hereare what are the right mapping strategies, what are the right scheduling strategies,because a core now contains maybe tens of thousands of these assemblies and youhave to execute all of them in some order.

So that raises the question how do we schedule things? How do we balance loads,right? So this graph that we have here, that's a completely unstructured thing. So it'sconsiderable that after a while some of the codes will be doing all the work. So we needsome sort of load balancing strategy. And finally what are the right data structures formaintaining the relationships, say, between the objects and their assemblies. How do wedetermine which assembly an object belongs to?

So all of these questions need to be addressed and this also needs to be hidden from theprogrammer and shoved into the body of the runtime system. So designing this runtimesystem is quite a challenging task.

So the main lower level abstraction that we have is known as a division. So a division isa set of assemblies that's mapped to a core. So here I see these two divisions. So thisdivision contains these assemblies and likewise. So now note that some of the mergesare now actually local within divisions, because a division now contains manyassemblies, right?

So if one assembly within a division wants to merge within a assembly within the samedivision that's like a local operation. It's not really parallelism. However, there are theseremote merges which is what if this assembly wants to merge with this assembly. So inthat casethere needs to be an actual request sent from one division to another and thatwould require some real synchronization.

So this is the high level structure of the problem. Now perhaps what we want to do is wewant to minimize the amount of synchronization that happens across divisions. We wantto make most of these merges local. So how do we do that?

Okay. So the main strategies here are that first the divisions are adaptive, which is ourheuristic for reducing the number of remote merges. So the

idea here is as follows thatwhen, say, this assembly tries to merge with this assembly, you are going to migrate thisassembly to this division. Right? And along with that you're also going to migrate someof the nearby assemblies to that division.

So the result of that is that since there is this locality property in these applications, thattypically if you want to merge with an assembly you are perhaps going to merge, want tomerge with some nearby assemblies, if at all.

So that's why this heuristic of migrating some chunk of, some local collection ofassemblies from one division to another, that actually turns out to help quite a bit thesecond thing is the heuristic for load balancing which is done by simple modification ofthis migration strategy, which is basically when you are doing the migration you can alsodo some computation and make sure that, you know, the migration is happening from thedirection of the core, which has more assemblies to the division which is pureassemblies. So that

way you're just making sure that assemblies are always migratedfrom the core with more assemblies to the one with pure assemblies.

>>: So how does that handle the case when you have an assembly that has to mergethrough the whole space?

>> SwaratChaudhuri: That would require a series of merges, right? So that gives

--

>>: At one point where it's merging from a place that's smaller than

--

>> Swarat Chaudhuri: Right. So of course these are just heuristics, you cannot justapply them completely blindly. You have to also put in other heuristics.

>>: A decision

--

>> Swarat Chaudhuri: So in all these applications that we considered it's not really that

--

with the exception of the spanning tree algorithm, it doesn't really happen that you needto merge with everything to form a big heap. So if it turns out that you are going to,you're forming bigger and bigger assemblies and you are just one division is going tohave lots of stuff and this cannot be avoided, then what can you do? You'll just have todeal with that.

And the far heuristic used to use the union define structure, the reason for that is theseassemblies here are these joint sets, right? So in order to maintain the relationshipbetween objects and what assembly it belongs to, we can use union fine data structure.So basically when there is a merge, what we do is that we use the union part of unionfind. And when we want to find out what assembly an object belongs to we use the findpart of that.

This also seems to give quite a bit of benefit. And finally we have this token passingstrategy to prevent deadlocks in this system. So as I mentioned earlier, that there is noruntime system that's completely centralized, right? That's just going to resolve all thesedifferent merge requests that are being made simultaneously.

So what we do is we have this token passing strategy, which is used in many otherdistributed programs as well, to handle the remote merges. So if you are going to do thelocal merges, these you can do at any time at your will. But as for remote, you needaccess to the global to do that. Therefore, it introduces serialization and that preventsdeadlocks.

There's a little bit more subtlety to the freedom deadlock, I will avoid that I can answerthat if there are questions.

>>: [inaudible].

>> Swarat Chaudhuri: Right.

>>: [inaudible].

>> Swarat Chaudhuri: So what happens is that you batch all the remote requests. So,for example, you have a division, right, which is executing all its assemblies in followingsome scheduling policy.

Now you see that there's a remote merge that's been requested in some assembly. Sowhat you do is you batch all these requests into a queue. When the token comes to youyou try to execute all of thesethings at one time.

And this means, yes, that at one time only one person can do these sorts of remotemerges but since you're batching and since there are so many assemblies typically parcore it seems to work out reasonably.

>>: Wouldn't that be more efficient to have [inaudible].

>> Swarat Chaudhuri: A what?

>>: [inaudible] a single scheduler?

>> Swarat Chaudhuri: I do not know what that

--

>>: Having single scheduler.

>> Swarat Chaudhuri: But you need to distribute that, right? How are we going tohave

--

>>: It's not distributed.

>> Swarat Chaudhuri: Sure, sure. But if you have these multiple cores, so where willthis master scheduler run and how will you find out

--

>>: Global view of the whole thinking maybe the loadbalancing would be much moreefficient.

>> Swarat Chaudhuri: Sure, but when would the scheduler run.

>>: On the core.

>> Swarat Chaudhuri: That means in order to do the scheduling you have tocommunicate from that code to the core which possessesthe assemblies. So you willneed to basically communicate for every assembly that's going to be executed. And thatwill require a lot of synchronization.

>>: When you talked about batching with this token, you have two assemblies that aretrying to merge to the same point they batched up, these are remote merges, theybatched up, now I want to go with I and he wants to go with I, how do you prevent adeadly embrace type

--

>> Swarat Chaudhuri: When the token comes to one of the divisions, it's going to try toexecute this thing, right. Once that merge is executed then that other request is voidedbecause that assembly is dead. It's just not there any longer.

>>: So the other one is

--

>> Swarat Chaudhuri: Yeah, it's just dead. So this is the structure of the system. Soyou take a program and then you pass it to the compiler front end and create a Javatranslation and finally we have this runtime system which will layer over the Java virtualmachine. The code for this is available. Just send me an e-mail this Web page shouldbe up in a few days it's not up now but if you want to look at the code send me an e-mailand I'll send it to you.

So let's look at some experiments. We encoded the Delaunay refinement problem fromthe Lonestar benchmarks in this language. And here the dataset had about 100,000triangles, almost all of them were initially bad. There were 1 to 8 threads.

So this is what the mesh looked like after a lot of triangulation. Initially we started withpartitioning of the mesh, which was fairly nice, it's this grid, right. And we note that evenafter many thousands of retriangulations the mesh that results still kind of looks like agrid, which means that there is a lot of locality in this computation. It's not just that thestructure completely dissolves after all these operations.

And so that is the property we tried to exploit. So now as for our numbers, so we have alot of overheads actually because of the fact that we are running on top of the Java virtualmachine and we are making expensive method calls for pretty much every single thing,even when we try to check if an assembly contains a certain object or not we have tomake a method call. And it turned out that these method calls also eat up a lot of time.

However, I must say that this is DSTM two which is Morris

--

his software transactionallibrary. And we are only beating the sequential version at eight cores but our overheadsare way better than them at least in this application.

And this is perhaps more accurate measure of performance. So this is called

--

this isself-relative speed up which is speed up over the sequential version of the same system.So this is an attempt to rule out some of the effects caused by these overheads.

So here we noted that we are pretty much close to DSTM but we are perhaps still a littlebetter. But I would say we pretty much have the same curve in DTSM, which isinteresting why these two are very similar. That's something to observe.

>>: You're relatively flat from 4 to 7 PCUs, did you investigate what's going on there?

>> Swarat Chaudhuri: We tried to investigate but we did not have any answers. Theproblem with this Java-based system is that there's no underlying performance model. I'llshow you a graph in a little bit that's really mindboggling. We had absolutely no clue whythis happened but the power performance started to drop when you moved from 7 to 8threads. Not in our model. When you tried to use fine grain logging. It suddenlycollapsed when you went from 7 to 8 threads. And we figured it was because of thegarbage collector but we have no clue what is happening there.

So what about the number of conflicts? So we define conflicts in DTSM two in the usualway how many transactions are rolled back. We also wrote a fine grained lockingimplementation that you saw in the slides also. Here a conflict meant you tried to build acavity but you basically have to stop doing it to get back to the original state. Our finegrain locks are not blocking locks. Basically you try to echo a triangle and expand yourcavity. If you cannot, you jump ship and go to the original and start again.

We found a number of conflicts in our system is actually very, very low, which gives ushome that programsif you have a more efficient implementation we would be able to getbetter results.

>>: Load balance.

>> Swarat Chaudhuri: No, we do not have real numbers for load balancing, primarilybecause the load balancing heuristic we gave we think it's a pretty simple thing and canbe improved we'd like to investigate it.

>>: The balancing [inaudible] try to recraft. The synchronize the wake as a division.Assembly, probably wait until later.

>>: Right.

>> Swarat Chaudhuri: It's more like you are making a request from a division which haslots of stuff to a division which has very little stuff. So now just the normal migrationwould bring stuff from the division which has less to the division which has more. But thatdoesn't seem right. So what we

do is we exploit the property that gives migrationsymmetric and we basically make the migration in the other direction.

>>: Usually a number off assembly.

>> Swarat Chaudhuri: Right.

>>: But not size.

>> Swarat Chaudhuri: The reason is thatthe sizes are hard to track. One could imagineyou could improve this by having some sort of a heuristic number of objects that are therein the division currently. So as for [inaudible] algorithm. This was the bizarre graph I wastelling you about, which is that in this fine grained locking implementation [inaudible]suddenly dipped when we went from seven threads to eight threads and we worked verycarefully to see if there were any bugs, none we could find. So we suspect that this isbecause of the garbage collector. But again we don't really know why.

>>: Were you running with eight cores.

>>: We were running with eight cores, yes.

>>: Had overheads, other stuff going on.

>> Swarat Chaudhuri: Yes, interesting thing to note is our speed up on this spanningtree application is quite different. And the reason that the DSTM, there's no graphcorresponding, no line corresponding to DSTM we couldn't run the DSTM on theapplication because it's so big it ran out of memory immediately.

Here'sthe far part of the application. Speed ups were rather miserable. Even thougheverything else was good. But we're hoping for better implementation and maybe thiscan be improved. But I'm not claiming this is fundamentally because of implementationoverheads. It may be that our approach is also, has some fundamental problems. Butwe do not know, basically.

>>: [inaudible] what does the raw speed up.

>> Swarat Chaudhuri: Raw speed up, you run the sequential version of the code. Andyou plot thespeed up over the sequential version. Then versus your performance, yourperformance on one thread. So basically the sequential version of your system that's thebaseline in this relative speed up, and the rest is the best sequential implementation,which is just written in Java.

>>: [inaudible] affect 1-1.

>> Swarat Chaudhuri: Yes, exactly.

>>: How are you going to make it ten times slower?

>> Swarat Chaudhuri: You'll be surprised. So initially it was about 25 times. And thenwe found ifwe replaced these exceptions by just simple checks, like where you did nothave to prepare a stack each time, then immediately jump from 25 times to ten times. Sonow our understanding of the overheads is not that deep. It would be useful to talk tosomebody who knows the details of the Java runtime system, but this is the best wecould do at least for now.

And in ongoing work, I'm working with Professor Vivek Sukor at Rice University to embedthis Chorus model into Habanero Java which is an offshoot of Extint which [inaudible]spearheaded the development of.

The point here is that this Habanero Java already has these abstractions for isolation,however these are cores grained and static. So what we want to have is this more finegrained April Tateactive construction for isolation which will be as you saw here. Andthen there are some other things that are being considered in the settings.

So in particular in the model that I showed you, the objects were either completely in theownership of one

assembly or not owned at all. However, we are considering theseabstractions for a fraction of ownership where multiple assemblies can own a objecthowever aliasing is very tightly controlled.

And other ongoing work includes a more optimized implementation. So my student isworking hard at it right as we speak. And we are also looking at foundational processcalculus for the problems, actually that's where we started in all of this. So I'm primarily atheoretician and our goal was to develop foundational calculators for local computing ingraphs. But then it became a more perhaps realistic thing but now we still want toaddress this question that what would pi calculus with splits and ownerships look like.We're investigating that. Finally as primarily as somebody who is very interested inprogram verification, I'm interested in also questions about type systems for Chorus thathow do we reason about type state of these assemblies and how can you do some staticanalysis. So these are all questions

that we want to do.

And then finally one other interesting application of Chorus seems to be in the modularrobotics, where there are these self-assembling robots which are just running around inparallel, get together to do some stuff and break off again and do some other stuff. Sothat seems to be quite similar, at least at a superficial level to our assemblies and theirmerging and splitting. We're looking into those applications and trying to see if we canprogram some of them.

So for more information you can to this Web page. I don't think it's up yet. But it will bein a few days. If you want the code ask me. The paper is published in OOPSLA. If youwant to see that e-mail me too. I'll end here and take questions.

[applause]

>>: So you say you're going back in to start to work on type system and static analysis.

>> Swarat Chaudhuri: So we're trying to build an preliminary examination of thislanguage, you can download and lose but we want to integrate it with this Habanero.And then there are questions what are the types, what do types mean for this and whatkind of compiler optimizations can we do and those questions. So I want to addressthese questions.

>>: Almost sounds like on a linear type setup.

>> Swarat Chaudhuri: Yes.

>>: But the pitfalls are always fun.

>> Swarat Chaudhuri: Right.

>>: So he asked my question. So if I wanted to use

--

if I wanted to use [inaudible]parallelize numeric correlation, notion of a job, a job would correspond to I guesstriangles which are bad. Or you can attract.

>> Swarat Chaudhuri: Right.

>>: Then push them into two and a half threads. Key jobs for the queue, right?

>> Swarat Chaudhuri: Right. The question is what kind of heuristics would you need tomaintain thejob. It can't be completely naive. What you presumably do is you're goingto perhaps use the same kind of locality strategy; that if you're stealing a triangle thenperhaps you'll always going to steal some local triangles that's because you will use themafterwards. So in some sense you can see our division runtime system as animplementation of what you said, that each division has a queue which steals work fromanother division which is basically about migrating these assemblies from one division toanother division.

And there is load balancing, but just as you're going to have load balancing in silk. Butthe point is that all of those things are hidden from the programmer because we alsowant to guarantee this high level property, freedom, develop

freedom. And so on.

>>: You would compile your large to

--

>> Swarat Chaudhuri: Yes, that's actually a part I did not have. That seems like apossibility, yes.

I mean, okay. So I would be

--

yeah I would be hesitant to say yes completely and thereason is that these questions like can deadlock freedom there we have to do somespecial things to ensure that. So I do not know if we just do a naive translation into silkwe will get all those properties academically, that would require some thought.

>>: We can talk about it.

>>: Sure. Yes.

>>: [inaudible].

>> Swarat Chaudhuri: The only way you do any kind of passing of objects from oneassembly to another is via this merge or split. So when you do a merge, you are justgiving the entire region to another assembly. So you are not

--

>>: So basically [inaudible] you're passing overloads and stuff and you see goingthrough the local part of the heap?

>> Swarat Chaudhuri: Yes, that is right. That is right. Yeah. So that's where pointseparation comes in.

>>: A measure of the

--

[inaudible] intercept and goes

--

>> Swarat Chaudhuri: Right. So not every one of them

--

>>: Ten percent slow down.

>> Swarat Chaudhuri: I agree that is perhaps

--

yeah, so I agree that is perhaps thebiggest bottleneck. But I think one clarification, though, is that we do not need to monitorevery single write. So really we can store some notion of a boundary of these objects.So you can do some optimizations which will guarantee that certain accesses areguaranteed to be local. But sometimes you're not sure and then, yes, you will have to dothis sort of check.

>>: This is like actual memory.

>> Swarat Chaudhuri: One way you're doing this only at a completely local level. Incase of

transactional memory, if you have at least something that's global, your contextroutine is going to have to go through the entire memory. Whereas here this sort ofcheck whether you are going outside or whether you're staying inside, that's somethingthat you are doing using that find operation in union find, right? And it's also a local thingthat if this, as you saw in that mesh example, assembly counts, what, ten triangles. It'sabout checking these 9 or 10 pages that go out of an assembly and try and see if

--

>>: It depends on what size of the

--

>> Swarat Chaudhuri: Absolutely.

>>: The size of the entire part of this assembly.

>> Swarat Chaudhuri: Sure. But what I'm saying is that the property that we're trying toexploit is that there is a lot of parallelism at a low level, at a fine grain level and that's theproperty that we're trying to exploit. So in cases where there's no such parallelism, then,yes, it will just reduce to doing the sorts of global checks.

>>: The overhead of

these checks

--

>> Swarat Chaudhuri: Sure.

>>: Essentially the same as the actual memory. You can have [inaudible]implementation

--

>> Swarat Chaudhuri: But the basic point I want to make the only difference betweenthat sort of checking that goes on in transactional memory and here is just that we areusing this local property that we're operating on local regions. So if you said that thiswas, these assemblies contain the entire heap, so in that case yes there's no differenceat all in termsof overheads.

>>: Multiple transactions you can access something outside the transactions so youneed to monitor.

>> Swarat Chaudhuri: But remember in transactions there is no annotation of this is theregion that this transaction belongs to. So transactions are refer to [inaudible] city ofcontrol in some sense and the data is one global blob and go do whatever you want to dowith it.

>>: [inaudible]

>> Swarat Chaudhuri: The what?

>>: The writes itself.

>> Swarat Chaudhuri: Sure, but if you have heap allocated data structures and if youhave field expressions then you will have to basically do either some kind of analysis,some kind of shape analysis to see how far you can go or you'll have to do it dynamicallyby doing a check. So here all we're exploiting is the locality property. That in theseapplications the parallel computations are restricted to these local regions in the heapand that's what we're exploiting, yes.

>>: Any further questions?

>>: I have a question. So something about the ease of programming with respect to thefine grain docking transaction

--

>> Swarat Chaudhuri: The transactional memory is easy to program in, but fine grainlocking is pretty hard. Even in these applications that are not complicated just trying tomake sure there are no deadlocks and no aliases are going from here to there. So that'sa huge pain.

So I think that in terms of ease of programming you saw in that application ten lines ofcode over the sequential. So I think it seems very natural at least for these classes ofapplications.