Orleans and Midori

Reading the epic Joe Duffy’s 15 Years of Concurrency post brought some old memories from the early days of Orleans. It even compelled me to dig up and try to compile the code from 2009. It was an entertaining exercise.

When we were just starting the Orleans project, we would meet and talk with Midori people on a regular basis. That was natural not only because of some obvious overlap of the problem spaces, but also because Jim Larus who conceived Orleans was one of the creators of Singularity, the base from which Midori started. We immediately borrowed the promises library of Midori because we wanted to use the promise-based concurrency for safe execution and efficient RPC. We didn’t bother to try to integrate the code, and simply grabbed the binaries and checked them in into our source tree. We were at an early prototyping stage, and didn’t have to worry about the long term yet.

At the time, grain interfaces looked like this:

[Eventual]

publicinterfaceISimpleGrain : IEventual

{

[Eventual]

PVoid SetA(int a);

[Eventual]

PVoid SetB(int b);

[Eventual]

PInt32 GetAxB();

}

PVoid and Pint32 were moral equivalents of Task and Task<int> in TPL. Unlike Tasks, they had a bunch of static methods, with one of the simpler overloads taking two lambdas: one for success case and one to handle a thrown exception:

The nested Whens were necessary to organize a data flow execution pipeline. Runner was an instance of ForeignTodoRunner, which was one of the ways of injecting asynchronous tasks (ToDos) into a TodoManager. TodoManager was a single-threaded execution manager a.k.a. a vat, the notion that came from E language. Initialization of the vat-based execution system was a few lines of code:

todoManager = newTodoManager();

Thread t = newThread(todoManager.Run);

t.Name = "Unit test TodoManager";

t.Start();

runner = newForeignTodoRunner(todoManager);

Within a silo, we also used vats for managing single-threaded execution of grain turns. As part of silo startup we set up N of them to match the number of available CPU cores:

We argued with Dean Tribble at the time that using static methods on promises in our view was too inconvenient for most developers. We wanted them to be instance methods instead. A few months later we introduced our own promises, AsyncCompletion and AsyncValue<T>. They were wrappers around Task and Task<T> of TPL and had instance methods. This compressed the code by quite a bit:

Initially, we allowed grain methods to be synchronous, and had grain references be their asynchronous proxies.

publicclassSimpleGrain : GrainBase

{

publicvoid SetA(int a)

publicvoid SetB(int b)

publicint GetAxB()

}

publicclassSimpleGrainReference : GrainReference

{

publicAsyncCompletion SetA(int a)

publicAsyncCompletion SetB(int b)

publicAsyncValue<int> GetAxB()

}

We quickly realized that was a bad idea, and switched to grain methods returning AsyncCompletion/AsyncValue<T>. We went through and eventually discarded a number of other bad ideas. We supported properties on grain classes. Async setters were a problem, and in general, async properties were rather misleading and provided no benefit over explicit getter methods. We initially supported .NET events on grains. Had to scrap them because of the fundamentally synchronous nature of += and -= operations in .NET.

Why didn’t we simply use Task/Task<T> instead of AsyncCompletion/AsyncValue<T>?

We needed to intercept every scheduling and continuation call in order to guarantee single-threaded execution. Task was a sealed class, and hence we couldn’t subclass it to override the key methods we needed. We didn’t have a custom TPL scheduler yet either.

After we switched to using our own promises, we lost the opportunity to use some of the advanced features that Midori had for theirs. For example, they supported a three-party promise handoff protocol. If node A called node B and held a promise for that call, but B as part of processing the request made a call to C for the final value, B could hand off a reference to the promise held by A, so that C could reply directly to A instead of making an extra hop back to B. In this tradeoff between performance and complexity we chose to prioritize for simplicity.

Another lesson we learned from talking to Midori people was that the source of some of the hardest to track down bugs in their codebase was interleaving of execution turns. Even though a vat had a single thread to execute all turns (synchronous pieces of code between yield points), it was totally legal for it to execute turns belonging to different requests in an arbitrary order.

Imagine your component is processing a request and needs to call another component, for example, make an IO call in the middle of it. You make that IO call, receive a promise for its completion or its return value, and schedule a continuation with a When or ContinueWith call. The trap here is that when the IO call completes and the scheduled continuation starts executing, it is too easy to assume that the state of the component hasn’t changed since the IO call was issued. In fact, the component might have received and processed a number of other requests while asynchronously waiting for that IO call, and processing of those requests could have mutated the state of the component in a non-obvious way. The Midori team was very senior. At the time, the majority of them were principal and partner level engineers and architects. We wondered if interleaving was so perilous to people of that caliber and experience, it must be even worse for mere mortals like us. That lead to the later decision to make grains in Orleans non-reentrant by default.

At around the same time, Niklas Gustafsson worked on project Maestro that was later renamed and released as Axum. We had an intern prototype one of the early Orleans applications on Axum to compare the programming experience with the promise-based one in spring of 2009. We concluded that the promises model was more attainable for developers. In parallel Niklas created a proposal and a prototype of what eventually, after he convinced Anders Hejlsberg and others, became the async/await keywords in C#. By now it propagated to even more languages.

After .NET 4.5 with async and await was released, we finally abandoned AsyncCompletion/AsyncValue<T> in favor of Task/Task<T> to leverage the power of await. It was another tradeoff that made us rewrite our scheduler a couple of times (not a trivial task) and give up some of the nice features we had in our promises. For example, before we could easily detect if grain code tried to block the thread by calling Result or Wait() on an unresolved promise, and throw an InvalidOperationException to indicate that this was not allowed in the cooperative multi-tasking environment of a silo. We couldn’t do that anymore. But we gained the cleaner programming model that we have today:

publicinterfaceISimpleGrain : IGrainWithIntegerKey

{

Task SetA(int a);

Task SetB(int b);

Task<int> GetAxB();

}

[Fact, TestCategory("BVT"), TestCategory("Functional")]

publicasyncTask SimpleGrainDataFlow()

{

var grain = GrainFactory.GetGrain<ISimpleGrain>(GetRandomGrainId());

Task setAPromise = grain.SetA(3);

Task setBPromise = grain.SetB(4);

awaitTask.WhenAll(setAPromise, setBPromise);

var x = await grain.GetAxB();

Assert.Equal(12, x);

}

Midori was an interesting experiment of a significant scale, to try to build a ‘safe by construction’ OS with asynchrony and isolation top to bottom. It is always difficult to judge such efforts in terms of successes, failures, and missed opportunities. One thing is clear – Midori did influence early thinking and design about asynchrony and concurrency in Orleans, and helped bootstrap its initial prototypes.

And people still argue for the correlation IDs and explicit responses and think it is better because humans don’t hold promises when they send messages to each other. Well humans don’t do many of the other things which computers do as well.
This is the best approach for 99% of the cases.

This said, I am sometimes picky and hold promises when I send a message to someone

Very good post, thank you Sergey! This brings great memories.
A couple of additional points:

1) Another reason for not going with Midori promises is that they did not have (at least as far as I remember, at that time) any support for distribution or remoting. It was a completely single box solution. Therefore, even if we used it, we would need to pretty much re-implement the whole stack underneath.

2) That also goes for handoff – it is much harder to do a distributed handoff between 3 remote parties then between non-remote parties.

3) Another reason, in addition to what you wrote, why we initially picked AsyncCompletion/AsyncValue instead of Tasks (Task per async await) was error prorogation. With Tasks if you use ContinueWith you have to re-throw the exception manually from the CW lambda (if you have Task1.CW and Task1 is broken, you have re-throw its exception from the CW lambda). If you forget to, the exception is swallowed. We thought that manual error propagation is an anti pattern and with our AC/AV we propagated exceptions automatically. Task async/await propagates automatically too. Interesting, intriguing to me, the Go languages does not support exceptions at all and uses manual error prorogation. I personally believe this to be a language mistake, although I would admit I don’t know the (presumably good) reasons why they choose not to support exceptions.

4) Another reason why we had good collaboration with Midori was Ravi Pandya, who worked on the Midori team prior to joining the Orleans team and having a profound impact on the early Orleans programming model. For example (as far as I remember, although that was quite a while ago) Ravi brought the idea of a strict separation between the concept of a Grain and an Activation, in such a way that Activation is never even exposed to the programming model. This was the foundation of the what we later coined a Virtual Actor.

I don’t remember now what was the remoting story with Midori promises. You are probably right.
Good point on error propagation. That was indeed a big one. I think error propagation deserves a separate post.
Indeed, Ravi ‘switched sides’ by joining us and brought a number of ideas and lessons with him.