Microsoft Orleans Cluster Management

Few weeks ago we saw how to create a Silo and how to implement grains. We also saw how we could create an ASP NET Core web app to talk to the Silo. We discovered the benefits of the single threaded model in grains. Today we will see one of the other major feature of Orleans, cluster management.

Build a silo

Form a cluster with multiple silos

Cluster management with membership

1. Build a silo

Let’s start first by implementing a Silo. We will be using the example we used in the past post.

Here we allow taking the endpoints as arguments so that we can bootup multiple silos under the same “main” cluster.
The first port being the silo port, next the silo gateway port and the last one being the primary node (also being the seed node). We can then open two command prompt and run the two following commands to start a cluster with a primary node on port 30001:

Doing so allows us to boot multiple servers under the same deployment, so called cluster.
Next we change the client configuration in order for the client to have choices between the two gateways available.

We now endup with 2 silos forming a cluster and a client having access to that cluster. With this setup, we can now scale by adding booting more silos and grains will be spread among those silos by the Orleans runtime.
The management of the silos is left to the runtime which uses a membership table.

3. Cluster management with membership

The cluster management in Orleans works via a membership table. Its main purpose is to answer the following questions:

How can a silo join the cluster?

How do other silos get notified that another silo went down?

How do clients know which gateway is available?

Joining a cluster

When using a MembershipTableGrain, the primary node is the one holding the membership. The primary node must be the one that start first.
Then when a silo tries to join the cluster, it looks for the primary node and ping the nodes stated as alive to join the cluster. It can be seen in the logs of the silo trying to join the cluster:

Once the communication is established, the silo marks itself in the membership table as alive and read the alive nodes in the cluster. For example let’s assume we have been running multiple silos and some went down already and we are booting a silo on 30020:

The number postfixed with the silo address is a timestamp which avoid silos to be written in the table with the same key.
There are 3Active silos, where one of them being us 30030, the other active silos being 30001 and 30030.
After this point, the silo will start to monitor its connection with the alive silos by actively pinging them. The log specifying this behavior can be seen on the silo logs:

Which then conclude the process of the silo joining the cluster. This process is known as the activation of a silo. It is contained within the following logs:

1
2
3

-BecomeActive
...
-Finished BecomeActive.

Exiting a cluster

All silos pings each other actively and therefore will know when other silos are down. When one of the silos goes down, it will eventually be marked as dead the membership table. This can be seen in the following logs on the silo actively pinging the dead silo:

Let’s simulate a silo failure in a cluster composed of 3 silos buy boot 3 silos (30001, 30020,30030) and then shutdown 30030. As we saw earlier, as soon as a silo joins, it starts to actively ping other silos and other silos start to actively ping it.
Therefore a failure will result in ping failures from its siblings silos:

Notice the time, as soon as 30020 marks 30030 as dead, it pings 30001 and then 30001 updates its watch and read back the table. This is how the cluster is managed together with the membership, silos are able to know when other silos join the cluster and when other silos exit the cluster.

Conclusion

Today we dived into how Orleans manage a cluster of multiple silos. How silos join and leave a cluster and how other siblings are notified. The process is handled by the Membership Oracle which has its data in the membership table. We saw what sort of logs can indicate the current status of the system and understand what is currently happening. We also saw the configuration needed to be able to boot multi silos from the same console app. The cluster management is one of the best feature of Orleans as it abstract from us the concepts of distributed system management. The membership table grain is not recommended to be used in a productive environment as if the primary node goes down, the whole system will not be able to survive. Next week we will look into moving the table to a fault tolerant storage like SQLServer. Hope you like this post! If you have any questions, you know what to do. See you next time!