snaplock

Introduction

The snaplock project is part of the snapwebsites environment. It allows for locking various resources on the entire cluster for a small amount of time. A resource has a URL which is what we use to create the lock (we call it the Object Name.) The URL can be global to the whole cluster, specific to a website, or even specific to one special variable in a specific page of a specific website (very small granularity is possible.) At this time only exclusive locks are available. It is planned to add a semaphore capability which will offer a Read/Write (exclusive) or Read-Only (shared) lock mechanism.

The snaplock tool runs on any system that may require a lock. More or less, all the nodes in a Snap Cluster.

The locking is obtained using the Lamport's Bakery Algorithm ameliorated to support a non-complete network (i.e. using a praxos consensus instead of a complete consensus.) However, before we can have valid locks, we first have to make sure that the cluster is considered up. This includes several premises:

snapcommunicator sent us a CLUSTERUP message with the total number of nodes in the cluster

The number of known snaplock represents a quorum

The known nodes is at least 3 (it can be less if the total number is only 1 or 2)

There are enough computers participating in the elections to become leaders to run said election

The election ran succesfully

The snaplock daemon is connected to all the other leaders

When starting the elections, we must have a quorum of them (see below for computation of the quorum parameter) otherwise snaplock could not guarantee that the locks would actually be locking anything exclusively. That is, each snaplock on its own could be asked to "LOCK" the same URL and return "LOCKED"... and clearly all the nodes would start working on the exact same resource. Instead, by having a quorum we make sure that if anyone asks for something to be locked, it for sure gets locked exclusively.

Further snaplock makes sure that the locking is fast by using at most three nodes to perform the actual lock (the Lamport's Bakery Algorithm). Many lock systems use just one node, often called a Leader, but this can cause one big problem: it is not failsafe in case of a crash. i.e. on a crash, the information about all the existing locks is lost. Although it would be possible to then wait for a little while before accepting new locks (i.e. blindly timing out old locks), but offering new locks too quickly could create weird bugs and some of our locks last a very long time (hours.) Really, without any information about those existing locks we can't even cancel them properly.

This is why we want three nodes to handle our lock mechanism.

Initialization

First snaplock starts and initializes various connections. Especially, it connects to snapcommunicator so it can learn about the network. However, other services cannot start making use of it until the cluster is up and leaders were elected. Note that in most cases leaders get elected as soon as the cluster of snaplock is considered up.

In the initialization step, the snaplock service generates a random identifier to be used by the election process. This allows for varying nodes to end up being leaders instead of always the exact same (assuming you have more than three nodes, of course.)

States

The snaplock has a status. It can be queried by sending a LOCKSTATUS message. The reply is either:

NOLOCK

This means that the snaplock service is not currently offering locks. It is not yet ready. It may be because it needs to have confirmation from its peers that they are up and running or there is a problem and the elections can't be conducted (i.e. you marked too many computers as being Off—not a candidate in the election.)

LOCKREADY

If you receive this reply to your LOCKSTATUS request, then the lock system is ready and a LOCK will quickly get a reply with a LOCKED message.

Note that LOCK messages sent to snaplock before the LOCKREADY status are cached by snaplock until they timeout. So you do not have to listen to the NOLOCK and LOCKREADY status, although many daemons do so as to mark themselves as ready or not.

The snaplock is in the NOLOCK state until valid elections happen and it enters the LOCKREADY state. If you shutdown many nodes all at once, then the state may return to NOLOCK. On a large enough live system, though, once the state is LOCKREADY, it should remain such forever except on computers where you are ready to do some maintenance (i.e. rebooting, upgrading, etc.)

Election Algorithm

I have a great book called Distributed Systems: An Algorithm Approach (pic. on right.) There are a few pages about Election Algorithms. How to determine which node should be the leader. However, there are two drawbacks with those algorithms:

The cluster needs to be complete to proceed with the election, problematic if you want to allow for a fully dynamic cluster (possibly one which creates new nodes and destroys old nodes, but never just reboots a node!)

The algorithms only elects one leader and we definitely want three leaders.

So we based our algorithm on the one presented in Sukumar's book, but we changed a couple of things to allow for the algorithm to work properly in a non-complete cluster and to elect three leaders instead of just one.

Messages

The algorithm I wrote is still fairly simple. There are three important messages:

CLUSTERUP

The CLUSTERUP message is sent by the snapcommunicator daemon once it knows of a little over 50% of the nodes in the cluster. The number of nodes needed to consider the cluster as being up is called the quorum. It is calculated as:

n
quorum = - + 1
2

where n is the total number of computers. Once your cluster is properly installed, we know n since we have the complete list of all the neighbors. That number will fluctuate as you add and remove nodes, although once you reach 5 nodes, a fluctuation of +1 or -1 can easily be ignored.

Since snapcommunicator only sends the CLUSTERUP message when it goes from being down to being up, snaplock sends a CLUSTERSTATUS to make sure it will receive the message at least once.

Note that snaplock ignores the CLUSTERCOMPLETE message that snapcommunicator sends too. Knowing that the cluster is complete is not useful to ensure proper mutual exclusion within the quorum of connected nodes.

LOCKSTARTED

The LOCKSTARTED message is used to connect all the snaplock daemons between each others. When a snaplock daemon starts, it broadcasts that message as soon as it receives the CLUSTERUP message from snapcommunicator. It also sends that message whenever a new computer is added to the cluster.

At first, a node doesn't know about any other node so it broadcasts the LOCKSTARTED to all the others. Whenever a node receives the LOCKSTARTED, it replies with a LOCKSTARTED directly to the sender if it did not yet know that sender. It does not reply to said sender if it already knows the sender. That way we avoid infinite loops.

The current implementation waits for a quorum of snaplock instances to be known before to proceed with the leaders elections. The quorum is calculated the same way as for the CLUSTERUP message above. Note that our algorithm would work within a quorum of nodes without the need for all the snaplock to know each others, but that require a more complicated algorithm with many more messages to ensure the election was run properly.

Once the quorum is reached, the election happens (see below for details about that.)

LOCKLEADERS

The LOCKLEADERS message is sent once the election results were determined.

The snaplock daemon with the smallest IP address found in the quorum of snaplock daemons proceeds with the election and determines the three leaders.

This message is considered authoritative. In other words, it requests for all the previously defined leaders to be overwriten with this new list of leaders.

Each snaplock still takes the election date in account (i.e. the most recent election results are the ones kept.) This assumes that our network has clocks that are properly synchronized.

Restrictions

The election happens if and only if all of the following tests are true:

Cluster Up --- until the cluster is up, snaplock has no clue to the total number of nodes are on the network, so it can't calculate a quorum and thus can't know whether to proceed with the elections

List of Leaders Incomplete --- if the daemon already has leaders, do not attempt another election; however, if a leader was lost, re-process the election to replace the leader that was lost

Quorum Reached --- we have two possibilities here, if the total number of nodes is 3 or less, then we do not want a quorum, instead we wait for a complete network; if there are 4 or more nodes, then we wait for at least quorum snaplock daemons to know of each others before proceeding

Smallest IP Address --- the node in charge of the election is the one with the smallest IP address; we do that because at the time CLUSTERUP happens many of the nodes may be connected to many different other nodes making it difficult to coordinate otherwise; the time it takes to process an election is so short, though, that using the smallest IP address node is pretty safe

Three Candidates --- among the nodes that form a cluster we need to have at least three of them that can be candidates; some nodes can request to never be part of the elections (for example, you may want to avoid elected snaplock instances on Cassandra nodes); so the cluster may be considered up, but you may not yet have three nodes that are not turned off in terms of candidacy

Once all of these are true, the node with the smallest IP address at the time the cluster is formed elects three leaders and broadcast the results with the LOCKLEADERS message. At this point, further LOCKSTARTED messages include the leaders so new nodes coming up will automatically be told of the results and they do not take the initiative to handle a new election.

It's at this point that we enter the state LOCKREADY.

Sorting snaplock Daemons to Find the Leaders

The way the election works is by sorting the candidates by ID. The ID is formed by concatenation of the following five elements:

Priority --- as defined by the candidate_priority=... parameter found in the snaplock.conf configuration file; the special value "off" can be used to remove this node from the candidacy; the smaller the priority, the higher the chance for this node to get elected, however, which nodes become leaders very much depend on factors such as the order in which the nodes are booted up

Random Number --- a random number as generated in the snaplock initialization step; after that initialization step this random number never changes until snaplock is restarted; this is used to make sure that the snaplock can travel between the various nodes and not be forced to stick on a specific set of nodes

IP:port --- the IP address and port of the concerned node; the port is likely to always be 4040; the IP being unique between each nodes we can make use of it to make sure this identifier is unique; also the participating node with the smallest IP:port wins the election if the Priority and Random Number are both equal, somehow

PID -- the process identifier of the snaplock instance generating this identifier; again this is used to make the identifier unique, but in this case that's for developers, on one system (i.e. while debugging we often run many instances of snaplock on a single system) each snaplock instance will have a different process identifier

Server Name --- whenever we get the list of leaders, we need to assign them to their corresponding node and that requires us having the name of the server on which these leaders run; it is also another unique identifier which makes sure that the sort when the elections happen is always clearly predictable; it is also important when receiving a list of leaders before we already registered one of those leaders through the LOCKREADY message

IMPORTANT NOTE: This algorithm does not currently work 100% correctly in a dynamic environment (i.e. a cluster that gows and shrinks.) The algorithm fails if the number of computers is very close the the quorum and new nodes get added at that time. This will change the quorum and it may cause problems. As long as new nodes get added one at a time and when the network is CLUSTERCOMPLETE, there should be no concerned.

Elected

The voting process was a success and the cluster knows of the three leaders.

Each leader changes its priority to 0. This way on the next elections they will always be the winners and if one of them is lost (a reboot happens or the node crashes) then the other two will remain leaders and one other node is elected as a replacement of the lost leader.

Note that the priority going to zero is virtual. It's done at the time the elections happen. The computer taking care of sorting the node's IDs makes sure to insert "00" as the priority of existing leaders.

Loss of Leader

As soon as one of the three leaders known to be lost, a rerun happens from the snaplock with the lowest IP address. Since the two leaders left are going to use a priority of "00", they will remain leaders (they will automatically be re-elected.) However, the order may still change so leader #2 may become leader #1, for example.

Nodes Hopping In

Up until the election happens, new nodes simply become potential candidates (unless their priority is "off," although such are still added to the list, but ignored for the election itself.)

After the elections, new nodes are sent the results along the LOCKSTARTED replies. Nodes that a node receiving a LOCKSTARTED message with a list of leaders immediately enters the LOCKREADY state.

Obtaining a Lock

As far as the client is concerned

With three leaders the snaplock system is ready to accept LOCK requests. These include a URL and the time by which we want to obtain the lock and the duration of the lock, in case we do obtain it.

If the lock can't be obtained for whatever reason, then a LOCKFAILED message is sent to the lock requester.

If the lock is obtained, then the current plus lock duration are used to calculate the exact date when the lock will time out. That information is sent along the LOCKED message.

The client can send the UNLOCK message once done with the resource and it is expected to happen that way. However, if the UNLOCK does not arrive on time, the snaplock system sends an UNLOCKED message to the client which should react by releasing the resource as soon as possible and then send an UNLOCK message.

The snap_lock object in the snapwebsites library offers all that functionality in an object one can use to obtain a lock. Very easy to use for the purpose. For locks used to do work that take a long time (like with backends), the get_timeout_date() can be used to know when the lock is going to timeout. The lock_timedout() function returns true once the lock has timed out. That second function actually reacts to the UNLOCKED message and tells you whether UNLOCKED was already received or not.

As far as the server is concerned

On the server side many things happen to know whether a lock is available or not.

First, if the LOCK message gets sent to a snaplock which is not a leader, that snaplock will proxy the LOCK message to one of the leaders. The first time, it selects a leader randomly. After that it uses a round robin mechanism so each leader will receive LOCK and be the master.

When a leader receives a LOCK message, it becomes the master for that one LOCK request. If it dies before the lock is obtained, then another leader will take over and restart the process.

From there on, the lock process goes like this (which is an advanced implementation of the Lamport's Bakery Algorithm):

client: LOCK

proxy: foward LOCK (only if client does not talk directly to a leader)

master: LOCKENTERING

A/B: LOCKENTERED

master: GETMAXTICKET

A/B: MAXTICKET

master ADDTICKET

A/B: TICKETADDED

master: LOCKEXITING

A/B: ACTIVATELOCK

master: LOCKED

master: (on timeout) UNLOCKED error="timedout"

A/B are the other two leaders.

The master sends its message to A and B.

A and B reply to the master.

In all cases, when the master receives at least one reply that confirms that a message was received, it has a quorum and can consider that there is consensus. With such, it can move forward to the next step.

At any stage, many tests are done to make sure that the algorithm is working as expected. If any error is detected, a LOCKFAILED happens with the "error" parameter likely set to "invalid".

If the messages take too long, somehow, then the process stops with a LOCKFAILED and the "error" parameter set to "timedout".

After you received the LOCKED, you shuold not receive a LOCKFAILED. Instead, you will get an UNLOCKED message. The UNLOCKED message may include an "error" parameter set to "timedout" in which case the lock will remain in place for at least as long as your lock was or 1 minute, whatever is the longest. That additional time must be used to stop using the resource that was locked or bugs may ensue as the locked data will be shared with others and if the resource needs to be edited by at most one process at a time, we're in big trouble!