Replica Set Internals Bootcamp: Part I – Elections

I’ve been doing replica set “bootcamps” for new hires. It’s mainly focused on applying this to debug replica set issues and being able to talk fluently about what’s happening, but it occurred to me that you (blog readers) might be interested in it, too.

I’m going to do one subject per post, we’ll see how many I can get through.

Prerequisites: I’m assuming you know what replica sets are and you’ve configured a set, written data to it, read from a secondary, etc. You understand the terms primary and secondary.

The most obvious feature of replica sets is their ability to elect a new primary, so the first thing we’ll cover is this election process.

Replica Set Elections

Let’s say we have a replica set with 3 members: X, Y, and Z. Every two seconds, each server sends out a heartbeat request to the other members of the set. So, if we wait a few seconds, X sends out heartbeats to Y and Z. They respond with information about their current situation: the state they’re in (primary/secondary), if they are eligible to become primary, their current clock time, etc.

X receives this info and updates its “map” of the set: if members have come up or gone down, changed state, and how long the roundtrip took.

At this point, if X map changed, X will check a couple of things: if X is primary and a member went down, it will make sure it can still reach a majority of the set. If it cannot, it’ll demote itself to a secondary.

Demotions

There is one wrinkle with X demoting itself: in MongoDB, writes default to fire-and-forget. Thus, if people are doing fire-and-forget writes on the primary and it steps down, they might not realize X is no longer primary and keep sending writes to it. The secondary-formerly-known-as-primary will be like, “I’m a secondary, I can’t write that!” But because the writes don’t get a response on the client, the client wouldn’t know.

Technically, we could say, “well, they should use safe writes if they care,” but that seems dickish. So, when a primary is demoted, it also closes all connections to clients so that they will get a socket error when they send the next message. All of the client libraries know to re-check who is primary if they get an error. Thus, they’ll be able to find who the new primary is and not accidentally send an endless stream of writes to a secondary.

Elections

Anyway, getting back to the heartbeats: if X is a secondary, it’ll occasionally check if it should elect itself, even if its map hasn’t changed. First, it’ll do a sanity check: does another member think it’s primary? Does X think it’s already primary? Is X ineligible for election? If it fails any of the basic questions, it’ll continue puttering along as is.

If it seems as though a new primary is needed, X will proceed to the first step in election: it sends a message to Y and Z, telling them “I am considering running for primary, can you advise me on this matter?”

When Y and Z get this message, they quickly check their world view. Do they already know of a primary? Do they have more recent data than X? Does anyone they know of have more recent data than X? They run through a huge list of sanity checks and, if everything seems satisfactory, they tentatively reply “go ahead.” If they find a reason that X cannot be elected, they’ll reply “stop the election!”

If X receives any “stop the election!” messages, it cancels the election and goes back to life as a secondary.

If everyone says “go ahead,” X continues with the second (and final) phase of the election process.

For the second phase, X sends out a second message that is basically, “I am formally announcing my candidacy.” At this point, Y and Z make a final check: do all of the conditions that held true before still hold? If so, they allow X to take their election lock and send back a vote. The election lock prevents them from voting for another candidate for 30 seconds.

If one of the checks doesn’t pass the second time around (fairly unusual, at least in 2.0), they send back a veto. If anyone vetos, the election fails.

Suppose that Y votes for X and ZvetosX. At that point, Y‘s election lock is taken, it cannot vote in another election for 30 seconds. That means that, if Z wants to run for primary, it had better be able to get X‘s vote. That said, it should be able to if Z is a viable candidate: it’s not like the members hold grudges (except for Y, for 30 seconds).

If no one vetos and the candidate member receives votes from a majority of the set, the candidate becomes primary.

Confused?

Feel free to ask questions in the comments below. This is a loving, caring bootcamp (as bootcamps go).

Join the Conversation

Any reason Z suddenly knew X was a bad choice for primary. For example, if there was a network partition and X, Y, and Z couldn’t reach W. Z suddenly can reach W and finds out that it’s primary already. Or W has a higher priority than X. Or if someone does a forced reconfig on Z, so Z has a newer version of the config than X. You can look at the code (https://github.com/mongodb/mongo/blob/master/src/mongo/db/repl/consensus.cpp#L192-222) to see all the possible cases, but those are a couple.

Arg, I was hoping no one would ask about Paxos 🙂 I think that comparing/contrasting MongoDB’s election protocol to Paxos will take more than I want to put in a comment (lots of diagrams!) so I’m going to put this in the blog post queue.

Hi,
I have question about arbiters. I’m testing replica sets on my computers and run into this problem:
I know that if I’m having replica set from 2 nodes, I need to run arbiter as well.
But what if I have replica set compound from 3 or 4 or more nodes and during some failure on nodes I end up with two nodes. Do I need arbiter running in this case?

What if I’m having arbiter in two nodes configuration and PC with arbiter fails? What happend to replica set than?

Thanks for answers, and really looking forward to next bootcamp blogpost 🙂

You do not need an arbiter if the total # of nodes is odd (# of available nodes doesn’t matter). So: 3 nodes=no arbiter, 4 nodes=add an arbiter.
> What if I’m having arbiter in two nodes configuration and PC with arbiter fails? What happend to replica set than?Are you talking about having 1 “normal” node and one arbiter? If so, you shouldn’t do that 🙂 If the arbiter goes down, the primary (“normal” node) will only have 1 out of 2 votes (not a majority) and so become a secondary.

If you’re talking about having 2 normal nodes & one arbiter, if everyone can reach each other, everything will be fine if the arbiter goes down (the primary still has 2 out of 3 votes). If the primary cannot reach the other “normal” node for some reason & the arbiter goes down, the primary will only have 1 out of 3 votes and so step down.

Thank you for this writeup. This blog post is very informative. I realize this blog was written a while ago, but I have a specific question about the implementation (I have been studying the code). In the second phase, where X formally announces its candidacy, I assume one of the questions Y and Z ask is “do I have more recent data than X?”. How is this question answered in a thread-safe manner? Or, is this not done in a thread-safe manner and the system depends on rollback should any other machine be further along?

In the election code (consensus.cpp, as best as I can tell), I don’t see any lock that prevents the BackgroundSync or SyncTail from applying more data while an election is running. What is to stop Y from having data yet to be applied that is further along than X, report to X that X can become a primary, and then have that data later be applied that puts it further along than X?

Suppose Primary is down, and now I have 3 Secondaries, each of them with priority 1. If they take a voting round, making a tie, for instance: A votes in B, B votes in C, C votes in A. What is the mecanism currently implemented to change that aparant deadlock? Does mongo takes measures to increase the priority of a given member? Cheers.

MongoDB does not change priorities dynamically, but it does wait a random amount of time on each member before voting again. Theoretically it could keep tying, but in practice I’ve never seen it. If all of the priorities are the same, the algorithm basically prefers the order the members are specified in the config, so the situation is unlikely. The plan is to implement Paxos at some point, which will eliminate the possibility of a tie.