MongoDB replica set mechanism

Last Updated: Jan 29, 2018

Replica set overview

MongoDB replica sets are composed of a group of mongod instances (processes) and include a single primary node and multiple secondary nodes. All data written by the MongoDB Driver (client) are written to the primary node, and, from there, the data is synced to the secondary nodes. This guarantees the datasets of all members in the replica set remain identical and therefore provides high data availability.

The following figure (from the official MongoDB documentation) shows a typical MongoDB replica set, including one primary and two secondary nodes.

Primary node election (1)

The replica set is initialized using the replSetInitiate command (or Mongo Shell rs.initiate() command). After initialization, the members begin to send heartbeat messages between each other and the primary node election operation is initiated. The node that receives votes from the majority of members becomes the primary node, while the other nodes become secondary nodes.

Initialize the replica set

config ={

_id :"my_replica_set",

members :[

{_id :0, host :"rs1.example.net:27017"},

{_id :1, host :"rs2.example.net:27017"},

{_id :2, host :"rs3.example.net:27017"},

]

}

rs.initiate(config)

Definition of majority

Assume that the replica set has N voting members (as explained following). In this case, a majority is N/2+1. When the number of active members in the replica set is insufficient to provide a majority, no primary node can be elected for the replica set and it cannot provide write services. In this case, it is read-only.

Number of voting members

Majority

Failure tolerance

1

1

0

2

2

0

3

2

1

4

3

1

5

3

2

6

4

2

7

4

3

It is generally recommended that the replica set members be set to an odd number. At the first glance, the preceding table shows that replica sets with both 3 or 4 nodes can only tolerate 1 node failure. From the perspective of service availability, they provide the same results. Still, 4-node replica sets certainly can provide more reliable data storage.

Special secondary nodes

Under normal conditions, the replica set’s secondary nodes participate in primary node election (and can themselves be elected as the primary node) and sync the latest data written to the primary node to guarantee they store the same data as the primary node.

Secondary nodes can provide read services, so adding secondary nodes increase the replica set’s read service capabilities, while also improving the set’s availability. In addition, MongoDB supports flexible configurations for secondary nodes of a replica set, to meet the needs of different scenarios.

Arbiter

Arbiter nodes only participate in voting and cannot be elected as the primary node or sync data from the primary node.

For example, if you deploy a 2-node replica set with 1 primary and 1 secondary node, the replica set cannot provide services (as it cannot elect a primary node) if any node goes offline. In such a case, you can add an Arbiter node to the replica set, so that a primary node can still be elected even if one node is not working.

The Arbiter itself does not store data, so it is an extremely lightweight service. When there is an even number of replica set members, it is best to add an Arbiter node to increase the availability of the replica set.

Priority0

A Priority0 node has an election priority of 0, so it cannot be elected as the primary node.

For example, if you deploy a replica set across two data centers, A and B, and want to ensure the primary node is in data center A, you can set the priorities for the data center B replica set members to 0. This means that the primary node must be selected from the members in data center A. (Note: If you deploy the replica set in this manner, it is best to place a majority of nodes in data center A. Otherwise, it may be impossible to elect a primary node in the case of a network partition. )

Vote0

In MongoDB 3.0, replica sets may have up to 50 members, but only a maximum of 7 members can participate in primary selection. The other members (Vote0 nodes) must have a vote attribute of 0, meaning they do not participate in voting.

Hidden

Hidden nodes cannot be elected as the primary node (priority: 0) and are invisible to the Driver.

Because Hidden nodes cannot receive Driver requests, you can use them to provide data backup and perform offline computing tasks, so as not to affect the replica cluster’s service.

Delayed

Delayed nodes must be hidden nodes that have a certain data update lag compared to the primary node (this delay can be set to 1 hour, for example).

Because the data of Delayed nodes have a certain lag with respect to the primary node, when incorrect or invalid data are written to the primary node, you can use a Delayed node to restore the data to a previous time point.

Primary node election (2)

In addition to replica set initialization, primary node election also occurs in the following scenarios:

Replica set reconfiguration

When secondary nodes detect that the primary node is offline, this triggers a new primary node election.

When a primary node voluntarily performs stepDown (voluntarily downgrades itself to a secondary node), this also triggers a new primary node election.

The primary node election is affected by inter-node heartbeats, priorities, latest oplog times, and various other factors.

Node priority

Each node tends to vote for the node with the highest priority. Nodes with a priority of 0 cannot initiate a primary node election.

When the primary node discovers a secondary node with a higher priority and this secondary node has a data lag of no more than 10s, the primary node voluntarily steps down to give this secondary node the chance to be elected as the primary node.

Optime

A node must have the latest optime (timestamp of the most recent oplog) to be able to become the primary node.

Network partition

Only nodes that maintain a network connection with the majority of voting nodes can be elected as the primary node. If a primary node loses connection with most other nodes, the primary node voluntarily steps down. When a network partition occurs, multiple primary nodes may exist for a short time. In this case, if the Driver writes data, it is best to set a majority success policy. This way, even if there are multiple primary nodes, a single primary node still can write data to the majority of nodes.

Data synchronization

Oplogs are used to sync data between primary and secondary nodes. After a write operation is completed on the primary node, it writes an oplog to the special local.oplog.rs collection. Secondary nodes constantly take and apply new oplogs from the primary node.

Because oplog data is constantly added, local.oplog.rs is configured as a capped collection. When its capacity reaches the set limit, the oldest data is deleted. Also, considering that an oplog can be reapplied by secondary nodes, it must be idempotent, so that repeated application has the same result.

The oplog format, containing ts, h, op, ns, o, and other fields, is shown as follows:

ts: the operation time; current timestamp + counter, the counter is reset every second

h: the globally unique operation identifier

v: the oplog version information

op: the operation type

i: insert

u: update

d: delete

c: execute command (such as createDatabase, dropDatabase)

n: null operation, special purpose

ns: collection for operation

o: operation content, if an update operation

o2: operation query conditions; only update operations have this field

The first time a secondary node syncs data, it first executes init sync to sync all data from the primary node (or another secondary node with up-to-date data). Subsequently, it continuously executes tailable cursor to query the latest oplogs in the primary node’s local.oplog.rs collection and apply them to itself.

Init sync process

The init sync process includes the following steps:

At the time T1, all database data (excluding local data) is synced from the primary node using the listDatabases + listCollections + cloneCollection command combination. We assume that this operation is completed at time T2.

The node applies all oplogs for the period from T1 to T2 from the primary node. It is possible that this performs some operations already contained in step 1, but because of the idempotence of the oplogs, they can be reapplied.

Based on the index settings for each primary node collection, indexes for the corresponding collections are created on the secondary node. (The index for each collection_id was completed in step 1.)

Note: The oplog collection size is configured based on the size of the DB and application writing requirements. If the size is too big, this results in a waste of storage space. If the size is too small, secondary nodes may not be able to complete the init sync operation. For example, in step 1, if the DB is too large and the oplog is too small, the oplog collection cannot store all the oplogs for the period from T1 to T2. Therefore, the secondary node cannot sync the entire dataset on the primary node.

Modify the replica set configuration

When you need to modify the replica set, such as to add/delete members or change member configurations (priority, vote, hidden, delayed, and other attributes), you can use the replSetReconfig command (rs.reconfig()) to reconfigure the replica set.

For example, you can set the priority of the second replica set member to 2 by executing the following command:

cfg = rs.conf();

cfg.members[1].priority =2;

rs.reconfig(cfg);

Rollback

If the primary nodes goes offline before the data can be completely synced to secondary nodes and a new primary node is elected to which data is written, the old primary node must rollback some operations, to ensure its dataset is consistent with that of the new primary node.

The old primary node writes rollback data to the independent rollback directory. This allows database administrators to use the mongorestore command to restore data as needed.

Replica set read/write settings

Read Preference

Under normal conditions, all the replica set’s read requests are sent to the primary node. However, the Driver can set Read Preference to route read requests to another node.

primary: The default rule, by which all read requests are sent to the primary node.

primaryPreferred: Priority is given to the primary node, but if this node cannot be reached, the requests are sent to secondary nodes.

secondary: All read requests are sent to secondary nodes.

secondaryPreferred: Priority is given to the secondary nodes, but if these nodes cannot be reached, the requests are sent to the primary node.

nearest: Read requests are sent to the nearest reachable node (ping to find the nearest node).

Write Concern

Under normal conditions, the primary node completes write operations and returns the result. However, the Driver can set Write Concern (for more information, click here) to set the rules for write operation success.

The following Write Concern rule configuration indicates that data must be written to a majority of nodes before the operation is successful, which times out in 5s.

db.products.insert(

{ item:"envelopes", qty :100, type:"Clasp"},

{ writeConcern:{ w: majority, wtimeout:5000}}

)

The preceding settings apply to individual requests. You can also modify the replica set’s default Write Concern to avoid having to configure each request individually.