I am unclear on how ONoSQL should behave in the event of a network partition.
Assuming a replica group is split between two data centers that ceased to communicate. Does the master replica stay where it is (even if it is on a network partition with only a minority of the replicas)? How does the "other" network partition "know" that it shouldn't elect a new master replica?

user12003335 wrote:
I am unclear on how ONoSQL should behave in the event of a network partition.
Assuming a replica group is split between two data centers that ceased to communicate. Does the master replica stay where it is (even if it is on a network partition with only a minority of the replicas)? How does the "other" network partition "know" that it shouldn't elect a new master replica?

When a network partition occurs, a new election is held. A master is only elected if there is a majority of nodes available to elect one. Hence, on the minority side of a network partition, no master will be elected. The replicas on the minority side may continue to service read requests as long as the consistency properties passed into the requests by the client can be satisfied. e.g. Consistency.NONE requests could be satisfied, but Consistency.ABSOLUTE requests could not.

"If as a result of a network partition the "minority node partition" had a node in the master state, it will continue to remain in the master state but not be able to process durable writes (since it's not in communication with a simple majority) until the partition is resolved and it notices the presence of the new master. The master on the minority side does not call for an election. This is because the master cannot distinguish between a temporary disconnect of a node and a true network partition.

A downside of having the node on the minority side continue to think that it's in the master state, is that it thinks it's absolutely consistent and as a result may respond incorrectly to read requests with time based or absolute consistency requirements.

So it's desirable for the master to relinquish mastership and call for an election, when it's not in touch with a simple majority, that is, it's not authoritative, but we want to avoid doing so on temporary network disconnects, since there is a cost associated with holding an election.

A solution we have discussed in the past, is for a non-authoritative master to call for an election, when it is consistently in this state for some configurable amount of time."

If I have a 7 node replication group, and 4 crash, the remaining nodes will not elect a new master and will not be able to accept writes until I bring the crashed nodes back?

That's correct. Break the above into two scenarios: (1) the master was one of the 4 nodes that crashed, and (2) the master was one of the 3 surviving nodes. In case (1), there is no surviving master so no one to accept writes. Further, since there is not a majority, no master can be elected. In case (2), a master survives, but it will not be able to commit any write requests sent to it, assuming that all write requests specify a durability with a replica ack policy of Simple Majority or All. The above post is pointing out that in case (2), an election will not be held, but the mastership will remain in the minority.

If I have a 7 node replication group, and 4 crash, the remaining nodes will not elect a new master and will not be able to accept writes until I bring the crashed nodes back?

Or, bring back at least one of the crashed nodes. As long as the total number of nodes that are up and communicating forms a majority (assuming the usual default "majority" configuration), you're back in business.

In such case, what do all others nodes and clients know about the master for that RG?
Do they have de same master for that RG?
Can the client know that the master on the minority is not the true master?
I am wondering the time is sufficient to ask for new election.
Election should be held when the other partition is already online or where all nodes communicating are in the majority. I think defining just a time will not be sufficient.
Can you clarify?
Thanks

893771 wrote:
Does that mean for this release no election will be held for non-authoritative master ?

Correct. We have an SR open to provide for an election when the nodes notice they are in this state after some configurable amount of time.

>

In such case, what do all others nodes and clients know about the master for that RG?

They know which node was, and still is, the master. If the master is a node in the minority group, they know which one is the master. Assuming clients can reach that node (i.e. they are not on the wrong side of the network split), then they will continue to send write requests to that node. Assuming the clients specify a durability of simple majority, and assuming that the master can still not reach a majority of the nodes, these write requests will be rejected (because the durability constraints can't be satisfied).

Do they have de same master for that RG?

There is only one master at any given time.

Can the client know that the master on the minority is not the true master?

Assuming the client specifies a durability of simple majority, and assuming that the master can not reach a majority of the nodes to commit the transaction, the write requests will be rejected. The client will still think that the node is a master because, after all, at some point the other nodes in the group may reappear and write requests will then succeed. Until there is an election, the node that is the master is the only master in the system that any other nodes know about.

I am wondering the time is sufficient to ask for new election.
Election should be held when the other partition is already online or where all nodes communicating are in the majority. I think defining just a time will not be sufficient.

If other nodes come on line, and if those nodes coming on line form a majority, then writes will proceed and there will be no need for an election. The "fix" we are thinking of doing is that when the system notices that there is not a majority, then after some user-configurable time, an election will be called for. This is different from the case you mention where some of the missing nodes come back up. In that case, no election is necessary.

I am missing some thing. I don't understand why there will be no elections.
The scenario is the following:
The master is in minority partition. Why can't nodes in the majority partition ask for an election?
They will see absence of heartbeat. So why can't we have one master on each partition?
And if it is possible to have a master on each partition, can please re-answer to preceding questions?
Thanks for your answer.

I am missing some thing. I don't understand why there will be no elections.
The scenario is the following:
The master is in minority partition. Why can't nodes in the majority partition ask for an election?
They will see absence of heartbeat. So why can't we have one master on each partition?
And if it is possible to have a master on each partition, can please re-answer to preceding questions?

Yes, the nodes in the majority partition will hold an election. Yes, there would be two masters. However, assuming the clients request simple_majority for all requests, the master in the minority partition will not ack commits because the rep ack policy can't be met.

The issue we mentioned earlier is that the nodes in the minority partition should call for an election, but do not. We believe that our fix would be to have them call for an election when they notice that they are in this state after a configurable amount of time.

That's mean you will have to deal with conflict resolution in some cases (depending of configuration).
From your point of view, will the client know the new master on the majority partition or the previous master on the minority partition?