Recover from a Partially Connected Cache Cluster Without Any Downtime

Partially connected cluster

Partial connectivity means two or more cache servers are connected with each other but not fully connected. It could be that the active partition on one cache server is no longer connected to its replica on another cache server even though the active partition on that other server is connected to its replica on the original server. Or, it could be that one of the cache servers is totally disconnected with other servers in the cluster.

Additionally, in Partition Replica Cache, each cache server contains one active partition and one replica partition. The replica is passive and only accessed by its active partition. But, at cache cluster layer, both active partition and the replica are seen as independent “nodes”. So, a 3 server cache cluster in Partition-Replica Cache will have a “6 node” cluster.

How to detect partial connectivity

Use View Cluster Connectivity tab in NCache Manager

Right click on your cache name in NCache Manager and then choose View cluster connectivity option

This will open another window with cluster connectivity status. You can use this tab to verify if your cache cluster is fully connected or partially connected.

Fully connected cache cluster:

In the example below, it shows a fully connected (healthy) cache cluster. There are 3 servers in the cluster and 6 “nodes”. So, each “node” is supposed to be connected to 5 other “nodes” as shown in “Connected to Nodes” column.

Partially connected cache cluster

In the example below, it is a partially connected cache cluster where 20.200.20.101 has lost connectivity with its replica on 20.200.20.102 and is missing a connection to 20.200.20.102 node. Hence, it has less number of nodes shown in “Connected to Columns” in front of it.

Partially connected cluster with split brain

In the example below, this is another partially connected cache with a Split Brain, where 20.200.20.102 has lost connectivity completely to other two nodes and hence showing Single Node cache Cluster status. Also, 20.200.20.100 and 20.200.20.101 are showing partially connected status and are missing 20.200.20.102 in the “connected to Nodes” column.

Node Address

Connected to Nodes

Status

20.200.20.100

20.200.20.100, 20.200.20.101, 20.200.20.101

Partially Connected

20.200.20.101

20.200.20.101, 20.200.20.100, 20.200.20.100

Partially Connected

20.200.20.102

---

Single Node cache Cluster

Figure 3: Split brain in partially connected cache cluster

How to fix partial connectivity

You have to start one or more cache servers to fix partial connectivity. In a 2-server cluster, you only need to start one of the cache servers. In case of a 3-server cluster, you may have to restart 2 cache servers.

Identify problem node

If you notice that cache cluster nodes are in partially connected state then pick the cache server which says Single Node Cluster as problem node. This is a Split brain scenario as shown above in Figure 3.

OR

If there is no server having Single node cluster status then pick the server node which has the least number of IP addresses displayed in Connected to Nodes column on cluster connectivity window in front of it. This is a partially connected cache scenario as shown above in Figure2.

AND/OR

Open cluster health window in NCache Monitor tool and then pick the node which has the least number of Clients in Clients column.

AND/OR

Pick a node with the least number of Request/sec counter value than other nodes.

Stopping cache on that node only

Once a cache cluster is in partially connected state then it requires manual intervention to recover. Here are the steps to resolve this problem,

Once the problem node is identified then right click on that node’s IP-Address in NCache Manager under your cache name and then choose Stop, this will stop this cache only on this node.

You can also use our command line tool stopcache to do the same as follows using node's IP address:

Start your cache again. You can do this in NCache Manager by right clicking on your Node IP under your cache name and by choosing Start option.
You can also use our command line tool startcache by running following command using node's IP address.