Distributed Deadlock when restarting the GemfireXD Servers

Applies to

Purpose

This document describes workarounds and solution to resolve a distributed deadlock issue during the AsyncEventListener Queue recovery when GemfireXD servers is restarting.

Symptom

Server1 log snippet:

GemfireXD servers hang and fail to restart with the following logs:

[info 2015/05/26 18:22:56.977 CST <CacheServerLauncher#serverConnector> tid=0xd] Region /AsyncEventQueue_Listener2_SERIAL_GATEWAY_SENDER_QUEUE has potentially stale data. It is waiting for another member to recover the latest data.
My persistent id:
DiskStore ID: 33fe3f15-dae9-4433-8297-b2993b492914
Name:
Location: /172.16.43.142:/home/gpadmin/gemfirexdpnp/nodeserver01/./Listener2_DS
Members with potentially new data:
[
DiskStore ID: 3bfecd04-1cad-43a9-9187-68179bd6307e
Name:
Location: /172.16.43.142:/home/gpadmin/gemfirexdpnp/nodeserver02/./Listener2_DS
]
Use the "gfxd list-missing-disk-stores" command to see all disk stores that are being waited on by other members.