On 15/04/14 23:09, Matt Pietrek wrote:
> This is rabbitmq 3.2.4, running in a 2 node cluster with all queues in ha.
> At some point we saw a network partition (see below). It appears that
> Autoheal eventually worked, but afterwards the cmcmd queue wasn't on the
> broker.
> =ERROR REPORT==== 14-Apr-2014::18:02:30 ===
> ** Generic server <0.204.0> terminating
> ** Last message in was {mnesia_locker,rabbit at sea5m1mq1,granted}
> ** When Server state == {state,2,{from,<0.302.0>,#Ref<0.0.1372.163190>}}
> ** Reason for termination ==
> ** {unexpected_info,{mnesia_locker,rabbit at sea5m1mq1,granted}}
So this is something we've seen before in the case of short-lived
partitions; something in Mnesia is sending a stray {mnesia_locker, ...,
...} message to a process that isn't expecting it after the partition,
killing the process in question.
The release notes for Erlang 17.0 contain:
OTP-11497 To prevent a race condition if there is a short communication
problem when node-down and node-up events are received. They
are now stored and later checked if the node came up just
before mnesia flagged the node as down. (Thanks to Jonas
Falkevik )
which sounds like the same thing.
So it is quite possible that this is fixed in Erlang 17.0.
Cheers, Simon
--
Simon MacMullen
RabbitMQ, Pivotal