There are a number of states that a RabbitMQ node can be in when in a cluster and some of these states are known to cause upgrades to fail with device busy error.

Resolution

The first thing you will want to do when considering upgrading a RabbitMQ deployment is to ascertain the state of each RabbitMQ node in the cluster. In order to successfully upgrade a Rabbit cluster, the operator should ensure that each node in the cluster is in an upgradeable state. So we have two areas to examine:

1. How to identify which state a given node is inThere are two tools which are invaluable in discerning the state of a given RabbitMQ server node. Both require that you bosh ssh onto the node in question, and sudo su - to get system privileges. The incantations are:watch monit statusPATH=$PATH:/var/vcap/packages/erlang/bin /var/vcap/packages/rabbitmq-server/bin/rabbitmqctl statusTo determine the state of a given RabbitMQ server node, run the incantations above, and cross-reference the results below. The following 2 states are the known states that an upgrade will always succeed from . If you see one of these states and the upgrade fails, then please let us know by opening a support request and collect logs from all RabbitMQ node VMs.

To maximise the chances of a successful upgrade, all Rabbitmq-server nodes in a given cluster should be either in state Running/Up or Running/Clusterer. It is perfectly ok for a cluster to contain some nodes in state Running/Up and some other nodes in state Running/Clusterer.The important thing is that no nodes in the cluster should be in any other state. If you do find another state is possible, please record it so that engineering can investigate, and continue to get that node intoa known good state.

2. For each possible state, how to move into an upgradeable state

2.1 Monit state: Trying - RabbitMQ: upIn this state, Rabbit is running in an Erlang VM, but monit has lost track of it. Monit knows that Rabbit should be running, so it runs Rabbit’s start script. This fails, because Rabbit has already started. Monit tries again, and keeps trying forever.To move from this state-to-state Running/Clusterer, all we have to do is stop Rabbit. This will put the node in a state where monit is able to successfully bring rabbit back up, and resume control. To achieve this, run:/var/vcap/jobs/rabbitmq-server/bin/rabbitmq-server.init stopThis should exit 0, and the node should enter the state Running/Clusterer.

2.2 Monit state: Stopped - Rabbitmq: DownIn this state monit believes Rabbit to be deliberately down, and Rabbit is, in fact, down. From here it is possible to move to Running/Clusterer (which is a good state) by running the following from any machine on which the BOSH director is targeted:bosh start $JOB_NAMEWhere $JOB_NAME is the name of the BOSH job that corresponds to this Rabbit server node. For example rmq_z0/0.

2.3 Monit state: BoshStopped - Rabbitmq: DownThis state means that BOSH has attempted to stop all the services and has removed all the monit spec files. The rabbitmq-server service will not be displayed in a monit status. The Erlang VM is running and does not report the clusterer or Rabbit running. From here we can get to Running/Clusterer which is a good state by running the following on a machine that is targeting your BOSH director:bosh start $JOB_NAMEWhere $JOB_NAME is the name of the BOSH job that corresponds to this Rabbit server node. For examplermq_z0/0.

3. How to determine state of the RabbitMQ system:

3.1 Monit StatesTo determine the state of a node, the command that needs to be invoked is as follows:watch monit statusFor those who have used monit before: notice that we require the full information from monit status. A monit summary is not enough. The key here is that we need both the status and the monitoring status of our RabbitMQ-server. Sometimes PID-related information can also be reassuring, but it is possible to distinguish between all our states without it. For ease of reference, we have highlighted these key lines in bold in the output below.3.1.1 RunningThis state means that Monit (and BOSH) believes that it is monitoring RabbitMQ-server and that it is running.

3.1.2 TryingThis state means that Monit is attempting to monitor RabbitMQ-server and it believes that RabbitMQ-server is not running. It is attempting to launch RabbitMQ-server, but it is failing to do so and cycles between these outputs, every 30 seconds or so.

This state is when the node has been stopped via BOSH stop. This removes all traces of the monit spec files related to the RabbitMQ-server service so that Monit doesn’t even know of its existence. So it is neither monitored and nor does Monit have any clue about the state of rabbitmq-server.

3.2 Rabbit StatesThis is separate from the Monit state because the Erlang VM has its own opinion of what it means for RabbitMQ to be running.To determine the state of a node, the command that needs to be invoked is as follows:PATH=$PATH:/var/vcap/packages/erlang/bin /var/vcap/packages/rabbitmq-server/bin/rabbitmqctl statusThere are two key lines we are looking for in all these command outputs:

{rabbit,"RabbitMQ","3.6.2"},

[{rabbitmq_clusterer,"Declarative RabbitMQ clustering",[]},

It is never the case that Rabbit runs without the clusterer. So there are three possible states: both Rabbit and the clusterer are running; only the clusterer is running; neither Rabbit nor the clusterer are running.

3.2.1 UpThis state means that this node is running both the “RabbitMQ” application and the clusterer. This can be seen in the running_applications section in the following output.

3.2.3 ClustererThis state means that the RabbitMQ clusterer plugin is active and waiting for the other nodes of the cluster to come online. This node would be considered not functioning because the “RabbitMQ” application is not listed in the running_applications output.Notice that while the clusterer is clearly running, the RabbitMQ application is entirely missing from the output.