About this guide

Development of a robust application, be it message publisher or message
consumer, involves dealing with multiple kinds of failures: protocol
exceptions, network failures, broker failures and so on. Correct error
handling and recovery is not easy. This guide explains how the amqp gem
helps you in dealing with issues like

Covered versions

Code examples

There are several
examples
in the git repository dedicated to the topic of error handling and
recovery. Feel free to contribute new examples.

Initial broker connection failures

When applications connect to the broker, they need to handle connection
failures. Networks are not
100% reliable, even with modern system configuration tools like Chef or Puppet misconfigurations happen and the broker might also be down. Error detection should happen as early as possible. There are two ways of detecting TCP connection failure, the first one is to catch an exception:

beginAMQP.start(connection_settings)do|connection,open_ok|raise"This should not be reachable"endrescueAMQP::TCPConnectionFailed=>eputs"Caught AMQP::TCPConnectionFailed => TCP connection failed, as expected."end

AMQP.connect (and AMQP.start) will raise
AMQP::TCPConnectionFailed if a connection fails. Code that catches
it can write to a log about the issue or use retry to execute the
begin block one more time. Because initial connection failures are due
to misconfiguration or network outage, reconnection to the same
endpoint (hostname, port, vhost combination) will result in the same
issue over and over. TBD: failover, connection to the cluster.

An alternative way of handling connection failure is with an errback
(a callback for specific kind of error):

handler=Proc.new{|settings|puts"Failed to connect, as expected";EventMachine.stop}connection_settings={:port=>9689,:vhost=>"/amq_client_testbed",:user=>"amq_client_gem",:password=>"amq_client_gem_password",:timeout=>0.3,:on_tcp_connection_failure=>handler}

Full example:

require"rubygems"require"amqp"puts"=> TCP connection failure handling with a callback"putshandler=Proc.new{|settings|puts"Failed to connect, as expected";EM.stop}connection_settings={:port=>9689,:vhost=>"/amq_client_testbed",:user=>"amq_client_gem",:password=>"amq_client_gem_password",:timeout=>0.3,:on_tcp_connection_failure=>handler}AMQP.start(connection_settings)do|connection,open_ok|raise"This should not be reachable"end

:on_tcp_connection_failure option accepts any object that responds to
#call.

If you connect to the broker from code in a class (as opposed to
top-level scope in a script), Object#method can be used to pass object
method as a handler instead of a Proc.

Authentication failures

Another reason why a connection may fail is authentication failure.
Handling authentication failure is very similar to handling initial TCP
connection failure:

require"rubygems"require"amqp"puts"=> Authentication failure handling with a callback"putshandler=Proc.new{|settings|puts"Failed to connect, as expected";EM.stop}connection_settings={:port=>5672,:vhost=>"/amq_client_testbed",:user=>"amq_client_gem",:password=>"amq_client_gem_password_that_is_incorrect #{Time.now.to_i}",:timeout=>0.3,:on_tcp_connection_failure=>handler,:on_possible_authentication_failure=>Proc.new{|settings|puts"Authentication failed, as expected, settings are: #{settings.inspect}"EM.stop}}AMQP.start(connection_settings)do|connection,open_ok|raise"This should not be reachable"end

Default handler

default handler raises AMQP::PossibleAuthenticationFailureError:

#!/usr/bin/env ruby# encoding: utf-8require"rubygems"require"amqp"puts"=> Authentication failure handling with a rescue block"putshandler=Proc.new{|settings|puts"Failed to connect, as expected";EM.stop}connection_settings={:port=>5672,:vhost=>"/amq_client_testbed",:user=>"amq_client_gem",:password=>"amq_client_gem_password_that_is_incorrect #{Time.now.to_i}",:timeout=>0.3,:on_tcp_connection_failure=>handler}beginAMQP.start(connection_settings)do|connection,open_ok|raise"This should not be reachable"endrescueAMQP::PossibleAuthenticationFailureError=>afeputs"Authentication failed, as expected, caught #{afe.inspect}"EventMachine.stopifEventMachine.reactor_running?end

In case you are wondering why callback name has "possible” in it: AMQP
0.9.1 spec requires broker implementations
to simply close TCP connection without sending any more data when an
exception (such as authentication failure) occurs before AMQP connection
is open. In practice, however, when broker closes TCP connection between
successful TCP connection and before AMQP connection is open, it means
that authentication has failed.

Handling network connection interruptions

Network connectivity issues are a sad fact of life in modern software
systems. Even small products and projects these days consist of multiple
applications, often running on more than one machine. The Ruby amqp gem
detects TCP connection failures and lets you handle them by defining a
callback using
AMQP::Session#on_tcp_connection_loss.
That callback will be run when TCP connection fails, and will be passed
two parameters: connection object and settings of the last successful
connection.

Sometimes it is necessary for other entities in an application to
react to network failures. amqp gem 0.8.0 and later provides a number
of event handlers to make this task easier for developers. This set of
features is known as the "shutdown protocol” (the word "protocol” here
means "API interface” or "behavior”, not network protocol).

AMQP::Session, AMQP::Channel, AMQP::Exchange, AMQP::Queue and
AMQP::Consumer all implement shutdown protocol and thus
errorhandling API is consistent for all classes, with AMQP::Session
and AMQP::Channel have a few methods that other entities do not
have.

The Shutdown protocol revolves around two events:

Network connection fails

Broker closes AMQP connection (or channel)

In this section, we will concentrate on the former. When a network
connection fails, the underlying networking library detects it and runs
a piece of code on AMQP::Session to handle it. That, in
turn, propagates this event to channels, channels propagate it to
exchanges and queues, queues propagate it to their consumers (if any).
Each of these entities in the object graph can react to network
interruption by executing application-defined callbacks.

Shutdown Protocol methods on AMQP::Session

AMQP::Session#on_tcp_connection_loss

AMQP::Session#on_connection_interruption

The difference between these methods is that
AMQP::Session#on_tcp_connection_loss
is used to define a callback that will be executed once when TCP
connection fails. It is possible that reconnection attempts will not
succeed immediately, so there will be subsequent failures. To react to
those, AMQP::Session#on_connection_interruption method is used.

The first argument that both of these methods yield to the handler that
your application defines is the connection itself. This is done to make
sure that you can register Ruby objects as handlers, and they do not
have to keep any state around (for example, connection instances):

Note that AMQP::Session#on_connection_interruption
callback is called before this event is propagated to channels,
queues and so on.

Different applications handle connection failures differently. It is
very common to use
AMQP::Session#reconnect method to
schedule a reconnection to the same host, or use
AMQP::Session#reconnect_to to connect to a different
one.
For some applications it is OK to simply exit and wait to be restarted
at a later point in time, for example, by a process monitoring system
like Nagios or Monit.

Shutdown Protocol methods on AMQP::Channel

AMQP::Channel provides only one method:
AMQP::Channel#on_connection_interruption,
that registers a callback similar to the one seen in the previous
section:

Note that
AMQP::Channel#on_connection_interruption
callback is called after this event is propagated to exchanges,
queues and so on. Right after that channel state is reset, except for
error handling/recovery-related callbacks.

Many applications do not need per-channel network
failure handling.

Shutdown Protocol methods on AMQP::Exchange

AMQP::Exchange provides only one method:
AMQP::Exchange#on_connection_interruption,
that registers a callback similar to the one seen in the previous
section:

Recovering from network connection failures

Detecting network connections is nearly useless if an AMQP-based
application cannot recover from them. Recovery is the hard part in
"error handling and recovery”. Fortunately, the recovery process for
many applications follows one simple scheme that the amqp gem can
perform automatically for you.

The recovery process, both manual and automatic,
always begins with re-opening an AMQP connection and then all the
channels on that connection.`

Manual recovery

Similarly to the Shutdown Protocol, the amqp gem entities implement a
Recovery Protocol. The Recovery Protocol consists of three methods that
connections, channels, queues, consumers and exchanges all implement:

AMQP::Session#before_recovery

AMQP::Session#auto_recover

AMQP::Session#after_recovery

AMQP::Session#before_recovery lets application developers register a
callback that will be executed after TCP connection is re-established
but before AMQP connection is reopened.
AMQP::Session#after_recovery is similar except that the callback is
run after AMQP connection is reopened.

Recovery process for AMQP applications usually involves the following
steps:

Re-open AMQP connection.

Once connection is open again, re-open all AMQP channels on that
connection.

For each channel, re-declare all exchanges.

For each channel, re-declare all queues.

Once queue is declared, for each queue, re-register all bindings.

Once queue is declared, for each queue, re-register all consumers.

Automatic recovery

Many applications use the same recovery strategy that consists of the
following steps:

Re-open channels.

For each channel, re-declare exchanges (except for predefined ones).

For each channel, re-declare queues.

For each queue, recover all bindings.

For each queue, recover all consumers.

The amqp gem provides a feature known as "automatic recovery” that is
opt-in (not opt-out, not used by default) and lets application
developers get the aforementioned recovery strategy by setting one
additional attribute on AMQP::Channel instance:

Note that if you do not want to pass any options, the second argument
can be left out as well, then it will default to
AMQP::Channel.next_channel_id.

To find out whether a channel uses automatic recovery mode or not, use
AMQP::Channel#auto_recovering?.

Auto recovery mode can be turned on and off any number of times during
channel life cycle, although a very small percentage of applications
actually do this. Typically you decide what channels should be using
automatic recovery during the application design stage.

Full example (run it, then shut down AMQP broker running on localhost,
then bring it back up and use management tools such as rabbitmqctl to
see that queues, bindings and consumers have all recovered):

require"rubygems"require"amqp"# requires version >= 0.8.0.RC14puts"=> Example of automatic AMQP channel and queues recovery"putsAMQP.start(:host=>"localhost")do|connection,open_ok|connection.on_errordo|ch,connection_close|raiseconnection_close.reply_textendch1=AMQP::Channel.new(connection)ch1.auto_recovery=truech1.on_errordo|ch,channel_close|raisechannel_close.reply_textendifch1.auto_recovering?puts"Channel #{ch1.id} IS auto-recovering"endconnection.on_tcp_connection_lossdo|conn,settings|puts"[network failure] Trying to reconnect..."conn.reconnect(false,2)endch1.queue("amqpgem.examples.queue1",:auto_delete=>true).bind("amq.fanout")ch1.queue("amqpgem.examples.queue2",:auto_delete=>true).bind("amq.fanout")ch1.queue("amqpgem.examples.queue3",:auto_delete=>true).bind("amq.fanout").subscribedo|metadata,payload|endshow_stopper=Proc.new{connection.disconnect{puts"Disconnected. Exiting…";EventMachine.stop}}Signal.trap"TERM",show_stopperSignal.trap"INT",show_stopperEM.add_timer(30,show_stopper)puts"Connected, authenticated. To really exercise this example, shut AMQP broker down for a few seconds. If you don't it will exit gracefully in 30 seconds."end

Server-named queues, when recovered automatically, will get new
server-generated names to guarantee there are no name collisions.

When in doubt, try using automatic recovery first. If
it is not sufficient for your application, switch to manual recovery
using events and callbacks introduced in the "Manual recovery”
section.

Detecting broker failures

AMQP applications see broker failure as TCP connection loss. There is no
reliable way to know whether there is a network problem or a network
peer is down.

AMQP connection-level exceptions

Handling connection-level exceptions

Connection-level exceptions are rare and may indicate a serious issue
with a client library or in-flight data corruption. The AMQP 0.9.1
specification mandates that a connection that has errored cannot be used
any more and must be closed. In any case, your application should be
prepared to handle this kind of error. To define a handler, use
AMQP::Session#on_error method that
takes a callback and yields two arguments to it when a connection-level
exception happens:

Only one connection-level exception handler can be
defined per connection instance (the one added last replaces previously
added ones).

Full example:

#!/usr/bin/env ruby# encoding: utf-8require"bundler"Bundler.setup$:.unshift(File.expand_path("../../../lib",__FILE__))require'amqp'EventMachine.rundoAMQP.connect(:host=>'127.0.0.1',:port=>5672)do|connection|puts"Connected to AMQP broker. Running #{AMQP::VERSION} version of the gem..."connection.on_errordo|conn,connection_close|puts<<-ERR Handling a connection-level exception. AMQP class id : #{connection_close.class_id}, AMQP method id: #{connection_close.method_id}, Status code : #{connection_close.reply_code} Error message : #{connection_close.reply_text} ERREventMachine.stopend# send_frame is NOT part of the public API, but it is public for entities like AMQ::Client::Channel# and we use it here to trigger a connection-level exception. MK.connection.send_frame(AMQ::Protocol::Connection::TuneOk.encode(1000,1024*128*1024,10))endend

Handling graceful broker shutdown

When an AMQP broker is shut down, it properly closes connections first.
To do so, it uses connection.close AMQP method. AMQP clients then
need to check if the reply code is equal to 320 (CONNECTION_FORCED) to
distinguish graceful shutdown. With RabbitMQ, when broker is stopped
using

Each application chooses how to handle graceful broker shutdowns
individually, so amqp gem’s automatic reconnection does not cover
graceful broker shutdowns. Applications that want to reconnect when
broker is stopped can use
AMQP::Session#periodically_reconnect
like so:

Error handling can be easily integrated into object-oriented Ruby code
(in fact, this is highly encouraged). A common technique is to combine
Object#method
and
Method#to_proc
and use object methods as error handlers:

#!/usr/bin/env ruby# encoding: utf-8require"bundler"Bundler.setup$:.unshift(File.expand_path("../../../lib",__FILE__))require'amqp'classConnectionManager## API#defconnect(*args,&block)@connection=AMQP.connect(*args,&block)# combines Object#method and Method#to_proc to use object# method as a callback@connection.on_error(&method(:on_error))end# connect(*args, &block)defon_error(connection,connection_close)puts"Handling a connection-level exception."putsputs"AMQP class id : #{connection_close.class_id}"puts"AMQP method id: #{connection_close.method_id}"puts"Status code : #{connection_close.reply_code}"puts"Error message : #{connection_close.reply_text}"end# on_error(connection, connection_close)endEventMachine.rundomanager=ConnectionManager.newmanager.connect(:host=>'127.0.0.1',:port=>5672)do|connection|puts"Connected to AMQP broker. Running #{AMQP::VERSION} version of the gem..."# send_frame is NOT part of the public API, but it is public for entities like AMQ::Client::Channel# and we use it here to trigger a connection-level exception. MK.connection.send_frame(AMQ::Protocol::Connection::TuneOk.encode(1000,1024*128*1024,10))end# shut down after 2 secondsEventMachine.add_timer(2){EventMachine.stop}end

AMQP channel-level exceptions

Handling channel-level exceptions

Channel-level exceptions are more common than connection-level ones.
They are handled in a similar manner, by defining a callback with
AMQP::Channel#on_error method that
takes a callback and yields two arguments to it when a channel-level
exception happens:

Only one channel-level exception handler can be
defined per channel instance (the one added last replaces previously
added ones).`

Full example:

#!/usr/bin/env ruby# encoding: utf-8require"bundler"Bundler.setup$:.unshift(File.expand_path("../../../lib",__FILE__))require'amqp'puts"=> Queue redeclaration with different attributes results in a channel exception that is handled"putsAMQP.start("amqp://guest:guest@dev.rabbitmq.com:5672")do|connection,open_ok|AMQP::Channel.newdo|channel,open_ok|puts"Channel ##{channel.id} is now open!"channel.on_errordo|ch,channel_close|puts<<-ERR Handling a channel-level exception. AMQP class id : #{channel_close.class_id}, AMQP method id: #{channel_close.method_id}, Status code : #{channel_close.reply_code} Error message : #{channel_close.reply_text} ERRendEventMachine.add_timer(0.4)do# these two definitions result in a race condition. For sake of this example,# however, it does not matter. Whatever definition succeeds first, 2nd one will# cause a channel-level exception (because attributes are not identical)AMQP::Queue.new(channel,"amqpgem.examples.channel_exception",:auto_delete=>true,:durable=>false)do|queue|puts"#{queue.name} is ready to go"endAMQP::Queue.new(channel,"amqpgem.examples.channel_exception",:auto_delete=>true,:durable=>true)do|queue|puts"#{queue.name} is ready to go"endendendshow_stopper=Proc.newdo$stdout.puts"Stopping..."connection.close{EventMachine.stop}endSignal.trap"INT",show_stopperEM.add_timer(2,show_stopper)end

Error handling can be easily integrated into object-oriented Ruby code
(in fact, this is highly encouraged).A common technique is to combine
Object#method
and
Method#to_proc
and use object methods as error handlers. For example of this, see
section on connection-level exceptions above.

Because channel-level exceptions may be raised
because of multiple unrelated reasons and often indicate
misconfigurations, how they are handled isvery specific to particular
applications. A common strategy is to log an error and then open and use
another channel.

Common channel-level exceptions and what they mean

A few channel-level exceptions are common and deserve more attention.

406 Precondition Failed

Description

The client requested a method that was not allowed because some
precondition failed.

What might cause it

AMQP entity (a queue or exchange) was re-declared with attributes
different from original declaration. Maybe two applications or pieces of
code declare the same entity with different attributes. Note that
different AMQP client libraries historically use slightly different
defaults for entities and this may cause attribute mismatches.

`AMQP::Channel#tx_commit` or
`AMQP::Channel#tx_rollback` might be run on a channel
that wasn’t previously made transactional with
`AMQP::Channel#tx_select`

Conclusion

Distributed applications introduce a whole new class of failures
developers need to be aware of. Many of them stem from unreliable
networks. The famous Fallacies of Distributed
Computing
list common assumptions software engineers must not make:

The network is reliable.

Latency is zero.

Bandwidth is infinite.

The network is secure.

Topology doesn’t change.

There is one administrator.

Transport cost is zero.

The network is homogeneous.

Unfortunately, applications that use Ruby and AMQP are not immune to
these problems and developers need to always keep that in mind. This
list is just as relevant today as it was in the 90s.

Ruby amqp gem 0.8.x and later lets applications define handlers that
handle connection-level exceptions, channel-level exceptions and TCP
connection failures. Handling AMQP exceptions and network connection
failures is relatively easy. Re-declaring AMQP instances that the
application works with is where most of the complexity comes from. By
using Ruby objects as error handlers, the declaration of AMQP entities
can be done in one place, making code much easier to understand and
maintain.

amqp gem error and interruption handling is not a copy of RabbitMQ Java
client’s Shutdown
Protocol, but they
turn out to be similar with respect to network failures and
connection-level exceptions.