Monday, June 27, 2011

If you are a regular reader of my blog, you’ll know that I’m currently working on a .NET friendly API for RabbitMQ, EasyNetQ. EasyNetQ is opinionated software. It takes away much of the complexity of AMQP and replaces it with a simple interface that relies on the .NET type system for routing messages.

One of the things that I want to remove from the ‘application space’ and push down into the API is all the plumbing for reporting and handling error conditions. One side of this to provide infrastructure to record and handle exceptions thrown by applications that use EasyNetQ. I’ll be covering this in a future post. The other consideration, and the one I want to address in this post, is how EasyNetQ should gracefully handle network connection or server failure.

The Fallacies of Distributed Computing tell us that, no matter how reliable RabbitMQ and the Erlang platform might be, there will still be times when a RabbitMQ server will go away for whatever reason.

One of the challenges of programming against a messaging system as compared with a relational database, is the length of time that the application holds connections open. A typical database connection is opened, some operation is run over it – select, insert, update, etc – and then it’s closed. Messaging system subscriptions, however, require that the client, or subscriber, holds an open connection for the lifetime of the application.

If you simply program against the low level C# AMQP API provided by RabbitHQ to create a simple subscription, you’ll notice that after a RabbitMQ server bounce, the subscription no longer works. This is because the channel you opened to subscribe to the queue, and the consumption loops attached to them, are no longer valid. You need to detect the closed channel and then attempt to rebuild the subscription once the server is available again.

The excellent RabbitMQ in Action by Videla and Williams describes how to do this in chapter 6, ‘Writing code that survives failure’. Here’s their Python code example:

EasyNetQ needs to do something similar, but as a generic solution so that all subscribers automatically get re-subscribed after a server bounce.

The connection.AddSubscriptionAction(subscribeAction) line passes the closure to a PersistentConnection class that wraps an AMQP connection and provides all the disconnect detection and re-subscription code. Here’s AddSubscriptionAction:

This spins up a thread that simply loops trying to connect back to the server. Once the connection is established, it runs all the stored subscribe closures (subscribeActions).

In my tests, this solution has worked very nicely. My clients automatically re-subscribe to the same queues and continue to receive messages. One of the main motivations to writing this post, however, was to try and elicit feedback, so if you’ve used RabbitMQ with .NET, I’d love to hear about your experiences and especially any comments about my code or how you solved this problem.

3 comments:

I am looking to make use of RabbitMQ and your EasyNetQ library looks very interesting.

One comment on your reconnection code would be around the usage of Thread.Sleep(). If you are using a ThreadPool Thread you should avoid calling Thread.Sleep as it could lead to thread pool starvation issues (if your process is busy). The better approach would be to use a Timer object that fires after 100 msec and calls your TryToConnect() Method as the thread would go back to the pool until the Timer Elapsed event was raised whereupon another (might be the same one) thread pool thread would be plucked out and used to execute the event handler.

Also would there be many subscriptions per connection? would you want to only start one 'attempt to reconnect' loop per connection as opposed to one per subscription - maybe look to use the interlocked class to marshal first come wins?

Thanks for pointing out my poor thread-pool usage ;) You are right of course. In this case, I'm not too bothered by it because while there's no connection, not much is going to be happening anyway. But I'll put it on my todo list. If you want to send me a pull request in the meantime, I'd be very happy.

There can be any number of subscriptions per connection. There is a single connection per RabbitBus instance, and normally I would expect an application to have a single instance and thus a single connection. So a single application should only have a single reconnect loop. Multiple channels can be multiplexed over a single connection.

All the subscription are handled from a single event loop, that takes messages off an in-memory queue. The main thing to remember as a EasyNetQ user is to write non-blocking subscription handlers.