I am trying to learn more about the various ways that the problem of a Distributed Queue may be solved. So I would like know what products, services, implementations and research papers that are already out there.

An implementation will face many challenges and will be forced to make tradeoffs:

Does it have strong or loose ordering?

Does it have idempotent put?

Can we have more queues than what can fit on a single machine?

Can we have more data in a queue than what can fit on a single machine?

How many machines can crash before we potentially lose data?

Can it tolerate net-splits?

Can it automatically reconcile data when a net-split is fixed?

Can it guarantee delivery when clients can crash?

Can it guarantee that the same message is not delivered more than once?

Can a node crash at any given point, come back up, and not send out junk?

Can you add nodes to, or remove nodes from, a running cluster without down time?

Can you upgrade nodes in a running cluster without down time?

Can it run without problems on heterogeneous servers?

Can you “stick” queues to a group of servers? (example: “these queues are only allowed in the european datacenter”)

Can it make sure to put data replicas in at least two datacenters, if so available?

I have no illusion that any implementation will be able to say “yes” to all of that. I am merely interested in hearing about the various implementations; how they work, what tradeoffs they have made and perhaps why they decided on their particular set of tradeoffs.

Also if there are any challenges that I may have missed in the above list.

1 Answer
1

Writing a basic queuing system is fairly simple, but as you noted above with all of the challenges, doing it right is another matter. I've used home grown systems for which I wrote the source code, 3rd party systems, and various JMS providers. JMS (Java Messaging Service) by far is the most complete solution I've encountered thus far. Much of what you ask is available in JMS. My favorite JMS provider is ActiveMQ. Free, performant, easy to install, and more importantly easy to embed in my app with Spring. JMS providers don't provide everything you asked for out of the box, but they provide a set of tools to handle much of what you asked about should your application need it. I haven't found lots of applications need everything you listed. Ordering might not be important (it's best if it isn't), durable topics might not be important, guaranteed delivery, etc. You just have to stick to the problem and use what it demands.

Does it have idempotent put? No, but this is trivial to implement in your application layer should you need that.

Can we have more queues than what can fit on a single machine? Yes. You can have clustered servers, and if you wanted to setup multiple machines with different queues you could, and pull from either.

Can we have more data in a queue than what can fit on a single machine? Yes most JMS providers have to use some sort of DB/persistent storage to ensure messages aren't dropped or lost if the JMS provider goes down.

How many machines can crash before we potentially lose data? This is a little harder to answer because it's timing related. However, you can crash a JMS provider and provided the disk isn't corrupt it will come back up and start where it received the last commit. This means messages could be delivered twice, but if you code your app to handle this it's not a problem. As long as you have at least one of each type (producers, consumers, or JMS servers) it will complete. You can also have load/balance/failover for redundancy should a disk go out on you.

Can it tollerate net-splits? I think I understand what you mean by "net-split", but I'm not entirely sure. I guess you mean if the JMS servers are clustered, and we loose connection with one of the servers will it jump to another server and pickup where it left off. Yes, but again these types of situations can lead to duplicate messages depending on at what point the client lost connection.

Can it automatically reconcile data when a net-split is fixed? If you are using transacted sessions it will only redeliver any message that has had a commit called on it to existing clients that are up.

Can it guarantee delivery when clients can crash? Yes this is one of the main goals of JMS. Guaranteed delivery means that if a message is queued it's guaranteed to be handled by a client.

Can it guarantee that the same message is not delivered more than once? Yes if the transacted sessions are being used. That means a client has accepted the message and called commit/rollback. Once the commit is called it won't redeliver the message.

Can a node crash at any given point, come back up, and not send out junk? In the case where you have durable clustered queues. Yes it won't spew "junk" if the other node in the cluster has delivered the message. It can still redeliver anything that hasn't been acknowledged.

Can you add nodes to, or remove nodes from, a running cluster without down time? Yes.

Can you upgrade nodes in a running cluster without down time? This is a little trickier for me to answer, but I believe that yes you can do this.

Can it run without problems on heterogeneous servers? What does this mean exactly? I've found most JMS providers are very easy to run in environments using different hardware, OS, etc. Although, if you mean performance, that's a whole another thing. Any distributed processing system can be negatively impacted by a slow node. I had 2 8 Core Intel servers running the queue and the consumers. That's 16 cores together, and I got better performance from using only those two boxes, than when I added a single core machine as a consumer. That single core machine was so much slower it slowed down the entire grid by a factor of 2x. This had nothing to do with JMS per se.

Can you “stick” queues to a group of servers? Short answer yes. I can think of a way where you can run a cluster that's only in the european data center, and configure the queue there. Then in your spring config setup your consumers to consume that queue as well as other queues on other clusters. You might want to consult the docs:

Can it make sure to put data replicas in at least two datacenters, if so available? Again I believe so, but it's best to consult the clustering docs.

Again JMS has lots of options you can tweak as your need dictates. Using transacted sessions and durable queues comes with a performance cost. I've seen turning on all the bells and whistles impact performance as much as 10x. When I used JBossMQ if we turned off some of these features we could get around 10,000 messages/s, but turning them on brought us down to 1000 messages/s. Big drop.

Thank you for taking the time with this answer. A net-split is when some nodes in a cluster can no longer communicate with the rest. By heterogeneous servers I mostly mean different amounts of RAM - some distributed systems prefer it when servers look alike.
–
Chris VestNov 7 '10 at 11:55

Then for sure yes on netsplits. If a consumer goes down or can't communicate it will keep trying to connect. Jobs that were given to it that didn't receive a commit will later be redelivered to other consumers. If a JMS provider goes down and you have other members of the cluster messages can be duplicated across the cluster to keep from losing messages.
–
chubbsondubsNov 9 '10 at 15:20

There are no requirements about having machines be identical whether than be RAM, Hardware, or OS. You can run a mixed bag of machines if you need to. The only concern is the one I noted which is performance related in that machines that aren't the same will process messages at different rates which can lead to lower throughput. However, JMS model somewhat mitigates this by the fact that it's pull instead of push model. Push models are much more sensitive to these types of issues.
–
chubbsondubsNov 9 '10 at 15:22