Prefetch state can be incorrect when transacted redelivery of duplicates occurs, causing stalled queue

Details

Description

In ActiveMQMessageConsumer, delivery acks are generated by receive() calls and by the dispatch() method when transacted redelivery of duplicates occurs. These delivery acks are consolidated by calling ackLater which batches them up using first/last message id and sends the acks as appropriate w.r.t. the prefetch size.

On the broker, the prefetch window is extended by checking the last message id, finding where it is in the dispatched queue and incrementing the prefetchExtension accordingly. This algorithm depends on the consumer maintaining the dispatch order in its consolidated delivery acks.

When the transacted redelivery occurs, it happens in a separate thread than the receive, operating on the latest delivered message. The delivery acks from the receive thread are arbitrarily delayed (but in order of dispatch) depending on client action. The mixing of these can result in an out of order consolidated delivery ack.

Real example (the client and broker logs are mixed to make it easier to follow; the dispatch logs come from my own custom logging plugin):

In the first ack received by the broker, you can see that message 68:1:1:2 is the first id and 68:1:1:1 is the last id. The broker never looks at the first id and will consider this a delivery ack of everything up to 68:1:1:1 (which was the first message dispatched). Thus this mixing results in a incorrect delivery count on the part of the broker.

An easy fix which would sometimes prematurely extend the prefetch window would be to always send transacted redelivery acks immediately (or consolidate them separately from receive originated delivery acks). Since transacted redelivery acks always get triggered on messages delivered later than the receive acks this would cause the broker to think that all the earlier messages had been delivered also. This might be inappropriate with really large prefetch sizes, although this is tempered by the fact that this only occurs in failover situations.

Another fix might be to enqueue the transacted redelivered messages and do the filtering of these types of messages in the dequeue method which would result in proper ordering of the delivery acks.

Anything else would seem to require explicit broker accounting of each delivered message. I'm willing to try to implement one of these fixes (I'm leaning to the dequeue filtering) but would like some guidance.

We have been using this patch successfully since reporting the issue. It definitely fixes the issue. Unfortunately I have not had the time to reproduce this in a simple unit test outside of our environment. It is easy to reproduce in our environment – we have several failover stress tests with heavy message traffic that we run in a loop.

Martin Serrano
added a comment - 19/Sep/12 14:18 We have been using this patch successfully since reporting the issue. It definitely fixes the issue. Unfortunately I have not had the time to reproduce this in a simple unit test outside of our environment. It is easy to reproduce in our environment – we have several failover stress tests with heavy message traffic that we run in a loop.

Gary Tully
added a comment - 30/Jan/13 12:14 fix looks good (there is no harm in premature extension of prefetch as it is capped by the broker) and no regression in org.apache.activemq.transport.failover.FailoverRedeliveryTransactionTest
applied with thanks in http://svn.apache.org/viewvc?rev=1440366&view=rev