So... something weird happened on our production server today. All of a sudden, hornetq decided to stop delivering messages on one of the queues (the others were fine). Unfortunatelly, it was our most important queue. Restarting the consumers, pausing and resuming de queue and also restarting the server had no effect.

I also noticed that the journal directory was very loaded (1.5gb) and while restarting hornetq, i've got some IllegalArgumentException's saying that the state of the journal was wrong:

and also, all of the journal would be loaded into memory (the heap would get over 4gb).

So I stopped hornetq, cleaned up the journal directory and everything started working fine (except that i probably lost those messages but i cannot tell). I can probably provide the data files for hornetq but it will not be easy (it uses 300mb gzipped).

Starting a hornetq with those files lead me to the same state: no messages get delivered to that queue and inspecting said queue through JMX (using jconsole) leads me to a very weird state: if says i have 130k+ messages but no method can look at them (or move / delete / expire) and i see no way to, at least, remove them from the heap.

Can anybody help me or give me any directions? Any help at all will be much apreciated - we're moving to the point where we will have to switch to another JMS provider if that happens again.

Oh and I'm using hornetq 2.2.21.Final but the same thing happens if i use the same data files and hornetq 2.3.0.Beta1 (except that new messages do not get stuck but gets delivered normally)

It seems that would be a good idea to move to the latest available version.. even if you use git to get to it (case you don't have an EAP subscription).

It seems that the page file was deleted maybe while the journal still showing some acks in a previous page? I would need your data to know what's going on. (maybe you could send the print-data nad print-pages outputs to me). (take a look on the properties before you send to me... to make sure you're not sending anything you coudn't send... although I never look at anyt of that data).

Hi Clebert thanks for all the attention. The previous NPE was a result of my moving files around i think so i guess we can safely ignore it (although NPE is, IMO, always a bug).

I'm back to square one: messages are stuck in the queue and are never delivered (i also cannot move/expire them using JMX so something is absolutely wrong). It happened again monday on a non critial queue so it was not a big deal.

@Marcelo: Why don't you come online on IRC and we can talk about this.

You should definitely move to the newer version. But I'm interested on learning what happened, to make sure it won't happen again. You are probably using some usecase that wasn't planned. (Are you using filters on paging for instance?.. if ou are.. you should move even sooner)

Ok so im back. I could join IRC but i dont want to bother you so much.

At any rate, i did some experiments and here are the results (still using 2.2.21.Final)

-the queue getting stuck is certainly something to do with paging because if i clear that, everything almost works again except that messages that are journaled are not consumed by the message listeners

-the journal files are loaded into memory but are never collected (and i suspect it may cause OOM's down the line but i am unsure) but, like i said above, the messages are never consumed and never leave the heap even after a full GC

Also i've tried the latest version from the 2.2.EAP branch - namely 2.2.24.EAP.snapshot - and the "queue stuck issue" is definitely gone, but the other issue (queue with X messages on its metadata and using a lot of heap) is still there and it may be related to the exception below:

And i forgot to mention, but i dont use any fancy feature and my use case should be pretty trivial:

I have a small cluster (2 nodes at the moment) with 6 queues (including DLQ) and 1 topic (just for statistics broadcasting) and all of them are clustered.

Each queue (except DLQ) has a number of consumers (64 currently). The producers usually put an average of 300 messages per second (and rarely over 600) on all the queues with an almost identical body that is at most 500 bytes each transacted (so its all or nothing on putting messages on the queues). The consumers either process the message or put another messages on the queue.

I dont use any filters or selectors and messages are rarely paged (or at least the queue size doesnt go up so much that it has to be paged).

Oh and hornetq is running embedded in my application so local message consumption (is this even a word?) is local using in-vm connection factories and the netty connectors are used only to bridge messages between the node clusters. (i also had some questions about how such routing is done but ill look the code first).

Im using tomcat 7 and when i say embedded really is just invoking EmbeddedJMS's start method and sniffing it to get the queue connection factory and registering a TopologyListener. I dont think it has to do with it, but i will try the standalone to see what i get. Im mimicing the startup script but maybe my configuration is just funky. I will clean it up and attach it too.

Im deploying the latest version now on our testing server and everything seems to be working great.