Friday, July 27, 2007

Do you know a very simple rule of thumb to verify the health of your integration flows? Well, everything could be consider reasonably fine if queues in your ESB are on average empty or quite close to be empty. I have heard many times in different projects complaint like "this JMS server doesn't work well, I have about 10,000 messages in a queue and everything looks so slow...". 10,000 messages parked in a queue? You have a problem here and it is not the fact that the JMS server is not very good at dealing with that, simply your flows are unbalanced and your overall design is broken! Basically your consumers are much slower than your producers so messages quickly accumulates in the system. But a JMS server is not a database, is definitely not for long term storage. You can't easily query messages in your queues as you could do in a database, you can't easily delete or edit them. Additionally, a good JMS server like the SeeBeyond IQ Manager, which is the default JMS implementation in JCAPS, by default activates message server throttling. It means that if persistent messages in the server go beyond a certain threshold the single producer or even the entire JMS server are stopped until a proper consumer lag is reached. For the SeeBeyond JMS these values are by default 1000 messages per single queue, after that messages producers are freezed, and 100,000 messages for the whole server, after that all the producers connected to that particular JMS server are stopped until a certain amount of messages are properly consumed. This is a safe net to avoid producers flooding the JMS server.

Probably the guy above complaining about the 10,000 messages knew about this throttling feature and he decided to simply increase the default threshold. This is definitely a bad idea, he needs to fix the balance of his flows and really understand what is happening in his system instead of looking for easy workarounds. The default limit of 1000 messages per queue is there for a good reason, and it is that a JMS server is not a storage device! It is a quick asynchronous delivery mechanism instead, where messages must stay in a queue for the minimum possible time. The message persistence is there for a complete different reason: it is a way to avoid message losses in case of a temporary hardware or software failure of the messaging system, that's it (oh well, you need an highly available filesystem for that, otherwise that becomes your new single point of failure...). If too many messages are usually staying in the system for a long time you'll notice a proliferation of .dbs files under your stcms folder. Briefly these files are where the IQ manager stores persistent messages, when they are too many all the system becomes inefficient because of files segmentation and reallocation (yes, there is a kind of garbage collection in action, and you want to avoid too much of that).

Then you can say: "but in my flows consumers are slower than producers by nature". This can be easily true because, for example, consumers are performing slower I/O operations with external systems (quite the norm in EAI). So, even if with JCAPS you can easily deploy a set of distributed consumers, this would solve the problem only if the processing is CPU-bound, but not if it is I/O-bound, as scaling horizontally does not help very much if external systems are inherently slow. So what? Well, your flows looks too me not to be nearly real-time, requesting an asynchronous delivery, but instead nearly batch. You need to take control of this and design a proper solution instead of complaining against the technology you are using (you might say that some other JMS servers can be configured to persist messages into a regular relational database. The bad news are that this does not solve your design issue at all: you are using a messaging solution the wrong way, regardless of the underlying persistence mechanism of your particular JMS vendor).

The solution? You should apply the "store and forward". Store your incoming messages into a regular database table then forward them into the destination queues using a more controlled process, moving batches of messages using a scheduled procedure, keeping your consumers busy but without flooding queues. You then might need to implement a reconciliation service. This depends on the context, but probably you would like to know how many messages entered and exited your system, and you probably like to have the possibility to re-submit or delete messages. A database table is a good fit for it, you can decide to store some additional information into table's fields and the run queries on it. Your reconciliation service will expose counters so that you know how many messages traveled through your EAI flow, for each single processor (a JCD in the JCAPS jargon). You than can decide to delete messages from the database when they are picked up by the first consumer or you might prefer to do so only at the very end, when all your process is successfully completed. This second solution could allow to remove the persistent flag from all the intermediate queues in the processing pipeline, to further speed up the JMS server, but this is a design consideration quite dependant on the applicative context. For example, in some scenarios could be simpler and cheaper to repeat the whole process from the beginning in case of failures, instead of maintaining an intermediate state, but in some other scenarios it could be too expensive to repeat and so it is mandatory to store intermediate processing results within the message itself, so that it must be stored in a persistent queue.