We are having the same issue. What is interesting is that we did not have this problem until I started using Empathy. The rest of the staff use PSI or Pidgin. I was the only user using Empathy that comes with the Ubuntu 9.10 edition.

It sounds like you may be encountering the issue where Openfire just completely ignores some IQ packets (in clear violation of the XMPP specs). There are some older tickets relating to that (I don't have the #s handy).

Please file a new issue for this and attach the Help->Debug Window output from a computer being disconnected from openfire.

I wonder if Empathy/Telepathy has the same issue? I wonder if it is worth reporting to their issue trackers as well?

We have 3GB dedicated to Openfire and the memory just slowly leaks away. Right now Openfire is using 1.7GB and the cache shows that it has cached about 400MB. There are only 2400 people online (relatively quiet).

We are running on Ubuntu server. In the evenings when the site is busy (around 7000 people online) the memory is totally used up and the GC is running continuously. It will crash after a few days as there won't be enough RAM. If I restart openfire the memory usage will drop right back to a few hundred MB like it should be.

Can anyone advise a good way to hunt down what is going on with all our memory?

A guy in the team thinks he has found the problem from analyzing a dump - we'll know a bit more tomorrow so I'll let you know if anything comes up.

I've read in a few places that Smack has a few memory leaks, and there are some patches floating around JIRA to fix them. In our most recent load tests the load test agents are crashing before the server really gets any good load, possibly due to this Smack bug. Obviously it's a client problem not a server problem but we need to fix that before we can do any more investigation. And hope XIFF doesn't have similar problems because that's what our real clients will be using...

We're using Grinder to load test but if people using Tsung seem to be pretty happy with it from what I've read.

- Openfire uses MINA which handles a session per socket. Each session has a queue of messages with no limit on this. Everytime you write a message to a session, it gets appended to the queue waiting for the client to process.

- If the client hangs and can't process the message (i.e. send an TCP/IP ACK back), the queue grows and it looks like an Openfire leak. Openfire eventually runs out of memory.

This would mean that any client machine, using any XMPP implementation (XIFF/Smack) could potentially bring down Openfire if it doesn't send that ACK.

I'd like to hear an Openfire developer's take on this to see if they agree.

That is very interesting. I assume the connection would timeout though if it doesn't get a response?

Nothing to do with the ACK but is anyone using the User Service Plugin? I just restarted this and our memory usage surprisingly dropped by about 300MB? I will have a look at the source code in this also to see if I can see any potential issues.

My patch disconnected all idle clients (clients that had not been sent any data for a while). As you've found, that's not the best of solutions. Instead, the code should detect write timeouts. Luckily, MINA appears to offer that functionality.

I've modified the patch to detect write-timeouts. I haven't been able to test this yet, but could one of you give it a try?

Just as an aside, what we're planning to do at some point is pull a network cable on our load test clients and see how Openfire deals with the sudden disconnection. If the patch works presumably it shouldn't run out of memory.

Just as an aside, what we're planning to do at some point is pull a network cable on our load test clients and see how Openfire deals with the sudden disconnection.

Speaking about sudden disconnection. There are more complaints about that and not about memory leaks. Such broken sessions stay online, keep updating their last activity (somehow) and confuses users alot, when they contacts are not replying while being shown as online all the day, while they laptops are not at the desk for hours.

I got the file, thanks. I tried it on a virtual machine with a freshly installed server, but I don't see any difference, I can't get any memory leak for the moment... Only our real server seems to have problems. I don't want to test it with too many people connected, so if this VM can't help, I will try the patch one of these evenings.

I was! I know it's the problem. But no way to reproduce the leak on this new server. But as soon as I connect an Empathy client on the main server and disconnect it, it happens. I will try again tomorrow on the VM with more clients.

-we have a problem when we connect an Empathy client on our production server. 60 users, mainly Pidgin + Spark. Once we connect/disconnect an Empathy client, the server becomes unusable. Not clear yet if it's a memory leak or something else. Processor goes 100%, machine unusable.

-I have installed a new server on a VM to test the patch. But I can't reproduce the problem there. It's working with the fresh install and Empathy client. So I don't know if this patch solves my problem. The only way to test is on the production server, but i won't do that during the day to avoid disturbing the users.

On first glance, your problem doesn't look like a memory leak. These problems usually take some time to develop into a real issue - they don't usually pop up immediately after one client connects or disconnects.

Perhaps you could install the monitoring plugin described in the blogpost New Openfire monitoring plugin. This will give you a bit more details of the overall health of your environment.

Are there any entries in the logs around the time the client connects that causes the instability?

I have moved the software on another machine with more memory. No more CPU at 100% now, but i see the Java heap memory growing slowly. It's a standard installation, I didn't try the patched file yet. Nothing really interesting in the logs, and the new monitoring doesn't tell me much, but mainly coz I don't know how to read it

Sorry to hear the patch didn't solve your issue. The patch was based on the description that (the guy in the team of) mikeycmccartarthy gave. Either I got the fix wrong, or your suffering from another problem.

I've got the feeling that several issues are being discussed in this thread. Things appear to get mixed up a bit. Perhaps we should discuss the issue that you're experiencing one-on-one. Could you send me a message or chat to me offline? You'll find my contact details in my profile.

We started our load test at 2pm yesterday using the patched version of Openfire. I was going to try the disconnect when I got in this morning but Openfire died at about 2am, 12 hours later.

The test gradually ramps up to 4000 people chatting on the server, across 5 rooms, with a throughput of about 3.6 chats / second. Heap memory rises slowly til it's just over 1gb at 1:45, then suddenly it spikes up and Openfire dies.

I have lots of graphs, all logs etc, would really appreciate it if we could go through these together. I'm a bit worried that the Openfire load test stats on this website only show a short period of testing. Do you find you need to restart Openfire often for Nimbuzz and what is the typical load?

You can send me all of the raw data (graphs, logs, etc) to my private address (see my profile). I'm also interested in your test environment. Could you send me the details of that too, please?

Non-disclosure forbids me to be exact, but the amount of users that Nimbuzz processes is a lot, lot higher than what you're processing. They do have occasional problems, but restarts were typically required after a few weeks, not hours.

Official JiveSoftware developers dont work actively on this project anymore, so there is only a group of users who try to fix issues and answer basic questions. It seems that there is only a small number of users doing load tests, so there is not much information about this.

We're now using Tsung to load test rather than Grinder and we're getting the same issues.

Does anyone know of an Openfire installation being used in any large-ish scale production environment? We're at the stage where I think we're going to have to abandon Openfire and use something like ejabberd : (

I just noticed the same problem with 3.6.4. I was using 3.6.3 before this until the pubsub problem (I've started a thread on this) surfaced. I disabled pubsub in 3.6.4 and gave it a 1024M max heap. With ~200 sessions, the heap was up to 1000M within a few hours. None of the users are on Empathy though.

Thank god I found this thread - ever since I moved to Empathy in Ubuntu 9.10, my server has been crashing with 100% CPU usage. Changed server-to-server setting, increased java memory size, but no good. I just switched back to Linux Spark.

Most of the people that are responding to this thread are reporting that Openfire runs out of memory (causing OutOfMemoryExceptions) after at least one user started to use the Empathy client. I've created a new bug report for the Empathy issue specificly: OF-82

So far, I've not been able to identify where things go wrong. We need your help!

I would like to receive a couple of thread dumps (or possibly, a memory dump) of an instance of Openfire that is about to run out of memory. If you can provide these dumps, please contact me (contact details can be found in my profile).

Most likely, the memory leak is linked to specific functionality. Are there any clues as to what functionality causes this problem?

I'm noting now that at a number of users that are experiencing this problem are using the monitoring plugin. This could be coincidence, of course, but lets check, to be sure. Can you guys reproduce the problem without the monitoring plugin? Can you stop the problem by unloading the monitoring plugin (keep an eye on java-monitors memory usage graphs after you do this!)

I think I may have started some of this - I'm on holiday at the moment so this may be the last post for a while but just to sum up some of our findings when running quite an intensive load test on Openfire:

- We needed to add the stalled session property that Guus has mentioned. When our clients dropped off for various reasons (out of memory exceptions on clients, network errors etc) Openfire did not deal with it that efficiently.

- We set our heap size wrong on the Openfire server itself, we were under the impression it had more memory than it did so set our heap size too high.

- The settings for the logging of room conversations by default does not really lend itself to heavy MUC usage (but fine for IM). We were generating pretty high traffic (2000 users, something like 3 chats per second). If you look at the source code I believe the default is to log in batches of 50. I can't remember the interval off the top of my head but basically our log of messages stored in memory was way too high. We've now upped the batch size and shortened the interval size and it's behaving much better, logging 250 messages per minute, although as it's not a real SQL batch statement (it's individual inserts) I would imagine the database may be getting a bit of a hammering.

Anyway, I believe our Openfire is up for the time being - thanks for all your help all : )

Everything appears fine on the nascent test system I set up (above)....so out of curiosity I logged in to our production server using Empathy 2.28.1.1 from an Ubuntu 9.10 virtual machine. And bingo - it's all on. Empathy indeed appears to be the kiss of death for OpenFire.

I can confirm that what I'm seeing is without the monitoring plugin installed.

We have a very modest Openfire installation, just 46 users, which ran untended and uninterupted for hundreds of days until November 5th, when we had the first "out of Java memory" crash. We took the enforced downtime as an opportunity to upgrade to 3.6.4.

It too crashed seven days later - out of Java resources..

Today (five days later) Java memory usage was at nearly 90% (of 1GB) so we restarted Openfire to preempt another crash.

Most of our clients are Pidgin (on Windows & Ubuntu), plus a smattering of OS X iChat. A couple of the Ubuntu users switched to Empathy on release of Ubuntu 9.10 (29 October). This does seem to have coincided the onset of our problems.

We have requested Empathy users to switch back to Pidgin and are monitoring closely.

In this case I was able to expend JVM memory completely in about five (5) minutes from bringing the OF server up - simply by repeatedly exiting and launching Empathy rapidly. This was with one user (me), with the remaining three or so users (at this late time of the day hehe) using Spark or Pidgin. If I am reading this right then I can easily see how in large deployments several Empathy users would create headaches.

Not necessarily the same problem then - there seems to be a growing body of evidence implicating Empathy in this particular issue, and in my case it appears I have a solid test case I can reproduce this consistently on.

I agree with Dave here. Given the amount of noise surrounding the Empathy client, something must be up there. You are probably not running into the same problem, but into something different with similar effects.

I think there are at least 3-4 different issues discussed on this thread and this gets very confusing, for Guus too, i think. Would be better to discuss Empathy issue on Empathy thread. For random out of memory issues a separate threads should be started too.

I started having this same exact issue after I setup a laptop with Ubuntu Desktop 9.10 with the Empathy IM client. Once I saw this post I turned off the Ubuntu laptop and rebooted the Ubuntu Server. The problem has not come back since.

I have noticed those connections too. Empathy queries possible proxy servers, which is why you see a lot of server-to-server connections being created. Although not very nice, it should be of no concequence. I'm running some quick tests to make sure.

It'll stop the random servers being called from Empathy, although doesn't seem related to Guus's PEP theory.

To enable Proposed updates in Ubuntu, goto "Software Sources" in Administration, goto the updates tab, and selected "Proposed". Then reload your sources. telepathy-gabble should then appear in update-manager.

Couple Empathy clients connected, no problem so far, the memory use doesn't show any suddent increase like before. I guess u did it Guus, great job! I will try to add some more Empathy clients to see if it's stable.

I've been trying to identify the cause of the problem, but so far, have been unable to do so. I've asked other developers to have a look too, but none of them have been successful either. Sadly, my attempts are severely hindered by lack of time - I'm doing this in my spare time, which is limited.

I'm not comfortable at all releasing the next version of Openfire with this bug in it, but there's going to be a point in the near future where I feel we should be pragmatic, and move forward. The release has been postponed for to long now.

In the meantime, the lead developer of the Empathy client has confirmed that updates have been released that should dramatically reduce the impact of the bug. Thanks, Sjoerd! I haven't tested the new client yet though.

Although I'm having trouble identifying the exact cause, I did manage to make some progress. While investigating, I've discovered a number of smaller and bigger issues in the PEP / pubsub routines of Openfire. There's no obvious direct link between these issues and the memory leak that we've been discussing here, but the issues that I'm addressing now are likely candidates, in the sense that these kind of (concurrency-related) bugs are known to cause these kind of problems. I'm currently busy rewriting parts of the PEP routines. I'm hoping that my general improvements make this problem go away (or at least help me to identify the cause). This borders on the edges of educated guessing and wishful thinking, but hey, it's the holiday season.

LDAP (AD) users and groups (around 120 users), rosters are distributed via shared groups, around six various plugins (I've been trying disabling all of them while digging this problem, so the list is not important).

My environment:

Server (current): Openfire 3.7.0 release (started using with Openfire 3.6.4 release and further through some nightly builds and beta).

Clients (current): Miranda IM 0.9.19.0 (started using with Miranda IM 0.9.10.0 and further through stable releases).

I've started suffering from daily memory runouts since putting 3.6.4 in production, Java memory configuration was default, min=max=64 MB. Then, I've tried increasing memory to min=64 MB max=256 MB, but it extended total memory runout period only (to about 2 days). My temporary solution was scheduling a nightly server restart - and even if it was pretty acceptable in my (corporative) environment, it is not good for any server at all.

I've read about Empathy/PEP/Java-memory some time ago, but: firstly, we don't use Empathy at all; and secondly, it was not clear to me what actually PEP is. So, I've been waiting for Openfire 3.7.0 release, hoping that it would solve my memory problem - and when 3.7.0 was finally released - and the problem was not solved - I've started analyzing and digging actively. I must say, that Java Monitor is really a great tool for troubleshooting Java apps & servers (thank Guus for telling about it), it helped to see what's going on inside my server (and keeps telling me that).

When influence of all of my plugins has been excluded, I've recalled this PEP thing, and have carefully read about it. I've found out that it actually intercrosses with my other issue, but no one (not even guru's ) have pointed that out to me. Now, I confirm that disabling PEP solved my long-suffered memory runouts (leak) problem, and here are a couple of related findings:

1. I think that this problem is not related to Empathy only, I suppose it is related to any client, that supports PEP (i.e. sending/receiving Mood/Activity/Tunes information or other "personal events") - and this aspect should be pointed out here or in some other place - so that people would be clearly aware of it. If "personal events" are not actually being sent by clients (i.e. clients are not capable of doing so) - it doesn't matter if PEP is enabled or not, the problem comes out only when PEP is actually used. And, until PEP realisation in Openfire is a known memory hog (until bugs are found and fixed) - wouldn't making it disabled by default (out-of-box) be a good idea? I guess there are very few folks that use it consciously (and very few clients that support it).

2. As far as I understand setting property xmpp.pep.enabled to false doesn't actually disable PEP capability advertising by server, it simply disables processing of events of this given type - becase after setting this property and restarting server I still get (while connecting):

So, is there a way to disable PEP-advertising (i.e. completely disable PEP)? If there is no such way - it would be desirable.

The point is that Miranda IM clients (I don't know about other clients, including Empathy) show "personal events" selecting/setting menus only when this capability is advertised by server, and if it's not - there are simply no such menus (I've found this out long ago, by trying to connecting to some public non-PEP non-Openfire servers). Thus, it's not obvious to people why they have these menus and they can't set "personal events".

I can confirm that setting the property xmpp.pep.enabled set to false prevents a memory leak in openfire 3.7.

We had some folks using a few different clients (adium, pidgin, etc) and at least one of them was definitely causing a memory leak and we'd have to restart openfire at least once a week with a heap size of around 1 GB.

Now that we've put that property in place, it's been smooth sailing ever since. BTW, we have 3 other servers where we tightly controlled which clients were allowed to connect and they did not suffer from the memory leak presumably because the client used didn't use PEP features.

In my opinion, this is a MAJOR bug with openfire - a client program should not be able to take down your server by generating a memory leak. The server should correctly handle whatever the client does to prevent this.

I'm rather surprised that this issue has persisted for so long and that it hasn't been addressed...

Not sure if this will help anyone but I was able to fix memory issues I was having by pointing openfire to a different, 64 bit JRE. The OS is RHEL 6.1. Here is my config file from /etc/sysconfig/openfire.

I run openfire 3.7.0. Other than a few administrative changes yesterday, I haven't had to restart Openfire in many many months after using that JRE. I keep the gc log enabled since it barely takes up any space (now that there's no memory errors ). That was how I was able to figure out where the problem was originally.