We have been running Zimbra for over a year now, and have been in production since the beginning of the year. We have been slowly migrating users over from a simple Linux/Postfix/POP3 setup, and now currently have about 220 active users. Just about all these users are using the Zimbra Web Client, with maybe a couple still using Outlook in IMAP mode. We are running Zimbra version 4.5.5_GA_838.RHEL4_20070503175122 under CentOS 4.

This morning I was greeted with a flood of calls because no one was able to login to the Web Client. When I ran 'zmcontrol status' on the server I saw "mailbox Stopped ... tomcat is not running". I stopped and started the Zimbra services using zmcontrol and that got everybody going. Then I started to troubleshoot the problem. I traced the problem back to yesterday afternoon around 4:00 PM when I started seeing "java.lang.OutOfMemoryError: Java heap space" errors in the mailbox.log. About a dozen of these errors were logged over the next hour or so, and then I see evidence in the logs that the mailbox service stopped. While I was looking into the problem, after a couple hours of running fine I started seeing the "Jave heap space" errors again, and users started reporting problems. At that point I rebooted the server; it had been up for 71 days. I've been monitoring the server since the reboot and so far it seems OK, but it's still too early to tell.

So, my first question is what would have caused this error to start happening, basically out of the blue, after we have been running for so long without any problems? Have we run into an issue with migrating users where we have reached some type of resource limit? The server has 2 GB of memory. Lastly, I found some information in the forums related to the error I am seeing and the "tomcat_java_heap_memory_percent" variable. Is this something I should be looking to adjust?

Any help would be greatly appreciated.

TIA,
John

06-06-2007, 08:21 AM

soxfan

Bump.... well 24 hours later and no signs of any problems. On the one hand one issue causing downtime in 15 months isn't bad; on the other hand some would argue any significant downtime is unacceptable. Our migration is only part way done; I expect to have close to 400 users on the Zimbra server when all is said and done. If we are at a point where we should start thinking about adding more memory to the server I'd like to get the ball rolling sooner rather than later. Also, should I be thinking about rebooting the server on a regular basis to start fresh, so to speak? I'm used to doing this type of thing on Windows servers, but Linux servers I usually just let run for months on end.

I'd love to get some feedback on what would have caused this problem to crop up, and what I can do to prevent it from happening in the future.