System-Load with M4

Hi,

I've been looking at Zimbra again having been sidetracked for the past few months. I wrote a small perl script to simulate sending messages to about a dozen users on the system, message-size of 5 - 10k and sending messages every 2-3 seconds (sometimes that means two messages as once, sometimes none for 6 - 10 seconds).

Anyway - the system load became so high (30+) that I could barely run "uptime" or do anything else on the system. The test machine is an Athlon64 with 1gig of RAM running 32-bit CentOS4.2 (~RHEL4).

Is this remotely normal? This is with no-one actually reading their mail at the same time.

Java & amavisd seem to be the two processes I see at the top of "top" most often. I've tried with swap completely disabled and still get the same kind of load.

system load

This is on M4 (41, I assume?)

I've run similar tests, and your system load shouldn't be that high. I'm not sure how much this is related to architecture, but some things to further investigate:

When you're injecting the messages, how large do the postfix queues get? How many lmtp processes (in postfix) seem busy? Are the messages backing up before amavis, or after? Try tailing /opt/zimbra/log/zimbra.log and look for the message deliveries - are they coming fast, or slow?

Is /opt/zimbra/amavisd-new-2.3.3/tmp mounted as tmpfs? That speeds up amavis quite a bit.

What's the IOWAIT on the box like? What kind of disk are you using? (local, san, nfs, usb drive...)

Is it possible to write a sctipt which can dump ZIMBRA realated performenace stats so if a user can run this script every 5 min when they are LOAD TESTING/TROUBLESHOOTING.
This way data collected by script can be analyzed or posted "here" for expert opinion to Optomize the Zimbra Setup.
And yes some things Admin need to know on its own but those will be GLOBAL settings of server which is easy to give out.

If anyone at Zimbra thinks its easy to write this script will be great help

When you're injecting the messages, how large do the postfix queues get?

It's sitting after running for 10 minutes (system load up from 0.3 at idle to 20 now) with about 15 messages in incoming & active.

How many lmtp processes (in postfix) seem busy? Are the messages backing up before amavis, or after? Try tailing /opt/zimbra/log/zimbra.log and look for the message deliveries - are they coming fast, or slow?

There only seem to be few (6 or so) lmtp processes running - none seem to be using much system time.
I'm not sure how to test whether they are backing up before amavis?
zimbra.log shows a delivery - on average - 3 seconds or so.

Is /opt/zimbra/amavisd-new-2.3.3/tmp mounted as tmpfs? That speeds up amavis quite a bit.

Yup

What's the IOWAIT on the box like? What kind of disk are you using? (local, san, nfs, usb drive...)

Performance

I see 3 mysqld groupings - are you running another DB besides the message store and logger instances?

I think I asked the IOWAIT question incorrectly - what I was after is what percentage of processes (as reported by top) are in the iowait state?

It's hard to tell if they're backing up before or after amavis - but if amavis is eating a large amount of CPU, they're probably backing up before it.

Couple of things to try to narrow this down:

1 - Run a test that just blasts mail into the system - no delay between sends, multiple processes sending mail to multiple accounts at the same time. Load up the queues, stop sending, and watch what the server does as the queues drain - this will eliminate the variables in the send delay, etc.

2 - check the postfix logs, look for delay=<num> - this will tell you how long it takes postfix to deliver a message. Each message takes 2 hops: postfix->amavis->postfix (via smtp) and postfix->mailstore (via lmtp). Which delay is higher?

3 - turn of amavis - you can do this in the admin console, which should cut the amavis step out of the delivery process. This will tell us where things are slowing down.

Sorry for not replying to this - I didn't get a mail saying a new post had been added. I took zimbra off of the machine it was on and put it on a more standard P4 box. System load is fine running the same test.