I have a custom built Ubuntu 11.04 server with a 6 disk software RAID 10 primary drive. On it I'm primarily running a PostgreSQL and a few other utilities that stream data from the web. I often find after a few hours of uptime the server starts to lag with all kinds of processes. For example, it may take 10-15 seconds after log-in to get a shell prompt. It might take 5-10 seconds for top to come up. An ls might take a second or two.

When I look at top there is almost no CPU usage. There's a fair amount of memory used by the PostgreSQL server but not enough to bleed into swap.

I have no idea where to go from here, other than to suspect the RAID10 (I've only ever had software RAID 1's before).

5 Answers
5

You're actually using both a physical storage controller that exposes your disks to the Linux kernel (whether or not you're using its built-in RAID capabilities) and software RAID. You can't rule out the possibility that your storage controller is poorly-supported. Use the output of hdparm -t /dev/sd{a,b,c,d,e,f} to diagnose the issue (this command will take a while).

Since you see some inordinate slowness on /dev/sda, I'd suspect disk failure or controller failure. Double-check that your storage controller is well-supported and try to replace /dev/sda as quickly as possible.

I've got an idea. As you posted the output of hdparm, it says that the SDA drive is very slow. It could be because:

a) You have your / and (part of) your RAID 10 on the same disk, or...

b) There's a problem with some driver.

If you upgraded the kernel, try using the default that comes with Ubuntu.

As @Oneiroi pointed, you should try iotop and, in background, run programs. You can run ls where the RAID is mounted, alone; and later run both ls on / and on the RAID. If it slows down, then it could be reason a.

Try using grep to search in /var/log/dmesg, syslog and messages for words like hdd, kernel, raid or postgresql.

Plus, I would try making the sda fail and unmount from the RAID. Then try hdparm again. If it works, then the problem is sda disk.

Another possible case is that the problem is PostgreSQL. If possible, you could start the server without PostgreSQL and see if the problem's still there. If there's still the problem, shutdown any other services you may have. You could also try shutting down everything but PostgreSQL. If you can do this you may know if the problem is produced by

a) PostgreSQL

b) Other Services

c) Manipulation of Big amounts of data

d) System itself.

Depending on what you tried before, you may specify what proble (a,b,c or d) you have, and the get better help.

Plus, if @SilverbackNet has the opportunity, he could tell us about his server; so we now what is simmilar between both servers and have a solution.

PD:Sorry for bad English. Edit and correct errors; there must be a lot.

PD2:I hope this is helpful, but it's just a bunch of theories i thought could help :)

Way too much logging going on, or log files not getting cleaned up properly. If they have gotten very large, it is taking time to load / save them during regular operations.

Network connection or SSH issue. I have had similar symptoms with Ubuntu 11, where when I SSH into the machine the SSH connection seems to hang or respond very slowly after even a short period of time. Directly hooking a monitor up seems to be as fast as ever, however. With Ubuntu 12 server, the problems disappeared.

There's one possible thing completely left out from the details as I type this.

Perhaps something triggers lots and lots of context switches and/or interrupts? That would probably show as lots of system% in top, but anyway, take a look at vmstat 1 and watch the in and cs columns. And paste the result to your question, too.

Which io scheduler are you using? Assuming you don't have a hardware RAID controller with lots of memory, then deadline's probably the most sensible option, but try CFQ if you've currently got deadline configured.

What's the filesystems and mount options? How have you configured the disks (for ide/data what does hdparm tell you - check the accoustic settings, DMA, readahead and cache)?