The Linux Administration group is for the discussion of technical issues technical issues that arise during the administration of Linux systems, including maintaining the operating system and supporting end-user applications.

Have you checked your hdd cache or buffer timings?
If not use following command
#hdparm -t /dev/sda
#hdparm -T /dev/sda
and post the results
Also check Your RAM modules if they are similar in specification.

There are three things affecting performance, processor, memory and I/O. Memory does not seem to be the problem as swap looks ok. Where are you getting the information from? is it local disk, network, tape? There are a lot of monitoring commands which could help identify where the bottleneck is and this is where to start. Use at least two different monitoring commands and identify where the wait states are. PS is the first one to use with different options to help narrow the options. From there select several of the plethora of commands to identify the sub system causing the problem. Cannot be more specific without more information.

Try checking your CPU, and IO performance, CPU can be checked by sar, vmstat, if the cpu is heavily utlized, check for no. of process, use ps, top, lsof for checking process, iostat is a handy utility to report the io on system. also check for network error, netstat -i should help.

Check your IO wait. I have seen machines where the memory is hardly
touched and CPU is idle yet the load average shows 100+ ... All because the
machine is IO bound. It appears that most of your memory is cached so its
not really being used actively.

Another thought just crossed my mind. Kernel parameters can make a HUGE difference in performance! It is impractical to go into all the different permutations here but the number and size of buffers can and will alter a lot.

Many thanks for your replies. I will check the commands you mentioned and get back with the results. Just need to mention that the OS version is "RedHat Enterprise Linux WS Version 4 update 6" . I already used "vmstat" and "top" commands but the results seems ok. the "top" command result seems only "9%" of CPU and "1%" of RAM are engaged. I use the other commands you mentioned and will reply with the results.

It is very simple to create a ram disk, but you have to be aware that all changes made to the RAM disk are lost after reboot or power loss (this is why SSD's are so cool (and expensive) because changes made to those is not lost.

If you wanted to use a RAM disk, I would say that you could create a boot from USB or HDD system that creates the ram disk during boot up and copies high access point files and binaries to the RAM disk.

If you are using a web server then all the pages for the web service could be loaded into the ram disk, or a mail server could have all the application pages loaded onto the ram disk during boot up and then when you run the apps they are accessed direct from memory.

Make sure that you use cron to replicate any changes you make in the RAM disk to the HDD/USB equivalent files so that your changes persist after a reboot.

Directories that are good to run from RAM are /var and /opt

If your machine is a database, then leave all the /var etc and just run the DB from RAM, though you might also be able to do this programatically within the settings for your DB by allocating more memory for it to work with.

I was re-reading the original post and noticed something odd
MemTotal: 16402384 KB
MemFree: 46252 KB
To put that another way 16,402,384 KB total and 46,252 KB free or 15.64 GB total and 45 MB free45 MB is not allot of free memory.
So, looking at inactive memory
Inactive: 15591240 KB
"Inactive" is memory reserved for various processes but is not currently in use. So, you have some processes reserving but not using RAM. Since this memory is reserved to one or more processes, the rest of your apps only have 45MB of headroom to operate in. These inactive pages are marked in use but haven't been touched for some time.
Off the top of my head I'm gonna guess this box has Apache and PHP installed.
I noticed that you have ample swap space but zero swap usage. Swap space temporarily holds memory pages that are inactive. Swap space is used when your system decides that it needs physical memory for active processes and there is insufficient unused physical memory available.
Yet, you have lots of inactive pages and no swap usage.
Take a look at your swapiness and consider increasing it.

What do you expect the memory to be used for at this stage?
You could help us with assessing the (perceived) problem if
you sorted the top output by largest RAM user rather than
CPU, but I suspect (given the load and CPU usage) that
there's really not much happening, and all the "unused"
RAM is perfectly adequate, and to be expected.

Actually I didn't run any software at the time I was trying the commands (top, vmstat, etc.). Actually that's why that I'm worry about this machine, even when I don't run any software or application this machine is too slow. I mean from the start-up when the computer is booting and after that I login to the graphical desktop (KDE) this machine is so slow. This is the main thing and I don't know why this is happening because Most f the CPU and RAM are free.

The other thing is that when I was installing the OS on this system, it took around an hour & half to installation finished, The HDD that I'm using on this computer is one 1TB SATA 10K and one 320GB SATA. At the first I guessed that maybe it took long time because of the SATA 1TB HDD, but after the installation finished I found out that the speed of the OS is slow too.

I give you an example. When I "right click" on the desktop and wanted to open new terminal it takes up to 20 seconds to open a new terminal. When I want to open System Control Center it takes 10 Seconds to open the toolbar, When I want to open my specific software normally (on a system with low specification) takes 5 seconds to load. on this machine it takes 15 seconds to boot. when I restart computer it's slow too when is booting up. The problem is that I have very good workstation with high hardware specifications. I don't expect this computer to have a performance like Pentium 3 with 512MB of RAM.

Maybe I'm not being clear. Your amount of SWAP is fine. However, if you look, no SWAP is being USED.
SwapTotal: 22531120 KB
SwapFree: 22531120 KB
Since you have decided to setup this server with swap, I am simply concerned that it is not in use.

As everyone has mentioned, this system per the supplied stats appear fine (ie no issues). Can you take a look at /var/log/messages and see if
there is anything "out of the ordinary" ? Based on your recent post, I'm getting the impression that it is hardware related, possibly your disk drive(s) .... Would it be possible to run, "iostat" for a few 2sec intervals when you start your "specific software" ?

you might want to run top and see what is going on. Based on your
description we can only guess. You need to figure out what is causing
the slow performance. I would first run top and see what applications
are using the most cpu time. I would then look at the process table
and see how much memory is being used by the various applications.

Hi Danny,
His meminfo shows swap available so swapon would have to have been set

I'm leaning towards Steve's point of view now. Those last couple of posts describing a 1.5 hour installation sounds troubling. Of course, depending on the packages selected this may be normal. Taking a close look at the logs and running an iostat is very good advice.

Unfortunately because of some company policy issues I don't have an access to this workstation until Saturday and I have to check the command you mentioned by Saturday.
I believe in what you mentioned and I'm serious to test it.

Hi Danny,
After thinking about it I realized you may be right. I had never actually tested meminfo with swapoff and swapon for myself. I had just been trusting what I had read. So, I thought maybe I should test it out.

It looks like the documentation on meminfo and swapon swapoff are correct. If your swap is not set on with swapon it shows zero for total and free. After running a swapon meminfo show total and free space.

cpl of issues/questions I have with your generalization. Whats the io? What OS are you using? You dont have that much "free Memory". You only have 400megs of free RAM. Whats the CPU at? How are your HDD's configured? Raid? How many spindles? What are you running that is "making it seem slow?"

I was looking over the specs on the Z600 workstation. One thing that put up a red flag for me was the fact that the machine uses up to 6 channel (3 per CPU) DDR3 1333. But some of the processor options don't actually support it. The E5504 for example is an option for the Z600 but only supports DDR3 800. And with 6 DIMM slots, mismatching channels (you need 3 per CPU) is easy to do.
I can see a situation where 16MB is achieved using 1 or 2 channels per CPU but how do you get 16GB of RAM divided among 6 slots? Answer - you can't and still use 3 channel RAM which is expensive. No, my guess is that this is a budget machine with a lesser processor and/or RAM config.
You could do 4x4 & 4x4 giving each CPU 2 channels to 8GB. But, if the 3 sticks of RAM are not in the correct slots (such as 4 & 4x4x4) you would get horrible performance and an unreliable system. The performance on such a setup would first be felt during OS install. Or, you could have 1 channel per CPU and 8GB sticks. Of course, you may make the mistake of putting booth 8GB sticks on the same CPU channel. Another thing is that the OS would have no idea about the misconfiguration so tracking it down via the OS may be impossible.
All of this is just guesswork but it fits the description of the problem. Hopefully we will hear more details about the actual setup of this workstation.

If I understood hp z600 series workstation doesn't have Parallel ATA port.
And your output showing that you have connected PATA drive to your
system which is may be a external drive.
I think there must be a problem with your interface of drive or it may
not configured properly. Please check connections
of your drive and update your linux system with latest update then try
to do hdparm.
As per standards a normal hardware should give at least of 40MB/sec read
speed then only your system run normally.

I just spotted this and the reference to collectl.
The whole reason I wrote it was that some tools showed disk stats, some cpu stats. Tools like sar had output that was too difficult to read. AND none of them could plot anything.

The problem with a lot of the responses is that they talk about a point-in-time or very short look at things. With collectl, you install the rpm, start it and let it go. After awhile you can drill into the data, which does have 10-second sample rates and 60-second rates for processes. If there's too much to look at, install colleclt-utils and run the web-based tool colplot to plot the data.

If you like vmstat, you can actually play back the collectl data with --vmstat and see it in that format. If you want to look at I/O by process, collectl will show you that too.

Lots of options, but the key point is once you collect the data you can play it back as often as you like, examining it from multiple angles until something jumps out. Trying to do this interactively with any of the 'stat' tools is just going to frustrate you because the state of the system is continually changing.

Having thought about this a little more, I think there is a lot of confusion in this thread and we're not starting at the beginning. The author is stating the machine is slow, but not what is slow - I know, if he knew he could fix it. Maybe...

But let's back up for a second - someone suggested it could only be 3 things: cpu, memory or disk. I'll say he forgot a 4th - the network. If something is waiting on network that too will slow things down.

But more importantly, the other thing missing from the problem statement is time! Is the system always slow of just sometimes slow? If always slow this should be relatively easy to track down. If occasionally slow, you need to write down the exact times of the slowdown and drill into the collectl data to see what was going on. Alternatively you could always rn collectl continually in its own window and when things slow down see what it's telling you.

Once we've figured out if it's cpu, disk, nework or memory we can then try to figure out who's consuming it.

As a side story, I once worked with a customer who said his system was too slow. we installed collectl and watched everything. All resources were running at reasonable rates but the system was so slow if you typed a single character into VI it took 2-4 seconds to echo!

This gave me a clue - since vi needs to write a backup to disk. but the disk wasn't busy so how could that be a problem? drilling deeper and looking at the disk wait times I found what is normally a few msecs was several seconds!!! Why? He was running mysql and blasting data to disk. Since this is random I/O, it doesn't take a lot to saturate things. As soon as mysql was shut down, things ran just fine.

But a I said, you first need to see what is slow. Then look for the cause.

Thanks for your advises. Actually the system is standalone and is not connected to any network. moreover, the system is slow at all the time, if I mentioned in my last comments, even when I was installing the OS the system was too slow. I don't run any application at the time but the performance is really bad. for example when I try to restart the machine, or right click on the desktop, or selecting the tools it is slow and it doesn't work properly.

The BuffType=unkown is normal if writeback is disabled. After seeing this I wonder about your Writeback, Dirty, and such.
When you posted the output of
cat /proc/meminfo you left out everything after swap.
Please post the entire output.

Easy answers - in this case - may not be the solution to the question. Might be time to change your focus, and batten down for the long, investigative solution.

With the limits you have defined ? we seem to be in a loop, chasing symptoms rather than investigating the cause.

1. What is the primary application on the server.
2. Go through logs, run debug, see if the application is throwing any errors.
3. I'm sure you checked, but it might be worthwhile to run several - pattern updated - AV or root kit apps, again!
4. As one respondent noted, check Network bandwidth/throughput. Disk space usage. Configuration changes.

I would pay particular attention to applications (first ten) and investigate the heck out of all facets of each. If the system is not assigning memory, it doesn't necessarily mean it's a memory issue ? the applications may not be asking for it.

Ok, no network, that makes things a little simpler, but you still haven't said exactly what is slow. everything can't be. if your disks are slow then cpu-only tasks should run just fine. if you do suspect the disk, than before playing with anything else, measure it. That's real easy to do...

You can use dd or something else to generate a load and then see what happens. I prefer robin miller's dt program which you can get at - http://www.scsifaq.org/RMiller_Tools/dt.html. I tend to be biased and always run collectl in a separate window while the test is running, but you should see something on the order of 30-60MB/sec depending on the speed of your disk. for example, here's what collectl tells me on my older workstation, noting the samples are being generated every second:

1 - it's averaging ~38MB.sec.
2 - there are about 75 writes/sec so we know they're about 500KB writes.

I'd expect you to see something of a similar nature. If your writes are at this rate, your disk is fine and you need to look elsewhere. If your rates are a lot lower, the next step is to drill deeper and maybe look at the I/Os themselves and/or your hardware or its configuration.

From what I have read so far there could be a couple of things. You state that the server is not on the network but are the network services disabled? If not they will be looking to connect and dragging the performance down significantly. Also check the clock speed on your memory and that of the bus. If these two do not align you will have this problem as well.

Before you start reformatting your disk, let's first figure out if that's the problem. I'd still like to see some steady-state measurements that are more than a single instance in time like you showed with vmstat and top. If you you run collectl on an idle system, what do you see? Your cpu should be ~0%, your network should be 0, since you said you're not connected to one and your disk should be pretty quiet though it will do an occasion read/write as part of normal housekeeping. If you don't want to run collectl, that's fine but run something to verify you're system really is quiet.
Sorry for being so persistent, but before you try a bunch of random things, get some data. I agree with zenner in that before you start looking for solutions make sure you understand the problem.
I also reread an earlier post in which someone said inactive memory is memory allocated to processes and not available. sorry, but that's not right. inactive memory is memory that IS available for use but given an 'inactive' status in case the data it contains is needed by a process in the future - just another cache, thereby saving the need to go to disk if it is.
As another thing, I only see a partial description of how memory is allocated. I don't see anything about mapped or slab memory. these are both additional types of physical memory that IS unavailable for use by others.

Sorry, but that technique will only work IF a process is to blame. If it's a system issue any process could be affected and you'd end up chasing your tail, just like this whole conversation has been doing. As I said before you need to run some controlled experiments. You need to verify your disks are functioning correctly independent of any process and I already explained how to do that.

If there were a network problem, which mohama said this isn't, I'd suggested running netperf to measure the n/w throughput. If the n/w is not running ~110-120MB/sec, then you'd know you have a network problem.

If the top output posted if typical of running it for awhile, we can probably rule out the CPU.

As for memory, I still haven't see any real data other than a simple snapshot. but if there is a problem and you're swapping, that's certainly a concern and then that drags the disk back into the equation.

Repeats after me -- do the basics before you complicate it by running any non-system code. Please do us all a favor and measure your disk speed.

Linux is very robust. Reformatting is a poor choice for this sort of problem
Mark Seger is right.

Before you start reformatting your disk, let's first figure out if that's the problem.

If the problem is with the install disc or the hardware, reformatting and re-installing the OS will solve nothing.
With that said, if this is not a production server, and you have the time to start over, you haven't harmed anything.

At the start of the User's Manual for dt - Data Test Program this is mentioned:

EXTREME WARNING!!!
Use of this program is almost guaranteed to find problems and cause your schedules to slip. If you are afraid to find bugs or otherwise break your system,
then please do not use this program for testing.
You can pay now or pay later, but you’ve been warned.

Do you think it makes any problem for my system? If yes, is this essential to use it in this case?

I have been following this thread since it started and there are a few steps that have not been done.
1. What is the disk usage? is there a disk that is almost full?
2. What is the system time and date? Is it way off?
3. What network services are enabled? Could this be holding the cpu waiting for a timeout?
4. Are you using NIS or NFS?
5. How many buffers and what size are in the kernel parameters?
6. Is your network setup for the correct network?
7. Are the drivers in the kernel correct?
8. Are there any errors in ANY of the error logs?

Start with this and see if there are any indicators. You stated that startup was slow and any one of these could cause this type of problem.

Sorry, I've been flying all day and am just reading this. The comment in dt is just Robin's way of being amusing. His point is so many people never really look at the way their disk subsystems are performing that when they do, they may end up spending a lot of time trying to make them better. Not to worry.

My starting dt command (if you haven't tried it by now) are:

dt of=/tmp/test limit=5g bs=1 disable=compare,verify

I just run collectl with not switches, to get started. If you want to add memory just include -s+m. If you want time stamps add -oT. It's that simple. It's not until you want to get real fancy that things get more complex.

b) you find a disk bottleneck using the sar command: "...sar -d 5 5.."
If avwait > avserve then you have a disk bottleneck

c) You will always have a disk bottleneck on the OS disks

d) To find the biggest memory process hog on a box us the ps and ptree commands.
ps -ef -o vsz,rss,pid,ppid,state,wchan,user,comm | sort -rn | head 10
vsz: Memory consumption in KB
rss: Memory consumption in blocks - usually more a more accurate figure
state: State of the process, ie. wait, blocked, etc.
wchan: What process your process is waiting on

e) Unless you have an OS problem you ignore all the root owned processes because applications should not be owned by root.

f) Linux memory includes disk cache. You want 100% read cache and 60% write cache, in general. So its important to review the caching or buffering metrics. In your reports your doing hardly any reading and performing mostly writing. This suggests to me that your buffer size needs to be increased.

g) Buffer size is supposed to be automatically adjusted in linux. However, in the above report from Mohama, you see the descript "...buffered disks...". And these tranfer numbers appear real slow to me.

Michael - I do agree with what you're saying about taking measurements and I'm hoping we'll soon see some results poster. While one is certainly welcome to run sar, I find its output far too complex to parse in real-time. Do people really care about all those numbers in hundredths at the cost of screen real-estate? That's one of the many reasons behind my writing collectl, but the point is any technique will work since all tools, be they sar, iostat, vmstat, or collectl all report the same information. In fact when writing collectl I relied heavily on all the tools I was trying to replace with it.

However, I still stand by my statement that before doing ANYTHING characterize the disk performance under a known load like dt. Here again, some people use dd, others iozone and still others simply copy a very large file. That too doesn't matter as long as the load is consistent during testing. I'd also ask why you're only recommending to look at 5 5-second samples with sar,. Why not do 1 second samples for the duration of the testing, for say minutes? Or is the output too difficult to read at that frequency? ;(

Btw - with collectl you can run it at 0.1 or even 0.01 second samples and its output remains very easy to read. The cool thing about that rate is you can even see the I/O rates change as the disk's hardware cache fills. ;)

Sorry for delay, but unfortunately for some policy reasons I'm not able to reach the server all the time. And I have to request and wait for company confirmation. I update you as soon as I test the procedure you explained. Many thanks for your follow up and helps in this regards.

I just this. Too bad there aren't notifications when this note is updated.
As Steve said, it looks like dd wasn't' running! Are you sure you're writing to disk?

The very first step it to get a reliable load generator test going. I don't know enough about dd to help so maybe someone else can (maybe post your command?) OR you can tell me what was wrong with getting dt to run and I can try to help you get that going.

This system is not idle, since you're showing about 5% cpu load. What IS it doing?

If you're really running dd and collectl isn't showing disk activity you're not doing any I/O, it's that simple. Also, if the CPU load isn't changing you're probably not doing anything. You might want to include memory along with with cpu, disk and network by the switch -s+m. You can also include timestamps to make the durations easier to see as well with -oT.

It would also make it easier to reference specific lines of output if you ever get the test going.

I downloaded Data Test program version (14.1) .tar.gz file. when I copied the file to the discussed machine (Linux), I used "gunzip" and "tar" to decompress the file, then I "cd" to the directory placed, and tried to use "make" or "make install" commands to installing this program, but unfortunately non of them worked. I tried to read the "README" file but it didn't contain any installation instruction. I don't have any idea how to start with "dt" program cause it is the first time I'm using it. Would you please help me in this regards Mark?

hmm, I also just noticed the dt command you're using is not what I recommended. no need to try playing with async I/O and such. and why are you reading? and why only 50MB? remember, I said for any testing to be meaningful it has to run for awhile and 50MB is only going to be a 1 or 2 second test.

please try the following:

./dt of=/tmp/test limit=1G bs=1m disable=compare,verify

I only specified creating a 1GB file, which is smaller than I typically use and if you want to try reading it back with 'if' it will almost certainly be read from cache and not disk, that's why I use larger files that are too big to cache.

How about you just send me your email address and we work this off-line and maybe just report the results here? This method is much too inefficient...

I'm really confused about this problem. I really don't know what to do! Maybe I use another hard disk and install separate RHEL OS on it to find out makes any difference or not? What do you suggest now?

Hmmm ... even without the collectl output, your SATA drive seems slow. It would still be interesting to have a good collectl run (appears that collectl was run after dt as opposed to "at the same time").

A couple of things:
- running collectl after the dt command runs is pointless. the whole idea is to run it at the same time!
- 3mb/sec as reported by dt sucks! all the more to run collectl and see what's happening.

In fact, you might want to run collectl to show the lower lever disk stats by running "collectl -sD", not the uppercase. but if you do this only a few seconds worth of samples are really necessary BUT only if done at the same time as dt.

First of all, pasting in hundreds of line of 0s is not particularly helpful, a small number of lines is sufficient to make your point ;)

But seriously, if this is what collectl is saying, I'd assert your data isn't going to the disk! If you don't believe that you could always run iostat at the same time and I'll be it too reports all zeros.

That being said, perhaps /tmp is linked to some other, slower destination? At this point I'm clueless.

Exactly which distro/version of Linux are you running? What does 'uname -a' report?

Well, at least the results of dt is consistent (i.e. still showing the slow/bad tthruput numbers). If you hadn't mentioned it, my impression would be that you were still running the commands sequentially .... doesn't make sense. I would also run "read" tests (from both disks to /dev/null), although the collectl output (run simultaneously) would still be relevant.

IMPO, with what data is available, it's still pointing at the hardware.

Yes, Mark actually bought up a good point regarding /tmp. Does, "a df -k" indicate that it is tmpfs (which is should be, considering the amount of real memory)? You'll have to point your dt output to one of the
disks, or do a read test and point the output to /dev/null.

The exact version of Linux I'm using is "RedHat Linux version4 update 6 (64-bit). The point is that /tmp didn't link to any other point. The thing that I do not understand is that how I'm copying data and using "dt" or even "dd" but the system doesn't show any action from hard disk? Do you think it may be the hardware problem? This machine has two hard disks (one 160GB SATA 7.2K and the other is 500GB SATA 7.2K). I can completely remove both hard disks and install new hard disk to find out the new one has the same problem or not? What is your opinion about this solution? It can be useful?

So, are you saying that /tmp is not allocated as tmpfs (which is usually by default)? That would explain the "null" collectl results (assuming that it was run simultaneously with dt), although it doesn't
explain the low dt throughput rate, which is suspicious.

So to rule out a disk hardware related issue, I would run a "read" test - say /usr or some large filesystem (to /dev/null), in addition to the "write" test to a location on either of your two SATA drives.

I'm still focused on /tmp - I'm not sure I even believe it's pointing to a HD, since collectl is showing no activity. That's why I asked about df and /proc/diskstats. df will show where things are mounted and /proc/diskstats is THE place where all disk activity is recorded. I'm still betting /tmp is not on a disk.

But in either case, I'd still like to see something generate some collectl output and therefore suggested a home directory.

It may very well be a hardware problem but I'd still like to see what the disks are doing - it may not necessary be a disk problem at all. I'm not ruling out hardware, but until someone can demonstrate what is happening on the disks, I'm clueless.

btw - this is the main reason why it's so dangerous to only look at how long a test takes to write. dt is showing 3MB/sec, but t where?

Why is everyone focused on a hardware problem? I would expect to see a lot of errors reported in syslog. If that is clean, I would look elsewhere in either drivers or kernel parameters. Hardware errors are logged unless there is a completely broken disk and then there would be other indicators.

So clearly from 'df' everything is mounted on sdc and collectl is only showing sda and sdb, but you never showed me the contents of /prod/diskstats where is all disk performance metrics for all utilities are found. Can you please report that? I can't do a thing w/o that information.

As for collectl not reporting sdc you don't see how it can it's not in /proc/diskstats and if it really isn't, could indicate some misconfiguration problem? I'm only guessing.

You could always run the command 'collectl --showheader" but I'd expect that to only show 2 disks. or I can have you run "collectl -sd -c1 -d4" which would show what disk data collectl is reading from /proc/diskstats, but having the actual contents of /prod/diskstats is clearly the starting point.

I completely forgot about the 2-disk system comment. So what gives with the mounts for /dev/hdc? Something is not right which is why I want to see /proc/diskstats. So if we try to reverse engineer this, lacking further info, I'd have to guess /dev/hdc is not a disk drive and if it is, it's somehow mounted in some strange way, so what could it be? Since the owner of this note mentioned something about it being a production system he doesn't always have access to it makes one wonder how this system is even running anything useful. btw - are you a collectl user? ;)

sosoback1 - how about posting the contents of /etc/fstab so we can see the actual mount commands?

Yes, very strange. He also mentioned slow boot times, so it would also be interesting to see what's happening at boot (maybe see something in /var/log/messages or dmesg). Along with /etc/fstab, I would also like to see the output of, "fdisk -l", which will show all the disks that the system recognizes (I hope :) Sorry, it's only through this list that I discovered collectl and dt :( ... I took a look at collectl, and it looks like a really neat tool, which I've added to my kit; I was a backup admin and mainly used Solaris's iostat and dd.

With the provided information, that where the issue appeared to be pointing :) Several folks had requested looking at the system logs (yes, configuration type issues, in my experience, tend to show
themselves at initialization time).

All I can say wrt collectl is that I'm always worrying it might be doing something wrong, and every time it has been vindicated - after all, all it does is take periodic samples of /proc/diskstats (as well as looking at other /proc data structures for other data). If collectl reports it you can take it to the back that so will iostat, sar, mpstat., vmstat and all the other tools people like to use. No magic, just a bunch of reads/writes.

What IS magic, in my opinion, is the way collectl displays its output. There are multiple options as I've found no one way keeps everyone happy. The one thing collectl doesn't do is report fractions - with the exception of process cpu stats. I've never seen much point in it and it just puts a bunch of additional useless digits in front of the poor user who is just trying to see what their system is doing.

collectl has been around for many years and used to monitor many of the fastest clusters in the world! Every hear of the top-500? a lot of them run it.