I have a Sun (5.10 Generic_144488-12 sun4v sparc SUNW,SPARC-Enterprise-T5220) RAC cluster running 11.2.0.3 GI & RDBMS, connected to a storage array through Fibre channel. But for some reason we see a horrible I/O response time @ teh DB level, which unfortunately doesn't sync up with the HOST or Storage logs, or in other words, systems or storage admins don't see any waits.

So my question is, is there any tool we can use to trace each and every event @ the OS/kernel level? Something like the strace, and also how to interpret that log file!?

What are the specifics of your file system(s) that you're seeing poor performance on?

In my experience, you most likely have some storage configuration issue. The question is, where's the misconfiguration?

Some common problems on the array side are things like 20 or more disk-wide RAID-5/6 arrays with tens if not hundreds of LUNs shared off that one array. Ouch. Other issues are mismatched file system block sizes and RAID-5/6 array stripe widths.

It's not that hard to take a SAN system that can deliver gigabytes a second of data and drop it into the kilobyte per second range.

Truss is mostly used so you can see what the app is doing. If you want to measure performance etc then you should read the docs on DTrace and use it instead. Yea I know you can time things with truss but DTrace was created with that in mind.

It's not that hard to take a SAN system that can deliver gigabytes a second of data and drop it into the kilobyte per second range.

That might be the closest to my scenario... can you advise on what to look for, or what command to run!?

When you're seeing bad IO performance, what's the output from "iostat -sndzx 2"? That will output data every two seconds until you kill it. Post a few seconds of results and you'll probably get a good number of comments.

LUN layout on disk arrays is often a problem. Lots of times storage admins just dole out storage without knowing or even thinking about the impacts of how LUNs are laid out on disks. KISS works best - one LUN per RAID array, that fills the entire array. When there are multiple LUNs on an array, it's way too easy to get extensive congestion issues, especially if someone did something stupid like putting 50 busy LUNs on a single 24-disk RAID-6 array where each disk has a segment size of 1 MB. Yes, I said STUPID. Any storage admin who does something like that is overpaid - even if they're working for free. Because they have absolutely no understanding of what they're working on. Just Google "read-modify-write". What would you think of a C programmer who can't figure out how an infinite loop hurts performance?

RAID-5/6 array layout is often done in ways that hurt performance, too. Almost all file systems do IO in power-of-two chunks, and it's usually pretty small - 4k is typical. ZFS does 128K. If you're using SAM/QFS, you can set your file system block size to just about anything up to 1MB. Yet all too many RAID-5/6 arrays are built as something like 9+2 (9 data, 2 parity) with 1 MB segment sizes per disk. That makes the natural RAID array stripe 9 MB. What on God's good Earth does IO in 9 MB chunks? (See above about overpaid storage admins....) Then, when system admins get that LUN, they will label/partition it in a way that forces misaligned IO - because they don't understand, either. Use an EFI label. (Note how ZFS always uses an EFI label and starts its data partition at sector 256? Hmmm - 256 x 512 bytes = 128 KB. Imagine that....) You should do the same - figure out your file system block size and ALWAYS start a partition an exact multiple of that many bytes from the start of the LUN. (And since you shouldn't use segment 0, that means the beginning of each LUN is lost space. Get over it. It's lost space. Get over it.) Oh, and the RAID-5/6 array stripe width should be no larger than the file system block size. The stripe width can be smaller than the file system block size. If the stripe width is smaller than the file system block size, the file system block size needs to be an EXACT multiple of the RAID-5/6 array stripe width. Actually, it's true that the file system block size should ALWAYS be an exact multiple of the underlying RAID array stripe width, assuming it's a RAID-5/6 array. It's just that it's OK for that multiple to have a value of one.

What this means in practice is that RAID-5/6 arrays should almost always be built using a power-of-two number of data disks - almost always 4+1/2 or 8+1/2, with a SMALLER individual disk segment size such that the segment size times the number of data disks (the stripe size) matches the intended file system block size. A 17+2 RAID-6 array with 1 MB segment size per disk is, again, downright STUPID.

I've seen way too many times where someone spends hundreds of thousands if not millions of dollars on high-performance disk systems, and then they toss away performance with bad admin practices that ignore the underlying design. You can't run any system near its design specifications if you ignore the design.

Depending on your disk storage system(s), IO alignment can be a problem. Larger, higher-performance systems can cope with unaligned IO better than smaller, lower-performance ones. (As long as the LUN layout isn't too bad, anyway...) An STK6780 can handle misaligned IO a lot better than a STK6140 can, for example.

RAID-0 and RAID-1 arrays (and combinations thereof) are a lot more forgiving performance-wise if you ignore all that than RAID-5/6 arrays are.
>
>

@alan - Thanks alan. Looks like truss is better... could not really make much out of DTrace.

Thanks
aBBy.

Truss isn't going to do you much - it will just show you "Hey, the read/write operations are slow". Unless your app is using async IO calls via lio_listio() or aio_read()/aio_write(). In which case truss won't even tell you that.