November 14, 2007

Back again!

Sorry folks! My day job and family effectively chewed up all my spare time for the past week- hence the delay in responding to the excellent suggestions in the comments to the last one.

Inch observed correctly that Open Source workload generators may end up writing to buffer cache instead. True. Reminds me of an old story...

Many years ago, when I was still doing REAL work, a customer called me and said they were running Oracle RAC on an Symmetrix 8830. We had told them they should get about 240 MB/s large block throughput from a well partitioned workload across the two node cluster.

The customer measured about 160 MB/s and was understandably concerned. So I went over and looked at the configuration - 2-node Solaris 8 VCS DBED/AC with Veritas' cluster file system. Enough HBA's, no architectural bottlenecks on the SAN. So my interest was piqued. I tuned up max_phys in /etc/system to 1 MB and that improved stuff a bit - 200 MB/s. Still short.

I used one of our internal workload generators, and measured the throughput for the vxfs filesystems. Sure enough, I got 200 MB/s too. So I went to raw devices, and viola, 240 MB/s. Turned out that going through vxfs and CFS, even with Direct IO turned on, I still saw a lot of overhead from operating system meta-structures like file systems.

Moral: Be very careful how the operating system is configured. Many unintended effects may rear their head if not done carefully. Raw devices are the best to get a true measure of storage performance. (I still maintain that the DS8300 would have posted better results on SPC-1 if only the queue_depth had been increased - one mans opinion!)

So if one is careful, maybe open source workload generators could be used. By stipulating only raw devices as targets. Its the multi-threading and building correlated IO threads that is daunting to me - any ideas? And then moving to multi-host benchmarks after that.....

Constraints!

Overwhelmingly, the sentiment echoed by people seemed to be - if someone wants to short-stroke, let 'em!

OSSG and TechEnki said just disclose it clearly, for all you know, for a fixed price, that may be the best configuration to achieve a certain metric. The open used community will decide if it is relavant or useful. The Anarchist suggested a minimum limit on utilization - spindles, ports, etc.

Here's my suggestion. I like Anarchist's idea of setting a limit, but tend to agree with what OSSG and TechEnki suggest as well. So why not have a default benchmark report with a utilization limit (such as 70% for component utilization), which I hope will be the majority, and have the tester explicitly call out any result which had been optimized for a specific metric like response time or back-end throughput?

The issue is one of constraints, IMHO. One can optimize a configuration for cost, throughput, power consumption (as Anarchist suggested as well) or other metrics. This should be highlighted in the results. So I may have Copan giving me results that suck from an IOPS perspective, but which rock from a power consumption standpoint because of the MAID technology it incorporates. Inquiring minds should want to know!

So should we have a loose framework of constraints, and have classes of results optimizing for the maximum bang for that constraint? Or have the constraints (spindles, total capacity, cache, $$, power, etc.) for a given configuration declared prominently?

This way, if someone wants to short-stroke, or have a a much greater number of spindles that needed for the storage, so be it. But its optimized for response time and throughput, and we know ot explicitly because it's posted in that class. BarryW echoed this sentiment as well - constraining # spindles may not make sense as spindle sizes grow.

BarryW also called for requirements on replication "overlay" tests. I agree - I'm only wondering if we should dive into this now, or hash out the pure performance benchmark first, and then move to overlays.

So my modified Postulate #2 could be:

Postulate #2: Dont overconfigure! (unless you have to)

and

Postulate #3: If you DO overconfigure, publish all constraints.

On a tangent...

I was tickled by the financial analyst reaction to Oracle VM - suddenly a bunch of them were going gaga over how this was going to compete with VMWare! Huh? Didn't anyone see that it was Xen based? Xen is nice, but its still got a ways to go.... even funnier is how both VMWare and Citrix tanked! Priceless!

Comments

You can follow this conversation by subscribing to the comment feed for this post.

Concerning virtualization, Xen (and now Oracle) have to catch up to a 3 year lead VMWare has on partner and interoperability development. VMWare's biggest competitive advantage (and the reason I don't think hypervisors will become a commodity any time soon) is that for 3 years while they were the only game in town and were sitting primarily in test centers, they were refining their ability to work with the majority of commercial server software. Sure, the other guys can do the easy stuff, but as soon as someone wants to put something besides a file or email server into Xen, they will run up against the lack of proper development for interoperability.

Concerning benchmarking, I think that an open sourced load generator would be ideal- if anyone finds a characteristic of the workload that's unfairly benefiting one type of device, they can fix it or fork it, depending on whether it would be useful to show both.

I don't know if this is feasible, but couldn't we just record IOs from a production environment and play them back to the devices being benchmarked?

Hi OSSG!
On the Xen/Viridian coming of age - in complete agreement. Seeing where VMWare is today, its a clear 2-3 year lead. No doubt the competition will make inroads, especially, IMHO, Microsoft, but not anytime soon.

On benchmarks: The critical piece is that it be representative of real workloads. Which is why I am in favor of what I espoused in Postulate #1. A single measurement will never reveal the true nature - strenghts and weaknesses of a given platform configuration. So multiple different workloads should be run against the same configuration - that way, if the configuration has been tweaked or shows a benefit for one kind of workload, that will wash out in the composite picture. Say small block/large block, cache-friendly/cache-hostile, sequential/random. Hard to simultaneously optimize for all of these in the same configuration.

Pat Artis has tools that can characterize real workloads on the mainframe by capturing and modeling real life IO streams - haven't seen any similar tool for open systems. Has anyone?