Archive for the ‘Performance’ Category

I think “launched” is a good description of a product that represents a company’s first release from a product acquisition that was already somewhat mature. No surprising new features, no trend-setting advanced in interface or integration – just a solid, usable “pane of glass” to improve “visibility” into an existing product set. That’s how I’d describe VMware’s “new” vCenter Operations appliance for vSphere.

The product launches initially as a virtual appliance (similar to VMDR, vMA, vCMA, etc.) that enhances vCenter’s ability to track performance, capacity and changes in the vSphere environment. This initial offering is called VMware vCenter Operations Standard and is priced per-VM (I’ll get to those details later.) vCOPS Standard will be available for download and trial beginning March 14, 2011. Here’s how VMware describes it:

Proactively ensure service levels, optimum resource usage and configuration compliance in dynamic virtual and cloud environments with VMware vCenter Operations. Through automated operations management and patented analytics, you benefit from an integrated approach to performance, capacity and configuration management. You’ll gain the intelligence and visibility needed to

Get actionable intelligence to automate manual operations processes

Gain visibility across infrastructure and applications for rapid problem resolution

Get ‘at-a-glance’ views of operational and regulatory compliance across physical and virtual infrastructure.

If you’re like me, that description won’t make you find a place in your strained IT budget for VMware’s new plug-in. Eventually, VMware will find the right messaging to sell this add-on, but let’s see if it can sell itself, shall we? Located deep within a “related whitepaper” there is an indication of how vCOPS differentiates itself from the crowd of “pretty statistics loggers” and delivers some real tasty goodness. I believe this is the real reason why VMware shelled-out $100M for the technology.

Dynamic Thresholds

Yeah, I thought that too. What the heck is a “dynamic threshold” and why do I care? For one thing, it takes VMware two pages of white paper just to describe what a “dynamic threshold” is, let alone describe how it adds value to vCenter. In short, VMware’s statistics logger applies eight proprietary algorithms to live and historical data to “predict” what “normal” operating parameters are for a specific VM, host, cluster, etc. and then make decisions as to whether or not anomalous conditions exist in the present operating state.

Effectively, VMware’s dynamic threshold takes a sophisticated look at the current trend data just like a seasoned IT admin would – except it does it across your entire virtual enterprise every 12 hours and predicts what the next 12 hours should look like. This “prediction” becomes the performance envelope, hour by hour, for the next 12 hours of operation. So long as your virtual object’s performance stays within the envelope, the likelihood of anomalous behaviour is low; however, when it is operating outside the envelope, outliers are likely to trigger performance alarms.

The following transforms are applied to statistical data every 12 hours:

• An algorithm that can detect linear behaviour patterns (e.g., disk utilization, etc.).
• An algorithm that can detect metrics that have only two states (e.g., availability measurements).
• An algorithm that can detect metrics that have a discrete set of values, not a “range” of values, (e.g., “Number of DB User Connections,” “Number of Active JMVs,” etc.).
• Two different algorithms that can detect cyclical behaviour patterns that are tied to calendar cycles (e.g., weekly, monthly, etc.)
• Two different algorithms that can detect general non-calendar patterns (e.g., multi-modal)
• An algorithm that works, not with time-series or frequently measured values, but with sparse data (e.g., daily, weekly, monthly batch data)

VMware claims this approach – to borrow a recently over-used term – “wins” versus typical bell-shape algorithm approaches many times over. In statistical analysis against real-world VM metrics, VMware says typical bell-shape analysis “barely shows up” and, in the few cases where it “wins” the bell-shape approach does so only slightly. In enterprise applications, being able to present “anomalous behaviour” of related systems in opposition can more quickly lead to root-cause identity. Here, VMware demonstrates how anomaly counts for separate, related application tiers can be compared and correlated visually:

Do Fries Come with That?

From some of the back-peddling overheard in the vExpert pre-launch conference, VMware’s testing the waters on where the product fits at the low-end. Essentially, this is an enterprise class product offering that’s been paired-down to fit into a smaller IT budget. Like most VMware products, a generous “free” trial period will be granted to allow you to try before you buy. However, the introductory price (i.e. official pricing is not posted on VMware’s site) is set at $50/VM (hence Kendrick’s quandary) for up to 500 VMs (about $25K).

Since VMware intends to offer an inclusive pricing scheme, all registered VMs will need to be licensed into the Standard Edition’s footprint. In the vExpert call, there was “talk” about extending analysis only to specific VMs (and allowing for a paired-down licensing footprint) but that is conjecture today. In a typical enterprise where 70-80% of workloads are non-mission critical, the cost and license model for vCOPS could be an obstacle for some – or at least force the use of a separate vCenter and cluster arrangement. Let’s hope VMware comes-up with a mission-critical license model quickly.

Anyone who’s discussed storage with me knows that I “hate” desktop drives in storage arrays. When using SAS disks as a standard, that’s typically a non-issue because there’s not typically a distinction between “desktop” and “server” disks in the SAS world. Therefore, you know I’m talking about the other “S” word – SATA. Here’s a tale of SATA woe that I’ve seen repeatedly cause problems for inexperienced ZFS’ers out there…

When volumes fail in ZFS, the “final” indicator is data corruption. Fortunately, ZFS checksums recognize corrupted data and can take action to correct and report the problem. But that’s like treating cancer only after you’ve experienced the symptoms. In fact, the failing disk will likely begin to “under-perform” well before actual “hard” errors show-up as read, write or checksum errors in the ZFS pool. Depending on the reason for “under-performing” this can affect the performance of any controller, pool or enclosure that contains the disk.

Wait – did he say enclosure? Sure. Just like a bad NIC chattering on a loaded network, a bad SATA device can occupy enough of the available service time for a controller or SAS bus (i.e. JBOD enclosure) to make a noticeable performance drop in otherwise “unrelated” ZFS pools. Hence, detection of such events is an important thing. Here’s an example of an old WD SATA disk failing as viewed from the NexentaStor “Data Sets” GUI:

Something is wrong with device c5t84d0...

Device c5t84d0 is having some serious problems. Busy time is 7x higher than counterparts, and its average service time is 14x higher. As a member of a RAIDz group, the entire group is being held-back by this “under-performing” member. From this snapshot, it appears that NexentaStor is giving us some good information about the disk from the “web GUI” but this assumption would not be correct. In fact, the “web GUI” is only reporting “real time” data so long as the disk is under load. In the case of a lightly loaded zpool, the statistics may not even be reported.

However, from the command shell, historic and real-time access to per-device performance is available. The output of “iostat -exn” shows the count of all errors for devices since the last time counters were reset, and average I/O loads for each:

Device statistics from 'iostat' show error and I/O history.

The output of iostat clearly shows this disk has serious hardware problems. It indicates hardware errors as well as transmission errors for the device recognized as ‘c5t84d0’ and the I/O statistics – chiefly read, write and average service time – implicate this disk as a performance problem for the associated RAIDz group. So, if the device is really failing, shouldn’t there be a log report of such an event? Yes, and here’s a snip from the message log showing the error:

SCSI error with ioc_status=0x8048 reported in /var/log/messages for failing device.

However, in this case, the log is not “full” with messages of this sort. In fact, it only showed-up under the stress of an iozone benchmark (run from the NexentaStor ‘nmc’ console). I can (somewhat safely) conclude this to be a device failure since at least one other disk in this group is of the same make, model and firmware revision of the culprit. The interesting aspect about this “failure” is that it does not result in a read, write or checksum error for the associated zpool. Why? Because the device is only loosely coupled to the zpool as a constituent leaf device, and it also implies that the device errors were recoverable by either the drive or the device driver (mapping around a bad/hard error.)

Since these problems are being resolved at the device layer, the ZFS pool is “unaware” of the problem as you can see from the output of ‘zpool status’ for this volume:

Problems with disk device as yet undetected at the zpool layer.

This doesn’t mean that the “consumers” of the zpool’s resources are “unaware” of the problem, as the disk error has manifested itself in the zpool as higher delays, lower I/O through-put and subsequently less pool bandwidth. In short, if the error is persistent under load, the drive has a correctable but catastrophic (to performance) problem and will need to be replaced. If, however, the error goes away, it is possible that the device driver has suitably corrected for the problem and the drive can stay in place.

SOLORI’s Take: How do we know if the drive needs to be replaced? Time will establish an error rate. In short, running the benchmark again and watching the error counters for the device will determine if the problem persists. Eventually, the errors will either go away or they wont. For me, I’m hoping that the disk fails to give me an excuse to replace the whole pool with a new set of SATA “eco/green” disks for more lab play. Stay tuned…

SOLORI’s Take: In all of its flavors, 1.5Gbps, 3Gbps and 6Gbps, I find SATA drives inferior to “similarly” spec’d SAS for just about everything. In my experience, the worst SAS drives I’ve ever used have been more reliable than most of the SATA drives I’ve used. That doesn’t mean there are “no” good SATA drives, but it means that you really need to work within tighter boundaries when mixing vendors and models in SATA arrays. On top of that, the additional drive port and better typical sustained performance make SAS a clear winner over SATA (IMHO). The big exception to the rule is economy – especially where disk arrays are used for on-line backup – but that’s another discussion…

Popular Posts

In Medio Stat Veritas

SOLORI's Take and Quick Take posts express my personal opinion unless explicitly attributed to other sources. Where possible, supporting facts are presented to properly frame and ground these opinions, however they are presented "AS-IS" without regard to warranty or promise: expressed or implied.

Comments are open to all registered users and may be edited for decorum. Spam is deleted with prejudice.