Hardware is Boring–The HCL Corollary

I think I’ve been pretty clear that I’m a fan of boring hardware. I’ve said it publicly, I’ve said it in presentations, I’ve said it on stage: no one cares about hardware, they care that the applications and workloads are available and performing appropriately.

Of course, just because it’s boring, doesn’t make it easy, or unrelated to the success of the platform that IT is supposed to be. There’s not a piece of software that has ever been written that makes shitty hardware magically better.

What are the takeaways here? Is this a VSAN issue? Is it a PERC H310 issue? Is there blame that needs to be placed? All of these are good questions, and from what I’ve seen so far, VMware has responded quickly and well to the situation, and I expect that they will get to the bottom of things with the customer. I’m still a fan of the tech, regardless of how this early-adopter issue turns out. Besides, if you have been around long enough you know that every vendor has an outage at some point, so while it’s VMware’s turn in the spotlight today, this definitely isn’t meant as a slight to the company or the VSAN team, many of whom I’ve shared fantastically good scotch with.

The bigger issue for me is the fact that a customer purchased something that was certified and put on an HCL, and that hardware appears to have been completely unsuitable for the job. The best hardware may be boring, but don’t underestimate the amount of time, expertise and investment goes into the creation of that kind of gear, especially in a multi-vendor, multi-vector stack (compute and storage, for example). If you are not going to include hardware with a software solution, whether that hardware is purpose-built or commodity, you have to invest even more in making sure that customers get the experience you want when they bring whatever hardware they have to the table.

In this case, VMware doesn’t want to include a hardware component to VSAN, although they do work with partners to deliver “Virtual SAN™ Ready Nodes” that can be purchased from the likes of Dell, Cisco, Fujitsu and others. The nodes presumably go through some sort of testing process before VMware stamps them and puts them in the HCL, and then customers are given some level of reassurance that the hardware will indeed run the software.

Best case, that’s the end of it, and the intersection between hardware and software is done at arm’s length. Ah, but this wasn’t a best case scenario, was it?

My guess is, given all of the information that has come out about the PERC H310 in the last 24 hours, that the customer would like to go back and ask a few more questions rather than just relying on the HCL. First on that list, I’d imagine, is probably: “Sure the hardware will work, but is it appropriate for my use case?” From the looks of it, the answer may have been no, but there’s no way to tell. It works and that’s all the HCL is good for. There’s no real indication of the workloads a particular configuration is suitable for.

The other method of delivering this is to include hardware, tightly coupled to the software in order to provide a consistent experience for customers. Even in the storage space, there are a number of companies who do this today. Nutanix and Nimble use SuperMicro servers, Simplivity and SolidFire use Dell, Pure uses Xyratex. In many of these cases, the actual software that is the core of the IP these companies produce will work just fine on many different kids of hardware (or in public clouds like AWS), but in order to provide the best user experience possible, hardware is included and often required. This isn’t a start-up company thing either! One could argue that even the might Vblock is a prescriptive, included, standardized hardware platform tightly coupled to (most of) the VMware platform.

Overall, I don’t know that there’s a right answer, since both approaches have their merits. In one, the HCL is a public facing document, and inclusion on it becomes something that is largely driven by demand from partners and customers. In the other, the HCL is an internal document and is there so that the development and support teams can have a firm foundation to work with.

I’m sure there are LOTS of people who have been following the Reddit thread who look at the VMware HCL in a whole new light. When 50% (4/8) of the Dell “Virtual SAN™ Ready Nodes” that are listed include the same controller that seems to have contributed to the issue in question, maybe it’s not the reassurance they were looking for after all.