Tag: Hardware

To say my last blog post is “a while ago” would a grave understatement. Unfortunately, I’ve mainly been busy with something that was entirely new for Nutanix, and with the amount of work involved and the sensitive nature of what I was working on, there was relatively little room left to blog. Especially since I usually ended up blogging about stuff, I stumbled upon while doing my job or researching.

This all started with me changing from my presales focussed role to our internal “Solutions & Performance Engineering” team, which focusses on the business-critical applications running on the Nutanix platform. In essence, those applications that are the lifeblood of a company. Applications which, if they are unavailable, will cost the company significant amounts of money.

One of those applications is SAP, or more specifically the SAP HANA in-memory database. My colleagues (mainly Kasim Hansia, Rainer Wacker and Alexander Thoma) had already been doing a great job, and all of the SAP applications were certified to run on the Nutanix platform in November 2016. The big question was always “When can we run SAP HANA on Nutanix?”.

Working on the answer to this question is what I’ve been busy with the last year or so. While I won’t bother you with the specific details on the business side of things, I do want to take a bit of time to show what it’s like to go through the process of validating a new application.

First off, the SAP HANA in-memory database is an application that scales to levels that many people won’t ever see in action. You can run HANA in two ways. You either scale up the resources of a single server, for example running with up to 20TB of memory, or you can scale out by adding multiple servers and distributing the load across all servers.

Now, SAP has given the customer two options to select hardware to run SAP HANA. One is an “appliance model” where you choose a configuration as a whole, and everything will run in a pre-tested and validated fashion. You are ensured of a specific behavior of the whole system while running your application. The other option is something called “Tailored Data Center Integration” or TDI in short, where in essence you select your hardware from a hardware compatibility list and have the freedom to mix and match.

What we have done is work with SAP to introduce a new third category called “Hyperconverged Infrastructure” or HCI. The HCI category assumes that we are running SAP HANA in a virtualized fashion, and “collapse” several infrastructure components such as compute and storage to an integrated system.

The limitations on the maximum sizes for this category are smaller than for the other two categories, but the requirements that are in place for this certification do not offer much more leeway. For example, a storage test to ensure storage performance, where initially log overwrite operations needed to have latency <= 400 microseconds (this changed later on). Another example is a test suite of close to 700 tests that emulate real-world issues, and the performance delta is then compared to a bare-metal installation, with only a specific maximum performance delta between the two giving you a passing grade.

All this meant that I had my work cut out for me from the start. We started off working with a server model that wasn’t qualified before, before switching to the validation hardware, namely a Lenovo SR950. A big four-socket server with the fastest CPUs we could use, namely the Intel Xeon Platinum 8180M Processors, 3072 GB of RAM, 3.84TB SSDs and 1.6 TB NVMes.

Now, as much as Nutanix is a software company, we do strictly check that hardware meets specific prerequisites to ensure a smooth user experience and to make sure that certain performance metrics are a given. The issue is that all of the checks and functionality in place didn’t work for this new hardware. Simple things like making the status indicator LED for the NVMe light up, or mapping the physical drive locations back to the diagram view in Prism. It meant modifying Python files that handle how hardware is accessed, packaging everything back up into Python egg files, restarting services and then magically seeing drives that the system was able to access. Or passing through NICs so that we could test with “RDMA over Converged Ethernet” or RoCE, and changing BIOS settings to ensure maximum performance.

And while pushing the underlying hardware to its limits, it also meant we had to dive deep, and I mean very deep, into the software side of things. From things like experimenting with c-states on the CPU and NIC multiqueueing in the virtual machines, down to changing parameters in our product, ensuring that specific CPU features are passed through, the pinning of virtual CPUs to their physical location or making changes to how often a vmexit instruction is called.

I can’t go into all of the specific details since some of it is Nutanix’ intellectual property, but I’ll check what I can share in future posts, and if you have any specific questions, please ask them, and I’ll try to answer as best I can. What I can say is that we pushed the limits of our platform and quite a couple of the things we found are going to be implemented back into the product, and I expect a lot of those changes to show up in the release notes of an upcoming release.

Fact is, I learned a ton of new things, and this all culminated in our validation for pre-production use of SAP HANA on Nutanix AHV as announced in https://www.nutanix.com/2018/06/05/finally-can-talk-sap-hana-nutanix/, and we are working full steam ahead on the last steps to get production support. It was and continues to be one hell of a journey, and I just wanted to give you guys a bit of insight into what it is like working on the engineering side of the Nutanix platform, and what a project like this entails.

I want to finish with a special thank you to (in no particular order), Rainer, Alexander, Kasim, Malcolm, Jonathan, and the extended team. It’s been one heck of an experience! 🙂

When you come to think about it, people who work in the IT sector are all slightly nuts. We all work in an area that is notorious for trying to make itself not needed. When we find repetitive tasks, we try to automate them. When we have a feeling that we can improve something, we do just that. And by doing that, we try to remove ourselves from the equation where we possibly can. In a sense, we try to make ourselves invisible to the people working with our infrastructure, because a happy customer is one that doesn’t even notice that we are there or did something to allow him to work.

Traditional IT shops were loaded with departments that were responsible for storage, for networking, for operating systems and loads more. The one thing that each department has in common? They tried to make everything as easy and smooth as possible. Usually you will find loads of scripts that perform common tasks, automated installations and processes that intend to remove the effort from the admins.

In comes a new technology that allows me to automate even more, that removes the hassle of choosing the right hardware. That helps me reduce downtimes because of (un)planned maintenance. It also helps me reduce worrying about operating system drivers and stuff like that. It’s a new technology that people refer to as server virtualization. It’s wonderful and helps me automate yet another layer.

All of the people who are in to tech will now say “cool! This can help me make my life easier”, and your customer will thank you because it’s an additional service you can offer, and it helps your customer work. But the next question your customer is going to ask you is probably going to be something along the lines of “Why can’t I virtualize the rest?”, or perhaps even “Why can’t I virtualize my application?”. And you know what? Your customer is absolutely right. Companies like VMware are already sensing this, as can be read in an interview at GigaOM.

The real question your customer is asking is more along the lines of “Who cares about your hardware or operating system?!”. And as much as it pains me to say it (being a person who loves technology), it’s a valid question. When it comes to true virtualization, why should it bother me if am running on Windows, Unix, Mac or Linux? Who cares if there is an array in the background that uses “one point twenty-one jiggawatts” to transport my synchronously mirrored historic data back to the future?

In the long run, I as a customer don’t really care about either software or hardware. As a customer I only care about getting the job done, in a way that I expected to, and preferably as cheap as possible with the availability I need. In an ideal world, the people and the infrastructure in the back are invisible, because that means they did a good job, and I’m not stuck wondering what my application runs on.

This is the direction we are working towards in IT. It’s nothing new, and the concept of doing this in a centralized/decentralized fashion seem to change from decade to decade, but the only thing that remained a constant was that your customer only cared about getting the job done. So, it’s up to us. Let’s get the job done and try to automate the heck out of it. Lets remove ourselves from the equation, because time that your customer spends talking to you is time spent not using his application.