Squashing assumptions with Data Science

Earlier this year Nimble Storage announced their all-flash array called the Predictive Flash Platform; you can read my thoughts on the launch over here. InfoSight is one of the core components of that announcement, which is why we had the opportunity for a fireside chat with the Nimble Storage data science team. We discussed the workings of InfoSight & VMVision and how this relates to actual benefits for an owner of a Nimble Storage array. This post will also touch on some of the key points discussed during the later Storage Field Day 10.

Infosight & VMVision

To recap: InfoSight collects statistics from all Nimble Storage arrays. VMVision is an extra feature of InfoSight that extends this monitoring to the hypervisor level, giving you the possibility to look at a (performance) problem from two angles: the storage system and the server using the storage. Why? Because analytics show that less than half the problems that are experienced as a “storage problem” are actually sourced in the storage environment. More than half of these problems are thus originating in other layers up the stack. I’ve written an extensive post about InfoSight VMVision not too long ago, which you can find here.

VMVision uses a collector that is running on the Nimble Storage array itself and which connects to the VMware stack via API calls. Nimble Storage has expanded the monitoring even further: not just storage & hypervisor, but also the application itself. When you create a volume, you can classify it to an application. This is extremely useful for workload correlation and predicting workload spikes. The latter of this could be extra useful on hybrid arrays mixing flash and slower drives: promoting data to flash ahead of time.

Some of the questions asked in the early parts of the fireside chat:

Will you extend InfoSight and VMVision to also monitor 3rd party arrays?

While not ruled out, adding application monitoring to InfoSight has higher priority.

Do the Nimble Storage arrays support containers (like Docker)?

Nimble itself does use containers for internal development of software, but doesn’t see too many containers in use at customers. The systems itself do not support containers just yet, but containers are a point of focus for Nimble. So we’ll see them when we see them…

Data Science

InfoSight data is logged on a second or minute granularity, with most of the sensors collecting on a minute basis. Data is kept for 6 months in the InfoSight cloud. Good news for admins that actually need to use this data: the granularity of 1 minute is maintained for the entire 6 months which allows for very detailed troubleshooting, even if you need to look back 5 months! So no 1-hour-averages several months down the line that are next to useless in analyzing intermittent, brief peaks.

So what is this data used for?

Non-stop availability (99,9997% measured so far in the field)

It will allow for blacklisting of upgrade paths by checking for conditions that could cause an issue and then blocking an affected upgrade path.

Identifying the onset of rare hardware conditions, like a disk that goes bad but not bad enough to cause a fail. One of the examples Nimble gave was a drive that would show dramatically low write performance. InfoSight was used to respond to these corner cases by proactively replacing those affected drives, until engineering created a code improvement that automatically failed those drives.

The Nimble arrays use Triple+ RAID. This means it can tolerate 3 simultaneous SSD failures, with the + meaning there’s intra-drive parity in every SSD. It might sound like overkill but according to Nimble it’s necessary. While entire SSDs fail less frequently than spinning disk there is a higher partial failure rate.

Cross-Stack Analysis

For example seemingly random latency spikes, for example on all C-drives of all Windows machines. After diagnosing with InfoSight and VMVision it turned out to be related to a virus scanning schedule. After adjusting the update/scanning schedule to spread the updates and scans out over the day, the latency spikes disappeared.

Automatic Diagnostics and Predictions

Basically determine when a VM, virtual disk or volumes activity is competing for a shared resource and impeding activity on a neighboring one. It allows you to pinpoint the active VM that’s using a lot of resources, absorbing all the headroom.

The Nimble Storage arrays unfortunately currently do not support QoS, but it’s firmly on the roadmap and specifically aimed at noisy neighbors. The heavy user would pay a penalty, just enough to allow the rest of the machines to continue as smoothly as possible.

Nimble demonstrated a troubleshooting case where the array showed high latency. The graphs showed that this was primarily attributed to SSD cache undersizing. After expanding the cache the throughput of the array (IOps) increased substantially, while maintaining relatively low latency. The interfaces then showed that CPU saturation was holding back even lower latencies: so if the latency was still not sufficient for the application, the next best upgrade would be faster CPUs.

Hardware sizing

If you know the cache is a problem in your system, the next step is figuring out how much cache you need to add. InfoSight will help you with that by analyzing the working set and giving you a recommended.

It will also show you an estimate of the available headroom on a storage array, allowing you to move workloads from overloaded arrays to underutilized systems.

Greenfields (where you’re building a new environment without historical information) can also be sized with a tool. Currently a customer can use three predefined workloads (Exchange, VDI and SQL), but there’s the opportunity to define a customer workload and see how an array would cope with it.

And finally, some feedback for Nimble itself. It shows what the results of a software upgrade of the Nimble OS are on array performance: newer releases generally show lower overhead and better performance, just with code optimizations on the same hardware. Usually not visible for the customers, but it still influences system and software design.

My thoughts

In one half quarter of shipping the AFA (between the launch and SFD10), Nimble managed to service 55 customers, half of them being new customers. 12% of array revenue was already coming out of the AFAs, with 64% of the deployments on Unified arrays. Pretty good numbers for a new product.

No worries; the whitepaper explains how to read these density maps 😉

With regards to Infosight: it’s still awesome. Other vendors recognize this as well, as we saw several of them putting effort into making a version of their own. If you doubt the usefulness of InfoSight and data science: Nimble released a white paper about the most common IO sizes per application, which you can find here. It’s well worth the read and could you help tune your next IOmeter benchmark!

Disclaimer: A large part of this post is based on the Predictive Flash Platform launch, for which Nimble Storage invited me to San Francisco. They paid for the flights, hotel and various expenses such as food. Some food during this trip was bought by either myself or Dan Frith. Additionally, GestaltIT paid for the flight, hotel and various other expenses to make it possible for me to attend SFD10. On both trips I was not compensated for my time and there is no requirement to blog or tweet about any of the presentations. Everything I post is of my own accord, as always.