Review: 2.5 years with Nimble Storage

Disclaimer: I’m not getting paid for this review, nor have I been asked to do this by anyone. These views are my own, and not my employers, and they’re opinions not facts .

Intro:

To begin with, as you can tell, I’ve been running Nimble Storage for a few years at this point, and I felt like it was time to provide a review of both the good and bad. When I was looking at storage a few years ago, it was hard to find reviews of vendors, they were very short, non-informative, clearly paid for, or posts by obvious fan boys.

Ultimately Nimble won us over against the various storage lines listed below. Its not a super huge list as there was only so much time and budget that I had to work with . There were other vendors I was interested in but the cost would have been prohibitive, or the solution would have been too complex. At the time, Tintri and Tegile never showed up in my search results, but ultimately Tintri wouldn’t have worked (and still doesn’t) and Tegile is just not something I’m super impressed with.

NetApp

X-IO

Equallogic

Compellent

Nutanix

After a lot of discussions and research, it basically boiled down to NetApp vs. Nimble Storage, with Nimble obviously winning us over. While I made the recommendation with a high degree of trepidation and even after a month with the storage, wondered if I totally made an expensive mistake, I’m happy to say, it was and is still is a great storage decision. I’m not going into why I chose Nimble over NetApp, perhaps some other time, for now this post is about Nimble, so let’s dig into it.

When I’m thinking about storage, the following are the high level area’s that I’m concerned about. This is going to be the basic outline of the review.

Performance / Capacity ratios

Ease of use

Reliability

Customer support

Scaling

Value

Design

Continued innovation

Finally, for your reference, we’re running 5 of their 460’s, which is between their cs300 and cs500 platforms and these are hybrid arrays.

Performance / Capacity Ratios

Good performance like a lot of things is in the eye of the beholder. When I think of what defines storage as being fast, its IOPS, throughput and latency. Depending on your workload, more of one than the other may be more important to you, or maybe you just need something that can do ok with all of those factors, but not awesome in any one area. To me, Nimble falls in the general purpose array, it doesn’t do any one thing great, but it does a lot of things very well.

Below you’ll find a break down of our workloads and capacity consumers.

IO breakdown (estimates):

MS SQL (50% of our total IO)

75% OLTP

25% OLAP

MS Exchange (30% of total IO)

Generic servers (15% of total IO)

VDI (5% of total IO)

Capacity consuming apps:

SQL (40TB after compression)

File server (35TB after compression)

Generic VM’s (16TB after compression)

Exchange (8TB after compression)

Compression? yeah, Nimble’s got compression…

Nimble’s probably telling you that compression is better than dedupe, they even have all kinds of great marketing literature to back it up. The reality like anything is, it all depends. I will start by saying if you need a general purpose array, and can only get one or the other, there’s only one case where I would choose dedupe over compression, which is data sets mostly consisting of operating system and application installer data. The biggest example of that would be VDI, but basically where ever you find your data being mostly consistent of the same data over and over. Dedupe will always reduce better than compression in these cases. Everything else, you’re likely better off with compression. At this point, compression is pretty much a commodity, but if you’re still not a believer, below you can see my numbers. Basically, Nimble (and everyone else using compression) delivers on what they promise.

SQL: Compresses very well, right now I’m averaging 3x. That said, there is a TON of white space in some of my SQL volumes. The reality is, I normally get a minimum of 1.5x and usually end up more along the 2x range.

Exchange 2007: Well this isn’t quite as impressive, but anything is better than nothing, 1.3x is about what we’re looking at. Still not bad…

Generic VM’s: We’re getting about 1.6x, so again, pretty darn good.

Windows File Servers: For us its not entirely fair to just use the general average, we have a TON of media files that are pre-compressed. What I’ll say is our generic user / department file server gets about 1.6 – 1.8 reduction.

Show me the performance…

Ok, so great, we can store a lot of data, but how fast can we access it? In general, pretty darn fast…

The first thing I did when we got the arrays was fire up IOMeter, and tried trashing the array with a 100% random read 8k IO profile (500GB file), and you know what, the array sucked. I mean I was getting like 1,200 IOPS, really high latency and was utterly disappointed almost instantly. In hind sight, that test was unrealistic and unfair to some extent. Nimble’s caching algorithm is based on random in, random out, and IOmeter was sequential in (ignored) and then attempting random out. For me, what was more bothersome at the time, and still is to some degree is it took FOREVER before the cache hit ratio got high enough that I was starting to get killer performance. Its actually pretty simple to figure out how long it would take a cold dataset like that to completely heat up, divide (524288000k/9600) or 15 hours. The 524288000 is 500GB converted to KB. The 9600 is 8k * 1200IOPS to figure out the approximate throughput at 8k.

So you’re probably think all kinds of doom and gloom and how could I recommend Nimble with such a long theoretical warm up time? Well let’s dig into why:

That’s a synthetic test and a worst case test. That’s 500GBs of 100% random, non-compressed data. If that data was compressed for example to 250GB, it would “only” take 7.5 hours to copy into cache.

On average only 10% – 20% of you total dataset is actually hot. If that file was compressed to 250GB, worst case you’re probably looking at 50GB that’s hot, and more realistic 25GB.

That was data that was written 100% sequential and then being read 100% random. Its not a normal data pattern.

That time is how long it takes for 100% of the data to get a 100% cache hit. The reality is, its not too long before you’re starting to get cache hits and that 1,200 IOPS starts looking a lot higher (depending on your model).

There are a few example cases where that IO pattern is realistic:

TempDB: When we were looking at Fusion IO cards , the primary workload that folks used them for in SQL was TempDB. TempDB can be such a varied workload that its really tough to tune for, unless you know your app. Having a sequential in, random out in TempDB is a very realistic scenario.

Storage Migrations: Whether you use Hyper-V or VMware, when you migrate storage, that storage is going to be cold all over again with Nimble. Storage migrations tend to be sequential write.

Restoring backup data: Most restores tend to be sequential in nature. With SQL, if you’re restoring a DB, that DB is going to be cold.

if you recall, I highlighted that my IOmeter test was unrealistic except in a few circumstances, and one of those realistic circumstances can be TempDB, and that’s a big “it depends”. But what if you did have such a circumstance? Well any good array should have some knobs to turn and Nimble is no different. Nimble now has two ways to solves this:

Cache Pinning: This feature was released in NOS 2.3, basically volumes that are pinned run out of flash. You’ll never have a cache miss.

Aggressive caching: Nimble had this from day one, and it was reserved for cases like this. Basically when this is turned on, (volume or performance policy granularity TMK), Nimble caches any IO coming in or going out. While it doesn’t guarantee 100% cache hit ratios, in the case of TempDB, its highly likely the data will have a very high cache hit ratio.

Performance woes:

That said, Nimble suffers the same issues that any hybrid array does, which is a cache miss will make it fall on its face, which is further amplified in Nimbles case by having a weak disk subsystem IMO. If you’re not seeing at least a 90% cache hit ratio, you’re going to start noticing pretty high latency . While their SW can do a lot to defy physics, random reads from disk is one area they can’t cheat. When they re-assure you that you’ll be just fine with 12 7k drives, they’re mostly right, but make sure you don’t skimp on your cache. When they size your array, they’ll likely suggest anywhere between 10% and 20% of your total data set size. Go with 20% of your data set size or higher, you’ll thank me. Also, if you plan to do pinning or anything like that, account for that on top of the 20%. When in doubt, add cache. Yes its more expensive, but its also still cheaper than buying NetApp, EMC, or any other overpriced dinosaur of an array.

The only other area where I don’t see screaming performance is situations where 50% sequential read + 50% sequential write is going on. Think of something like copying a table from one DB to another. I’m not saying its slow, in fact, its probably faster than most, but its not going to hit the numbers you see when its closer to 100% in either direction. Again, I suspect part of this has to do with the NL-SAS drives and only having 12 of them. Even with coalesced writes, they still have to commit at some point, which means, you have to stop reading data for that to happen, and since sequential data comes off disk by design, you end up with disk contention.

Performance, the numbers…

I touched on it above, but I’ll basically summarize what Nimble’s IO performance spec’s look like in my shop. Again, remember I’m running their slightly older cs460’s, if these were cs500’s or cs700’s all these numbers (except cache misses) would be much higher.

Random Read:

Cache hit: Smoking fast (60k IOPS)

Cache miss: dog slow (1.2k IOPS)

Random Write: fast (36k IOPS)

Sequential

100% read: smoking fast (2GBps)

100% write: fast (800MBps – 1GBps)

50%/50%: not bad, not great (500MBps)

Again, its rough numbers, I’ve seen higher number in all the categories, and I’ve seen lower, but these are very realistic numbers I see.

Ease of use:

Honestly the simplest SAN I’ve ever used, or at least mostly. Carving up volumes, setting up snapshots and replication has all been super easy, and intuitive. While Nimble provided training, I would content its easy enough that you likely don’t need it. I’d even go so far as saying you’ll probably think you’re missing something.

Also, growing the HW has been simple as well. Adding a data shelf or cache shelf has been as simple as a few cables and clicking “activate” in the GUI.

Why do I say mostly? Well if you care about not wasting cache, and optimizing performance, you do need to adapt your environment a bit. Things like transaction logs vs DB, SQL vs Exchange, they all should have separate volume types. Depending on your SAN, this is either common place, or completely new. I came from an Equallogic shop, where all you did was carve up volumes. With Nimble you can do that too, but you’re not maximizing your investment, nor would you be maximizing your performance.

Troubleshooting performance can take a bit of storage knowledge in general (can’t fault Nimble for that per say) and also a good understanding of Nimble its self. That being said, I don’t think they do as good of a job as they could in presenting performance data in a way that would make it easier to pin down the problem. From the time I purchased Nimble till now, everything I’ve been requesting is being siloed in this tool they call “Infosite”, and the important data that you need to troubleshoot performance in many ways is still kept under lock and key by them, or is buried in a CLI. Yeah, you can see IOPS, latency, throughput and cache hits, but you need to do a lot of correlations. For example, they have a line graph showing total read / write IOPS, but they don’t tell you in the line graph whether it was random or sequential. So when you see high latency, you now need to correlate that with the cache hits and throughput to make a guess as to whether the latency was due to a cache miss, or if it was a high queue depth sequential workload. Add to that, you get no view of the CPU, average IO size, or other things that are helpful for troubleshooting performance. Finally, they role up the performance data so fast, that if you’re out to lunch and there was a performance problem, its hard to find, because the data is average way too quickly.

Reliability:

Besides disk failures (common place) we’ve had two controller failures. Probably not super normal, but none the less, not a big deal. Nimble failed over seamlessly, and replacing them was super simple.

Customer Support:

I find that their claim of having engineers staffing support to be mostly true. By in large, their support is responsive, very knowledgeable and if they don’t know the answer, they find it out. Its not always perfect, but certainly better than other vendors I’ve worked with.

Scaling:

I think Nimble scales fantastically so long as you have the budget. At first when they didn’t have data or cache shelves, I would have said they have some limits, but now a days, with their scale in any direction, its hard not to say that they can’t adapt to your needs.

That said, there is one area where I’m personally very disappointed in their scaling, which is going up from an older generation to a newer generation controllers. In our case, running the cs460’s requires a disruptive upgrade to go to the cs500’s or cs700’s. They’ll tell me its non-disruptive if I move my volumes to a new pool, but that first assumes I have SAN groups, and second assumes I have the performance and capacity to do that. So I would say this is mostly true, but not always.

Value / Design:

The hard parts of Nimble…

If we just take face value, and compare them based on performance and capacity to their competitors, they’re a great value. If you open up the black box though and start really looking at the HW you’re getting, you start to realize Nimble’s margins are made up in their HW. A few examples…

Using Intel sc3500’s (or comparable) with SAS interposers instead of something like an STEC or HTST SAS based SSD.

Supermicro HW instead of something rebranded from Dell or HP. The build quality of Supermicro just doesn’t compare to the others. Again, I’ve had two controller failures in 2 years.

Crappy rail system. I know its kind of petty, but honestly they have some of the worst rails I’ve seen next to maybe Dell’s EQL 6550 series. Tooless kits have kind of been a thing for many years now, it would be nice to see Nimble work on this

Lack of cable management, seriously, they have nothing…

Other things that bug me about their HW design…

Its tough to understand how to power off / on certain controllers without looking in the manual. Again, not something you’re going to be doing a lot, but still it could be better. Their indicator lights are also slightly mis-leading with a continual blinking amberish orangeish light on their chassis. The color is initially misleading that perhaps an issue is occurring.

While I like the convince of the twin controller chassis, and understand why they, and many other vendors use it. I’d really like to see a full sized dual 2u rack mount server chassis. Not because I like wasting space, but because I suspect it would actually allow them to build a faster array. Its only slightly more work to unrack a full sized server, and the reality is I’d trade that any day for better performance and scalability (more IO slots).

I would like to see a more space conscious JBOD. Given that they over subscribe the SAS backplane anyway, they might as well do it while saving space. Unlike my controller argument, where more space would equal more performance, they’s offering a configuration that chews up more space, with no other value add, except maybe having front facing HDD’s. I have 60 bay JBODs for backup that fit in 4u. Would love to see that option for Nimble, that would be 4 times the amount of storage in about the same amount of space.

Its time to talk about the softer side of Nimble….

The web console, to be blunt is a POS. Its slow, buggy, unstable, and really, I hate using it. To be fair, I’m bigoted against web consoles in general, but if they’re done well, I can live with them. Is it usable, sure, but I certainly don’t like living in it. If I had a magic wand, I would actually do away with the web console on the SAN its self and instead, produce two things:

A C# client that mimic’s the architecture of VMware. VMware honestly had the best management architecture I’ve seen (until they shoved the web console down my throat). There really is no need for a web site running on the SAN. The SAN should be locked down to CLI only, with the only web traffic being API calls. Give me a c# client that I can install on my desktop, and that can connect directly to the SAN or to my next idea below. I suspect, that Nimble could ultimately display a lot more useful information if this was the case, and things would work much faster.

Give me a central console (like vCenter) to centrallly manage my arrays, I get that you want us to use infosite and while its gotten better, its still not good enough. I’m not saying do away with info site, but let me have a central, local, fast solution for my arrays. Heck, if you still want to do a web console option, this would be the perfect place to run it.

The other area I’m not a fan of right now, is their intelligent MPIO. I mean I like it, but I find its too restrictive. Being enabled on the entire array or nothing is just too extreme. I’d much rather see it at the volume level.

Finally, while I love the Windows connection manager, it still needs a lot of work.

NCM should be forwards and backwards compatible, at least to some reasonable degree. Right now its expected that it matches the SAN’s FW version and that’s not realistic.

NCM should be able to kick off on demand snaps (in guest) and offer a snapshot browser (meaning show me all snaps of the volume).

If Nimble truly want to say they can replace my backup with their snapshots, then make accessing the data off them easier. For example, if I have a snap of a DB, I should be able to right click that DB, and say (mount a snapshot copy of this DB, with this name) and the Nimble goes off and runs some sort of workflow to make that happen. Or just let us browse the snaps data almost like a UNC share.

The backup replacement myth…

Nimble will tell you in some cases that they have a combined backup and primary storage solution. IMO, that’s a load of crap. Just because you take a snapshot, doesn’t mean you’ve backed up the data. Even if you replicate that data, it’s still not counting as a backup. To me, Nimble can say they’ve solved the backup dilemma with their solution when they can do the following:

Replicate your data to more than one location

Replicate your data to tape every day and send it offsite.

Provide an easy straight forward way to restore data out of the snapshots.

Truncate transaction logs after a successful backup.

Provide a way of replicating the data to non-Nimble solution, so the data can be restored anywhere. Or provide something like a “Nimble backup / recovery in the cloud” product.

Continued Innovation:

I find Nimble’s innovation to be on the slow side, but steady, which is a good thing. I’d much rather have a vendor be slow to release something because they’re working on perfecting it. In the time I’ve been a customer, they’ve released the following features post purchase:

Scale out

Scale deep

External flash expansion

Cache Pinning

Virtual Machine IOPS break down per volume

Intelligent MPIO

Cache Pinning

QOS

RestAPI

RBAC

Refreshed generation of SANS (faster)

Larger and larger cache and disk shelves

Its not a huge list, but I also know what they’re currently working on, and all I can say is, yeah they’re pretty darn innovative.

Conclusion and final thoughts:

Nimble is honestly my favorite general purpose array right now. Coming from Equallogic, and having looked at much bigger / badder arrays, I honestly find them to be the best bang for the buck out there. They’re not without faults, but I don’t know an array out there that’s perfect. If you’re worried they’re not “mature enough”, I’ll tell you, you having nothing to fear.

That said, its almost 2016 and with flash prices being where they are now, I personally don’t see a very long life for hybrid array going forward, at least not as high performance mid-size to enterprise storage arrays. Flash is getting so cheap, its practically not worth the saving you get from a hybrid, compared to the guaranteed performance you get from an all flash. Hybrids were really filling a niche until all flash became more attainable, and that attainable day is here IMO. Nimble thankfully has announced that an AFA is in the works, and I think that’s a wise move on their part. If you have the time, I would honestly wait out your next SAN purchase until their AFA’s are out, I suspect, they’ll be worth the wait.

44 thoughts on “Review: 2.5 years with Nimble Storage”

Eric,
Great writeup. Currently I have a 3 year old Pure Storage array in my data center and a 9 year old Netapp array in my DR site. If we were to ever have a situation where my DR site had to become my production site I would be in a world of hurt.

We are looking at three different solutions to this situation:
1. Purchase another Pure array for the DR site and replicate between them. Love the Pure box, great performance and a great web GUI. This option might be cost prohibitive.

2. Purchase two Nimble arrays, probably CS300 or CS500. One would replace the Pure in my data center and the other one would replace the Netapp in DR. We would replicate between them. My only hesitation here is performance. My Pure array can put out 120,000 IOPS all the time and read/write latency is always below 2ms and much of the time below 1ms. We have 40 servers (mostly Windows but a few Linux) and 380 VMWare View desktops running on the Pure and it just works. We only run about 15,000 IOPS max and about 260MB/s throughput on the Pure. I worry that the Nimble array will too often drop to disk and have poor performance with a hit on IOPS, latency, and/or throughput.

3. Purchase one Nimble array for our DR site. I would then use either VMWare’s replication or a product like Zerto to replicate between the two different manufacturers arrays. I still get the killer performance of the Pure but have a DR site that is usable if I needed it. The downside is I have two different GUIs/products to work with.

I agree that in the long that nearly all primary data center storage is going to be flash based. This is why we purchased the Pure in the first place.

1. Do you use Pure for all of your prod data, or do you have something else for bulk data (file servers)? Its hard to imagine a company using an AFA for everything (even with today’s prices) but I could be wrong. If you’re in the process of a potential storage refresh, I would look long term at a vendor strategy. Is Pure going to have a solution that’s cost effective for bulk data in addition to their current high performance niche. If you’re looking at an array that can pretty much do it all, Nimble is a tough vendor to beat. They have announced the release of an AFA this year. What you can draw from that is with Nimble, you’ll have a vendor that now can provide a solution for not only your ultra high performance data needs, but also your more generic needs like file servers, or other lesser critical data. Now you’re looking at a more complete solution that inter works together. Same GUI, same replication, an ability to move volumes from AFA to hybrid and vice versa.

2. Depending on your time frame, I would honestly wait it out for Nimbles AFA release, but if you had to pick now I would say it depends. What drove you to Pure in the first place? If you can afford AFA now, why sacrifice that performance to go back to hybrid (it is a step back). I think your concerns with hot vs. cold data are very fair, and I can’t say what your environment would be like, but I can say in general, I average less than a 5ms latency, and depending on how critical your workload is, I could guarantee you all flash performance with cache pinning. You could basically purchase a Nimble array with a cache shelf. I think at the moment you can stuff up to 32TB of flash in a single array (and that’s growing every year). So that could give you a ton of flexibility with pinning critical volumes to cache, and let’s some volumes simple auto manage based on activity. I personally don’t use cache pinning and Nimble has performed fine for us. Its not perfect, but its a balance between disk capacity, performance and cost, and I think Nimble does it very well. To give you an example, we have media files (pictures, videos, etc.) co mingling with high performance SQL and Exchange data. Nimbles disk is cost effective enough for us to do that, and the cache is mostly used by SQL. Its a great blend.

3. I think my answer in point 2 kind of answers question 1 and 2. Nimble can do not only 125k IOPS (cs700) but when you scale them out, you can go up to 500k IOPS (hybrid). But it doesn’t sound like you’re really IOPS bound, so a cs300 + extra cache vs. a cs500 and less budget for cache would make more sense IMO. The only disadvantage of the cs300 (besides IOPS) is you might be limited to the amount of disk capacity you can add. Then again, its only a controller swap away so likely not a big deal.

4. I don’t know how much volume cloning you do, but I can say the one thing I saw that Pure did which I loved was truly independent clones. Nimbles clones always link back to a parent volume which could get messy if you depend on them a lot. That’s a big if…

At the end of the day, I’ve look at Pure, Nimble, Kaminario, Soldfire, Tegile, and Tintri when it comes to AFA. My take away is…

Pure is great, but their usable cost (before dedupe / compression) is way too expensive IMO. They’re not a solution I would want because I would be at the mercy of deduplication, and and I have a ton of data that wouldn’t dedupe well.

All I can speak about for Nimble is their hybrid, but if their AFA ends up with similar costs per usable GB (as Pure), it may be a deal breaker for us. Otherwise, I’d much rather have a vendor with a solution that’s great for my performance data and the rest, and Nimble seems on track to deliver that.

Kaminario has great performance, and an even better usable cost per GB. Their biggest problem is they do a lousy job of marketing. I suspect they actually have a really great product, but no one can pronounce their name, let alone knows they exist. They lack a disk tier, but I honestly find that their flash price is so good, I’d probably move a lot more data on to them, then I would with Pure, and end up with a lesser array for the other stuff.

Tegile and Tintri are a whole lot of meh. Not even worth broaching

Soldfire basically just go purchased by NetApp. They were already expensive to begin with, I can’t see it getting any cheaper now. Eaither way, they’re a niche play, and they really only make sense for large service providers. Plus, single volume performance is way better with all the other vendors above.

I haven’t looked at EMC Extreme IO (very rocky start IMO), but if I had to pick one more vendor to seriously consider, it would be HP 3par. Again, a vendor that could offer you a complete solution.

Thanks for getting back to me so quickly. We do have all of our data on the Pure array. Certainly our VDI setup and SQL databases get a huge benefit from the AFA. The decision was made to have one storage vendor for our data center so we would not have to deal with two companies/interfaces. I am guessing we were one of the first 100 or so clients for Pure. Now they are huge, selling a lot of boxes, and doing very well.

I actually looked at Extreme IO at that time, this is even before they were purchased by EMC. We also looked at another 6-8 vendors (Texas Memory, Hitachi, EMC, Tegile, Netapp, Kaminaro, etc.) but decided on Pure. I am thinking that flash prices will keep falling to the point were even bulk data like files/folders will also be a great candidate for flash. In other words, I think that spinning disks days are numbered, we just jumped on the band wagon early. Our driver was all the research that said the number one determinant of a good VDI experience is storage.

I do like your idea of purchasing the CS300 (plenty of IOPS) and some extra cache to try and better guarantee very high cache hit rates or even pinning some volumes. Our overall dedup/compress rate for the Pure is 3:1. Our VDI volume is much higher than that but our volumes for files/folders is much lower.

Two other things I really like about the pure. I already mentioned the web GUI but how they handle volumes is the other one. I am hoping you might give me some insight here. All of our 40 Windows/Linux servers live on a single volume on the Pure. All 380 desktops live on a single volume. Now I do have separate volumes for file shares. For example for my two file servers, there C: drives live on the same volume that all of my servers reside in but I have two separate volumes (E: and F:) on each server that live in a file shares volume on the Pure. I have do have a separate volume for SQL Databases and Logs. What I am getting at is with a real traditional spinning disk array, it was common to have a large number of volumes. Perhaps 380 VDI desktops would live on 10 separate volumes. This was a management nightmare. The Pure box being all flash does away with all of that, I just have a handful of volumes and the performance is still great.

The question becomes if I move to a Nimble will I need to split some of the single volumes into multiple volumes?

In answer to your question about volume, you honestly should be doing it with anything that’s block based IMO. Perhaps I’m old school in that sense, but even with VAAI offload, there still is some form of metadata locking that’s occurring, and the more VM’s, the more hosts you have sharing a single volume, the worse your potential latency / IOPS are going to be. I was reading some where on VMware forum, that some dude had something like 250 VM’s in a single datastore on an AFA, and they were wondering why their write latency was so high under heavy load. After breaking their single volume up into something like 10 volume, they said there was a crazy drop in the latency and a huge spike in performance. No clue offhand whether the array supported VAAI at the time, or even which array it was. Point being, unless Pure is telling you to keep all your VM’s in a single LUN, I would strongly suggest breaking them up anyway. In the case of pure, where you don’t care about things like block size, and other volume level settings, I would simply suggest using VMwares SDRS (requires Enterprise plus last time I checked). It will balance VM’s across luns for you automatically. So you get the benefit of not needing to micromanage at a lun level, and a better VM to datastore ratio. I have few SDRS pools, full auto, and it works great. Add to that, there’s a queue depth limit of 128 or 256 outstanding IOS IIRC within a single lun, so yet another good reason to break them apart. Honestly, there’s way too many good reasons not to use singular, really large volume. If you honestly want that ability, file protocols like NFS are really your best best. Even there, you’re looking at multiple shares just to balance the network IO. The only 100% singular volume solution I’ve ever heard about, that’s by design is vSAN. So regardless of which direction you go with, I highly suggest re-evaluating your LUN design.

All that said, you absolutely need to micro manage LUNS with Nimble, and at times its to a point of frustration. However, its for the greater good, and if you compare to some other arrays like Oracles ZFS, you can see why. Basically its about not caching more than you need. If SQL writes 8k, but your cache is set to 32k, you might be caching 3x more data then you really need to. In turn, if you set it too low (say 4k) then that’s twice as many cache lookups that you’d need to do. I personally like to break my volumes up by IO size and application. For SQL I have a volume for OS, DB, Log, Index, TempDB, TempLog. I like to keep things highly organized. In many ways its a bit more work up front, but then you can get a lot more insight about your storage later on, whenever everything is really granular.

Hi Eric, Can you explain why TinTri is meh too you, i respect your write up , but why are you against Tintri, i think the fact that majority of data is cold for most general purpose apps, Tintri is a great options, it doesnt scale out atm, but future version might have this features.

I think Tintri has a few slick features like per VM monitoring / policies, but that’s pretty much where it ends. When I first looked at them, it was NFS only and VMware only. I realize they now support Hyper-V / KVM and SMB, but being a solution that only serves file protocols is just not enterprise enough for me. I don’t have anything against file protocols, I would use them in many cases, but there’s more things that support block protocol than file, which made Tintri a less flexible storage solution. Take a traditional SQL failover cluster, it needs shared storage, and up until very recently, SMB was not a supported option, let alone NFS. Even still, there are a number of caveats around file protocols with SQL DB’s. So again, nothing against file, but block is more enterprise friendly.

They were also missing simple things like replication at the time when I looked at them. Ironically not the only array, which is kind of disturbing, but none the less. Replication is kind of a need in most enterprises. I haven’t looked, but I suspect by now, they have it.

Additionally, I personally hate tiering instead caching (except for backup and file shares). It’s why I ran away from Compellent. I get that Tintri has dedupe in their cache tier, and I’m sure that does make a larger difference, but at the end of the day, if my data is cold, caching will respond quicker then tiering, and I care more about that, plus I can always make a larger cache. Adding to this, Tintri is not a scalable solution. With Nimble or other arrays, I can add more data storage, more cache, and even upgrade the CPU (to a limited extent). I’ve added both cache and disk to my arrays over the years, so I would contend its a real need. Rather than purchase more SAN’s, I incrementally update the ones I had.

I agree with your point that the Tintri appliance don’t have scale-up or scale-out capabilities. Although, Tintri is only for virtualized environments and supports OpenStack, Hyper-V and VMware.. So in a scenario where your VMware environments are large than 100TB (that’s allof of VMs) of effective capacity would might need to by another appliance for Virtualized workloads this isn’t a problem. At the end of the day all you are doing is adding another NFS Datastore and more rackspace in the data centre. The benefit is that is a one more NFS Datastore. Yes you are right, if you wanted to have physical servers connected to a TinTri appliance well that is not the right users case, but if you want SQL clustering you can use SQL Alwayson technology in a Virtualized environment. Tintri also provide a AFA solution which means you have the option to have something that doesn’t do tiering..So overall for a virtualized environment, TinTri is a much better options and provide better ROI and TCO for VM environment as the overall over head of managing traditional SAN based environments like Nimble is a massive over head..Tintri pretty eliminate storage management. But for people who like to tinker and keep their jobs, this might not be a good thing, i find this type of discussion needs to be had at Line manager CEO level. Because it is a threat to the storage admin teams. Storage admins love to tinker and argue about what SSD architecture is fastest, unless you are running High frequency trading applications or something similar requiring the lowest latency, you are really wasting company time and money tinkering with SANs, LUNs and VVOLs in a 100% virtualized environment. I highly recommend people looking at TinTri.TinTri is just a Intel x86 server, so the software will also be upgradable. debating NFS vs iSCSI vs VVOLS is really a waste of time, its about reduce the overall operational overhead of managing a virtualized environments.. Tintri can eliminate this..

You’re welcome to your views and I love debating, but I don’t see us going back and forth being a productive thing. I will say, its very clear based on the number of customers and arrays that Nimble has in the field, that folks prefer them to Tintri. I know I do, and for the reasons I originally mentioned. I think if Tintri were to make a number of architecture changes they’d probably get a lot more traction. The VM stuff is very cool, but its overshadowed by a lack of flexibility. No different than I passed on Rubrik in favor of CommVault. One vendor can do it all, the other can’t. And the same is true for Nimble vs. Tintri. In an enterprise like mine, I want to keep the number of point solutions to a minimum. Nimble allows me to do that, and Tintri does not.

With regards to storage, when I think of a complete solution, I mean a vendor that can offer you everything you would need with storage. For example, while the trend is leaning towards AFA to run everything, its still not there yet, which means we need multiple storage tiers. When I look at a vendor like Kaminario or PureStorage, they only have one tier, which is flash. With Nimble, you now have a spinning disk tier and an all flash tier, which is a more complete solution from a storage perspective. if you really break it down, there are the following tiers, archive (cheap and deep), regular work loads (decent performance to capacity ratio), high performance (great performance, smaller capacity, and expensive), and now there’s what I’ll just call insane speed (ultra low latency, mega iops / throughput, small, and typically non redundant). I’m not intimately familiar with 3Par, but I know they offer(ed) a ton of configurations consisting of all those tiers above with maybe the exception of the insane speed. Maybe their solution was never “cheap” and deep, but you can basically build an array(s) that serves almost every purpose. Some of these newer players just can’t do that. All that being said, like I said, all flash is getting to a point where I think you’re going to see those two middle tiers merge into one. There will still be a need for cheap and deep, and insane speed, but now the AFA will cover the two most frequently used needs.

I have read that article a bunch of times over the past few years. I am going to revisit this issue of how many VMs per datastore. Not sure how all flash arrays change this equation but my 3 years of experience with a Pure array have led me to believe that things have changed. I did adjust my queue depth per Pure best practice. I have 340 VMWare view sessions running on a single Lun on the Pure. I have 40 Window Server VMs running a single Lun on the same Pure array. The Pure array can do about 150,000 IOPS, I am only pushing about 22,000. The latency is almost always below 2ms and most of the time below 1ms. If I head to a Nimble array I may need to rethink this and have VMs on a few more luns

Curious as to why you passed on Compellent. Looking at moving from EQL PS6000s with ~40TB to a Dell Storage SC4020 and shelves for 96TB. Thanks for a great read. We looked at Nimble but concerned about long term pricing and scale. Compellent’s licensing is a little goofy, but we are starting big enough, I don’t think we will run into it often. We are also going to leverage Pernix with 1TB NVMe cards in each host. Using that now with 480GB SAS SSDs and it works great to smooth out the bumps.

We’re a huge Dell shop, and I personally have a ton of experience with Equallogic (long before they were Dell and while they were Dell). Let’s start with Compellent, and why I passed on them almost 3 years ago, and would probably still do it again today.

– I hate tiering for primary storage, unless its going from really expensive SSD to really cheap SSD. Its one of the reasons I passed on Tintri too. Tiering from flash to disk is less ideal compared to caching. Caching responds much faster to changing working sets and cache is generally cheaper / larger then a tier, so more of your data fits in said cache. Add to that, Dell always writes in the highest tier first, which means that tier needs to be sized appropriately or you’re not only going to have a lot of read performance issues, but also write performance issues.

– I find that Dell is like hospice for tech companies. Look at EQL, it went completely stagnant post Dell acquisition. Sure there were little things they did, but nothing at the rate of the original EQL. Now let’s rewind to compellent, what big things have they done lately? Not much IMO. Sure they added flash tiers and dedupe, but really, its too little too late IMO. If you want more proof of this trend, look at what little they’ve done with the Ocarina. The DR series appliance can’t hold a candle to DataDomain. Even more proof, look at NetVault, Quest, etc. Dell doesn’t innovate, and that’s my opinion why. Now who know, Michael may turn things around. I find that going public for companies is the worst thing for customers. Maybe going back to private will whip them back into shape. I also want to be clear with this, I like Dell, they have great customer service and I like their servers / desktops, its just everything else that I have an issue with.

– This reason isn’t based on historical, but its why I wouldn’t purchase them now. I’m sure you’ve heard about the EMC deal? I can’t see how Compellent is going to be a long term solution in their portfolio if they have VNX’s, VMAX’s and ExtremeIO’s at their disposal.

– I read a ton of horror stories about under sized tiers, data taking forever to move up, etc. Again, not to belabor the point, but tiering is something that would have been “the next big thing” like 16 years ago. Today in 2016, not only is it a hack, its the worst hack of the hacks available. I DO see value in tiering for traditional file servers, or backup data, but that’s not what you’re buying this array for.

– No compression or dedupe. Just on principle, a company of this size and with this level of cost should have dedpe or compression, and they had neither.

I also see you went with Pernix, which too me really just shows how weak compellent is, if you needed to bring in a 3rd party caching solution to get the performance you needed. I’m not a big fan of Pernix either, but I’m not going to dig into that in the comments here (another blog post at some point). That said, I think if you purchased Dell, you left a HUGE opportunity in Nimble on the table.

As for scale and price being your concerns, I think neither should be a worry to you.

– Scale: Nimble now has both hybrids and AFA’s that scale to incredible levels of capacity and IOPS. 500k IOPS for hybrid, and 1.4 Million for AFA (don’t have the specs in front of me, but that’s what I recall). If I had one concern with scale / performance, it would just be making sure I have enough cache in the hybrid arrays, which would be pretty hard not to if that’s what you need. Capacity wise, again, its huge, as in PB’s.

– Price: They can be a tad pricey, but compared to Compellent, not really. I’m sure Dell will drop their pants to seal a deal (that’s what you do when you know your solution is inferior), and they probably will continue to do so. The question you need to ask is do you want the better product or the cheaper one? Its just like Veeam vs. CommVault. Yeah, Veeam is way cheaper, but it can’t hold a candle to CommVault. Anyway, back on topic, I was able to secure a pricing agreement with Nimble. I suspect if you were / are interested, you could probably do the same.

Like I’ve said in both at the end of my post and in the comments, in 2016 I can’t say I would really reccomend hybrid arrays in general for anything but chum workloads. The cost per GB / IO is just too great now a days for AFA. That being said Nimble chew threw sequential IO like I do through cake, which is pretty darn fast. The biggest issue Nimble has when it comes to sequential IO and frankly any disk array will, is that when you’re running an operation of both heavy write + heavy read, they tend to slow down to about 400MBps – 600MBps. Bear in mind too though, that’s my cs460. If you say had a cs700, I suspect that number would be a bit higher. Taking the long way around to your answer, for backup’s, I haven’t had anything pull the data fast enough to really stress Nimble all that much. SQL for example we do compressed backups. The bottleneck there is SQL not Nimble. With that, I regularly see 400MBps+ for any one SQL server. In real life work such as a huge table scan, I’ve seen Nimble hit 1.5GBps sustained, and that’s a pretty old VM, meaning lots of opportunity for fragmentation.

Ahh… don’t compare WAFL / ZFS (COW) to Nimble 🙂 Nimble has a sweeping process (think of it as full time defrag) that takes care of the swiss cheese free space issue that your other file systems might leave you helpless with. I’ve never had a sequential IO issue with Nimble related to fragmentation stuff. Now, I’m not promising you you’ll never run into any fragmentation issues, but Nimble does a good job of keeping its file system in tip top shape.

True, but have a system that has a good prefetch cache, good sequential IO management (like Nimble) makes most of those issues go away. Sure, no single backup is going to get 2GBps, but you’ll still get very good throughput for each stream. Your bigger issue would be if you had heavy write operations going on while trying to do a bunch or sequential reads, OR if you had a bunch of cache misses.

Again, I think I sort of answered this, but if not, let me say there are a number of factors that will play in.

1. The application that’s doing the backup
2. The size of the IO being performed
3. Is it truly sequential or is it also random.

For example, SQL I get awesome throughput. I haven’t tested writing to null with no compression, but 400 – 500 is very doable with little work. I’ve seen Veeam during an active full pull 700Mbp – 800MBps for general server. I’ve also seen Veeam fall on its face and only pull 35MBps, but that’s little to do with the SAN and more to do with the data being heavy random IO.

Really if sequential IO is your biggest concern, I’d say you’ll be not just fine, but quite happy with Nimble.

I have been running a Pure array in production for over 3 years. Unfortunately the array in my DR site is a 9 year old Netapp. This led us to look at one of two options. Purchase a second Pure for our DR site to replace the Netapp or purchase two Nimble arrays and replace the Pure and Netapp

We decided to go the Nimble route. One of the clinchers for us is the release of the Nimble all flash array. We are getting the AFA3000 for production and a hybrid CD300 for our DR site. This will give us great read and write performance in production including backup.

I forgot to mention in my last post but using Commvault we are pulling 550GB/hr on our backups with the target being a 10TB Exagrid box. If I did my math right, that is about 150MB/s. I am assuming that I can get similar performance in backup from the Nimble AFA3000 as I am getting from the Pure. I am making the jump to Veeam so there will be another variable thrown into the backup equation. I will not have all of this in place for a few months but I will report back on the backup speeds I get on the new setup

Oh know! you’re going to Veeam? Ouch, you’re going to miss CV IMO. I went from CV to Veeam and now going back to CV. We just purchased a DotHill array for our backup target, haven’t had a chance to test it yet, but plan to do a full review after some real world time with it. That being said though, seriously, if you haven’t purchased Veeam yet, I would go read my review about Veeam, I’ve not had a very positive experience with them at all, and CV has always been both rock solid and awesome support.

Just read your review. A few comments
1.What you said about support is worrisome. Commvault support has always been great. Very knowledgeable and able to solve problems quickly. Given that, over the past three years I have only called Commvault support once, the product has run well. My hope is Veeam runs as well
2. We have not used tape in nearly a decade. We have two Exagrid boxes that work great
3. We do not use Commvault’s dedup since our target was the Exagrid. Things change a bit with two Nimble arrays. Replication will happen between the arrays using Veeam. Backup will happen from Nimble to Nimble and then a backup copy job to the Exagrid which is being moved offsite to our ISP
4. We do not have any clustered applications
5. We ditched our three Exchange servers last years and jumped to Gmail
6. I have been told that in July or August Nimble is being added as a supported vendor under Veeam

I will let you know how things go with Veeam and Nimble. I did talk to a lot of happy Veeam customers before making the decision to jump. On thing I can say is Commvault pricing has come down a lot over the last year. My guess is price competition from vendors like Veeam forced the issue.

I had high hopes for Veeam, but ultimately it didn’t live up to much of its hype.

1. I think the fact that you only had to call CV once should tell you all you need to know about how solid it is as a product. You probably will not have that same level of stability with Veeam, we didn’t. And when you do need support, best I can say is good luck.
2. I get it, but still, not having tape doesn’t mean CV isn’t a good fit. CV can do everything that Veeam can do, Veeam can’t say the same.
3. Neither do we, in-line compression works fine, and disk is cheap. If replication is your reason for dedupe, I suspect for what you spent on ExaGrid you would have been far better off with CV dedupe + generic storage.
4. Luck you 🙂
5. Sorry to hear that, hate GMail, but that’s just my opinion.
6. Ha! hope that true in your case. Even still, CV can already do it. That should tell you something again. In fact CV has a TON of supported storage.

In all seriousness, good luck. We made the move to save $$$ on the capex side, and all that we ended up with was an inferior product that required more care and feeding.

I looked at Nexenta a while back for backup storage. I was on a huge ZFS kick back then, and they were the product that kept coming up when it came to ZFS. Here are my opinions

1. They were stupid expensive compared to just buying a a solution out right from NetApp, EMC, or other vendors. Until you took there licensing into account + the HW, it was practically a wash. They charge you based on raw storage not usable.
2. If I were going to do Nexenta, I would not home brew it and instead I would work with someone like RackTop Systems. John is the one guy I spoke to over there, and those guys know their stuff when it comes to Nexenta + ZFS in general.
3. If you’re on a ZFS kick and stuck on that kick I would go Oracle. I know that’s a bad word in most shops but seriously, its not really that expensive to just get an appliance from them and now you have one neck to choke. While its true the ZFS originators aren’t there, I have to imagine you’ll be better off with them over Nexenta. Add to that, the whole solution just has a more polished look and feel. Everything is designed correctly, etc. I almost chose them over storage spaces (wish I would have).
4. In general I find ZFS to be HIGHLY overrated. Don’t get me wrong, its a great FS, but as far as comparing it to CASL, I think Nimble’s FS is better.
4a. ZFS fragments like you wouldn’t believe, and snapshots will just accelerate it. There’s no sweeping process like nimble, so it only gets worse and worse. Before you know it, even sequential IO start performing like random IO. Add to that, mirrioring is a must for decent performance.
4b. ZFS doesn’t re balance data as you add storage, just like WAFL. So you’ll need to migrate you data every time you add storage, What a PITA that sounds like
4c. Write IOPS seem to be limited to single queue depth performance when hitting the ZIL. So when you test a flash drive, test it with a single queue depth and that’s what you can expect the ZIL to give your for random IOPS, that sucks to say the least.
4d. Cache is lost every time you fail over, which if you’re really relying on that for performance means you’re going to take a nose dive in performance when it does happen.
4e. ZFS is stupid ram heavy. I get that its used for cache, but its also used for metadata too. I think the rule of thumb is something like 1GB or RAM per 1TB of disk. I’m not sure if that’s raw storage or usable storage.
5. Specific to Nexenta, I’ve not read great things about the. For example, I read on one site where a company ended up hiring an ex Solaris / ZFS dev to basically write a hot fix to work around an issue that Nexenta couldn’t figure out. So just think about that if you’re thinking of primary storage.

All that being said, they do seem like they’re working on some pretty cool stuff. so who knows. I guess I would say approach with caution.

This blog post was very helpful to me. I’m a part-time IT guy for my church (10k members). Our pastor wants to do more outreach via the internet which means more video that needs to be captured, edited (collaboratively), stored and archived. I’m no expert in SANs so this article gave me a number of things to research and understand when assessing solutions. Avid and some of the other major providers of solutions geared toward media are going to be cost prohibitive. I was told Nimble may be a good solution. If you have any other suggestions to research or things I should take into consideration I’d be glad to hear it.

It’s tough to say without knowing more of your requirements. Nimble’s a great SAN, but its also not the cheapest solution out there. if you’re dealing with media files, its likely that things like in-line compression are not going to be of value to you.

Some questions you need to ask yourself, and the church:

1. How much data do you need to store now, and what is your projected growth. Add to that, can any of that data be archived to the cloud instead of being local?
2. What are your performance needs? For you to answer this, if you have any existing systems, polling them with perfmon every 20 seconds for a couple of days would be a good start. Look at the logical disk stats. You want to look for things like IOPS and throughput, queue depth and latency. Output the perfmon to a CSV so you can create a pivot chart to analyze the data in excel.
3. What kind of SLA’s are you expecting to provide for this storage. Meaning, how much down time can you tolerate. Do you need a local redundant copy? If you do need a local copy, how much data are you willing to lose. For example, do you need to replicate the data every 5 minutes, or is once an hour good enough?
4. Do you have any clustering requirements (shared storage)?
5. Will the storage be used for anything other than media?
6. Do you need a vendor that can handle supporting the whole solution, or are you ok with getting some things back online yourself?
7. Would a capex or an opex model work better for you?

Those are just a few questions you need to figure out. Like I said, I don’t know your requirements. So on one hand you might be better off with a Dell server + some DAS, on the other hand, you might need the features a SAN offers. DAS will be much cheaper, but less feature packed,and more work on your part. its also less resilient than a SAN, but that may not be a big deal if you can tolerate disruptions once in a while.

As far a reasonably priced SAN, Quantum QXS (DotHill) may be a better fit (call west coast technologies if you’re interested), but that depends. I would certainly call Nimble and get some quotes if you think you need a SAN, and then compare that to something like a Quantum. I will say Nimble is a better SAN hands down, but it comes at a premium.

Have you reached out to Nimble and spoke with their SE? I can try to answer what your asking, but they’re really the best person to contact.

–Replication: Define low protection space? Nimble uses a small block size for differencing, or at least the block size is based on what you configure for the volume. For example, by default, Nimble recommends 4k for generic ESXi / Windows servers and 8k for SQL databases. That means the smallest change block will be 4k and 8k respectively. Nimble does NOT require a reserve space for snapshot, but ultimately you’ll be limited by the total capacity on your array. On top of that, if you have compression enabled, the snapshots are compressed, so if that 4k block compresses to 2k, then you’re only storing 2k for a snapshot (as an example). Dedupe is NOT something I have experience with, so I can’t speak to that, but I suspect there yet another space savings measure. Obviously the variable in all of this is what’s your change rate, and how many snapshots are you keeping around. That variable however would be a constraint on any array and is not unique to Nimble.

-will support storage and vmware replication and supports vmware SRM with an rpo 0f 2 hrs: Nimble replicates VM’s so I’m not sure if that’s all your asking for the first section. You setup a “protection” policy, which tells Nimble what volumes to snapshot, and whether to have VMware also snapshot the VM’s first. If you have Nimble tell the VM’s to snap, Nimble waits till all VM’s in a given protection policy have completed their snapshots and then it finally snapshots the lun, then tells VMware to delete the VM snaps, and then finally replicates where ever you want (only copy only though). You can choose how many snaps local and remote to keep as well. As far as SRM, I know they support SRM, but that’s all I can speak to, I’ve never used SRM. That said, your RPO is again going to be based on a number of factor, most of them outside the SAN. Nimble can snapshot as fast as every minute (although 5 to 15 minutes is really the most aggressive they recommend).

— Data rest encryption: Yes they support this, it was added in NOS 2.3

— File level backup: Nimble isn’t a backup solution as much as their marketing might try to convey that. They integrate with CommVault and other backup platforms, and if your file server is mounting Nimble raw, instead of via a VMDK/VHDK then you can pretty easily mount a snapshot, recover what you need and unmount the snap. If you’re asking if they provide native NAS functionality, the answer is no, they’re block only.

Hope that helps, and again, I suggest reaching out to a Nimble SE. If you want a from the customer viewpoint on what they’re saying, feel free to continue asking questions in the comments and I’ll answer them the best I can.

I have spent quite a bit of time on your blog reading about Nimble storage and your point of view on Nimble and the rest of the Storage vendors you’ve talked to researched.

We are currently using NetApp FAS 3400’s Storage Arrays for the production and the DR site. We are a VMware shop and using VMware SRM with NetApp snapmirror technology. For most part, the Arrays working as it was designed. Our Storage Arrays are 4 yrs old and going on 5th year. The support been okay and the support contract is super expensive. Therefore, I have invested in another product, Nimble… We purchased the 5000’s series. All Flash for the primary site and hi-bred for the DR site.

As I’ve indicated above, we’re a VMware shop with 15 nodes clustered. We are also MS SQL shop as well. I have separate LUNs for the OS, SQL DB, Logs, Binaries, TMPDB and the backup volume. I went as fas as using different Virtual disk controller when I created these volumes at the VM level. Even with separate LUNs and Volumes, I still see occasionally queue length and IOPS screaming at me through vmturbo tool.

This come in question, what would you recommend to use for the block size when creating these volumes for the SQL HA Cluster (Always ON)? The only thing that I’ve done differently is using 64Kb for the TmpDB volume per my SQL DBA, but everything else are set by default when creating new volume in Windows. We are mostly using Windows 2012 R2 for the MS SQL Cluster HA and some are Windows 2008 R2 64-bit.

Thank you for your time and I’m looking for your feedback on the volume blocksize as well.

Hi Harry,
Thanks for your post.
To answer your actual question, when you say you’re using a 64k “block size” are you talking about the NTFS allocation unit size or Nimbles performance policy? I’m pretty sure Nimbles performance policy maxes out at 32k, so I’ll assume its the NTFS cluster size. The cluster size will make zero tangible difference on an AFA from a performance perspective, but it can affect (slightly) hybrid arrays. Ultimately the cluster size is about mitigating fragmentation, enabling you to run larger volumes, and optimizing space utilization. Using a 64k cluster size is fine, it’s not hurting anything, and technically a best practice for ALL SQL files. Using a smaller cluster size than 64k would have a negative impact if you were running off of DAS, but I personally suspect, the impact would only be during large sequential actions like table scans, and backups. Running a 4k sector size on an AFA (as an example) is probably not a performance killer.

That said, clearly your concern is that vmturbo is telling you something is wrong, or at least indicating that it thinks it found something for you to inspect. The fact that you have a high queue depth or even high IOPS is not necessarily a problem. It’s merely an indication that your SQL server (or some other system) is hitting the disk hard. What you want to focus on, isn’t so much the queue depth or IOPS (per se) but rather how well is your SQL server (or other systems) responding. For example, if you’re seeing really high queue depth during a SQL job, but your throughput is kicking ass, then that’s not a problem per se, its just a symptom of pounding the disk. In fact, most storage (NVMe if you want another example) doesn’t even start to show its throughput potential until your queue depth starts getting into the 16 to 64 range (or higher). If your queue depth was 1 all day long, that would indicate that you probably have a single threaded application, or you’re not stressing the disk. What I ask my DBA’s when we make changes is something like “hey, how long did it take you to run one of your bigger jobs after the move compared to before?” In my case, there as a 30% -40% reduction in run-time moving from DAS to a SAN that was already pretty active. Ultimately THAT is what you care about.

Is your DBA saying the SAN is slow? Or is your concern simply that VMTurbo is saying there’s an issue? IMO, unless you’re seeing a performance impact, I wouldn’t sweat what VMturbo is saying. High queue depth isn’t a problem, UNLESSS you have high queue depth *and* terrible throughput. To throw some arbitrary numbers out there, if you had a queue depth of 100 and your throughput was 100Mbps (on an AFA with 10g links) that’s probably a problem. If your queue depth is 100 and your throughput is more like 800MBps or greater (assuming a mix of read / write) then you’re probably fine. If you want to inspect WHY there is so much queue depth, that’s a question you and your DBA (mostly on them, its their server causing it after all) are going to need to work out (presuming you have good throughput).

If you’re seeing bad throughput + high queue depth, then then the high queue depth is probably a symptom of something messed up in your environment. I presume you went 10G (or FC) with your SAN (I sure hope so if you went all flash)? Do you have jumbo frames enabled (presumes iSCSI)? If so, have you confirmed end to end that you can ping a 9000 frame packet with the no-fragment parameter enabled? If you don’t have jumbo frames, that would be a simple first step towards squeezing more throughput out of your SAN. Do you have the Nimble Connection manager installed in VMware (presumes you’re at v6.0 of vmware, or that you’re running enterprise plus)? That will optimize the path selection policy for Nimble. If you DON’T have either of those, have you confirmed that you at least enabled round robin on your volumes? Since you’re using an AFA, I would NOT recommend dedupe for SQL at all, except the OS and binaries. If you have the data files in a dedupe policy, I would take them out. Ultimately, if you’re at this stage though, I would highly recommend reaching out to your Nimble SE. I’m flying blind over here, and they can probably track down the issue in no time *if* it’s a situation where you really think it’s the SAN or even something in your environment. Nimble’s SE’s rock and they don’t want the SAN getting blamed for poor performance, they’ll find your issue if you really think its SAN. For all you know, you have the wrong switch for the job (small buffers). The whole path needs to be tier 1 for tier 1 performance. Also, if you have VMware enterprise plus, I would recommend enabling SIOC, as that helps with noisy neighbors. Finally, VMware side, make sure the controllers you’re using are PVCSCI, don’t waste your time with LSI SAS unless you’re doing RDM + clustering.

BTW, I read your “backup” volume. Can we both safely assume you’re not backing up locally? Please say yes… that would be very bad from a backup / recovery plan, a huge waste of AFA capacity, and ultimately beating up your san twice.

Hi Eric – Hope you’re doing well! I do enjoyed reading you blog about SAN Storage, Nimble Storage in general.

My company been using IBM SAN Storage DS5000 Series for years and they decided to upgrade the SAN Storage and we went with NetApp FAS 5000’s series. The NetApp has been great now for 5 years. We also purchased another cheaper version of NetApp (E’s Series) for our D2D backup target. Again, it has been really nice compared to tape-tape solution for years.

My point is that, it has been 5 years on the NetApp FAS 3240’s and now we’re decided it time to buy another SAN Storage Arrays. We looked at NetApp,Tegile, and Nimble. We decided to go with Nimble for the support and all up-front cost on the softwares etc. Also we were sold on the features and all-flash Arrays for the production and Hybrid for the DR solution. We are getting the Storage Arrays sometimes next week. Now, the reseller that we went through does tell us how they are going to migrate the data from the NetApp Storage Arrays to Nimble. I asked the local Nimble tech, they told me to ask the reseller that we’re buying the H/W from. Is there migration software that can migrate data off the NetApp Arrays to the Nimble Arrays? I was not involved, but the NetApp engineer migrate the data off from our IBM Arrays onto the NetApp once the Arrays are all zoned through the Fabric mesh.

What were you using before you went to Nimble and how did you migrate the data off your previous Arrays to Nimble Arrays? We are VMware shop, so I was thinking of doing the Storage vMotion once Nimble LUNs (datastores) are presented to the VMware cluster I guess.

Thank you in advance and hope you continue with your blog as I really enjoyed reading it and I can speak for all those are reading your blog.

If you’re data is all in VMware + vDisks (not RDM’s) I would totally just svmotion it. You pay good money for VMware, and its one of the many things they offer to make life easier. The only thing I would caution you on, is moving clusters (Exchange DAG’s, SQL AAG, or similar). It’s not so much that I wouldn’t svmotion them, but I would do the passive node first, then failover, and move the final node.

If you’re dealing with physical servers or VM’s with RDMs/direct connect storage, it’s a little more involved, but I could provide a few pointers if you need that.

When I went Nimble, we were 100% DAS and mostly physical. We virtualized our environment + implemented a new SAN + implemented 10g networking all in one shot. Lots and lots of p2v’s. But moving to new SANs, like I said, storage vmotion for the win.

Thanks for responding to my questions.
Our Exchange cluster are on physical and using DAS, so I don’t have to worry about them. The Virtualized MSSQL are HA clustered, 2 nodes at HQ and 1 node at the DR site. The professional group that we had engaged to setup the Nimble Arrays recommended that we put all of our 115 VMs into 4 large datatstores. I personally, don’t think that’s correct. I know this is all-flash Arrays, but I still want to keep each SQL clustered on it own datatstore down to separate datastores for each volumes. This is how I have created on our current Storage Arrays. Our current Storage Arrays does not have any SSDs, mostly SASs and Satas. Am I too anal about this or they are surely knows about Storage, but not datas and VMware?

I presume your professional services group is really just trying to make your life easier, but I’m not sure without some extra tweaks, that I would recommend that many VM’s in a single datastore. I would suggest reaching out to your Nimble SE (by pass the professional services if its through a VAR) and ask them specifically what settings they tune and optimize. I actually don’t know this. The “theory” with spreading the VM’s out is to avoid metadata locking, but it’s also to increase the queue depth via scaling out vm’s across multiple luns. I suspect if NCM increases the default queue depth per volume that you may not actually see a negative performance impact by stacking a lot of VM’s. However, if NCM doesn’t do this for you, then it ends up being something you have to remember each time you provision a volume (presumes you don’t automate the process).

That said, there are other good reasons to use different volumes:
• Nimble should be recommending that you first segregate your data based on its performance policy. SQL logs should be on a different LUN compared to SQL DB. Generic file servers and VM’s should be on a different volume than the above. Maybe there are VM’s you want to pin to cache and maybe there are VM’s that you want to disable compression (or disks more specifically).
• Your replication and snapshot polices should also determine how you carve up your volumes. Maybe some VM’s you don’t want to snapshot and some you do.
• Volumes CAN get corrupted. It doesn’t happen often, but it can happen (I’ve got one right now). When you have a ton of data in a really large volume, it means you need an equal amount of free space to vacate that volume. Vacating a bunch of 2TB disks is a lot easier than vacating a 20TB volume.
o To be fair, unmap can help with this issue, but I still contend its easier to vacate with larger number of smaller volumes.
• When your SAN is getting pounded, its easier to figure out the offending VM. Not that you can’t do it on larger volumes, but its gets really noisy in the vmware performance overview charts. Nimble does have a good tool to help with this, but its more long term trends IIRC. It’s been a while since I’ve looked at my vm stats.
• SDRS (if you have enterprise plus) makes managing volume space a breeze. I use it to manage about 50TB or so of VM’s (before compression). I have it auto move all day long, no issues. Add to that, I don’t even have to pick which SAN I want (I have 5 sans to manage).

vVOL’s is going to change some of this though, and with vVOLS you likely would make one big volume (I forget the proper term) per performance policy per san. Honestly if you’re starting with a new SAN + you’re on vsphere 6 or greater, I would really suggest looking into vVOLS. You may find that you want to use that for a good majority of your data. I haven’t used it yet, but I have a feeling its going to be a big win.

Ultimately though, big volumes can make your life easier, I guess its more of a call that you’ll need to figure out based on how granular you want to get. A balance would be the most ideal.