Cloud capable (storage OS ability to run in the cloud) – with replication to and from the cloud instances of ONTAP

QoS and SLO provisioning

Inline Foreign LUN Import (to make migrations from third-party arrays less impactful to the users)

The point is that competitor All-Flash offerings typically focus on a few of these points (for example, performance, ease of use and maybe inline storage efficiencies) but are severely lacking in the rest. This limited vision approach creates inflexibility and silos, and neither inflexibility nor silos are particularly desirable elements in Enterprise IT.

But what if you could have Flash without compromises? That’s what we are offering with the new AFF systems. A high performance offering with all the “coolness” but also all the “seriousness” and features Enterprise Grade storage demands.

It’s a bit like this Venn diagram:

AFF simply offers far more flexibility than All-Flash competitors. There may be certain things AFF doesn’t do vs some of the competitors, but they pale vs what AFF does and the competitors cannot.

And even if you don’t need all the features – at least they’re there waiting for you in case you do need them in the future (for instance, you may not need to replicate to the cloud and back today, but knowing that you have the option is reassuring in case your IT strategy changes).

Architecture

It’s far harder to add serious enterprise data management features to newly built architectures than add new architecture benefits to a platform that already had the Enterprise Grade stuff down pat.

WAFL (the block layout engine of ONTAP) is already naturally well suited to working with SSDs:

Avoids modifying data in place

Writes to free space

Performs I/O coalescing in order to lump many operations in a single large I/O

Preserves the temporal locality of user data with metadata to further reduce I/O

Achieves a naturally low Write Amplification Factor (it’s worth noting that in all the years we’ve been selling Flash and after hundreds of PB, we have had exactly zero worn out SSDs – they’re not even close to wearing out).

So we optimized ONTAP where it mattered, while keeping the existing codebase where needed. It helps that the ONTAP architecture is already modular – it was fairly straightforward to enhance the parts of the code that dealt with storage media, for instance, or the parts that dealt with the I/O path. This significant optimization started with 8.3.0 and continued with 8.3.1.

The overall effect has been dramatically reduced latencies, enhanced code parallelism, increased resiliency, while at the same time enabling inline storage efficiencies. For customers still on 8.2.x and prior, or on 7-mode ONTAP, the differences with 8.3.1 will be pretty extreme…

Ease of Use

New with ONTAP 8.3.1 is a completely redesigned administrative GUI, and a SAN-optimized config for the ability to be serving I/O in 15 minutes from unpacking the system.

In addition, wizards allow the easy creation of LUNs for Databases with just 3 questions, and ONTAP can now be upgraded from the GUI.

Storage Efficiencies

ONTAP has had various flavors of storage efficiencies for a while. Those efficiencies typically had to be turned on manually, and often affected performance.

With ONTAP 8.3.1, Inline Compression is on by default, as is Inline Zero Deduplication (very helpful in VM deployments that use EagerZeroThick). In addition, Always On Deduplication is also available – a deduplication that runs very frequently (every 5 minutes).

In conjunction with already excellent thin provisioning plus state-of-the-art cloning and snapshot capabilities, some excellent efficiency ratios are possible. We can show up to 30:1 for certain kinds of VDI deployments, while things like Databases will not be able to be squeezed quite that much (especially if DB-side compression is already active). The overall efficiency ratio will vary depending on how the system is used.

Ultimately, Storage Efficiencies aim to reduce overall cost. Focus not so much on the actual efficiency ratio. Instead look at the effective price/TB. The efficiency differences between most vendors are probably less than most vendors want you to think they are – in real terms you will not save more than a few SSD’s worth of capacity. The real value lies elsewhere.

Performance

The AFF systems are fast. A maximum-size AFF cluster will do about 4 million IOPS at 1ms latency (8K random, with inline efficiencies). Throughput-wise, 100GB/s is possible.

We wanted to have performance stop being a discussion point except in the most extreme of situations. The reality is that any top-tier Flash solution will be fast. Once again – the real value lies elsewhere.

For the curious, here’s an IOPS vs Latency chart for a 2-node AFF8080 system, given that this level of performance is more than enough for the vast majority of customers out there (the max speed of a cluster is 12x what’s shown in this graph):

Many optimizations were implemented in ONTAP 8.3.1 – certain operations are over 4x faster than with ONTAP 8.2.x, and almost 50% faster than with 8.3.0. SSDs, being as fast as they are, benefit from those optimizations the most.

If you want more performance proof points, we have previously shown SSD performance for a very difficult workload (over 60% writes, combination of block sizes, random and sequential I/O) with 8.3.0 in our SPC-1 benchmarks (needs to be updated for 8.3.1 but even the 8.3.0 result is solid). We also have SQL, Oracle and VDI Technical Reports.

We are also always happy to demonstrate these systems for you.

Pricing

There are some significant pricing changes, leading to an overall far more cost-effective solution. For instance, the cost delta between different controller models with the same amount of storage is far smaller than in the past. The scalability of all the systems is the same (240 SSDs * 1.6TB max currently per 2 nodes). The only difference is performance and the amount of connectivity possible.

Warranty pricing is now stable even if you buy 3 years up front and extend later on.

Oh – and all AFF models now include all NetApp FAS software: All the protocols, all the SnapManager application integration modules, replication, the works.

Final Words

The pace of innovation at NetApp is accelerating dramatically. We had to work hard to bring Clustered ONTAP to feature parity with the older 7-mode, which delayed things. Now with 8.3.x dropping 7-mode altogether, we have many more developers to focus on improving Clustered ONTAP. The big enhancements in 8.3.1 came very rapidly after 8.3.0 became GA… and there’s a lot more to come soon.

Make no mistake: NetApp is a storage giant and has an Engineering organization not to be trifled with.

In this post I will try to help you understand how to objectively calculate the cost of space-efficient storage solutions – there’s just too much misinformation out there and it’s getting irritating since certain vendors aren’t exactly honest with how they do certain calculations…

A brief history lesson:

The faster a storage device, the smaller and more expensive it usually is. Flash was initially insanely expensive relative to spinning disk, so it was used in small amounts, typically as a tier and/or cache augmentation.

And so it came to be that flash-based storage systems started implementing some of the more interesting space efficiency techniques around. Interesting because it’s algorithmically easy to reduce data dramatically, but hard to do under high load while maintaining impressive IOPS and low latency.

But how does one figure out the best deal?

There are some factors I won’t get into in this article. Company size and viability, support staff strength, maturity of the code, automation, overall features etc. all may play a huge role depending on the environment and requirements (and, indeed, will often eliminate several of the players from further consideration). However, I want to focus on the basics.

Recommended metric: Cost per effective TB

It’s easy to get lost in the hype. One company says they reduce by 3:1, another might say 5:1, yet another 10:1, etc. The high efficiency ratios seem to be attractive, right?

Well – you’re not paying for a high efficiency ratio. What you are paying for is for usable capacity.

If all solutions cost the same, the systems with high efficiency ratios would win this battle every day of the week and twice on Sundays.

However, solutions don’t all cost the same. Ask your vendor what the projected effective capacity will be for each specific configuration, and the Cost/Effective TB is a trivial calculation.

But there’s one more thing to do in order for the calculation to be correct:

Insist on calculating the efficiency ratio yourself.

Most storage systems will show a nice picture in the GUI with an overall efficiency ratio. Looks nice and easy. Well – the devil is in the details.

If a vendor is upfront about how they measure efficiency, your numbers might make sense.

This is where you trust but verify. Some pointers:

Take a note of the initial usable space before putting anything on the system.

If you store a 1TB DB and do nothing else to the data, what’s the efficiency?

Does the number make sense given the size of the data you just put on the system and how much usable space is left now?

If you take 10 snapshots of the data, what’s the efficiency? How about if you delete the snaps, does the efficiency change?

If you take a clone of the DB, what’s the efficiency?

If you delete the clone you just took, what’s the efficiency?

Create a large LUN (10TB for example) and only store 1TB of data in it. What’s the efficiency? Do you count thin provisioning as data reduction?

Does this all add up if you do the math manually instead of the GUI doing it for you?

Does it all meet your expectations? For example, if a vendor is claiming 5:1 reduction, can you actually store 5 different DBs in the space of one? Or do they really mean something else? That’s a pretty easy test…

You see, most vendors count savings a bit differently. In the examples above, that 1TB DB, if stored in a 10TB LUN, and cloned 10 times, will probably result in a very high efficiency number. It doesn’t mean however that 10 different DBs of the same size would have nearly the same efficiency ratio.

If you don’t have time to do a test in-house, have the vendor prove their claims and show how they do their math in their labs while you watch. You will typically find that each data type has a wildly different space efficiency ratio.

The bottom line

It’s pretty easy. Figure out the efficiency ratio on your own based on how you expect to use the system, then plug that ratio into the Price/Effective TB formula like so:

And, finally, a word on capacity guarantees:

Some vendors will guarantee capacity efficiencies. Always, always demand to see the fine print. If a vendor insists they will guarantee x:1 efficiency, have them sign an official legally binding agreement that has the backing of the vendor’s HQ (and isn’t some desperate local sales office ploy that might not be worth the paper it’s printed on).

Insist the guarantee states you will get that claimed efficiency no matter what you’re storing on the box.

It’s been a while since our last SPC-1 benchmark submission with high-end systems in 2012. Since then we launched all new systems, and went from ONTAP 8.1 to ONTAP 8.3, big jumps in both hardware and software.

In 2012 we posted an SPC-1 result with a 6-node FAS6240 cluster – not our biggest system at the time but we felt it was more representative of a realistic solution and used a hybrid configuration (spinning disks boosted by flash caching technology). It still got the best overall balance of low latency (Average Response Time or ART in SPC-1 parlance, to be used from now on), high SPC-1 IOPS, price, scalability, data resiliency and functionality compared to all other spinning disk systems at the time.

Today (April 22, 2015) we published SPC-1 results with an 8-node all-flash high-end FAS8080 cluster to illustrate the performance of the largest current NetApp FAS systems in this industry-standard benchmark.

And #3 if you look at performance at load points around 1ms Average Response Time (ART).

The NetApp system uses RAID-DP, similar to RAID-6, whereas the other entries use RAID-10 (typically, RAID-6 is considered slower than RAID-10).

In addition, the FAS8080 shows the best storage efficiency, by far, of any Top Ten SPC-1 submission (and without using compression or deduplication).

The FAS8080 offers far more functionality than any other system in the list.

We also recently posted results with the NetApp EF560 – the other major hardware platform NetApp offers. See my post here and the official results here. Different value proposition for that platform – less features but very low ART and great cost effectiveness are the key themes for the EF560.

In this post I want to explain the current Clustered Data ONTAP results and why they are important.

Flash performance without compromise

Solid state storage technologies are becoming increasingly popular.

The challenge with flash offerings from most vendors is that customers typically either have to give up a lot in order to get the high performance of flash, or have to combine 4-5 different products into a complex “solution” in order to satisfy different requirements.

For instance, dedicated all-flash offerings may not be able to natively replicate to less expensive, spinning-drive solutions.

Or, a flash system may offer high performance but not the functionality, scalability, reliability and data integrity of more mature solutions.

But what if you could have it all? Performance and reliability and functionality and scalability and maturity? That’s exactly what Clustered Data ONTAP 8.3 provides.

Here are some Clustered Data ONTAP 8.3 running on FAS8080 highlights:

All the NetApp signature ultra-tight application integration and automation for replication, SnapShots, Clones

Over 460TB (yes, TeraBytes) of usablecache after all overheads are accounted for (and without accounting for cache amplification through deduplication and clones) in an 8-node cluster. Makes competitor maximum cache amounts seem like rounding errors – indeed, the actual figure might be 465TB or more, but it’s OK… (and 3x that number in a 24-node cluster, over 1.3PB cache!)

The ability to virtualize other storage arrays behind it

The ability to have a cluster with dissimilar size and type nodes – no need to keep all engines the same (unlike monolithic offerings). Why pay the same for all nodes when some nodes may not need all the performance? Why be forced to keep all nodes in the same hardware family? What if you don’t want to buy all at once? Maybe you want to upgrade part of the cluster with a newer-gen system?

The ability to evacuate part of a cluster and build that part as a different cluster elsewhere

The ability to have multiple disk types in a cluster and, indeed, dedicate nodes to functions (for instance, have a few nodes all-flash, some nodes with flash-accelerated SAS and a couple with very dense yet flash-accelerated NL-SAS, with full online data mobility between nodes)

That last bullet deserves a picture:

“SVM” stands for Storage Virtual Machine – it means a logical storage partition that can span one or more cluster nodes and have parts of the underlying capacity (performance and space) available to it, with its own users, capacity and performance limits etc.

In essence, Clustered Data ONTAP offers the best combination of performance, scalability, reliability, maturity and features of any storage system extant as of this writing. Indeed – look at some of the capabilities like maximum cache and number of LUNs. This is designed to be the cornerstone of a datacenter.

it makes most other systems seem like toys in comparison…

FUD buster

Another reason we wanted to show this result was FUD from competitors struggling to find an angle to fight NetApp. It goes a bit like this: “NetApp FAS systems aren’t real SAN, it’s all simulated and performance will be slow!”

Right…

Well – for a “simulated” SAN (whatever that means), the performance is pretty amazing given the level of protection used (RAID6-equivalent – far more resilient and capacity-efficient for large pooled deployments than the RAID10 the other submissions use) and all the insane scalability, reliability and functionality on tap

Another piece of FUD has been that ONTAP isn’t “flash-optimized” since it’s a very mature storage OS and wasn’t written “from the ground up for flash”. We’ll let the numbers speak for themselves. It’s worth noting that we have been incorporating a lot of flash-related innovations into FAS systems well before any other competitor did so, something conveniently ignored by the FUD-mongers. In addition, ONTAP 8.3 has a plethora of flash optimizations and path length improvements that helped with the excellent response time results. And lots more is coming.

The final piece of FUD we made sure was addressed was system fullness – last time we ran the test we didn’t fill up as much as we could have, which prompted the FUD-mongers to say that FAS systems need gigantic amounts of free space to perform. Let’s see what they’ll come up with this time 😉

On to the numbers!

As a refresher, you may want to read past SPC-1 posts here and here, and my performance primer here.

Important note: SPC-1 is a 100% block-based benchmark with its own I/O blend and, as such, the results from any vendor SPC-1 submission should not be compared to marketing IOPS numbers of all reads or metadata-heavy NAS benchmarks like SPEC SFS (which are far easier on systems than the 60% write blend of the SPC-1 workload). Indeed, the tested configuration might perform in the millions of “marketing” IOPS – but that’s decidedly not the point of this benchmark.

The SPC-1 Result links if you want the detail are here (summary) and here (full disclosure). In addition, here’s the link to the “Top 10 Performance” systems page so you can compare other submissions that are in the upper performance echelon (unfortunately, SPC-1 results are normally just alphabetically listed, making it time-consuming to compare systems unless you’re looking at the already sorted Top 10 list).

I recommend you look beyond the initial table in each submission showing the performance and $/SPC-1 IOPS and at least go to the price table to see the detail. The submissions calculate $/SPC-1 IOPS based on submitted price but not all vendors use discounted pricing. You may want to do your own price/performance calculations.

The things to look for in SPC-1 submissions

Typically you’re looking for the following things to make sense of an SPC-1 submission:

ART vs IOPS – many submissions will show high IOPS at huge ART, which would be rather useless when it comes to Flash storage

Sustainability – was performance even or are there constant huge spikes?

RAID level – most submissions use RAID10 for speed, what would happen with RAID6?

Application Utilization. This one is important yet glossed over. It signifies how much capacity the benchmark consumed vs the overall raw capacity of the system, before RAID, spares etc.

Let’s go over these one by one.

ART vs IOPS

Our ART was 1.23ms at 685,281.71 SPC-1 IOPS, and pretty flat over time during the test:

Sustainability

The SPC-1 rules state the minimum runtime should be 8 hours. We ran the test for 18 hours to observe if there would be variation in the performance. There was no significant variation:

RAID level

RAID-DP was used for all FAS8080EX testing. This is mathematically analogous in protection to RAID-6. Given that these systems are typically deployed in very large pooled configurations, we elected long ago to not recommend single parity RAID since it’s simply not safe enough. RAID-10 is fast and fine for smaller capacity SSD systems but, at scale, it gets too expensive for anything but a lab queen (a system that nobody in their right mind will ever buy but which benchmarks well).

Application Utilization

Our Application Utilization was a very high 61.92% – unheard of by other vendors posting SPC-1 results since they use RAID10 which, by definition, wastes half the capacity (plus spares and other overheads to worry about on top of that).

Some vendors using RAID10 will fill up the resulting space after RAID, spares etc. to a very high degree, and call out the “Protected Application Utilization” as being the key thing to focus on.

This could not be further from the truth – Application Utilization is the only metric that really shows how much of the total possible raw capacity the benchmark actuallyused and signifies how space-efficient the storage was.

Otherwise, someone could do quadruple mirroring of 100TB, fill up the resulting 25TB to 100%, and call that 100% efficient… when in fact it only consumed 25%

It is important to note there was no compression or deduplication enabled by any vendor since it is not allowed by the current version of the benchmark.

Compared to other vendors

I wanted to show a comparison between the Top Ten Performance results both in absolute terms and also normalized around 1ms ART.

Here are the Top Ten highest performing systems as of April 22, 2015, with vendor results links if you want to look at things in detail:

FYI, the HP XP 9500 and the Hitachi system above it in the list are the exact same system, HP resells the HDS array as their high-end offering.

I will show columns that explain the results of each vendor around 1ms. Why 1ms and not more or less? Because in the Top Ten SPC-1 performance list, most results show fairly low ART, but some have very high ART, and it’s useful to show performance at that lower ART load point, which is becoming the ART standard for All-Flash systems. 1ms seems to be a good point for multi-function SSD systems (vs simpler, smaller but more speed-optimized architectures like the NetApp EF560).

The way you determine the 1ms ART load point is by looking at the table that shows ART vs SPC-1 IOPS. Let’s pick IBM’s 780 since it has a very interesting curve so you learn what to look for.

IBM’s submitted SPC-1 IOPS are high but at a huge ART number for an all-SSD solution (18.90ms). Not very useful for customers picking an all-SSD system. Even the next load point, with an average ART of 6.41ms, is high for an all-flash solution.

To more accurately compare this to the rest of the vendors with decent ART, you need to look at the table to find the closest load point around 1ms (which, in this case, it’s the 10% load point at 0.71ms – the next one up is much higher at 2.65ms).

You can do a similar exercise for the rest, it’s worth a look – I don’t want to paste all these tables and graphs since this post will get too big. But it’s interesting to see how SPC-1 IOPS vs ART are related and translate that to your business requirements for application latency.

Here’s the table with the current Top Ten SPC-1 Performance results as of 4/22/2015. Click on it for a clearer picture, there’s a lot going on.

Key for the chart (the non-obvious parts anyway):

The “SPC-1 Load Level near 1ms” is the load point in each SPC-1 Report that corresponds to the SPC-1 IOPS achieved near 1ms. This is not how busy each array was (I see this misinterpreted all the time).

The “Total ASU Capacity” is the amount of capacity the test consumed.

The “Physical Storage Capacity” is the total amount of capacity in the array before RAID etc.

What do the results show?

Predictably, all-flash systems trump disk-based and hybrid systems for performance and can offer very nice $/SPC-1 IOPS numbers. That is the major allure of flash – high performance density.

Some takeaways from the comparison:

Based on SPC-1 IOPs around 1ms Average Response Time load points, the FAS8080 EX shifts from 5th place to 3rd

The other vendors used RAID10 – NetApp used RAID-DP (similar to RAID6 in protection). What would happen to their results if they switched to RAID6 to provide a similar level of protection and efficiency?

Aside from the NetApp FAS result, the rest of the Top Ten Performance submissions offer vastly lower Application Utilization – about half! Which means that NetApp is able to use 2x the capacity vs raw compared to the other submissions. And that’s before starting to count the possible storage efficiencies we can turn on like dedupe and compression.

How does one pick a flash array?

It depends. What are you trying to do? Solve a tactical problem? Just need a lot of extra speed and far lower latency for some workloads? No need for the array to have a ton of functionality? A lot of the data management happens in the application? Need something cost-effective, simple yet reliable? Then an all-flash system like the NetApp EF560 is a solid answer, and it can still be front-ended by a Clustered Data ONTAP system to provide more functionality if the need arises in the future (we are firm believers in hardware reuse and investment protection – you see, some companies talk about Software Defined Storage, we do Software Defined Storage).

On the other hand, if you would prefer an Enterprise architecture that can serve as the cornerstone of your datacenter for almost any workload and protocol, offers rich data management functionality and tight application integration, insane scalability, non-disruptive everything and offers the most features (reliably) compared to any other platform – then the FAS line running Clustered Data ONTAP is the only possible answer.

In summary – the all-flash FAS8080EX gets a pretty amazing performance and efficiency SPC-1 result, especially given the extensive portfolio of functionality it offers. In my opinion, no competitor system offers the sheer functionality the FAS8080 does – not even close. Additionally, I believe that certain competitors have very questionable viability and/or tiny market penetration, making them a risky proposition for a high end system purchase.

Each of those systems can do 650,000 random 4K reads at a stable 800 microseconds (since I like defining my performance stats), 600,000 random 8K reads at under 1ms, and over 300,000 random 32KB reads at under 1ms. Also each system can do 12GB/s of large block sequential reads. This is sustained I/O straight from the SSDs and not RAM cache (the I/O from cache can of course be higher but let’s not count that).

<edit: updated with the changes in the SPC-1 price/performance lineup as of 3/27/2015, fixed some typos>

I’m happy to announce that today we announced the new, third-gen EF560 all-flash array, and also posted SPC-1 results showing the impressive performance it is capable of in this extremely difficult benchmark.

If you have no time to read further – the EF560 achieves, by far, the absolute best price/performance at very low latencies in the SPC-1 benchmark.

The EF line has been enjoying great success for some time now with huge installations in some of the biggest companies in the world with the highest profile applications (as in, things most of us use daily).

The EF560 is the latest all-flash variant of the E-Series family, optimized for very low latency and high performance workloads while ensuring high reliability, cost effectiveness and simplicity.

EF560 highlights

The EF560 runs SANtricity – a lean, heavily optimized storage OS with an impressively short path length (the overhead imposed by the storage OS itself to all data going through the system). In the case of the EF the path length is tiny, around 30 microseconds. Most other storage arrays have a much longer path length as a result of more features and/or coding inefficiencies.

Keeping the path length this impressively short is one of the reasons the EF does away with fashionable All-Flash features like compression and deduplication – make no mistake, no array that performs those functions is able to sustain that impressively short a path length. There’s just too much in the way. If you really want data reduction and an incredible number of features, we offer that in the FAS line – but the path length naturally isn’t as short as the EF560’s.

A result of the short path length is impressively low latency while maintaining high IOPS with a very reasonable configuration, as you will see further in the article.

Some other EF560 features:

No write cliff due to SSD aging or fullness

No performance impact due to SSD garbage collection

Enterprise components – including SSDs

Six-nines available

Up to 120x 1.6TB SSDs per system (135TB usable with DDP protection, even more with RAID5/6)

High throughput – 12GB/s reads, 8GB/s writes per system (many people forget that DB workloads need not just low latency and high IOPS but also high throughput for certain operations).

All software is included in the system price, apart from encryption

The system can do snaps and replication, including fully synchronous replication

Consistency Group support

Several application plug-ins

There are no NAS capabilities but instead there is a plethora of block connectivity options: FC, iSCSI, SAS, InfiniBand

The usual suspects of RAID types – 5, 10, 6 plus…

DDP – Dynamic Disk Pools, a type of declustered RAID6 implementation that performs RAID at the sub-disk level – very handy for large pools, rapid disk rebuilds with minimal performance impact and overall increased flexibility (for instance, you could add a single disk to the system instead of entire RAID groups’ worth)

T10-PI to help protect against insidious data corruption that might bypass RAID and normal checksums, and provide end-to-end protection, from the application all the way to the storage device

Can also be part of a Clustered Data ONTAP system using the FlexArray license on FAS.

The point of All-Flash Arrays

Going back to the short path length and low latency discussion…

Flash has been a disruptive technology because, if used properly, it allows an unprecedented performance density, at increasingly reasonable costs.

The users of All-Flash Arrays typically fall in two camps:

Users that want lots of features, data reduction algorithms, good but not deterministic performance and not crazy low latencies – 1-2ms is considered sufficient for this use case (with the occasional latency spike), as it is better than hybrid arrays and way better than all-disk systems.

Users that need the absolute lowest possible latency(starting in the microseconds – and definitely less than 1ms worst-case) while maintaining uncompromising reliability for their applications, and are willing to give up certain features to get that kind of performance. The performance for this type of user needs to be deterministic, without weird latency spikes, ever.

The low latency camp typically uses certain applications that need very low latency to generate more revenue. Every microsecond counts, while failures would typically mean significant revenue loss (to the point of making the cost of the storage seem like pocket change).

Some of you may be reading this and be thinking “so what, 1ms to 2ms is a tiny difference, it’s all awesome”. Well – at that level of the game, 2ms is twice the latency of 1ms, and it is a very big deal indeed. For the people that need low latency, a 1ms latency array is half the speed of a 500 microsecond array, even if both do the same IOPS.

You may also be thinking “SSDs that fit in a server’s PCI slot have low latency, right?”

The answer is yes, but what’s missing is the reliability a full-fledged array brings. If the server dies, access is lost. If the card dies, all is lost.

So, when looking for an All-Flash Array, think about what type of flash user you are. What your business actually needs. That will help shape your decisions.

All-Flash Array background operations can affect latency

The more complex All-Flash Arrays have additional capabilities compared to the ultra-low-latency gang, but also have a higher likelihood of producing relatively uneven latency under heavy load while full, and even latency spikes (besides their naturally higher latency due to the longer path length).

For instance, things like cleanup operations, various kinds of background processing that kicks off at different times, and different ways of dealing with I/O depending on how full the array is, can all cause undesirable latency spikes and overall uneven latency. It’s normal for such architectures, but may be unacceptable for certain applications.

Notably, the EF560 doesn’t suffer from such issues. We have been beating competitors in difficult performance situations with the slower predecessors of the EF560, and we will keep doing it with the new, faster system

Enough already, show me the numbers!

As a refresher, you may want to read past SPC-1 posts here and here, and my performance primer here.

Important note: SPC-1 is a block-based benchmark with its own I/O blend and, as such, the results from any vendor’s SPC-1 Result should not be compared to marketing IOPS numbers of all reads or metadata-heavy NAS benchmarks like SPEC SFS (which are far easier on systems than the 60% write blend and hotspots of the SPC-1 workload). Indeed, the tested configuration could perform way more “marketing” IOPS – but that’s decidedly not the point of this benchmark.

The EF560 SPC-1 Result links if you want the detail are here (summary) and here (full disclosure). In addition, here’s the link to the “Top 10 by Price-Performance” systems page so you can compare to other submissions (unfortunately, SPC-1 results are normally just alphabetically listed, making it time-consuming to compare systems unless you’re looking at the already sorted Top 10 lists).

The things to look for in SPC-1 submissions

Typically you’re looking for the following things to make sense of an SPC-1 submission:

Latency vs IOPS – many submissions will show high IOPS at huge latency, which would be rather useless for the low-latency crowd

Sustainability – was performance even or are there constant huge spikes?

RAID level – most submissions use RAID10 for speed, what would happen with RAID6?

Application Utilization. This one is important yet glossed over. It signifies how much capacity the benchmark consumed vs the overall raw capacity of the system, before RAID, spares etc.

Price – discounted or list?

Let’s go over these one by one.

Latency vs IOPS

Our average latency was 0.93ms at 245,011.76 SPC-1 IOPS, and extremely flat during the test:

Sustainability

The SPC-1 rules state the minimum runtime should be 8 hours. There was no significant variation in performance during the test:

RAID level

RAID-10 was used for all testing, with T10-PI Data Assurance enabled (which has a performance penalty but the applications these systems are used for typically need paranoid data integrity). This system would perform slower with RAID5 or RAID6. But for applications where the absolute lowest latency is important, RAID10 is a safe bet, especially with systems that are not write-optimized for RAID6 writes like Data ONTAP is. Not to fret though – the price/performance remained stellar as you will see.

Application Utilization

Our Application Utilization was a very high 46.90% – among the highest of any submission with RAID10 (and among the highest overall, only Data ONTAP submissions can go higher due to RAID-DP).

We did almost completely fill up the resulting RAID10 space, to show that the system’s performance is unaffected when very full. However, Application Utilization is the only metric that really shows how much of the total possible raw capacity the benchmark actually used and signifies how space-efficient the storage was.

Otherwise, someone could do quadruple mirroring of 100TB, fill up the resulting 25TB to 100%, and call that 100% efficient… when in fact it only consumed 25%

It is important to note there was no compression or deduplication enabled by any vendor since it is not allowed by the current version of the benchmark.

Compared to other vendors

I wanted to show a comparison between the SPC-1 Top Ten Price-Performance results both in absolute terms and also normalized around 500 microsecond latency to illustrate the fact that very low latency with great performance is still possible at a compelling price point with this solution.

Why 500 microseconds you might ask? Because that’s a good place for very low latency flash storage systems. Why not 1 millisecond you might also ask? Well, 1ms is more commonly found on systems that have more features and don’t concentrate on low latency as much (1ms is half the speed of 500 microseconds).

Here are the Top Ten Price-Performance systems as of March 27, 2015, with SPC-1 Results links if you want to look at things in detail:

I will show columns that explain the results of each vendor around 500 microseconds, plus how changing the latency target affects SPC-1 IOPS and also how it affects $/SPC1-IOPS.

The way you determine that lower latency point (SPC calls it “Average Response Time“) is by looking at the graph that shows latency vs SPC-1 IOPS and finding the load point closest to 500 microseconds. Let’s pick Kaminario’s K2 so you learn what to look for:

Notice how the SPC-1 IOPS around half a millisecond is about 10x slower than the performance around 3ms latency. The system picks up after that very rapidly, but if your requirements are for latency to not exceed 500 microseconds, you will be better off spending your money elsewhere (indeed, a very high profile client asked us for 400 microsecond max response at the host level from the first-gen EF systems for their Oracle DBs – this is actually very realistic for many market segments).

Here’s the table with all this analysis done for you. BTW, the “adjusted latency” $/SPC-1 IOPS is not something in the SPC-1 Reports but simply calculated for our example by dividing system price by the SPC-1 IOPS found at the 500 microsecond point in all the reports.

What do the results show?

As submitted, the EF560 is #3 in the absolute Price-Performance ranking. Interestingly, once adjusted for latency around 500 microseconds at list prices (to keep a level playing field), the price/performance of the EF560 is far better than anything else on the chart.

Regarding pricing: Note that some vendors have discounted pricing and some not, always check the SPC-1 report for the prices and don’t just read the summary at the beginning (for example, Fujitsu has 30% discounts showing in the reports, Dell, X-IO and HP all at 45% off – the rest aren’t discounted).

Our price-performance is even better once you adjust for discounts in some of the other results. Update: In this edited version of the chart I show the list price calculations as well. We are #1 in price/performance when adjusted for list pricing even at the higher submitted latencies for all vendors…

Another interesting observation is the effects of longer path length on some platforms – for instance, Dell’s lowest reported latency is 0.70ms at a mere 11,249.97 SPC-1 IOPS. Clearly, that is not a system geared towards high performance at very low latency. In addition, the response time for the submitted max SPC-1 IOPS for the Dell system is 4.83ms, firmly in the “nobody cares” category for all-flash systems (sorry guys).

Conversely… The LRT (Least Response Time) we submitted for the EF560 was a tiny 0.18ms (180 microseconds) at 24,501.04 SPC-1 IOPS. This is the lowest LRT anyone has ever posted on any array for the SPC-1 benchmark.

Clearly we are doing something right

Final thoughts

If your storage needs require very low latency coupled with very high reliability, the EF560 would be an ideal candidate. In addition, the footprint of the system is extremely compact, the SPC-1 results shown are with just a 2U EF560 with 24x 400GB SSDs.