There has been substantial marketing FUD thrown at WAFL performance over the years making the claim that over time, the performance of NetApp arrays degrades due to the nature of WAFL. For example, this performance analysis from HP was posted on Wikibon today by Calvin Zito.

In response to claims that WAFL performance degrades over time, NetApp has published this summary of a 48-hour sustainability test under the SPC benchmark.

Thanks to Val Bercovici for providing these additional useful links in response to Wikibon's request to help shed light on this issue:

These links provide insight into the challenges of managing LSF performance in general and WAFL specifically. The first analysis points out some of the nuances of managing WAFL in a world where applications assume data needs to be laid out on disks in a sequential manner-- a predominant approach for the vast majority of disk arrays.

Essentially, NetApp uses historical increases in processing power, intelligent algorithms and its fifteen years of experience managing this issue to minimize the problem and turn it into an advantage-- allowing NetApp for example to enable thin provisioning during benchmarks-- which is rarely done by vendors.

This second link provides insight as to how WAFL minimizes the need for write cache in RAID operations. Importantly, it also confirms that WAFL, like other LSF systems suffers performance degradation as the array's storage utilization increases. The following statement applies for WAFL and to Wikibon's knowledge, every LSF system:

...WAFL, will do very little work when the array has plenty of free space and therefore perform much better. A side effect, though, is that as the array free space shrinks, WAFL’s will have to do more work to find free space, and that more work will translate into lower performance.

Wikibon's Take

It seems conceptually clear that WAFL, which performs writes to any available free space, must at some point perform so-called garbage collection to re-organize and re-allocate free space. When an array that uses this technique is less full, users will see no performance degradation because there is plenty of free space available for writes. As the array's capacity becomes increasingly utilized, the array must work harder to find free space and this will negatively impact performance-- as NetApp itself admits.

Our understanding is that NetApp performs free space collection as a background task in an effort to minimize performance degradation. However as the array becomes more full, the window for free space management will shrink and could negatively impact performance.

Competitive vendor analysis and subsequent NetApp responses (e.g. it's 48-hour sustainability test) attempt to highlight or refute that WAFL has a fragmentation issue. Perhaps a more interesting benchmark would be to show performance of a WAFL-based array at variable utilization levels from very low to very high to assess the performance impact.

The bottom line is there are always trade-offs in any technology choice. NetApp's WAFL innovation brings dramatic simplicity and very often efficiency benefits to users. Ironically, however, in certain workloads and environments, some of these benefits will be negated by the fact that as a WAFL array gets filled up, users will need to allocate more headroom to maintain consistent performance.

The reality is that many NetApp customers will never see this problem as they're not running high volume transaction environments where performance is king. Nonetheless, users should be aware of the inverse relationship between capacity and performance for any storage array, not just WAFL-based arrays. By understanding this relationship and the value of applications running on the arrays, trade-offs can be assessed and ROI maximized.

Comments on 'WAFL Performance'

It's interesting that HP is trying to take their clearly biased anti-NetApp WAFL smear campaign into an objective forum like yours. Let's see how that works out for them :)

Kostadis Roussos has done an excellent job covering the unique characteristics of WAFL in a narrative style over on his blog: http://blogs.netapp.com/extensible_netapp/wafl

Two of the more relevant entries related to your question are:
1. http://blogs.netapp.com/extensible_netapp/2009/04/understanding-wafl-performance-the-f-word.html
2. http://blogs.netapp.com/extensible_netapp/2009/03/understanding-wafl-performance-how-raid-changes-the-performance-game.html

Here is his engineering perception of the nature of this FUD:
http://blogs.netapp.com/extensible_netapp/2008/11/progress-of-a-k.html

I would call particular attention to the SPC-1 references. Unlike an opaque test conducted in the bowels of a vendor's competitive labs, SPC-1 is a fully transparent benchmark with formal auditing and member rights to revoke invalid (competitive) publications. More details on all of this here: http://blogs.netapp.com/shadeofblue/2009/09/random-rocks-and-benchmarks.html?cid=6a00d8341ca27e53ef0120a5f87586970c#comment-6a00d8341ca27e53ef0120a5f87586970c

In light of that - why didn't SPC member HP simply challenge this report which demonstrated the final 48 hours of a 4 week continuous WAFL run under the SPC-1 workload?

This is definitive objective proof that WAFL's advanced techniques don't suffer from any of the common issues found on log-structured file systems such as STK's Iceberg in a historical sense - or more interestingly today, Flash Translation Layers ( http://en.wikipedia.org/wiki/Flash_file_system ) found in all Flash SSD controllers used by NetApp competitors like HP & EMC! :)

For a full perspective, it's important to emphasize to your readers that ALL storage systems (log-structured or not) degrade performance when they reach their absolute capacity limits. Ironically, it is precisely because of the deep integration between WAFL's semantic layer (block, file or soon object) and high-performance RAID (6) layer that NetApp offers the HIGHEST usable capacity at maximum performance levels of ANY storage array in the industry!

This creates a tangible "glass ceiling" of 50% usable capacity for any non-NetApp storage system configured with maximum performance in mind. More details here:
http://blogs.netapp.com/exposed/2008/06/performance-as.html

It's hard to argue Val with the fact that WAFL is one of the storage industry's highest impact innovations. Thanks for providing all these great resources so readers can get the facts, do their own research and reach some conclusions.

In the NetApp test we ran, we specifically tried to choose a test that had been run and described in detail by NetApp. That way the (rather endless) questions we typically get regarding whether best practices were followed become a non issue. I have found that no matter what test I choose, no matter how fair I try to be, its seen as a biased test. So it’s best to use a test that a vendor has already blessed, when that is possible. I have very limited time to do testing, so I have to go for the biggest bang for the buck.

Concerning capacity fill and its effect on NetApp, the ESRP test we ran did not exhibit that effect at all. There was plenty of spare capacity on the array – in fact NetApp made sure of it in how they set the test up. What we saw was purely a fragmentation effect. When ESRP (jetstress) runs it allocates and writes all blocks to all files up front. There is no file growth. In this test snaps were disabled. The array space allocation is stable as a rock as the throughput degrades.

Fragmentation causes a giant drop in throughput for netapp, on the order of 35% for exchange database load. That effect is caused purely by starting with pristine sequentially ordered stripes and having random small block writes destroy that arrangement over time (WAFL is fragmentation by design). If netapp had reported its ESRP IOPs after this burn in had occurred, I would have had nothing much to write about. But they didn’t do that – they ran their numbers while fragmentation was in its early stages. I think that’s being pretty deceptive – given other vendors don’t have WAFL and can’t do the same inflation of their numbers. Everyone wants to publish ESRP, but when NetApp is posting numbers that others can’t meet, it’s becomes an incentive to not publish.
Karl

Hi Karl...I don't think you'll see NetApp run a benchmark as I'm suggesting as by their own admission there's an inverse relationship between performance and capacity. NetApp's point is that EVERY array suffers this tradeoff and it would be interesting to see a benchmark that provides such comparisons...across arrays over a wide range of utilization scenarios.

But at the end of the day I'm not sure customers care. They want efficiency and predictable performance...don't you agree?

Hi Dave,
I think customers want what you suggest, efficiency and predictable performance. But there is an argument that NetApp FAS delivers neither very well.
First, let’s touch on predictable performance. I've run the NetApp tests enough to know that they have an exponential decay in response time (throughput) as random writes erode their stripe ordering. If we started with a FAS LUN in some random state (i.e. you don’t know the history of the LUN with respect to writes), you will not be able to predict the throughput that that LUN will achieve on the next test.
If I run the same loads on an HP EVA or an EMC CX4 the results are steady – nothing close to a multi-day burn in needed to achieve steady state. Within seconds these boxes are straight lining. History of the load does not impact current throughput to the LUN. I’m not saying the CX4 is a great box, but at least it doesn’t have this problem. BTW, EMC is no less of a formidable competitor of ours than NetApp. I’m just stating the facts.
There are other factors that influence predictability of the FAS. For example, if reallocating the LUN, throughput takes a nosedive. NetApp recommends running reallocate on a schedule for “chaotic” workloads like MS Exchange , so reallocate is actually a common recommendation. This degradation is on top of that caused by fragmentation. I observed a near 50% throughput impact last for about 8 hours with a 500GB LUN. Another influence is deduplication, which also has a hefty throughput price tag during the initial dedup phase, then an ongoing tax while the IOs are directed to the deduped volume.
So… if you are running some steady or cyclic load, you let it run for a few days before drawing conclusions, nothing is allowed to kick off in the background, and you are not using deduplication, perhaps there is an argument that the FAS throughput is predictable. I have not had the patience to get to that point, but let’s assume it’s possible.
From an efficiency perspective the point to be made is pretty simple. The FAS is slower than its competitors. Not a little difference, but a big difference. No matter what test I choose: randoms, reads/writes, large/small block, sequentials - the FAS is smoked by a similarly configured, similar class EVA or CX4 after the FAS does its burn in. The difference varies depending on the test of course, but it’s not uncommon to see a test where the FAS achieves less than half the IOPs given the same array spindle count, LUN size, and average IO latency to the LUN.
There is a question to be addressed, one that often gets raised by the pro-NetApp crowd. More than once I’ve heard something like “If the FAS is so slow, why are there so few NetApp customers complaining about performance?”
Now, I’m not in a position where I can know how many complain, and by the way there are some rather nasty complaints on the web, but for the moment let’s assume that the question has some merit. I suspect a reasonable explanation lies in the basis of comparison. Very few storage customers have the money and time to buy similarly configured arrays from multiple vendors and put them to identical apples to apples, head to head tests. (Sometimes I wonder if I’m the only one on the planet really doing this.)
Think about the last thing you thought was working fine, then after seeing how good an alternative worked realized that the original was in fact not acceptable. I had this realization with normal TV vs HDTV not all that long ago. Normal TV was just fine, meeting my needs, until I realized what the alternative really meant. My guess is that most of the NetApp customers who find their array throughput efficiency acceptable are also in this camp – not quite understanding the alternatives.

Are you really going to permit vendor propaganda such as Karl / Calvin's from HP to pollute the integrity of this site? They continue to spew this FUD based on their own unverified opinions all over the blogoshpere. Compared against the independently audited facts offered by NetApp, HP's diatribes contribute nothing to an intelligent discussion on this site or others. I've been lurking here for quite sometime, but felt compelled to sign up and share my thoughts.

It's highly ironic and perhaps prophetic that Karl's incoherent comment appeared the same day as this piece from Chris Mellor:
http://www.channelregister.co.uk/2009/11/24/hp_storage_slow/

Against that backdrop, the motivations behind the shameful online tactics used by HP are blindingly obvious, or as Val would say "exposed".

Based on their recent financial results, I would council HP to focus on the future of their storage products instead of pathetic attempts to disparage others'

We've tried to summarize this issue in the article of this page above; and cut through the FUD. Do you feel the article unfairly represents WAFL performance? If so please hit the edit tab and improve it.

Thanks for the response. I think your parent article itself is quite well balanced, despite HP goading you to write it in the first place.

My objection lies with the ridiculous comments by Karl Dohm of HP above. They are insulting to your readers with their lack of credibility. I have no loyalty to any specific vendor, but over the years I have used all of the systems Karl mentions above.

Bottom-line, they all have their strengths and weaknesses, but NONE of them exhibit the outrageous behavior Karl describes regarding NetApp.

We have a multi-vendor policy and are currently purchasing more NetApp than others due to a superior balance of cost, functionality and performance for OUR needs (VMware, MS BackOffice & LAMP).

Having said that, I cannot generalize a NetApp recommendation for others - and I can also unequivocally state that if some other vendor better satisfies our needs during 2010, they WILL be getting our business.

In the meantime, our HP sales team will have a LOT of explaining to do if they ever fathom wasting our time using Karl's garbage in a sales call with us.

It would be nice if someone contested my data above with facts instead of emotion.

If you sincerely believe that the FAS does not exponentially drop throughput as random writes occur, show us some data to counter the claim. Saying the data is “garbage” is garbage unless you really have had hands on time with the equipment and run some tests. Have you? If so, what have you done?

The usual response is “this must be wrong, look at ESRP and SPC”. Well I did. I recreated their FAS2050 ESRP results and exposed issues with how they conducted the test and inflated the numbers.

This following post also presents real test data showing the exponential decay. Various experts, not at HP, have agreed that this exponential decline on the FAS is real. For example Chuck Hollis at EMC, Patrick Cimprich at Avanade.

Karl, please look up the definition of "facts", or more specifically "evidence". Independently audited results are required for proper use of either term.

All you have are unverified claims from your so-called internal testing. That equates to nothing more than speculation and conjecture.

More curiously, since you question the process and results, you are either ignorant or disrespectful of the work your own HP performance engineering colleagues have done regarding ESRP, SPC, TPC, etc…

Very Machiavellian of you, but back to my original point, this desperate attempt to disparage a competitor taking market share from you does nothing to further a technical discussion about proper storage architecture and operations.

The intent here was to address for users: 1) Does WAFL have unique performance issues? 2) Are they any different than other arrays? 3) Do the benefits of WAFL outweigh the drawbacks?

I believe we've done a reasonable job answering these questions for customers without going too deep into the details of the technology. There is certainly a place for that too.

There are several good links in here and in the article itself from which reasonable people should be able to draw their own conclusions.

Bottom line for me is WAFL is a unique animal that essentially NetApp built a tremendous company around. Does it have tradeoffs...absolutely. Should users understand these tradeoffs...most definitely. Are there a zillion use cases where WAFL's benefits outweigh its drawbacks?