an insider's perspective, technical tips n' tricks in the era of the IT Revolution

December 16, 2009

What’s what in VMware View and VDI Land…

I saw this post by my respected colleague Duncan Epping at Yellow-Bricks, and it prompted a comment. As soon as I was three paragraphs in I figured “this really should be a blog post”.

Within the next 24hrs, I got pinged from all three continents from trusted sources on the same question – IO scaling and VDI. Over the last week I’ve been deeply engaged with 3 massive VDI deployment projects that are struggling with this point.

I’ve got an interesting (and I think insanely fortunate!) perspective/visibility into what’s going on in VDI. I’m not claiming that it’s better/worse than anyone else – but I get to see VDI projects around the globe, in all sorts of verticals, and at all scales (100’s of desktops up to 10’s of thousands and even hundreds of thousands) as my team of people partner with customers.

Duncan/Richard G asked a good question – why isn’t there more View-oriented blogging? IMHO, View blogging is rare because VDI architectures in general are complex (not that any individual element is complex, but rather the end-to-end picture and client lifecycle of patching and A/V can be complex) and very, very variable from customer to customer (workload type, client type, connectivity type, connection management type, virtualization layer config, storage type/config all vary wildly). This means it takes a bit more time to get your head wrapped around it all – but also suggests that a ton of value could come from more open dialog via blogs.

So… Let’s do some View posts :-) Read on….

The first whitepaper Duncan pointed out (here) nailed a core issue that I see (there are many that are in a VDI project design) – the IOPs challenge.

This is the third most common cause I see stopping VDI projects from getting started. The first most common traditionally is “client experience” – which is a function of network, remote graphics protocol, use cases. The second is “TCO” – which is rooted in the fact that people are used to capex-oriented TCO models with VMware (on server virtualization), and client virtualization is neutral on capex at best, only shines when the elements of opex and information security/availability are factored in.

That said, scaling IO effectively is the number one most common cause I see (admittedly, I’m sure I see a “storage centric” world-view). derailing VDI projects as they scale up (into the “thousands of users” )– not in PoC, but in production.

Q: How many hard drives do 1000 desktops have? A: 1000. Q: When you virtualize those 1000 desktops… will you use a shared storage (FC/iSCSI/FCoE SAN or NAS) config with 1000 drives? A: of course not.

In a nutshell, that’s the core question.

There are loads of things that help, and of course that trivializes the question, but that paints the problem in a way people seem to understand. Why does that trivialize it? Well – the math isn’t that simple.

Things that help with this IOps scaling problem:

duty cycle of desktop (not everyone does the same thing at the same time) – though it’s VERY important to look at things you do that DO affect every desktop at the same time, and consider changing them (patch/AV are the big ones).

The disks in laptops/desktops are generally 2-3x slower on random IO than those used in enterprise arrays.

cache architecture of the shared storage array buffers burst writes, and does read caching (every array does this differently, but they all do it to varying extents) - but in remember that in the end, you need to commit the write volume fast enough, otherwise write cache is guaranteed to overflow, no matter what.

the VMware and storage layers do some minimal coalescing of IO

you can mitigate “simultaneous boot” effect through a variety of means at the connection manager – eg. VMware View can stage client boot/logon behavior

The author of the first whitepaper also was bang on – initially we expected 90/10 read/write ratios at customers, but in practice we’re seeing more along 50/50 read/write. We’re also seeing IO per client that range more towards the high rather than the low end of the band. Ergo, if I had a penny for every customer who said “design it for 2-5 IOps per user”, and then complained when it turned out to be 8-20 IOps per user… Well, I’d be a rich man.

Are you using VDI in production – if so, I’d love to see your comments on what you’re seeing in practice for your clients re: client IO profile…. please comment!

Array caching models help and are part of the picture. It’s important to understand where and how.

Read cache, once loaded and in steady state is always full = Bigger is generally better, and can help with cacheable reads. Read cache has a gradual performance increase as the cache fall through rate (the time data stays in cache) decreases due to memory pressure.

Conversely, write cache starts empty, gets filled as I/O comes in to the array, and is drained (destaged) by the backend spindles. This helps by absorbing bursts, and allowing the arrays to try to coalesce the write I/Os. They also differ in the effect when they fill. When write cache fills, it has a big and instantaneous performance drop – as all of a sudden the host is directly exposed to the latency and performance envelope of the backend spindles.

Remember that write cache protects you against bursts - but all storage arrays needs to be able to “sink” (drain the write cache) at a rate that is greater the sustained write IO workload. Otherwise cache destage (i.e. write commit on the backend spindles as if the array had no cache) becomes the gating performance envelope.

I hate customer PoCs (by EMC or anyone) that are structured to avoid this point – because they prove little to nothing IMO. The challenge is that generating big, and realistic client workloads in a customer PoC (and in our reference architecture work) is exceptionally difficult.

There are several very important things you can do to mitigate extra guest IO:

avoid vswap at all costs in this use case. People think that using memory density will be their bottleneck in the economic model. It can be, but just as often its the storage. in general configure guest mem = reserved mem.

any time you are able to move the user data (”my documents”) out of the guest and into a NAS device, it’s a huge win (in many dimensions – capacity efficiency, minimize VM). This is not an option for some user use cases (for example, doesn’t work easily on “check in/out” use cases).

Disable automated AV updates

Disable boot optimization

Disable system restore

In my experience (and my team’s experience) working on many VDI projects (with all sorts of configs) – at larger scales, the problem of economic IO scaling becomes very hard. Seeing some customers deploy Atlantis (and similar approaches) to increase scaling, decrease IO density (through distributed caching on the ESX hosts) – though these often break the encapsulation model (everything is a trade-off).

Other things to think of…

VDI can generate BIG workloads fast.

Let’s say once again that your peak workload is 12 IOps per client, and you have 15,000 desktops you want to virtualize. That’s a total of 180,000 IOps, which is a very, very large workload for common storage configurations. It would hammer a large CX4, for example. You would need to carefully scale out all the aspects of the design, and consider it just like you would consider the system design for a MASSIVE database. Can it be done? Of course – but there’s a reason why the “what’s the single ESX host maximum IOPs” test at the vSphere 4 launch (365,000 IOPs) was backended by 3 CX4-960s with 30 solid-state disks. That’s a whackload of IO.

Protocol factorsto consider….

If you’re using Fibre-Channel – every FC target port on arrays have port limits (measured in I/Os) – usually around the 3000 IOps mark. You need to make sure that you have enough front-end port connectivity. Does this sound weird? It’s not. Let’s say you have a massive V-Max able to chew up and spit out IOps out the ying-yang, but the vSphere cluster is using 4 FA ports (front end array interfaces). Well, you have a total “ingest” of around 12000 IOPs. If you have users generating 12 IOps, that means once you’re at 1000 clients, you’re going to saturate the front-end ports form an IOps standpoint. You better make sure you are using lots of front-end ports, and are load-balancing across all of them.

NFS has some advantages here, but also some disadvantages that need consideration. Whereas steady-state IO is mostly IOps gated – the precise periods of peak IO – patching/AV – have high bandwidth (MBps) gated workloads. You need to remember the “one active vmkernel interface per NFS datastore” rules of the NFSv3 support in vSphere 4. You can of course use 10GbE with NFS datastores today (which we support). Remember though, the NFS servers usually have backend interfaces (which also have IOps limits) which are used to connect the NFS server with the disks that support the filesystems. As an FYI - an updated Celerra vSphere NFS best practices doc is coming soon. While it does fine with 4K IOs, since the Celerra has an internal 8K allocation size on the UxFS filesystem it uses, if the NTFS guest volume uses a 8K allocation size, performance is even better.

So… What are we doing about it?

Clearly caches can help a lot with the read portions of the workload effect, and write caches can absorb spikes and help a little on the backend (decoupling the host from the backend disk IO). Remember to assume that cache is non-existant (or negligible) for your write worksloads (you must avoid the write-cache full effect). And larger cache is generally better (though not a panacea).

On the EMC side, we think that EFD (enterprise-class solid state) is an important part of the answer. The author of the whitepaper is right in terms of the consumer SSD life span, but there are enterprise SSDs which can sustain the same duty cycle and lifetime of any enterprise magnetic media. Over time, this will apply even to consumer SSDs.

BUT – they are not currently a simple answer – as with View 4, currently you can’t put a base replica on one datastore, and the linked clones on another. This means you would need to use EFDs for the entire datastore, which makes them less viable (though they can help) with the current $/GB of EFDs. It also means that if you’re not using View Composer, you can leverage them, but only if the entire datastore is on EFD.

We’ve used all the methods above in the current View 4 VMware/Cisco/EMC reference architecture to get to a $750/client end-to-end cost (everything – including Microsoft VECD licensing, but it doesn’t include the client hardware).

For those interest, of the $750/client end-to-end cost:

it assumed 2000 users on the total config

4GB dimms were superior on an economic breakpoint when compared with 8GB dimms on the UCS blades.

The breakdown was:

26% on the storage

49% on the servers/network

21% on the VMware software (vSphere 4 Enterprise Plus, the full View package)

3% on VECD and incidentals

The initial doc was worked on prior to the View 4 GA (and for people who are close to View and VMware know that right before the GA, it stretched out a couple of days) – which meant it needed to use pre-GA versions of View 4 on pre-release versions of vSphere 4 update 1. Through now and the end of Dec we’re working on an update based on all the GA elements and will include many more detailed findings.

To try to make the economics even better, we’re working on a couple things. This was an extensive topic of discussion with the View team when I was at VMware two weeks ago for our EMC/VMware QBR and QTR.

vStorage APIs for Array integration (VAAI) – “hardware accelerated locking” – this will help with VM density per VMFS datastore (to match the “VMs per datastore” model of NFS for customers using block models), as well as the VAAI fast/full copy hardware-offloaded copy/move. On the NAS side of things, over time working on pNFS support in vSphere and in our GA NFS platforms for scale-out NAS

also vStorage APIs for write same/write zero (we demoed this at VMworld) will reduce some of the I/O from the host to the array by eliminating some duplicate I/Os and zeros.

FAST v2 (block level autotiering) will enable a datastore (on NFS or VMFS) will help in the sense that a datastore will be able to be “blended” with EFDs supporting “hot” blocks/portions of files, and large, slow SATA/SAS being used for ones that are “cold” – this results in the best of $/IOPs of EFDs and the $/GB of large magnetic media.

Future versions of View and View Composer have some things that are explicitly targeted to help with the various things I noted (decoupling base replicas and linked clones, creative stuff about vswap/guest swap handling)

Working on tools to more easily capture a given client workload (think of a “Capacity Planner for desktops”)

Working on tools to more accurately model client workloads at large scale

@Andre - I TOTALLY agree. in the View 3 timeframe, I personally was thinking 90:10 read/write. I can't say beyond the projects I'm seeing personally, which are around 50:50. I think one of the most important elements we could deliver which would help our customers would be tools to quantify exactly what they have. Every workload is also wildly divergent.

@Marco - 8Gbps and 10GbE are more than enough - people often confuse high IOps (throughput) and high MBps (bandwidth). The bandwidth is usually during periods that have large IO sizes, and lower IOps.

- We're using folder re-direction for user data to cifs shares.
- We're never stingy with RAM. Windows caching helps save a few IOP's. (And with ram getting cheaper, I suspect this'll be the case in more setups)
- Even an idle desktop will tend to page stuff out rather than read stuff in.

I suspect Chad's right - either SSD or cache will help a lot. Unfortunatly, no-one has published any numbers, so real world sizing is a black art.

@Marco, supposing a peak is 12 IOPS per user, the bandwidth is 0.75Mbps (if using 64k blocks). For 10,000 users 7.5GB/s. You only need that bandwidth at the storage side as a single host will never support all 10,000 users. In any case better to have everything connected to your core switches if using NFS.

I just did a performance analysis for a customer with ~400 VDI users attached to DMX (so we can get great performance data), who wanted to "take VDI to the next level" while redeploying on an NS platform to free-up Tier 1 capacity. Their users are happy about performance today, and the team wanted to keep it that way. In the morning these users were each driving more like 20-25 IO/s as they all came online, to an aggregate of about 10k IO/s. This dropped to a steady morning work activity of about 5k IO/s. The afternoon dropped them down to 1000 IO/s. The read/write % was always around 80/20.

This was a rather naive configuration with fully provisioned desktops dedicated to each user, apps and data stored directly in the guests, etc. We figured they needed about 100 x 15k FC's if they wanted to handle the morning boot storm as effectively as on the DMX. We recommended changing the way they deploy desktops, using View Composer, linked clones from a master image, and removing the apps, profiles, and user data from the guests and putting that on file shares. In this way, they could get their entire storage footprint down to about 600GB.

The trick is, this mere 600GB of capacity still has to perform the same as 100 x 15k FC. The need for a single RAID group of EFD's was obvious.

@Marco on the 64k block comment. We too saw that XP's published block size is 64k. Based on this we looked at sizing an array for 3000 users at 10 IOPs each and it came out to some insane numbers, like 2 fully populated NS960's to server the throughput (not the IOPs). After taking a look at what block size XP was actually using (on a test harness using common office apps) we saw that the majority was 4k, with some 8k and 16k and very little 64k. We generally use a 4k block size when sizing the storage.

This is also something that needs to be taken into account when looking into deduplication. I've seen multiple vendors offer it with VDI as most of these images are similar. But when you calculate all iops you might end up with the same amount of disks, so what's the point?

Great post! when taking a chance with VDI or IO you should make a second and third thought about the efficiency and the quality and Flexibility of those protocols, don't ever afraid of spending some extra money for "saving your beck"!

(Name and email address are required. Email address will not be displayed with the comment.)

Name is required to post a comment

Please enter a valid email address

Invalid URL

Please enable JavaScript if you would like to comment on this blog.

Disclaimer

The opinions expressed here are my personal opinions. Content published here is not read or approved in advance by Dell Technologies and does not necessarily reflect the views and opinions of Dell Technologies or any part of Dell Technologies. This is my blog, it is not an Dell Technologies blog.