I am an avid supporter of virtualizing Provisioning Server. Servers today are just too powerful and it is a waste of resources to run things on bare metal. Let’s face it, the average enterprise 1U rack or blade server has at least 2 sockets, 8+ cores and tons of RAM. Running a single instance of Windows on a one of these servers is a complete waste of resources. I have often heard people saying that you can only get 300 – 500 targets on a virtual PVS. I have also seen customers thinking that they have to place a virtual PVS on each hypervisor host along with the target devices so that the number of targets per PVS is limited and that all traffic remains on the physical host and virtual switch. I would like to finally debunk these myths and let you know that PVS virtualizes just fine, even in large environments and you do not have to treat it any differently than other infrastructure servers that run as virtual machines. I would like to take this opportunity to provide a real world customer example showing that Provisioning Server is an excellent candidate to virtualize for all environments, even large ones.

Real World Example

First, for the sake of privacy I will not be disclosing the name or any other identifying information about the customer, but I will provide some basic technical details as it relates to the virtual PVS deployment as well as some data showing how well virtual PVS is scaling.

Environment basics

Hypervisor is VMware 4.1 for both servers and Windows 7 desktops

PVS 5.6 SP1 is virtualized on same hosts along with other supporting server VMs

There are 5000+ concurrent Windows 7 virtual machines being delivered by virtual PVS

All virtual machines (both Windows 7 and PVS) have one NIC. PVS traffic and production Windows traffic traverses the same network path

Each virtual PVS was configured as a Windows 2008 R2 VM with 4 vCPUs and 40 GB RAM

The PVS Store is a local disk (VMDK) unique to each PVS server

Each Windows 7 VM has a unique hard disk (VMDK) that hosts the PVS write cache

So, how many target devices do you think that we could successfully get on a single virtual PVS; 300, 500, 1000??? Well, check out the screen shot below which was taken in the middle of the afternoon during peak workload time:

As you can see, on the first three PVS servers, we are running almost 1500 concurrent target devices. How is performance holding up from a network perspective? The console screen shot was taken from PVS 01 so the task manager data represents 1482 connected target devices. From the task manager graph, you can see that we are averaging 7% network utilization with occasional spikes of 10%. Since this is a 10Gb interface, that means sustained networking for 1500 Windows 7 target devices is 700 – 1000 Mb/s. In theory, a single 1 Gig interface would support this load.

How about memory and CPU usage? Check out the task manger screen shot below taken from PVS 01 at the same time as the as the previous screen shot:

From a CPU perspective, you can see that we are averaging 13% CPU utilization with 1482 concurrently connected target devices. Memory usage is only showing 6.74 GB committed; however, take note of the Cached memory (a.k.a. System Cache or File Cache). The PVS server has used just under 34 GB RAM for file caching. This extreme use of file cache is due to the fact that there are multiple different Windows 7 VHD files being hosted on the PVS server. Windows will use all available free memory to cache the blocks of data being requested from these VHD files, thus reducing and almost eliminating the disk I/O on the virtual PVS servers.

At 1500 active targets, these virtual PVS servers are not even breaking a sweat. So how many target devices could one of these virtual PVS servers support? My customer has told me that they have seen it comfortably support 2000+ with plenty of head room still available. It will obviously take more real world testing to validate where the true limit will be, but I would be very comfortable saying that each one of these virtual PVS servers could support 3000 active targets.

It is important to note that this customer is very proficient in all aspects of infrastructure and virtualization. In fact, in my 13+ years of helping customers deploy Citrix solutions; the team working at this customer is by far the most proficient that I have ever worked with. They properly designed and optimized their network, storage and VMware environment to get the best performance possible. While I will not be able to go into deep details about their configuration, I will provide some of the specific Citrix/PVS optimizations that have been implemented.

There are Advanced PVS Stream Service settings that can be configured on the PVS server. These settings typically refer to the threads and ports available to service target devices. For most optimal configuration it is recommended that there be at least one thread per active target device. For more information on this setting, refer to Thomas Berger’s blog post: http://blogs.citrix.com/2011/07/11/pvs-secrets-part-3-ports-threads/

For this customer we increased the port range so that 58 UDP ports were used along with 48 threads per port for a total of 2784 threads. Below is a screen shot of the settings that were implemented:

It is also important to note that we gave 3GB RAM to each Windows 7 32-bit VM. It is important to make sure that you do not starve your targets devices for memory. In the same way that the PVS server will use its System Cache RAM so that it does not have to keep reading the VHD blocks from disk, the Windows target devices will use System Cache RAM so that they do not have to keep requesting the same blocks of data from the PVS server. Too little RAM in the target means that the network load on the PVS server will increase. For more detailed information on how System Cache memory on PVS and target devices can affect performance, I highly recommend you read my white paper entitled Advanced Memory and Storage Considerations for Provisioning Services: http://support.citrix.com/article/ctx125126

Conclusion

Based on this real world example, you should not be afraid to virtualize Provisioning Server. If you are virtualizing Provisioning Server make sure you take the following into consideration:

Give plenty of RAM to both PVS and your target devices

Give the proper number of vCPUs to the PVS VM and tune the ports and threads

Plan on supporting about 1000 active targets per 1 Gig of network throughput

It is also import that all of our other best practices for PVS and VDI are not overlooked as well. In this real world example, we also followed and implemented the applicable best practices as defined in these two links below:

As a final note before I wrap up, I would like to address XenServer as I know that I will l get countless questions since this real world example used VMware. There have been discussions in the past that seem to suggest that XenServer does not virtualize PVS very well. However, it is important to note that XenServer has made some significant improvements over the last year, which enables it to virtualize PVS just fine. If you are using XenServer then make sure you do the following:

34 Comments

Good question. Bonding NICs within the Hypervisor is still something that should be done to provide higher availability and throughput. VMware supports LACP, so a single PVS VM can send traffic simultaneously over 2 NICs. At this point in time XenServer supports bonding to provide greater overall throughput and availability for the XenServer host, but a single VM can only have its traffic transmitted over a single NIC at any moment in time.

Also keep in mind, Jay, that Dan’s environment was 10 Gb. And assuming the networking infrastructure across the board is truly 10 Gb (i.e. switch side as well), then NIC teaming/bonding isn’t really an issue as you said. But if this was a 1 Gb environment (and I find that most still are today but that’s changing quickly…), NIC teaming/bonding all of the sudden becomes critically important…because we’ll start hitting that 1 Gb bottleneck with anywhere from 500-100 target devices. So that’s when it would have been critical for Dan (in this vSphere environment) to enable static LACP and make sure he has 2+ Gb of effective throughput for the stream traffic. The lack of LACP on the XS side is what makes virtualizing PVS “tough” in a 1 Gb environment if you’re trying to scale to 1000+ targets on each box.

Great information Dan. Virtualizing Provisioning Server and using CIFS for the vDisk Store is something we have long avoided but the more data we see the more our minds are put at ease. I notice this example is not using CIFS for the vDisk store, it would be interesting to see the performance data of a real world example showing CIFS vDisk store(s) used in large scale…

Another design element I noticed in this example is a single NIC/network being used for PvS Streaming and Production VM traffic. In the past I have seen recommendations to multi-home the PvS Targets and use separate networks to isolate PvS vDisk Streaming traffic from Production traffic in order to provide better scalability and maximum performance. Have you seen any data that proves or disproves this theory?

Great question about multi-homing PVS and targets. I have seen those recommendations as well. While there is nothing technically wrong with multi-homing and isolating the PVS traffic, it most situations it is overkill and is not required. With XenServer and VMware, PVS targets support the optimized network drivers that are installed with the hypervisor guest tools. These are fast and efficient drivers that have no issues handling production Windows and PVS storage traffic over the same network path. In my experience, the added complexity of trying to create a multi-homed target and manage a separate network for streaming traffic is just not worth it.
Cheers,
Dan

Hi Dan, great post and thank you for your answers to all the questions so far. Could you share some information about the Network-Interfaces used for the VMs (PVS-Server and Targets). We found out, that we reach best performance using VMXNet, but it think we all know the problem PVS had with VMXNet 3 in the past. And what about CPU-Overcommitement on the PVS-Server-Hosts? Do you have that?

CPUs on the hosts with PVS server VMs are technically overcommitted as there can be more VMs and active vCPUs than physical CPUs, but this customer has a well architected hypervisor solution such that total CPU host utilization is monitored so that overall host CPU utilization is within normal range. And of course there are affinity rules to prevent PVS VMs from running on the same host.

Any particular reason why you wouldnt want to run VMs on the same host that PVS is virtualized on? We want to maximize our hosts and with 96GB of RAM and 12 Cores (24 with HT) we would prefer to be able to use some of the available resources for provisioned XenApp servers. Thoughts?

I don’t see this article mentioning the amount of RAM on the WIn7 desktops or the size of the persistent disk allocated to each. Since proper sizing of the client helps attain maximum throughput of PVS, I would think those details are important.

Strange. I when click the link as you reposted in your comment and in the body of my blog, it works fine for me. Also, if I google "CTX127459" it comes up as the first hit for me. Can you try it again?

Great article. I did this very thing last year, albeit for a much smaller environment (~300 XenDesktops). I’m curious as to the VMware host configuration? CPU type and RAM. I’m currently designing an environment roughly the same size as your customer.

Dan what kind of disk did you use for windows 7 with boot storm and logon storm? Also how did you move all the log files to the cache drive and does the logs delete them self or will the drive fill up if we don’t delete them?

We set Eventlogs to overwrite events as necessary and set them to a fixed size on the write cache disk (Drive D:). You can do this with a GPO. The write cache disks are on an EMC SAN connected to the VMware hosts via FC.

Hi Dan,
Provisioning Server best practice (CTX117374) says to disable TCP task offload – was this done in this environment? Im curious about the CPU usage, its always higher in our environments with a lot less clients, I always figured it was because TCP offload was disabled.
Regards, Dan.

Cifs for storing vhd on PVS are terrible choice!!! Windows not cache Network share instead of local disk! If you want to have a unique repository for all PVS You need a cluster FS like Sanbolic Meliofs!!!

Lucab,
Did you even read the article that I linked to? If you actaully read the article, then you will understand that making the registry changes I detail will actually allow Windows to cache the network share data. With that being said, there is nothing wrong with using a clustered file system like Melio.
Cheers,
Dan

Dan, how did you come up with the threads and ports numbers exactly? The blogpost from Thomas suggests to use the number of CPU cores. Just wondering if you had done some testing with different numbers to come to this conclusion.

Actually, Thomas suggested increasing it to make sure that that when multiplying the threads by the total number of ports; you end up with one per active target. He then said that Citrix lab testing suggested that performance is best for Streaming Service when it the cores equals or is greater than threads per port. However, if you are going to go large like we did at this customer highlighted in my article, then you need to go past that the threads/per core ratio. No worries, it will scale just fine as you can see from my customers results. For large environments, you definitely want a value much higher than the default of 8!

It depends on how many you are running and how many local disks you have in the host. Typically it is fine but you may run into an IOPS issue if you have too many on there but that can be solved by adding in a few SSDs if you really are inclined to use local storage.

This was an outstanding post and very helpful for me. I have a question about the storage of the vDisk images. It mentions in the other blog and in other whitepapers that we should use block level storage for the PVS Dsik that houses the vhd files. I am using NFS for my storage repositories and was wondering if that is still not considered block level storage since it is a file onthe NFS SR and not a true LUN and would it make a difference in a small environment? I am trying to shoehorn this into less than 100 user environments and making the numbers work has been hard. I like VDI in a box but I LOVE PVS 6 and management of the images is so easy compared to VDI in a Box. I am about to deploy a 8 user Xendesktop on a single host and planning on virtualizing all of it. Exchange is in the cloud so I feel comfortable with it. the 8 users are only using IE so load will be very little. So should I setup an ISCI lun for the PVS vdisk store or just use a thin provisioned NFS disk?

I also have question about the storage of vdisk, if you do not want to use local storage, or NFS, can they be placed on VMFS or an iSCSI or FC LUN and one or more pVS servers access it for HA capabilities? how would this work, or is NFS The way to go?

Hello, I hope this thread has not gotten too old, its a great post – if this were a forum vs. a blog it would be ” pinned

I have a question on the configuration above, on the hosts that have 1482 machines on them, and that you say you think would go to 2000 or 3000, what are you using for your IP subnets?

I am assuming you are using something like a /21 for 2048 hosts, for 3000 you would need to go to something like a /20 for 4096 hosts.

I ask because I am in an environment where we successfully deployed several hundred physical CAD workstations using PVS. I am using a pair of physical blade servers on 10Gig, and 10Gig or multiple-1Gig links to the edge switches in the closets and of course Gig to the desktops.

Now we would like to expand this environment, but of course if I go beyond the 255 hosts in a /24 subnet then I have some decisions to make. I dont know if our network group will like it.

We currently still have NetBIOS and WINS active, which I think we can eliminate, but I would be worried about broadcasts in general on such large subnets. Was this something team in question considered?

To my knowledge, you still cant easily get a Provisioning Server working with multiple NICs (each in a different VLAN) due to having limited multiple NIC support for the PXE portion of the solution, and I want to avoid complex/problematic setups. But I would be interested if this has been addressed.

Aside from that, of course if I can efficiently just run multiple virtual PVS servers across a few physical hosts so that I can have say 1/pair per subnet, I have a little more flexibility. I am getting some new HP Gen8 blades that will support SR-IOV and 256GB of RAM or more, so I could give 30-50GB RAM to each virtual PVS server.

To avoid the overhead associated with copying lots of images when I need to make an update, I was going to look into the Melio product so that I could have say, 10 PVS servers that all ” see” the same storage.