TagPerformance

After a few weeks and several reasons (professional and non-professional) I finally restarted writing about my current vSphere replication version 6.0 series. This part 2 focus on some network design options and how they might impact SLAs defined for the Recovery Point Objective (RPO).

-> As expected the throughput dropped nearly by 50% to around 440 Mbit / s.

I know that those 2 results are depending on the specific characteristic of my homelab environment. The reason I have written that down was to create an awareness that the network decision has an impact on the replication performance and therefore maybe on the fact if you can meet an SLA or NOT.

Let’s make a short calculation within a small scenario.

RPO – Recovery Point Objective: How much data can get lost during a failure. This value is configured during the setup of a replication job and defines within which time-interval the concrete replication is started

Amount machines

15

VM Size

100 GB = 102400 MB

Max average daily disk-change rate

5%

Max Replication transfer rate option 1

901 Mbit / s = 112,625 MB / s

Max Replication transfer rate option 1

440 Mbit / s = 55 MB / s

The initial replication can be calculated with the following formula:

and will take the following amount of time in our scenario:

Option 1:

Option 2:

To meet a SLA we are in most cases more interested about how long the ongoing replication will take place.

Option 1:

Option 2:

So if you have an RPO defined with 15 minutes there is a risk not to meet the SLA within option 2.

Maybe I repeat myself, but that this is just an example of a calculation (and depending on the use case the limiting factor will be the link between the protected and replicated site). Nevertheless you need to get aware of the following relevant metrics when you design replication:

replication-throughput

change-rate

number and size of your VMs.

In production we don’t want to receive an RPO violation alarm (technically or by the service manager ;-). If you can’t meet the requirements in a theoretical calculation, you will not be able to meet them during daily operations.

Which tool can we use to get the above metrics? Replication-throughput via ESXTOP (network-view: n), number and size of your VMs via PowerCLI (If haven’t done stuff with PowerCLI so far, this is a great starting task for it ;-).

For gathering data about the data change-rate within a VM I refer to a PowerCLI script Scott Herold (his name was in the comments) has created a few years ago that used the change-block-tracking mechanism. I found the script via google and you can download it here (Download: CBT_Tracker – Howto). No need to say that you should understand it (and it’s influence on your system – it uses CBT and snapshots – see the comments within the script) and test the script first before you use it for your analysis.

Compression – The X-Files continues

As I have already said VMware has included a new compression mechanism in 6.0 to speed up the initial copy job. During my first tests (Setup 1 with compression enabled) I had a higher CPU utilization (that’s expected on my vSphere Replication Appliance), but also a lower Throughput of the replication data. I am totally unaware what went wrong here. I will try to figure out more about this effect and keep you informed ;-). If you have any ideas/hints what went wrong in my setup. Please comment or contact via Twitter (@lenzker).

vSphere Replication is a really cool application helping us to replicate our virtual machines on a VM-level without the need to have a dedicated and replicating storage. Besides the conservative replication methods it can also be used to replicate to a public cloud provider (call it DaaS or whatever 😉 like VMware vCloud Air (I am still lacking technical-deep knowledge of VMware’s hybrid cloud appriache. It will be a seperate blog-post once I know more;-).

In the following I want to give an architectural overview of the new version 6.0 of vSphere replication. I realized during some of my Site Recovery Manager class people might get confused with some KB-mentioned terminologies and maximums so I wanted to create something that clarifies all of those things.

This article is the first part of the vSphere replication 6.0 series (not sure if series is the right word if only 3 episodes are planned 😉 )

Components required for vSphere replication

The general task of the vSphere replication appliance is to get data (VM files and changes) from the vSphere agent of a protected ESXi and transfer it to the configured recovery ESXi (via a mechanism called Network File Copy – NFS).

Now it might get a little bit confusing. The appliance we are downloading are in fact 2 different appliances with 2 different OVF-file pointing to the same disk file.

1x OVF (vSphere_Replication_OVF10.ovf) which is the virtual machine descriptor for the vSphere replication (manager) appliance – used for SRM OR single vSphere Replication – 2 or 4 vCPU / 4GB RAM

1x OVF (vSphere_Replication_AddOn_OVF10.ovf) which is the virtual machine descriptor for the vSphere replication server – can be used to balance the load and increase the maximum number of replicated VMs – 2 vCPU / 512 MB RAM

vSphere replication (manager) appliance

The vSphere replication (manager) appliance is the ‘brain’ in the vSphere replication process and is registered to the vCenter so that the vCenter-Webclient is aware of the new functionality. It stores the configured data in the embedded postgres-SQL database or in an externally added SQL database. The VMware documentation typically talks about the vSphere replication appliance, to make sure not to mix it up with the replication server I put the (manager) within the terminology. The vSphere replication (manager) appliance includes also the 1st vSphere replication Server. Only 1 vSphere replication appliance can registered with a vCenter and supports theoretically up to 2000 replications if we have 10 vSphere replication server in total. Please be aware of the following KB if you want to replicate more than 500 VMs since minor changes at the Appliance are mandatory.

vSphere replication server

The vSphere replication server in general is responsible for the replication job (data gathering from source-ESXi and data transferring to target-ESXi). It is included within the vSphere replication appliance and can effectively handle 200 replication jobs. Even though it I have read in some blogs that it is only possible to spread the replication load over several vSphere replication server in conjunction with Site Recovery Manger it works out of the box without the need for Site recovery manager.

Sample Architecture

The following picture should illustrate the components and traffic flow during the vSphere replication process.

The following diagrams shows two sample Architectures regarding the network.

In the first diagram the vSphere replication network MUST be routed/switched on layer 3, while in the second example we are able to stay in a single network segmet with our replication-traffic (thanks to the new VMkernel functionalites for NFC/Replication traffic in vSphere 6).

Option 1: External routing/switch mandatory (would be a good use case for the internal routing of NSX ;-)):

Option 2: No routing mandatory & switching occurs within the ESXi

Of course those are only two simple configuration samples, but I want to make you aware of that the (virtual) network design has an impact on the replication performance in the end.

I will focus on the performance difference between those two options (and the usage of the compression mode within an LAN) in part 2. Stay tuned 😉

When I first installed vSphere 6.0 I was pretty impressed about the performance gain of the vSphere Web Client. Finally the Web Client is a tool I can work productive with and mustn’t be afraid to be marked as unproductive from my customers (it’s tough to argument a higher hour-rate if I wait 20% of my time for the UI 😉 ).

So my homelab was installed with vSphere 6.0 and I tried to connect to it via VPN from my hotel wifi. Since the wifi was blocking my VPN attempts I was forced to tether/share the internet via my smartphone.

Sharing internet on… starting OpenVPN against my homelab… opened Chrome to use the Web Client and.… the useability with the Web Client 6.0 was really really good.

After a few minutes I received a warning by my provider T-mobile that my data plan has reached the 80% thresholds. I know 500MB included in my data plan is not that much, but still I was really suprised seeing this OpenVPN statistics after a few minutes.

Since I haven’t used any other services than the vSphere Web Client I wanted to know how much bandwidth working in a local Browser via the Web Client really needs.

I created a test-case (it’s Sunday, weather is bad, Bundesliga is pausing) which should take around 3-4 minutes:

Login to the vCenter via the Web Client

Navigate around in the home menu

Select Hosts and Clusters

Choose an ESXi Host and navigate to Manage / Settings / Networking

View the vSwitch, virtual adapater and physical Nics settings

Go to related Objects and select a datastore

Browse the content of the datastore

Create a new VM with default settings

Power on the VM

I did this test with 2 Browsers Chrome and Firefox (to make sure the results are nearly identical) and observed the results via the activity monitor of MacOS. As a 3rd alternative I have chosen to use a remote connection via Microsoft Remote Desktop (native resolution, 16 Bit color) and did the same test-case steps mentioned above.

Here are the results:

Chrome – duration: <4 minutes, bandwidth: ca. 21 MB

Firefox – duration: < 4 minutes, bandwidth: ca:26 MB

RDP – duration: < 3.5 minutes, bandwidth: ca. 2 MB

Of course there are a lot of factors not considered (high activity on the vCenter that might increase the data most certainly), but the numbers should give you a feeling that the better performance of the Web Client seems to come side by side with a pretty bandwidth sensitive caching on the client side. So if you work with limited bandwidth or any kind of throughput limitation use an RDP connection to a Control-Center VM within your environment that is able to use the Web Client for your daily vSphere operations.

Sometimes I love being an instructor. 2 weeks of Optimize and Scale and finally I have more valid and realistic values from 2 participants of mine regarding performance vs. power usage.

First of all thanks to Thomas Bröcker and Alexander Ganser who were not just discussing this topic with me, but also did this experiment in their environment. First of all I am proud that it seems that I have motivated Alexander to blog about his findings in English :-). While he is focusing in his post on hosting server applications on Dell/Fujitsu hardware (-> please have a look at it), I will extent this information by using data from a HP-based VDI environment, where the impact on performance, power-usage and costs were much higher than I have expected it.

The trend of green IT not just had an effect on more effective consumer CPUs, it is also getting more and more a trend in modern datacenters. Hosts are powered down and on automatically (DPM – productive users of this feature please contact me 😉 ), CPU frequencies are dynamically changed or cores are disabled on demand (core parking). Since I always recommend NOT to use any power management features in a server environment, I am now following up this topic by giving suitable and realistic numbers from a production environment.

A few details about the setup and the scenario I am going to talk about. For my calculations later on I selected a common VDI size of around 1000 Windows 7 virtual machines.

VDI: 1000

Number of ESXi (NoESXi): 20 (vSphere 5.5 U2)

CPU-type: 2x Intel Xeon E5-2665 (8 Cores 2.4 – 3.1 Ghz – TPD – 115W)

vCPU per VM: 4 (pretty high for regular VDI – but multimedia / video capability was a requirement ( by avg. 80% of the VDI have active users)

vCPU / Core rate: 12.5

A few comments to the data. Intranet video-quality was miserable with the initial VM sizing (1 vCPU). We took a known and approved methodology of the 2 performance affecting dimensions:

1st dimension: Sizing of a virtual machine (is the virtual hardware enough for the proposed workload?) – verified by check if the end-user satisfied with the performance (embedded videos are working fluently).

As a baseline approach we defined that an intranet video needs to run fluently and ESXTOP metrics %RDY (per vCPU – to determine a general scheduling contention) and %CO-STOP (to determine a scheduling difficulty because of the 4vCPU SMP) were not reaching a specific threshold (3% Ready / 0% CO-STOP) during working hours. *

// * of course we would run into a resource-contention once each user on this ESXi host is going to watch a video within the virtual desktop resulting a much higher %rdy value.

So far so good. The following parameters describe dependant variables for the power costs of such an environment. Of course the used metrics can differ between countries (price for energy) and datacenter type (cooling).

Power usage per host: This data was taken in real-time via iLO HP DL 380G8 and describes the current power usage of the server. We tested the following energy-safer settings (Can be changed during runtime and has a direct effect):

HP Dynamic Power Savings

Static High Performance Mode

Climate factor: A metric defining how much power is effort to cool down the IT systems within a datacenter. This varies a lot for different datacenter and I am referring as a source to Centron who did an analysis in German with an outcome that the factor is between 1,3 and 1,5 which means that for 100 Watt used by a component we need 30/50 Watt for the cooling energy. The value I will take is randomly taken as 1,5 and can differ a lot in each datacenter.

Power Price: This price will differ the most in each country depending on the regulations. The price is normed as kilo Watt hour, means how much do you pay for 1000 Watt power usage in 1 hour. Small companies in Germany will have to pay around (25 Cent per kWH), while large enterprises with a huge power demand pay much less ( around 10 Cent per kWH)

Data was collected during a workday at around 11 AM – Friday. We assume that the data is taken during a regular office-hour workload.

The result of the power-saving mode is very high/aggressive in a VDI environment and is far less when the ESXi host is used for server virtualization (I refer back to the blog post of Alexander Ganser since we observed nearly the same numbers for our serers). Server virtualization has a higher constant CPU-load while VDI workload pattern is much more infrequent and gives a CPU more chances to quiesce-down a little bit. We observed around 10% power-savings in the server field.

So now let’s get a step ahead and compare the influence of the energy-saving mode for the performance.

As you can see the power usage has a direct impact on the ready values of our virtual machines vCPU. At the end of the day the power-savings have a little financial impact in the VDI field, still I always recommend deactivating ALL power-saving methods since I always try to ensure the highest performance.

Especially in the VDI field with irregular sudden CPU spikes the wake-up / clock-increasement of the Core takes too much time and if you read through the VMware community on a regular basis you will see that a lot of strange symptoms are very often resolved by disabling energy-saving mechanisms.

Please be aware that those numbers may differ in your environment depending on your server, climate-factor, consolidation-rate, etc.

I am working in a project including vCloud Director as well as most other parts of VMware’s cloud stack for a while now. Until a couple of days ago, everything was running fine regarding the deployment process of vApps from vCloud Director UI or through vCenter Orchestrator. Now we noticed that starting and stopping vApps takes way too long: Powering on a single VM vApp directly connected to an external network takes three steps in vCenter:

Reconfigure virtual machine

Reconfigure virtual machine (again)

Power On virtual machine

The first step of reconfigure virtual machine showed up in vCenter right after we triggered the vApp power on in vCloud Director. From then it took around 5min to reach step two. Once step 2 was completed, the stack paused for another 10min before the VM was actually powered on. This even seemed to have implications on vCenter Orchestrator including timeouts and failed workflows.

We spent an entire day on trying to track the problem down and came up with the opinion that it had to be inside vCloud Director. But before we went into log files, message queues etc, we decided to simply reboot the entire stack: BINGO! After the reboot the problem vanished.

Shutdown Process:

vCO

vCD

vCD NFS

VSM

vCenter

SSO

DB

Then boot the stack in reverse order and watch vCloud Director powering on VMs withing seconds 😉

I remember reading the paper “Virtualizing Performance Counters” by Benjamin Serebrin and Daniel Hecht some time ago. I guess that was when I was researching for the other article about selecting the MMU virtualization mode. At the time, I was wondering “when will virtual hardware performance counters be implemented?” and now here we are: 5.1 supports it. In this post, I would like to give you some information on how to use virtual hardware performance counters and how you can use them to more easily make the decision on the VMM mode.

(Virtual) Hardware Performance Counters

Hardware Performance Counters – often refered to as PMC (performance monitoring counters) describe a set of special-purpose registers a physical CPU provides in order to facilitate the counting of low level, hardware-related events that occur inside the CPU. Giving deep insight into CPU activity, PMCs are usually utilized by profilers. In VMware virtual machines until now, the VMM did not emulate PMC for vCPUs rendering profilers in VMs useless in a lot of scenarios. With ESXi 5.1 VMware eradicated this flaw introducing vPMC. For those technical folks out there, I recommend reading the paper mentioned above. It will give you good insight into implications and challenges that go along with the technology.

Requirements and Prerequisites

VMware’s KB article 2030221 gives the full list of prerequisites to meet before you can make use of vPMC. Let me summarize this for you:

ESXi 5.1 or later

VM with HW9 or later

Intel Nehalem or AMD Opteron Gen 3 (Greyhound) or later

enable vPMC in vSphere Web Client (it does not show in vSphere Client!).

ESXi must not be in an EVC enabled DRS cluster

Enabling vPMC for a VM enables further sanity checks before moving that VM with vMotion. The target host must support the same PMCs, too, otherwise no vMotion will be possible.

Freeze Mode

The freeze mode is an advanced VM settings that allows to specify when PMC events should be accounted for to the vPMC value. During runtime of a VM instructions can be executed …

directly be the guest or

by the hypervisor on behalf of the guest.

The freeze mode defines which hardware events should be accounted for in which of the above situations.

vcpu: Hypervisor instructions on behalf of the VM are accounted for, too

hybrid: Events counting the number of instructions and branches are handled as if in guest freeze mode. Otherwise, vcpu freeze mode behavior is used.

If you wish to change that behavior, set the vpmc.freezeMode option to “hybrid”, “guest” or “vcpu” (hybrid is default).

Reading Metrics in Linux

Counting hardware performance metrics with linux is quite easy. On Debian Squeeze install the “perf” tool by executing the following line:

apt-get install linux-base linux-tools-`uname -r`

Perf is quite a powerful tool for profiling and such tools are usually just as complex to use. A good overview of the command’s usage can be found here. To get a full list of PMCs suuported by perf type

perf list

Not all counters listed might be supported by HW9 though. To analyze a certain process for a specific event use

perf stat -e EVENT COMMAND

where EVENT is the name of the event as listed in “perf list” and COMMAND is command to execute to create the process to monitor.

Using vPMC for Selecting the MMU Virtualzation Mode

Recently, I posted an article about selecting the proper MMU virtualization mode for a certain application. It suggested to conduct an isolated experiment with the VM in question and to use vmkperf to monitor the number of TLB misses. This procedure is quite cumbersome as “isolated experiment” means to set up a dedicated ESXi host with a single VM and the application to monitor just for the testing, which might not even be possible if the application is already running productively.

Luckily, vPMC allows access to the TLB miss event! That means we can now perform the testing of a VM sharing an ESXi host with other VMs! That gives us a much easier way to find out whether we need to change the MMU virtualization mode back to shadow page tables:

The script executed is the same one already used in the other article, but this time we measure the TLB miss count from within the VM using perf. The results show the same: a much higher number of TLB misses using nested paging. The consequences of this were already explained in the other article.

Unfortunately, I could not quickly find an easy way to monitor the same counters on Windows systems. I will keep looking and keep you posted as I understand Windows is the more commonly used platform out there – as much as I hate admitting

Acknowledgements

Thank you, Thomas Glanzmann, for providing recent hardware to test this new feature!

This information is not really new and also well documented by VMware, but nevertheless I meet a lot of people with a VMware View environment not aware of the following. Since it took me some time to find the specific part in the documentation, I decided to write a short article here.

Important: Changing the registry values for the jvm heap size is NOT recommended.

Since the Connection Server 5.1 is only supported on Windows Server 2008 R2, which in turn is only available in a 64-bit version this is something administrator need to take care of, once they decide to raise the amount of the connection server memory.

Reinstalling the connection server is especially important crossing the 10GB memory threshold. This will increase the JVM heap siz to 1024MB which allows the connection server to handle up to 2000 connections.

Hi folks! In this article, I would like to show you how to optimize the VMM mode for certain memory workloads. We will be using the widely unknown command line tool vmkperf to make an appropriate decision on the VMM mode for a specific VM. I assume you already have a basic knowledge about the two techniques for MMU (memory management unit) virtualization named shadow page tables and nested paging (aka Intel EPT or AMD RVI). If so, you are probably aware that you have the opportunity to intervene in the automatic mode selection the VMM performs upon VM start-up using vSphere Client. Set to “Automatic”, the VMM by default chooses the appropriate combination of CPU and MMU virtualization modes, usually trying to use hardware acceleration if available.

Selecting nested paging for MMU is not always the best choice depending on the memory workload inside the guest OS. A TLB-miss (translation look-aside buffer) causes a double page walk to perform the necessary address translation (virtual to guest physical, then guest physical to host physical) whereas without nested paging only a single page walk is necessary. As a result, you can say that frequent TLB misses – in either mode – have a negative impact on performance. The frequency at which TLB misses occur depends on the size of the TLB (the TLB caches address translations), the use of nested paging (with nested paging twice the amount of translations has to be stored) and the memory workload of the application. So, in order tell whether an application would be better off using shadow page tables, you have to monitor the number of TLB misses with and without nested paging and compare the results.

Hardware Performance Counters

Most modern CPUs provide special purpose registers used for monitoring hardware events. This counters have the advantage of providing insights into events that could not be given using ordinary profiling. The list of events that can be monitored depends on the CPU model and vendor and are identified by an event select and unit mask value. Those values can be taken from the chip manufacturer’s documentation. So far, I only used this way of monitoring with Intel CPUs, so I can only provide this link right now:

You will probably easily be able to find the equivalent document for AMD.

To monitor the number of TLB misses caused by memory load instructions you need the MEM_LOAD_RETIRED.DTLB_MISS event. The corresponding event select value is CB, the unit mask is 10 (both hexadecimal numbers).

vmkperf

ESXi provides a command line tool that allows us to monitor such events. It is the equivalent to the perf command on Linux systems used for exactly the same purpose. The vmkperf command is only accessible through a SSH connection to the ESXi host directly, not via vCLI or vMA. The tool allows you to monitor some pre-configured events, too, but TLB misses are not among them, so you have to manually specify the event select and unit mask for that event (taken from the documentation mentioned above).

~ # vmkperf start tlb_misses -e 0xcb -u 0x10

This triggers the system to start monitoring the event referred to by 0xcb, 0×10 and will make it available by the name of “tlb_misses”. That name can be anything you like, it is used only for identification. The system will now count the number of occurred TLB miss events until you stop the process:

~ # vmkperf stop tlb_misses

After that, the CPU will stop counting events and all gathered data will be lost, so be sure to dump all information you need to a file prior to stopping the monitor.

Vmkperf has two sub-commands to read the actual values: read gives you the current values at that moment, while poll shows the same information but recurs at a given interval:

~ # vmkperf read tlb_misses

~ # vmkperf poll tlb_misses -i 1 -n 5

The second commands starts vmkperf polling the tlbmisses event at an interval of 1 second 5 times. Polling an event looks like this:

As you can see, the TLB miss counts are displayed for each pCPU including hyper threads (2 sockets * 6 cores * 2 hyperthreads = 24 pCPUs). I admit this is not the most allowing format for further analysis, but it is fairly easy to write a Perl script to do the conversion. Here is something that converts the output to CSV. This allows you to open the file in a spreadsheet software of you choice. [Download]

Isolated Experiment

In the section above, I showed you that you get the number of TLB miss events per pCPU not per VM. Therefore, the values are comprised of the events generated by multiple VMs as virtual machines have to share physical CPUs. In a perfect world, I would set up a completely isolated host with a single VM running on it. In that case, you would get very good results for that VM, but as you know the world is not perfect The best I could do with the hardware available is to free that host from VMs as good as possible and configure CPU affinity to the VM of interest. This is to avoid the scheduler’s load balancing mechanism to kick in potentially spreading the TLB misses across multiple CPUs. At least, this way I can be sure to have all TLB misses created by that VM accounted to the same pCPU. I pinned the VM to pCPU 23 (the last pCPU).

The workload inside the VM is not a real-world workload. Instead, I used a Linux VM with a Perl script to put the TLB under stress: tlbstress.pl_1

According to the values collected, the tlbstress.pl script creates around twice as many TLB misses with nested paging compared to shadow page tables. With around the 5.000.000 TLB misses per second the performance loss has to be tremendous!

In the Real World …

In the real world, you might not see values that conclusive. The number of TLB misses will most likely be lower and the difference between the two might not be that high. Further, the way I performed this experiment (CPU affinity) is not very applicable to the real world for several reasons:

In a DRS cluster, you cannot use CPU affinity at all.

The VM to monitor will likely have more than 1 vCPU which means TLB misses are spread across at least 2 pCPUs.

The difference between the values measured will not be as high so that other VMs will have a higher influence on the values.

So I suggest an isolated experiment conducted for each VM under inspection! A big problem is the effort it makes to perform the monitoring and analysis. You can hardly do that for every single VM in your environment. You have to filter which VMs are worthy to take a close look at. Very generally speaking, applications that use a large amount of memory evenly (similar amount of access to all addresses) will likely cause many TLB misses (but the number might still be so small that changing the mode hardly makes a difference). Java-based web applications and databases are possible candidates for further inspection.

Last week I was asked about the estimated bandwidth requirement for a VMRC based console connection through a vCloud Director cell. Well, I did not know at the time, so I set up a little test environment. The results I want to share with you now.

In vSphere the question for the bandwidth consumption of a vSphere Client console window is rather pointless. Unless we are talking about a ROBO (remote offices and branch offices) installation, console connections are made from within the company LAN where bandwidth is much less of an issue.

Figure 1: Remote console connection with vSphere Client.

The fat dashed line indicates the connection made from vSphere Client directly to ESXi in order to pick up the image of a virtual machine’s console.

With vCloud Director things are a bit different: Customers have access to a web portal and create console connections to their VMs through the VMRC (virtual machine remote console) plug-in. Though the plug-in displays the console image, the connection to ESXi is not initiated by it. Instead the VMRC connects to the vCD cell’s console proxy interface. vCD then connects to ESXi. This means a vCD cell acts as a proxy for the VMRC plug-in.

Figure 2: Remote console through the vCloud Director web portal.

Of course, the browser could be located inside the LAN, but especially in public cloud environments this traffic will flow through your WAN connection.

Testing the Bandwidth

The bandwidth consumed by a single remote console connection depends on what is being done inside VM. So, In my testings I monitored bandwidth in three different situations:

writing some text in notepad

browsing the internet

watching a video

Of course, the configured screen resolution and color depth has to be considered, too. But this is not going to be a big evaluation of the performance but rather an attempt to give you – and myself – an impression and rough values to work with.

To get the actual values, I used the Advanced Performance Charts to monitor the console proxy NIC of my vCloud Director cell:

Figure 3: Network performance of a vCD cell during a VMRC connection.

I started the testing after 8 PM, so please ignore the the spikes on the left. The first block of peaks after 8 PM is the result of writing text in notepad. I did not use a script to simulate this workload which is probably the reason why values are not very constant. Towards the end, I reached a fairly high number of keystroke per second – probably higher than what would be an average value. The estimated average bandwidth is around 1400 KBps. After that, I started a youtube video. The video was high resolution but the player window remained small. Still, I reached an average of maybe 3000 KBps! Browsing a few web sites and scrolling the browser window seems to create a slightly lower amount of network I/Os. Most likely, a realistic workload includes reading sections before scrolling, so the bandwidth consumption would be even lower than the measured average of – let’s say – 1600 KBps.

As we have seen the protocol used for the VMRC connection is not a low bandwidth implementation. Implementing your cloud, you should definitely keep that in mind. A single VMRC connection does not harm anyone, but having several 10 concurrent connections might congest your WAN connection depending on what you have there. Also could a single customer influence performance of another!

How do we solve this? Well, if you have a problem with VMRC bandwidth this is a limitation of your WAN connection. All you can do from the vCloud Director’s side is set a limit on the maximum number of concurrent connections per VM:

Figure 4: Customer Policies: Limits

But this works only for connections to the same VM! A more professional solution would include the an appliance placed in front of the vCD cells that performs traffic shaping per VMRC connection. Maybe your load balancer can do this!

Sitting on my sofa this morning watching Scrubs, I was thinking about the NUMA related considerations in vSphere – yes, I am a nerd. I read about this for the first time back in the days of vSphere 4.0, but it probably existed for much longer. Then it came to my mind that since vSphere 5.0 VMware supports the configuration of the number of sockets and cores per socket for a Virtual Machine and the 5.0 feature called vNUMA. I googled the topic for a while an found a bit of information here and there. I figured it was time to write a single article to completely cover the topic.

What is NUMA?

Let’s start with a quick review of NUMA. This is taken from Wikipedia:

Non-Uniform Memory Access (NUMA) is a computer memory design used in multiprocessing, where the memory access time depends on the memory location relative to a processor. Under NUMA, a processor can access its own local memory faster than non-local memory, that is, memory local to another processor or memory shared between processors.

This means in a physical server with two or more sockets on an Intel Nehalem or AMD Opteron platform, very often we find memory that is local to one and memory that is local to the other socket. A socket, its local memory and the bus connecting the two components is called a NUMA node. Both sockets are connected to the other sockets’ memory allowing remote access.

Please be aware that an additional socket in a system does NOT necessarily mean an additional NUMA node! Two or more sockets can be connected to memory with no distinction between local and remote. In this case, and in the case where we have only a single socket, we have a UMA (uniform memory access) architecture.
uma Summarizing NUMA Scheduling

UMA system: one or more sockets connected to the same RAM.

Scheduling – The Complete Picture

Whenever we virtualize complete operating systems, we get two levels of where scheduling takes place: A VM is provided with vCPUs (virtual CPUs) for execution and the hypervisor has to schedule those vCPUs accross pCPUs (physical CPUs). On top of this, the guest scheduler distributes execution time on vCPUs to processes and threads.
scheduling overview Summarizing NUMA Scheduling

So, we have to take a look at scheduling at two different levels to understand what is going on there. But before we go into more detail we have to take a look at a problem that might arise in NUMA systems.

The Locality Problem

Each NUMA node has its own computing power (the cores on the socket) and a dedicated amount of memory assigned to that node. You can very often even see that taking a look at your mainboard. You will see two sockets and two separate groups of memory slots.
P 500 Summarizing NUMA Scheduling

Those two sockets are connected to their local memory through a memory bus, but they can also access the other socket’s memory via an interconnect. AMD calls that interconnect HyperTransport which is the equivalent to Intel’s QPI (QuickPath Interconnect) technology. The names both suggest very high throughput and low latency. Well, that’s true, but compared to the local memory bus connection they are still far behind.

What does this mean to us? A process or virtual machine that was started on either of the two nodes should not be moved to a different node by the scheduler. If that happened – and it can happen if the scheduler in NUMA-unware – the process or VM would have to access its memory through the NUMA node interconnect resulting in higher memory latency. For memory intensive workloads, this can seriously influence performance of applications! This is referred to by the term “NUMA locality”.

Small VMs on ESXi

ESX and ESXi servers are NUMA-aware for a while now – to be exact since version 3.5.

NUMA-awareness means the scheduler is aware of the NUMA topology: the number of NUMA nodes, number of sockets per node, the number of cores per socket and the amount of memory local to a single NUMA node. The scheduler will try to avoid issues with NUMA locality. To do that, ESXi will make an initial placement decision to assign a starting VM to a NUMA node. From now on, the VM’s vCPUs are load balanced dynamically across cores on that same socket.
numa scheduling Summarizing NUMA Scheduling

In this example, VMs A and B were assigned to NUMA node 1 having to share cores on that socket. VM C is scheduled on a different node, so that VMs A and B will not have to share cores with VM C. In the case of very high load on either socket, ESXi can decide to migrate a VM from one NUMA node to another. But that’s not going to happen recklessly as the price for that is very high: To avoid NUMA locality problems after the migration, ESXi will migrate the VM’s memory image, too. That puts high load on the memory bus and the interconnect and could influence the overall performance on that host. But if perceived benefits outreach costs, that is going to happen.

In the figure above, the VMs are “small” meaning they have less vCPUs than the number of cores per NUMA node and less memory than what is local to a single NUMA node.

Large VMs on ESXi prior to vSphere 4.1

Thing start to become interesting for VMs with more vCPUs than the number of cores on a single socket. The hypervisor scheduler would have to have that VM span multiple NUMA nodes. A VM like this will not be handled by the NUMA scheduler anymore – so no home node will be assigned. As a result, the VM’s vCPUs will not be restricted to one or two NUMA nodes but can be scheduled anywhere on the system. Memory will be allocated from all NUMA nodes in a round-robin fashion. Like that, memory access latencies will dramatically increase.
wide vm2 Summarizing NUMA Scheduling

Figure 5: A large VM spannung two NUMA nodes.

To avoid this, it is the administrators job to make sure every VM fits into a single NUMA node. This includes the number of vCPUs and the amount of memory allocated to this VM.

Wide-VMs since vSphere 4.1

Introduced in vSphere 4.1 the concept of a “Wide-VM” addresses the issue of memory locality for virtual machines larger than a single NUMA node. The VM is split into two or more NUMA clients which are then treated as if they were separate VMs handled by theNUMA scheduler. That means, each NUMA client will be assigned its own home node and be limited to the pCPUs on that node. Memory will be allocated from the NUMA nodes the VM’s NUMA clients are assigned to. This improves the locality issue and enhances performance for Wide-VMs. A technical white paper provided by VMware goes into more detail on how big the performance impact really is.

As a result, chances of remote access and high latencies are decreased. But this is not the final solution because operating systems are still unaware of what is happening down there.

Scheduling in the Guest OS

Before vSphere 5.0, the NUMA topology was unknown to the guest OS. The scheduler inside the guest OS was not aware of the number of NUMA nodes, their associated local memory or the number of cores contained by the socket. From the OS’s perspective, all available vCPUs were seen as being their own sockets, all memory can be accessed from all sockets in the same speed. Due to this unawareness, a scheduling decision made by the OS could suddenly render a well-performing process suffering from bad memory locality after is was moved from one vCPU to another.

In figure 5, the VM spans two NUMA nodes with 4 vCPUs on one and 2 vCPUs on the other node. The OS sees 6 single-core sockets and treats them all as scheduling targets of equal quality for any running process. But actually, scheduling a process from the very left vCPU to the very right vCPU migrates the process from one physical NUMA node to another.

vNUMA since vSphere 5.0

vNUMA exposes the NUMA topology to the guest OS allowing for better scheduling decisions in the operating system. ESXi creates virtual sockets visible to the OS each with an equal amount of vCPUs visible as cores. Memory is evenly split accross sockets creating multiple NUMA nodes from the OS’s perspective. Using hardware version 8 for your VMs, you can use vSphere Client to configure vNUMA per VM:

This results in two lines in the VM’s .vmx configuration file:

numvcpus = "8"
cpuid.coresPerSocket = "4"

Well, this is not the end of the story. This I read in the Resource Management Guide:

If the number of cores per socket (cpuid.coresPerSocket) is greater than one, and the number of virtual cores in the virtual machine is greater than 8, the virtual NUMA node size matches the virtual socket size.

The best way to understand this, is to have a look into a Linux OS and investigate the CPU from there: I configured a Debian Squeeze 64bit to have 2 virtual sockets and 2 cores per socket using vSphere Client und used the /proc/cpuinfo file and a tool called numactl to gather the following info:

The numactl tool shows only a single NUMA node – I configured 2 virtual sockets in vSphere Client, remember? Well, sockets doesn’t necessarily mean NUMA node (see above). From the OS’s perspective, this is a UMA system with 2 sockets.

Next, I configured the VM for 2 virtual sockets, 6 cores per socket. This time, we exceed 8 vCPUs, so Linux should see a NUMA system now. And it does:

As explained above, vNUMA kicks in from 9 vCPUs. To reduce that threshold to some lower number, configure the numa.vcpu.maxPerVirtualNode advanced setting for that VM. This setting defaults to 4 (as it is per virtual node).

Bottom Lines for Administrators

vSphere 4.0 and before:

Configure a VM with less vCPUs than the number of physical cores per socket.
Configure a VM with less memory than what is local to a single physical NUMA node.

vSphere 4.1:

Configure a VM with more vCPUs than the number of physical cores per socket is a bit less of a problem but there is still a chance of remote accesses.

vSphere 5.0:

Configuring 8 or less vCPUs for a VM does not change much compared to vSphere 4.1.
Assigning more than 8 vCPUs to a VM spread across multiple sockets create virtual NUMA nodes inside the guest allowing for better scheduling decisions in the guest.

For every version of vSphere, please note that the whole issue of memory latency might not even apply to your VM! For VMs with low memory workloads the whole question might be irrelevant as the performance loss is so minimal.