For this I take advantage of the built in metrics functionality that is part of Hyper-V.

Looking at this report – I realize I should probably filter our replicated virtual machines (those are all the entries with zero data). I guess I will have to fix that at some point in the future. Regardless – here is the code that I use today:

$metricsData contains the output of “get-vm | measure-vm” (this is mentioned in my first post in this series). The reason why I do this is because measure-vm is a heavy command (it uses a chunk of CPU and disk) so I only want to run it once.

Once again – I use raw HTML to set the color of the table headers.

Again – I run the output of these commands through Select-Object with the use of the “Expression” option to set column labels appropriately.

Again – I use ConvertTo-HTML –Fragment to get a nice HTML table outputted.

At the end of this code I reset the counters, and enable metering on all virtual machines. I do this so that if I add any new virtual machines, they get picked up automatically.

In my experience – the most common failure for my servers is a failed hard disk. Now, as I have multiple levels of redundancy configured in my storage configuration, it is not always obvious that a disk has failed. Luckily, it is very easy to get this information with PowerShell.

In fact, this is one of the primary reasons why I like using storage spaces. The great integration with PowerShell. Here is the code that I use to generate this table:

I am using “Get-PhysicalDisk”, “Get-StoragePool” and “Get-VirtualDisk” to gather the raw data.

Once again – I use raw HTML to set the color of the table headers.

Again – I run the output of these commands through Select-Object with the use of the “Expression” option to set column labels appropriately.

Again – I use ConvertTo-HTML –Fragment to get a nice HTML table outputted.

Again – I implement color coding for individual entries in the table. I set each table cell to be “red” by default. I then do some string parsing to see if the health is good – and switch the background color if I get a positive result.

I was tinkering around with my lab setup which consists of a domain, proxy server, primary and replica servers. There are some gotchas when it comes to Hyper-V Replica and proxy servers and I realized that we did not have any posts around this. So here goes.

If the primary server is behind a proxy server (forward proxy) and if Kerberos based authentication is used to establish a connection between the primary and replica server, you might encounter an error: Hyper-V cannot connect to the specified Replica server due to connection timed out. Verify if a network connection exists to the Replica server or if the proxy settings have been configured appropriately to allow replication traffic.

I have a Forefront TMG 2010 acting as a proxy server and the logs in the proxy server

I also had netmon running in my primary server and the logs didn’t indicate too much other than for the fact that the connection never made it to the replica server – something happened between the primary and replica server which caused the connection to be terminated. The primary server name in this deployment is prb8.hvrlab.com and the proxy server is w2k8r2proxy1.hvrlab.com.

If a successful connection goes through, you will see a spew of messages on netmon

When I had observed the issue the first time when building the product, I had reached out to the Forefront folks @ Microsoft to understand this behavior. I came to understand that the Forefront TMG proxy server terminates any outbound (or upload) connections whose content length (request header) is > 4GB.

Hyper-V Replica set a high content length as we expect to transfer large files (VHDs) and it would save us the effort to re-establish the connection each time. A closer inspection of a POST request shows the content length which is being set by Hyper-V Replica (ahem, ~500GB)

The proxy server returns a what-uh? response in the form of a bad-request

That isn’t superhelpful by any means and the error message unfortunately isn’t too specific either. But now you know the reason for the failure – the proxy server terminates the connection the connection request and it never reaches the replica server.

So how do we work around it – there are two ways (1) Bypass the proxy server (2) Use cert based authentication (another blog for some other day).

The ability to by pass the proxy server is provided only in PowerShell in the ByPassProxyServer parameter of the Enable-VMReplication cmdlet – http://technet.microsoft.com/en-us/library/jj136049.aspx. When the flag is enabled, the request (for lack of better word) bypasses the proxy server. Eg:

This is not available in the Hyper-V Manager or Failover Cluster Manager UI. It’s supported only in PowerShell (and WMI). Running the above cmdlets will create the replication request and start the initial replication.

When trying to enable replication on one of my VMs in my lab setup, I encountered the following error – Hyper-V failed to authenticate the Replica server using Kerberos authentication. Error: The specified target is unknown or unreachable (0x80090303).

Needless to say, I was able to reach the replica server (prb2.hvrlab.com in my case), firewall settings in the replica server looked ok and I was able to TS and login to the replica server as well. As the error message indicated that the failure was encountered when authenticating the replica server, I decided to check the event viewer logs on the replica server. A couple of errors caught my eye:

(1) SPN registration failures

(2) This was followed by an error message which indicated that the authentication had failed

I was getting somewhere, so I ran the “setspn –l” command to list down the currently registered SPNs for the computer and the Hyper-V Replica entry was conspicuously absent.

I restarted the vmms service and when I re-ran the command, I could see the following (set of correct) entries

Quite a few customers have reached out to us with this question, and you can even see a few posts around this on the TechNet Forums. The query comes in various forms:

“My log file size was in the MBs and sometime at night it went into the GBs – what happened?”

“I have huge amounts of data to sync across once a day when no data is being changed in the guest”

“The size of the log file (the .hrl) is growing 10X…”

The problem here is not just the exponential increase in the .hrl file size, but also the fact that the network impact of this churn was not accounted for during the planning stages of the datacenter fabric. Thus there isn’t adequate network between the primary and the replica to transfer the huge log files being generated.

As a first step, the question that customers want answered is: What is causing this churn inside the guest?

I started the debugging process using the script on SQL Server virtual machine of my own. I copied the script into the VM and ran it in an elevated PowerShell window. You might run into PowerShell script execution policy restrictions, and you might need to set the execution policy to Unrestricted (http://technet.microsoft.com/en-us/library/ee176961.aspx).

At the same time, I was monitoring the VM using Perfmon from the host and checking to see if there is any burst of disk activity seen. The blue line in the Perfmon graph is something I was not expecting to see, and it is significantly higher than the rest of the data – the scale for the blue line is 10X that of the red and green lines. (Side note: I was also monitoring the writes from within the guest using Perfmon… to see if there was any mismatch. As you can see from the screenshot below, the two performance monitors are rather in sync :))

At this point, I have no clue what in the guest is causing this sort of churn to show up. Fortunately I have the script collecting data inside the guest that I will use for further analysis.

Pull out the two files from the guest VM for analysis in Excel – ProcStats-2.csv and HVRStats-2.csv. Before starting the analysis, one additional bit of Excel manipulation that I added was to include a column called Hour-Minute:it pulls out only the hour and minute from the timestamp (ignoring the seconds) and is used in the PivotTable analysis as a field. I use the following formula in the cell: =TIME(HOUR(A2), MINUTE(A2), 0) where A2 is the timestamp cell for that row. Copy it down and it’ll adjust the formula appropriately.

Overall write statistics (HVR Stats)

Let’s first look at the file HVRStats-2.csv in Excel. Use the data to create a PivotTable and a PivotChart – this gives a summarized view of the writes happening. What we see is that there is excessive data that gets written at 4:57 AM and 4:58 AM. This is more than 30X of the data written otherwise.

Per process write statistics

Now let’s look at ProcStats-2.csv in Excel. Use the data to create a PivotTable and PivotChart – and this should give us a per-process view of what is happening. With the per-process information, we can easily plot the data written by each process and identify the culprit. In this case, SQL Server itself caused a spike in the data written (highlighted in red)

This is what the graph looks like for a large data copy operation (~1.5 GB). There is a burst of writes between 1:52PM and 1:53PM in Explorer.exe – and this corresponds to the copy operation that was initiated.

What next?

At this point, you should be able to differentiate between the following process classes using the process name and PID:

Step 2: Which files are being modified?

Isolating the file sometimes helps in identifying the underlying operation. Once you know which process is causing the churn and at approximately what time, we can use the inbox tool Resource Monitor (resmon.exe) to track the Disk Activity. We can filter to show the details of the processes that we want in the Resource Monitor.

From the previous step you will get the details of the process causing the churn – for example, System (PID 4). Using the Resource Monitor you would find that the file being modified – for example, the file is identified as C:pagefile.sys. This would lead you to the conclusion that it is the pagefile that is being churned.