We are running a Microsoft Failover Cluster with Server 2008 R2 and an Equallogic PS4000 SAN. Our main configuration has 2 Dell Poweredge T710 Servers in the cluster. We have CSV and Quorm setup. The servers each have 10 Broadcom 1Gb NICs. Right now 4 of the NICS are on the iSCSI network for accessing the SAN. They use MPIO and the Dell HIT pack.

We have 5 VMs running on each node and everything runs smooth. No noticeable performance issues or anything. From the SAN I can see the 4 iSCSI connections from each server to each volume (CSV and Quorm). Again, it seems to perform great.

The problem I am running into is with backups. I have tried a few backup programs like backupchain and Veeam. The problem is both of them are very very slow to backup the VMs. For instance I have a 500GB (fixed disc) VHD that’s running on the cluster. It takes over 18 hours to backup that VHD and that’s with compression and depuping turned off which is supposed to be the fasted.

We also have a separate server that is just for backups. It has a lot of directed attached storage. As part of the troubleshooting I decided to bring that server into the cluster as a node. It now has access to the CSV and can read from C:\clusterstorage\volume1 which is where our VHDs live. This backup server only has 2 NICs. 1 NIC is going to the iSCSI network and the other is just on the main network. It has Intel NICS in it without any sort of MPIO or teaming.

So with the 3rd server now in the cluster I started doing some benchmarking. I have a test VHD that’s about 7GBs that’s stored in the CSV. I have tested file copying that VHD from all 3 servers to directed attached storage in the respective server. The 2 Dell servers that are the main nodes in the cluster (they house the VMs) are reading that file at about 20Mbs/Sec. Which at that rate is way to slow for the backups. The other server which only has 1 NIC to the SAN is reading at around 100Mbs/Sec.

I spent a few hours on the phone with Dell today about this . We went through all kind of tests and he was pretty dumb founded. He really has no idea why that server with only 1 NIC is reading about 5 times as fast as the servers with 4 NICS and MPIO.

We looked at the network utilization of the NICs while the file copy was going on. The servers with the 4 NICs had a small increase of activity during the file copy but they only went up to around 8-10% on all 4 NICs. The other server with the 1 NIC jumped up to over 80% during the file copy.

I plan on doing some more testing after hours and calling Dell back tomorrow but I really am confused (and so is Dell’s support rep) why I cannot get faster file copy access to the CSV on those servers.

Anyone have any input on this?
Any feedback would be greatly appreciated.

2 Answers
2

With the information you provided, it seems your backup process is putting the CSV in to Redirected Access mode. It could be that your backup software is not CSV aware and trying to access the VHD files via a server that does not own the resources.

You should be able to verify this by viewing the CSVs in Failover Cluster Manager under Storage.

If this is the case, I would contact Veeam to see how they recommend performing clustered Hyper-V backups.

This sounds like a misconfigured MPIO setup to me. It's impossible to pinpoint the exact problem without spending hours at your site, but here are a few pointers to check out:

How is the Equallogic configured to present the LUN(s)? Is it doing active/passive or active/active? Is it using ALUA? If it's not ALUA then you're maybe experiencing path trashing, which will bend a SAN to it's knees during heavy I/O.

Are you using jumbo frames? If yes (or if you don't know) - check the SAN, the switch(es) and the nic(s) on ALL devices to make sure that the MTU setting is identical everywhere

Every respected SAN vendor provides best practices for different usage scenarios. You should be able to find one for MPIO on Windows with iSCSI.