I would recommend upgrading to 10.9.2. The update is supposed to fix some network issues in OS X, which seems to have caused the "random" failovers. Basically, it seems like OS X stopped being able to open new TCP sockets.

After these messages, error messages like these start appearing, which means cvadmin doesn't work and that things will quickly go bad. Thus I chose to do a manual failover of all the volumes on that MDC, before it would happen automatically.

I've been running OS X 10.9.2 for a day now in our Xsan lab environment without issues, and I just upgraded two MDCs (controlling three volumes) from 10.9.1 to 10.9.2 tonight. I'm upgrading an other Xsan environment with two MDCs and two volumes tomorrow, and over the weekend I will upgrade our big MultiSAN with 7 MDCs and 16 volumes from 10.9.1 to 10.9.2.

I scheduled some time later this week to do the upgrade here but after reading your notes, I think I may wait for 10.9.3. My MDCs have been rock-solid for me on 10.8.5 so I'm nervous about messing with them in the first place. Reading about persistent issues in 10.9.2 definitely doesn't give me the warm fuzzies about upgrading and I will continue to put it off as long as I can.

Thanks again for posting about your experiences. Have you tried to contact Apple about the problem through their forums or an official channel? Did you work with a reseller to build your XSAN who might be able to help contact Apple?

Also, what hardware are your MDCs on? I don't know if that would make a difference or not, just trying to feel this out and help if I can.

For those of you that have upgraded, here is my advice until Apple manages to find and fix the issues.

1. Check the logs asap after the failover, and make note of any suspicous log messages

/private/var/log/system.log*/Library/Logs/Xsan/debug/nssdgb.out*

2. Always reboot the MDC that ran the volumes prior to the failover

3. Always check after a reboot that both MDCs are listed when running "sudo cvadmin"

We have a case with Apple about these random failovers. We actually saw the issues start back with OS X 10.8.4, which is the first time we bug-reported it to Apple. It's the same case we're still tracking with Apple in 10.9.2.

We have five separate Xsan environments in production, and one in our lab environment, and all of them have the same issues. The largest environment consists of 7 MDCs and 16 volumes, with 100 clients. Doesn't seemt to be any difference in the frequency of failovers, between the larger environment and the smaller ones.

The failovers happen about every two weeks in our production environments, sometimes more often. The problem seems to be OS X network related, as there are no signs in logs related to Xsan. We have had full debugging enabled for multiple of our Xsan volumes, without any addiontal clues as to why the failovers occur.

I'm also seeing something similar to this on 10.9.2, though I'm pretty sure our volumes have been running fine on 10.9.2 for several weeks. This issue cropped up on the primary MDC yesterday. If PortMapper reported a hiccup from FSM on the primary MDC, the volume would fail over to the secondary MDC. Both my volumes run fine on the secondary MDC, so I've kept them like that until I can sort out what's wrong with the primary. I watched the primary and saw a few more of these hiccups throughout the day yesterday, so on a hunch, I rebooted it. I haven't seen any more yet, and if it holds steady for another day or so, I'll fail one of the volumes back to it and see how it goes.

You're right about the secondary MDC. I've been rebooting them whenever one acts up, and that seems to stabilize it for a bit.

You mentioned that you thought it was a network issue in OS X. Just out of curiosity, what ethernet switches are you using? We recently had some OS X network issues that traced back to our Cisco gear, which apparently doesn't play nice like all the other switches on the block. This was just with a PEG6 card that we were trying to bond, though--completely unrelated to Xsan. I can't say that I've had any other issues with our Ciscos, but I'm not willing to rule them out completely. I haven't been able to find any telltail log messages about networking either.

Another possibility might be our ethernet configuration on the MDCs. What hardware are you using? Xserves? Minis? Pros? I've got two Mac Minis, each with a Promise SANLink adapter and a Thunderbolt to GigE adapter. The GigE adapters daisychain through the Promise adapter back to the single Thunderbolt port on the Minis. We're currently using the GigE adapters for our metadata network, but I'm considering flipping them around. If anyone is having this issue with non-Mini MDCs, then I'd scratch that idea.

Our network is all enterprise Cisco based. I'm not managing the network, so I can't give you too much details there. However, we didn't have failover issues this frequently (every 1-2 weeks) before upgrading to OS X 10.9, so I can't see how this could be an external network issue.

Considering we see the same issues in all our five separate Xsan environments, and the fact that everyone I have talked to who are running Xsan on OS X 10.9 have these issues, I can't see how this can be anything else than a bug in OS X.

Our MDCs consists of Xserves, Mac Pros and Mac minis. The hardware for the MDC doesn't matter, they all have failover issues.

The Mac mini MDCs are running in a lab environment with a simple HP ProCurve switch for the Metadata Network. We have failovers happen there too, just not as frequently, which I guess is because of the limited activity in our lab environment.

I have been running our Xsan lab environment on an early seed of 10.9.4 for 18 days, and there have been no failovers. Looks promising, but I'm still waiting a couple more days before starting to upgrade some of our smaller environments.