You can split nodes in 2 ways; with or without the use of ISL’s (Inter-Switch Link). Both deployment methods are covered in detail in this document.

Deployment without ISLNodes are directly connected to the FC switches in both the local and remote site, without traversing an ISL. Passive WDM devices (red line) can be used to reduce the number of links. You’ll need to equip the nodes with “colored” long distance SFP’s.

The failure domains for the project I worked on, were approximately 55 KM (total dark fiber length) apart, so we had to use the following method.

Deployment with ISLNodes are connected to the local FC switches, ISL’s are configured between sites, all traffic traverses the ISL’s. You are required to configure a so called private SAN for node-to-node communication and a public SAN for host and storage array communication. You can separate the SAN’s by using dedicated switches, or by the use of Virtual Fabrics.

N.B. The private-public separation isn’t always strictly enforced. In case of a failure (let’s say all public SAN ISL’s fail), the SVC can route public traffic over the private SAN (and vice versa).

The implementation I worked on consisted of 2x Brocade B6510 in each site (each B6510 containing two Virtual Fabrics). We used a MRV LambdaDriver 1600 (DWDM mux/demux) to create 4x 8FC ISL’s over 2 dark fibers between site 1 and 2 and 1x 2FC to the quorum site.

Make sure all dark fibers between the 3 sites use different physical paths from at least 2 providers. Keep in mind however that sometimes fiber providers will swap fibers. So despite using different providers, your fibers end up in the same tube.

Also, keep in mind the attenuation of the optical signal for all (short wave) patches. Full 8FC can go up to 150 meters with OM3 cabling. If your FC switch is located inside a rack and the multiplexer in the Main Equipment Room (MER) or Satellite Equipment Room (SER), you might be covering more meters than you think.

Quorum site
As you may have already noticed in the pictures above, a third (independent) site is required for the (active) quorum. As with all non-majority clusters (nodes are always divided 50:50), a tie breaker is needed. The main sites contain a candidate quorum disk. The active quorum will decide which site stays up in case of a split-brain.

For the active quorum disk, you’ll need a storage array that supports (what IBM calls) extended quorum. We used a V3700 in the quorum site. The quorum disk only takes up around 256 MB.

Configuration nodeI haven’t discussed the term configuration node yet. There’s one configuration node in the SVC cluster, that manages the configuration (what’s in a name). The configuration node is elected by the system, you are unable to change the configuration node manually. If the configuration node fails, an other node takes over it’s role.

An other task of the configuration node is to closely monitor the active quorum. This might have an impact on the split-brain remediation, which I’ll discuss shortly.

Cluster statusIn the table below, you’ll find some failure scenario’s and the resulting cluster status. Please note there’s a small error in the 4th entry; write cache is enabled when only the quorum fails!

There’s no way you can be sure which node accesses the active quorum first. However, we do know the configuration node already accesses the active quorum frequently, so it has a high chance of winning. In my validation testing, the site that contained the configuration node always won, despite the fact this site was located 40 KM farther from the quorum compared with the other site! Based on this, we can make the following assumptions, on who has the highest chance of accessing the active quorum first:

Configuration node

Node in the site physically closest to the active quorum

Node in the site physically farther from the active quorum

Voting setA voting set is formed by all nodes that are fully connected together. A site has quorum (stays online) if either one of the following is true.

more than half of all nodes in the voting set

half of all nodes in the voting set AND the active quorum

if there’s no active quorum; half of the nodes if that half includes the founding node (usually the configuration node)

This shows a 4-node cluster in normal operation. The voting set consists of all nodes that are fully connected with one another.

In this scenario, all ISL’s have failed. There’s no node majority in the voting set. The configuration node (NODE 2) is the first one to access the quorum > site 2 stays online.

I know you’ve been waiting for this scenario! Although highly unlikely, I did test it 😉

The active quorum fails; at this point there’s no impact. The quorum disks in site 1 and site 2 remain in a candidate state

All ISL’s fail; a new active quorum is elected. The configuration node is located in site 2, therefore the quorum switches from candidate to active and site 2 stays online

Now we know how a split I/O group cluster behaves. In PART 3 we will see how this all interacts with VMware HA and we’ll have a closer look at for instance APD (All Paths down), PDL (Permanent Device Loss) and some advanced DRS (Distributed Resource Scheduler) settings.

Comments (8)

nice article , the problem comes definitely from the unavailability to determine where the datastore is actually accessed , there is no way (from a vmware point of view) to know if we are doing our write I/O on the local or the distant site which could affect the perf