Oracle Blog

Availability Engineering

The Sun Cluster "Top Five Strengths List"

We all have our favorites when it comes to Sun Cluster strengths. My top five list keeps changing a little, depending on feedback we get from customers and our field engineers, as well as discussions within our own organization. So, instead of struggling over finalizing a top five list, I will discuss five items that come to my mind first, as I type this blurb - they are likely the biggest strengths of Sun Cluster. In future blogs, we will go into details of one or more aspects of each of these (and other) items. If there is anything in particular that you'd like to hear about, please let us know.

1. Integration with Solaris

This is a big one. Sun Cluster (SC) is tightly integrated with Solaris, with parts of SC being in the kernel. There are many advantages of being integrated with your own company's operating system, including timely exploitation of its new features. The lower levels of the SC stack, including cluster membership, heartbeats, and the inter-node communication layers, are in the kernel. This makes the detection of a certain set of faults in the system (such as missed heartbeats) predictably faster than corresponding user-space software. Faster detection translates to faster recovery and higher availability.

2. Robust High Availability (HA) Infrastructure

While one expects an HA product to provide HA, there are varying degrees of HA, both in terms of the level of availability offered and the variation therein. SC provides predictable levels of HA on well defined stacks, and this is based on rigorous mathematical analysis and modeling. More on this under item 5. below.

As expected, SC is resilient to single points of failure. Did you know that it also tolerates multiple points of failures in various scenarios? For instance, in an N node cluster, where each node is connected to an N-ported quorum device, the cluster can tolerate the failure of (N-1) nodes.

3. "No Data Corruption" Guarantee

The SC infrastructure has many protection mechanisms that enable it to guarantee that there can not be any data corruption in an SC config, irrespective of the number of faults in the system. In the worst case, the cluster will go down. However, it will do so to prevent data corruption.

The core protection mechanisms are provided by cluster membership, quorum and fencing. Cluster membership and quorum ensures that all nodes of the cluster have a consistent view of who the cluster members are, and prevent cluster partitions, such as split brain and amnesia. Both these types of partitions can lead to data corruption. Fencing is used to ensure that a non cluster node can not access cluster resources, and is evicted from the cluster when it attempts to do so.

4. Flexibility across the Stack

SC offers a flexible platform for developing HA applications, and we continue to improve even further in this area.

SC has support for both traditional failover HA (where there is exactly one service primary) and scalable HA (where there are multiple service primaries), examples being HA-NFS and Oracle RAC respectively. The Agent Builder can be used to quickly HA-ize applications. The Resource Group Manager (RGM) framework supports a rich set of application dependency models, that suits the entire spectrum of application needs.

In addition to a very flexible computing model, the cluster configuration choices are also broad. A wide range of servers can be used as cluster nodes, as can a wide range of storage units for cluster storage. This list goes on.

5. Data Driven Availability Prediction

We have built an industry-leading Markov availability model of SC. This model helps us understand the criticality of different parameters governing the availability of a system, and formulate best practices for customer deployments. It also provides the mathematical basis for offering SLAs to our customers. A paper describing this model can be found here.

Additionally, we continually analyze, measure and improve the availability of a set of well defined Sun Cluster stacks, in our development lab. We also mine large amounts of data from customer deployments. This data closes the feedback loop for our internal modeling and analysis, and helps in converging the model to an accurate representation of real life SC deployments.

Having typed the above, I am already feeling that the list needs a-changin' ... I did not mention key strengths such as a rich portfolio of supported applications, a simple but powerful disaster recovery solution, a sophisticated industry-leading test framework that is used both inside Sun Cluster and by partners (including those external to Sun), the rigorous amount of testing that the product is put through, an excellent worldwide support team that is integrated into the product team, ...

OK, the list goes on. You get the point. Sun Cluster has many strengths and we will talk about each of these in the coming blogs ... please stay tuned.