Server, Storage, Virtualization, OS news from Vienna

Clustering Basics and Challenges

For upcoming posts it seemed to be a good idea to dedicate some time for cluster basic concepts and theory. This post misses a lot of details that would explode the articlesize, should you have questions, do not hesitate to ask them in the comments.

The goal here is to get some concepts straight. I can't promise to give you an overall complete definitions of cluster, cluster agent, quorum, voting, fencing, split brain condition, so the following is more of an explanation.

Here we go.

-------- Cluster, HA, failover, switchover, scalability --------

An attempted definition of a Cluster: A cluster is a set (2+) server nodes dedicated to keep application services alive, communicating through the cluster software/framework with eachother, test and probe health status of servernodes/services and with quorum based decisions and with switchover/failover techniques keep the application services running on them available. That is, should a node that runs a service unexpectedly lose functionality/connection, the other ones would take over the and run the services, so that availability is guaranteed. To provide availability while strictly sticking to a consistent clusterconfiguration is the main goal of a cluster.

At this point we have to add that this defines a HA-cluster, a High-Availability cluster, where the clusternodes are planned to run the services in an active-standby, or failover fashion. An example could be a single instance database. Some applications can be run in a distributed or scalablefashion. In the latter case instances of the application run actively on separate clusternodes serving servicerequests simultaneously. An example for this version could be a webserver that forwards connection requests to many backend servers in a round-robin way. Or a database running in active-active RAC setup.

-------- Cluster arhitecture, interconnect, topologies --------

Now, what is a cluster made of? Servers, right. These servers (the clusternodes) need to communicate. This of course happens over the network, usually over dedicated network interfaces interconnecting all the clusternodes. These connection are called interconnects.

How many clusternodes are in a cluster? There are different cluster topologies. The most simple one is a clustered pair topology, involving only two clusternodes:

There are several more topologies, clicking the image above will take you to the relevant documentation. Also, to answer the question Solaris Cluster allows you to run up to 16 servers in a cluster.

Where shall these clusternodes be placed? A very important question. The right answer is: It depends on what you plan to achieve with the cluster. Do you plan to avoid only a server outage? Then you can place them right next to eachother in the datacenter. Do you need to avoid DataCenter outage? In that case of course you should place them at least in different fire zones. Or in two geographically distant DataCenters to avoid disasters like floods, large-scale fires or power outages. We call this a stretched- or campus cluster, the clusternodes being several kilometers away from eachother. To cover really large distances, you probably need to move to a GeoCluster, which is a different kind of animal.

What is a geocluster? A Geographic Cluster in Solaris Cluster terms is actually a metacluster between two, separate (locally-HA) clusters.

So how does the cluster manage my applications? The cluster needs to start, stop and probe your applications. If you application runs, the cluster needs to check regularly if the application state is healthy, does it respond over the network, does it have all the processes running, etc. This is called probing. If the cluster deems the application is in a faulty state, then it can try to restart it locally or decide to switch (stop on node A, start on node B) the service. Starting, stopping and probing are the three actions that a cluster agentdoes. There are many different kinds of agents included in Solaris Cluster, but you can build your own too. Examples are an agent that manages (mounts, moves) ZFS filesystems, or the Oracle DB HA agent that cares about the database, or an agent that moves a floating IP address between nodes. There are lots of other agents included for Apache, Tomcat, MySQL, Oracle DB, Oracle Weblogic, Zones, LDoms, NFS, DNS, etc.

We also need to clarify the difference between a cluster resource and the cluster resource group.A cluster resource is something that is managed by a cluster agent. Cluster resource types are included in Solaris cluster (see above, e.g. HAStoragePlus, HA-Oracle, LogicalHost). You can group cluster resources into cluster resourcegroups, and switch these groups together from one node to another. To stick to the example above, to move an Oracle DB service from one node to another, you have to switch the group between nodes, and the agents of the cluster resources in the group will do the following:

How do the clusternodes agree upon their action? How do they decide which node runs what services? Another important question. Running a cluster is a strictly democratic thing.Every node has votes, and you need the majority of votes to have the deciding power. Now, this is usually no problem, clusternodes think very much all alike. Still, every action needs to be governed upon in a productive system, and has to be agreed upon. Agreeing is easy as long as the clusternodes all behave and talk to eachother over the interconnect. But if the interconnect is gone/down, this all gets tricky and confusing. Clusternodes think like this: "My job is to run these services. The other node does not answer my interconnect communication, it must be down. I'd better take control and run the services!". The problem is, as I have already mentioned, clusternodes very much think alike. If the interconnect is gone, they all assume the other node is down, and they all want to mount the data backend, enable the IP and run the database. Double IPs, double mounts, double DB instances - now that is trouble. Also, in a 2-node cluster they both have only 50% of the votes, that is, they themselves alone are not allowed to run a cluster.

This is where you need a quorum device. According to Wikipedia, the "requirement for a quorum is protection against totally unrepresentative action in the name of the body by an unduly small number of persons.". They need additional votes to run the cluster. For this requirement a 2-node cluster needs a quorum device or a quorum server. If the interconnect is gone, (this is what we call a split brain condition) both nodes start to race and try to reserve the quorum device to themselves. They do this, because the quorum device bears an additional vote, that could ensure majority (50% +1). The one that manages to lock the quorum device (e.g. if it's an FC LUN, it SCSI reserves it) wins the right to build/run a cluster, the other one - realizing he was late - panics/reboots to ensure the cluster config stays consistent.

Losing the interconnect isn't only endangering the availability of services, but it also endangers the cluster configuration consistence. Just imagine node A being down and during that the cluster configuration changes. Now node B goes down, and node A comes up. It isn't uptodate about the cluster configuration's changes so it will refuse to start a cluster, since that would lead to cluster amnesia, that is the cluster had some changes, but now runs with an older cluster configuration repository state, that is it's like it forgot about the changes.

Also, to ensure application data consistence, the clusternode that wins the racemakes sure that a server that isn't part of or can't currently join the cluster can access the devices. This procedure is called fencing. This usually happens to storage LUNs via SCSI reservation.

Now, another important question: Where do I place the quorum disk? Imagine having two sites, two separate datacenters, one in the north of the city and the other one in the south part of it. You run a stretched cluster in the clustered pair topology. Where do you place the quorum disk/server? If you put it into the north DC, and that gets hit by a meteor, you lose one clusternode, which isn't a problem, but you also lose your quorum, and the south clusternode can't keep the cluster running lacking the votes. This problem can't be solved with two sites and a campus cluster. You will need a third site to either place the quorum server to, or a third clusternode. Otherwise, lacking majority, if you lose the site that had your quorum, you lose the cluster.

Okay, we covered the very basics. We haven't talked about virtualization support, CCR, ClusterFilesystems, DID devices, affinities, storage-replication, management tools, upgrade procedures - should those be interesting for you, let me know in the comments, along with any other questions. Given enough demand I'd be glad to write a followup post too.

Now I really want to move on to the second part in the series: ClusterInstallation.

Fencing is the mechanism that ensures that only healthy clusternodes with uptodate cluster configuration repositories are active in a cluster. In other words if the clusternodes are not completely okay, they are going to be fenced, kept outside of the cluster, not allowed to join or even thrown out of the running cluster. This usually happens with I/O fencing, for example with SCSI reservation of the shared disks the faulty clusternodes will get access to shared disks denied to keep data consistence up at all costs.

About

This is the Technology Blog of the Oracle Hardware Presales Consultant Team in Vienna, Austria. We post about our technology fields: server- and storage hardware, operating system technologies, virtualization, clustering, datacenter management and performance tuning possibilities.