Clustering XEN with Heartbeat and Advanced HASI

About this guide

This document is focusing on a design where XEN virtual machines (domU) are centrally managed according to a set policy and highly available. What is written here is the real thing not a demo and has been working in production for over 18 months. We use this technology for hosting many UNIX services including DNS, Web proxy, SMTP, NFS, VPN, CUPS, etc. and traditional Netware services with OES2 although in this document I only present solution for hosting home directories for UNIX users by NFS (HASI).

Please read the guides above to have better understanding about this technology. This guide assumes that you already have hands on SLES10 experience, Heartbeat, EVMS and other components and intended for experienced system administrators.

Overview

We have a 2 node XEN cluster (HP DL360G5) running on a small local SAS storage but also has connectivity to our fiber channel SAN where all our resources (XEN virtual machines) reside. We run SLES10 SP2 on both cluster member nodes and configure Heartbeat (high availability) services upon our member nodes (XEN hosts). We set a policy which treats each virtual machine as an individual primitive resource, monitors each virtual machine alongside of their network connectivity and act according to an event.

Point

This is a 2 node cluster what isn’t a big deal to manage but HA (heartbeat) supports up to 16 nodes in one cluster and sharing a storage between these nodes is very dangerous. Due to HA is configured on all cluster member nodes, it tracks each resource (XEN domU) and through monitoring operation, it knows where a certain resource is running, what it’s doing, etc.

It protects you from your own mistakes for instance starting up a new instance of an already running virtual machine corrupting its storage instantly.

HA is your control center, you must do everything there including starting, stopping, migrating, etc. your virtual machines otherwise HA gets confused. It’s not just the safe way of managing virtual machines it’s very good for DR and business continuity.

Should you lose one of your virtual machines, should one of them crash, freeze your host, should your LAN switch go faulty? HA will restart (stop/start), migrate (even live if possible) your XEN domU resource to another healthy host with all its dependent services.

Remember that you have to eliminate the single point of failure and do as much as possible for redundancy for everything else in this scenario. Resources can only be highly available if you have redundancy in your storage, servers, switches, power supply (UPS and emergency generator) HA communication paths, etc. as well.

Storage

I have LUN2 (currently 100GB) for virtual machines. I split this up with EVMS and through CSM container we can share the same block based storage layer between both XEN hosts (HP DL360G5).

It’s simpler somewhat comparing to file image based setup where you’d need an extra layer to mount the images as well as this setup is well tested and works even for live migration seamlessly not to mention the unbeatable I/O performance.

LUN1 (currently 2T) is for user data (home directories) what I do NOT manage on the XEN hosts. I actually forced EVMS to manage (able to see) only the 100GB LUN2 because I want to manage the 2T array within my NFS virtual machine. We just map this LUN1 to our NFS virtual machine.

What makes this advanced compared to the original Novell demo is as follows:

Configuration

I installed a fairly cut down copy of SLES10 SP2 on my XEN hosts, LUNs are presented to both nodes, etc. I configured my first eth0 NIC with a class C private IP address which will be the main connection back to our private LAN. The other NIC is configured with a class A private IP address what I use solely for HA communication. I simply connected these to multiple switches (redundancy) but if you have your hosts close to each other you could use crossover cable as well. I did have several virtual machines already hence it’s not covered in this guide by the way there are plenty of notes about that already. I run mainly SLES10 SP2 XEN virtual machines whereas possible although have some Debian Linux 4.0 virtual machines as well.

NTP SetupThe time on the two physical machines needs to be synchronized. Several components in the HASF stack require this. I have configured both nodes to use our internal ntp servers (3 of them) in addition to the other node which would give us fairly decent redundancy.

Certainly these need to be done on the other node as well the same way respectively.

MultipathingI have redundant SAN switches and controllers hence for proper redundancy we need to configure this service and it could confuse EVMS as well. There’s a guide from HP but it requires the HP drivers to be installed. I prefer using the SuSE stock kernel drivers because it is maintained by Novell and works pretty much out of the box. Using the HP one may require you to reinstall or update the HP drivers at a time when you receive new kernel update. Tools we need:

We need as quick response as possible in an emergency hence we instruct the stock driver to disable the HBA built in failover and propagate this up to the dm I/O layer. The stock driver does support this as shown by the list above… Activate it:

Your SAN devices should be visible by now (2 for each LUN), in my case /dev/sda, sdb, sdc, sdd. Note: this will dynamically change when you add additional LUNs to the machine or get any other disk managed by dm.

The problem I found that dm tries to take over and manage pretty much every block device (after SP2 update) therefor we will need to blacklist everything including CD/DVD drives, local cciss (HP SmartArray), etc. as well except the SAN LUNs.

Configure multipath service according to your WWID numbers. Remember we have duplicates because we have double paths for each LUN:

For further information on HP technology please refer to the original HP guide:
http://h20000.www2.hp.com/bc/docs/support/SupportManual/c00814876/c00814876.pdf?HPBCMETA::doctype=file
Do exactly the same on the other node as well. I copied the multipath.conf over to the other host followed by setting up the services as shown above.

Multipath for root device:

If you want to boot your XEN host off the SAN you would have an extra LUN for the host’s OS. The method doing your host this way is that you install the system onto one of your disks (of your OS LUN). Once you have a running system you can build multipath on top of that, the only difference is that you will not create alias for your host’s OS disk LUN!

HeartbeatNote:HA has been going through a transformation lately in order to support both the OpenAIS and Heartbeat cluster stacks equally. The resource manager (crm) got extracted out of the HA package and became and individual project named Pacemaker.

What you see in SLES10 at this time of writing (version 2.1.4) is a special Novell port for SLES10 customers only, bundled with new features, bug fixes since the change in the project. Ultimately it will change in SLES11, HA will be replaced with OpenAIS and follow the same packaging and naming convention according to the recent changes in the project.

Heartbeat (referred as HA) is a very powerful, versatile, open source clustering solution for Linux. SuSE and IBM are big contributors of this project (code) which I am personally thankful for.

HA will be our central database, control center, it will manage our resources (domU) and their dependencies according to a set policy. Some services can be configured via LSB scripts for certain runlevels but basically HA will take over this for most services necessary for domU management by the way EVMS what we will configure in a minute doesn’t maintain cluster memberships. We need HA to actually maintain memberships and activate EVMS volumes upon startup on our nodes.

Change the filter to pattern then select the entire High Availability group since we will need other components of that later on. Ignore the disk section, it’s not the picture of the actual server…

HA at present still supports v1 configuration (haresources file) but we use the new v2 style (crm with XML files) known to be better and more powerful.

In this section I only show the initial setup of HA, resources and the complete cluster setup will be discussed later on. Since evmsd is started by HA, I have to present this part before I can discuss EVMS volumes and disk configurations.

The bold lines tell heartbeat to bring up evmsd at startup. evmsd is a remote extension of the EVMS engine without the brain. I configured unicast simply because I prefer it over broadcast and multicast. At last but not least I configured one ping node because I only care about one connection back to my private LAN, the other NIC is reserved for HA communication only.

I modified the global options as shown above for performance reasons and added the second part to the bottom of the configuration file. I used my server’s DNS name, excluded debug info for my own taste, you may consider those too. At completion rebuild syslog-ng’s configuration file:

host1:~ # SuSEconfig --module syslog-ng

Do not forget to prepare your remote syslog server for these log entries! It’s a bit outdated and doesn’t include performance settings but related reading:

EVMSEVMS is a great, open source, enterprise class volume manager with yet again significant support from IBM and SuSE. It includes a feature called CSM (Cluster Segment Manager) what we use to manage shared LUNs, distribute the block devices (partitions) and the complete storage arrangement between the dom0 nodes identically. On top of CSM we use LVM2 volume management to create, resize, extend logical volumes.

Since you could have different LVM2 or EVMS arrangement within your domU I include only the device in the evms.conf file what I want to manage on the XEN host. This is the 100GB LUN for the virtual machines only. I want to hide the rest from the host system, I don’t want EVMS to discover or interfere with other disks I am not planning to use or manage from the XEN host.

The multipath -ll command (earlier) tells you the device-mapper (referred as dm) managed number you need for this.

Note: this numbering changes as you add or remove disks, LUNs, etc. When you do so it’s advised to reboot the host to see what the new dm layer is like then update the EVMS configuration accordingly.

It’s a very good idea to set up automatic metadata backup, just remove the comment from the corresponding line as shown above.

Remember that when you save the configuration you update the metadata, utilities read the metadata from the disk. The evms.conf is really just for the global behavior of the engine.

LVM2 versus EVMS:
Remember that I have another 2T LUN what I want to manage by LVM2 solely within the NFS domU hence I also disabled LVM2 on the XEN host to avoid interfering with EVMS managed disks and with the domU’s LVM2 configuration later on.

The reason being is that when you map a block device, file image, CDROM, etc. to a XEN virtual machine, it appears on the XEN host just like it does within the domU. You can see pretty much the same thing from both (you can mount it, partition, etc.) hence I always make sure that I only manage disks on the XEN host what I need to manage on the host. In a complex environment this could create confusion…

So why this big fuss is, what is wrong with the original HASI design, why we don’t use file image based virtual machines?

Well, it’s a long story, started more than 2 years ago when I first started playing with this technology. At that time the loop mounted file images were slow, we just simply couldn’t afford it. As of today we have blktap driver shipped with SLES10 SP2 providing nearly native disk I/O performance on file images out of the box.

You could say then, that this EVMS complexity is unnecessary and makes the domU importable what I would disagree with. I think having file images is one extra layer of complexity in our storage and involves OCFS2. I had issues with it often especially when new versions came along hence I only use it for my configstore.

The other reason is that block devices are just like file images, just virtual or special files. I can create a copy of them any time with dd, redirect the output to a file creating an identical copy of my block device, (partition) it’s just a special type of file and at last but not least it has been working for nearly 2 years in production without one glitch.

Now, we can create the volumes. Note: I am going to present my configuration here just for reference. If you need a step by step guide please read this document:

After you have created (with evmsn or evmsgui utilities) your EVMS volumes, save the configuration. To activate changes (create the devices on the file system) on all XEN hosts immediately, we need to run evms_activate on every other node simply because the default behavior of EVMS is to apply changes upon the local node only.

I have 2 nodes at this stage and I want to activate only the other node:

host1:~ # evms_activate -n host2

This is where evmsd process we started with HA becomes important. It’s our engine handler on the remote node, without we wouldn’t be able to create the devices on remote hosts’ file system.

What if I had 16 nodes? It would be a bit overwhelming so a quick solution to do this on all nodes: (there are many other ways of doing this)

It will disappear in SLES11. Novell will support it until the lifetime of SLES10, probably same for OES2 after all NSS volumes can only be done at this stage with EVMS. I did ask Novell about the transition from EVMS to cLVM, (when you upgrade from SLES10 to SLES11) and as usual there will be tools, procedures and support for this.

It’s your call to decide whether you want to use a discontinued or unsupported technology (in future releases) today or not, but as mentioned on the page below customers shouldn’t be put off by this decision and still encouraged to use it.

OCFS2 cluster file system for XEN domU configurationsWe need a fairly small volume for XEN virtual machine configurations. The best for this is OCFS2, Oracle’s cluster file system. We will mount this under the default /etc/xen/vm directory on all member nodes. Both XEN nodes will see the same files, both will be able to make changes or create new ones on the same file system.

I will not provide step-by-step solution for this, it’s been discussed many times, there’s a lot about it already.

In SLES10 SP1, there was a timing issue with Heartbeat managed OCFS2 on XEN host. In SLES10 SP2, the defaults changed to fix this, but it may be worthwhile to mention the solution in this guide. (Probably you don’t need to do this)

Sometimes the networking on a XEN host takes more time to come up (remember xend modifies the network configuration, creates virtual bridges, etc.), maybe your switches are busy or STP is enabled, perhaps something else is causing slight delay. Nevertheless OCFS2 is very sensitive for this. We need to ensure that OCFS2 has enough time for the members to handshake if there’s a delay on the network:

Heartbeat Cluster ConfigurationHA is blank, empty at this stage. We configured it to bring evmsd process up when it starts, which other cluster members it will need to talk to, they should already be in sync. This chapter explains the cluster’s operation, the policy, what resources we want to manage by HA and how they should react to certain events. I shall try to explain everything as clear as possible, but there will be details not being covered in this guide.

One of the biggest issues with HA is the documentation. There’s some but it’s usually outdated and hard to find. I understand that the project is trying hard to make it better but it’s still far away from being good, the product is changing rapidly and complex. The best documentation I found was the Novell one, I encourage everybody to read it, gaining decent understanding about the product:http://www.novell.com/it-it/documentation/sles10/heartbeat/data/b3ih73g.html

The best possible source of information is still the mailing list though, you will want to join or at least browse the archives if you are serious about HA: http://wiki.linux-ha.org/ContactUs

As mentioned previously, we will use the new v2 (crm) type configuration with XML files. It’s not as nice when it comes to reading as a text file, but easy to get used to and any decent text editor nowadays can recognize XML, help you in the syntax with highlighting, etc.

Outline:

create and save XML entry for each resource or policy

load them into the cluster one by one

monitor the cluster for reaction

backup the final (complete) configuration

Note:HA does have a GUI (hb_gui) interface but as of today, it’s really just for basic operations, it’s still not useful for complex configurations. I only use it for monitoring perhaps start/stop a resource or put a node in to standby. Therefore the configuration method presented in this guide will be mainly CLI (command line) based.

The cluster configuration is replicated amongst all member nodes therefore you don’t need to do this from other nodes, it has to be done once and from any node although my preference is always the DC (designated controller) node. You can find this information from monitoring commands (hb_gui, crm_mon) or alternatively:

host2:~ # crmadmin -D
Designated Controller is: host1

Global settings:

HA has a very good default configuration/behavior therefor we have a very little to change here:

Usually id=”something” is a given name by you, it can be anything although you should keep it meaningful and easy to read (keep the indents) without symbols, etc.

What is worth mentioning here is that we enable STONITH with a default action of reboot. It’s our power switch, which will ensure that in case of node failure such as reboot, freeze, network issue or any occasion when heartbeat stops receiving signals from the other node the misbehaving node is rebooted. It is extremely important then that you have multiple communication paths (multiple NICs for example) between your nodes to avoid serious problems.

The resource-stickiness setting will ensure that if a failing node comes back up online after reboot ,the resource (virtual machine) which was moved will NOT move back to its original location. (where it was started) It’s a safety feature and save you having bouncing resources between failing nodes.

Generally a resource stays where it is started unless its originating node was rebooted, either was put into standby or the resource was forced to move by the administrator. HA will always try to keep the balance and the harmony in your cluster. It means that for example you have 10 domUs to load into the cluster or just start up in a brand new one, HA will balance this amongst all nodes (2 node cluster = 5 each) unless you configure the policy with preferred locations for certain domUs. This sort of thing is out of the scope of this guide because it’s not very useful for XEN clustering what this guide is supposed to be about or at least it wasn’t for me.

So far we just enabled this globally in the cluster, we don’t have power switches yet. These are like daemons running on each member node and execute the reboot command but it depends on the STONITH resource type and the set global action. Remember that we talk about rebooting XEN cluster member nodes not resources (virtual machines).

HA ships with a test STONITH agent which executes reboot via ssh. It’s not for production use but I configured it because I rather have more than one. It could do the job as long as the failing node is responding (not frozen).

Failing as a term can be anything in HA. You may have perfectly working XEN cluster member node but for example if HA cannot start a new domU (resource) up on a node, (because you made typo in the XEN configuration file) it would be severe from HA’s point of view because its job is to keep all your resources up and running in default. As a result it would migrate (or stop and start) all existing resources from the node where the startup failed to another node and would reboot the misbehaving node. Of course it would try starting that resource on the other node and it would fail too. Once the rebooted node came back online again HA would migrate all resources away from the other node and would try again the new resource with the typo within, of course it would fail again too so from this point it will wait for admin interaction, keep running what is healthy and mark the misbehaving resource failed. It can take significant amount of time if you had more nodes as it would try the same thing on all nodes, one by one. It may not be an issue in a test environment but can be severe in production.

perhaps set the resource to not managed so HA will not bother if it fails to start

put the resource back to managed mode if everything is working as expected

ssh test STONITH agent:

We need passwordless login (by ssh keys) between all nodes and for the root user. Doing this is out of the scope of this document, (original HASI guide discusses this) ensure that before you configure the agent you can log into all nodes as root without password from all nodes.

It’s a clone resource meaning that we will have a running copy on each member node, their characteristics will be identical. As shown, I configured the max number (number of nodes I have) and the max copies. (usually one on each node) We set monitoring as well, should something happen to my ssh daemon and this resource cannot log into one of my nodes? I shall be notified about it.

The last thing is to enable at daemon. The way the ssh STONITH agent works is that in case of a node failure, an intact node logs into the failing one (assuming it’s possible) and schedules a reboot via at daemon:

host1:~ # insserv atd && rcatd start

riloe STONITH agent

The next thing is to configure a lot more production ready STONITH agent for my servers. The best is to use something, which independent from the operating system like iLO for HP. At this time of writing HA comes with agents for most hardware vendors like HP, IBM, etc.

So which STONITH agent will actually act and execute the reboot when the disaster strike if I have more than one STONITH agent? The answer is any, HA will pick one randomly.

In design the common sense is to configure STONITH agents as clone resources. For clusters with many nodes it’s actually causing minor issues because:

iLO resource is configured on all nodes. (even the one where the iLO is installed physically) It makes sense to log into the failing node from another node and execute the reboot right? (suicide is not actually allowed by default) So when the monitoring operation is due (every 30sec or whatever you set) these agents from all nodes will try logging into the iLO device.

Occasionally a race condition evolve when 2 nodes trying to log into the same iLO device what they can’t causing weird behavior, error, etc.

According to a recent discussion on linux-ha mailing list, it should be fine by now and fixed regardless what method you use but I just couldn’t see any point having a copy on a node where the iLO device is installed physically even if suicide is not allowed (safe).

I think it’s nonsense to run iLO STONITH agent on all nodes regardless how many nodes you have because we need only one on a healthy node, the DC (always the stonithd on the DC receives the fencing request) will instruct that node (where the agent is running) to execute reboot on the corresponding iLO device. (installed on the failing node) hence I took a different approach due to the nature of the iLO device and:

created one primitive iLO STONITH resource

configured the cluster to run this anywhere but on the node where it is installed

On a 2 node cluster obviously it will be the other node but on a many node cluster it could run anywhere depending on node availability, cluster load, etc. This solution is working seamlessly for me for quite sometime.

As usual we create unique id for a rule then tell the cluster that STONITH-iLO-host1 resource (doesn’t exist yet) has a score -INFINITY. In HA a preference is expressed by always scores and it’s the smallest one meaning that it’s the least preferred, it cannot something… For a rule then we create an expression (with unique id also) after the rule line telling the cluster where the rule applies to, on the node where the iLO interface is installed.

Along the normal operations we configure instance_attributes as well which describe the details of our iLO device. It’s for iLOv2, if you happened to need this for older iLOv1, the difference would be:

This can take a few moments to come up green, be patient for a while, monitor the cluster.

Hint:

If it didn’t come up for some reason or stopped itself after a while (rarely happens) select “stop”, wait few seconds then “clean the resource on all nodes” with the GUI. Few seconds later select “default” option with the GUI which should start it up fine after all. I have slow hubs connecting my iLO network, perhaps that’s causing this minor issue occasionally.

At this point we are just configuring our cluster for general behavior, setting up rescue and safety tools, resources. The last one on the list is the ping daemon. We already specified the gateway IP address of my private LAN (in ha.cf) which uses eth0 interface. The eth1 interface is purely for my HA communication in my setup so if I can’t ping my gateway IP via eth0 then it means something is wrong. Since all my domU resources will share the eth0 interface, it’s crucial to monitor the eth0 interface and make sure that it’s working otherwise all my resources (virtual machines) could become unreachable.

Outline:

create a clone resource for pingd

configure the cluster to score ping (network) connectivity

configure the cluster to run resourcesonly where ping connectivity is defined

Each resource (domU) in the cluster will have to be individually configured for pingd connectivity hence I will leave this for later, right now I just configure the clone resource:

crm_mon is a great utility, it displays information about cluster in various ways even provide output for nagios monitoring system. See the man page for further information.

EVMS resource:

On SLES10 the /dev directory is actually tmpfs meaning that it’s an interim file system created by udev every time the system starts. It also means that next time we boot the servers our evms devices will not be available under /dev/evms/…

Remember that when we saved the evms disk configuration we had to run evms_activate on the other node to make the devices available. (created) This is exactly what we need to do and luckily HA ships with an OCF resource agent to do exactly this for us. The other good thing doing evms this way is that we can make all the other resources dependent on this.

The benefit of this is that HA will ensure that evms_activate was run, devices are in place before starting dependent resources (domU).

All it does is runs evms_activate on all nodes when the resource starts up.

Load the resource into the cluster:

host1:~ # cibadmin -C -o resources -x evmscloneset.xml

Note: SLES10 ships with LSB scripts (found in /etc/init.d) doing the same thing, creating EVMS devices during boot process but it will not work with CSM containers. They would run fine without errors but your devices won’t be created. Perhaps because evmsd would not be running but whatever causes this according to the project’s page, the designed way at this time of writing is to manage cluster memberships with HA. Even if you didn’t plan deploying clustered XEN domUs, just want to share storage between 2 bare metal servers with EVMS and CSM, you would need the same setup until this point except ping daemon.

OCFS2 cluster file system resource:

By now we should have our OCFS2 cluster up and running, file system created, ready to mount, services set for boot, etc. The last step is to mount the actual device what we will do with HA and its file system resource agent. The reason being is that it will need to be done on more than one node and doing it with HA makes it:

cluster aware (every node will know when a node leaves, joins the cluster)

simple (single configuration for multiple mounts)

This will need to be again a clone resource that mounts
/dev/evms/san2/cfgpool volume to /etc/xen/vm on each node in the cluster:

Again we are configuring an anonymous cloneset globally_unique parameter is (again) set to false. Since this time we are configuring a cloneset that contains an OCFS2 file system resource agent, we want to enable notify for it, so that the clones (each agent on each node) will receive notifications from the cluster and hence get informed on the cluster membership status. To enable notifications, set notify to true for the cloneset. We also configure the monitor operation, so that the cluster checks every 20 seconds if the mount is still there.

The most important part of the XML blob is the attributes section of the configpool primitive part. Set the device parameter to the OCFS2 file system that needs to be mounted, directory to the directory on which this file system must get mounted, and fstype to ocfs2 for obvious reasons. In a clone file system RA (resource agent), any other value is forbidden because OCFS2 is the only supported cluster aware file system at this stage.

Load the resource into the cluster:

host1:~ # cibadmin -C -o resources -x configpoolcloneset.xml

Without EVMS volumes, this resource wouldn’t be able to start hence we have to make sure that EVMS starts first, making OCFS2 resource dependent of EVMS resource:

Something to mention here about the scoring at the end… It’s very important since version 2.1.3 and above. Without it, after successful live migration, domUs can randomly restart. Reading the documentation, I realized that it’s harmless and it should be in the CIB anyway.

This is the stage where I had issues with fileimage based domUs. I had basically the 100GB LUN created with one big OCFS2 file system, mounted with HA in similar fashion. It was a while ago, I must admit that perhaps it is fixed by now but when domUs started migrating to the other node things went wrong.

According to my observations and logs, it was like a locking issue with live migration, OCFS2 couldn’t handle over the lock to the other node or release the image file when the handover finished by XEN and the writing operation to the image file was going to continue on the other node.

Of course it was a severe error and the nodes started receiving STONITH actions until things came right. Without live migration, HA would have stopped the domUs first then started them up on the new location which would have worked perfectly, the original HASI was based on this idea. I just couldn’t afford this for storage which holds user data and mounted (multiple times) all the time and at last but not least it’s not what I want for a production environment.

FYI:

Another interesting issue we found recently is that if you have findutils-locate package installed, it would not work very well on OCFS2. There’s a cron job running every day which builds the database for all files found on the system but when it reaches OCFS2 volumes, it hangs. We made a support call about this, no updates yet.

It’s great file system, I like its features and the support what Novell builds into its products for it but I am not convinced yet, that it’s suitable for XEN clustering and live migration in production environment.

NFS virtual machine (domU) resource:

Finally this is the time to play with virtual machines and HA. As mentioned earlier in this guide, I will only present solution here for NFS domU sharing a large disk with UNIX users but on the same idea, I run around 15 other domUs on 2 clusters hosting various UNIX services.

Creating a virtual machine is out of the scope of this guide, there’s plenty on the net about it. In fact I don’t install domU anymore, I maintain, run a plain copy on one of my clusters and clone that when I need a new one.

Note: the XEN domU configuration is still text based, not xenstore. It’s the only way of doing XEN clustering according to my knowledge because it’s tricky to sync the xenstore database amongst all nodes at this time of writing.

I assigned a fairly small (5GB) EVMS volume to this domU and added the 2T LUN as well, as is, without any modification as it comes off the dm layer with its alias name mpath1. The 5GB might look a bit tight but my domUs are crafted for their purpose, run purely in runlevel 3 (no GUI) and really just a cut-down SLES copy. At a time when I built this cluster, there was no JEOS release available, today it may be good choice for domU:http://www.novell.com/it-it/linux/appliance

The second domU above has 10 seconds start_delay set for all its operations. When HA starts up it starts all resources in the cluster according to ordering and only wait for dependencies to complete. If I had 20 domUs in my cluster it would hammer the system and its hypervisor and could lead some domUs to crash. I understand that SLES and xend has protection against this sort of problem however I had issues when some of my heavily loaded domUs started migrating all at the same time, some crashed occasionally.

This little time will delay the second domU operations and mostly 10 seconds is enough for start, stop, migration to complete unless it’s heavily loaded or has large RAM allocated. Mind you 10 seconds for each resource could cause significant time delay for a cluster loaded with 20 domUs, the last resource would have more than 3 minutes delay, therefore it’s your call to make how you configure or adjust these delays. You have to craft it for your environment, I cannot give you “one fits all” solution.

You could though:

reduce the time and only delay each domU for 5 seconds if you have many

delay a pair of domUs (or more) with either similar purpose or characteristics

delay just the ones you know being resource intensive or busy

don’t use timing if you had well tested your cluster and had no issues

You have to make sure that the domU runs, the configuration is typo free and the disk descriptions are correct before loading it into the cluster.

You could simply test the domU outside of the cluster (start it up with the traditional XEN utilities on one of the member nodes) or better yet to test it in a test environment if you can afford one. If you load it into your HA cluster and it doesn’t start or crash after a while, HA will issue STONITH for that node (reboot) which could affect other, already running, working services. It may not be something you want on a production system.

If you are ready, load it into the cluster:

host1:~ # cibadmin -C -o resources -x xenvmnfs.xm

Now we create a policy to make this domU resource dependent of EVMS and configpool. Since they are already dependent of each other, it makes sense to make domU dependent of configpool resource:

Certainly in our scenario, we have live migration enabled hence the cluster will not just stop the resources as mentioned on the page above but it would either start on another node (assuming there’s network connectivity) or just live migrate if there’s some other Ethernet connectivity between the cluster member nodes. (should be!)

It is not the best way of handling network connectivity scoring as written on the page above, I’m well aware of that but the other method (preference scoring for better connectivity) could cause domUs moving between nodes perhaps often depending on load. I don’t want that, I wanted my resources to stay where they are as long as there’s networking available.

Should you lose networking on all your nodes? HA will shutdown all domUs and keep scoring continuously in the background. Once it came back, it would start them up again although I haven’t tested this behavior.

Remember that these policies will need to be individually configured for each domU resource you plan to run within an HA cluster.

Operating Hints

Caveats:

HA is our central database, it tracks and monitors each resource in the cluster, informs all member nodes about changes and synchronizes the CIB (cluster information base) amongst all nodes.

You must stop using any traditional XEN domU management utility including:

virsh (libvirt)

virtmanager (libvirt GUI)

xm

Anything you do with the domUs must be done with informing HA. If you stop one of your domUs with any traditional utility, HA would not know what happened to it and would start the resource up but what if you do the same just at the same time? Yes, corrupted storage. XEN will happily start multiple instances of the same domU, will not warn you or complain, it’s what it does but all your files written at the same time without cluster file system will be corrupted.

You can only use those above for monitoring, gathering information, perhaps testing domUs out of the cluster, nothing else. The cluster’s job is to keep them running, should you want to change that status, tell the cluster by its built in, HA aware utilities.

The other common mistake is typo in the configuration files, particularly when you don’t install the domU just clone it. Some will be harmless but some can be quite destructive hence:

You have to make sure that the XEN domU configuration has the correct disk descriptions and they point to the right device. It’s a must regardless you use EVMS setup or fileimages on OCFS2.

Failing to do so you could forget to change the disk line of the configuration after cloning and when you start it up, there’s a very good chance to corrupt an already running production domU disk…

Crafting as a word appearing in this guide quite often. This is what makes the difference, create harmony in your cluster and you don’t need to do much against defaults for it.

Basic principals: turn off unused services, no firewall, no AppArmor, patch regularly, install only the softwares packages you need, keep an eye on RAM and CPU usage, no unnecessary accounts, use runlevel 3, etc.

Heartbeat GUI:

You should see lots of nice little green lamps in your cluster now so let’s talk a bit about the GUI interface. It’s very basic, can do certain things, gets better by every release but I use it mainly to take an overview at the services or for basic operations and I strongly recommend you doing the same.

Note: this is something what you will need to do on all cluster member nodes unless you are using centralized user management. (LDAP) The GUI can be started from any member node within the cluster and will display, work, behave the same way.

host1:~ # hb_gui &
or
host2:~ # hb_gui &

Now you should be able to authenticate with your credentials, learn and get used to the interface:

You can configure the status of a resource in the XML file before loading it in for example:

<nvpair id="xen-sles-op-05" name="target_role" value="stopped"/>

or

<nvpair id="xen-sles-op-05" name="target_role" value="started"/>

In the CIB it becomes the default for that particular resource. Frankly, I don’t see any point loading something into the CIB with stopped status, doing it with started is nonsense because that’s the HA default action anyway.

But when we stop, start a resource either with the GUI or with my script we in fact insert an interim attribute into the CIB with a generated id without making it default.

It’s important because if you want to start the stopped resource again you could select start:

But then you just replace the stopped interim attribute with the started one, you still leave interim bits in your CIB. To do it properly select default which actually removes the interim attributes and apply the HAdefault status to the resource which was started:

Deleting the target_role attribute inserted by GUI from the right panel has the same affect as default. I don’t personally like interim entries in my CIB, I like to keep it nice and clean. Note: at this stage my script doesn’t have option for default!

Safe way of testing new resource configurations:

Essential for production environment even if you are certain that things would work. The best way is to tell the cluster not to manage the resource in the initial XML file:

<nvpair id="xen-sles-op-05" name="is_managed" value="false"/>

Should the new resource not to start up, crash after a while? The cluster will ignore the failure and will not issue fencing for the node where the failure occurred. Once it became stable you can remove this attribute with the GUI or edit the CIB with cibadmin (later on).

Disable monitoring operation:

You may need this at some stage, I haven’t used it so far. It’s just an extra attribute:

Notice that I removed the monitor delay from the previous example just to avoid breaking the line and maintain readability otherwise it would be there. You can remove this as the previous one.

Editing the CIB:

You can only use the GUI or the cibadmin utility. You must not edit the CIB by hand even if you know where it is located on the file system.

The GUI is simple, you just edit or delete the particular part you need but it’s different from the command line. Assume you saved all the XML files you loaded into the CIB then just make the change you like and load it back into the cluster but use replace -R option instead of create -C for example:

host1:~ # cibadmin -R -o resources -x xenvmnfs.xm

Backup, Restore the CIB:

You can backup the entire CIB:

host1:~ # cibadmin -Q > cib.bak.xml

It includes all the LRM parts what you wouldn’t need normally. LRM stands for local resource manager, basically the part which handles all things to the corresponding node locally. CRM is replicated across all nodes whereas LRM is the component which does the local actions on each node. Hence I prefer to backup my CIB by object type instead:

It’s a little bit different therefore I would like to talk about it. In HA when you tell your cluster to do something (stop, start) you either set target_role or set a certain preference. (migration)Preference is managed by scoring what you may have realized already and when you migrate a resource, you actually instruct the cluster NOT to prefer (score -INFINITY) or prefer it the MOST (score +INFINITY) a certain resource on a particular node.

Using my script to migrate or “right click” on any resource on the GUI and select “migrate resource” option are the same, they apply interimscoring and insert a rule into the cluster. This nature of HA applies to all resource types not just to virtual machines!

Ultimately you are messing up the CIB what you will need to clean up at some point. As you can see it’s already built into the interface (option down below) and my script also includes a subcommand for this but it’s still the not preferred way of doing this at least by me.

Standby is the way to go:

In last 2 years, since I am running these clusters I have only had to migrate resources when I wanted to update (patching) or do maintenance on the servers. The best is to put the node into standby. It’s designed for this and basically what it does is migrates all resources which are migration capable (and need to be running), stops and starts the others which are not then stops the remaining resources which don’t need to be running.

As usual my script includes a built in option to do this or you could use the GUI:

It takes some time depending on your setup, set time delays, number of resources and their load so just be patient. Once HA reports that the node is in running-standby, resources stopped, domUs running on another node, you can basically do whatever you feel like. You can patch the XEN host, upgrade, shut it down for maintenance, upgrade firmware and so forth.

When you finished just make it active node again, GUI or the script it’s your choice:

How do I actually backup these complex systems? What would the restore be like? We use CommVault7 to back up data partitions (within domU) but I use tar for everything else. The dom0 is simple, it logs to a remote server as well, barely changes hence I back it up (full) once a week to a remote backup server via NFS where it gets stored onto tape.

The restore would be simple too. Should the system fail to boot? I would boot from the original SLES install DVD, select rescue mode, configure networking then restore from the remote NFS server. We actually keep lots of system backups online on the remote backup server’s disk cache.

For the domU I run tar based daily differential backups and full once a week. The restore should be more challenging due to the nature of the setup but it’s actually fairly easy. You can fix, restore, backup, modify or whatever you need doing on any domU disk from the XEN dom0 host regardless you use EVMS, partition or fileimage based setup. I already published a guide how to access domU disks from the host which is the key in this solution:

I tested this solution many times when I accidentally deleted domU disks or corrupted them during development. It actually takes less than 10 minutes to fully recover a domU this way which is pretty good.

If you can afford domU being offline, you can create, clone any domU disk with dd utility, create an image backup, daily snapshot or whatever you want.

Update regime:

Due to the nature of the system it makes sense not to be version freak and update all the time when a patch is released. Simply follow Murphy, don’t try to fix the non existing problem although it’s good to do patching occasionally to save yourself from software bugs.

I recommend doing the following, keep the common sense amongst them:

avoid upgrading during working hours, the load needs to be as small as possible

sign up for patch notification emails

always read them thoroughly, understand what is fixed, see if you could be affected

update regularly but not too often (I update usually between 8-12 weeks)

unless I am affected or the patch is really important for healthy environment

I’m not too concerned about security fixes, my systems protected by various levels of tools, firewalling, etc. but you may need them urgently if a package is affected by a critical bug and it could remotely affect your systems providing public services over the INTERNET.

The updating as a process is more challenging in this environment especially when software components receive feature updates, newer versions, etc.

For an HA managed XEN cluster, the standard procedure would be as follows:

put host1 node back into active mode (no domU should be running there)

observe the system for a while, monitor logs, ensure full functionality

update the least important domU first then stop it (running on host2)

start the least important domU up immediately (it should start up on host1)

monitor behavior for a while, may be a day or so depending on your requirements

if things look good put host2 into standby BUT don’t patch just yet

all domUs will move to the newly patched host1 (should be no issues)

host2 remains in standby until all domUs are fully patched and restarted

proceed with the patching of the rest of the domUs running on host1

once all domUs are fully patched, restarted on host1 and fully functional for a while proceed with host2 patching

reboot host2 then put it back to active mode if behaves well for a while

This is the safest method I had figured out over the years and it always worked except one occasion: when SUSE updated HA from 2.0.7 to 2.1.3 and my configuration at that time didn’t have certain scoring settings. It’s already discussed briefly at the bottom of page 27.

I had odd OCFS2 issues as well when a new version got released (1.4.x). For clustering solutions it’s quite common that nodes cannot establish connection with other nodes with different software versions, it’s just the way it is.

You have to pick the right time for upgrade and maintenance. The load needs to be the smallest possible to avoid issues. For example: once I updated one of my cluster nodes during working hours and when I put the node back to active mode, the EVMS resource failed due to timeout. I must have had big load either on the SAN or on the volumes presented to the nodes. As a result STONITH actions were fired, reboots, etc.

RAM usage generally:

This idea assumes that you count the amount of RAM you consume, you limit the usage of your dom0 and all your domUs for a certain amount and you never run more resources than one node can handle. It’s one of the reasons why a 2 node cluster is inefficient, one is pretty much waste because you can only run as much as one node can handle.

Of course if you had more nodes, it’s a lot easier to dump some domUs onto many, alternatively you could set up some policies to shut down certain resources to accommodate the extra load but it’s out of the scope of this guide.

Persistent device names:

It’s a bit off topic, take it as an optional reading by the way it should not affect users who run SP1 or SP2 SLES10 installations. I upgraded mine since GA release and at that time this wasn’t the default during installation. I made some notes how to do this by hand.

The maximum limit is not enough itself, I do recommend setting the minimum limit as well. This way your dom0 (your controller) always gets what it needs and according to my tests 1G is enough for even the busiest environments. Smaller systems could try 512M for a start:

It’s an article I published recently and it could be very useful for virtual environments. The GUI component takes a lot of resources and actually the server doesn’t need to run these so we can turn them off yet use and take advantage of the GUI tools developed for SuSE.

It’s a very important and sensitive topic, pay attention to it when you design your cluster. I personally prefer bridges since it’s a layer 2 operation on the OSI model, easy to set up and doesn’t cause too much overhead on the host even if it’s software based but it may not suit your environment therefor you will need routing.

If you decided to use bridging and want to use multiple ones for more than one NIC, you can create a small wrapper script to manage them when xend starts up:

It will basically run the standard bridge-script multiple times for every NIC specified. You will need this to be set same way on all dom0 hosts. We use this to separate private LAN and DMZ traffic for infrastructure domUs requiring access to both.

hb_gui is great but what if you don’t have it in hand? It’s possible through HTTP protocol with web browser, crm_mon command line utility is capable of creating output in HTML format. Web server could be running on your nodes but it’s pointless, my preference is to run the web server somewhere else equipped with a small cgi script to retrieve the output from the nodes remotely. You could ask any nodes, they return the same output but your cgi script must ask another node if one of them is down for maintenance. This solution is designed for 2 node cluster and might not be the best but works, based on that you already created with your favorite tool a user for example “monitor” on both hosts.

Both nodes should be ready now for remote passwordless login. (monitor user only)

My webserver is not a SLES box at this stage but it should be very similar, the difference is just the running user and maybe the locations. On the webserver we have to unlock the account for the running user by giving it a valid shell. At completion we copy the private key across and put it to the default $HOME/.ssh location:

The webserver configuration is out of the scope of this document and remember it may take few seconds to load the page. This delay is caused by the host checking (ping) method built into the cgi script.

Testing the cluster

It’s just as important as any other safety feature we built into the cluster. I hope you had read the links provided by this guide by now and they were all working. Jo’s original HASI discussed the testing in some ways, I’m not planning to duplicate but you should:

Test multipath:

It depends on your hardware hence I cannot give you solution. In my case I did have some unplanned outages, controller failures which allowed me to actually test the service in real life. At some point one of my servers lost both paths to the SAN, the other for some reason didn’t. Interestingly HA must have noticed something and the following day I found all my domUs running on the healthy node. There’s nothing in my cluster monitoring the SAN but something obviously did something which actually saved all my resources from becoming unavailable what I didn’t noticed at that time since it was after hours…

Test STONITH:

This is the most important part, ensure it’s operating 100%. You can kill the HA process and see if resources move to the other node and the one you killed the process on actually reboots (after a while depending on your check interval, deadtime, etc.):

host2:~ # pkill heartbeat

It could take some time if ssh agent was chosen so be patient, monitor logs in the meantime:

It’s a one line command, you can use the parameters hardcoded into your XML file. It should cold reboot the node within few seconds.

Note: the STONITH action is considered emergency hence the iLO agent will just pull the cord, your journaling filesystem should take care of the unclean shutdown. The ssh agent is different, it’s executing a reboot command within the OS which would do clean shutdown therefore requires significant amount of time to complete depending on your setup.

Remember: if you have implemented timing delays as explained in this guide, with many domUs, starting heartbeat process at boot and stopping at shutdown will take some time. Don’t force it, it will complete, it’s just the nature of the cluster.

I do recommend stopping the ssh STONITH agent and do the proper test (killing HA process as above) to ensure that iLO STONITH work as well.

Resources:

Stop some domUs randomly with xm or virt-manager, see if HA brings them back up after while, monitor logs.

Network failover:

Test the cluster against network issues. The best and easiest is to pull the cord out of one of your member nodes. (eth0 only, we monitor only the NIC going back to our private LAN) Resources should start restarting or migrating onto the other node after few moments. If you don’t have access to the hardware or don’t want to pull the cord, you can block the returning ping packets to your node which should have the same affect:

The -s parameter is my gateway, the returning packet’s source, the -d is the server itself and the destination for the packets. It filters -p only icmp protocol and just inserts -I this filter to the standard INPUT chain. You could add more options to create the closest match to these packets but I think it should do the job safely without risking you blocking something you didn’t mean to. The server should not reboot, your domU resources should move to another node. To get rid of this interim firewall rule, you could restart the XEN host or issue the following command:

Test the cluster against standby mode, see resources moving. Reboot the nodes, do some maintenance, etc. Once finished, put it back to active mode, resources should stay where they are, logs should contain no errors.

After few reboots on both nodes, ensure that the DC (designated controller) role does change over in the HA cluster, EVMS volumes get discovered and activated well on all nodes.

Check the configuration time to time:

It’s good idea to do ad-hoc configuration checks particularly when it changed:

host1:~ # crm_verify -LV
host1:~ #

Empty prompt is good sign, everything else is displayed…

Proof of concept

The NFS domU is hosting user home directories exported by NFS service to store user data, for testing it’s set with 512MB of RAM. The domU is running on host1, everything is as presented earlier in this document.

It wasn’t very fast due to my uplink was limited to 100Mbit/s at that time but it’s not what we are concerned about right now. Redo the test but migrate the domain (put host1 into standby) while writing the same file to the NFS export:

For the record:

As of today the NFS domU uses 1G RAM and sharing the home space amongst many users. The disk is still using XFS filesystem due to the CommVault7 backup software we use for the data partition by the way it’s the best for this purpose. Apart from minor adjustments (added more RAM, increased the number of nfsd processes as we added more users to it) the system uses nearly default settings.

We needed to tune the mount options for the users’ NFS clients too. These were necessary to ensure data safety and instant reconnection if the NFS server rebooted for some reason such as kernel update. The mount command (one line) would look like this:

All UNIX systems around nowadays configured with autofs which retrieves a special record from our LDAP about the user’s home location then mounts it instantly. It has been working absolutely hassle free, the domU was migrated many times amongst the nodes while heavy access without any issues. The filesystem is clean, the data is safe, the shared storage is considerably efficient although NFS is not my favorite protocol.

Conclusion

There may be an argument for many between XEN versus VMware, no doubt both has its strengths and place on the market. XEN is the child of Linux, no question about its strengths or efficiency and cost effectiveness but unfortunately it’s lacking of all the fancy features and tools, utilities what VMware offers.

XEN tools are getting better, evolve fast but it will take significant amount of time for developers to catch upon others who have been doing this for a long time ago. Novell’s efforts to improve on this is quite clear, the support we get is very good and I am very happy about what I managed to work out over the last 2 years.

For us, the deciding factor was the efficiency, stability not particularly the cost. At that time we had verbal agreement to purchase ESX for some infrastructure developments, I believe we still own one copy but it’s not being used as far as I know.

I don’t have VirtualCenter, P2V, dynamic provisioning and many other great features but the reality is that I don’t need it. The main components are built into this solution:

live migration without service interruption

clustering and high availability

auto failover

efficient, centralized resource management, etc.

We needed static virtualization, domUs with dedicated purpose and that’s what most small, mid sized businesses would want at least for a start. XEN can serve these well in a very good, very efficient, cost effective way.

I have to admit that I didn’t plan using SLES for this project at the early days and it’s not marketing here for any flavor, just an individual opinion.

I tested Debian Linux, Fedora Core 5 and NetBSD3 for both dom0 and domU but SLES turned out to be the best for the dom0. It’s what matters after all, the dom0 has to be rock solid and perfect. The components used in this guide also developed by SuSE people as individual projects hence no doubt that it’s really the best you can use for this sort of thing.

Unlike for the dom0, I still prefer INTERNET facing core systems on Debian Linux and I wouldn’t use anything else. You wonder why? Cleanness of code, community support, vast amount of packages, discipline and at last but not least their standard policy of no feature upgrades within release.

For example our SMTP gateway domU consumes 387M storage space, it’s the installed system without logs, keeps away daily an average 20K SPAM (most rejected on the spot meaning that these don’t waste my CPU time), exchanges zillions, checks viruses and at last but not least our false positive ratio is very low. Half of the filters would not be available for SLES and maintaining these could become unnecessary hassle or overhead after a while.

Disclaimer: As with everything else at SUSE Conversations, this content is definitely not supported by SUSE (so don't even think of calling Support if you try something and it blows up). It was contributed by a community member and is published "as is." It seems to have worked for at least one person, and might work for you. But please be sure to test, test, test before you do anything drastic with it.

Great article, BTW. Other question, is what about NCS rather than HA? It is supported, but it seems you have to use OCFS? But can you use NCS with raw block disks and use just the OCFS for your config files?

This guide got pretty outdated by now but when this cluster was built the o2cb HA resource was experimental and unreliable. The init services are indeed need to be ON because the only thing this cluster managed on SLES10SP2 was the OCFS2 FS resource what does nothing else other mounting the cluster filesystem. It’s fails unless the underlying OCFS2 daemons on all nodes are in sync…

It’s probably a lot better now on SLES11, unfortunately I have not had a chance to give that a crack

I was wondering if EVMS is really strictly necessary for offering block devices to the Xen hosts. What if you what just use LVM, activate the logical volumes on all nodes and use hardware fencing to avoid the same Xen guest being loaded on multiple nodes?
This would avoid the complexity of EVMS and protect against data corruption.
Or am I wrong?