Caché can be configured as an application controlled by Veritas Cluster Server (VCS) on Linux. This appendix highlights the key portions of the configuration of VCS including how to incorporate the Caché high availability agent into the controlled service. Refer to your Veritas documentation and consult with your hardware and operating system vendor(s) on all cluster configurations.

When using Caché in a high availability environment controlled by Veritas Cluster Server:

Install the hardware and operating system according to your vendor recommendations for high availability, scalability and performance; see Hardware Configuration.

Configure VCS with shared disks and a virtual IP (VIP). Verify that common failures are detected and the cluster continues operating; see Linux and Veritas Cluster Server.

Configure the hardware according to best practices for your application. In addition to adhering to the recommendations of your hardware vendor, consider the following:

Disk and Storage

Create LUNs/partitions, as required, for performance, scalability, availability and reliability. This includes using appropriate RAID levels, battery-backed and mirrored disk controller cache, multiple paths to the disk from each node of the cluster, and a partition on fast shared storage for the cluster quorum disk.

Networks/IP Addresses

Where possible, use bonded multi-NIC connections through redundant switches/routers to reduce single-points-of-failure.

Linux and Veritas Cluster Server

Prior to installing Caché and your Caché-based application, follow the recommendations described below when configuring Linux and VCS. These recommendations assume a two-node cluster where both nodes are identical. Other configurations are possible; consult with your hardware vendor and the InterSystems Worldwide Response Center (WRC) for guidance.

Linux

When configuring Linux on the nodes in the cluster, use the following guidelines:

All nodes in the cluster must have identical userids/groupids (that is, the name and ID number must be identical on all nodes); this is required for Caché.

These two users and two groups need to be added and synchronized between members:

Users

Owner(s) of the instance(s) of Caché

Effective user(s) assigned to each instance’s Caché jobs

Groups

Effective group(s) to which each instance’s Caché processes belong.

Group(s) allowed to start and stop the instance(s).

All volume groups required for Caché and the application are available to all nodes.

Include all fully qualified public and private domain names in the hosts file on each node.

Veritas Cluster Server

This document assumes Veritas Cluster Server (VCS) version 5.1 or newer. Other versions may work as well, but likely have different configuration options. Consult with Symmantec/Veritas and the InterSystems Worldwide Response Center (WRC) for guidance.

In general you will follow these steps:

Install and cable all hardware, disk and network.

Create a cluster service group that includes the network paths and volume groups of the shared disk.

Be sure to include the entire set of volume groups, logical volumes and mount points required for Caché and the application to run. These include those mount points required for the main Caché installation location, your data files, journal files, and any other disk required for the application in use.

Installing the VCS Caché Agent

The Caché VCS agent consists of five files and one soft link that must be installed on all servers in the cluster.

Sample Caché VCS agent scripts and type definition are included in a development install. These samples will be sufficient for most two-node cluster installations. Follow the instructions provided for copying the files to their proper locations in the cluster.

A development install is not required in the cluster. The files listed in the following can be copied from a development install outside the cluster to the cluster.

There are different procedures depending on whether you are installing only one instance of Caché or multiple instances of Caché. Installing a single instance of Caché in the cluster is common in production clusters. In development and test clusters it is common to have multiple instances of Caché controlled by the cluster software. If it is possible that you will install multiple instances of Caché in the future, follow the procedure for multiple instances.

Use the following procedure to install and configure a single instance of Caché in the VCS cluster.

Note:

If any Caché instance that is part of a failover cluster is to be added to a Caché mirror, you must use the procedure described in Installing Multiple Instances of Caché, rather than the procedure in this section.

Bring the service group online on one node. This should mount all required disks and allow for the proper installation of Caché.

Check the file and directory ownerships and permissions on all mount points and subdirectories.

Create a link from /usr/local/etc/cachesys to the shared disk. This forces the Caché registry and all supporting files to be stored on the shared disk resource you have configured as part of the service group.

Run Caché cinstall on the node with the mounted disks. Be sure the users and groups (either default or custom) have already been created on all nodes in the cluster, and that they all have the same UIDs and GIDs.

Stop Caché and relocate the service group to the other node. Note that the service group does not yet control Caché.

Manually start Caché using ccontrol start. Test connectivity to the cluster through the virtual IP address (VIP). Be sure the application, all interfaces, any ECP clients, and so on connect to Caché using the VIP as configured here.

The Caché resource must be configured to require the disk resource and optionally the IP resource.

Start VCS and verify that Caché starts on the primary node.

Installing Multiple Instances of Caché

To install multiple instances of Caché, use the following procedure.

Note:

If any Caché instance that is part of a failover cluster is to be added to a Caché mirror, the ISCAgent (which is installed with Caché) must be properly configured; see Configuring the ISCAgent in the Mirroring chapter of this guide for more information.

Bring the service group online on one node. This should mount all required disks and allow for the proper installation of Caché.

Check the file and directory ownerships and permissions on all mount points and subdirectories.

Run Caché cinstall on the node with the mounted disks. Be sure the users and groups (either default or custom) are already created on all nodes in the cluster, and that they all have the same UIDs and GIDs.

The /usr/local/etc/cachesys directory and all its files must be available to all nodes at all times. To enable this, copy /usr/local/etc/cachesys from the first node you install to each node in the cluster. The following method preserves symbolic links during the copy process:

cd /usr/local/
rsync -av -e ssh etc root@node2:/usr/local/

Verify that ownerships and permissions on the cachesys directory and its files are identical on all nodes

Note:

In the future, keep the Caché registries on all nodes in sync using ccontrol create or ccontrol update or by copying the directory again; for example:

ccontrol create CSHAD directory=/myshadow/ versionid=2013.1.475

Stop Caché and relocate the service to the other node. Note that the service group does not yet control Caché.

Manually start Caché using ccontrol start. Test connectivity to the cluster through the VIP. Be sure the application, all interfaces, any ECP clients, and so on connect to Caché using the VIP as configured here .

Waits for processes to end cleanly, potentially delaying the stop, especially when some processes are unresponsive due to a hardware failure or fault. This setting can significantly lengthen time-to-recovery.

Because it does not wait for processes to end, dramatically decreases time-to-recovery in most cases of failover due to hardware failures or fault. However, while ccontrol force fully protects the structurally integrity of the databases, it may result in transaction rollbacks at startup. This may lengthen the time required to restart Caché, especially if long transactions are involved.

If a controlled failover is to occur, such as during routine maintenance, follow these steps:

Even if CleanStop is set to 0, the ccontrol force command issued during the stop of the cluster service has no effect since Caché is already cleanly stopped, with all transactions rolled back by the command line ccontrol stop before processes are halted .

Application Considerations

Consider the following for your applications:

Ensure that all network ports required for interfaces, user connectivity and monitoring are open on all nodes in the cluster.

Connect all interfaces, web servers, ECP clients and users to the database using the VIP over the public network as configured in the main.cf file.

Ensure that application daemons, Ensemble productions, and so on are set to autostart so the application is fully available to users after unscheduled failovers.

Consider carefully any code that is part of %ZSTART or otherwise occurs as part of the Caché startup. To minimize recovery time do not place heavy cleanup or query code in the startup such that VCS would time out before the custom code completed.

Other applications or web servers and so on can also be configured in the cluster, but these examples assume only Caché is installed under cluster control. Contact the InterSystems Worldwide Response Center (WRC) to consult about customizing your cluster.

Testing and Maintenance

Upon first setting up the cluster, be sure to test that failover works as planned. This also applies any time changes are made to the operating system, its installed packages, the disk, the network, Caché, or your application.

In addition to the topics described in this section, you should contact the InterSystems Worldwide Response Center (WRC) for assistance when planning and configuring your Veritas Cluster Server resource to control Caché. The WRC can check for any updates to the Caché agent, as well as discussing failover and HA strategies.

Failure Testing

Typical full scale testing must go beyond a controlled service relocation. While service relocation testing is necessary to validate that the package configuration and the service scripts are all functioning properly, you should also test responses to simulated failures. Be sure to test failures such as:

Loss of public and private network connectivity to the active node

Loss of disk connectivity

Hard crash of active node

Testing should include a simulated or real application load. Testing with an application load builds confidence that the application will recover in the event of an actual failure.

If possible, test with a heavy disk write load; during heavy disk writes the database is at its most vulnerable. Caché handles all recovery automatically using its CACHE.WIJ and journal files but testing a crash during an active disk write ensures that all file system and disk devices are properly failing over.

Software and Firmware Updates

Keep software patches and firmware revisions up to date. Avoid known problems by adhering to a patch and update schedule.

Monitor Logs

Keep an eye on the VCS logs in /var/VRTSvcs/log/. The Caché agent logs time-stamped information to the engine log during cluster events. To troubleshoot any problems, search for the Caché agent error code 60022.