Learn with our tutorials and training

developerWorks provides tutorials, articles and other
technical resources to help you grow your development skills
on a wide variety of topics and products. Learn about a specific
product or take a course and get certified. So, what do you want to learn
about?

Featured products

Featured destinations

Find a community and connect

Learn from the experts and share with other developers in one of our
dev centers. Ask questions and get answers with dW answers. Search for local events
in your area. All in developerWorks communities.

Contents

Tivoli System Automation

Getting Started Guide

Frank GoytisoloPublished on May 17, 2012

The purpose of this guide is to introduce Tivoli® System Automation
for Multiplatforms and provide a quick-start, purpose-driven approach to
users that need to use the software, but have little or no past experience
with it.

This guide describes the role that TSA plays within IBM’s Smart Analytics
System solution and the commands that can be used to manipulate the
application. Further, some basic problem diagnosis techniques will be
discussed, which may help with minor issues that could be experienced
during regular use.

When the Smart Analytics system is built with High Availability, TSA is
automatically installed and configured by the ATK. Therefore, this guide
will not describe how to install or configure a TSA cluster (domain) from
scratch, but rather how to manipulate and work with an existing
environment. To learn to define a cluster of servers, please refer to the
References appendix for IBM courses that are available.

Terminology

It is advisable to become familiar with the following terms, since they
are used throughout this guide. It will also help you become familiar with
the scopes of the different components within TSA.

Table 1. Terminology

Term

Definition

Peer Domain:

A cluster of servers, or nodes, for which TSA is responsible

Resource:

Hardware or software that can be monitored or controlled. These
can be fixed or floating. Floating resources can move between
nodes.

Resource group:

A virtual group or collection of resources

Relationships:

Describe how resources work together. A start-stop relationship
creates a dependency (see below) on another resource. A location
relationship applies when resources should be started on the same
or different nodes.

Dependency:

A limitation on a resource that restricts operation. For example,
if resource A depends on resource B, then resource B must be
online for resource A to be started.

Equivalency:

A set of fixed resources of the same resource class that provide
the same functionality

Quorum:

A cluster is said to have quorum when there it has the capability
to form a majority within its nodes. The cluster can lose quorum
when there is a communication failure, and sub-clusters form with
an even number of nodes.

Nominal State:

This can be online or offline. It is the desired state of a
resource, and can be changed so that TSA will bring a resource
online or shut it down.

Tie Breaker:

Used to maintain quorum, even in a split-brain situation (as
mentioned in the definition of quorum). A tie-breaker allows
sub-clusters to determine which set of nodes will take control of
the domain.

Failover:

When a failure occurs (typically hardware), which causes resources
to be moved from one machine to another machine, the resources are
said to have “failed over”

Getting
Started

The purpose of TSA in the Smart Analytics system is to manage software and
hardware resources, so that in the event of a failure, they can be
restarted or moved to a backup system. TSA uses background scripts to
check the status of processes and ensure that everything is working ok. It
also uses “heart-beating” between all the nodes in the domain to ensure
that every server is reachable. Should a process fail the status check, or
a node fails to respond to a heartbeat, appropriate action will be taken
by TSA to bring the system back to its nominal state.

Let’s start with the basics. In a Smart Analytics System, the TSA domain
includes the DB2 Admin node, the Data nodes, and any Standby/backup nodes.
The management server is not part of the domain and TSA commands will not
work there. Further, all TSA commands are run as the root user.

The first thing you want to do is check the status of the domain, and start
it if required:

In this case it’s already started, but if OpState would show “Offline”,
then the command to start the domain is,

startrpdomain bcudomain

Notice that the domain name is bcudomain, and it is required for the start
command. Likewise, if you want to stop the domain, the command is,

stoprpdomain bcudomain

If TSA is in an unstable state, you can also forcefully shut down the
domain using the -f parameter in the stoprpdomain command. However, this
is typically not recommended:

stoprpdomain -f bcudomain

You should not stop a domain until all your resources have been properly
shut down. If your system uses GPFS to manage the /db2home mount, then you
need to manually unmount the GPFS filesystems before you can stop the TSA
domain using the following command,

/usr/lpp/mmfs/bin/mmunmount /db2home

Next, you’ll want to check the status of the nodes in the domain. The
following command will do this:

You can see that we have 3 nodes in this domain: beluga006, beluga007, and
beluga008. This also shows their state. If they are Online, then TSA can
work with them. If they are Offline, they are either turned off or TSA
cannot communicate with them (and thus unavailable). Nodes don’t always
appear in the order that you would expect, so be sure to scan the whole
output (in this case, beluga008 shows up before beluga007).

Resource
Groups

After you have verified that the Domain is started, and all your nodes are
Online, you will want to check the status of your resources. TSA manages
all resources through resource groups. You cannot start a resource
individually through TSA. When you start a resource group however, it will
start all resources that belong to that group.

To check the status of your DB2 resources, use the hals command. This gives
you a summary of all nodes in the peer domain, including their primary and
backup locations, current location, and failover state.

In this example, we see that the admin node is dwadmp1x since it holds
partition 0. There are 4 data nodes in this system, and all are in Normal
state except for data node 3. We can see that data node 3 is in Failover
state and its current location is dwhap1x, the backup server.

The hals command is actually a summary of the complete output. For more
detailed information about each resource, use the lssam command. The
following output is an example of a cluster with the following nodes:

Notice that the full output was grepped to “Nominal”. This is a trick to
shorten the output so that we only see the Nominal states, and soon you
will see that it can get quite long otherwise.

Let’s step through the above output:

Online IBM.ResourceGroup:SA-nfsserver-rg Nominal=Online

This first line tells us that we have a resource group named
SA-nfsserver-rg and it is Online. The Nominal state is also Online, so it
is working as expected. By the name, we can tell that this resource group
manages the NFS server resources. Typically, this should always be
online.

Next we have a resource group called db2_bculinux_NLG_beluga006-rg. This is
the resource group belonging to the Admin node. We know that because
beluga006 is the hostname for the Admin node. Here, we have 1 DB2
partition (the coordinator partition). For every partition, we define a
resource group. You’ll see why shortly. The resource group for the admin
partition, partition 0, is called db2_bculinux_0-rg.

The first line was what we had seen before (lssam | grep Nom). Now, we can
see what resources actually form the resource group. This first resource
is of type AgFileSystem and represents the db2home mount. We can see that
it can exist on beluga006 and beluga008, and that it is Online in
beluga006 and Offline in beluga008.

Similarly, for the admin node, we can now see the individual resources:

The first two lines were part of the previous grepped output, but now we
can see an Application resource. You can see similar results for the data
node and each of its 4 data partitions. The reason that each of these
resources exist on two nodes (beluga006 and beluga008) is for high
availability. If beluga006 were to fail, TSA will move all those resources
that are currently Online there to beluga008. Then, you would see that
they are Offline in beluga006, and Online in beluga008. You can see how
this output is useful to determine on which nodes the resources exist.

The lssam command also shows Equivalencies as part of the output. I will
include it for the sake of completion, but we will discuss this later
on:

With the Smart Analytics System, some new commands were introduced to make
it easier to monitor and use TSA with DB2:

Table 2. Useful Commands

Command

Definition

hals:

shows HA status summary for all db2 partitions

hachknode

shows the status of the node in the domain and details about the
private and public networks

hastartdb2

start db2 partition resources

hastopdb2

stop db2 partition resources

hafailback

moves partitions back to the primary machine specified in the
primary_machine argument

Equivalency:

A set of fixed resources of the same resource class that provide
the same functionality

hafailover

moves partitions off of the primary machine specified in the
primary_machine argument to it is standby

hareset

attempt to reset pending, failed, stuck resource states

Stopping and Starting Resources

If you want to stop or start the DB2 service, you need to stop the
respective DB2 resource groups using TSA commands. TSA will then start or
stop DB2.

The command to do this is chrg. To stop a resource group named
db2_bculinux_NLG_beluga007, issue the command,

chrg –o offline –s “Name == ‘db2_bculinux_NLG_beluga007’”

Similarly, to start the resource group

chrg –o online –s “Name == ‘db2_bculinux_NLG_beluga007’”

You can also stop/start all resources at the same time:

chrg –o online –s “1=1”

The Smart Analytics System also has some pre-configured commands:

hastartdb2 and hastopdb2

These two commands, however, are specific to DB2 and if there has been
customization to TSA, they may not stop/start all resources.

If TSA has pre-configured rules/dependencies, they will ensure that
resources are stopped and started in the correct order. For example, DB2
resources that depend on NFS will not start if the NFS share is
Offline.

TSA Components

Now that you understand the basics of Tivoli System Automation, we can
discuss some of the other components that it can manage.

Service IP

A service IP is a virtual, floating resource attached to a network device.
Essentially, it is an IP address that can move from one machine to
another, in the event of a failover. Service IPs play a key role in a
highly available environment. Because they move from a failed machine to a
standby, they allow an application to reconnect to the new machine using
the same IP address – as if the original server had simply restarted.

The following command will allow you to view what service IPs have been
configured for your system.

The above example shows three resources with the same name,
db2ip_10_160_20_210-rs. The NodeNameList parameter tells us
which node(s) the resource is referring to. The first resource has Opstate
set to 2, which tells us that this is where the service IP is currently
pointing (it is also the primary location of the resource). The second
resource has Opstate 1, which tells us that this is the backup/standby
node. The third resource contains both nodes in its NodeNameList
parameters, and this tells TSA that this is a floating resource between
those two nodes.

Application Resources

TSA manages resources using scripts. Some scripts are built in (and part of
TSA), such as those for controlling DB2. These scripts are responsible for
starting, stopping and monitoring the application. Sometimes it can be
useful to understand these scripts, or even edit them for problem
diagnosis. To find out where they are located, we use the lsrsrc command,
which provides us with the complete configuration of a particular
resource.

Log Files

It is important to be aware of the log files that TSA actively writes to:

History file – this logs the commands that were sent to TSA

/var/ct/IBM.RecoveryRM.log2

Error and monitor logs – these logs are simply the AIX and Linux
system logs. They will show you the output of the start, stop, and
monitor scripts as well as any diagnostic information coming from TSA.
Although the system administrator can configure the location for these
logs, they are typically located in the following locations,

AIX: /tmp/syslog.out
Linux: /var/log/messages

Command Reference

Table 4 describes the most common commands that a TSA administrator will
use.

Table 4. Common TSA Commands

Command

Definition

hals:

Display HA configuration summary

hastopdb2:

Stop DB2 using TSA

hastartdb2:

Start DB2 using TSA

mkequ:

Makes an equivalency resource

chequ:

Changes a resource equivalency

lsequ:

Lists equivalencies and their attributes

rmequ:

Removes one or more resource equivalencies

mkrg:

Makes a resource group

chrg:

Changes persistent attribute values of a resource group (including
starting and stopping a resource group)

lsrg:

Lists persistent attribute values of a resource group or its
resource group members

rmrg:

Removes a resource group

mkrel:

Makes a managed relationship between resources

chrel:

Changes one or more managed relationships between resources

lsrel:

Lists managed relationships

rmrel:

Removes a managed relationship between resources

samcrl:

Sets the IBM TSA control parameters

lssamctrl:

Lists the IBM TSA controls

addrgmbr:

Adds one ore more resources to a resource group

chrgmbr:

Changes the persistent attribute value(s) of a managed resource in
a resource group

Troubleshooting

This section describes methods that can be used to determine the cause of a
particular problem or failure. Though techniques vary depending on the
type of problem, the following should be a good starting point for most
issues.

Cheat Sheet for Resolving Common
Problems

Courtesy of Larry Pay lpay@ca.ibm.com

Resolving FAILED OFFLINE status A failed offline
status will prevent you from setting the nominal status to ONLINE, so
these must be resolved first and changed to OFFLINE before turning it back
to ONLINE. Make sure that the Nominal status is showing OFFLINE before
resolving it.

Recovery from a failed failover attempt Take all TSA
resources offline. The lssam output should reflect “Offline” for all
resources before you attempt to bring them back online. To reset NFS
resources, use:

When testing goes wrong, you are often left with resources in various
states such as online, offline, and unknown. When the state of a resource
is unknown, before attempting to restart it, you must issue resetrsrc for
that particular resource.

When you are restarting DB2, you must verify that all the resources are
offline before attempting to bring them online again. You must also
correct the db2nodes.cfg file. Make sure you have backup copies of
db2nodes.cg and db2ha.sys.

NFS mounts stop functioning In testing the NFS
failover, we were able to move the server over successfully, but the
existing NFS client mounts stopped functioning. We solved this problem by
unmounting and remounting the NFS volume.

Resolving Binding=Sacrificed To resolve this problem
you have to look at the overall cluster and how its setup/defined. Issues
that can and will cause this are types that will have a cluster-wide
impact but not specifically affect one resource.

Check for failed relationships by listing the relationships with the
following command "lsrel -Ab", and then
determine if one or more of the relationships relating to the failed
resource group have not been satisfied.

Check for failed equivalencies by listing them with the following
command "lsequ -Ab" and then determine if
one re more of the equivalencies have not been satisfied.

Check your resource group attributes and look for anything that maybe
set incorrectly, some of the commands to use are listed as follows:

Recycling the automation manager If the problem is
most likely related to the automation manager, you should try recycling
the automation manager (IBM.RecoveryRM) before contacting IBM support.
This can be done using the following commands:

Find out on which node the RecoveryRM master daemon is running:

# lssrc -ls IBM.RecoveryRM | grep Master

On the node running the master, retrieve the PID and kill the automation
manager:

# lssrc -ls IBM.RecoveryRM | grep PID
# kill -9 <PID>

As a result, an automation manager on another node in the domain will take
over the master role, and proceeds with making automation decisions. The
subsystem will restart the killed automation manager immediately.

Move to another node in the same HA group and see if you can run the lssam
command. If you can, go back to the original node to see if you can now do
the lssam command. If this still does not work, then run the following
commands:

Make sure neither of the above command outputs return the “hanging” node
and if so, then reboot just that node and see if the issue is
resolved.

AVOID the following (DON’Ts)

Do not use rpower –a, or rpower on more than one node in the same HA
group when SAMP HA is up and running.

Do not offline HA-NFS using a sudo command while logged in as the
instance owner and while in the /db2home directory. HA-NFS will get
stuck online, and the RecoveryRM daemon has to be killed on the
master. If RecoveryRM will not start, reboot may be required.

Do not use ifdown to bring down a network interface. This will result
in the eth (or en) device to be deleted from equivalency member and
will require you to add the "eth" device (in Linux) or "en" device (in
AIX) back into the network equivalency using chequ command

Do not manipulate any BW resources that are under active SAMP control.
Turn automation off (samctrl –M T) before manipulating these BW
resources.

Do not implement changes to the SA MP policy unless exhaustive testing
of the HA test cases is completed.

Check the following frequently (DOs)

Ensure the /home and /db2home directories are always mounted before
starting up a node.

Check for process ids that may be blocking stop, start and monitor
commands.

Save backup copies of the db2nodes.cfg and db2ha.sys file.

Save the backup copies of the current SAMP policy before and after
every SAMP change. Compare the current SAMP policy to the backup SAMP
policy every time there is an HA incident.

Save backup copies of db2pd -ha output before and after every SAMP
change. Compare the current db2pd outputs to the backup db2pd outputs
every time there is an HA incident.