This document provides a general overview and sample configurations of
the Redundant Link Manager (RLM) used in the Cisco PGW 2200 for signaling mode.
Information is also provided on troubleshooting the RLM signaling and ISDN
signaling between the network access server (NAS) gateway and Cisco PGW 2200.

The RLM provides virtual link management over multiple IP networks so
that the Cisco Q.931+ signaling protocol can be transported on top of multiple
redundant links between Cisco PGW 2200 and Cisco NAS.

RLM provides:

A client/server relationship—NAS RLM is always the
client and switches a link when a failure is detected.

The information in this document is based on Cisco PGW 2200 Software
Release 9.x.

Note: The RLM details are part of Cisco PGW 2200 version 7.4(11) and
7.4(12). However, this document only provides guidelines for Cisco PGW 2200
Release 9.x.

The information in this document was created from the devices in a
specific lab environment. All of the devices used in this document started with
a cleared (default) configuration. If your network is live, make sure that you
understand the potential impact of any command.

One RLM group is configured on a gateway and two Cisco PGW 2200s are
configured within the RLM group. One has the IP address and UDP port for the
active Cisco PGW 2200 and the other has the IP address and UDP port of the
standby Cisco PGW 2200 (see Figure 2).

Each server in the RLM group is supported by two UDP channels on
different UDP ports. One UDP channel (port 3000) transports the RLM protocol
and the other UDP channel (port 3001) transports the Q.921 protocol.

The objective of RLM is to insulate the call signaling layers from
the indeterminate nature of network behavior typically associated with IP-based
networks. The RLM maintains various virtual links between the Cisco PGW 2200
and the remote NAS and continuously monitors the link state to determine if the
outgoing frames should assume an alternative path.

Since each different RLM group requires binding to a Cisco PGW 2200
Channel Controller (IOCC) (a specific UDP port required for each), multiple
IOCCs are required to support this configuration. Although the Cisco PGW 2200
can support up to eight Primary Rate Interface Internet Protocol (PRIIP) IOCCs,
each with the capacity for 32 gateways (RLM) or each Cisco PGW 2200 IOCC
(PRIIP) supports 32 gateways (RLM). This means that on the Cisco PGW 2200, you
have ports 3001, 3003, and 3005 through 3015. Use the UNIX command
netstat -a | grep 30 to verify this on the Cisco
PGW 2200.

Information from the XECfgParm.dat file under directory
/opt/CiscoMGC/etc:

*.maxNumLinks = 32

*.maxNumRLMPorts = 8 # Maximum number of unique RLM
ports

The PGW 2200 supports a maximum of eight PRI channel controller
processes. These processes are created when you configure the PGW 2200. For
example, you use port 3000 and 3001 in your Cisco IOS® / PGW 2200
configuration, for RLM and ISDN. This creates one IOCC for PRI(NI+). Therefore,
every time you use a different port another process is created.

Each process supports up to 32 gateways. If you use one RLM per
gateway, then you can have 256 gateways. But when you have four RLMs per
gateway for traffic routing, then you are left with a capacity of 64 physical
gateways.

Note: IUA use is supported from Cisco PGW 2200 release 9.4 or later. The
support for IUA with SCTP is limited because RLM has limitations in terms of
scaling to support large numbers of NFAS groups per media gateway. Refer to
Support
for IUA with SCTP for further information.

Note: Do not change this value. Also, be aware that as you increase the RLM
sessions you use per Cisco PGW 2200, the fewer total gateways you can support.
For example, one RLM supports a total of 256 gateways per Cisco PGW 2200, two
RLMs support a total of 128 gateways per Cisco PGW 2200, and so forth.

The gateways are considered the client side and are responsible for the
instigation of a switchover to a lower weight standby RLM link in the event of
a failure.

The default UDP port for the RLM data link is one plus the value of
the RLM management link UDP port value (for example, 3001).

Figure 3: RLM Configuration Information

The IOS commands show rlm group x and
show ip sockets display the UDP ports in use on the
IOS NAS.

The nfas_int in the E1/T1 controller
must match the spanID in the Cisco PGW 2200 bearer channel
configuration. This is a key point in the channel mapping. It is transported in
the ChannelD IE of the Q.931 setup message together with the timeslot.

The RLM link management packet consists of six bytes as this diagram
shows.

The current supported versions of RLM in the PGW 2200 is version 2.0
only.

The control field provides the command to the peer. These are valid
control values:

RLM_START_REQ (0x01)—Used to initiate an RLM link.
Only generated by the NAS.

RLM_START_ACK (0x02)—Generated by the PGW 2200 to
acknowledge the start of an RLM link.

RLM_STOP_REQ (0x03)—Generated by either the PGW 2200
or the NAS to stop a link.

RLM_STOP_ACK (0x04)—Acknowledgement to a stop
request.

RLM_ECHO_REQ (0x05)—Used by the NAS only to
periodically ping the PGW 2200 in order to verify
link integrity. Used on both an active link and all standby links.

RLM_ECHO_ACK (0x06)—Acknowledgement of an echo
request.

RLM_SWITCH_REQ (0x07)—Used to switch from a lower
weighted active RLM link to a higher weighted available link.

RLM_SWITCH_ACK (0x08)—Acknowledgement of a switch
request.

The packet length is the length of the RLM management packet (UDP
payload). For RLM version 1.0, this value is always 6. For RLM version 2, this
value is 8.

The sequence number is a unique value used to correlate a specific
command request and acknowledgement.

Figure 4: RLM Message Flow for Link Recovery

In Figure 4, the client RLM on the NAS initiates a request to the Cisco
PGW 2200 to start an RLM session. Assume the NAS is configured to give the
first link a higher priority. After the Cisco PGW 2200 acknowledges the start
request, the link is considered available and data packets can be sent on the
data UDP port. The second link is placed in a standby mode. The RLM
periodically sends the echo requests to all configured RLM links in a given RLM
group. The default interval is 1 second.

In regards to the TIMEOUT issues in Figure 4, if the active link does
not receive a response to one of the RLM echo requests, it attempts to retry
the request (default value is three attempts). Upon failure to receive an
acknowledgement, the client RLM initiates a link recovery by sending a start
request to the next highest weighted standby link available. The client RLM
continues to poll the previously active link. If a response is eventually
received, it performs a link switchover back to the higher weighted link. If
the link weights are identical, the RLM client selects the link where the start
acknowledge is first received. For the standby Cisco PGW 2200, the RLM server
does not acknowledge the echo requests from the NAS while in the standby state.
Once the standby becomes the active server and all call states are restored,
the RLM starts to acknowledge the requests from the NAS.

The behavior of RLM is such that RLM keepalives are only transmitted
when signaling traffic has not been transmitted for some time. For instance,
the receipt of a signaling message (for example, Q.921) has the effect of
resetting the RLM keepalive timer. Note also that RLM keepalives are only
transmitted by the NAS. The Cisco PGW 2200 only responds to RLM keepalive
requests. However, if the RLM keepalive timer expires on the Cisco PGW 2200, it
brings down the link. Increasing the RLM keepalive timer values on both sides
(PGW 2200 and NAS) ensures that the RLM link is not reset during transient
conditions in the IP network during which the default RLM keepalive timer value
may be too stringent. For a single Cisco PGW 2200, there is no penalty for
doing this. With two Cisco PGW 2200s in a failover configuration, there is a
trade-off between avoiding flaps in the RLM link and quickly detecting a link
failure. With RLM, keepalive timers and Q.921/Q.931 timers increased.

When you look at the control RLM information messages (see Figure 5),
the control field provides the command to the peer. The values in Figure 5 are
valid control values:

This section is designed to preserve stable calls during Cisco PGW 2200
failover or under conditions of transient IP network instability. These changes
ensure that calls are retained unless there is prolonged loss of RLM
connectivity. Loss of RLM connectivity means there are no available links to
carry signaling traffic between the NAS and the active Cisco PGW 2200. Loss of
a single link is handled by the RLM layer transparently to the ISDN stack.

With the show rlm group <x> command on
the IOS NAS, you can check the timers of the RLM.

Table 1: RLM Default Timer Values on the Cisco IOS
NAS

Timer

Duration

Open Wait

3 seconds

Recovery

12 seconds

Minimum-up

60 seconds

Keepalive

1 second

Force-down

30 seconds

Switch-link

5 seconds

Retransmit

1 second

The force-down time needs to be longer than the total keepalive time
(keepalive period * retries) plus the recovery time. For example, see this
formula:

If the force-down and keepalive timer has the same value, then the
IOS NAS cannot recognize that the link is reset because the keepalive is
greater than or equal to the force down time.

Keepalive Timer—The IOS NAS sends ECHO_REQ every 1
second. After three lost ECHO_REQ, NAS thinks the link might be down and it
starts a recovery timer (12 seconds). However, it continues to send ECHO_REQ
expecting that the link might come back up. Pay attention to this in older
Cisco IOS versions, the recovery timers at the default values are too long.
There were instances where the RLM link could be taken down. The best item is
to check these timers on both systems. During the startup/shutdown of the
standby Cisco PGW 2200, the active Cisco PGW 2200 is delayed in its response to
the ECHO_REQ from the IOS NAS. After three tries from the IOS NAS, each with a
default of one second timeout, the IOS NAS brings down the RLM link. By
increasing the keepalive timer from 1 second to 10 seconds, it is possible to
keep the active RLM up. This way, the IOS NAS waits longer after each ECHO_REQ
before timing out and trying again. With a 10 second keepalive, the IOS NAS can
wait 30 seconds before timing out and bringing down the RLM link. However, in
this instance, if you change the keepalive timers, you need to take attention
on the force-down timer as well.

Recovery Timer—If you want to reduce the recovery
timer, bring down the active RLM link quickly before the Cisco PGW 2200
restarts. This is done by configuring both the keepalive timer and force-down
timer in the same value. Therefore, when IOS NAS is reloaded and comes back,
the remote IOS NAS cannot recognize that the link is reset because the
keepalive is greater than or equal to the force-down time. The force-down time
needs to be greater than the total keepalive time (keepalive period * retries)
plus the recovery time. The correction is that the force-down timer must be
greater then three times the keepalive plus the recovery timer.

Force-down Timer—According to the specification, RLM
remains in the Recovery state for about 15 seconds (number of ECHO_REQ every 1
second plus Recovery every 12 seconds). If the link does not come back within
that time frame, the RLM state goes to the DOWN state and is forced to stay
down for 30 seconds as a default to avoid the ping-pong effect. After that, it
begins to send out keepalives. Both the client and server go through this cycle
at about the same time. When the RLM state goes from IDLE to DOWN, there is no
need to force the state down since it is already in the DOWN state. This means
that when the Ethernet/Fast Ethernet links are disconnected, the RLM client at
IOS NAS tries to restore the link for a period defined by the recovery timer
(default value equals 12 seconds). If it is not successful, there is a
force-down timer (default value equals 30 seconds) that prevents the RLM client
from responding even if the Ethernet links are up. Only after the force-down
timer expires, the RLM client begins to establish the links with the Cisco PGW
2200. In this case you can have a delay of 42 seconds (the combination of
recovery and force-down timer [12 + 30 = 42 seconds]).

Description—The maximum number of link
recoveries within a 10 minute period before alarming the path to the
destination as unstable. Value range is 1 to 100.

Default—10

Type—int
Range=1-100

10

Note: When you modify timers, the mismatched timers between the Cisco PGW
2200 and the NAS can be difficult to diagnose. Therefore, as an operational
matter, it is recommended that the default settings be used unless there is a
compelling reason to change them.

The PGW 2200 is required to provide ISDN Q.921 and NI-2 Q.931
connections over redundant IP links to various remote Cisco NAS gateways. These
redundant IP links are maintained by the RLM. Thus, all the timeslots on the
time-division multiplexing (TDM) interfaces (IMT trunks) that run into the NAS
contain only bearer channels. ISDN signaling is carried across the IP links
from the PGW 2200 to the NAS gateways. Each signaling connection consists of a
pair of redundant IP links between the PGW 2200 and the NAS. There can be one
or more signaling connections on each NAS. Each signaling connection
exclusively controls a set of NAS TDM interfaces as a Non-Facility Associated
Signaling (NFAS) group.

With traditional ISDN signaling, each ISDN PRI circuit has a timeslot
(D-channel) used to carry the signaling. However, with ISDN NFAS PRI, the
signaling is carried on a single D-channel for all PRI interfaces in the NFAS
group. This reduces the number of signaling links needed for the PRI lines and
yields extra bearer channels to be used for data, voice, or video. It is
optional to have a backup D-channel on another interface should the primary
interface go out of service. In Cisco's SS7 Interconnection solution for access
server and voice gateway, the ISDN NFAS feature is used. However, with the SS7
implementation, the ISDN signaling channel (D-channel) is freed up from the PRI
interface and redirected to another port (Ethernet, Fast Ethernet or serial).
Therefore, all the PRI timeslots contain only bearer channels and no signaling.

Single Channel Service Message—Reports the service
state (IS or OOS) for a single bearer channel.

Group Service Message—Reports the service state for
all bearer channels for one or more T1/E1 interfaces.

Sync and Re-sync—Checkpoints the call states between
the PGW 2200 and the NAS gateways. These messages are typically generated after
a switch over event to determine if any discrepancies occurred in the call
states.

Configuration on the NAS gateway is simple. Every NAS gateway has one
or more RLM groups defined. Within the RLM group, and if the PGW 2200 is in
redundant mode, there are two server link groups (one for the Active PGW 2200
and another one for the Standby PGW 2200). Each server link group can have one
or two links that connect to the each of the PGW 2200 Ethernet (E0 and/or E1)
interfaces. The NAS gateway can use either of its interfaces (loopback,
Ethernet, or Fast Ethernet) as the source address to create the links to the
PGW 2200. For full redundancy, the NAS gateway connects two Ethernet interfaces
to both PGW 2200s. One Ethernet connects to both PGW 2200 hme0 interfaces in
one VLAN. The other Ethernet interface connects to both PGW 2200 hme1
interfaces in another VLAN. See this diagram for a full redundancy setup.

You can also verify this same information in the .dat files located in
the /opt/CiscoMGC/etc directory. The .dat files are the information gathered
from configuring and provisioning the PGW 2200. The sigChanDevIp.dat file
contains all the information on the IP link to the PGW 2200 from both the NAS
gateway and SLT.

Use this information to make sure that the IP addresses configured in
sigChanDevIp.dat are correct.

00100001 IP_Addr1 3001 172.16.13.141 3001 0.0.0.0 255.255.255.255
00100001 = Signalling Channel Component ID as defined for the engine.
!--- Must match what is configured in the components.dat file.
IP_Addr1 = Symbolic link to the name defined within XECfgParm.dat
!--- *.IP_Addr1 = 172.16.13.132 # Address of interface on motherboard.
3001 = UDP port defined for receive side of ISDN messages.
!--- RLM manager runs on the - 1 value, or 3000 in this example.
172.16.13.141 = IP address of the NAS gateway.
!--- Must match the IP address defined in the RLM group on the NAS gateway.
3001 = UDP port defined for transmit side of ISDN messages for the NAS gateway
!--- RLM manager runs on the - 1 value, or 3000 in this example.

Make sure that the correct ISDN protocol is configured to run on the
ISDN/IP connection.

Get the PGW 2200 component ID (00100001) information within the
sigChanDevIp.dat file for the IP link. Then, go to the sigChanDev.dat file and
get the component ID for Signaling Path component ID (00140001) on the fourth
column. With this Signaling Path component ID, use the sigPath.dat file to find
the ISDN protocol used (ISDNPRI BELL_1268_C3 ).

Note: In
PGW
release 9.3(2) and later, the BELL_1268_C3 variant is changed to
BELL_1268_C2.

On the NAS gateway, run the debug rlm group
x command to look at the keepalive and packet flow between the
PGW 2200 and NAS gateway.

This output shows some example command output from the NAS gateway. In
normal operation, there are constant keepalives (ECHO_REQ and ECHO_ACK)
exchanged between the NAS gateway and PGW 2200 every 1 second. If this does not
occur, figure out who is not responding or sending the keepalives.

Note: The TID (transaction id) is the same echo request and echo
acknowledgement. Even though the other PGW 2200 (172.16.13.134) is in standby
mode, it constantly communicates with the NAS gateway.

Review the syslog file (/opt/CiscoMGC/var/log/platform.log) for clues
to the problem.

Turn on debug mode on the PGW 2200 for
certain processes (such as engine or ISDN PRI over IP [PRIIP]).

Use the Snooper Tool to sniffer the IP packet between the PGW 2200
and the NAS gateway.

Use the MML command rtrv-alms to view any
alarms the system experiences. A more helpful command to use is the
rtrv-alms::cont to continuously listen for any
current alarms that are reported. The most useful information is the
platform.log file under the /opt/CiscoMGC/var/log/ directory. This file
contains all the information from the system. Since this file might be very
large, use the UNIX command grep to search and parse
through the file.

The key word to search for troubleshooting ISDN and RLM is IOCC-PRIIP,
which is the I/O Channel Controller for PRIIP. Another method is to use
tail –f platform.log under the /opt/CIscoMGC/var/log/
directory to continuously monitor in real-time any error message that appears.
You can set the PGW 2200 in debugging mode. Set the PRIIP process into
debugging mode and look deeper into the packet flows within the PGW 2200.

The other tool you can use is the Cisco Snooper. It can monitor (in
real-time) different types of protocols (for example, RLM, SS7, ISDN, and
H.225) that run over IP. It is like a sniffer connected off the Ethernet
segment to monitor all types of traffic. This paper does not cover the
troubleshooting procedure using the Cisco Snooper tool.

This is some example output from the PGW 2200. In normal operation,
there is constant communication between the NAS gateway and the PGW 2200. The
keepalive messages can be monitored on the PGW 2200. Enable the PGW 2200 to
have the PRIIP process in debugging mode with the MML command
set-log:prrip-01:debug,confirm.

The RESYNC_REQ/RESYNC_RESP messages are used to checkpoint the call
states between the PGW 2200 and the NASes. These messages are typically
generated after a switch-over event to determine if any discrepancies occurred
in the call states. These messages are used to re-establish a consistent view
of the channel call states on both the PGW 2200 and NAS gateway to prevent any
possible hang CIC.

Group Service Message

Similar to the RESYNC message, the Group Service messages use a single
message per D-channel to indicate the service state (IS/OOS) of all of the
associated B-channels. The NAS initiates the Group Service operation. Actions
are taken on the PGW 2200 side to maintain consistency of the channel states
based on the result of comparing the state of each channel. When the PGW 2200
receives this message, it sends out SS7 ISUP circuit group block (CGB/CGBA) and
circuit group unblock (CGU/CGUA) to correspond to the B-channel service
indications from the group service messages. In addition, the acknowledgement
to the group service message from the NAS does not occur until the signaling
gateway receives a CGBA or CGUA from the PSTN switch.

In the Cisco SS7 Interconnection voice gateway solutions configuration,
bearer channels from a NAS are mated (nailed up) to SS7 bearers. Before, the
PGW 2200 engine handled each individual NAS service messages by setting bearer
channel service states. When many channels on a NAS change state
simultaneously, the resulting service messages can flood the switch if they are
sent individually. A group service message sent from the NAS efficiently
informs the engine of the state of all bearer channels. The engine must decode
this message, change the state of each NI-2 bearer channel, and propagate the
changes to the SS7 side, from which corresponding block and unblock channel
management messages (CGB/CGBA and CGU/CGUA) must be sent. This allows for
maximum efficiency. This Group Service Message (GSM) helps minimize the number
of SERVICE/SERVICE ACK message transactions in the event of more than one
channel (or interface) being taken into out-of-service or in-service. Group
Service messages can handle up to thirty interfaces at a time.

You can issue the snoop command on all
Solaris platforms. Log in as superuser and issue this command to collect UNIX
snoop information:

snoop -o snoop.log IP address
Ctrl C - to exit snoop

Upload the snoop.log file to the case notes.

Note: Explain in the case notes that this file was captured through use
of the UNIX snoop command.

Run the Cisco snooper application. Log in as a superuser and issue
the ./snooper int INTERFACE PARMS LIST command or
run ./snooper to collect Cisco snooper information,
which gives you a full description.

./snooper int hme'x' ni2+ rlm ss7 > snooper_int1!--- Where 'x' is the interface number, which you can also find
!--- by issuing the ifconfig -a command.

Issue the MML command rtrv-alms on the Cisco
PGW 2200 to find out the reason of the failure. In this scenario, both Ethernet
and FastEthernet are down on the NAS hostname v5300-2. This results in the
'signas1' being unreachable.