* To achieve active-active redundancy, a minimum of two contexts and two FT groups are required on each ACE.

* To achieve active-active redundancy, a minimum of two contexts and two FT groups are required on each ACE.

* When you configure redundancy, the ACE keeps all interfaces that do not have an IP address in the Down state. The IP address and the peer IP address that you assign to a VLAN interface should be in the same subnet but should be different IP addresses. For more information about configuring VLAN interfaces, see the [http://www.cisco.com/en/US/docs/interfaces_modules/services_modules/ace/v3.00_A2/configuration/rtg_brdg/guide/rtbrgdgd.html ''Cisco Application Control Engine Module Routing and Bridging Configuration Guide.'']

* When you configure redundancy, the ACE keeps all interfaces that do not have an IP address in the Down state. The IP address and the peer IP address that you assign to a VLAN interface should be in the same subnet but should be different IP addresses. For more information about configuring VLAN interfaces, see the [http://www.cisco.com/en/US/docs/interfaces_modules/services_modules/ace/v3.00_A2/configuration/rtg_brdg/guide/rtbrgdgd.html ''Cisco Application Control Engine Module Routing and Bridging Configuration Guide.'']

+

* FT Interfaces are put into automatic trunk status and switch side needs to be set to trunk the specific VLAN you are using for FT interface.

===Example of a Redundancy Configuration===

===Example of a Redundancy Configuration===

Revision as of 15:04, 20 January 2010

This article describes the procedures for troubleshooting redundancy issues with your ACE.

Overview of ACE Redundancy

Redundancy (or fault tolerance) allows your network to remain operational even if one of the ACEs becomes unresponsive. Redundancy ensures that your network services and applications are always available.

Redundancy provides seamless switchover of flows if an ACE becomes unresponsive or a critical host, interface, or HSRP group fails. Redundancy supports the following network applications that require fault tolerance:

Mission-critical enterprise applications

Banking and financial services

E-commerce

Long-lived flows such as FTP and HTTP file transfers

Redundancy Protocol

You can configure a maximum of two ACEs (peers) in the same Catalyst 6500 series switch or in different chassis for redundancy. Each peer module can contain one or more fault-tolerant (FT) groups. Each FT group consists of two members: one active context and one standby context. For more information about contexts, see the Cisco Application Control Engine Module Virtualization Configuration Guide. An FT group has a unique group ID that you assign.

Both ACE modules can be active at the same time, processing traffic for distinct virtual devices and backing up each other (stateful redundancy). See Figure 1.

Figure 1. Example of an Active-Active Configuration

The ACE uses the redundancy protocol to communicate between the redundant peers. The election of the active member within each FT group is based on a priority scheme. The member configured with the higher priority is elected as the active member. If a member with a higher priority is found after the other member becomes active, the new member becomes active because it has a higher priority. This behavior is known as preemption and is enabled by default.

One virtual MAC address (VMAC) is associated with each FT group. The format of the VMAC is: 00-0b-fc-fe-1b-groupID. Because a VMAC does not change upon a switchover, the client and server ARP tables does not require updating. The ACE selects a VMAC from a pool of virtual MACs available to it. You can specify the pool of MAC addresses that the local ACE and the peer ACE use by configuring the shared-vlan-hostid command and the peer shared-vlan-hostid command, respectively. To avoid MAC address conflicts, be sure that the two pools are different on the two ACEs. For more information about VMACs and MAC address pools, see the Cisco Application Control Engine Module Routing and Bridging Configuration Guide.

Each FT group acts as an independent redundancy instance. When a switchover occurs, the active member in the FT group becomes the standby member and the original standby member becomes the active member. A switchover can occur for the following reasons:

The active member becomes unresponsive.

A tracked host, interface, or HSRP group fails.

You enter the ft switchover command to force a switchover.

FT VLAN

Redundancy uses a dedicated FT VLAN between redundant ACEs to transmit flow-state information and the redundancy heartbeat. You must configure this same VLAN on both peer modules. You also must configure a different IP address within the same subnet on each module for the FT VLAN. Cisco recommends two port-channeled 1-Gigabit Ethernet links fo the FT VLAN.

Note:

Do not use the FT VLAN for any other network traffic, including HSRP traffic and data.

The two redundant modules constantly communicate over the FT VLAN to determine the operating status of each module. The standby member uses the heartbeat packet to monitor the health of the active member. The active member uses the heartbeat packet to monitor the health of the standby member. Communications over the switchover link include the following data:

Redundancy protocol packets

State information replication data

Configuration synchronization information

Heartbeat packets

For multiple contexts, the FT VLAN resides in the system configuration file. Each FT VLAN on the ACE has one unique MAC address associated with it. The ACE uses these device MAC addresses as the source or destination MACs for sending or receiving redundancy protocol state and configuration replication packets.

Note:

The IP address and the MAC address of the FT VLAN do not change at switchover.

Configuration Requirements and Restrictions

Follow these requirements and restrictions when configuring the redundancy feature:

Redundancy is not supported between an ACE module and an ACE appliance operating as peers. Redundancy must be of the same ACE device type and software release.

In bridged mode (Layer 2), two contexts cannot share the same VLAN.

To achieve active-active redundancy, a minimum of two contexts and two FT groups are required on each ACE.

FT Interfaces are put into automatic trunk status and switch side needs to be set to trunk the specific VLAN you are using for FT interface.

Example of a Redundancy Configuration

The following example shows a running-configuration file that defines fault tolerance (FT) for a single ACE module operating in a redundancy configuration. You must configure a maximum of two ACE modules (peers) for redundancy to fail over from the active module to the standby module.

Note:

All FT parameters are configured in the Admin context.

This configuration addresses the following redundancy components:

A dedicated FT VLAN for communication between the members of an FT group. You must configure this same VLAN on both peer modules.

If the software or license is incompatible, install the appropriate software image or license in the peer to correct the problem.

2. Ensure that any SSL certificates (certs) and keys that exist in the active ACE are also configured in the standby ACE. SSL certs and keys are not synchronized automatically from the active to the standby. Use the crypto export and crypto import commands to accomplish this task. This requirement also applies to scripts and scripted probes. Failure to keep the active and standby configurations identical will cause configuration synchronization to fail and may cause the standby ACE to enter the STANDBY-COLD state.

The ACE sends heartbeat packets via UDP over the FT VLAN between peers. When heartbeats are not received during the specified interval (the interval and count are configurable), the ACE notifies the HA processor on the CP by sending a Peer_Down interprocess communication protocol (IPCP) message. If a peer is down or unreachable, you may receive one of the following syslog messages:

3. Verify connectivity between the peers over the FT VLAN. If a peer device is physically up but connectivity is the problem, you may end up with two active devices. If connectivity is lost due to the peer going down, reboot the peer to restore redundancy between the two devices.

4. Display heartbeat statistics, including missed heartbeats, by entering the following command:

If the query interface is configured, upon receiving a PEER_DOWN message from the heartbeat process, the ACE data plane attempts to ping the peer using the Query VLAN. If the ping fails, the standby transitions to the ACTIVE state. If the ping is successful, the standby transitions to the STANDBY_COLD state.
To recover from the STANDBY_COLD state, reboot the standby.

Each peer uses a VMAC that is dependent on the FT group number. If you are using multiple ACEs in the same chassis, be careful when using the same FT groups in more than one module.

6. Display the VMAC for an FT group by entering the following command:

FT Peer and Group Status Details

This section describes how to diagnose unexpected status conditions for the FT group and FT peer. This information may enable you to troubleshoot an issue directly or help you to provide additional information to your Cisco support representative.

FT Group Status Conditions

This section describes how to diagnose and troubleshoot unexpected status conditions applicable to the FT group status.

STANDBY_COLD

An FT group status of STANDBY_COLD may appear when:

Config sync fails (including, incr-sync and bulk-sync), or

FT VLAN is down while the query interface is up

Config Sync Failure

In config sync failure, the peers are not correctly exchanging configuration information. This failure can be identified as follows:

Output of the show ft peer detail command shows that the peer state is "Compatible".

Entering show ft group detail shows that the FT group is in "Standby Cold" mode and running cfg sync status shows the reason for the failure. For incr-sync failure, the output shows exactly which command resulted in an execution error on the standby. For a bulk-sync failure, the reason is "Error on Standby device when applying configuration file replicated from active".

To further investigate bulk-sync failure, perform these steps on the standby device:

For software version A2(2.0) and earlier and version A2(1.3) and earlier releases, from the Admin context, enter sh ft history cfg_cntlr and grep for "error:" to find any CLI commands that caused execution errors.

To work around a bulk sync failure, perform these steps to remove the CLI commands that triggered the error (as identified from the preceding analysis) and then retrigger the bulk sync operation, as follows:

Grep for the keywords MTS_OPC_REQ_CFG_DNLD_STATUS and MTS_OPC_CFG_DNLD_STATUS.

If one or both of the messages are missing, an error occurred in the synchronization exchange process.

Note that once stuck in the STANDBY_CONFIG state, configuration mode will be disabled on both the active and standby devices. It can be stuck in this state for up to 4 hours, after which a timeout period expires.

FT Peer Status Conditions

This section describes how to diagnose and troubleshoot unexpected status conditions applicable to the FT peer.

PEER_DOWN

If the peer status shows PEER_DOWN:

Check whether IP addresses for the local and peer device are configured correctly on both.

Verify that pinging or telnetting to the peer IP address works. If ping fails, check whether the interface is up (sh int). If so, the interface VLAN is probably not allocated to the ACE module on the supervisor (which suggests a configuration issue on the supervisor).

Enter show arp to see if the FT peer IP address is resolved. (If arp is not resolved and ping/telnet is also failed, it might be an encap issue requiring support).

Enter show conn on both sides to see if HA connections have been set up. If connections have not been set up, check the HA DP manager log (sh ft history ha_dp_mgr). Setup may have failed for various reasons. If this is the case, contact Cisco support.

Enter sh ft stats on both devices to see if heartbeats are being sent or received. If the "heartbeats missed" counter is incrementing, the heartbeat packets could be getting dropped. Run sh np 1 me-stats -sfp to see if heartbeat packets are being received and being forwarded to X-Scale. If the following counter is not incrementing, provide the information to Cisco support representative:

TL_ERROR

This state may occur when the telnet connection used to exchange configuration information between the peers cannot be established but heartbeat packets are exchanged successfully. To identify this issue:

Verify that heartbeats are flowing by checking the statistics, sh ft stats.

Attempt to connect by telnet or to ping the FT peer. The telnet connection attempt will likely fail.

Run show arp to see if the FT peer IP address can be resolved.

If show arp indicates that the address is not resolvable and the ping or telnet connect attempts fail, it is likely an encapsulation issue on the ACE.

FT_VLAN_DOWN

This state typically occurs when the FT VLAN goes down while the query interface is up. If the heartbeat exchange fails and the query interface is determined to be up based on an ICMP message check, the status is FT_VLAN_DOWN.

To verify, attempt to connect to the FT VLAN Peer IP address by ping or telnet.

If running show ft stats indicates that heartbeats are being missed, it is likely a physical connectivity issue, such as the physical port or cable failure.

FSM_PEER_STATE_ERROR

This indicates a Software Relationship Graph (SRG) version inconsistency between the peers. See the relationship graph table in the following section.

About WARM_COMPATIBLE and STANDBY_WARM

While peer modules in a redundant configuration are designed to operate with identical versions of the software, when you are applying a version upgrade or downgrade to the modules, it's possible for the peers to temporarily employ different software versions. The WARM_COMPATIBLE and STANDBY_WARM redundancy states help minimize the operational impact of CLI compatibility issues between the peers, and allow fail-overs to occur on a best-effort basis during such transitions.

When you upgrade or downgrade the ACE software in a redundant configuration with different software version, the STANDBY_WARM and WARM_COMPATIBLE states allow the configuration and state synchronization process between the peers to continue on a best-effort basis. This basis allows the active ACE to synchronize configuration and state information to the standby even though the standby may not recognize or understand the CLI commands or state information.

In the STANDBY_WARM state, as with the STANDBY_HOT state, configuration mode is disabled on the standby ACE and configuration and state synchronization continues. A failover from the active to the standby based on priorities and preempt can still occur while the standby is in the STANDBY_WARM state. However, while stateful failover is possible for a WARM standby, it is not guaranteed. In general, modules should be allowed to remain in this state only for a short period of time.

When redundancy peers run different software versions, the SRG compatibility field shown by the show ft peer detail command output is WARM_COMPATIBLE instead of COMPATIBLE. When the peer is in the WARM_COMPATIBLE state, the FT groups on standby go to the STANDBY_WARM state instead of the STANDBY_HOT state.

The following software version combinations indicate whether the SRG compatibility field displays WARM_COMPATIBLE (WC) or COMPATIBLE (C):