The complexity inherent in today's campus networks necessitates a design process capable of separating solutions into basic elements. The Cisco hierarchical network model achieves this goal by dividing the network infrastructure into modular components. Each module is used to represent a functional service layer within the campus hierarchy.

Designing High Availability in the Enterprise Campus

The Cisco hierarchical network model enables the design of high-availability modular topologies. Through the use of scalable building blocks, the network can support evolving business needs. The modular approach makes the network easier to scale, troubleshoot, and understand. It also promotes the deterministic traffic patterns.

This section reviews design models, recommended practices, and methodologies for high availability in the Cisco Enterprise Campus Architecture infrastructure.

Enterprise Campus Infrastructure Review

The building blocks of the enterprise campus infrastructure are the access layer, the distribution layer, and the core layer. The principal features associated with each layer are hierarchal design and modularity. A hierarchical design avoids the need for a fully meshed network in which all nodes are interconnected. A modular design enables a component to be placed in service or taken out of service with little or no impact on the rest of the network. This methodology also facilitates troubleshooting, problem isolation, and network management.

Access Layer

The access layer is the point of entry into the network for end devices, as illustrated in Figure 2-1.

The campus access layer aggregates end users and provides uplinks to the distribution layer. The access layer can support multiple features:

High availability: At the access layer, high availability is supported through various hardware and software attributes. With hardware, system-level redundancy can be provided using redundant supervisor engines and redundant power supplies. It can also be provided by default gateway redundancy using dual connections from access switches to redundant distribution layer switches. With software, high availability is supported through the use of first-hop routing protocols (FHRP), such as the Hot Standby Router Protocol (HSRP), Virtual Router Redundancy Protocol (VRRP), and Gateway Load Balancing Protocol (GLBP).

NOTE

Cisco offers a unique high-availability feature to its 3750 Workgroup Switch and Etherswitch Services Module called StackWise. StackWise technology enables switches to be interconnected to create a single logical unit through the use of special stack cables. The cables create a bidirectional path that behaves as a switch fabric for all the interconnected switches. The stack is managed as a single unit, eliminating the need for spanning tree and streamlining the interface to a single management session for all devices. For more information about StackWise, refer to Cisco.com.

NOTE

IOS Release 12.2(18) SXD extended high availability to the 6500/7600 series line of switches. It added services such as Control Plane Policing (CoPP), Nonstop Forwarding (NSF), Stateful Switchover (SSO), and Gateway Load Balancing Protocol (GLBP), which are discussed later in this chapter.

Security: The access layer provides services for additional security against unauthorized access to the network through the use of tools such as IEEE 802.1x, port security, DHCP snooping, Dynamic ARP Inspection (DAI), and IP Source Guard.

Quality of service (QoS): The access layer allows prioritization of mission-critical network traffic using traffic classification and queuing as close to the ingress of the network as possible. It supports the use of the QoS trust boundary.

IP multicast: The access layer supports efficient network and bandwidth management using software features such as Internet Group Management Protocol (IGMP) snooping.

Distribution Layer

The distribution layer aggregates traffic from all nodes and uplinks from the access layer and provides policy-based connectivity, as illustrated in Figure 2-2.

Availability, load balancing, QoS, and provisioning are the important considerations at this layer. High availability is typically provided through dual paths from the distribution layer to the core and from the access layer to the distribution layer. Layer 3 equal-cost load sharing allows both uplinks from the distribution to the core layer to be used.

The distribution layer is the place where routing and packet manipulation are performed and can be a routing boundary between the access and core layers. The distribution layer represents a redistribution point between routing domains or the demarcation between static and dynamic routing protocols. The distribution layer performs tasks such as controlled routing and filtering to implement policy-based connectivity and QoS. To further improve routing protocol performance, the distribution layer summarizes routes from the access layer. For some networks, the distribution layer offers a default route to access layer routers and runs dynamic routing protocols when communicating with core routers.

The distribution layer uses a combination of Layer 2 and multilayer switching to segment workgroups and isolate network problems, preventing them from impacting the core layer. The distribution layer may be used to terminate VLANs from access layer switches. The distribution layer connects network services to the access layer and implements QoS, security, traffic loading, and routing policies. The distribution layer provides default gateway redundancy using an FHRP, such as HSRP, GLBP, or VRRP, to allow for the failure or removal of one of the distribution nodes without affecting endpoint connectivity to the default gateway.

NOTE

Cisco has introduced the Virtual Switching System (VSS), which can reduce or eliminate the need for FHRPs at the distribution layer. For more information about VSS, visit http://www.cisco.com/go/vss.

Core Layer

The core layer provides scalability, high availability, and fast convergence to the network, as illustrated in Figure 2-3. The core layer is the backbone for campus connectivity, and is the aggregation point for the other layers and modules in the Cisco Enterprise Campus Architecture. The core provides a high level of redundancy and can adapt to changes quickly. Core devices are most reliable when they can accommodate failures by rerouting traffic and can respond quickly to changes in the network topology. The core devices implement scalable protocols and technologies, alternate paths, and load balancing. The core layer helps in scalability during future growth.

The core is a high-speed, Layer 3 switching environment using hardware-accelerated services. For fast convergence around a link or node failure, the core uses redundant point-to-point Layer 3 interconnections because this design yields the fastest and most deterministic convergence results. The core layer is designed to avoid any packet manipulation, such as checking access lists and filtering, which would slow down the switching of packets.

Not all campus implementations require a campus core. The core and distribution layer functions can be combined at the distribution layer for a smaller campus.

Without a core layer, the distribution layer switches need to be fully meshed, as illustrated in Figure 2-4. This design can be difficult to scale, and increases the cabling requirements, because each new building distribution switch needs full-mesh connectivity to all the distribution switches. The routing complexity of a full-mesh design increases as new neighbors are added.

Note that combining distribution and core layer functionality (collapsed core) requires a great deal of port density on the distribution layer switches. An alternative solution is a Layer 2 core with discrete VLANs on each core switch. This scenario requires only two ports per distribution layer switch—regardless of the number of buildings (switch blocks)—and so you can avoid the expense of multilayer core switches.

In Figure 2-4, a distribution module in the second building of two interconnected switches requires four additional links for full-mesh connectivity to the first module. A third distribution module to support the third building would require 8 additional links to support connections to all the distribution switches, or a total of 12 links. A fourth module supporting the fourth building would require 12 new links, for a total of 24 links between the distribution switches. Four distribution modules impose eight Interior Gateway Protocol (IGP) neighbors on each distribution switch.

As a recommended practice, deploy a dedicated campus core layer to connect three or more buildings in the enterprise campus, or four or more pairs of building distribution switches in a very large campus. The campus core helps make scaling the network easier by addressing the requirements for the following:

Gigabit density

Data and voice integration

LAN, WAN, and MAN convergence

High-Availability Considerations

In the campus, high availability is concerned with minimizing link and node failures and optimizing recovery times to minimize convergence and downtime.

Implement Optimal Redundancy

The recommended design is redundant distribution layer switches and redundant connections to the core with a Layer 3 link between the distribution switches. Access switches should have redundant connections to redundant distribution switches, as illustrated in Figure 2-5.

As a recommended practice, the core and distribution layers are built with redundant switches and fully meshed links to provide maximum redundancy and optimal convergence. Access switches should have redundant connections to redundant distribution switches. The network bandwidth and capacity is engineered to withstand a switch or link failure, supporting 120 to 200 ms to converge around most events. Open Shortest Path First (OSPF) and Enhanced Interior Gateway Routing Protocol (EIGRP) timer manipulation attempt to quickly redirect the flow of traffic away from a router that has experienced a failure toward an alternate path.

In a fully redundant topology with tuned IGP timers, adding redundant supervisors with Cisco NSF and SSO may cause longer convergence times than single supervisors with tuned IGP timers. NSF attempts to maintain the flow of traffic through a router that has experienced a failure. NSF with SSO is designed to maintain a link-up Layer 3 up state during a routing convergence event. However, because an interaction occurs between the IGP timers and the NSF timers, the tuned IGP timers can cause NSF-aware neighbors to reset the neighbor relationships.

NOTE

Combining OSPF and EIGRP timer manipulation with Cisco NSF might not be the most common deployment environment. OSPF and EIGRP timer manipulation is designed to improve convergence time in a multiaccess network (where several IGP routing peers share a common broadcast media, such as Ethernet). The primary deployment scenario for Cisco NSF with SSO is in the enterprise network edge. Here, the data link layer generally consists of point-to-point links either to service providers or redundant Gigabit Ethernet point-to-point links to the campus infrastructure.

In nonredundant topologies, using Cisco NSF with SSO and redundant supervisors can provide significant resiliency improvements.

Provide Alternate Paths

The recommended distribution layer design is redundant distribution layer switches and redundant connections to the core with a Layer 3 link between the distribution switches, as illustrated in Figure 2-6.

Although dual distribution switches connected individually to separate core switches will reduce peer relationships and port counts in the core layer, this design does not provide sufficient redundancy. In the event of a link or core switch failure, traffic will be dropped.

An additional link providing an alternate path to a second core switch from each distribution switch offers redundancy to support a single link or node failure. A link between the two distribution switches is needed to support summarization of routing information from the distribution layer to the core.

Avoid Single Points of Failure

Cisco NSF with SSO and redundant supervisors has the most impact in the campus in the access layer. An access switch failure is a single point of failure that causes outage for the end devices connected to it. You can reduce the outage to one to three seconds in this access layer, as shown in Figure 2-7, by using SSO in a Layer 2 environment or Cisco NSF with SSO in a Layer 3 environment.

Cisco NSF with SSO

SSO allows the standby route processor (RP) to take control of the device after a hardware or software fault on the active RP. SSO synchronizes startup configuration, startup variables, and running configuration; and dynamic runtime data, including Layer 2 protocol states for trunks and ports, hardware Layer 2 and Layer 3 tables (MAC, Forwarding Information Base [FIB], and adjacency tables) and access control lists (ACL) and QoS tables.

Cisco NSF is a Layer 3 function that works with SSO to minimize the amount of time a network is unavailable to its users following a switchover. The main objective of Cisco NSF is to continue forwarding IP packets following an RP switchover. Cisco NSF is supported by the EIGRP, OSPF, Intermediate System-to-Intermediate System (IS-IS), and Border Gateway Protocol (BGP) for routing. A router running these protocols can detect an internal switchover and take the necessary actions to continue forwarding network traffic using Cisco Express Forwarding while recovering route information from the peer devices. With Cisco NSF, peer networking devices continue to forward packets while route convergence completes and do not experience routing flaps.

Routing Protocol Requirements for Cisco NSF

Usually, when a router restarts, all its routing peers detect that routing adjacency went down and then came back up. This transition is called a routing flap, and the protocol state is not maintained. Routing flaps create routing instabilities, which are detrimental to overall network performance. Cisco NSF helps to suppress routing flaps.

Cisco NSF allows for the continued forwarding of data packets along known routes while the routing protocol information is being restored following a switchover. With Cisco NSF, peer Cisco NSF devices do not experience routing flaps because the interfaces remain up during a switchover and adjacencies are not reset. Data traffic is forwarded while the standby RP assumes control from the failed active RP during a switchover. User sessions established before the switchover are maintained.

The ability of the intelligent line cards to remain up through a switchover and to be kept current with the FIB on the active RP is crucial to Cisco NSF operation. While the control plane builds a new routing protocol database and restarts peering agreements, the data plane relies on pre-switchover forwarding-table synchronization to continue forwarding traffic. After the routing protocols have converged, Cisco Express Forwarding updates the FIB table and removes stale route entries, and then it updates the line cards with the refreshed FIB information.

NOTE

Transient routing loops or black holes may be introduced if the network topology changes before the FIB is updated.

The switchover must be completed before the Cisco NSF dead and hold timers expire; otherwise, the peers will reset the adjacency and reroute the traffic.

A device is said to be Cisco NSF aware if it runs Cisco NSF-compatible software. A device is said to be Cisco NSF capable if it has been configured to support Cisco NSF. A Cisco NSF-capable device rebuilds routing information from Cisco NSF-aware or Cisco NSF-capable neighbors.

A Cisco NSF-aware neighbor is needed so that Cisco NSF-capable systems can rebuild their databases and maintain their neighbor adjacencies across a switchover.

Following a switchover, the Cisco NSF-capable device requests that the Cisco NSF-aware neighbor devices send state information to help rebuild the routing tables as a Cisco NSF reset.

The Cisco NSF protocol enhancements allow a Cisco NSF-capable router to signal neighboring Cisco NSF-aware devices. The signal asks that the neighbor relationship not be reset. As the Cisco NSF-capable router receives and communicates with other routers on the network, it can begin to rebuild its neighbor list. After neighbor relationships are reestablished, the Cisco NSF-capable router begins to resynchronize its database with all of its Cisco NSF-aware neighbors.

Based on platform and Cisco IOS Software release, Cisco NSF with SSO support is available for many routing protocols:

EIGRP

OSPF

BGP

IS-IS

Cisco IOS Software Modularity Architecture

The Cisco Catalyst 6500 series with Cisco IOS Software Modularity supports high availability in the enterprise. Figure 2-8 illustrates the key elements and components of the Cisco Software Modularity Architecture.

When Cisco IOS Software patches are needed on systems without Cisco IOS Software Modularity, the new image must be loaded on the active and redundant supervisors, and the supervisor must be reloaded or the switchover to the standby completed to load the patch.

The control plane functions (that manage routing protocol updates and management traffic) on the Catalyst 6500 series run on dedicated CPUs on the multilayer switch forwarding card complex (MSFC). A completely separate data plane is responsible for traffic forwarding. When the hardware is programmed for nonstop operation, the data plane continues forwarding traffic even if there is a disruption in the control plane. The Catalyst 6500 series switches benefit from the more resilient control plane offered by Cisco IOS Software Modularity.

NOTE

Catalyst switch forwarding fabrics are broken down into three planes or functional areas, as follows:

Control plane: The control plane is a logical interface that connects physical chassis components and software functions into a unified logical unit. The control plane connects the system controller functionality on the RP to the service processor (SP) module used to control each card and module in the chassis.

Data plane: The data plane is where packet forwarding takes place. It is the path that packets take through the routing system from the physical layer interface module (PLIM) to the modular services card (MSC) to the switch fabric. On the 6500 series platforms, this would include the policy feature card (PFC) used for high-performance packet processing, and the distributed forwarding card (DFC), which provides local packet forwarding on select line cards.

Management plane: The management plane is where control/configuration of the platform takes place.

It enables process-level, automated policy control by integrating Cisco IOS Embedded Event Manager (EEM), offloading time-consuming tasks to the network and accelerating the resolution of network issues. EEM is a combination of processes designed to monitor key system parameters such as CPU utilization, interface counters, Simple Network Management Protocol (SNMP), and syslog events. It acts on specific events or threshold counters that are exceeded.

Example: Software Modularity Benefits

Cisco IOS Software Modularity on the Cisco Catalyst 6500 series provides these benefits:

Operational consistency: Cisco IOS Software Modularity does not change the operational point of view. Command-line interfaces (CLI) and management interfaces such as SNMP or syslog are the same as before. New commands to EXEC and configuration mode and new show commands have been added to support the new functionality.

Protected memory: Cisco IOS Software Modularity enables a memory architecture where processes make use of a protected address space. Each process and its associated subsystems live in an individual memory space. Using this model, memory corruption across process boundaries becomes nearly impossible.

Fault containment: The benefit of protected memory space is increased availability because problems occurring in one process cannot affect other parts of the system. For example, if a less-critical system process fails or is not operating as expected, critical functions required to maintain packet forwarding are not affected.

Process restartability: Building on the protected memory space and fault containment, the modular processes are now individually restartable. For test purposes or nonresponding processes, the process restartprocess-name command is provided to manually restart processes. Restarting a process allows fast recovery from transient errors without the need to disrupt forwarding. Integrated high-availability infrastructure constantly checks the state of processes and keeps track of how many times a process restarted in a defined time interval. If a process restart does not restore the system, the high-availability infrastructure will take more drastic actions, such as initiating a supervisor engine switchover or a system restart.

NOTE

Although a process restart can be initiated by the user, it should be done with caution.

Modularized processes: Several control plane functions have been modularized to cover the most commonly used features. Examples of modular processes include but are not limited to these:

Routing process

Internet daemon

Raw IP processing

TCP process

User Datagram Protocol (UDP) process

Cisco Discovery Protocol process

Syslog daemon

Any EEM components

File systems

Media drivers

Install manager

Subsystem ISSU: Cisco IOS Software Modularity allows selective system maintenance during runtime through individual patches. By providing versioning and patch-management capabilities, Cisco IOS Software Modularity allows patches to be downloaded, verified, installed, and activated without the need to restart the system. Because data plane packet forwarding is not affected during the patch process, the network operator now has the flexibility to introduce software changes at any time through ISSU. A patch affects only the software components associated with the update.