Like this article? We recommend

Availability has always been an important design goal for network
architectures. As enterprise customers increasingly deploy mission-critical
web-based services, they require a deeper understanding of designing optimal
network availability solutions. There are several approaches to implementing
high-availability network solutions. This article provides an overview of the
various approaches and describes where it makes sense to apply that
solution.

FIGURE 1 provides a high-level overview of a typical corporate
customer's network. This integrated network can be divided into the
following sectors to create a logical partitioning, which can be helpful in
understanding the motivation of the protocols that provide resiliency.

Access networkThis sector connects the enterprise's private
network to the service provider. This network is generally controlled by a
network service provider, which is usually called an Internet service provider
(ISP) because that provider provides connectivity to the Internet. The term
access network is used by carriers because this is the point where end
users and enterprises access the carrier networks. Depending on the
configuration, there may be a static route from the enterprise to the ISP, or
there may be an exterior routing protocol such as Border Gateway Protocol4
(BGP4). BGP4 is more resilient, if a particular route is down, an alternate
route may be available.

Enterprise networkThis network is the enterprise's internal
network, which is always partitioned and segregated from the external network
primarily for security reasons. This network is the focus of our paper. Several
methods provide network resiliency which we investigate further in this
article.

Corporate WANThese networks provide the connectivity over long
distances to the remote enterprise sites. There are varying degrees of
connectivity, which include campus networks, that interconnect enterprise
buildings within a certain distance, metropolitan area networks (MANs) that
interconnect enterprise offices located within one local providers MAN network,
and wide area networks (WANs) that connect enterprise branch offices that may
span thousands of miles. WAN connectivity generally requires the services of a
Tier 1 service provider. Modern WAN providers may provide an IP tunnel for that
enterprise to connect remote offices over a shared network.

In this paper, we briefly discuss how MPLS can be used for resiliency. MPLS
has gained wide industry acceptance in the core networks.

The scope of this article is limited to interior routing protocols and
enterprise network technologies for availability purposes.

Physical Network Topology and Availability

One of the first items to consider for network availability is the physical
topology from an implementation perspective. In general, the topology will have
a direct impact on the mean time between failure (MTBF) calculation. Serial
components reduce availability and parallel components increase
availability.

There are three topology aspects impacting network availability:

Component failureThis aspect is the probability of the device
failing. It is measured using statistics averaging the amount of time the device
works divided by the average time the device works plus the failed time. This
value is called the MTBF. In calculating the MTBF, components that are connected
serially drastically reduce the MTBF, while components that are in parallel,
increase the MTBF.

Design A shows a flat architecture, often seen with multi-layer chassis based
switches using Extreme Networks Black Diamond®, Foundry Networks
BigIron®, or Cisco® switches. The switch can be partitioned into VLANs,
isolating traffic from one segment to another, yet providing a much better
solution overall. In this approach, the availability will be relatively high,
because there are two parallel paths from the ingress to each server and only
two serial components that a packet must traverse in order to reach the target
server.

In Design B, the architecture provides the same functionality, but across
many small switches. From an availability perspective, this solution will have a
relatively lower MTBF because of the fact there are more serial components that
a packet must traverse in order to reach a target server. Other disadvantages of
this approach include manageability, scalability, and performance. However, one
can argue that there may be increased security using this approach, which in
some customer requirements, outweighs all other factors. In Design B, multiple
switches need to be hacked to control the network; whereas in Design A, only one
switch needs to be hacked to bring down the entire network.

System failureThis aspect captures failures that are caused by
external factors, such as a technician accidentally pulling out a cable. The
more components that are potential candidates for failure are directly
proportional to the complexity, and thus, result in a higher system failure
probability. So Design B, in FIGURE 2, has more components that can go
wrong, which contributes to the increased probability of failure.

Single points of failureThis aspect captures the number of devices
that can fail and still have the system functioning. Both approaches have no
single points of failure, and are equal in this regard. However, Design B is
somewhat more resilient because if a network interface card (NIC) fails, that
failure is isolated by the Layer 2 switch, and does not impact the rest of the
architecture. This issue is a trade-off to consider, where availability is
sacrificed for increased resiliency and isolation of failures.