Any organization that relies on data networks for its core operations needs to ensure the continued availability of its network infrastructure, which includes LAN devices (switches, routers, firewalls, etc.), WAN links, Internet and cloud connections, and the support facilities (power, air conditioning, etc.). The network operator may achieve the desired level of LAN availability using several approaches. Availability of services obtained from service providers or carriers is often defined by an SLA (service level agreement) that states, among other things, the percentage of time during which the service is expected to be up and running (uptime). It is common for the uptime to range from 99.95% to 99.9999%, depending on the type of service. Availability of five nines, 99.999% uptime or a little over 5 minutes of downtime a year, is considered the norm for telecommunication carriers [1], but as networks are becoming vital to business continuity for many organizations, the need for five nines availability will proliferate.

The percentage of uptime does not provide sufficient information about the availability of the network. For instance, the 99.99% availability translates to about 52 minutes of downtime/year. This outage can occur in one occasion, in periods of four minutes every month, or one minute a week. Therefore, the availability is better measured and controlled using the metrics MTBF (mean time between failures) and MTTR (mean time to repair). The two metrics are related to the percentage of availability by the equation shown below but they provide better estimate of how long the network is expected to be operational and how fast it can recover from an unexpected failure.

(1)

Reliability

Reliability of a given system or a component refers to the probability (likelihood) that a network is operational at any given time. A network component that remains operational, on average, for 364 days/year is said to have reliability of 99.73%. The component can also be described to have a frequency of failure of 1 day/year. Components that have high reliability are expected to have long MTBF and low frequency of failure.

Networks consist of many interdependent systems and components such as internetworking devices (routers and switches), facilities (power, air-conditioning, rack space, etc), physical and data security, management and configuration controls, and others. If the relationship among these systems is viewed as a connected chain of functions, the network can be operational only if all these systems are functioning properly. Then, the reliability, R, of the network is equivalent to the multiplication of the reliability of all individual systems.

(2)

According to Equation (1), to end up with a network of 99.999% reliability, individual components and systems must have higher reliability. As the number of components the network increases, the individual reliability must go higher as well. Reliability of hardware components and services acquired from providers (e.g. WAN connections) are often beyond the network operator’s control. Instead, the effort is concentrated on eliminating single points of failure (SPOF) in the network to reduce the chance that failure of a single component takes down the entire network [2].

Redundancy

Redundancies in the network infrastructure eliminate SPOFs by adding components and other resources (e.g. memory, bandwidth or power) beyond those needed for the normal operation of the network. The goal is to make these resources available in the event of a loss of the main resources due to a failure. Complete duplication of components, known as 2N redundancy, is quite expensive considering the excess resources that remain unused. Alternatively, N+1 or N+M redundancy may provide more cost effective redundancy by relaxing the requirement to duplicate every component and providing one or few standby components instead.

Let’s consider a simple example to demonstrate the effect of redundancy. A router serves as an Internet gateway for an organization’s LAN. If the router fails unexpectedly, purchasing a replacement router and waiting for its arrival may extend the time to recover from the failure to days or even weeks. If the LAN availability is measured within a year (to comply with an SLA, for example), then it is estimated to be around 95% based on Equation (1). A vendor’s service contract may guarantee replacing the failed router within a predefined time (2 to 24 hours) for an annual fee. In this case, the repair time will include the delivery time plus the time to bring the router online and the availability approaches 99.95%. To reduce the repair time further to an hour or less and increase availability to 99.99%, a spare router can be kept in storage.

The example is an oversimplification because a single instance of failure is not sufficient to measure the availability of a network since the MTTR is the statistical average of repair times from multiple failures accruing over a long period of time. The example shows, however, that redundancy is an attractive approach to improving the network reliability. As the following equation shows, a highly available system of 99.9999% reliability can be constructed from two redundant components of 99.9% reliability or three redundant components of 99% reliability.

(3)

Failover switching mechanisms

Redundancy requires a mechanism to detect the failure and initiate the failover process. Successful failover to standby components requires timely detection of the fault and successful transfer of functions to the standby components. It is also required that any activated component and other backup resources are capable of performing the same functions and carrying the same workload as the failed component.

Network management systems (NMS) can detect faults and alert the network operators who can replace the faulty components and restore configurations manually as in the previous example. However, as MTTR requirements approach few minutes, timely response by the human operator becomes impossible and automatic detection and switching mechanisms are needed.

This means for our router example that the backup router must be connected to the network, powered on, and ready to take over as soon as the main router fails. Protocols such as VRRP (Virtual Router Redundancy Protocol) or HSRP (Cisco’s counterpart) provide the necessary detection and activation mechanisms, but the network operator must ensure the synchronization of the configuration in both routers manually or using automated tools. Considering the various delays associated with operating VRRP [3] and the convergence of other protocols such as BGP after the fault [4], the network may return to full operation within few minutes and that is sufficient to push the availability to 99.999%.

It is easy to overlook the fact that failover switching mechanisms can also be subject to failure. The NMS may fail to detect the fault or send the alert message to the human operator. The VRRP may not function properly because of misconfiguration or because of another fault in the network. These failures may go unnoticed during normal network operation because they do not affect network performance, but they cause severe consequences when faults occur.

Stateful recovery

Redundancy may reduce the network recovery time but it does not recover the data in transit. If failover is to be unnoticeable by protocols and applications such as VoIP, then stateful recovery is required. In stateless failover all active connections going through our example router can timeout and sessions are dropped. The backup router needs to establish all routing adjacencies and rebuild the routing, NAT, and ARP tables. Applications also need to re-establish connections when the backup router takes over.

If stateful failover is supported, the active router must continuously pass information to the backup router such as device status, TCP connection states, NAT and ARP tables, etc. When the failover occurs, the same information will be available to the backup router to use immediately and the applications running on the network do not need to re-establish communication sessions. In addition to its advantage to certain applications (e.g. no dropped VoIP conversations), stateful recovery reduces the recovery time to seconds or a sub-second interval and improves the availability to the range of 99.9999%.

Complexity

Equation (3) justifies the use of redundancy as means to build highly available network infrastructure from less reliable, inexpensive components. However, Equation (2) suggests that the same result can be achieved by simplifying the architecture (fewer stages) and/or using highly available components in each stage. Estimating the reliability of the network by statistical means is not a trivial task because of issues of complexities that result from adding redundancies and inter-dependencies.

The multitude of protocols and level of redundancies may cause multiple protocols to react to failure and attempt recovery simultaneously. For instance, when a link fails, the network will initiate recovery by re-configuring the spanning tree and activating another link. A backup router may also attempt to take over the routing function when it fails to receive notifications from the main router as a result of the link-loss. Such conditions are avoided by introducing artificial delays in reacting to these event, at the expense of longer repair time.

Redundancy may provide false sense of reliability and scalability. A network of many redundant components can experience multiple failures before it suffers performance degradation or an all-out outage. Once an outage occurs the network operator is faced with the task of repairing multiple failures. Redundant components may suffer cascading failures if they are subjected to the same external events that caused the original failure such as capacity overload or an exploitation of a software bug. Also performance degradation may occur when one component within a group of load-balancing components fails if the remaining components do not have enough capacity to handle all the workload.

Conclusions

Network availability can be improved considerably by implementing multiple types of redundancies, but achieving the coveted five nines requires paying attention to issues beyond simple redundancies. Decisions about levels of complexity and scalability can be made during the design stage by choosing between few highly available components or more of less reliable components. Failover mechanisms can fail but redundancy in these systems is not always available. The role of network management systems and practices is significant not only in detecting faults and critical conditions, such as exceeding safe capacity levels, but also in ensuring fast recovery. Standard network protocols have inherent limitations with respect to reacting to faults and recovery. These limitations have to be understood and proprietary solutions can be sought, if available, to archive the desired availability.

Network service providers (a.k.a communication carriers) are offering multiple services suitable for MAN (metropolitan area network) and WAN (wide area network) connectivity. This article introduces two of these services that offer unique benefits to organizations seeking this type of network connectivity.

Dark Fibre Service

Dark fibre refers to optical fibre cables that have been installed underground or over utility poles but are not being utilized yet. Since these cables offer a medium for light to transfer data, dark fibre refers to the ‘unlit’ strands of fibre. Network service providers offer dark fibre service by leasing these strands of fibre to organizations seeking a point-to-point connectivity solution that provides secure high speed data transport.

The organization that acquires Dark Fibre service can take full control of the connection including the choice of the transmission technology (Ethernet, ATM, Fibre-channel or any other widely adopted protocol). Dark Fibre also provides virtually unlimited amount of bandwidth since a single fibre strand can carry up to 100Gbps of data, and with the use of WDM (wavelength division multiplexing) technologies the amount of available bandwidth can be multiplied.

Dark Fibre circuits are not routed or tunnelled across the public Internet or any other infrastructure. And, as only one customer is utilizing the Dark Fibre circuit, they are very secure way to connect mission critical sites. Such service is often backed by a comprehensive SLA that guarantees service up-time and timely service restoration. The service provider may offer the organization to select the route it wants to connect its sites as well as adding physically diverse optical paths for added reliability.

Almost every aspect of performance can be controlled by the organization that uses the Dark Fibre service. Since no electronics exist on the light path, latency is limited only by the propagation delay of light in the fibre. Jitter and processing delays are also controlled by the organization by choosing the proper active network components at fibre terminating ends.

Dark Fibre service does not have to be limited to point-to-point connectivity. Multiple services can be acquired from the service provider to build any desired topology enabling the organization to create and manage its network without the need to invest in building its own fibre infrastructure.

Dark Fibre Service

Carrier Ethernet Service

Carrier Ethernet refers to extensions to the Ethernet technology that allow a service provider (Carrier) to offer customers point-to-point or multipoint-to-multipoint Ethernet Virtual Connections (EVCs) over the provider’s backbone network. Defined by standards from organizations such as the IEEE and the MEF, this technology allows an organization to connect two or more sites transparently over the Carrier’s network without being exposed to other customers on the network. The technology also ensures the Carrier’s network is isolated from the customer’s network and not affected by the common characteristics of Ethernet LANs, such as frame broadcasts and spanning-tree convergence.

The psychical demarcation point between the customer and Carrier networks is known as the User Network Interface (UNI). The Carrier Ethernet specifications define services that carry data from UNI to UNI. The latest Carrier Ethernet specifications, CE2.0, includes four service types (E-Line, E-LAN, E-Tree, and E-Access). Furthermore, each service type can be either Port or VLAN based, resulting in a total of eight services. The CE2.0 dedicates the E-Access service type to interconnections between service providers at the External Network-Network Interface (ENNI).

Offered under different marketing names, Carrier Ethernet services enable the customer organization to connect multiple locations using point-to-point (Line), multipoint-to-multipoint (LAN) or point-to-multipoint (Tree) topologies using virtually no additional equipment. Thus, no capital costs are incurred to build the inter-site network connectivity. From the customer’s network point-of-view the EVCS are transparent links, which can transport data traffic from multiple VLANS or event Layer 2 protocol frames.

Since the services utilize the carrier’s backbone, the organization topology may extend to any point in the carrier’s network regardless of physical distances, making the services ideal for wide geographical reach. Service parameters such as bandwidth, up-time, jitter, and latency are determined by service options and providers SLA with the ability to upgrade to higher performance level if the need arise.

Carrier Ethernet Services

Conclusion

We described briefly two network services commonly available from major carriers. An organization can acquire either one of these services to fulfill its WAN connectivity requirements and other business requirements. We conclude this article with a summary of the main features of each service in Table 1.

Table 1 Features of Dark Fibre and Carrier Ethernet services

Dark Fibre

Carrier Ethernet

Dedicated physical circuit (light path)

Virtual packet tunnel(s)

Protocol transparency

Ethernet port or VLAN transparency

Customer control of bandwidth, jitter, delay, and packet loss

Provider control of bandwidth, jitter, delay, and packet loss

A light path can be divided into multiple channels using WDM to increase bandwidth or carry different protocols

VLANs maybe be used to carry traffic from different applications or user groups

Nearly unlimited bandwidth

Limited by provider backbone, bandwidth up to 10G is common

Latency is subject to propagation delay

Latency and jitter is subject to provider’s network topology and level of service

Multipoint network topologies can be built using several light paths and customer active components

Multipoint network topology is available from provider

Customer is responsible managing active components

Carrier is responsible for managing the service to the demarcation point.

Software Defined Networks (SDN) is a new technology with a lot of potential and a healthy dose of hype. The main premise of SDN is moving the intelligence of the network from distributed network nodes to a centralized location to enable programmability and flexibility of configuration through software applications.

Software Defined Networks

Each router in today’s communication networks is capable of making decisions on its own regarding how to forward data packets to their final destination. The router gathers information about available paths to other networks and builds a view of the entire network topology independent from all other routers. This view allows the router to decide along which path a packet should be forwarded to reach its destination according to predetermined criteria. The distributed routing decision mechanism creates resiliency. If the path fails the routers will find another path to deliver packets to their destination with minimal interruption. To be able to provide this level of survivability, each router has to process every received data packet, decide where it should go, and forward it; all while communicating with other routers to maintain up-to-date topology view.

SDN proposes to separate the packet forwarding function from the routing decision function in all network devices, not just routers, and move all control to a central device. This simplifies the design of network devices and reduces their cost. Removing all control from devices also means eliminating the distinction between switches, routers, or firewalls as they all can be combined in one device that forwards (or drops) packets according to instructions received from a central controller. The result is simplified, inexpensive hardware and significant reduction in energy consumption due to eliminating the redundant computation needed in topology discovery.

Advances in general purpose microprocessors makes it possible to use an off-the-shelf server as a central controller; thus eliminating any need for special hardware. The SDN controller offers many functions that are currently difficult to perform with current network management tools. For instance, routing and other configuration policies can be pushed from a central location and changed dynamically as needed.

In virtualized environments where a virtual machine (VM) may move from one physical hardware to another, even across data centers, there is a need to reconfigure the network accordingly to maintain VM connectivity without human intervention. Carriers and infrastructure providers may use the central configuration ability of SDN to create visualized, independent networks to deliver services or rent directly to customers. Organizations, such as universities, may use the technology to run research experiments on the same hardware as production networks without affecting the latter. Some of these abilities exist today using various technologies and standards. SDN brings dramatic simplification to routing functions by centralizing the control. Also, allowing user applications to control routing means that network users can write their own routing protocols to handle data packets in the networks under their control.

The SDN’s potential to turn networking equipment into commodity products, maximize network utilization, and meeting the dynamic demands of cloud environments, has attracted the support of cloud and network service providers such as Deutsche Telekom, Facebook, Google, Microsoft, Verizon, Yahoo, and NTT. Yet, there are many challenges to overcome in order for the technology to be widely adopted. Among these challenges, fault-tolerance must be achieved by replicating the controller and maintaining synchronization among the replicas. Performance bottleneck issues may arise in large networks when all decisions need to be taken by a single controller. Also, vendor support and standardization remains a major challenge in this early stage of the technology development.

SDN can be disruptive because of the fundamental way it changes network design, operation, configuration, and management. The ability to provide X-as-a-Service (XaaS) over virtualized networks may depend on it. However, its widespread adoption will require resolving outstanding issues in areas of performance, scalability, security, and interoperability.

Connectivity to the Internet through more than one upstream ISP (Internet Service Provider) is referred to as multi-homing (or dual-homing in case of two ISPs). Multi-homing is generally required to increase the reliability of the Internet connection by reducing the reliance on a single provider and eliminating single-point-of –failure in the IP network. Dual- or multi-homing can also be used to load-balance the Internet traffic and improve performance.

While there are some techniques that can be used to archive dual-homing for special applications, the use of BGP routing to connect to multiple providers is the only effective technique to achieve general dual-homing for IPv4 networks. This report will focus exclusively on the use of BGP to connect to multiple providers.

BGP provides the ability for the network traffic going or coming from the Internet to be forwarded to any of the available ISPs. Unlike internal routing, BGP does not select routes based on shortest path to the destination but on the number of ASs (Autonomous Systems) the represent the networks from source to destination. BGP may be also configure to implement other routing policies to, for example, prefer some routes over others.

To improve the reliability of the Internet connection, an organization may choose to connect to two or more ISPs and split the Internet traffic equally among them. In the case where one provider’s link fails, outgoing traffic will automatically be routed to the remaining link(s). Other networks will be notified, through BGP updates, of the failed link and incoming traffic will be routed through another ISP link as well. In this architecture, there must be enough capacity in the remaining active links to be able to carry all the traffic from the failed link with causing congestion, which results in dropped packets and degradation of service. This means than in a dual-homing scenario, each link must carry the entire organization’s Internet traffic volume.

The organization may find an advantage in connecting to two ISP of unequal bandwidth. BGP may be configured to use one ISP as the main route where all outgoing and incoming traffic is directed. The backup ISP of small bandwidth will be activated only in the case of the main ISP’s failure and only selected traffic is routed through this link while the main ISP is being repaired. The advantage of this approach is to reduce the expenses needed to establish a second full capacity link.

Dual- or Multi-homing can be also used to improve the performance of the Internet connectivity by the carful choice of the ISPs and the proper configuration of BGP. For an organization that serve customers in diverse geographic locations, or it has branches both locally and abroad, BGP peering with multiple ISPs can ensure that traffic to each geographic location will go through the best route. This configuration will reduce the latency experiences by the users in each geographic region.

To enable multi-homing using BGP, an organization must have its own public IP address block and a public Autonomous System (AS) number before connections to two or more separate ISPs are established. Generally, ISPs do not accept or announce IPv4 address blocks smaller than /24 (255 addresses) through BGP. The organization must receive its public ASN from the regional Registry of Internet Numbers (ARIN in North America). The IPv4 can be obtained directly from the regional authority or from one of the ISPs. In the latter case other ISPs must agree to announce the IPv4 block in BGP.

A key problem to avoid in multi-homing is creating two apparently independent links from completely different ISPs using a common infrastructure such as link or a router in the organization’s network. This will actually form a single point of failure and considerably reduce the reliability benefits from multi-homing. Another problem to watch for is connecting to two ISPs, which in turn connect to a third, common ISP. The failure of the distant ISP may result in simultaneous outage or degradation of service on both links.