This chapter is from the book

In this chapter we will look at causes of fiber cable failures, identify the impacts of outage, and relate these to the goals
for restoration speed. We then provide an overview of the different basic principles and techniques for network survivability.
This provides a first overview appreciation of the basic approaches of span, path and p-cycle based survivability which we treat in depth in later chapters. The survey of basic mesh-oriented schemes in this chapter
also lets the reader see these schemes in contrast to ring-based schemes that are 100% or more redundant, and which we do
not consider further in the book. The chapter concludes with a look at the quantitative measures of network survivability,
and the relationships between availability, reliability and survivability.

3.1 Transport Network Failures and Their Impacts

3.1.1 Causes of Failure

It is reasonable to ask why fiber optic cables get cut at all, given the widespread appreciation of how important it is to
physically protect such cables. Isn't it enough to just bury the cables suitably deep or put them in conduits and stress that
everyone should be careful when digging? In practice what seems so simple is actually not. Despite best-efforts at physical
protection, it seems to be one of those large-scale statistical certainties that a fairly high rate of cable cuts is inevitable.
This is not unique to our industry. Philosophically, the problem of fiber cable cuts is similar to other problems of operating
many large-scale systems. To a lay person it may seem baffling when planes crash, or nuclear reactors fail, or water sources
are contaminated, and so on, while experts in the respective technical communities are sometimes amazed it doesn't happen
more often! The insider knows of so many things that can go wrong [Vau96]. Indeed some have gone as far as to say that the most fundamental engineering activity is the study of why things fail [Ada91] [Petr85].

And so it is with today's widespread fiber networks: it doesn't matter how advanced the optical technology is, it is in a
cable. When you deploy 100,000 miles of any kind of cable, even with the best physical protection measures, it will be damaged. And with surprising frequency. One estimate is that any given mile of cable will operate about 228 years before
it is damaged (4.39 cuts/year/1000 sheath-miles) [ToNe94]. At first that sounds reassuring, but on 100,000 installed route miles it implies more than one cut per day on average. To the extent that construction activities correlate with the working week, such failures may also tend to cluster,
producing some single days over the course of a year in which perhaps two or three cuts occur. In 2002 the FCC also published
findings that metro networks annually experience 13 cuts for every 1000 miles of fiber, and long haul networks experience
3 cuts for 1000 miles of fiber [VePo02]. Even the lower rate for long haul implies a cable cut every four days on average in a not atypical network with 30,000
route-miles of fiber. These frequencies of cable cut events are hundreds to thousands of times higher than corresponding reports
of transport layer node failures, which helps explain why network survivability design is primarily focused on recovery from
span or link failures arising from cable cuts.

3.1.2 Crawford's Study

After several serious cable-related network outages in the 1990s, a comprehensive survey on the frequency and causes of fiber
optic cable failures was commissioned by regulatory bodies in the United States [Craw93]. Figure 3-1 presents data from that report on the causes of fiber failure. As the euphemism of a "backhoe fade" suggests, almost 60%
of all cuts were caused by cable dig-ups. Two-thirds of those occurred even though the contractor had notified the facility
owner before digging. Vehicle damage was most often suffered by aerial cables from collision with poles, but also from tall
vehicles snagging the cables directly or colliding with highway overpasses where cable ducts are present. Human error is typified
by a craftsperson cutting the wrong cables during maintenance or during copper cable salvage activities ("copper mining")
in a manhole. Power line damage refers to metallic contact of the strain-bearing "messenger cable" in aerial installations
with power lines. The resulting i2R (heat dissipation) melts the fiber cable. Rodents (mice, rats, gophers, beavers) seem to be fond of the taste and texture
of the cable jackets and gnaw on them in both aerial and underground installations. The resulting cable failures are usually
partial (not all fibers are severed). It seems reasonable that by partial gnawing at cable sheaths, rodents must also compromise
a number of cables which then ultimately fail at a later time. Sabotage failures were typically the result of deliberate actions
by disgruntled employees, or vandalism when facility huts or enclosures are broken into. Today, terrorist attacks on fiber
optic cables must also be considered.

Floods caused failures by taking out bridge crossings or by water permeation of cables resulting in optical loss increases
in the fiber from hydrogen infiltration. Excavation damage reports are distinct from dig-ups in that these were cases of failure
due to rockfalls and heavy vehicle bearing loads associated with excavation activities. Treefalls were not a large contributor
in this U.S. survey but in some areas where ice storms are more seasonal, tree falls and ice loads can be a major hazard to
aerial cables. Conduits are expensive to install, and in much of the country cable burial is also a major capital expense.
In parts of Canada (notably the Canadian shield), trenching can be almost infeasible as bedrock lies right at the surface.
Consequently, much fiber cable mileage remains on aerial pole-lines and is subject to weather-related hazards such as ice,
tree falls, and lightning strikes.

Figure 3-2 shows the statistics of the related service outage and physical cable repair times. Physical repair took a mean time of 14 hours but had a high variance, with some individual repair times reaching to 100 hours. The average
service outage time over the 160 reported cable cuts was 5.2 hours. As far as can be determined from the report, all 160 of the cable
failures reported were single-failure events. This is quite relevant to the applicability and economic feasibility of later
methods in the book for optimal spare capacity design.

Figure 3-2. Histogram of service restoration and cable repair times (data from [Craw93]).

In 1997 another interesting report came out on the causes of failure in the overall public switched network (PSTN) [Kuhn97]. Its data on cable-related outages due to component flaws, acts of nature, cable cutting, cable maintenance errors and power
supply failures affecting transmission again add up to form the single largest source of outages. Interestingly Kuhn concludes
that human intervention and automatic rerouting in the call-handling switches were the key factors in the systems's overall
reliability. This is quite relevant as we aim in this book to reduce the dependence on human intervention wherever possible
in real-time and effectively to achieve the adaptive routing benefits of the PSTN down in the transport layer itself. Also of interest to readers is [Zorp89] which includes details of the famous Hinsdale central-office fire from which many lessons were learned and subsequently
applied to physical node protection.

3.1.3 Effects of Outage Duration

There are a variety of user impacts from fiber optic cable failures. Revenue loss and business disruption is often first in
mind. As mentioned in the introduction, the Gartner research group attributes up to $500 million in business losses to network
failures by the year 2004. Direct voice-calling revenue loss from failure of major trunk groups is frequently quoted at $100,000/minute
or more. But other revenue losses may arise from default on service level agreements (SLAs) for private line or virtual network
services, or even bankruptcies of business that are critically dependent on 1-800 or web-pages services. Many businesses are
completely dependent on web-based transaction systems or 1-800 service for their order intakes and there are reports of bankruptcies
from an hour or more of outage. (Such businesses run with a very finely balanced cash-flow.) Growing web-based e-commerce
transactions only increase this exposure. Protection of 1-800 services was one of the first economically warranted applications
for centralized automated mesh restoration with AT&T's FASTAR system [ChDo91]. It was the first time 1-800 services could be assured of five minute restoration times. More recently one can easily imagine
the direct revenue loss and impact on the reputation of "dot-com" businesses if there is any outage of more than a few minutes.

When the outage times are in the region of a few seconds or below, it is not revenue and business disruptions that are of
primary concern, but harmful complications from a number of network dynamic effects that have to be considered. A study by
Sosnosky provides the most often cited summary of effects, based on a detailed technical analysis of various services and
signal types [Sosn94]. Table 3-1 is a summary of these effects, based on Sosnosky, with some updating to include effects on Internet protocols.

The first and most desirable goal is to keep any interruption of carrier signal flows to 50 ms or less. 50 ms is the characteristic
specification for dedicated 1+1 automatic protection switching (APS) systems. An interruption of 50 ms or less in a transmission signal causes only a "hit" that is perceived by higher layers
as a transmission error. At most one or two error-seconds are logged on performance monitoring equipment and data packet units
for most over-riding TCP/IP sessions will not be affected at all. No alarms are activated in higher layers. The effect is a "click" on voice, a streak
on a fax machine, possibly several lost frames in video, and on data services it may cause a packet retransmission but is
well within the capabilities of data protocols including TCP/IP to handle. An important debate exists in the industry surrounding 50 ms as a requirement for automated restoration schemes.
One view holds that the target for any restoration scheme must be 50 ms. Section 3.1.4 is devoted to a further discussion of this particular issue.

As one moves up from 50 ms outage time the chance that a given TCP/IP session loses a packet increases but remains well within the capability for ACK/NACK retransmission to recover without a
backoff in the transmission rate and window size. Between 150-200 ms when a DS-1 level reframe time is added, there is a possibility
(<5% at 200 ms) of exceeding the "carrier group alarm" (CGA) times of some older channel bank1 equipment, at which time the associated switching machine will busy out the affected trunks, disconnecting any calls in progress.

Table 3-1. Classification of Outage Time Impacts

Target Range

Duration

Main Effects / Characteristics

Protection Switching

< 50 ms

No outage logged: system reframes, service "hit", 1 or 2 error-seconds (traditional performance spec for APS systems), TCP recovers after one errored frame, no TCP fallback. Most TCP sessions see no impact at all.

With DS1 interfaces on modern digital switches, however, this does not occur until 2.5 +/- 0.5 seconds.2 Some other minor network dynamics begin in the range from 150-200 ms. In Switched Multi-megabit Digital Service (SMDS) cell rerouting processes would usually be beginning by 200 milliseconds. The recovery of any lost data is, however, still
handled through higher layer data protocols. The SS7 common channel signaling (CCS) network (which control circuit-switched connection establishment) may also react to an outage of 100 ms at the SONET level (~150 ms after reframing at the DS-1 level). The CCS network uses DS-0 circuits for its signaling links and will initiate a switchover to its designated backup links if no DS-0
level synch flags are seen for 146 ms. Calls in the process of being set up at the time may be abandoned. Some video codecs
using high compression techniques can also require a reframing process in response to a 100 ms outage that can be quite noticeable
to users.

In the time frame from 200 ms to two seconds no new effects on switched voiceband services emerge other than those due to
the extension of the actual signal lapse period itself. By two seconds the roughly 12% of DS0 circuits that are carried on
older analog channel banks (at the time of Sosnosky's study) will definitely be disconnected. In the range from two to 10
seconds the effects become far more serious and visible to users. A quantum change arises in terms of the service-level impact
in that virtually all voice connections and data sessions are disconnected. This is the first abrupt perception by users and
service level applications of outage as opposed to a momentary hit or retransmission-related throughput drop. At 2.5 +/- 0.5 seconds, digital switches react to
the failure states on their transmission interfaces and begin "trunk conditioning"; DS-0, (n)xDS-0 (i.e., "fractional T1"), DS-1 and private line disconnects ("call-dropping") occur. Voiceband data modems typically
also time out two to three seconds after detecting a loss of carrier. Session dependent applications such as file transfer
using IBM SNA or TCP/IP may begin timing out in this region, although time-outs are user programmable up to higher values (up to 255 seconds for
SNA). X.25 packet network time-outs are typically from one to 30 seconds with a suggested time of 5 seconds. When these timers
expire, disconnection of all virtual calls on those links occurs. B-ISDN ATM connections typically have alarm thresholds of about five seconds.

In contrast to the 50 ms view for restoration requirements, this region of 1 to 2 second restoration is the main objective
that is accepted by many as the most reasonable target, based largely on the cost associated with 1+1 capacity duplication
to meet 50 ms, and in recognition that up until about 1 or 2 seconds, there really is very little effect on services. However,
two seconds is really the "last chance" to stop serious network and service implications from arising. It is interesting that
some simple experiments can dramatically illustrate the network dynamics involved in comparing restoration above and below
a 2 second target (whereas there really are no such abrupt or quantum changes in effects at anywhere from zero up to the 2
second call-dropping threshold).

Figure 3-3 shows results from a simple teletraffic simulation of a group of 50 servers. The servers can be considered circuits in a
trunk group or processors serving web pages. The result shown is based on telephony traffic with a 3 minute holding time.
The 50 servers are initially in statistical equilibrium with their offered load at 1% connection blocking. If a call request
is blocked, the offering source reattempts according to a uniform random distribution of delay over the 30 seconds following
the blocked attempt. Figure 3-3(a) shows the instantaneous connection attempts rate, if the 50 trunk group is severed and all calls are dropped, then followed
by an 80% restoration level. Figure 3-3(b) shows the corresponding dynamics of the same total failure, also followed by only 80% restoral, but before the onset of call dropping. Figure 3-3(c) shows how the overall transient effect is yet further mitigated by adaptive routing in the circuit-switched service layer
to further reduce ongoing congestion. This dramatically illustrates how beneficial it is in general to achieve a restoration
response before connection or session dropping, even if the final restoral level is not 100%.

The seriousness of an outage that extends beyond several seconds, into the tens of seconds, grows progressively worse: IP networks begin discovering "Hello" protocol failures and attempt to reconverge their routing tables via LSA flooding. In
circuit-switched service layers, massive connection and session dropping starts occurring and goes on for the next several
minutes. Even if restoration occurred at, say, 10 seconds, there would by then be millions of users and applications that
begin a semi-synchronized process of attempting to re-establish their connections. There are numerous reports of large digital
switching systems suffering software crashes and cold reboots in the time frame of 10 seconds to a few minutes following a
cable cut, due to such effects. The cut itself might not have affected the basic switch stability, but the mass re-attempt
overwhelms and crashes the switch. Similar dynamics apply for IP large routers forwarding packets for millions of TCP/IP sessions that similarly undergo an unwittingly synchronized TCP/IP backoff and restart. (TCP/IP involves a rate backoff algorithm called "slow start" for response to congestion. Once it senses rising throughput the transmit
rate and window size is multiplied in a run up to the maximum throughput. Self-synchronized dynamics among disparate groups
of TCP/IP sessions can therefore occur following the failure or during the time routing tables are being updated). The same kind of
dynamic hazards can be expected in MPLS-based networks as label edge routers (LERs) get busy (following OSPF-TE resynchronization) with CR-LDP signaling for re-establishment of possibly thousands of LSPs simultaneously through the core network of LSRs. Protocols such
as CR-LDP for MPLS (or GMPLS) path establishment were not intended for, nor have they ever been tested in an environment of mass simultaneous signaling
attempts for new path establishment. The overall result is highly unpredictable transient signaling congestion and capacity
seizure and contention dynamics. If failure effects are allowed to even enter this domain we are ripe for "no dial tone" and
Internet "brown outs" as switch or router O/S software succumbs to overwhelming real-time processing loads. Such congestion
effects are also known to propagate widely in both the telephone network and Internet. Neighboring switches cannot complete
calls to the affected destination, blocking calls coming into themselves, and so on. If anything, however, the Internet is
even more vulnerable than the circuit switched layer to virtual collapse in these circumstances. 3

Beyond 30 minutes the outage effects are generally considered so severe that it is reportable to regulatory agencies and the
general societal and business impacts are considered to be of major significance. If communications to or between police,
ambulance, medical, flight traffic control, industrial process control or many other such crucial services break down for
this long it becomes a matter of health and safety, not just business impact. In the United States any outage affecting 30,000
or more users for over 30 minutes is reportable to the FCC.

3.1.4 Is 50 ms Restoration Necessary?

Any newcomer to the field of network survivability will inevitably encounter the "50 ms debate." It is well to be aware that
this is a topic that has been already argued without resolution for over a decade and will probably continue. The debate persists
because it is not entirely based on technical considerations which could resolve it, but has roots in historical practices
and past capabilities and has been a tool of certain marketing strategies.

History of the 50 ms Figure

The 50 ms figure historically originated from the specifications of APS subsystems in early digital transmission systems and was not actually based on any particular service requirement. Early
digital transmission systems embodied 1:N APS that required typically about 20 ms for fault detection, 10 ms for signaling, and 10 ms for operation of the tail-end transfer
relay, so the specification for APS switching times was reasonably set at 50 ms, allowing a 10 ms margin. Early generations of DS1 channel banks (1970s era)
also had a Carrier Group Alarm (CGA) threshold of about 230 ms. The CGA is a time threshold for persistence of any alarm state on the transmission line side (such as loss of signal or frame synch
loss) after which all trunk channels would be busied out. The 230 ms CGA threshold reinforced the need for 50 ms APS switches at the DS3 transmission level to allow for worst-case reframe times all the way down the DS3, DS2, DS1 hierarchy
with suitable margin against the 230 ms CGA deadline. It was long since realized that a 230 ms CGA time was far too short, however. Many minor line interruptions would trigger an associated switching machine into mass call-dropping
because of spurious CGA activations. The persistence time before call dropping was raised to 2.5 +/- 0.5 s by ITU recommendations in the 1980s as
a result. But the requirement for 50 ms APS switching stayed in place, mainly because this was still technically quite feasible at no extra cost in the design of APS subsystems. The apparent sanctity of 50 ms was further entrenched in the 1990s by vendors who promoted only ring-based transport
solutions and found it advantageous to insist on 50 ms as the requirement, effectively precluding distributed mesh restoration
alternatives which were under equal consideration at the start of the SONET era. As a marketing strategy the 50 ms issue thus served as the "mesh killer" for the 1990s as more and more traditional
telcos bought into this as dogma.

On the other hand, there was also real urgency in the early 1990s to deploy some kind of fast automated restoration method
relatively immediately. This lead to the quick adoption of ring-based solutions which had only incremental development requirements
over 1+1 APS transmission systems. However, once rings were deployed, the effect was to only further reinforce the cultural assumption
of 50 ms as the standard. Thus, as sometimes happens in engineering, what was initially a performance capability in one specific context (APS switching time) evolved into a perceived requirement in all other contexts.

But the "50 ms requirement" is undergoing serious challenges to its validity as a ubiquitous requirement, even being referred
to as the "50 ms myth" by data-centric entrants to the field who see little actual need for such fast restoration from an
IP services standpoint. Faster restoration is by itself always desirable as a goal, but restoration goals must be carefully
set in light of corresponding costs that may be paid in terms of limiting the available choices of network architecture. In
practice, insistence on "50 ms" means 1+1 dedicated APS or UPSR rings (to follow) are almost the only choices left for the operator to consider. But if something more like 200 ms is allowed,
the entire scope of efficient shared-mesh architectures become available. So it is an issue of real importance as to whether
there are any services that truly require 50 ms.

Sosnosky's original study found no applications that require 50 ms restoration. However, the 50 ms requirement was still being
debated in 2001 when Schallenburg [Schal01], understanding the potential costs involved to his company, undertook a series of experimental trials with varying interruption
times and measured various service degradations on voice circuits, SNA, ATM, X.25, SS7, DS1, 56 kb/s data, NTC digital video, SONET OC-12 access services, and OC-48. He tested with controlled-duration outages and found that 200 ms outages would not jeopardize
any of these services and that, except for SS7 signaling links, all other services would in fact withstand outages of two
to five seconds.

Thus, the supposed requirement for 50 ms restoration seems to be more of a techno-cultural myth than a real requirement—there
are quite practical reasons to consider 2 seconds as an alternate goal for network restoration. This avoids the regime of
connection and session time-outs and IP/MPLS layer reactions, but gives a green light to the full consideration of far more efficient mesh-based survivable architectures.