Why CenturyLink's Network Suffered a Christmas Hangover

As 2018 was winding down, CenturyLink experienced what it calls a "network event," an outage that interrupted or, in some cases, impaired services all over the US. The carrier tells Light Reading that the culprit of the outage was an electronic (not virtual) network element in its transport network, where a third-party network management card began creating and spreading "invalid frame packets," flooding the CPUs in its network with congestion and locking them up.

"This event was not caused by a source external to the CenturyLink network, such as a security incident," CenturyLink said, in an email to Light Reading.

The outage impacted "voice, IP, and transport services for some of our customers," CenturyLink said in its email. [Ed. note: So, everything, pretty much.] "The event also impacted CenturyLink's visibility into our network management system, impairing our ability to troubleshoot and prolonging the duration of the outage."

Included in that was an interruption in at least some wireless services and 9-1-1 emergency services in several states. Verizon, for instance, told the Associated Press it had service interruptions in Albuquerque, New Mexico and parts of Montana as a result of issues with CenturyLink.

From behind his giant, clownlike coffee mug on December 28, Federal Communications Commission Chairman Ajit Pai announced that the FCC would investigate the CenturyLink outage because it interrupted 9-1-1 services "across the country."

"I've directed the Public Safety and Homeland Security Bureau to immediately launch an investigation into the cause and impact of this outage," the FCC head said in a statement, several days after the government was shut down over an omnibus funding bill. "This inquiry will include an examination of the effect that CenturyLink's outage appears to have had on other providers' 911 services."

This is either a stock photo or it's from today's editorial meeting at Light Reading.

Who is to blame?
CenturyLink told Light Reading that "a faulty network management card from a third-party equipment vendor" caused the outage. Light Reading pressed for more details. We first thought the gear at fault might have been a virtualized network function running on a commercial, off-the-shelf platform. But CenturyLink explained otherwise, saying that the "source was an electronic network element within the transport layer of the CenturyLink network driven by a card supplied by a third-party equipment vendor."

CenturyLink engineers have identified a network element that was impacting customer services and are addressing the issue in order to fully restore services. We estimate services will be fully restored within 4 hours. We apologize for any inconvenience this caused our customers.

What happened with the network management card? It went a bit bonkers [Ed. note: And that's us editorializing, not CenturyLink.]

The problem originated in Denver, CenturyLink said in its email to Light Reading. That's where the network card in question began "propagating invalid frame packets that were encapsulated and then sent over the network via secondary communication channels. Once on the secondary communication channel, the invalid frame packets multiplied, forming loops and replicating high volumes of traffic across the network." In turn, this "congested controller card CPUs (central processing units) network-wide, causing functionality issues and rendering many nodes unreachable," CenturyLink explained.

With the network management card acting up, CenturyLink was then faced with a troubling issue -- it had to find the problem and then figure out how to clear out the network traffic that had been created by the malfunctioning network management card. From CenturyLink's description, this involved undoing stuff that had been replicated because, we assume, the network management card was part of the transport network, which is subject to 1-to-1 redundancy.

We discovered some additional technical problems as our service restoration efforts were underway. We continue to make good progress with our recovery efforts and we are working tirelessly until restoration is complete. We apologize for the disruption.

Not an easy fix
"Locating the network management card that was sending invalid frame packets across the network took significant analysis and packet captures to be identified as the source as the card was not indicating a malfunction," CenturyLink told Light Reading. "Even after the network management card was removed, the CenturyLink network continued to rebroadcast the invalid packets through the redundant (secondary) communication routes. These invalid frame packets did not have a source, destination, or expiration and had to be cleared out of the network via the application of the polling filters and removal of the secondary communication paths between specific nodes to fully restore service."

As it went along, the repairs got more complicated. "In addition, as repair actions were underway, it became apparent that additional restoration steps were required for certain nodes, which included either line card resets or field operations dispatches for local equipment login," CenturyLink said, adding that its teams "worked around the clock until the issue was resolved."

Even as services were being restored, as is the case with telco networks, they have varying generations of equipment with diverse operational processes that all somehow work in harmony (most times) to provide what looks like, to the consumer, a single, homogenous service. When stuff goes wrong, of course, you need just as many fixes as you have different ways of doing the same thing. "Lingering outages for a small subset of clients were experienced following that time," CenturyLink said. "The remaining impacts were investigated at the individual circuit level and resolved on a case-by-case basis to restore all services to a stable state."

We are aware of some 911 service disruptions affecting various areas through the United States. In case of an emergency, customers should use their wireless phones to call 911 or drive to the nearest fire station or emergency facility. Technicians are working to restore services.

The fix has been ongoing, and CenturyLink had to come up with a plan to figure out how to spot the issue more quickly, should it start happening again.

"Secondary communication channels that enabled invalid traffic replication have been disabled networkwide," the carrier told Light Reading. "CenturyLink has established a network monitoring plan for key parameters that can cause this type of outage, based on advice from the third-party equipment vendor. Improvements to the existing monitoring and audits of memory and CPU utilization for this type of issue have been put into place.

"Enhanced visibility processes will quickly identify and terminate invalid packets from propagating the network. This will be jointly and regularly evaluated by the third-party equipment vendor in conjunction with CenturyLink network engineering to ensure the health of the affected nodes," the carrier said, acknowledging that its vendor is actively involved in fixing the problem caused by its gear.

The network event experienced by CenturyLink Thursday has been resolved. Services for business and residential customers affected by the event have been restored.

Re: Reminiscent of TARP storms of old Like many bugs in complex software, it is hard to test for and you don't know it's there until it's too late. Especially when it's triggered by a hardware failure. There's a hardware failure mode in some routers wherein the card stops relaying MPLS and other traffic (its job) but does maintain physical connectivity and its IS-IS daemon dutifully reports that the link is up. In that case the black hole it creates is relatively easy to locate. GMPLS, like MPLS and IP, depends on some underlying route determination code. And apparently the bad packets knocked out circuits previously set up. Here, it seems as if a packet of death, with no address, was allowed to propagate. The vendor probably didn't test for that since it's not supposed to happen. I'm guessing that a misdesign in the vendor code somewhere relayed, rather than forwarded, that packet, perhaps treating its non-address as a broadcast address. That would really be, uh, hilarious.

Your explanation seems the most reasonable of the ones I've seen so far - i.e., the explanation does fit the cryptic CTL commentary. IF this is the explanation, one thing that puzzles me is that GMPLS is well-entrenched in networks and has been for more than a decade. So why has this never happened before? (I'm not aware of a GMPLS breakdown like this at any time in the past.)

Re: Reminiscent of TARP storms of old GMPLS comes to mind, because they were losing optical streams, and that's the usual approach to treating optical streams like BESQR cat videos. I am guessing that "third party" simply means the hardware vendor, not a real third party. There's an interesting alleged "outage report" on comp.dcom.telecom.

Re: Reminiscent of TARP storms of old The outage impacted optical services, not just packet service, but was caused by packets. That points to a control plane on an optical network that has something more elaborte than simple mangement commands. The packet of death didn't just knock out change-control, it knocked out circuits. What could that be?

I'm guessing that this was a GMPLS failure. That takes all of the brokenness of the IP protocol suite and puts it in charge of the underlying optical layer. It's what comes from folks who don't come from telecom, and who think the whole world will forever be IP and that IP is somehow infallible heavenly writ. The higher layer routing protocol's job is to route around physical failures. Put the physical layer under its control and of course hilarity ensues.