Have a question or want to start a discussion? Post it! No Registration Necessary. Now with pictures!

This appears to be an outage report from Centurylink, but I can't
veryify its authenticity. I had to substitute ASCII for some
multi-byte characters.

Bill Horne

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

* Event Conclusion Summary * Outage Start: December 27, 2018 08:40 GMT
Outage Stop: December 29, 2018 10:12 GMT Root Cause: A CenturyLink
network management card in Denver, CO was propagating invalid frame
packets across devices. Fix Action: To restore services the card in
Denver was removed from the equipment, secondary communication
channel tunnels between specific devices were removed across the
network, and a polling filter was applied to adjust the way the
packets were received in the equipment. As repair actions were
underway, it became apparent that additional restoration steps were
required for certain nodes, which included either line card resets
or Field Operations dispatches for local equipment login. Once
completed, all services restored. RFO Summary: On December 27, 2018
at 08:40 GMT, CenturyLink identified an initial service impact in
New Orleans, LA. The NOC was engaged to investigate the cause, and
Field Operations were dispatched for assistance onsite. Tier IV
Equipment Vendor Support was engaged as it was determined that the
issue was larger than a single site. During cooperative
troubleshooting between the Equipment Vendor and CenturyLink, a
decision was made to isolate a device in San Antonio, TX from the
network as it seemed to be broadcasting traffic and consuming
capacity. This action did alleviate impact; however, investigations
remained ongoing. Focus shifted to additional sites where network
teams were unable to remotely troubleshoot equipment. Field
Operations were dispatched to sites in Kansas City, MO, Atlanta, GA,
New Orleans, LA and Chicago, IL for onsite support. As visibility to
equipment was regained, Tier IV Equipment Vendor Support evaluated
the logs to further assist with isolation. Additionally, a polling
filter was applied to the equipment in Kansas City, MO and New
Orleans, LA to prevent any additional effects. All necessary
troubleshooting teams, in cooperation with Tier IV Equipment Vendor
Support, were working to restore remote visibility to the remaining
sites. The issue had CenturyLink Executive level awareness for the
duration. A plan was formed to remove secondary communication
channels between select network devices until visibility could be
restored, which was undertaken by the Tier IV Equipment Vendor
Technical Support team in conjunction with CenturyLink Field
Operations and NOC engineers. While that effort continued,
investigations into the logs, including packet captures, was
occurring in tandem, which ultimately identified a suspected card
issue in Denver, CO. Field Operations were dispatched to remove the
card. Once removed, it did not appear there had been significant
improvement; however, the logs were further scrutinized by the
Vendor's Advanced Support team and CenturyLink Network Operations to
identify that the source packet did originate from this
card. CenturyLink Tier III Technical Support shifted focus to the
application of strategic polling filters along with the continued
efforts to remove the secondary communication channels between
select nodes. Services began incrementally restoring. An estimated
restoral time of 09:00 GMT was provided; however, as repair efforts
steadily progressed, additional steps were identified for certain
nodes that impeded the restoration process. This included either
line card resets or Field Operations dispatches for local equipment
login. Various repair teams worked in tandem on these actions to
ensure that services were restored in the most expeditious method
available. By 2:30 GMT on December 29, it was confirmed that the
impacted IP, Voice, and Ethernet Access services were once again
operational. Point-to-point Transport Waves as well as Ethernet
Private Lines were still experiencing issues as multiple Optical
Carrier Groups (OCG) were still out of service. The Transport NOC
continued to work with the Tier IV Equipment Vendor Support and
CenturyLink Field Operations to replace additional line cards to
resolve the OCG issues. Several cards had to be ordered from the
nearest sparing depot. Once the remaining cards were replaced it was
confirmed that all services except a very small set of circuits had
restored, and the Transport NOC will continue to troubleshoot the
remaining impacted services under a separate Network Event. Services
were confirmed restored at 10:12 GMT. Please contact the Repair
center to address any lingering service issues. Additional
Information: Please note that as formal post incident investigations
and analysis occur the details relayed here may evolve. Locating the
management card in Denver, CO that was sending invalid frame packets
across the network took significant analysis and packet captures to
be identified as a source as it was not in an alarm status. The
CenturyLink network continued to rebroadcast the invalid packets
through the redundant (secondary) communication routes. CenturyLink
will review troubleshooting steps to ensure that any areas of
opportunity regarding potential for restoral acceleration are
addressed. These invalid frame packets did not have a source,
destination, or expiration and were cleared out of the network via
the application of the polling filters and removal of the secondary
communication paths between specific nodes. The management card has
been sent to the equipment vendor where extensive forensic analysis
will occur regarding the underlying cause, how the packets were
introduced in this particular manner. The card has not been replaced
and will not be until the vendor review is supplied. There is no
increased network risk with leaving it unseated. At this time, there
is no indication that there was maintenance work on the card,
software, or adjacent equipment. The CenturyLink network is not at
risk of reoccurrence due to the placement of the poling filters and
the removal of the secondary communication routes between select
nodes.

* 2018-12-29 12:48:18 GMT - The Transport NOC continues to monitor the
network to ensure impacted services have remained restored and
stable. If additional issues are experienced, please contact the
CenturyLink Repair Center. A final notification will be provided
momentarily.

* 2018-12-29 10:48:39 GMT - The Transport NOC advises Field Operations
has replaced the impacted cards and the replacement cards have
booted up and are continuing to stabilize. The Transport NOC is
monitoring to confirm impacted services have restored.

* 2018-12-29 09:40:22 GMT - The Transport NOC advises Field Operations
has received the line cards and, in cooperation with the equipment
vendor, is commencing with replacements.

* 2018-12-29 08:33:07 GMT - The Transport NOC has provided updated
estimated time of arrivals for the replacement cards of 08:30 GMT a
nd 09:00 GMT. Field Operations are on site and will replace the
affected cards immediately upon receiving the replacement cards. The
Transport NOC and Field Operations are continuing with
troubleshooting efforts for the remaining impacted sites.

* 2018-12-29 07:21:10 GMT - The Transport NOC reports continued repair
progress as multiple Optical Channel Groups have restored. Repl
acement line cards have been ordered for impacted sites with an ETA
of 07:45 GMT and 08:30 GMT. Troubleshooting efforts remain ongoing
at the remaining impacted sites by Field Operations and an equipment
vendor.

* 2018-12-29 05:40:47 GMT - The Transport NOC has advised that
additional Optical Carrier Groups have restored; however,
collaborative troubleshooting continues at the necessary locations,
as multiple out service Optical Carrier Groups remain.

* 2018-12-29 05:15:24 GMT - The Transport NOC has advised that
additional Optical Carrier Groups have restored; however,
collaborative troubleshooting continues at the necessary locations,
as multiple out service Optical Carrier Groups remain.

* 2018-12-29 03:52:30 GMT - The Transport NOC advises that Field
Operations personnel are at the final two sites and are currently
tro ubleshooting with the assistance from the equipment vendor.

* 2018-12-29 00:31:23 GMT - Field Operations in cooperation with the
Engineering teams have repaired the span traversing the western U
nited States through loop testing. Once the equipment was restored,
additional capacity was in turn available to the span on the
CenturyLink Network. IP, Voice, and Ethernet Access services are
expected to have restored with the now available
capacity. Point-to-Point Transport Waves as well as Ethernet Private
Lines may still experience issues while the remainder of the final
card issues are resolved. Lingering latency may be present, which is
anticipated to subside as routing continues to normalize. If issues
are still being experienced with your IP, Voice, and Ethernet Access
services please contact the CenturyLink Repair Center.

* 2018-12-28 23:02:29 GMT - As the Equipment Vendor and CenturyLink
Engineering teams continue to work to clear the lingering card iss
ues it has been confirmed that alarms continue to clear, and network
capacity is being restored. Efforts will remain ongoing to continue
to resolve any further issues identified.

* 2018-12-28 21:42:05 GMT - The Transport NOC has confirmed that
visibility has been restored to all nodes, allowing triage of the
add itional cards to be completed. Engineering continues to review
the network to identify, review, and clear the remaining alarms and
issues observed. Field Operations continue to remain on standby and
dispatch to sites as necessary to assist with isolation and
resolution.

* 2018-12-28 20:31:40 GMT - Efforts to complete the line card resets
remain ongoing, while additional support teams continue to triage
chassis within a smaller set of nodes that did not have full
visibility restored as well as additional line cards within the
network. The highest level of Engineering support from both the
Equipment Vendor as well as CenturyLink continue to diligently work
to restore services.

* 2018-12-28 18:23:33 GMT - The Transport NOC has confirmed that
visibility has been restored to the majority of the network outside
o f a few remaining nodes that are in various states of
recovery. Engineering has identified the line cards that will need
to be reset and are working diligently to perform the necessary
actions to bring all cards back online

* 2018-12-28 17:15:20 GMT - It has been confirmed that visibility has
been restored to the majority of the nodes across the network. F
ield Operations have been dispatched to assist with recovering
visibility to the few remaining nodes. Engineering is working to
systematically review the network alarms on the other nodes and are
then performing remote manual resets to individual cards that remain
in alarm. Reinstate times for each card may vary significantly, as
such an estimated completion time is not yet available. If cards do
not automatically reinstate after remote resets complete, Field
Operations are standing by to dispatch as needed. The Equipment
Vendor's Tier IV team continues to assist with the resolution
efforts

* 2018-12-28 13:35:00 GMT - Efforts by the Equipment Vendor and
CenturyLink engineers to apply the filters and remove the secondary
co mmunication channels in the network continue. The previously
provided ETR of 09:00 GMT remains.

* 2018-12-28 13:27:30 GMT - The Equipment Vendor and CenturyLink
engineers continue work to apply the filters and remove the
secondary communication channels. Field Operations and Equipment
Vendor dispatches to recover nodes locally remain underway. Services
continue to restore in a steady manner as troubleshooting progresses
following the recovery of nodes. CenturyLink NOC management remains
in contact with the equipment vendor to obtain updates as
restoration efforts continue.

* 2018-12-28 11:04:24 GMT - CenturyLink continues to work with the
Equipment Vendor to apply the filters and remove the secondary comm
unication channels. Field Operations and Equipment Vendor dispatches
to recover nodes locally remain underway. Client services continue
to restore in a steady manner as troubleshooting progresses
following the recovery of nodes.

* 2018-12-28 08:51:29 GMT - CenturyLink NOC Management has advised
that repair efforts are steadily progressing, and services are incr
ementally restoring. The Equipment Vendor and CenturyLink engineers
continue work to apply the filters and remove the secondary
communication channels at this time. There have been additional
restoration steps identified for certain nodes, which includes
either line card resets or Field Operations dispatches for local
equipment login, that have impeded the restoration process. Various
repair teams are working in tandem on these actions to ensure that
services are restored in the most expeditious method
available. Restoration efforts are ongoing.

* 2018-12-28 07:12:32 GMT - Efforts by the Equipment Vendor and
CenturyLink engineers to apply the filters and remove the secondary
co mmunication channels in the network continue. Additional
information on repair progress will be available from the Equipment
Vendor by 07:30 GMT. Information will be relayed as soon as it is
obtained.

* 2018-12-28 06:00:01 GMT - Efforts by the Equipment Vendor and
CenturyLink engineers to apply the filters and remove the secondary
co mmunication channels in the network continue. The previously
provided ETR of 09:00 GMT remains.

* 2018-12-28 04:58:44 GMT - CenturyLink engineers in conjunction with
the Equipment Vendor's Tier IV Technical Support team have
identified the elements causing the impact to customer
services. Through the filters being applied and the removal of the
secondary communication channels, it is anticipated services will be
fully restored within four hours. We apologize for any
inconvenience this caused our customers. Additional details
regarding details of the underlying cause will be relayed as
available.

* 2018-12-28 04:09:31 GMT - The Equipment Vendor's Tier IV
Technical Support team in conjunction with CenturyLink Tier III
Techn ical Support continues to remotely work to remove the
secondary communication channel tunnels across the network until
full visibility can be restored, as well as applying the necessary
polling filter to each of the reachable nodes.

* 2018-12-28 02:53:38 GMT - The Transport NOC has confirmed that
cooperative efforts remain ongoing to remove the secondary
communicat ion channel tunnel across the network until full
visibility can be restored, as well as applying the necessary filter
to each of the reachable nodes. It has been confirmed that both of
these actions are being performed remotely, but an estimated time to
complete the activities is not available at this time.

* 2018-12-28 01:58:56 GMT - Once the card was removed in Denver, CO it
was confirmed that there was no significant improvement. Additi onal
packet captures, and logs will be pulled from the device with the
card removed to further isolate the root cause. The Equipment vendor
continues to work with CenturyLink Field Operations at multiple
sites to remove the secondary communication channel tunnel across
the network until full visibility can be restored. The equipment
vendor has identified a number of additional nodes that visibility
has been restored to, and their engineers are currently working to
apply the necessary filter to each of the reachable nodes.

* 2018-12-28 00:59:04 GMT - Following the review of the logs and
packet captures, the Equipment Vendor's Tier IV Support team has
iden tified a suspected card issue in Denver, CO. Field Operations
has arrived on site and are working in cooperation with the
Equipment Vendor to remove the card.

* 2018-12-27 23:57:16 GMT - The Equipment Vendor is currently
reviewing the logs and packet captures from devices that have been
compl eted, while logs and packet captures continue to be pulled
from additional devices. The necessary teams continue to remove a
secondary communication channel tunnel across the network until
visibility can be restored. All technical teams continue to
diligently work to review the information obtained in an effort to
isolate the root cause.

* 2018-12-27 22:52:43 GMT - Multiple teams continue work to pull
additional logs and packet captures on devices that have had
visibili ty restored, which will be scrutinized during root cause
analysis. The Tier IV Equipment Vendor Technical Support team in
conjunction with Field Operations are working to remove a secondary
communication channel tunnel across the network until visibility can
be restored. The Equipment Vendor Support team has dispatched their
Field Operations team to the site in Chicago, IL and has been
obtaining data directly from the equipment.

* 2018-12-27 21:35:55 GMT - It has been advised that visibility has
been restored to both the Chicago, IL and Atlanta, GA sites. Engin
eering and Tier IV Equipment Vendor Technical Support are currently
working to obtain additional logs from devices across multiple sites
including Chicago and Atlanta to further isolate the root cause.

* 2018-12-27 21:01:26 GMT - On December 27, 2018 at 02:40 GMT,
CenturyLink identified a service impact in New Orleans, LA. The NOC
was engaged and investigating in order to isolate the cause. Field
Operations were engaged and dispatched for additional
investigations. Tier IV Equipment Vendor Support was later
engaged. During cooperative troubleshooting a device in San Antonio,
TX was isolated from the network as it was seeming to broadcast
traffic consuming capacity, which seemed to alleviate some
impact. Investigations remained ongoing. Following the isolation of
the San Antonio, TX device troubleshooting efforts focused on
additional sites that teams were remotely unable to
troubleshoot. Field Operations were dispatched to sites in Kansas
City, MO, Atlanta, GA, New Orleans, LA and Chicago, IL. Tier IV
Equipment Vendor Support continued to investigate the equipment logs
to further assist with isolation. Once visibility was restored to
the site in Kansas City, MO and New Orleans, LA a filter was applied
to the equipment to further alleviate the impact observed. All of
the necessary troubleshooting teams in cooperation with Tier IV
Equipment Vendor Support are working to restore remote visibility to
the remaining sites at this time. Tier IV Equipment Vendor Technical
Support continues to review equipment logs from the sites where
visibility was previously restored. We understand how important
these services are to our clients and the issue has been escalated
to the highest levels within CenturyLink Service Assurance
Leadership.

https://fuckingcenturylink.com/

***** Moderator's Note *****

This notice doesn't mention 911. That's puzzling: there were outages
of 911 service in many areas, although they are reported as being
limited to cellular users.

The report inplies that a fault occured in several high-capacity
MUXes, which IIRC wouldn't ususally be used to carry 911 traffic. My
experience was all in wireline, so I'll ask those of you who work in
the mobile world if Centurylink is allowed to have mobile switches
carry traffic across LATA boundaries.