[EN] BGP Routing Table Size Limit Blamed for Tuesday’s Outages

Many websites, including Data Center Knowledge, responded sporadically from certain locations Tuesday, but the outages did not result from loss of power at a hosting company’s or a cloud provider’s data center, a flood or a network cable severed by a squirrel. The problem was attributed to a structural problem in the way the Internet is built.

That issue is capacity of a certain type of memory chips on older-generation router hardware used in many service providers’ infrastructure. Ternary Content-Addressable Memory is memory routers use to store the Internet’s routing table. In very simple terms, it is sort of a combination of an address book and a map for routes Internet traffic travels on.

The amount of routes TCAMs can store is finite, as a post on The IPv4 Depletion Site blog, ran by a group of network and IT experts, explains. While workarounds have been developed to deal with this limit, not all routing equipment (especially older routing equipment) has been upgraded to use them. On Tuesday morning, the Internet felt a very distinct tremor that resulted from the size of the routing table reaching that magic number of 512,000 BGP routes. BGP is the protocol used to communicate routing information.

Representatives of the hosting company Liquid Web (which hosts Data Center Knowledge, among many others) indicated on the company’s Twitter feed that the issue had been attributed to the table size hitting the TCAM limit.

Since the issue affected numerous network operators, it was not easy to send traffic around affected areas of the Internet. “Generally, we would reroute traffic, but this is being hindered by the amount of providers experiencing outages,” the Liquid Web team tweeted.

According to downdetector.com, service providers that had network issues Tuesday morning included Comcast, Level 3, AT&T, Cogent, Verizon, Time Warner and possibly others. Outage start times, courtesy of downdetector:

Comcast is having issues since 8:30 AM EDT

Level 3 is having issues since 9:55 AM EDT

AT&T is having issues since 9:35 AM EDT

Cogent Communications is having issues since 10:10 AM EDT

Verizon Communications is having issues since 10:41 AM EDT

Time Warner Cable is having issues since 10:01 AM EDT

Things began looking up in the afternoon, when LiquidWeb tweeted, “As ISP’s have recovered from #512k active bgp routes being reached, many of our customers affected by these carrier issues have regained ability to reach their sites.”

The hosting company updated its Twitter feed around 3 pm Pacific, saying all of its customers had regained connectivity from all locations.

Data disruption forecast as routers hit memory limits

In the last few days the number of possible routes breached that upper limit which might mean those routers start to struggle.

The disruption could grow over the next few weeks as older hardware is identified, said net experts Renesys.

Any router that hits its upper memory limit could slow down, lose data or become unstable, wrote Omar Santos from Cisco in a blogpost.

The problem has emerged as the number of connections between the different networks that make up the internet has continued to grow.

Routers, which send data around the net, keep track of all the ways data can travel via an internal log known as a routing table. This list is constantly updated according according to the ebbs and flow of internet traffic.

This week the number of entries on that global routing table peaked at 524,000. That represents a growing problem, said Mr Santos, because five separate devices Cisco makes can only handle a routing table of 512,000 entries.

As more and more routers around the world have to support 512K entries and beyond, the potential problems will grow, said Jim Cowie from internet monitoring firm Renesys.

"512K is right around the corner for everyone on Earth, as early as next week," he wrote in a blog post, adding, "this situation is more of an annoyance than a real internet-wide threat."

So far, he said, there was no evidence that the 512K problem was bringing about any more disruption than Renesys normally sees.

Hosting firm LiquidWeb blamed the 512K bug for service disruption that hit it on Tuesday and it is also thought to be instrumental in causing problems for eBay, Comcast and Time-Warner.

Paul Lettington, network architect at UK ISP Andrews and Arnold, said workarounds did exist for the bug that should help older kit cope. Cisco has also published advice for owners of vulnerable hardware.

Andrews and Arnold had only seen indirect evidence of the 512K bug starting to bite, said Mr Lettington.

"We have seen anomalies with other networks on the internet which could have been caused by it, and these may have had an effect on our customers accessing those other networks," he said.

He added; "It is unlikely that any network operators will step forward and say that they were affected by it, as it would require admitting that they are running older, less capable hardware and are not on top of managing the maintenance of it."