Kubernetes bug ate my banking app! How code flaw crashed Brit upstart

Monzo engineering chief details exact cause of outage

Monzo, a UK online banking startup, suffered an outage on Friday for over an hour due to a four-month-old Kubernetes bug.

The Fatal Flaw, as the event might be titled by author Lemony Snicket, took down a complete production cluster, according to Oliver Beattie, head of engineering for Monzo, "through a very unfortunate series of events."

Customers saw incoming payments delayed during this period and outgoing payments failed. Monzo essentially operates as an internet-based bank, accessible through its smartphone app, that offers current accounts, budgeting tools, spending warnings, and so on.

Two weeks prior to the outage, Monzo's platform team upgraded its etcd cluster to a new version and expanded its size from three nodes to nine. In so doing, they set the stage for the outage. On Thursday, an engineering team deployed a new feature for account holders, but started seeing issues and scaled the service down so it was not running on any replicas but remained as a Kubernetes service.

On Friday, around 14:10 BST, a change was made to a service used for processing payments. At that point, customers began experiencing payment failures. Two minutes later, the change was rolled back but the problems persisted.

By 14:18, Monzo's engineers traced the problem to linkerd. The software wasn't receiving updates from Kubernetes about where new pods were running on the network and was routing requests to IP addresses that were no longer valid.

At 14:26, they decided to restart the several hundred linkerd instances running on the backend in the belief doing so would fix the issue across the board. But they couldn't because the Kubelets running the cluster's nodes were unable to fetch configuration data from the Kubernetes apiservers.

Banking app startups go TITSUP as payment slurper keels over. Again

Suspecting additional issues affecting either Kubernetes or etcd, they restarted three apiservers processes. Come 15:13 and all the linkerd pods had restarted. But the banking app's services were not receiving any requests. It was, by this point, a full platform outage.

At 15:27, the engineers noticed linkerd logging a NullPointerException while trying to read the service discovery response from the apiservers. They realized the failure to parse empty responses was due to an incompatibility between the versions of Kubernetes and linkerd being run.

To restore service, they turned to an updated version of linkerd being tested in the company's staging environment. After deploying the necessary version upgrade, they recognized that they could avoid the error that arose from trying to parse services with no replicas by deleting them. That allowed linkerd to resume its service discovery and the platform started to recover.

Beattie said his team "found a bug in Kubernetes and the etcd client that can cause requests to timeout after cluster reconfiguration of the kind we performed the week prior. Because of these timeouts, when the service was deployed, linkerd failed to receive updates from Kubernetes about where it could be found on the network."

Restarting the linkerd instances compounded the problem, he said, because it revealed an incompatibility between specific versions of linkerd and Kubernetes.

"I want to reassure everyone that we take this incident very seriously; it’s among the worst technical incidents that have happened in our history, and our aim is to run a bank that our customers can always depend on," Beattie concluded. "We know we let you down, and we’re really sorry for that."

The frank mea culpa appears to have been well-received by customers, with a number of them voicing appreciation for the detailed disclosure and explanation. ®