Question

Can't schedule workloads after upgrade of Kubernetes from 1.12.7 to 1.12.8

Fortunately I don’t have mission critical services running on my cluster and now I likely never will be :)

I have started the automatic upgrade process of my DO-managed Kubernetes cluster from 1.12.7 to 1.12.8. After the upgrade of the Master control-plane went through I was expecting the worker nodes to get upgraded as well, however it somehow got stuck and the nodes are not being upgraded.

So currently I can no longer schedule new workloads as all new pods are going into the state ContainerCreating and are stuck there.

I tried to resize the node pool and this caused a new droplet to be spun up with the up-to-date Kubernetes version from DO (Debian do-kube-1.12.8-do.4). Using kubectl get nodes I can see the old nodes still in status Ready while the new node (even after around 30min) still shows up as NotReady and the latest event is Kubelet starting.

In addition all the old but still running droplets in the node pool no longer report metrics into the DO web interface.

Any idea what I can do? If nothing works I will probably have to tear down the whole cluster and set it up from scratch again. To be honest, this is very worrying to me.

I’m the Engineering Manager on the Kubernetes team at DO and I wanted to follow-up on Ethan’s previous posts as we’ve continued to work through the resolution of this issue.

A recent update to our auto-upgrade process introduced drift in our upgrade logic that resulted in the failed upgrades we’ve been discussing here. Our team has identified the problem and are in the process of rolling out a fix. The data and workloads in your clusters will remain intact. Affected clusters that are within their maintenance window will resume and complete the upgrade process when the fix is rolled out. If your cluster is now outside of it’s scheduled maintenance window, you can recycle your worker nodes in the cloud panel to trigger and resume their upgrade.

Unfortunately, our testing pipeline did not catch this issue. As is common procedure at DO, we will be following up this incident with an internal retrospective that will help us evolve the system and testing pipeline to be more resilient going forward. We understand that you place your trust in our platform and sincerely apologize for any trouble this has caused. Our system and processes will get better as a result of this.