Topics

Gracefully Restarting a Go Program Without Downtime

Jun 8, 2018
by
Russell Jones

Introduction to Graceful Restarts

Being able to deploy a new version of your application or change its configuration in place without downtime has become table-stakes for modern systems software. This post discusses the different approaches that can be taken to gracefully restart an application along with a functional standalone sample to dig into the details. For those not familiar with Teleport, it’s our SSH and Kubernetes privileged access management solution designed for elastic infrastructure written in Go. This post should be interesting to developers and SREs that build and maintain services written in Go.

Background on SO_REUSEPORT vs Duplicating Sockets

Continuing our work in making Teleport highly available, we recently spent some time investigating how to gracefully restart Teleport’s TLS and SSH listener in GitHub issue #1679. Our goal was to be able to upgrade a Teleport binary without having to take an instance out of service.

You can set SO_REUSEPORT on the socket to allow multiple processes to bind to the same port. With this approach you have multiple accepts queues feeding multiple processes.

You can duplicate the socket and pass it as a file to a child process and then re-create the socket in the new process. With this approach you will have a single accept queue feeding multiple processes.

During our initial discussions several negatives came up about SO_REUSEPORT. One of our engineers had previously used that approach and noticed that due to its multiple accept queue approach sometimes pending TCP connections could be dropped. In addition when we were having these discussions, Go did not make it easy to set SO_REUSEPORT on a net.Listener. However, it should be noted that within the past few days there has been movement on this issue and it looks like Go will support setting socket options soon.

The second approach was also appealing due to its simplicity plus being the traditional Unix fork/exec spawning model that most developers are familiar with, the convention where all open files are passed to the child process. One thing to note, the os/exec package actually does not allow this. Most likely for safety reasons, it only passes stdin, stdout, and stderr to the child process. However the os package does have lower level primitives that can be used to pass a file to a child process, and that’s what we do.

Using Signals to Switch Socket Process Owner

Before getting into the source, it’s worthwhile providing some more details on how this approach works.

Starting a fresh Teleport process causes it to create a listener socket that receives all inbound network traffic on the assigned ports. For Teleport this is TLS and SSH traffic. We added a signal handler for SIGUSR2 which causes Teleport to duplicate the listener socket then spawn a new processes which is passed both the listener socket (as a file) and the metadata about the socket in its environment variables. Once the new process starts, it re-creates the listener socket from the passed-in file and metadata, and starts processing traffic as it gets it.

It should be noted that when a socket is duplicated inbound traffic is round-robin load balanced between the two sockets. As can be seen in Figure (1), this means for a period of time both Teleport processes will be accepting new connections.

Figure 1: Teleport can duplicate itself and share traffic with the duplicate process.

Shutdown of the parent process is the same affair but in reverse. Once a Teleport process receives SIGQUIT it will begin the shutdown process: it stops accepting new connections, waits for all existing connections to be drained or for some timeout to occur. Once traffic has cleared, the dying parent will close its copy of the listener socket and exit. Now the kernel will only send traffic to the new process.

Figure 2: Once the first process shuts down, all traffic is no longer duplicated.

Sample Graceful Restart Walk-through

We wrote a small application that uses this approach that you can try yourself. The source code is at the bottom of this post, and you can try it out with the following commands: