Using CRIU to upgrade a VPN server's kernel without dropping connections

It must have been in late 2017 when Red Hat’s VPN servers were rebooted for a kernel upgrade. A few days later I was contacted by someone who knew that I am working on Checkpoint/Restore in Userspace (CRIU) and asked if CRIU could be used to avoid terminated VPN connections due to a reboot of the VPN server. Always happy to hear about interesting use cases for CRIU I thought that this sounds like great idea to try out and answered that it should be theoretically possible.

For those not familiar with CRIU, the goal of CRIU is to “freeze” an application or process, then save the files and restore them later to run as they were when frozen. This is to enable things like migrating containers or application snapshots, or to avoid terminating VPN connections!

Over the next few months I thought about this use case from time to time, but never actually tried it out, until now. It took me one afternoon to set up everything but now it actually works and this recording of my terminal session shows the result:

Checkpointing OpenVPN

Now that we know that it actually works to update the kernel on an OpenVPN server without terminating the connections between clients and server I want to go into the details what I did to set this up.

My setup consists of two virtual machines running Red Hat Enterprise Linux 7.5 with OpenVPN installed from EPEL. CRIU is available as tech preview in Red Hat Enterprise Linux 7 as of 7.2. One VM is running local on my computer (client) and the other VM is running on a system about 10 kilometers away (server). After configuring OpenVPN I started it from the command-line with

$ openvpn --config server.conf

Without much preparation I tried to checkpoint the OpenVPN process using CRIU:

$ criu dump -t `pidof openvpn` -D /tmp/1

This failed pretty fast with CRIU complaining:

Error (criu/tun.c:276): Net namespace is required to dump tun link

It would have been nice if it just worked without any additional setup, but this sounds solvable. So let’s start OpenVPN in a network namespace:

Up to this point I have configured a regular OpenVPN based VPN tunnel with the only exception that the server part of the connection is running in a network namespace. Now that all this works I can finally checkpoint and restore my OpenVPN server process.

-t `pidof openvpn` - this gives CRIU the process ID (PID) of the process that should checkpointed

-D /tmp/1 - the directory CRIU should use to write the checkpoint images to

--ext-mount-map auto --enable-external-sharing --enable-external-masters - all those options are helping CRIU to correctly handle checkpointing a process in a network namespace. If using CRIU as part of a container runtime to checkpoint and restore a container, this would happen automatically.

CRIU will complain if something did not work, but if the checkpointing was successful it will write the checkpoint image to the specified directory. At this point the OpenVPN process will be gone and the ping command on the client side will stop receiving replies. The client side still thinks that the VPN tunnel is alive but the server process is gone.

The next step is to restore the checkpointed OpenVPN process:

$ criu restore -D /tmp/1 --external veth[veth1]:veth0@br0 -d

To restore the OpenVPN process following parameters are used:

restore - this tells CRIU to restore a process

-D /tmp/1 - the checkpoint image of the process to restore can be found in the directory /tmp/1

--external veth[veth1]:veth0@br0 - this option tells CRIU how to wire the network interface in the network namespace to the host’s network setup. This is basically the same as the manual setup of the network namespace when the VPN server process was initially started.

-d - this tells CRIU to run the restored process in the background

If CRIU succeeds to restore the process the OpenVPN server process will continue to run and the ping command on the client side should receive replies again.

Now the basic functionality, to checkpoint and restore the OpenVPN server process, is working. All that is left to do is combine the checkpointing and restoring with a reboot. I am using kexec to reboot as it should be faster which should reduce the downtime of the OpenVPN server process.

This tells the kernel which kernel image and which ramdisk it should use for the kexec reboot: