Cohesive Networks

How can we help you today?

Speeding up client failover on VNS3 encrypted networks

Modified on: Fri, 16 Jun, 2017 at 5:08 PM

Using VNS3 to its fullest capabilities means running a TLS tunneling agent on your cloud hosts. In our case we use the OpenVPN agent, the most-used, most auditably secure choice available in the industry.

Using this agent allows you to have an "over the top" network you control independent of the underlying infrastructure whether AWS VPC, Azure VNET, Colo-subnet, or corporate network. This provides (to name a few):- End-to-end encrypted data in motion- In-cloud and x-cloud UDP Multicast support- Separation of network location from network identity for long-distance "motion" operations.

In topologies using multiple VNS3 Controller's in multiple clouds, regions, zones, there are often two controllers in a given location for failover capabilities.

Failover is always an interesting situation, how quickly should a device do it? You don't want to over-react to a small network interruption where the corrective action causes more outage than "waiting out" the small interruption.

In VNS3 you can modify the client configuration file to provide the cloud host with a primary VNS3 controller, and one or more backup controllers. This is done by adding multiple "remote" lines as documented in the VNS3 Configuration Guide pages 48 and 51.

The default failover time is approximately two minutes before a cloud host will give up on its primary controller and connect to the next backup controller in the list.

This has shown to be a reliable set of default settings handling cloud hosts connecting to a controller in the same VPC, or in a nearby VPC in another availability zone of the same cloud, or from another cloud, or as a "road warrior" VPN use case.

For cloud hosts in the same VPC and/or nearby VPC in the same cloud and same cloud region, it may be desirable to shrink the failover window to under 30 seconds. Any less than this could create "flapping" issues with a client being in an ongoing unstable state jumping back and forth between two VNS3 controllers.

There are three variables for controlling the failover window: "ping", "ping-restart" and "hand-window".

"ping" is how often the client will ping the server. The ping is encrypted, and serve the purpose of having firewalls keep the UDP connection state open between the cloud host and the VNS3 controller. This does not actually affect failover, but it does help keep intervening firewalls from clearing out the connection.

"ping-restart" is how many seconds the client should wait without receiving a ping response OR any tunnel traffic before attempting a re-connect.

"hand-window" is short for "TLS Handshake Window (or timeout)". The default time is 60 seconds - which becomes the slowest part of failover even if you reduce the ping-restart time. HOWEVER - if your client is not in the same VPC, or at least Region as the VNS3 controller, you don't want to shrink this too much. In the same VPC we are comfortable with 10 seconds.

The other "pause" in a failover is the underlying OpenVPN agent has a hard coded 10 second delay called the "restart delay". There is not a variable to customize this, so without a private build of OpenVPN, this cannot be reduced.

TYING IT ALL TOGETHER

To reduce failover to just about 30 seconds, for a client running in the same VPC (ideally) or at least the same cloud region/zone - try the following parameters in your "ovpn" (Windows) clientpack file, or your "conf" (Linux) clientpack file.

ping 5ping-restart 15hand-window 10

This will be about 30 seconds from the last time traffic was received or responded to via the PING.

It will take 15 seconds for the restart to happen, 10 seconds for the restart delay, then less than 10 seconds for the TLS handshake.

YOUR MILEAGE MAY VARY. Again, our warning is that much of HA configuration and implementation introduces MORE work than running in a highly recoverable mode. That said there are well-defined, single-cloud-subnet use cases where shrinking the failover to under 30 seconds can be achieved.