Taking time out to fix Azure TCP KeepAlives

Martin Rhodes, Developer, shares an IRC Azure issue fix.

I recently fixed an issue that had plagued a personal project of mine for some time. A while back, I started work on an idea based around an Azure worker role that connects to an IRC (Internet Relay Chat) server as a chat bot, and sits in chat channels recording the words in messages to a backend database (for those who don't know what IRC is, take a look here - but essentially, it's a place on the internet where people go and chat about things).

The basic idea was to be able to track hot topics of discussion in real-time, and in effect, be able to see what's trending on IRC. The more ambitious idea was to have a variety of flashy ways of displaying, filtering and comparing the Live and historical data on a Website frontend, as well as a secret Admin area to control and configure the IRC bot, etc. But I started with the basic idea.

After deploying the worker role that connected the IRC bot to a server, the IRC chat data started streaming in, but I soon came up against a recurring issue. Sometimes, the bot would run for hours on end, capturing data without a problem. Other times, it would last a few minutes, and then stop capturing any further data, until I restarted it - at which point the cycle would continue. After a few days of this, I'd captured enough data to allow for a proof-of-concept site - so I was happy that the project had at least not been a total failure, and didn't really spend too much time trying to get to the bottom of it.

Recently, I returned to look at this issue. I fired up an Azure VM and installed VS 2017 on it so I could debug the project in-situ on the machine, in the hopes that it would get me closer to the answer. Fortunately, Microsoft have just started a Promo on Dv2 VMs as they move to Hyper-Threaded virtual cores (with reductions of up to 28% over the old pricing), which allowed me to quickly and cheaply spin up a fast virtual machine with plenty of RAM for all my debugging needs (ProTip: D-Series VMs have CPUs that are up to 60% faster and have double the RAM of a similarly priced A-Series VM – so definitely a good option for the price).

After fixing a few unrelated bugs and getting the bot stable, upon closer inspection, I realised that IRC bot would disconnect when there was no activity on the IRC server for some magical number of minutes. Contrary to the belief made popular by the hip hop group De La Soul in the 90s, it turns out that magic number was actually 4, not 3.

After around 4 minutes of complete inactivity, the IRC bot would be disconnected from the IRC server by some external force. How long the bot would last before this happened was entirely dependent on the amount of activity on the channels the bot was listening in on, which is what made the initial issue so seemingly random.

This pointed to some sort of time out, either in the client, the IRC server, or Azure. As IRC is designed to stay connected for long periods of time without the guarantee of traffic across the network, this seemed to rule out the client or server. Generally, if your client app does not make special provisions to periodically send data, keeping a TCP connection alive when there is no data passing through it is accomplished using TCP KeepAlives; small data packets which are periodically sent to the server to let the client and server know that they are still connected, even though they are not exchanging any real data.

IRC does have a mechanism for this built into the IRC protocol - PING, which can be used to measure lag between client and server – but if you are in a situation where you do not have control over the IRC client’s behaviour, and there is no guarantee on how often the server will PING you to see if you’re still there, then we cannot rely solely on this.

Further investigation into the source code of the IRC NuGet package I was using (SmartIrc4net) revealed that the IRC client was indeed switching on Windows OS-level TCP KeepAlives as expected - but for some reason, it was not working.

I started to dig around for any other reports of TCP connection issues that people were having with Azure VMs, looking for any mention of the magic number 4. Countless MSDN forum and StackOverflow posts later, I finally stumbled across a trustworthy looking link to a document on Microsoft's GitHub entitled "TCP settings for Azure VMs".

It turns out that the TCP time out for outgoing connections on an Azure VM (or Worker Role) is 4 minutes. It is regulated by Azure’s networking infrastructure, and this cannot be changed. It also turns out that the default OS-level KeepAlive interval for Windows Server is 2 hours. This means that, when TCP KeepAlive is turned on for a certain socket at the OS-Level, Windows will only send out a KeepAlive packet after 2 hours, but Azure will disconnect you after only 4 minutes of inactivity. This is a problem.

Luckily, the OS-level KeepAlive settings can be modified in the Windows Registry. Setting the KeepAlive interval to something below 4 minutes is all that is needed to thwart Azure's inbuilt connection culling, and requires the addition of 3 new registry keys under Services\Tcpip\Parameters:

KeepAliveTime – the amount of time a connection is idle for before the OS will start transmitting KeepAlive packets. Set this to 120 seconds (120000 milliseconds) – the default is 2 hours.

KeepAliveInterval – the interval between successive KeepAlive packets sent while awaiting an acknowledgement from the server. Set this to 30 seconds (30000 milliseconds).

TcpMaxDataRetransmissions – the number of retransmissions of a KeepAlive before giving up on the connection. Set this to 8.

To do this, on your Windows Server (I was using Windows Server 2016), follow the steps below:

Once I had done this and restarted the VM, the connection no longer disconnected after 4 minutes of inactivity. Woot!

KeepAlive settings for an Azure VM

The same fix can be applied to Worker Roles, by running the ServicePointManager.SetTcpKeepAlive(bool enabled, int keepAliveTime, int keepAliveInterval) command in the OnStart() method of your worker role.

Why Microsoft decided to ship their Azure VMs without TCP KeepAlive settings that are compatible with their own internal network time outs is anyone's guess, but at least now the bot can happily remain connected to the server without Azure silently killing the connection every time there is a 4-minute lull in network activity.

Hopefully this helps someone that may be having similar issues with IRC or any other outgoing TCP connection from an Azure VM or worker role.