Thursday, 24 August 2006

Generic components can only get you so far

We had a strange issue with Meteor Server about two years back. Under stress, the Mem Size column (working set size, in fact) in Task Manager would be up and down like a yo-yo. I initially wondered whether the OS was trimming the working set too aggressively, and tried using the SetProcessWorkingSetSize function to increase the quota. Result: no improvement, it was still happening. The time spent in the memory allocator was causing the server to slow down significantly, and as it started to slow down, the problem would get worse, and worse, eventually virtually grinding to a halt.

To prevent overhead of context switching between multiple runnable worker processes, we moved a long time ago (before I started working on it) from a model where each client would have a dedicated worker process, to a much smaller pool of worker threads (the old mode can still be enabled for compatibility with older applications that don’t store their shared state in our session object or otherwise manage their state, but it is highly discouraged for new code). This does mean that there will be times where a client request cannot be handled because there is no worker process to handle it.

After some thought and experimentation, it became clear that what was happening was that when the server started to slow down, the incoming packets were building up in, of all things, the windows message queue. I should say at this point that we were using the Winsock ActiveX control supplied with Visual Basic 6 for all network communications. We already had a heuristic that would enable a shortcut path if the average time to handle a request exceeded a certain threshold. This shortcut path simply wasn’t fast enough.

To work around the problem, I added code that would actually close the socket when either of these conditions held. This was pretty tricky to get right as we had to reopen the socket in order to send a response out of it, and we would then need to close again if the average time still exceeded the threshold. There was at least one server release where the socket would not be reopened under certain conditions (if I recall, when both the time threshold was exceeded and a worker process became available). The memory allocation issue still occurred, but it was contained. I added an extra condition that would also close the socket if no worker process was available (this would prevent some retries from lost responses and some requests for additional blocks, both handled in the server process without using a worker, from being handled).

Then, recently, we discovered a problem with the code used to send subsequent packets of large responses, too large to fit into a single packet (the application server protocol is UDP-based). We weren’t setting the destination (RemoteHost and RemotePort properties) for these packets, assuming that this wouldn’t change. Wrong! If another packet from another client arrives (or is already queued) between calling GetData and SendData, the properties change to the source of the new packet. This sometimes meant that a client would receive half of its own response and half of a different one, which when reassembled would be gibberish (typically this would cause the client to try to allocate some enormous amount of memory, which would fail). I corrected that, but found that the log in which we (optionally) record all incoming and outgoing packets still had some blanks in it where the destination IP and port were supposed to be – these values retrieved from the RemoteHostIP and RemotePort properties. Where were these packets going? Who knows! Perhaps they were (eek!) being broadcast?

The WinSock control really isn’t designed to be a server component. Frankly it was amazing we were getting around 2,400 transactions per minute (peak) out of it. It was time to go back to the drawing board. Clearly I was going to need an asynchronous way of receiving packets, and the Windows Sockets API really isn’t conducive to use from VB6, so it was going to be a C++ component. Since string manipulation and callbacks were involved, I went with a COM object written with ATL.

I surmise that the WinSock control uses the WSAAsyncSelect API to receive notifications of new packets, and that’s why we were seeing the message queue grow with each packet received. The new component uses WSAEventSelect and has a worker thread which waits on the event for a new packet to arrive. When a packet arrives it synchronously fires an event, which has the effect of waiting until the server finishes processing the packet – either discarding it (as a duplicate, otherwise malformed, or due to excessive load), sending the next block in a multi-block response, or handing the request off to a worker process.

This does mean that there could be long delays between checking for packets. Doesn’t that cause a problem? Not really. The TCP/IP stack buffers incoming packets on a UDP socket in a small First-In-First-Out buffer. If the buffer doesn’t have enough space for an incoming packet, the oldest one in the buffer is discarded. That behaviour is perfect for our situation. You can vary the buffer size (warning, it’s in kernel mode and taken from non-paged pool, IIRC) by calling setsockopt with the SO_RCVBUF parameter.

For added performance the socket is in non-blocking mode, so on sending a packet, it simply gets buffered and the OS sends the data asynchronously.

Net result? No more problems with misdirected packets (my new API requires you to pass the destination in at the same time as the data), a step on the road to IPv6 support (the WinSock control will not be updated for that) – and a substantial performance improvement. My work computer now does 7,000 transactions per minute (peak) on the same application – and the bottleneck has moved somewhere else, because that figure was achieved with only three worker processes while the earlier one was with eight. (Hyperthreaded P4 3.0GHz). We saw much less difference in a VM (on the same machine) I’d been using to get performance baselines, but what we did see was that the performance was much more consistent with the new socket component.

The sizing for this application was previously around 1,500 transactions per minute per server, so this really gives a substantial amount of headroom.

My component would be terrible for general use – but it’s just right for this one.