Saturday, November 29, 2008

Working on NNRP proxies, or writing threaded C code..

So it turns out that I'm working on some closed-source NNRP proxy code. It sits between clients / readers and backend spool servers and directs/balances requests to the correct backend servers as required.

News is a kind of interesting setup. There are servers with just lots of articles, indexed via message ID (or a hash thereof.) There are servers with the overview databases, which keep track of article ids, message ids, group names, and all that junk. The client reader interface has a series of commands which may invoke a combination of access to both the overview databases and the article spools.

I'm working on a bit of code which began life as a reader -> spool load balancer; I'm turning it into a general reader and client facing bit of software which speaks enough NNRP to route connections to the relevant backend servers. The architecture is pretty simplistic - one thread per backend connection, one thread per client connection, "message queues" sit between all of these and use pthread primitives to serialise access. For the number of requests and concurrency, it scales quite well. It won't scale to 100,000 connections by any means but considering the article sizes (megabytes at a time) a 10GE pipe will be filled far, far before that sort of connection and request rate limit is reached.

So, today's post is going to cover a few things I've learnt whilst writing this bit of code. Its in C, so by its very nature its going to be horrible. The question is whether I can make it less horrible to work with.

Each client thread sits in a loop reading requests, parsing them, figuring out what needs to happen, then queuing messages to the relevant spool or reader server queue to be handled by one of the connection threads. Its relatively straightforward. The trick is to figure out how to keep connections around long enough so the thing you've sent the request to is still there when you reply.

There's a couple of options which are used in the codebase.

The first is what the previous authors did - they would parse an article request (ARTICLE, HEAD, BODY), create a request, push it onto the queue, and wait 1 second for a reply. If the reply didn't occur in that time they would push another request to another server. The idea is to minimise latency on the article fetches - instead of waiting around for a potentially overloaded server, they just queue requests to the other servers which may have the article and then stop queuing requests when one issues a reply. The rest of the replies then have to be dequeued and tossed away.

The second is what I did for the reader/overview side - I would parse a client request, (GROUP, STAT, XOVER, etc), create a request to the backend, push it onto the queue, and wait for the reply. The backend code took care of trying the set of commands required to handle that client request (eg a STAT would require a GROUP, then a STAT; but a STAT would only require a STAT on the backend), with explicit timeouts. If the request didn't happen by then, the backend reader thread would send a "timeout" reply to the client thread, and then attempt to complete the transaction before dequeuing the next.

There are some implications from the above!

The first method is easier to code and easier to understand conceptually - the client handles timeouts and throws away unwanted responses. The backend server code is easy - dequeue, attempt the request until completion or error, return. The problem is that there is no guaranteed time in which the client will be notified of the completion of the request.

The second method is trickier. The backend thread handles timeouts and sends them to the client thread. The backend then needs to track the NNTP transaction state so it can resume it and run the request to completion, tossing away whatever data was being returned. The benefit is that the client -will- get a message from the backend in the specified time period.

These approaches aren't mutually exclusive either. The first works better for article fetches where the isn't any code to try and monitor server performance and issue requests to servers that are responding quickly. I'm going to add that code in soon anyway. The second approach works great for the reader commands because they're either nice and quick, or they're extremely long-lived. Article replies are generally maxing out at a few megabytes. Overview commands can server back tens or hundreds of megabytes of database information and this can take time.

One of the important implications is when the client thread can be freed. In the first method, the client thread MUST stay around until all the pending article requests have been replied to in some fashion. In the second method, the client thread waits for a response to its message immediately after queuing it, so it doesn't have to reap queue events on connection shutdown.

The current crash bugs I've seen seem to be related to message queuing. I'm seeing both junk being dequeued from the client reader queue (when there should be NO messages pending in that queue once a command has been fully processed!) and I'm seeing article responses being sent to queues for clients which have been destroyed for one reason or another. I'm going to spend some time over the next few hours putting in assert()ions to track these conditions down and naff them on the head before the stack gets scrambled and I end up with a 4 gigabyte core which gives me absolutely no useful traceback. :P

Oh look, the application cored again, this time in the GNU malloc code! Time to figure out what is going on again..