Writing a reverse proxy/loadbalancer from the ground up in C, part 2: handling multiple connections with epoll

This is the second step along my road to building a simple C-based reverse proxy/loadbalancer so that I can understand how nginx/OpenResty works — more background here. Here’s a link to the first part, where I showed the basic networking code required to write a proxy that could handle one incoming connection at a time and connect it with a single backend.

This (rather long) post describes a version that uses Linux’s epoll API to handle multiple simultaneous connections — but it still just sends all of them down to the same backend server. I’ve tested it using the Apache ab server benchmarking tool, and over a million requests, 100 running concurrently, it adds about 0.1ms to the average request time as compared to a direct connection to the web server, which is pretty good going at this early stage. It also doesn’t appear to leak memory, which is doubly good going for someone who’s not coded in C since the late 90s. I’m pretty sure it’s not totally stupid code, though obviously comments and corrections would be much appreciated!

[UPDATE: there’s definitely one bug in this version — it doesn’t gracefully handle cases when the we can’t send data to the client as fast as we’re receiving it from the backend. More info here.]
Just like before, the code that I’ll be describing is hosted on GitHub as a project called rsp, for “Really Simple Proxy”. It’s MIT licensed, and the version of it I’ll be walking through in this blog post is as of commit f51950b213. I’ll copy and paste the code that I’m describing into this post anyway, so if you’re following along there’s no need to do any kind of complicated checkout.

Before we dive into the code, though, it’s worth talking about epoll a bit.

You’ll remember that the code for the server in my last post went something like this:

while True:
wait for a new incoming connection from a client
handle the client connection

…where the code to handle the client connection was basically:

connect to the backend
read a block's worth of stuff from the client
send it to the backend
while there's still stuff to be read from the backend:
send it to the client

Now, all of those steps to read stuff, or to wait for incoming connections, were blocking calls — we made the call, and when there was data for us to process, the call returned. So handling multiple connections would have been impossible, as (say) while we were waiting for data from one backend we would also have to be waiting for new incoming connections, and perhaps reading from other incoming connections or backends. We’d be trying to do several things at once.

That sounds like the kind of problem threads, or even cooperating processes, were made for. That’s a valid solution, and was the normal way of doing it for a long time. But that’s changed (at least in part). To see why, think about what would happen on a very busy server, handling hundreds or thousands of concurrent connections. You’d have hundreds or thousands of threads or processes — which isn’t a huge deal in and of itself, but they’d all be spending most of their time just sitting there using up memory while they were waiting for data to come in. Processes, or even threads, consume a non-trivial amount of machine resources at that kind of scale, and while there’s still a place for them, they’re an inefficient way to do this kind of work.

A very popular metaphor for network servers like this these days is to use non-blocking IO. It’s a bit of a misnomer, but there’s logic behind it. The theory is that instead of having your “read from the backend server” or “wait for an incoming connection” call just sit there and not return until something’s available, it doesn’t block at all — if there’s nothing there to return, it just returns a code saying “nothing for you right now”.

Now obviously, you can’t write your code so that it’s constantly running through a list of the things you’re waiting for saying “anything there for me yet?” because that would suck up CPU cycles to no real benefit. So what the non-blocking model does in practice is provide you with a way to register a whole bunch of things you’re interested in, and then gives you a blocking (told you it was a misnomer) function that basically says “let me know as soon as there’s anything interesting happening on any of these“. The “things” that you’re waiting for stuff on are file descriptors. So the previous loop could be rewritten using this model to look something like this:

add the "incoming client connection" file descriptor to the list of things I'm interested in
while True:
wait for an event on the list of things I'm interested in
if it's an incoming client connection:
get the file descriptor for the client connection, add it to the list
connect to the backend
add the backend's file descriptor to the list
else if it's data from an client connection
send it to the associated backend
else if it's data from a backend
send it to the associated client connection

…with a bit of extra logic to handle closing connections.

It’s worth noting that this updated version can not only process multiple connections with just a single thread — it can also handle bidirectional communication between the client and the backend. The previous version read once from the client, sent the result of that read down to the backend, and from then on only sent data from the backend to the client. The code above keeps sending stuff in both directions, so if the client sends something after the initial block of data, while the backend is already replying, then it gets sent to the backend. This isn’t super-useful for simple HTTP requests, but for persistent connections (like WebSockets) it’s essential.

There have been many system calls that been the “wait for an event on the list of things I’m interested in” call in Unix/POSIX-like environments over the years — select and poll, for example — but they had poor performance as the number of file descriptors got large.

The popular solution, in Linux at least, is epoll. It can handle huge numbers of file descriptors with minimal reduction in performance. (The equivalent in FreeBSB and Mac OS X is kqueue, and according to Wikipedia there’s something similar in Windows and Solaris called “I/O Completion Ports“.)

The rest of this post shows the code I wrote to use epoll in a way that (a) makes sense to me and feels like it will keep making sense as I add more stuff to rsp, and (b) works pretty efficiently.

Just as before, I’ll explain it by working through the code. There are a bunch of different files now, but the main one is still rsp.c, which now has a main routine that starts like this:

So, some pretty normal initialisation stuff to check the command-line parameters and put them into meaningfully-named variables. (Sharp-eyed readers will have noticed that I’ve updated my code formatting — I’m now putting the * to represent a pointer next to the type to which it points, which makes more sense to me than splitting the type definition with a space, and I’ve also discovered that the C99 standard allows you to declare variables anywhere inside a function, which I think makes the code much more readable.)

epoll not only allows you to wait for stuff to happen on multiple file descriptors at a time — it’s also controlled by its own special type of file descriptor. You can have multiple epoll FDs in a program, each of which gives you the ability to wait for changes on a different set of normal FDs. A specific normal FD could be in several different epoll FDs’ sets of things to listen to. You can even add one epoll FD to the list of FDs another epoll FD is watching if you’re so inclined. But we’re not doing anything quite that complicated here.

You create a special epoll FD using either epoll_create or epoll_create1. epoll_create is pretty much deprecated now (see the link for details) so we just use epoll_create1 in its simplest form, and bomb out if it returns an error value.

So now we have an epoll instance ready to go, and we need to register some file descriptors with it to listen to. The first one we need is the one that will wait for incoming connections from clients. Here’s the code that does that in rsp.c

This is all code that’s using an abstraction I’ve built on top of epoll that makes it easy to provide callback functions that are called when an event happens on a file descriptor, so it’s worth explaining that now. Let’s switch to the file epollinterface.c. This defines a function add_epoll_handler that looks like this:

The important system call in there is epoll_ctl. This is the function that allows you to add, modify and delete file descriptors from the list that a particular epoll file descriptor is watching. You give it the epoll file descriptor, an operation (EPOLL_CTL_ADD, EPOLL_CTL_MOD or EPOLL_CTL_DEL), the normal file descriptor you’re interested in events for, and a pointer to a struct epoll_event. The event you pass in has two fields: an event mask saying which events on the file descriptor you’re interested in, and some data.

The data is interesting. When you do the “block until something interesting has happened on one or more of this epoll FD’s file descriptors” call, it returns a list of results. Obviously, you want to be able to work out for each event where it came from so that you can work out what to do with it. Now, this could have been done by simply returning the file descriptor for each. But epoll’s designers were a bit cleverer than that.

The thing is, if all epoll gave you was a set of file descriptors that have had something happen to them, then you would need to maintain some kind of control logic saying “this file descriptor needs to be handled by that bit of code over there, and this one by that code”, and so on. That could get complicated quickly. You only need to look at the code of some of the epoll examples on the net to see that while it might make sample code easier to understand at a glance, it won’t scale. (I should make it clear that this isn’t a criticism of the examples, especially the one I linked to, which is excellent — just my opinion that non-trivial non-sample code needs a different pattern.)

So, when epoll tells you that something’s happened on one of the file descriptors you’re interested in, it gives you an epoll event just like the one you registered the FD with, with the events field set to the bitmask of the events you’ve received (rather than the set of the events you’re interested in) and the data field set to whatever it was you gave it at registration time.

Aside for people newish to C — this was something I had to refresh myself on — a C union is a type that allows you to put any value from a set of types into a variable. So in a variable with the type specification above, you can store either a pointer to something (void*), an integer, or one of two different types of specifically-sized integers. When you retrieve the value, you have to use the field name appropriate to the type of thing you put in there — for example, if you were to store a 32-bit integer in the data using the u32 name and then retrieve it using the ptr variable, the result would be undefined. (Well, on a 32-bit machine it would probably be a pointer to whatever memory address was represented by that 32-bit integer, but that’s unlikely to be what you wanted.)

In this case, we’re using the data pointer inside the union, and we’re setting it to a pointer to a struct epoll_event_handler. This is a structure I’ve created to provide callback-like functionality from epoll. Let’s take a look — it’s in epollinterface.h:

A callback function to handle an epoll event which takes a pointer to a epoll_event_handler structure, and a uint32_t which will hold the bitmask representing the events that need to be handled

And a pointer to something called closure; basically, a place to store any data the callback function needs to do its job.

Right. Now we have a function called add_epoll_handler that knows how to add a file descriptor and an associated structure to an epoll FD’s list of things it’s interested in so that it’s possible to do a callback with data when an event happens on the epoll FD. Let’s go back to the code in rsp.c that was calling this. Here it is again:

This presumably now makes sense — we’re creating a special handler to handle events on the server socket (that is, the thing that listens for incoming client connections) and we’re then adding that to our epoll FD, with an event mask that says that we’re interested in hearing from it when there’s something to read on it — that is, a client connection has come in.

Let’s put aside how that server socket handler works for a moment, and finish with rsp.c. The next lines look like this:

Pretty simple. We print out our status, then call this do_reactor_loop function, then return. do_reactor_loop is obviously the interesting bit; it’s another part of the epoll abstraction layer, and it basically does the “while True” loop in the pseudocode above — it waits for incoming events on the epoll FD, and when they arrive it extracts the appropriate handler, and calls its callback with its closure data. Let’s take a look, back in epollinterface.c:

It’s simple enough. We go into a never-ending loop, and each time around we call epoll_wait, which, as you’d expect, is the magic function that blocks until events are available on any one of the file descriptors that our epoll FD is interested in. It takes an epoll FD to wait on, a place to store incoming events, a maximum number of events to receive right now, and a timeout. As we’re saying “no timeout” with that -1 as the last parameter, then when it returns, we know we have an event — so we extract its handler, and call it with the appropriate data. And back around the loop again.

One interesting thing here: as you’d expect from the parameters, epoll_wait can actually get multiple events at once; the 1 we’re passing in as the penultimate parameter is to say “just give us one”, and we’re passing in a pointer to a single struct epoll_event. If we wanted more than one then we’d pass in an array of struct epoll_events, with a penultimate parameter saying how long it is so that epoll_wait knew the maximum number to get in this batch. When you call epoll_wait with a smaller “maximum events to get” parameter than the number that are actually pending, it will return the maximum number you asked for, and then the next time you call it will give you the next ones in its queue immediately, so the only reason to get lots of them in one go is efficiency. But I’ve noticed no performance improvements from getting multiple epoll events in one go, and only accepting one event at a time has one advantage. Imagine that you’re processing an event on a backend socket’s FD, which tells you that the backend has closed the connection. You then close the associated client socket, and you free up the memory for both the backend and the client socket’s handlers and closures. Closing the client socket means that you’ll never get any more events on that client socket (it automatically removes if from any epoll FDs’ lists that it’s on). But what if there was already an event for the client socket in the event array that was returned from your last call to epoll_wait, and you’d just not got to it yet? If that happened, then when you did try to process it, you’d try to get its handler and closure data, which had already been freed. This would almost certainly cause the server to crash. Handling this kind of situation would make the code significantly more complicated, so I’ve dodged it for now, especially given that it doesn’t seem to harm the proxy’s speed.

So that’s our reactor loop (the name “reactor” comes from Twisted and I’ve doubtless completely misused the word). The code that remains unexplained is in the event-handlers. Let’s start off by looking at the one we skipped over earlier — the server_socket_event_handler that we created back in rsp.c to listen for incoming connections. It’s in server_socket.c, and the create_server_socket_handler function called from rsp.c looks like this:

So, we create and bind to a server socket, to get a file descriptor for it. You’ll remember that terminology from the last post, and in fact the create_and_bind function (also defined in server_socket.c) is exactly the same code as we had to do the same job in the original single-connection server.

Now, we do our first new thing — we tell the system to make our server socket non-blocking, which is obviously important if we don’t want calls to get data from it to block:

make_socket_non_blocking(server_socket_fd);

This isn’t a system call, unfortunately — it’s a utility function, defined in netutils.c, and let’s jump over and take a look:

Each socket has a number of flags associated with it that control various aspects of it. One of these is whether or not it’s non-blocking. So this code simply gets the bitmask that represents the current set of flags associated with a socket, ORs in the “non-blocking” bit, and then applies the new bitmask to the socket. Easy enough, and thanks to Banu Systems for a neatly-encapsulated function for that in their excellent epoll example.

Let’s get back to the create_server_socket_handler function in server_socket.c.

listen(server_socket_fd, MAX_LISTEN_BACKLOG);

You’ll remember this line from the last post, too. One slight difference — in the first example, we had MAX_LISTEN_BACKLOG set to 1. Now it’s much higher, at 4096. This came out of the Apache Benchmarker tests I was doing with large numbers of simultaneous connections. If you’re running a server, and it gets lots of incoming connection and goes significantly past its backlog, then the OS can assume someone’s running a SYN flood denial of service attack against you. You’ll see stuff like this in syslog:

Thanks to Erik Dubbelboer for an excellent writeup on how this happens and why. A value of 4096 for the maximum backlog seems to be fine in terms of memory usage and allows this proxy to work well enough for the amount of connections I’ve tested it with so far.

We create a closure of a special structure type that will contain all of the information that our “there’s an incoming client connection” callback will need to do its job, and fill it in appropriately.

…then we create a struct epoll_event_handler with the FD, the handler function, and the closure, and return it.

That’s how we create a server socket that can listen for incoming client connections, which when added to the event loop that the code in epollinterface.c defined, will call an appropriate function with the appropriate data.

Next, let’s look at that callback. It’s called handle_server_socket_event, and it’s also in server_socket.c.

We need to be able to extract information from the closure we set up originally for this handler, so we start off by casting it to the appropriate type. Next, we need to accept any incoming connections. We loop through all of them, accepting them one at a time; we don’t know up front how many there will be to accept so we just do an infinite loop that we can break out of:

There are two conditions under which an accept will fail (under which circumstances the call to accept will return -1):

if (client_socket_fd == -1) {

Firstly if there’s nothing left to accept. If that’s the case, we break out of our loop:

if ((errno == EAGAIN) || (errno == EWOULDBLOCK)) {
break;

Secondly, if there’s some kind of weird internal error. For now, this means that we just exit the program with an appropriate error message.

} else {
perror("Could not accept");
exit(1);
}
}

If we were able to accept an incoming client connection, we need to create a handler to look after it, which we’ll have to add to our central epoll handler. This is done by a new function, handle_client_connection

So, in summary — when we get a message from the server socket file descriptor saying that there are one or more incoming connections, we call handle_server_socket_event, which accepts all of them, calling handle_client_connection for each one. We have to make sure that we accept them all, as we won’t be told about them again. (This is actually slightly surprising, for reasons I’ll go into later.)

All this means that our remaining unexplained code is what happens from handle_client_connection onwards. This is also in server_socket.c, and is really simple:

We just create a new kind of handler, one for handling events on client sockets, and register it with our epoll loop saying that we’re interested in events when data comes in, and when the remote end of the connection is closed.

Onward to the client connection handler code, then! create_client_socket_handler is defined in client_socket.c, and looks like this:

This code should be pretty clear by now. We make the client socket non-blocking, create a closure to store data for callbacks relating to it (in this case, the client handler needs to know about the backend so that it can send data to it), then we set up the event handler object, create the backend connection, and return the handler. There are two new functions being used here — handle_client_socket_event and connect_to_backend, both of which do exactly what they say they do.

Let’s consider connect_to_backend first. It’s also in client_socket.c, and I won’t copy it all in here, because it’s essentially exactly the same code as was used in the last post to connect to a backend. Once it’s done all of the messing around to get the addrinfo, connect to the backend, and get an FD that refers to that backend connection, it does a few things that should be pretty clear:

The same pattern as before — create a handler to look after that FD, passing in information for the closure (in this case, just as the client handler needed to know about the backend, the backend needs to know about this client), then we add the handler to the epoll event loop, once again saying that we’re interested in knowing about incoming data and when the remote end closes the connection.

The only remaining client connection code that we’ve not gone over is handle_client_socket_event. Here it is:

Secondly, perhaps the remote end has closed the connection. We don’t always get an official “remote hung up” if this happens, so we explicitly close the connection if that happens, also closing our connection to the backend.

…or in other words, if there’s been some kind of error or the remote end hung up, we unceremoniously close the connection to the backend and the client connection itself.

}

And that’s the sum of our event handling from client connections.

There’s one interesting but perhaps non-obvious thing happening in that code. You’ll remember that when we were handling the “incoming client connection” event, we had to carefully accept every incoming connection because we weren’t going to be informed about it again. In this handler, however, we read a maximum of BUFFER_SIZE bytes (currently 4096). What if there were more than 4096 bytes to read?

Explaining this requires a little more background on epoll. Epoll can operate in two different modes — edge-triggered and level-triggered. Level-triggered is the default, so it’s what we’re using here. In level-triggered mode, if you receive an epoll notification that there’s data waiting for you, and only read some of it, then epoll notes that there’s still unhandled data waiting, and schedules another event to be delivered later. By contrast, edge-triggered mode only informs you once about incoming data. If you don’t process it all, it won’t tell you again.

So because we’re using level-triggered epoll, we don’t need to make sure we read everything — we know that epoll will tell us later if there’s some stuff we didn’t read. And doing things this way gives us a nice way to make sure that when we are handling lots of connections, we time-slice between them reasonably well. After all, if every time we got data from a client, we processed it all in the handler, then if a client sent us lots of data in one go, we’d sit there processing it and ignoring other clients in the meantime. Remember, we’re not multi-threading.

That’s all very well, but if it’s the case and we can use it when processing data from a client, why did we have to be careful to accept all incoming client connections? Surely we could only accept the first one, then rely on epoll to tell us again later that there were still more to handle?

To be honest, I don’t know. It seems really odd to me. But I tried changing the accept code to only accept one client connection, and it didn’t work — we never got informed about the ones we didn’t accept. Someone else got the same behaviour and reported it as a bug in the kernel back in 2006. But it’s super-unlikely that something like this is a kernel bug, especially after seven years, so it must be something odd about my code, or deliberate defined behaviour that I’ve just not found the documentation for. Either way, the thread continuing from that bug report has comments from people saying that regardless of whether you’re running edge- or level-triggered, if you want to handle lots of connections then accepting them all in one go is a good idea. So I’ll stick with that for the time being, and if anyone more knowledgable than me wants to clarify things in the comments then I’d love to hear more!

So, what’s left? Well, there’s the code to close the client socket handler:

Simple enough — we close the socket, then free the memory associated with the closure and the handler.

There’s a little bit of extra complexity here, in how we call this close function from the handle_client_socket_event function. It’s all to do with memory management, like most nasty things in C programs. But it’s worth having a quick look at the backend-handling code first. As you’d expect, it’s in backend_socket.c, and it probably looks rather familiar. We have a function to create a backend handler:

There’s a lot of duplication there. Normally I’d refactor to make as much common code as possible between client and backend connections. But the next steps into making this a useful proxy are likely to change the structure enough that it’s not worth doing that right now, only to undo it a few commits later. So there it remains, for now.

That’s all of the code! The only thing remaining to explain is the memory management weirdness I mentioned in the close handling.

Here’s the problem: when a connection is closed, we need to free the memory allocated to the epoll_event_handler structure and its associated closure. So our handle_client_socket_event function, which is the one notified when the remote end is closed, needs to have access to the handler structure in order to close it. If you were wondering why the epoll interface abstraction passes the handler structure into the callback function (instead of just the closure, which would be more traditional for a callback interface like this) then there’s the explanation — so that it can be freed when the connection closes.

But, you might ask, why don’t we just put the memory management for the handler structure in the epoll event loop, do_reactor_loop? When an event comes in, we could handle it as normal and then if the event said that the connection had closed, we could free the handler’s memory. Indeed, we could even handle more obscure cases — perhaps the handler could returns a value saying “I’m done, you can free my handler”.

But it doesn’t work, because we’re not only closing the connection for the FD the handler is handling. When a client connection closes, we need to close the backend, and vice versa. Now, when the remote end closes a connection, we get an event from epoll. But if we close it ourselves, then we don’t. For most normal use, that doesn’t matter — after all, we just closed it, so we should know that we’ve done so and tidy up appropriately.

But when in a client connection handler we’re told that the remote end has disconnected, we need to not only free the client connection (and thus free its handler and its closure), we also need to close the backend and free its stuff up. Which means that the client connection needs to have a reference not just to the backend connection’s FD to send events — it also needs to know about the backend connection’s handler and closure structures because it needs to free them up too.

So there’s the explanation. There are other ways we could do this kind of thing — I’ve tried a bunch — but they all require non-trivial amounts of accounting code to keep track of things. As the system itself is pretty simple right now (notwithstanding the length of this blog post) then I think it would be an unnecessary complication. But it is something that will almost certainly require revisiting later.

So, on that note — that’s it! That’s the code for a trivial epoll-based proxy that connects all incoming connections to a backend. It can handle hundreds of simultaneous connections — indeed, with appropriate ulimit changes to increase the maximum number of open file descriptors, it can handle thousands — and it adds very little overhead.

In the next step, I’m going to integrate Lua scripting. This is how the proxy will ultimately handle backend selection (so that client connections can be delegated to appropriate backends based on the hostname they’re for) but initially I just want to get it integrated for something much simpler. Here’s a link to the post.