USING EPOLL

2014-10-10

Introduction

I heard about epoll not so long ago and wanted to give it a try. Epoll is the new
way to use non-blocking sockets (or any file descriptors). It replaces poll() and select().

epoll is easy enough to use. The good thing about it is that, unline select(), you don't need
to rebuild the FDSET each time you call epoll_wait. Here is a typical usage:

efd = epoll_create1()

epoll_ctl(efd, EPOLL_CTL_ADD, listeningSocket)

loop:

n = epoll_wait()

for i=0 to n:

if (events[i].data.fd == listeningSocket)

newsock = accept()

epoll_ctl(efd, EPOLL_CTL_ADD, newsock)

close(efd)

Another nice thing about epoll is that you don't need to worry about removing a socket from the list.
You can call epoll_ctl(efd, EPOLL_CTL_DEL, sock) if you want but when the socket closes, it will
be removed automatically.

One thread

Using epoll, I can do all my sending and receiving in one thread. So people may suggest to
send data from other thread but what if you get EAGAIN? Assume that thread A is the epoll thread.
Thread B attempts to send 100 bytes but could only send 50. Then Thread C attempts to write 100 bytes
on the same socket. By the time that Thread C was called, the socket was ready. The remote socket
will have received corrupted data. For that reason, Thread B and C, will add data in a queue so that
each messages are guaranteed to be sent completely. The queue will be emptied in Thread A.

Since epoll might potentially handle thousands of connections, Thread A must do minimal work.
It must only do this:

epoll_wait

if a socket is ready-ready, read all from that socket and dispatch to consumers.
consumers should work fast, otherwise we should "post" the data to the consumer in another thread.

send all from the sendqueue

Edge-triggered VS Level-Triggered

epoll offers two ways of working: Edged-triggered and Level-triggered. The default way is Level-triggered.
so you might want to change this to level-triggered in your application.

In Level-triggered mode, epoll will return as long a socket is read or write ready.
So if data is ready to be received on the socket and you only read part of it, the next epoll call
will return because there is still data available. As soon as the internal receive buffer is empty,
then epoll won't return. But since the socket will be write-ready almost all of the time, it will return.
This is not something you would typically want. My guess is that if you want to use level-triggered mode,
you should not register to get EPOLLOUT events unless you have something to send. So while your application's
TX buffer is empty, you don't register for EPOLLOUT. As soon as data is added to the TX buffer, you register
for EPOLLOUT, epoll will return and you write data out on the socket. If EAGAIN was returned then you will block
on the next epoll_wait() and can send the rest of the buffer when it unblocks. Once the application's TX buffer
is empty, you could unregister for EPOLLOUT.

With Edge-triggered mode, things are different. You will only receive an event if the status has changed from not-ready to ready.
So if you have incomming data and you only read part of it, when you will call epoll_wait, the socket will still be ready.
So epoll_wait will block because the state will not change from not-ready to ready. So if you are using that mode, you must
make sure to read ALL the data on the socket until you get EAGAIN. If you want to send something, epoll_wait will only give you
a write-ready event if you previously got EAGAIN while attempting to write out.

Let's say you have an internal transmit queue in which you add data when you want to send it out. The epoll thread would need
to ready from that queue, write the data on the socket and handle EAGAIN appropriately.

epoll_wait()

if a socket is ready-ready, readAllFromThatSocket(socket);

sendAllData() // sends all that is contained in the internal queue.

goto step 1

But if epoll_wait() is blocked in triggered mode, and you add data in the queue (from another thread), sendAllData() will not be called until
epoll_wait returns because data is ready to be received (it won't return because data is ready to write, because you need to write first and get EAGAIN for that.). To solve this problem, I created an eventfd (see sys/eventfd.h). I add the eventfd in the epoll_wait list and whenever I add
data in the application's TX queue, I write 1 on the eventfd so that epoll_wait will break out. I could have used a timout
in epoll_wait, but that would add unnecessary latency to amy reactor. That way of doing things is similar to what is called the "pipe to self trick".

My framework

To play with epoll, I wrote a small reactor framework that exposes an interface that's easy to use.
all the gory details are hidden inside the framework. To use it you would need to define two classes:

class OwnProtocolFramingStrategy: public IFramingStrategy
{
protected:
virtual int processData(char* buffer, size_t size, FullMessageInfo& info)
{
info.buffer will first be zero. You need to allocate memory for that buffer
and copy the data from "buffer" to "info.buffer". If you determine that the buffer
does not contain a full message, return the quantity of bytes you read from the buffer.
In this case, it should be "return size;" And the next time that some data is received,
this function will be called with you buffer. You can update other fields in "info"
to help you resume next time.
If you determine that a full message was received, set "info.complete = true" and return
the number of bytes that you read from the buffer. After returning from this function, your
buffer created in "info.buffer" will be considered as containing a full message and will be
passed to the TcpEngine. Next time this function will be called, info.buffer will be back to
zero.
If you determine that a full message was received but there are still data left in the buffer,
it means that you probably have 2 messages in the buffer. As mentioned above, you will
return the number of bytes that you read from the buffer. This function will be called
again with "buffer" pointing to the index after the last byte you have read. So you
only need to try to parse one message only when this function is called.
If you find that data is not well-formed, does not respect you protocol or has any other errors
that prevents you from reliably build a full message, then return -1 and the client's connection
will be aborted.
}
};

A IFramingStrategy is a class that will process the tcp stream into full messages. As you know, when
reading data from a socket, you may receive more than one message in a single read and you could also
only receive half of a message. So the IFramingStrategy is what parses the stream and builds messages.
For example, if implementing an HTTP Server, the strategy would build the message by expecting
the data to be chunked so the logic to calculate the size would be done here.

class OwnProtocolServerEngine: public TcpEngine
{
public:
OwnProtocolServerEngine():TcpEngine(new ClientFactory(this))
{
}
protected:
virtual void onClientConnected(TcpClient* client)
{
Maybe add the client in some list so we can use it later?
}
virtual void onClientDisconnected(TcpClient* client)
{
if client was added in a list, we should remove it because
its instanced will be deleted after returning from this method.
}
virtual void onClientMessage(TcpClient* client, char* buffer, size_t size)
{
The buffer you get here, is the one that was created in the strategy. You
own this buffer in this method, so you must free it from here
(or remember to free at a later time)
In this method, you are guaranteed to get 1 and only 1 full message assuming
that the Strategy was coded correctly.
Note that everything is happening in 1 thread. So in the onClientConnected()
method, you might have saved a TcpClient pointer in a list. It is perfectly
safe to use it in here (i.e: for forwarding the message to someone else).
}
};

Thread-safe queue

I said earlier that everything was happening in 1 single thread so everything *should* be
safe. But what if I want to send data to a client from another thread? That should be possible.
The transmit queue inside the TcpClient is thread safe. So calling TcpClient::sendData() is a safe
call to do in another thread. Accessing the client is another story though. The client instance
could get deleted at any time, but there are ways around that but it's beyond the scope of this framework.

Since the TX queue is a Multi-Producer, Single-Consumer FIFO, creating a thread-safe implementation
is very easy. Of course, I didn't want to use any of the STL containers.

This is using __sync_bool_compare_and_swap, which I strongly suspect uses the x86 instruction CMPXCHG. I described
that instruction here
I didn't fully test that code though . I addapted it from a version I had written in ASM, so there might be bugs
in the C++ version.