Introduction

The Win32 SDK provides a set of APIs [Application Programming Interface] to manage the creation, destruction, and synchronization of threads. A frequent creation and destruction of threads is time and resource consuming, and therefore, impacts the performance of the server applications. The Windows 2000 operating system was designed for targeting the server market. To capture the server market, lots of scalability features like thread pooling and IO Completion Ports were introduced in Window 2000. Thread pooling is a pool of worker threads that are maintained by the system. A server developer would be free from the hassles of creating and destructing threads. To access the thread pool, a developer needs to use thread pooling APIs which makes creation, destruction, and general management easier.

Server applications make extensive use of thread pooling techniques to delegate client requests to worker threads, which is implemented as a thread pool so that they can achieve better throughput. After serving a client request, the thread goes back into the thread pool and gets ready to serve the next pending client’s request. Thread pooling will give better throughput, as less time is wasted in creating and destroying threads, because the application already has an idle pool of threads. The number of threads can vary in a pool depending upon the load on a server.

The purpose of this article is to cover a design that can be used to develop scalable network servers without impacting performance. The Windows 2000 operating system provides four components to support thread pooling. A thread pool consists of four components that can be accessed by using a set of APIs available for thread pooling. A basic technique that is widely used in designing a scalable server application is an asynchronous mechanism through which a server’s main thread can defer/delegate a client’s request to a pool of worker threads for further processing. A thread pool consists of four types of threads. They are as follows:

Non-I/O threads: The purpose of these threads is to invoke the work items that have been queued up on these threads. These work items should not issue an asynchronous I/O call as these threads do not wait for the completion of the I/O request, and hence cause a notification on a completion of I/O operations to be lost.

I/O threads: These threads are useful for work items that issue an asynchronous I/O request. These threads never die if they have a pending I/O request, i.e., they can wait on a signal that is posted on the completion of an I/O request.

Timer thread: This thread is responsible for invoking callback functions on an expiration of a specified time. This timer expires at the specified time periodically. By default, the work items are queued to the non-I/O component’s thread. We have an option called WT_EXECUTEINTIMETHREAD, which causes a timer component’s thread to go into an alertable wait, waiting for the waitable timer to queue an APC to it. The work item should be executed quickly, otherwise it will block the timer component’s thread and therefore impact the performance of the timer thread component.

Wait threads: This thread is responsible for invoking a queued work item when a kernel object gets signaled. Once an object becomes signaled, the work item is queued to non-I/O component threads, by default. We can use the WT_EXECUTEINWAITTHREAD, which will ensure that a work item will be queued up on a wait thread.

The Devil in the Thread Pooling Environment: TLS

A developer has to be careful while working with thread local storage in a thread pooling environment. The purpose of TLS is to provide a private storage space for each thread present in a multi-threaded environment. Consider an example where a server process’ main execution thread invokes the QueueUserWorkItem() function on receiving a client request. The context passed to the thread function is of the type below:

The server has a default handler, which is a callback function invoked by a pool of worker threads.

DWORD WINAPI defaultHandler(PVOID pvContext);

The purpose of this handler is to invoke a function being exposed by specified modules for further processing, depending on the type of image being requested from a client. Each module provides a set of helper functions that perform different operations on the image being requested.

The defaultHandler function checks that if szModuleName is not NULL, then it loads module in its address space, and invokes a function name being specified by szFunctionName, for further processing.

Consider a scenario where the first request comes to the server, and it is handed over to first thread by the thread pool manager. In the defaultHandler() function, a callback function is exposed by the executable module itself, so there is no need to load a DLL. The szModuleName will be NULL for this type of a request. Now, suppose a second request comes to the server while thread-1 is busy in handling a request, a new request will be given to thread-2 which will load a DLL as the handler function is exposed by the DLL module specified in szModuleName. A LoadLibrary() will be called by second thread that will cause an entry point to be invoked with DLL_PROCESS_ATTACH as an argument. A DLL_PROCESS_ATTACH will be responsible for allocating a TLS for a specific thread. A callback function szFunctionName exposed by the DLL will make use of TLS by invoking the TlsGetValue() function.

Meanwhile, if a third request comes to the server and thread-1 has already completed its work, the thread pool manager will hand over the request to thread-1. As this thread existed before the DLL got loaded, there is no way to initialize a TLS for this thread as the DLL entry point will not be invoked. An access to TLS in a callback function will cause an access violation. These scenarios need to be considered while working with TLS in a thread pooling environment.

Thread Execution

Handler Function

DLL Attached

Thread Attached

Action

Thread-1

defaultHandler

No

No

As defaultHandler is present in an executable module, no DLL loading is required.

Thread-2

DLL::FunctionName

Yes

No

As the function is present in the DLL, the DLL is loaded by the thread, which causes to invoke the entry point with DLL_THREAD_ATTACH. In this, we should call TlsSetValue() to initialize the TLS slot for the thread.

Thread-1

DLL::FunctionName

No

No

As this thread existed before the DLL gets loaded, no DLL_PROCESS_ATTACH and DLL_THREAD_ATTACH get called. This will cause a crash as the function exposed by the DLL tries to access a TLS via the TlsGetValue() function.

Scalable Server Based on IO Completion Port

Windows provides a combination of overlapped I/O and IO Completion ports to design scalable servers. Threads consume resources as they have their own stack, and designing a server on one thread per client would not be advisable for large servers. Scalable servers should be able to handle multiple clients with a handful number of threads that constitute a thread pool.

The select() API provides a way to deal with multiple I/O streams in a single thread, but still not an ideal solution to design a scalable server. Overlapped I/O is a way through which I/O operations are initiated asynchronously, and are notified on its completion by an operating system. A Completion Port is a queue maintained by the operating system on which all notifications of completed I/O operations are posted by the system. The worker threads polling on the IO Completion port will be responsible for processing the posted I/O completion notifications. An IO Completion Port is associated with non-I/O threads, by default.

A server’s main thread will be responsible for creating a listening socket that will be accepting connections from the client. A listening socket will be associated with a network event, WSAEvent, and it will be registered against the FD_ACCEPT notification. The network event will be created by calling the WSAEventSelect() API provided by the Winsock library [Ws2_32.lib]. This is a way to synchronize a pending connection request on a listening socket and AcceptEx() which accepts pending connections on a socket. The AcceptEx() will be invoked when an event associated with a listening socket gets signaled.

// This will be non-signalled object by default
WSAEvent wsaEvent = WSACreateEvent();
// Associate an event object with FD_ACCEPT network event.
WSAEventSelect(g_ServerSocket,socketInfo.wsaEvent,FD_ACCEPT);
// Waiting for an client to connect to listening socket.
// As soon as client connects, an associated event object
// gets signaled and we post a work item
// to a worker thread in the thread pool.
WSAWaitForMultipleEvents(1,&pServerSocket->wsaEvent,
TRUE,WSA_INFINITE,FALSE);

The thread function associated with a worker thread is responsible to call AcceptEx(). The event object associated with FD_ACCEPT gets signaled when a connection request is pending and waiting to be accepted by a server socket. The AcceptEx() function gives better performance for scalable servers as a socket can be created before the connection occurs, hence speeding up the connection establishment. The AcceptEx() function is responsible for posting an accept request on the listening socket, and once the accept request gets completed, the IO completion packets are posted to the port associated with the listening socket.

The AcceptEx() function makes use of Overlapped I/O, which makes it suitable to be used in scalable servers as multiple clients can be served by a small pool of threads. On completion of an accept request, the IO completion packet will be dispatched to a port where the pool threads will process the IO packet and register newly accepted sockets with an IO Completion Port. All new accepted sockets will be registered with the IO Completion Port. This design provides us the benefit of having scalability on two stages, i.e., there could be different I/O operations being performed on multiple handlers associated with the IOCP, and there could be multiple I/O operations performed on a single handler. The processing threads should be able to distinguish these different I/O operations. A per-file completion data is associated with the completion key when a file handle is registered with the IOCP.

This sample will show how multiple clients can be served with a limited bunch of worker threads constituting a thread pool. This design is scalable as compared to having one thread per client where scalability is restricted.

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

But there's a little bug in ya code. It's in the AllocateIOData function. You're trying to allocate a new PER_IO_DATA item, and then checking validity of the lpPerIOContext pointer. But normally "operator new" throws an std::bad_alloc exception rather then returning NULL pointer. It's the standart. There's an alternative version of the new operator, which doing exactly what you want, just use the new(nothrow) version

First off, thanks for the awesome sample code, this is exactly what I was looking for to understand IOCP. However, in playing with your code I have found a serious flaw, and I know what needs to be done to fix it, but I'm not sure how to fix it.

Basically, to illustrate the problem, run the server, and then execute the following 20 times:

start /MIN telnet localhost 5001

if you don't type anything in any of the telnet sessions, then
only the first one ever gets connected, and if you run it enough times eventually the telnet fails to even connect. The code is getting hung on this line:

I'm confused whether this is an article on thread pooling or async socket handling. (In fact it is both.) I'd recommend discussing this separately before bringing together.

Previous contributor described why IOCP is used -the article itself does not. A novice would be wondering why they should be using IOCP over say some form of APC callback mechanism.

The point is to reduce thread context switches - pertinent to the design of any thread pool not matter what it is used for. That's why I think it would have been less confusing to discuss the thread pool design separately.

You shouldn't need this at all... at the start of every WorkerThread just do a couple of queueAccept() calls to make sure there are a few for every IOCP thread waiting on a GQCS() call. All the processing should be done within the worker threads and there should be no need at all to switch out to another thread for any processing at all (this is the real beauty of IOCP... zero thread swaps).

Also... you should think about using AcceptEx to it's fullest... you are not currently doing a data receive within AcceptEx. This can be a huge performance beneifit as it saves you an entire call (and all the ensuing processor transistions).

Otherwise nice code... this is the most 'straightforward' example of an IOCP server on codeproject so far (many other authors are just downright confused as to what is going on).

> Why the WSAEventSelect()?
I am not aware of queueAccept() and tried to search for this API but could not able to fetch anything.Is it a part of WINSOCK. The main purpose to use WSAEventSelect() API was to show how thread pooling functions can be easliy used to delegate work. I fully agree that zero thread swaps is a beauty of IOCP. The purpose of an article was to cover little bit of thread pooling and that why I delibrately introduced a pool thread which will come into picture once a connection is pending to be accepted.

> you should think about using AcceptEx to it's fullest... you are not > currently doing a data receive within AcceptEx.
To be frank, I was trying the same what you have mentioned. But I was getting a crash while recieving an data in buffer.

I am working on HTTP proxy server (in Win 32) for wireless mobiles and laptop, it contain two exe. One is in ISP (proxy) which can handle 5000 clients and another running in client system (Agent).
So when ever request comes from browser is collected by Agent and send to proxy.
My query is what is best protocol to use so that my proxy and agent can communicate fast. We already implemented TCP/IP but fastness is not good so I am thinking use UDT, or RBUDP is it a correct idea? Or is there any other protocol?

It depends upon what type of communication you need i.e. what are your priorties. It could be speed or reliability. TCP/IP is slower but reliable. Any protocol based on UDP will be faster and less reliable. So you have to go for trade-off.

If we see today's trend, most of the web services are using SOAP for communication. I donot know about the performance of SOAP. If performance os good, then stand-alone SOAP servers can be written. I have not worked on mobile and laptop application so may not be in position to help you better.