Tuesday, July 05, 2011

I’ve recently been introduced to a code base that illustrates a very common threading anti-pattern. Say you’ve got a batch of data that you need to process, but processing each item takes a significant amount of time. Doing each item sequentially means that the entire batch takes an unacceptably long time. A naive approach to solving this problem is to create a new thread to process each item. Something like this:

The problem with this is that each thread takes significant resources to setup and maintain. If there are hundreds of items in the batch we could find ourselves short of memory.

It’s worth considering why ProcessItem takes so long. Most business applications don’t do processor intensive work. If you’re not protein folding, the reason your process is talking a long time is usually because it’s waiting on IO – communicating with the database or web services somewhere, or reading and writing files. Remember, IO operations aren’t somewhat slower than processor bound ones, they are many many orders of magnitude slower. As Gustavo Duarte says in his excellent post What Your Computer Does While You Wait:

Reading from L1 cache is like grabbing a piece of paper from your desk (3 seconds), L2 cache is picking up a book from a nearby shelf (14 seconds), and main system memory is taking a 4-minute walk down the hall to buy a Twix bar. Keeping with the office analogy, waiting for a hard drive seek is like leaving the building to roam the earth for one year and three months.

You don’t need to keep a thread around while you’re waiting for an IO operation to complete. Windows will look after the IO operation for you, so long as you use the correct API. If you are writing these kinds of batch operations, you should always favour asynchronous IO over spawning threads. Most (but not all unfortunately) IO operations in the Base Class Library (BCL) have asynchronous versions based on the Asynchronous Programming Model (APM). So, for example:

Your main thread doesn’t block when you call BeginMyIoOperation, so you can run hundreds of them in short order. Eventually your IO operations will complete and the callback you defined will be run on a worker thread in the CLR’s thread pool. Profiling your application will show that only a handful of threads are used while your hundreds of IO operations happily run in parallel. Much nicer!

Of course all this will become much easier with the async features of C# 5, but that’s no excuse not to do the right thing today with the APM.

Like C# programmers, Java programmers tend to be excessively thread-happy, so I'm glad to see you highlighting the (ab)use of threads for I/O concurrency in cases where async I/O is more than good enough.

Code Rant

Notepad, thoughts out loud, learning in public, misunderstandings, mistakes. undiluted opinions. I'm Mike Hadlow, an itinerant developer. I live (and try to work in) Brighton on the south coast of England. Please don't mistake me for an expert in anything. I love technology and programming, but make no claims to be any good at it. Much of what you read here may be poorly thought out, wrong, or just plain dangerous.