So I've been adding in some multi threading to my game demo. It all works fine until I create more threads than I have cores on my CPU.

I've only got a dual core so I have the main thread and an extra helper thread, if for whatever reason I create a third thread then performance drops considerably - like going from a super car to your dad's first Ford.

With a single thread I get 50 fps, with two threads I get 70 fps but with three threads I get 8 fps.

Using Very Sleepy to see what's going on; when I add the third thread a lot of time is spent in WaitForSingleObject and ReleaseSemaphore, but there is only one place I use that and the locks are going to get called the same number of times regardless of the number of threads because there's only a limited amount of data.

I create the threads at startup using CreateThread and they then wait for en event to signal that there is work for them to do, once the work is done they'll be waiting for the event again.

3 solutions

Solution 2

Having more threads than cores will slow you down if you are already running at 100% cpu. However, I have written programs running dozens of threads on a dual core cpu without a problem. Just need to avoid deadlock and race conditions.

Solution 1

It's harder to look at the actual problem when no actual code example is given, but based on your information I get to the following:
Because you have more threads than cores in a situation with limited data, you have a lot of overhead that causes the processor to stall.

This is way more clear in a real life example:
When doing dishes by hand, you have one person doing the washing and one doing the drying. When there is an extra person (thread) that has to compete to get a lock on the only brush (and sink) available, it is clear that this will get messy without an actual performance gain. Doing dishes this way certainly won't get any faster.

But how to speed it up? Well, if after drying an item you would have to walk from the kitchen to the living room to put that item away, it could help to leave the drying cloth at the sink and let another person (thread) take that resource and use it while the other person is putting an item away (meaning the lock on the drying cloth is released).

The conclusion here is only to use more threads than cores in a case where I/O operations are involved. That time is otherwise spent on waiting. When threads would otherwise have to compete on resources that are already available it won't speed up anything. In that cases you are making the processor crazy because threads are suspended to/restored from main memory, meaning you are thrashing the cache. The latency on that is enormous because the performance of a computer program is simply the maximum latency that the system encounters by executing it. By adding a thread you just added more latency intensive operations which drops performance drastically.

It's not that the thread is using a huge amount of ram. First thread 1 and 2 are working and thread 3 is waiting for a core to get available. Thread 1 is done and tries to acquire the lock. In case of 1 thread per core the kernel doesn't interrupt the thread and it can resume after the lock is acquired. But with the extra thread the kernel switches context, meaning that thread 3 is scheduled to execute. The processor cache is cleared and code and data of thread 3 is loaded. The cache controller probably didn't (and couldn't) anticipate to this and is way behind on getting the necessary data to the processor. The processor is at this time very inefficient because code and data is simply missing to get the actual work done. This is the case for each thread on every iteration. Each time the thread context will be switched with the waiting thread and that takes a huge amount of time because the quite intelligent cache controller could not foresee this tragedy of cache trashing.

What might be happening in your case is context switching/Time-slicing and not parallel execution or processing with more Threads than processor cores,
I suggest to use Task Parallel Library (TPL) which utilizes your system processor capabilities much efficiently knowing which processor/core is idle/available and assigning tasks providing in-built thread pooling, utilizing memory too
Thanks

Your comment is very late in the discussion, as this question/answer goes back to 2010. Using TPL might be a good way to go. But this question was more about using more threads than cores when the actual tasks are processor intensive instead of i/o intensive. Even then someone can make the mistake in forcing tpl to use 3 threads and still lose performance. But TPL can help in a great way indeed.