Introduction

In a previous article, I introduced a series of brief tests on the calculation efficiency of a piece of CUDA code and another piece employing the thrust library. In the tests, a summation of squares of an integer array (random numbers from 0 to 9) was performed and the code execution efficiency was recorded. The conclusion is, briefly, the parallelism on a GPU chip shows great potential, especially for those containing whopping number of parallel elements.

However, on the other hand, the computer, on which I performed the tests, also bears an Intel Q6600 quad-core CPU. Obviously the single thread plain loop code cannot utilize the full potential of the multi-core CPU. Therefore, I wrote some pieces of C# code to re-perform the summation operation of squares, and by comparing the efficiency of the different parallel algorithm strategies, I conclude some helpful points which might be interesting for those developers working on similar topics.

System.Threading.Tasks.Parallel in .NET 4.0

I chose C# because of my familiarity with the language. Another important reason is the new support upon parallel programming in the recently released Visual Studio 2010 and .NET 4.0. Since loads of information can be found about the feature, such as the article "Introducing .NET 4.0 Parallel Programming" on CodeProject, I don't write redundant words but only give the corresponding code to do my calculation.

With the help of the new System.Threading.Tasks namespace provided by .NET 4.0, the for loop is replaced with a call to a method namely System.Threading.Tasks.Parallel.For. It is a static method. A anonymous function delegate defines the real work, which actually corresponds to the code block written inside the previous plain for loop. A thread pool is commonly used to incubate the threads to implement the work; however, we fortunately don't need to know any details, as the Tasks namespace looks after everything. Finally, the practical code is:

The add work is then parallelised. Another point to note is that Interlocked.Add method is used to make the add operation, previously final_sum += data[i] * data[i];, thread-safe.

That's all. It is simple, isn't it? However, using my quad core CPU the parallel code didn't show any advantages. For example, if managing 1 M (1 M equals to 1 * 1024 * 1024) integers, the serial code used 7 ms but the parallel code consumed 113 ms, which is 15 times longer.

Why? Then comes the interesting point.

Points of Interest

Do remember we used the Interlocked.Add method to lock the shared resources final_sum among the threads allocated. (data[i] doesn't matter because each thread only accesses one element of the array and these elements are different from a thread to another.) This is critical to consider about the code execution efficiency, since all threads have to wait for the access to final_sum in a queue. In other words, the access to final_sum is still serial, although the code is parallelised. That is basically why the code execution time was not reduced. Furthermore, because of the overhead taken to allocate and manage threads, the execution time is even much more.

Thinking about the parallel algorithm used for the same purpose in my CUDA program, we actually don't need to parallelise the summation operation down to each single element of the array. A suggested way is supposed to be dividing the array into a certain number of pieces. Each piece contains an equal number of elements and the elements from different pieces don't overlap. Theoretically, we can employ a number of threads, set up the one-to-one correspondence between the threads and the pieces, and then:

Calculate the sub-summation in each piece by its corresponding thread and record the result

Do a summation of all the sub-summation results

If split the array into 1 k (1 k equals to 1 * 1024) pieces, i.e. the thread number is also 1 k, the practical code might be:

In the CUDA version, I used 16 k threads to do the calculation. How to determine a proper thread number for the quad core CPU to obtain a best performance? I did a sensitivity study on this. I used TN, namely Thread Number, equal to 1 k, 16 k and 256 k, respectively, and tested the code execution times; the recorded values (unit: second) are listed in the table below:

The table also lists the consumed times by the serial code and the parallel code without splitting the array. Briefly, the descriptions of the items are:

Single core: Execution time of the serially executed plain loop

Multi core: Execution time if using the method System.Threading.Tasks.Parallel.For without splitting the array into pieces

TN-1k: Execution time if splitting the array data into 1 k groups and using Parallel.For method to manage them. In each group, an internal plain loop is employed to compute the sub-summation

TN-16k: Same with the former, but separating the array into 16 k groups

TN-256k: Same with the former, but separating the array into 256 k groups

The corresponding trends can be summarized as:

or compare them by the histogram:

After grouping the data into a certain number of parallel units, the execution performance was dramatically improved, and the improved performance is already better than the serially executed loop, although the improvement did not reach 4 times as my quad core processor.

The figures also reveal that it is not ideal to use only 1 k threads. 16 k is better for most scenarios except when DATA_SIZE = 32 M. This probably implies that, for much more data, more threads might be needed to handle them.

Conclusion

The original purpose for me to perform the test is to compare the CUDA calculation efficiency with the multi-core efficiency. However, in the process, I released the port from a serial loop to a parallel code is not as simple as replacing the for loop by using a Parallel.For method, although which was expected. In practice, like designing the CUDA program, thread numbers have to be carefully considered.

Based on the test, I found that, for 1 M to 32 M integers, using 16 k threads can achieve a very much balanced performance. Although for DATA_SIZE = 32 M the efficiency of 16 k threads is less than that of 32 k threads, the loss is around 5%.

Moreover, by comparing these C# results with the CUDA result from the previous article, I also found that:

With respect to the serial code, i.e. the plain loop code, the C++ code performed better than the C# code; both code were run on the same Q6600 CPU.

The CUDA code performed better than the parallel C# code, even though both employed 16 k threads. The CUDA was run on a 9800 GTX+ chip and the C# code was run on a Q6600 processor.

We know, when using CUDA the memory transfer between the CPU and the GPU costs very much. Even though the memory transfer overhead, the CUDA code still performs better than a piece of parallel C# code which utilizes the 4 cores of a CPU at 2.40 GHz. At least for this summation of squares calculation, this is true. Does this imply that CUDA does bring great potentials to us?

Code Instruction

The code file ParallelExample.cs, contained in the zip package, includes the test code for serial and parallel methods with different policies, as mentioned in the present article. Note that the calculation execution has to be repeated for enough times in order to extract average values for a practical benchmark test; for clarity and simplification. I didn't include this feature in the code attached, but it is surely easy to be added.

The code was built and tested in Windows 7 32 bit plus Visual Studio 2010. The test computer specification is basically Intel Q6600 quad core CPU plus 3G DDR2 800M memory. Although the hard drive used is not good, (marked 5.1 by Windows 7), I don't think it plays an important role in the test.

Actually I know .NET 4 did work to simplify the parallel computing code, especially tried to ease the porting from a serial piece of code to a parallel one. I meant to do something to test if it is true.

Daniel Grunwald proposed some other methods to improve the efficiency. I would also like to incorporate these interesting approaches into the present article as well. I believe that this article will then be more complete.

BTW, I just found out that Thrust has an OpenMP backend, so you can compile it for multi-core as well as GPU.

Testing the scalability of very minor functions probably won't give you a feel for the speed/efficiency of the architecture itself, though will give you an idea of how easy each architecture is to work with.

You do want to make sure you block your data going to your threads to reduce your calls. I just finished an OpenMP version of a mass line of sight calculation. Out of 200 to 1600 calculations, I sent the data in 16 unit blocks to each thread with 8 cores this meant I was running 128 unit chunks in 8 cores at a time. Alternately I could send one at a time to the threads, but that means my threads barely start working before they ask for another job. You know you have done your job right when all 8 cores are maxed out for 51 seconds of calculation instead of 5 minutes and one core.

_________________________
John Andrew Holmes "It is well to remember that the entire universe, with one trifling exception, is composed of others."

Shhhhh.... I am not really here. I am a figment of your imagination.... I am still in my cave so this must be an illusion....

It is really interesting to know that thrust has an OpenMP backend, which would be pretty useful. I would find a piece of time to try and test it.

Yes, you are definitely right. Although my calculation seems quite trivial, I want to have a general idea from the testing work. I do appreciate your experience in which blocking data is also important in an OpenMP calculation. That sounds interesting.

This is a good article, but it's misleading to say that you are creating 16k / 32k / 256k threads, because that's not what is actually happening in your test code. What you are doing is dividing the input data array into partitions, so that executing threads will have an isolated slot to work with instead of incrementing a shared counter.

The number of threads created to execute the parallel loop is determined internally, and is usually just one thread per logical core, so on a quad-core machine, you will likely only get 4-8 threads executing work items at any point, regardless of how many iterations you have in the loop. Each thread will process a distribution of the data groups.

I also agree that using a shared array to collect results is likely to be causing false sharing, which is a major performance drain - basically you aren't getting cache reuse because the same lines of cache that store the global array reference keep getting invalidated by hits from threads. The more cores you have, the bigger the problem, as each core has to reload an invalidated line from system memory before it can keep processing.

The great thing about this article is that it clearly shows that Parallel Programming is NOT a panacea - you can't just switch it on and get immediate improvements for all sorts of code.

Actually I didn't mean to say that the number of the threads in execution is 16k / 32k / 256k - probably I have to modify the language On the other hand, I think it is not wrong either to say the thread number is #k, because these threads are managed by an underlying thread pool, and we could say some of the #k threads are in fact re-used from others due to the cost to create and destroy threads.

You're right, using a global counter means there's too much synchronization. Ideally, each thread needs its own counter.
However, there's no need to distribute the work yourself (picking those arbitrary 1024 groups). Instead, use one of the more advanced Parallel.For overloads:

However, both approaches suffer from another problem: the overhead of all the delegate invocations is much larger than the cost of the multiplication. The overhead of delegates and lambda expressions is very noticable when the individual calculations are very cheap.

Also, for some strange reason the performance is now consistently different from what I got half an hour ago, with your TN-#k versions easily beating the single-threaded version. Half an hour ago I also did multiple runs and consistently got single-threaded timings that were close to your TN-#k timings. I don't have any explanation for that.

In fact, the new timings for the single-threaded approach seem too slow, as that would mean I get a 10x speedup by using 4 cores. Even taking the old measurement for single-threaded execution (around 0.07), my new version seems to have super-linear speedup.

Comparing these results with their corresponding ones shown in the article, the TN-1k value changed to 0.234 s from 0.170 s. Other timings didn't change much.

From this test on Data size of 32 M, your Parallel.For with localSum and groups version did give a best performance. Actually I also found this is true if Data size >= 2M. The results for Data size of 1 M is

I believe it is interesting to incorporate these new results, with the help of the code pieces contributed by you, into the present article. Do you mind if I incorporate them? Do you prefer a co-author? If you can share the complete test results on the Core-i5 as well, I would like to tidy all the values up together.

This article gives real perspective about the Parallel.For method, and also disappointment, because the whole purpose of the Parallel.For method, as Anders himself said when he presented it, was to use multi-threading on C# with minimal impact on the way we code, and it seems like that actually doesn't happen, in order to get Parallel.For to really outperform a serial execution, is to break your execution into pieces, as you've shown, so it's a failure for Parallel.For.
Sure it's easier opposing to manage the threading ourselves, much easier, but there's still a long way to go for C# multi-threading.

I think it is absolutely right. Parallel.For is essentially simplifying the code to produce and manage threads, but it is still an encapsulation upon the threads model. When using threads, access to shared resources will always be a critical overhead to carefully take care of.

Thereby, we can say System.Threading.Tasks is a convenient tool but the underlying mechanism still needs big improvements.