This second approach to CUDA has been very interesting. Nvidia has improved performances, flexibility, and robustness, and although the list of functions to add and small bugs to correct probably represents something to keep Nvidia engineers busy for some time, CUDA is really usable.

We were able to evaluate the performances of Nvidia GeForce 8 class GPUs in a practical application and we saw a huge performance advantage with these GPUs compared to CPUs with an algorithm which was initially destined for them. This is enough to open some new perspectives for this type of application. And in fact, these are results which differ with our previous conclusion on CUDA, in which we say that the power of the GeForce 8800 wasn’t yet sufficient enough to really justify new development compared to multi-CPU systems. So it appears we underestimated two important points.

The first one is that a high end GPU comes with a memory bandwidth of 100 GB/s whereas the four cores of a CPU have to share 10 GB/s. This is a significant difference which limits CPUs in certain cases. The second point is that a GPU is designed to maximize its throughput. It will therefore automatically use a very high number of threads to maximize its throughput whereas a CPU will more often end up waiting many cycles. These two reasons allow GPUs to have an enormous advantage over CPUs in certain applications such as the one we tested.

Using a GPU with CUDA may seem very difficult even adventurous at first, but it is much simpler than what most people think. The reason is that GPU isn’t destined to replace a CPU, but rather to help it in certain specific tasks. This doesn’t involve making a task parallel in order to use several cores (like it’s currently the case with CPUs) but to implement a task which is naturally massively parallel and they are numerous. A race car isn’t used to transport cattle and we don’t drive a tractor in an F1 race. This is the same thing for GPUs. Therefore, it's all about making a massively parallel algorithm efficient depending on a given architecture.

It would be a mistake to limit our view of a GPU such as the GeForce 8800 as a cluster of 128 cores, which we should try to take advantage of by segmenting an algorithm. A GeForce 8800 isn’t just 128 processors, but rather first of all up to 25,000 threads in flight! We therefore have to provide the GPU with an enormous number of threads, keeping them within the hardware limits for maximum productivity and let the GPU execute them efficiently.

Some very important points we have to worry about when coding for a CPU have to be abandoned to focus on others. This is a change of habits that is unfortunately not too often taught in universities. Nvidia is aware of this problem and knows it is a key element in the successful use of its chips as calculation units. David Kirk, Chief Scientist at Nvidia, has for this reason been given the responsibility of teaching a class about massively parallel programming at the University of Illinois at Urbana-Champaign and Nvidia has supported and sponsored other similar courses.

David Kirk’s very interesting class is available on-line and is offered as a free to use learning kit. While the course uses the example of the GeForce 8800, the concepts that are presented aren’t too specific to any given architecture, with exception of course to optimizations. It can be found here :

Moreover, if the subject interests you, we recommend the read of interviews transcripts published by Beyond3D of David Kirk, Andy Keane and Ian Buck, the Chief Scientist, General Manager of GPU Computing Group, and CUDA Software Manager, respectively.