I know very little about programming and CUDA limitations. I hear how GPU and CPUs are different beasts and have different purposes with their inner workings.

So my question to everyone is it remotely possible for CUDA to compile binaries on our systems? Something similar to DistCC maybe? Its a shot in the dark but I would like to know.

Oh yeah, I once found a nice project where GPUs are used for live video recognition, and this program was able to track thousands of points at a time. They even had a small game where they could control Darth Vader as he ran around on their desk. I can't find it for the life of me, anyone know?

EDIT: I found the project to the second question. OpenVIDIA, was on the CUDA page =).
<http://openvidia.sourceforge.net/index.php/OpenVIDIA>

Well, I doubt that you would gain much performance with the current GPU architecture, because compilers usually don't work parallel. Of course, when you have a multi-core platform, you simply instruct make to launch multiple compiler calls in parallel, and you get a speed benefit. But the compiler itself works, as far as I know, not parallel.

In general, there are two types of parallelization possible:
1. Compiling multiple individual source files in parallel,
2. Compiling one source file in parallel.

The first approach would require you to port the compiler to a GPU, the second one to develop a completely new compiler algorithm.

For a GPU, you also have to take into account, that its architecture is completely different from a CPU. First, you have a very limited amount of cache. Where a modern CPU has MBs of L2 cache, the multiprocessor of a GPU has only 16kB cache (called shared memory in the CUDA world) and 8kB texture cache, and that is shared over all threads (usually about 256 or more). Second, you have only limited inter-thread communication. Only the threads executed by one multiprocessor can communicate directly with each other. All other communication has to be performed via the global memory, which is very slow compared to the cache.
Another downside is the Host <-> Device memory transfer, that has to be very efficient.

In short, the GPU is good if your algorithm has to perform many calculations on individual data, no need of synchronizing or communicating intermediate results, and requires a lot of threads. That's usually the case in image processing, where you can process each pixel individually.