I’m looking for a way to distribute some number crunching across multiple PCs in a network (we’re talking about 10 PCs). I primarily want to utilize their GPUs but using their CPUs might still give some extra speed.

I’ve seen that Microsoft’s C++ AMP seems promising, since it runs C++ code on the CPU and with help of DirectCompute on the GPU. Does anyone of you have some experience with it? How flexible is it really? Can I share resources between AMP and DirectCompute or CUDA? How well does it map code to the GPU? Perhaps some hand-optimized code can be faster… Is it possible to add hand-optimized code for certain tasks?

I noticed Microsoft’s implementation of the Message Passing Interface. What else does it offer aside from the message passing? I’ve seen some basic scheduling stuff and read about a monitoring tool. What are your experiences with this software? Do you know of better implementations of the Message Passing Interface? Is it easy to add in new scheduling strategies? Does this run entirely on Windows 7?Or what is the current state-of-the-art in distributed computing in local networks?

Generally, I’m wondering whether C++ AMP and Microsoft’s MPI are a good combination. Any thoughts on that? What would your choice be, if you would like to use all the GPUs in a network?

It has been a while since I last thought about switching from CUDA or DirectCompute to OpenCL. Back in the days OpenCL has been too slow.I looked around for benchmarks and actually found one that compared CUDA with OpenCL and AMP.AMP is the slowest, so this kind of rules the whole thing out. I guess it will take more time until AMP catches up. (It is still in a beta, so I’ll look again when it actually ships.)OpenCL seems to be quite close to the performance of CUDA now, which is very nice actually. I think I’ll look into it (if only for the experience).

This leaves the question with Microsoft’s Messaging Passing Interface open. Is it a good choice?

Have you found the optimal algorithm yet for such distributed environment?

Which API is best fit for it?

This is perhaps the most important question.

You've got a network of tightly coupled computing nodes connected through slower losely-coupled computing nodes. You'll need to design your algorithm accordingly, to favor communication between those nodes on a single machine and minimize communication between machines.

Oh, yes indeed. I found a paper that compared Microsoft’s MPI to a Unix implementation and it turned out Microsoft isn’t too far behind, which would be okay for me. I definitely have to do further research, though.

Have you found the optimal algorithm yet for such distributed environment?

Which API is best fit for it?

This is perhaps the most important question.

You've got a network of tightly coupled computing nodes connected through slower losely-coupled computing nodes. You'll need to design your algorithm accordingly, to favor communication between those nodes on a single machine and minimize communication between machines.

The research for an optimal (or at least a useful) algorithm is the one and only purpose of the project. And yeah, that’s why I’d like to find an API that has good tools for workload monitoring and for experimenting with scheduling strategies. That’s where I hope you have some experience to share.

When you're writing you algorithm, make sure that you're measuring the performance of your distributed work. I once wrote an algorithm which distributed work out to multiple CPU's and broke the work down into very small increments. It turned out that while the work was done quickly, the overhead of the network latency caused the overall time to complete the job to increase (~1 minute). If I made the jobs bigger, the computers could spend more time doing work on the CPU rather than sending and recieving network packets, thus decreasing working time to about 15 seconds. There's a probably a sweet spot for every algorithm where you maximize the work and minimize the time taken.

For what it's worth, the university I was at purchased and used an NVidia card which had a ton of processing power. I think it was the tesla? I don't know what the price tag looks like or how to use it, so I can't make any recommendations from experience.

Thanks for the tip! I’ll watch out for it and make sure that the workload size will be scalable, perhaps somehow adaptive.I had for a few months the pleasure to work with a Tesla. My colleagues could hear when my fillrate increased. The performance was awesome. The 10 desktops I’m working with have GTX460s, which is good enough I hope.

I wrote the app in Java and used MPI. I don't have the source code available with me so I can't go into specifics on how I did it. If you're interested, I can get back to you in a few months when I get home.

Alright, then! It looks like I’ll go with MPI. (I’ll first have a look at Microsoft’s implementation.)

I wrote the app in Java and used MPI. I don't have the source code available with me so I can't go into specifics on how I did it. If you're interested, I can get back to you in a few months when I get home.

That’s a very nice offer, thanks. If I have trouble setting the MPI up, I’ll be back.