If this is your first visit, be sure to
check out the FAQ by clicking the
link above. You may have to register
before you can post: click the register link above to proceed. To start viewing messages,
select the forum that you want to visit from the selection below.

What realy sucks balls about OpenCL is that you need to specifically target all kinds of different cards, even though your code will run on any OpenCL device. The problem is hardcore GPU understanding. For example the bank size and terminology is different between nVidia and ATi. Imagine programming soundcards >.<

Does the current IR succesfully work as a GPU design abstraction with Clover, so that Clover converts OpenCL in general code that works just as great on nVidia as ATi? That would be massive win all over the place.

as i understand it if you write code in openCL then it will work fine on ati, nvidia, multicore cpu etc. but if you want the code to super fast then you need to pay close attention to things like memory layout, shared caches, and other hardware dependant stuff, because memory bandwidth and cache misses can be significant. I think that tweaking would be very hard to automate.

as i understand it if you write code in openCL then it will work fine on ati, nvidia, multicore cpu etc. but if you want the code to super fast then you need to pay close attention to things like memory layout, shared caches, and other hardware dependant stuff, because memory bandwidth and cache misses can be significant. I think that tweaking would be very hard to automate.

The warp/work group sizes can drastically vary between hardware, and the ideal code can as well (vector programming vs other methods). During program startup, it is possible to compile the OpenCL kernels and run quick performance tests to pick an ideal method, but that assumes that you are willing to write the auto-tuning code and also to write multiple codepaths.

But you are right. If you write code that works on one OpenCL device (e.g. Nvidia), it should work on another device (CPU, DSP, AMD card, etc). There are extensions that can come into play, but as long as the device you are trying to execute on supports what you need, it should at least execute and produce results.

Performance tuning of OpenCL code is affected by the specific hardware you're running on, but the code should at least execute properly on other devices.

am i right in thinking that even between different models from the same manufacturer you might need to do different optimisation?

Yeah. Case in point would be AMD. Their r600/r700 chips were mostly 5-wide vector units, but the Cayman chips have moved to 4-wide vector units. The next architecture is supposedly going to be SIMD-based, which will lead to entirely different optimization strategies (possibly similar to Fermi, but we'll see).

Yeah. Case in point would be AMD. Their r600/r700 chips were mostly 5-wide vector units, but the Cayman chips have moved to 4-wide vector units. The next architecture is supposedly going to be SIMD-based, which will lead to entirely different optimization strategies (possibly similar to Fermi, but we'll see).

One point I don't see mentioned much - current architectures are VLIW *and* SIMD.

The SIMDs are 16-wide on high end chips and 4- or 8-wide on lower end chips.

Uses LLVM and Clang, and they've got future plans for x86 support. It builds on x86_64, but I haven't gotten more than a simple hello world program to link, and the hello world program explicitly tells me that my CPU model is unsupported currently (phenom ii x6 1055t).

I'm not saying that it's feature complete or that it's perfect, but I've heard from people who've used it on ARM and it does the trick. Given that it's LGPL, I don't see any license issues with using it on Linux.

I don't see it going into the kernel (it is something that should probably remain in user-space as a library), but it might be something that could be included in distributions in the future after some further testing/porting.