The Khronos Group - a non-profit industry consortium to develop, publish and promote open standard, royalty-free media authoring and acceleration standards for desktop and handheld devices, combined with conformance qualification programs for platform and device interoperability.

If this is your first visit, be sure to
check out the FAQ by clicking the
link above. You may have to register
before you can post: click the register link above to proceed. To start viewing messages,
select the forum that you want to visit from the selection below.

Profiling Code

Ok, so I have my code running, but I have to say I'm disappointed with the performance.

It's a particle system and I get the following approximate performance statistics:

Scalar version on the CPU: 1 Million particles per second.
GPGPU version using GLSL: 55 Million particles per second.
OpenCL version on CPU: 5 Million particles per second.
OpenCL version on GPU: 4 Million particles per second.

OpenCL on the CPU seems about right. I'm doing calculations on 3 component float vectors (in float4s) and I'm on a Core 2 Duo, so two cores. A six times speed-up would be my theoretical maximum, and that's not including the fact that there's some unavoidable scalar calculation. I'm happy with that result.

The problem is obviously the GPU based OpenCL. It's about 12x slower than my GPGPU implementation, and it's even slower than the CPU OpenCL. Obviously something is going very wrong. I suspect it's down to memory access, but I don't know for sure.

How can I find out what is making my code slow?
What profiling tools are there?

I'm currently on Snow Leopard, but could probably get my code to Linux if there were better tools there.

Re: Profiling Code

Work-group size is a really important factor. If you're setting it to 1 you will get terrible utilization on GPUs! You also need to make sure that the data transfer is not killing you. On the CPU the data doesn't have to move over the PCI bus, but it does on the GPU. This means you want to move as little data as possible and do a lot of computation. Given that your GLSL performance is high I doubt this is the issue, though.

Each time I enqueue this it a 1-dimensional million item global work-size. I've been leaving the local work size as NULL to allow the driver to decide an optimal value.

starPosition/Velocities are read/write buffer objects, but used for input only.[/*:m:21bjdic6]

newStarPosition/Velocities are read/write buffer objects, but used for output only. [/*:m:21bjdic6]

After each iteration the buffer objects are swapped new<->old, avoiding any copies.[/*:m:21bjdic6]

One call to enqueueReadBuffer is performed on newStarPositions after an iteration to get the locations back for display. It's non-blocking, but I do wait for the event before display. Removing this read and the display routines don't make huge differences.[/*:m:21bjdic6]

The main load appears to be the loop. I would have liked to have put the galaxyPositions / Masses into constant space, but making that change (i.e. change global to constant) crashes the compiler.[/*:m:21bjdic6]

numberOfGalaxies = 20.[/*:m:21bjdic6]

The speeds I gave before were working on 3 element vectors. I changed it to 4 as an experiment and got a speed increase of about +1 million particles for both the CPU and GPU.[/*:m:21bjdic6]

Re: Profiling Code

Hmm the code looks quite sensible and there's no barriers/fences in there that might mess things up. My best guess would be that the reads from the non const input arrays are causing something bad (and unneeded) to happen with the memory/cache on the device. This would hurt quite a lot, especially given that every work item reads every single galaxy.

Does changing "global float4 * " to "global const float4" change the timings at all? It might let the compiler optimize the loads better. You mentioned as well that the buffers are both read/write. Did you do a comparison at all with marking them either read or write exclusively and doing copies? Might be interesting to see if that makes any difference.

Re: Profiling Code

It brings the GPU to approximate parity with the CPU (6M each). Worthwhile, but it's not the order of magnitude I'm looking for.

You mentioned as well that the buffers are both read/write. Did you do a comparison at all with marking them either read or write exclusively and doing copies? Might be interesting to see if that makes any difference.

I just gave it a go, and it cost me a little (about 300k particles a second on the GPU - about 800k on the CPU). I have also tried just having a single buffer that's read and written to, and that didn't seem to be a win or a loss. I was wondering if I might benefit from a smaller cache footprint.

Re: Profiling Code

Paul,
Use the async_workgroup_copy to copy your 20 galaxies into local memory. That should give you a tremendous speed boost. Currently you are reading the same data from global memory for every operation, which is probably causing a huge slowdown. (On the CPU this gets put in the cache, so you don't see the hit as much.)

Re: Profiling Code

I've now taken that improvement over to a version that shares an OpenGL VBO as the starPosition memory buffer object. This eliminates the read back and re-submit of all the position data to display the system, which was now becoming significant.

We're now up to 30 Million particles per second, which is within a factor of 2 of my best GPGPU results on this machine, and 7.5 times what we started with.

I expect some of the difference Vs my GLSL kernel is that the galaxyPositions/Masses were defined as uniforms so that could have been loaded once into fast memory and left alone. This kernel is having copy them local for each work group. Am I right in saying that loading the values in constant memory would have the same effect?

Not sure how much more there is to squeeze out of this, but it's been an interesting experiment. Hope it's been useful to others too.

Could still do with some form of profiling I know it's not that easy though.

Re: Profiling Code

Paul,

You should put in a memory barrier after your async_workgroup copies to make sure all outstanding memory accesses across the workgroup are done before your kernel continues. (This shouldn't matter on current hardware, but may be needed in the future.)

The other thing you should do is enable MADs. Take a look at the OpenCL documentation for the compiler variable to pass in to the compiler to enable the use of the mad instruction. (It's something like -cl-enable-mad.) This will be off by default in CL, but on by default in GLSL, so you may be able to get a boost out of that.