The Khronos Group - a non-profit industry consortium to develop, publish and promote open standard, royalty-free media authoring and acceleration standards for desktop and handheld devices, combined with conformance qualification programs for platform and device interoperability.

If this is your first visit, be sure to
check out the FAQ by clicking the
link above. You may have to register
before you can post: click the register link above to proceed. To start viewing messages,
select the forum that you want to visit from the selection below.

The identifiers in caps are defined in a separate source (generated before compilation) as

Code :

#define REAL_TYPE double
#define NUM_SAMPLES 4096 // may be defined as anything from 4096 to 65536
#define NUM_CELL_VAR 64 // may be defined as 64, 216 or 512

It seems to me that accessing both global memory arrays in the loop may cause a bottleneck, so I've been experimenting with precopying parts of the global data into local and private memory. However the current code still runs the fastest. Perhaps because sizeof(REAL_TYPE)*NUM_SAMPLES is too large to fit into local memory all at once on my device.

As you may see, this is a part of a numerical integration and the use of double presicion is neccesary. Anyway, clever optimization tips would be greatly appreciated.

To me it looks like this is essentially normal matrix multiplication, no? I hate to not actually give you any explicit help, but googling (or searching these forums) for "OpenCL local memory matrix multiplication" gives a number of results, unfortunately I'm not sure which are the best. You could also take a look at Nvidia's CUDA documentation, since your approach will be very similar to theirs.

Thanks, yes its a sum of outer products, essentially equivalent to multiplication of two rectangular matrices. I've found a few examples, maybe my google-fu is no good, but they all seem to suggest copying a buffer into local or private memory which, in my case, causes a significant slowdown.

I've also tried casting *gA and *gB into double4 and summing up dot(.. , ..) products which gave a slight improvement.

The only thing I've found that is significant is to precompute the starting memory address in the source data (the pointers to &glo_A[row*NUM_SAMPLES] and &glo_B[row*NUM_SAMPLES]). That gave about a 15 % improvement.

Currently this kernel performs about 51 Gflops. Compared to the rated 152 Gflops of my device (for double precision), this seems a bit low, no?

I haven't had a chance to try it out yet, but you might try transposing gA and gB so that the reads from global memory are coalesced. That is to say, if NUM_SAMPLES was 8 and you had 8 threads, have the memory layout be something like:

As it is now, you read from A[0], A[NUM_SAMPLES], ... , A[(global_size(0) - 1) * NUM_SAMPLES], etc, all at the same time, which I doubt is very efficient in terms of memory bandwidth. You could also probably rewrite the kernel a different way to instead process each line from A/B at the same time instead of transposing the data, but either would probably work. I forgot to mention, but if you have a profiler for your device, make sure you use it, since I am sort of guessing on what might be the bottleneck here.

I havent profiled a GPU kernel before so I'm reading a bit to see what to make of these numbers. But right away I notice especially the VALUBusy at 56.12% (time used for vector instructions) and SALUBusy at 1.09% (time used for scalar instructions), which are supposedly bad. But I guess SALUBusy beeing low is only due to most instructions being vector type.

Thanks a lot for your input. I've just tried the CodeXL profiler from AMD. Here is what it has to say about the kernel

I havent profiled a GPU kernel before so I'm reading a bit to see what to make of these numbers. But right away I notice especially the VALUBusy at 56.12% (time used for vector instructions) and SALUBusy at 1.09% (time used for scalar instructions), which are supposedly bad. But I guess SALUBusy beeing low is only due to most instructions being vector type.

I'm tempted to say that 50% isn't terrible, all things considered, but I guess theoretically your code could be twice as fast (if you were aiming to be compute-bound I guess?). I tried running a test version of your kernel with similar parameters here:

Interestingly, the mad() version performs slightly worse, seemingly because of more cache misses? Honestly not really sure whats going on there. The only different values I got from profiling are as follows:

Code :

Time 48.70119
VALUInsts 6542
VALUBusy 1.12
CacheHit 47

Anyway, that's all besides the point. The fact that MemUnitBusy is 90%+ while VALUBusy is small (and in the case of my device, extremely so) means that, as far as I can tell, your program is memory bound. We can tell the reads are not as good as they could be by looking at the FetchSize:

Originally Posted by AMD

FetchSize: The total kilobytes fetched from the video memory. This is measured with all extra fetches and any cache or memory effects taken into account.

For my parameters glo_A and glo_B each had a size of 216 * 6144 * (8 bytes), which comes out to about 21.23 MB total, but as you can see from my profile results the total fetch size is 295.929 MB, or over 10x more. Luckily the cache does help us some, so it's not as big as it could be, but I think it could be improved by trying to coalesce the global reads and/or explicitly use local memory to store reused data. Obviously we won't be able to get it down to just 21.3 MB, but I think it could be cut down some, which should raise VALUBusy.

Just an observation: The GlobalWorkSize is not an integer multiple of the WorkGroupSize (216 is not evenly divisible by 16). In OpenCL 1.x, if you specify the work group size then the global size must be a multiple of it. Then you can pass the real size as a parameter and your kernel can check to see if the global_id is within the valid size before doing work.

So I transposed the global memory buffers and this gave some improvement. Now the VALUBusy is typically 70% - 80% (SALUBusy is 10%). And overall the kernel performs about 60 Gflops.

Fetch size is still ranges from 100MB to 200MB. The fact that this varies a lot from one run to another is a bit strange. Maybe it could be due to the fact that the GPU also is connected to a screen and renders stuff for other applications?

Originally Posted by Dithermaster

Just an observation: The GlobalWorkSize is not an integer multiple of the WorkGroupSize (216 is not evenly divisible by 16). In OpenCL 1.x, if you specify the work group size then the global size must be a multiple of it. Then you can pass the real size as a parameter and your kernel can check to see if the global_id is within the valid size before doing work.

Hmm you're right. It has not been a problem but I've changed the wg size to 8x8 to be sure.

BTW, I've been using dot(A,B) instead of A*B. Even if A and B are not vectors it seems dot() is slightly faster than *. Guess this is highly system dependent though.