The Khronos Group - a non-profit industry consortium to develop, publish and promote open standard, royalty-free media authoring and acceleration standards for desktop and handheld devices, combined with conformance qualification programs for platform and device interoperability.

If this is your first visit, be sure to
check out the FAQ by clicking the
link above. You may have to register
before you can post: click the register link above to proceed. To start viewing messages,
select the forum that you want to visit from the selection below.

Writing a Kernel for a Explicit Method Heat Equation

Hello all,

The greatest difficulties I encounter are those with kernel writing. I am learning OpenCL for my job and as an exercise I am attempting to parallelize the explicit method for a 1D heat equation. The serial C implementation is actually very simple:

So I am trying to make the meat of this program, the double for loop, into my kernel. So as it stands, each layer of the array v[] is calculated at a given time step t, so that stepping in the t direction expands the grid, layer by layer. However, to calculate a layer requires the previous layer calculated. So I had two thoughts about this and I wasn't sure how to really go about this:

but I just don't understand what would happen here. So I understand that I can access the arrays through the get_global_id(0) OpenCL command but can have each thread compute one point in the layer, then wrap that in the for loop to time step it? Otherwise my second though was to

2. Have the kernel only handle a single layer, then enqueue it in a for loop for the time steps.

I am having difficulties even realizing a problem space. In MPI there is the probelm that threads would have to communicate continuously to compute the end points of their respective segments of the layer. I don't know if this is a problem in OpenCL. OpenCL programmers, I need your help! Please any advice as I am lost!

Re: Writing a Kernel for a Explicit Method Heat Equation

Well how you solve this really depends on the problem size. If you're only doing a pile of 9x10 problems you might approach it differently to if you were solving 100x1000, or 1000x100 ...

But there are a couple of general approaches:
a) convert inner loops to multiple work-items within a work-group
b) convert outer loops to individual work-items
c) leave the code more or less the same and get parallelism by doing many at once in a batch

I'l just consider a) here, and that the problem size will be much bigger than 9x10 in practice.

For a, the cross work-item communication (which is required here) can be performed in two ways:
a) using local memory. Everything stays on the individual compute unit which has it's own local memory. This is extremely fast. One kernel invocation can do any number of cross-work item communications, so for example you could put the time step loop inside the kernel. The problem is local memory is restricted in size, and so is the size of the work-group.
b) using global memory. Here memory has to go to/from global memory. If the information must be communicated outside of the work-group, then the only (practical) way to synchronise the data is with separate kernel calls. In this case time step loop is on the cpu host. If there is no cpu-device synchronisation within each iteration, the overhead of the calls isn't too bad.

(this is why the size of the problem matters).

I'll only look at b) here since it's simpler and more general, and it equates to your case 2. If you want to do a) then you're limited to the number of work-items that can run in a single work-group that run on a single compute unit - so you will not get effective parallelism unless you need to do many such problems separately. (there are other ways to communicate integer/float values *between* work groups for the overlapping cases, but they can be very expensive and will complicate the code and will probably be much slower).

Basically you convert the inner loop into a single bit of code which is run concurrently.

So most of this is a fairly straight-forward realisation of the inner loop, but a couple of specific points which might not be clear:
1) The range is checked because when you launch on a gpu you will get optimum parallelism if you give it a global work size which is a multiple of the device width. This should be at least 64 for GPUs. You then have to do your own range checking.
2) I include the boundary conditions 'in the loop'.
3) I assume the buffer pointers are swapped/double buffered in the host code, so there's no need to copy the results back to w (due to the concurrency this wont work anyway, not with one kernel).

Having that, you then write host code that calls that kernel repeatedly for the outer loop, and then just read the result at the end. It will go something like (pseudocode):

Re: Writing a Kernel for a Explicit Method Heat Equation

Perhaps I should have been a little more confident, because I actually set this guy up almost exactly as prescribed. My problem was with the NDRange. But I had previously attempted to implement the same exact kernel as you had again with the NDRange problem. Now it is all fixed thanks to your excellent explanation of Approach (a) using global memory.

What I would like to do now is attempt it using local memory (Approach a), after I get the global memory implementation timed for comparison with the serial implementation. I will keep you posted notzed if you don't mind helping some more later on. Thanks again. You are good at what you do.

Re: Writing a Kernel for a Explicit Method Heat Equation

I haven't touched this code in quite a while, but after coming back to it two months later, I just noticed that while the code looks and acts how it seems it should, it is actually as precise in its calculation as the serial version. I mean its really really close, but the OpenCL code is ultimately not as precise.

All my types are correct and the programs are running with the exact same parameters. What could be going on?

Re: Writing a Kernel for a Explicit Method Heat Equation

Double precision vs Single Precision? What GPU are you using?

You can also test this in two ways:
1) Write a version of your CPU code to use floats instead of doubles. Compare your differences then.
2) Run your OpenCL code on your CPU using the Intel OpenCL SDK or the AMD SDK. Make sure you enable double-precision on your kernel, as it needs a flag.

Compare your results. If they are still different by a noticeable margin, we can dig deeper.