I have a PCI data acquisition card that supports P2P. It will be capturing records one after the other at a very rapid rate, and the plan is to write each record to the GPU using DirectGMA, where a kernel will process the data. I can't handle the records sequentially as there is no time between records for the kernel to run. Instead, I was thinking of having two device buffers that the PCI card would alternately write to. After a record arrives in the first buffer I would run the kernel on this data, while at the same time start a wait for data to arrive in the second buffer. Once this data has arrived, I would run the kernel on this, start a wait on the first buffer, and so on.

I'm not familiar with out of order queues and CL events, so I'm looking for some suggestions on how I could achieve this. This is my attempt so far (I haven't included the reading of results back to the host, as this should be trivial):

1. clEnqueueWaitSignalAMD (buffer A)

2. clWaitForEvents (step #1 to complete)

3. clEnqueueNDRangeKernel (arg = buffer A)

4. clEnqueueWaitSignalAMD (buffer B)

5. clWaitForEvents (step #4 to complete)

6. clEnqueueNDRangeKernel (arg = buffer B)

7. Go back to #1

With an out of order queue I'm assuming 3 & 4 will happen "at the same time" (ditto for 6 & 1), however I think this is flawed: at step 5 the host waits on the buffer B wait signal to complete before proceeding, so there is an assumption here that this will always take longer than the kernel to run (#3), but what if it doesn't? Do I need to cater for this in some way, and if so how? Or does this whole pattern rely on the kernel taking less time than a record write, otherwise the program wouldn't be able to keep up with the data throughput?!

Am I on the right lines here? Any pointers would be greatly appreciated.

Currently, AMD's OpenCL runtime does not support out-of-order queue on host side; each host-side queue executes commands in-order fashion (please check "queue property" in the clinfo output). If out-of-order command execution is required, use two or more queues to submit the commands.

I assume that by using two queues (one per buffer), I would just run the commands in sequence on each queue: clEnqueueWaitSignalAMD(), clEnqueueNDRangeKernel(), clEnqueueReadBuffer(). When the latter completes I just enqueue the same three commands again.

Does this mean I have to manage each queue in its own host thread? I can't see how else to independently wait for each queue's clEnqueueReadBuffer() command to complete.

Using a separate thread for each queue (or buffer) might be a good option to independently process each buffer. If single thread is used, then there will be some kind of ordering between the two dependency chains. For example, a typical single thread approach may look like below.

while(1)

{

// set a dependency chain for bufferA so that processing of bufferA can start once GMA write to bufferA completes

clEnqueueWaitSignalAMD(queueA, bufferA, marker++, e1)

clEnqueueNDRangeKernel(queueA, bufferA, 1, e1, e2)

clEnqueueReadBuffer(queueA, bufferA, hostBufferA, 1, e2, e3)

clFlush(queueA) // submit all the commands without blocking the host thread

// once GMA write to bufferA completes, bufferB can be used for writing

// now, set a similar dependency chain for bufferB so that processing of bufferB can start once GMA write to bufferB completes

clEnqueueWaitSignalAMD(queueB, bufferB, marker++, e4)

clEnqueueNDRangeKernel(queueB, bufferB, 1, e4, e5)

clEnqueueReadBuffer(queueB, bufferB, hostBufferB, 1, e5, e6)

clFlush(queueB) // submit all the commands without blocking the host thread

clWaitForEvents(e3) // wait for bufferA to complete (blocking call)

readyBufferA = true; // at this moment, bufferA is ready for GMA write once again; send a signal to GMA writer [Note, GMA writer should not use the bufferA until it gets this signal]

host_Process(hostBufferA) // also, a new thread can be launched to process the host buffer in parallel

clWaitForEvents(e6) // wait for bufferB to complete (blocking call)

readyBufferB = true; // at this moment, bufferB is ready for GMA write once again; send a signal to GMA writer [note, GMA writer should not use the bufferB until it gets this signal]

host_Process(hostBufferB) // also, a new thread can be launched to process the host buffer in parallel

}

Note: some of the events can be omitted when a in-order queue is used.

My only comment is that the other PCI card acquires data at a fixed frequency, and there is no mechanism to send a signal telling it that the buffer is ready to write to. I guess I just have to hope that the GPU can "keep up" with the data throughput, and complete the command chain before the buffer is written to again.

No separate signalling mechanism is needed. I just meant to say a new DirectGMA write should start once the GPU finishes its work on the same buffer. For example, if there is an API for the DirectGMA write, then check a corresponding flag or something before calling the API to initiate the DirectGMA write to a buffer. Please note, a buffer should not be used for DirectGMA write while the GPU is reading/processing the same buffer.

In general, clEnqueue<> calls do not block except those which take an extra input argument to explicitly indicate a blocking operation. Regarding clEnqueueWaitSignalAMD(), however, I'm too little bit confused here because this page says: "This command instructs the OpenCL to wait until <value> is written to <buffer> before issuing the next command". If it's a blocking call, then I think multi-threaded approach would be a better solution than the above I described earlier. I will check with the appropriate team and get back to you.

Looks like two different implementations exist for clEnqueueWaitSignalAMD. It’s an asynchronous call in newer PAL OpenCL stack, whereas in ORCA OpenCL stack, it is a blocking call because it waits on CPU. So, I would suggest you to try a simple test before finalizing any design.

This behavior mainly depends on the hardware and the driver. It's nothing to do with the SDK. I have already asked the team to know the expected behavior on Windows. Once I get their reply, I'll share with you.

As I've come to know, clEnqueueWaitSignalAMD() seems to be a blocking call on Windows WDDM1. Otherwise, it is expected to be an asynchronous call. For example, on Windows 10 which includes WDDM2, it is an asynchronous call.

Anyway, to make it sure, you can always debug the code to see whether it is a blocking call or not.

Is there a particular technique for checking whether or not the command is async? Could I measure the time taken to call clEnqueueWaitSignalAMD() - if it's non-blocking then I guess it should return in a few microseconds; if it is blocking then the elapsed time should match the rate of the data being written over P2P?

Yes, async call returns just after enqueuing a command to the command queue, it does not wait for that command to complete. Whereas blocking call returns once the command is completed. In other words, if clGetEventInfo is used to check the status of the associated event for an async command, it may return any state from CL_QUEUED to CL_COMPLETE. For a blocking call, the same is expected to return CL_COMPLETE only.