The Khronos Group - a non-profit industry consortium to develop, publish and promote open standard, royalty-free media authoring and acceleration standards for desktop and handheld devices, combined with conformance qualification programs for platform and device interoperability.

If this is your first visit, be sure to
check out the FAQ by clicking the
link above. You may have to register
before you can post: click the register link above to proceed. To start viewing messages,
select the forum that you want to visit from the selection below.

No speed up from using 2 GPUs

I just got 2 new GPUs yesterday. They are both NVIDIA C2070. I wrote a simple program to compare the runtime of using 1 GPU and 2 GPUs. Surprisingly, 2 GPUs don't give me any speedup. Basically, I have 2 kernels that have their own independent inputs and outputs. I ran different variations of numbers of contexts and command queues, and the command queues are always in-order execution. This is the result:

2 command queues on 2 contexts on 2 devices
(run kernel A on command queue A which is on context A that include only device A; run kernel B on command queue B which is on context B that include only device B)
total time: 519,748 microseconds

Running 1 kernel itself takes 198,018 microseconds (this is the time when the kernel starts running on gpu until finish. there is nothing to do with cpu side.).

Can anyone explain what's going on? I expect to get some speedup when using 2GPUs but apparently not.

Re: No speed up from using 2 GPUs

How are you measuring the time? Are you executing the same amount of work on the 1-device case and the 2-device case? I.e. if you are running 100 work-items for the 1-device example, are you then running 50 work-items per device in the two-device example? Is it possible that your execution time is bandwidth bound rather than ALU bound?

Disclaimer: Employee of Qualcomm Canada. Any opinions expressed here are personal and do not necessarily reflect the views of my employer. LinkedIn profile.

Are you executing the same amount of work on the 1-device case and the 2-device case?

I execute 2 kernels. Each has 100 work-items. When I use 1 device, I run both kernels on that device (100 + 100 work-items). When I use 2 device, I run each kernel on each device (100 work-items on one and 100 work-items on another).

Originally Posted by david.garcia

Is it possible that your execution time is bandwidth bound rather than ALU bound?

I expect that read and write buffer are bandwidth bound, but run kernel shouldn't. These are the runtime of read buffer, run kernel, and write buffer using clGetEventProfilingInfo:
write: 1 microseconds
run: 198026 microseconds
read: 80 microseconds

These are for kernel A. Kernel B takes about the same time. And these runtime results apply for all variations. You can see that I spend most of the time on running kernel. In the last variation that I have 2 command queues on 2 different contexts associated to 2 different devices, the 2 kernels should run concurrently since I put clEnqueueNDRangeKernel one right after another with clFlush for both of them, so I expect the 198026 microseconds kernel runtime of the 2 kernels to overlap.

Re: No speed up from using 2 GPUs

These are the runtime of read buffer, run kernel, and write buffer using clGetEventProfilingInfo:
write: 1 microseconds
run: 198026 microseconds
read: 80 microseconds

You can see that I spend most of the time on running kernel.

That's not so clear to me. You are apparently measuring the time it takes to execute clEnqueueReadBuffer()/clEnqueueWriteBuffer(), which is not the same as the time it takes to actually read or write a buffer. If I may use an analogy, it's the difference between the time it takes to order a pizza and the time it takes to actually bake the pizza.

clGetEventProfilingInfo() is the right way to do all time measurements.

In the last variation that I have 2 command queues on 2 different contexts associated to 2 different devices, the 2 kernels should run concurrently since I put clEnqueueNDRangeKernel one right after another with clFlush for both of them

It would be great if you could show us the whole source code to understand what's going on.

Disclaimer: Employee of Qualcomm Canada. Any opinions expressed here are personal and do not necessarily reflect the views of my employer. LinkedIn profile.

Re: No speed up from using 2 GPUs

Originally Posted by david.garcia

That's not so clear to me. You are apparently measuring the time it takes to execute clEnqueueReadBuffer()/clEnqueueWriteBuffer(), which is not the same as the time it takes to actually read or write a buffer. If I may use an analogy, it's the difference between the time it takes to order a pizza and the time it takes to actually bake the pizza.

clGetEventProfilingInfo() is the right way to do all time measurements.

I did use clGetEventProfilingInfo() to measure those time. Only total time that I used gettimeofday.

Re: No speed up from using 2 GPUs

Can I suggest doing all the clCreateXxx() calls as well as clBuildProgram() at the beginning of the code, then doing the actual clEnqueueXxx() calls? clBuildProgram() in particular is notoriously expensive and executing it between your first and your second call to clEnqueueNDRangeKernel() may be eliminating all possibility of concurrency between the two devices.

Disclaimer: Employee of Qualcomm Canada. Any opinions expressed here are personal and do not necessarily reflect the views of my employer. LinkedIn profile.