The Khronos Group - a non-profit industry consortium to develop, publish and promote open standard, royalty-free media authoring and acceleration standards for desktop and handheld devices, combined with conformance qualification programs for platform and device interoperability.

If this is your first visit, be sure to
check out the FAQ by clicking the
link above. You may have to register
before you can post: click the register link above to proceed. To start viewing messages,
select the forum that you want to visit from the selection below.

Threaded View

native_ functions

Hi there,

I tried to compare the performance of the BlackScholes implementation in CUDA and OpenCL found in the NVidia SDK.
The kernels are exactly the same just that the CUDA implementation uses native functions (__expf and __logf), whereas OpenCL doesn't, i.e. it uses exp and log. With these implementations the OpenCL kernel execution takes about twice as long as the CUDA implementation.
Using native functions in OpenCL (native_exp and native_log) makes the OpenCL kernel as fast as the CUDA one, however the results are less accurate...

I wrote a small kernel that only computes the exponential of the input and translated it to PTX, once with and once without native_. It seems like using native the function call is translated to ex2.approx.f32, whereas the non-native function is translated to a long sequence of instructions (strangely including ex2.approx.f32 as well).
The same can be observed when CUDA is translated to PTX. However, the CUDA implementation using __expf is as accurate as the one using expf, whereas in OpenCL the native function is a lot less accurate than the non-native one...

Any ideas why that is the case? How well are native_ function calls supported in the NVidia OpenCL SDK?