Performance: boost.compute v.s. opencl c++ wrapper

The following codes add two vectors using boost.compute and opencl c++ wrapper respectively. The result shows boost.compute is almost 20 times slower than the opencl c++ wrapper. I wonder if I miss use boost.compute or it is indeed slow.
Platform: win7, vs2013, boost 1.55, boost.compute 0.2, ATI Radeon HD 4600

The kernel code generated by the transform() function in Boost.Compute should be almost identical to the kernel code you use in the C++ wrapper version (though Boost.Compute will do some unrolling).

The reason you see a difference in timings is that in the first version you are only measuring the time it takes to enqueue the kernel and map the results back to the host. In the Boost.Compute version you are also measuring the amount of time it takes to create the transform() kernel, compile it, and then execute it. If you want a more realistic comparison you should measure the total execution time for the first example including the time it takes to set up and compile the OpenCL program.

This initialization penalty (which is inherent in OpenCL's run-time compilation model) is somewhat mitigated in Boost.Compute by automatically caching compiled kernels during run-time (and also optionally caching them offline for reuse the next time the program is run). Calling transform() multiple times will be much faster after the first invocation.

P.S. You can also just use the core wrapper classes in Boost.Compute (like device and context) along with the container classes (like vector<T>) and still run your own custom kernels.