You cannot call the CUDA modules from within python. The python bindings only allow you to call the standard OpenCV routines.

If you have built OpenCv with CUDA support then to use those libraries and/or redistribute applications built with them on any machines without the CUDA toolkit installed, you will need to redistribute the following dll’s from your

C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v8.0\bin

directory to those machines:

cudart64_80.dll

nppc64_80.dll

nppi64_80.dll

npps64_80.dll

cublas64_80.dll

cufft64_80.dll

The latest version of Intel TBB uses a shared library, therefore if you build with Intel TBB you need to add

Optional – Install both the Intel MKL and TBB by registering for community licensing, and downloading for free. MKL version 2018.0.124 and TBB version 2018.0.124 are used in this guide, I cannot guarantee that other versions will work correctly.

Generating OpenCV Visual Studio solution files with CMake

In the next section we are going to generate the Visual Studio solution files with CMake. There are two ways to do this, from the command prompt or with the CMake GUI. Generating solution files from the command prompt is both quicker and easier, however using the GUI enables you to more easily see and change the available configuration options. My advice would be to use the command prompt if you just want to compile OpenCv with CUDA and use the GUI if you want to add extra configuration options to your build. Once you have decided proceed with the guide that applies to you:

Building OpenCV 3.3 with CUDA 8.0 from the command prompt (cmd)

to temporarily set the environmental variables for locating your TBB installation.

Then choose your configuration from below and copy to the command prompt where PATH_TO_BUILD_DIR is the location where you which to build OpenCV and PATH_TO_SOURCE_DIR is the location of the OpenCV source files. To build with Visual Studio 2015 instead of 2013 replace -G”Visual Studio 12 2013 Win64″ with -G”Visual Studio 14 2014 Win64″:

This will both build the library and copy the necessary redistributable parts to the install directory, PATH_TO_BUILD_DIR/install in this example. Additionally if you build the python bindings then the cv2.pyd and/or cv2.cp36-win_amd64.pyd shared libs will have been copied to your python Anaconda2[3]\Lib\site-packages\ directory, all that is required is to add the directory containing opencv_world330.dll (and tbb.dll if you have build with Intel TBB) to you path environmental variable.

If everything was successful, congratulations, you now have OpenCV v3.3 built with CUDA 8.0.

Building OpenCV 3.3 with CUDA 8.0 with the CMake GUI

Fire up Cmake. If you want OpenCV to use TBB then open up the command prompt (windows key + r, then type cmd and press enter) and enter

to temporarily set the environmental variables for locating your TBB installation, and

"C:\Program Files\CMake\bin\cmake-gui"

to launch CMake using those variables. Otherwise you can just start CMake normally.

Making sure that the Grouped checkbox is ticked, select the location of the source files, downloaded from GitHub, and the location where the build will take place, E:/opencv/ and E:/build/opencv/vs2013/x64/cuda_mkl/ in this example.

Skip if you are not building with MKL. We want MKL to use TBB but unfortunately the CMake script does not correctly locate the Intel MKL and TBB libraries when using the GUI. The following is an inelegant hack of the script to get MKL to use TBB.

Open up OPENCV_SOURCE/cmake/OpenCVFindMKL.cmake (where OPENCV_SOURCE is E:/opencv/ in this example) in your favorite text editor and amend line 44 to activate MKL_WITH_TBB as

Click the Configure button and select Visual Studio 2013 Win64 (32 bit CUDA support is limited). This may take a while as CMake will download ffmpeg and the Intel Integrated Performance Primitives for Image processing and Computer Vision (IPP-ICV).

Skip if you are not building with MKL. If MKL and TBB are installed correctly, and you have modified the OpenCVFindMKL.cmake as above, the path to these should have been picked up in CMake, and MKL_WITH_TBB should have been selected, as below.

Verify your output resembles that shown below.

Skip this if you are not building with TBB. Expand the WITH group and tick WITH_TBB,

then press configure and confirm that CMake has picked up the locations of your TBB installation

and shows the correct parallel framework.

Expand the BUILD group and tick BUILD_opencv_world (builds to a single dll).

Expand the CUDA tab, the CUDA_TOOLKIT_ROOT_DIR should point to your CUDA 8.0 toolkit installation, if you have more than one version of the toolkit installed and it has picked that one then simply change the path to point to CUDA 8.0.

The default CUDA_ARCH_BIN option is to build microcode for all architectures from 2.0-6.1 (Fermi–Pascal). This setting results in a large build time (~3.5hours on an i7) but the binaries produced will run on all supported devices. If you only want to execute OpenCV on a specific device then only enter the compute capability of that device here, remember that this the produced libraries are not guaranteed to run on any device’s of a different major compute version to the ones entered, see the CUDA C Programming Guide for details.

If you are comfortable with the implications, you can also enable CUDA_FAST_MATH which will enable the –use_fast_math compiler option, again see CUDA C Programming Guide for details.

Skip if you are not including the Python bindings. If you have installed only one version of Anaconda, then CMake should pick up its location (as long as you ticked “Register Anaconda as my default Python” on installation) and already ticked the correct build option (BUILD_opencv_python2[3]). However, if you are building for both Python 2 and 3, you may have to manually enter in the locations for Anaconda3 as below.

Then once you press configure again, both build options will be selected.

Press Configure again, your CUDA options should resemble the below.

There should be no warning messages in red displayed in the configuration window. If there are then the Visual Studio solution may be generated but it it will probably fail to build.

Note: More recent versions of CMake, than the v3.7.1, may give warnings resembling the below:

These can be safely ignored.

Press Generate and wait until the bottom of the window indicates success.

Press Open Project (not available in older versions of CMake, for those just locate and open the Visual Studio solution file) to open up the solution in Visual Studio.

Note: If you are building with python bindings then you will need to build in Release mode unless you have the python debug libraries.

This will both build the library and copy the necessary redistributable parts to the install directory, E:/build/opencv/vs2013/x64/cuda_mkl/install in this example. Additionally if you build the python bindings then the cv2.pyd and/or cv2.cp36-win_amd64.pyd shared libs will have been copied to your python Anaconda2[3]\Lib\site-packages\ directory, all that is required is to add the directory containing opencv_world330.dll (and tbb.dll if you have build with Intel TBB) to you path environmental variable.

If everything was successful, congratulations, you now have OpenCV v3.3 built with CUDA 8.0.

NOTE: If you change remove any options after pressing Configure a second time, the build may fail, it is best to remove build directory and start again. This may seem over cautions but it is preferable to waiting for an hour for the build to fail and then starting again.

If OpenCV has been built with the python bindings then on the your build machine the cv2.pyd and/or cv2.cp36-win_amd64.pyd shared libs should have been copied to the Anaconda2[3]\Lib\site-packages\ directory. If not you need to copy them to that directory on the machine you are using. They should be located in the build\lib directory, e.g. E:/build/opencv/vs2013/x64/cuda_mkl/lib/.

Therefore to use OpenCV with python just fire up Anaconda Prompt, navigate to the directory containing opencv_world330.dll, e.g. E:/build/opencv/vs2013/x64/cuda_mkl/install/x64/vc12/bin. Start the python interpreter (type: python), then in the interpreter type import cv2. If this is successful then you can use python’s OpenCV bindings. If that works then, add the location of opencv_world330.dll to your system path.

Thanks for the detailed reply. I will try it out.
But I am having some problem with the build phase in Visual Studio. It has been going on for hours and its stuck at 22%. It’s working with all the CUDA libraries (matmul.h, add.h). And I also get a lot of warnings about deprecated architectures (sm-20). Any chance I can speed up the build. For now, I deleted the directory and am starting again from the Cmake step.
P.S. I am new to OpenCV and CUDA .

It takes approximately 3.5 hours on a modern intel i7, the CUDA compiler performs a significant amount of optimization while compiling, hence the wait. Warnings regarding sm-20 are fine, as long as you are not getting any errors I would keep waiting.

If the errors are just in the performance tests, then the OpenCv libs should have already compiled correctly. Check the bin folder, if opencv_world330.dll is present then you should be able to ignore the warnings.
Are you certain that you have checked out the 3.3.0 tag and you are not building and earlier version of OpenCv?

I downloaded the 3.3.0 executable from Github. The project is still building, cant find opencv_world330.dll in the path. If I build again without the performance tests, would the errors go away. Is there any catch if I build without performance tests.

If opencv_world330.dll is missing from the your bin\Release folder, do you have any executables in there? Is opencv_world330.lib in your lib\Release folder?
I cannot comment on removing the performance tests, because I am unable to recreate your issue on either of the two machines I have tried a fresh build on. If you can successfully build the OpenCv world lib and dll then I would expect that you can ignore the errors with the performance tests, however without recreating the issue on my machine I can not tests this to make sure.

Hey, when compiling with anaconda, visual studio looks for python35_d.lib (the debugging library) to the best of my knowledge, the debugging lib is either increadably hard to build, or it just doesn’t exist. What am i doing wrong? can I point it to the “official” python35_d.lib, and go without issue?

Hi, from memory I did not have any problems building in Debug or Release with python bindings, however I am unable to check at the moment because I don’t have access to my build machine. In cmake under the PYTHON3 drop down, was the location C:\Anaconda3\libs\python35.lib or equivalent? The OpenCv CUDA module is not supported in python, are you sure you need to build with python bindings?

I will have a look later on. Is visual studio still looking for python35_d.lib when you build in Release mode?
Which OpenCv CUDA routines are you looking to use? If it is mainly matrix operations, filtering etc. then you could use pytorch. If it is HOG, GMM, Haar cascades etc. then OpenCv is probably the way to go.

Hi, what are you comparing your compiled build with, the default binaries from OpenCv? I noticed a significant speed up in matrix operations when I build with MKL, I will see if I can find the results of the performance tests, and let you know, what to expect.

Thank you for your response.
I was comparing them with my own build without MKL. The reason for no speed up was that for matrices with size smaller than HAL_GEMM_SMALL_MATRIX_THRESH (=100) opencv is implementing its own gemm function and my test Mat size was 50*100000 (it is my size of work).
Now I have speed up with matrices bigger than 100*100. But still it is 2~3 times slower than numpy. I looked at task manager and found that numpy is using all cpu threads. then I enabled MKL_with_tbb but there is again no change and it is using one thread of my cpu. should I enable multi-threading of MKL explicitly?

Hi, I can confirm I am experiencing the same slowdown as you are when using cv2.gemm() instead of np.dot(), I work in c++ and had not previously compared with python. I am not sure what is causing this but from your observation and as both implementations (numpy and opencv) should be using Intel MKL it would point to a threading issue. I will investigate, if you find a solution please let me know.

I don’t think TBB will have any effect because from what I have read MKL uses OpenMP for multi threading. From the documentation

AFAIK numpy is using openBLAS which I couldn’t compile opencv with it. OpenBLAS is using multithreading for matrix calculations, so the speed is much higher than MKL without multithreading.
If MKL is using openMP for multithreading, so what is the reason we use tbb instead of it ? wondering if Checking with_openMP solves the problem?

Hi, ignore my previous comment regarding OpenMP, I had misinterpreted the Intel documentation. To get the MKL libs to use TBB you need to make additional modifications to the OpenCVFindMKL.cmake script before you press configure for the first time. I have updated the instructions, let me know if this solves your issue.

On testing in python I now get almost identical results from cv2.gemm() and np.dot().

My version of numpy installed through conda is using MKL, you can check yours by runningnp.__config__.show().

Hi James,
Thanks for the tutorial!
I have a question concerning TBB: I’ve installed MKL (in the default path), and also decompressed TBB (it’s just an archive, not an installer) in another folder. I’ve adapted OpenCVFindMKL.cmake as instructed and ticked MKL -> MKL_WITH_TBB.
Do I also need to tick WITH -> WITH_TBB then specify TBB_ENV_INCLUDE, TBB_ENV_LIB and TBB_ENV_LIB_DEBUG according to where I decompressed TBB, or is what comes with MKL sufficient?

Hi, I installed the Intel TBB binaries from the Intel website, not from https://www.threadingbuildingblocks.org/. I am pretty sure that you only tick WITH_TBB if you want to build TBB from source which I have not done. I will try to dig out the OpenCv performance test results from including TBB in this way to see what the benefit is.

Thank you! I didn’t think of getting the TBB binaries from the Intel website, I’ll do another pass with cmake once I’ve installed them to see what Cmake reports. As for WITH_TBB, I remember that when I compiled OpenCV 3.2 many months ago, I ticked WITH_TBB but didn’t build TBB from source, instead I pointed Cmake to where TBB (from threadingbuildingblocks, not from Intel) was decompressed and I was able to get everything to work (didn’t do any speed tests though).
But if you have some performance test results on hand, I’d be happy to know. 🙂
In the meantime, I’ve launched an OpenCV 3.3 build MKL_WITH_TBB + WITH_TBB (decompressed from threadingbuildingblocks) to see what happens.
It’s still not clear to me if MKL_WITH_TBB impacts only the MKL part or also other parts of OpenCV that might benefit from TBB.

Hi, I had not noticed that Intel had changed their TBB installation, I have amended the instructions above to allow OpenCv to be built with the 2018 version of Intel TBB. I will share some performance comparison results when I have them.

I was incorrect with what I previously told you, enabling:
MKL_WITH_TBB, (if you amend the CMake script as I mention above) will only impact MKL.
WITH_TBB should (I am still testing) enable multi threaded parts of OpenCv to run, and it should work with the 2018 libraries downloaded from Intel.

Thank you very much!
new modification on openCVFindMKL worked for me and now numpy and opencv have identical performance!
I have another request from you. MKL and openblas have similar performance. but openblas is free and mkl is not for commercial use (am I right?) if you could write a similar instructions for building opencv+openblas I would be so thankful of you. I have similar issues with compiling opencv + openblas. it seems that openCVFindopenBLAS is not working too.

Hi James, Thanks for this comprehensive guide. Although I have yet to be able to build it and keep getting this error for opencv_world CMake Error at cmake/OpenCVUtils.cmake:945 (target_compile_definitions):
Cannot specify compile definitions for target “opencv_world” which is not
built by this project.
I was compiling using CUDA 9.0 and VS2013 (VS2015 didn’t work). I tried using VS2017 and CUDA 8.0 too (various combinations), but the same error occur. Do you know how I can rectify this problem? or if whether it’s fine not to compile the opencv_world? (error doesn’t occur if that is unchecked). I’ll be using this for my python programme (anaconda3 used). Thanks a lot again!!

CUDA 9.0 and/or VS 2017 are not supported by OpenCv 3.3, even if you can get it to compile none of the features of CUDA 9.0 (cooperative groups etc.) will have been implemented so I doubt there is any advantage over CUDA 8.0. If you want to use python then CUDA is also not supported so it would be best to disable the CUDA modules.