Couple of people have been asking for some help with compiling OpenVDB on Windows, so here is a quick solution I’ve put together. It contains 4 projects in the openvdb_build directory to be compiled under /MD in VS2010 (pretty sure it’ll work for 2012 as well). I’ve included the binaries for the dll and viewer application, though you will require the vs2010 crt libs from vs2010. The few changes made to the source (mostly for the viewer components) I explain here for compiling with /MT.

– A dynamic library of the main OpenVDB toolkit excluding the viewer components
– A static library of the main OpenVDB toolkit excluding the viewer components
– A static library of the main OpenVDB toolkit with the viewer components (I was unable to build the viewer by linking to a dll)
– The OpenVDB viewer application

The prerequisites are still required, so download and build the following and all you need to do is add the additional include & lib directories in the projects:

– OpenEXRs Half.h and subsequent lib/dll
– TBB
– Zlib

And for the static viewer library:

– glew
– glfw legacy version (under version 3)

Nothing fancy here, just download them and compile under /MD for your required build as per usual. If you want to build the OpenVDB’s viewer application you’ll need to build the openvdb_2.3.0_viewer_static project and link to that. Everything is built in /bin/<project name>/<arch>/<config>/

In this project I wanted to investigate and demonstrate the effectiveness of using Compute Unified Device Architecture code to implement graphical simulations and update visual representations with effective optimization and understanding of graphics memory architecture. The focus is on fluid simulations, explicitly looking at the implementation of a simple implicit fluid solver for the Navier-Stokes equations for incompressible flow, based on the semi-Lagrangian simulation method introduced by Jos Stam’s stable Fluids. The project code uses a sparse linear solver to support arbitrary user input collision boundaries, with planned extensions into more detailed PIC and FLIP solvers.

The video above was simulated and filmed using a NVIDIA Geforce 570GTX with a compute capability of 2.0. Frame playback rates of the following solver grids are listed below:

I’m planning on going through the fundementals of the implementation and the core principles behind the solver soon. Below are some quick screen caps showing some of the features I have working in context of the video above.

A single density field with multiple input values on a 2D grid of size 1024×1024 at 25fps

5 different fluid densities symbolised by different colours mixing. The colours and density values can be assigned by the user. A 2D grid of size 1024×1024 at 20fps

A velocity visualisation of a 2D grid of size 1024×1024 at 25fps

Fluid colliding with an arbitrary barrier. The collision barriers can be of any shape and can be drawn on interactively whilst the application is running. 128×128 grid at 60fps sync’d.

A velocity visualisation of a voxelized 3D grid of size 64x64x64 at 17fps

When using the cuda libraries, a seperate compilation and linking process is required for device specific portions of code. The easiest way to do this is to use nvcc (the Nvidia CUDA Compiler). All cuda sources can contain both C/ C++ host code and device functions, as nvcc will seperate these components and handle the host side of things with a general purpose compiler insinuated by the platform. All device methods and attributes must be compiled using NVIDIA compilers/assemblers, which embed the compiled GPU functions as load images in the host object file.

Setting it up in VS is pretty easy due to the template project support from the cuda samples, however to allow me to work on unix based systems I tend to work in Qt. It wasn’t too difficult to find out how to add extra compiler commands to QMAKE, but took a while to set up nicely. It’s also nice to have a .pro which can deal with both linux and Windows. The below is the essential extracts just for configuring nvcc from my .pro. Flags such as SYSTEM_TYPE and NVCC_OPTIONS I set under their respective architecture/platform identities using QMAKE’s parsing and inbuilt flags when setting the main variables for the general host compilation (such as checking QMAKE_TARGET.arch with :contains and generally sub-dividing platforms and compilers with :{} ). I’ve provided an example of this for use on Windows when using MSVC as it requires some extra settings.

Note that as nvcc practically mimics gcc, it requries its own set defines, includes and library paths for the compilation and linking processes in the .pro, and for the most part cannot deal with spaces in file paths and names!

There are plenty of different methods which allow a programmer to grab a bunch of pseudo random numbers which take into account the accuracy of truly randomised values vs. speed. CUDA and thrust parallel primitives offer a variety of host and device API methods to generate random numbers, but also provide a good insight into the processing speed comparison vs. the CPU.

So, as more of a technical experiment to start observing the real performance differences, I started running some basic tests on arbitrary size data containers using four different methods (the first three use host API calls). I’ll go through each one and display their respective performance difference:

thrust::generate is a very simple example of using the CPU to quickly fill a container with random numbers and can practically be summarised in a few lines, where _rand is a host function which simply sets the limits on the calculation:

Pretty simple, the point of showing this is to be able to compare the speed of this method to the other three GPU based implementations.

The second is slightly more complicated, but is from what I can see the easiest way of creating a device vector of random numbers using the GPU without using curand. Here, we use a tranformation of counting iterators to calculate a value per element of the rng sequence. Basically each counting iterator is fed into an op which discards the iterator amount from a created thrust::default_random_engine to avoid correlation and applys a distribution (if needed). I’ve found the need to explicitly set the execution policy with the first parameter thrust::device for the transform, else you tend to get some crazy results.

This method gives a the ability to easily change the limits and way the random numbers are created, however as you will see below, a simple curand implemention will do the same.

With the third method, just as before, we create a pseudo_random_generator (this time using curand), pass it a randomly generated seed and leave it down to the basic wrapped API methods to handle the generation. The function names are pretty self explanatory, with curandGenerateUniform ‘s default limits being between 0.0f and 1.0f. checkCudaErrors & CURAND_CALL are basic checks for the status returns.

The last way I’m going to show is by using curand device API on kernels. Here, we need to allocate space for each individual kernels curandState as well as the device container for the random results. We can then initialize the kernels we are going to use to a particular curandState (the state of the random number generator) by calling curand_init with a given seed and different subsequence value to create a different starting position per kernel. You have the ability here to generate the same sequences for each thread by giving curand_init the same subsequence and offset values!

It’s then a simple case of launching the initialized kernels again ( to avoid calling curand_init) and using the wrapper functions to generate either pseudorandom or quasi random numbers. I’m using curand_uniform to generate floats between 0.0f and 1.0f as before. You can copy the generator state used in local memory for fast generation, and store it back in global memory between kernel launches.

Now the interesting part; how long do each of these methods take to perform for a variety of random numbers! The test cases I have run always apply a uniform distribution and generate numbers between 0.0f and 1.0f using the corresponding API methods for different container sizes (10, 10000, 10000000) and take an average time of 100 calls to each case. Bear in mind that the data in all these GPU examples is being left on the device, and the timings do not account for device->host overhead if it is needed. There is an issue I’m facing for larger amounts using the device API, I think due to the naive way I’m launching the kernels, so I’ve only managed to provide results for n= 10 and 10000 for this method. I’m using the timer code I previously uploaded to monitor these.

Random Number Speed Tests

The allocation and processing speed of the CPU will pretty much always be faster with smaller container sizes, bearing in mind this could be paralyzed even further (by using OpenMP for example), but the main limitation of this is it does require a halt to the translation unit. I’m uncertain how thrust deals with the device generation, however for smaller and decently sized container sizes it seems to perform really well. cuRand’s host API will almost always guarantee a result dependant on your hardware and generator type, the container size does not deviate the processing speed much due to the way it optimizes the generation on the device, however you can get better performance even on this by using the device API. In general, you will get the best performance from cuRand by generating containers of random numbers that are as large as possible. The docs on the cuRand library are really good and the examples worth looking at for more info.

Recently I’ve been getting deeper into OpenMP & CUDA parallelism, and it’s been interesting to see the running time of different optimized functions. There’s been times in the past where I’ve glazed over speed optimizations for bits of code (mainly due to time contraints – how ironic), going on the basis of “this looks fast, it’s fast enough”. Obviously this is bad. And there are plenty of profilers out there to test the timings of different methods! However for convenience, I use the below to give me a somewhat accurate representation of the speed of different functions.

There doesn’t seem to be a particularly reliable timer on Windows, as QueryPerformanceCounter will return different results from a multithreaded program running on a multicore computer. timeGetTime() will only give you accurate results to ms, however for single executing threads QueryPerformanceCounter should be fine, and will return double precision ms. This just makes it quick and easy to see what different implementations yeild!

This page is dedicated to looking at a project I undertook, set by MPC, which focuses on integrating OpenVDB support into my own windows based application, capable of allowing the drawing of OpenVDB data in OpenGL. The app also presents a variety of memory preferences and data controls to help understand the information in a VDB file. The result is a fairly concise UI allowing for different representations of the data via voxel trees. Check out the video below which demos some of the key features I have working. I plan to keep improving and updating the app when I can!

The project can currently be compiled on Windows using VC++10/11 for MSVC2010/2012 respectively (check out my blog post here for some fun OpenVDB building for Windows) or on Linux with gcc and clang, with CUDA and muParser functionality as optional. I plan to make the code available as soon as I’ve tidied it up a bit and perhaps built a VS project for support, as it has been created within the Qt IDE and using its windows API.

[Current Version V0.46]

Changes:

Optional CUDA implementation of kernel filtering with thrust parallel primitives. This allows for cumulative total feedback from clipping, filtering and expressions

Built in muParser (http://muparser.beltoforion.de/) expression evaluation. Although evaluation is processed on the Host, fast memory and visualization updates can be achieved with CUDA

Hello all! I thought I’d post my notes I made a while back which helped me setup OpenVDB on Windows based systems using VS in case anyone finds it useful.

Step by step, compiling openVDB and the openVDB Viewer V(2.1) on Windows 7 under x86/64 for MT/MTd using the VC++10/11 for VS2010/2012. There’s not much that differs from Joe Kider’s post for openVDB 0.96.0 here https://groups.google.com/forum/#!topic/openvdb-forum/ZTEbIAk9kdc. However I’ve quickly expanded below on the setup of the prerequisites for anyone that needs it.

Prerequisites for core OpenVDB

We’ll be building all these libraries from source using the pre-set project folders for VS included with the downloads. You’ll need the following for the main source. Keep in mind we’re building for static linking:

– 1b) Copy the folder extracted from the assembly source called contrib. Paste this into the extract zlib source zlib-1.2.5 and confirm all merge prompts with the existing folders. Alternatively manually copy the obj files; navigate to the extracted folder from zlib124_masm_obj\contrib\masmx86 for 32 or masmx64 for 64. Copy both obj files. Paste them into the zlib source under zlib-1.2.5\contrib\masm x86/x64

– 1c) Go to the VS proj folder zlib-1.2.5\contrib\vstudio\vc10. Open up the solution. The proj you are after is the static lib (zlibstat), build with your required architecture and config (release/debug).

– 2b) Navigate to the extracted folder\vc\vc9 and open the solution. Follow the prompts to convert the project (select no when asked to create a backup)

– 2c) The post build events for these projects assume you’re building the complete OpenEXR package and will create the deploy folder 2 directories up of your extracted location. You could just remove this step, however the folder structure it creates becomes much easier for include/lib paths later on. However if you wish to, just right click on each project in turn, > Properties > Build Events > Post-Build Event and removing the install command (install<projName>.cmd $(IntDir)).

– 2d) Build the Half project with your required architecture and config. If building in debug you may have to init the build twice for symbols.

– 3b) Navigate to the extracted folder\build\vs2010, open the solution and build with your required arch. NOTE: I seem to get some redefinition errors for malloc and free later even by switching the code generation to MT and removing the cmd line overwrite, but currently just forcing multiple links later or choosing to remove defaultlib msvcrt.

Prerequisites for OpenVDB Viewer

For the viewer you’ll be needing glew and GLFW (glu as well but this should be standard). If you’re not actually looking to include to the viewer in the static lib just skip these.

– 4b) Navigate to build\vc10 and open and build the solutions static and shared libs.

– 5a) OpenVDB uses an older version of GLFW, as far as I know any version 2.x will work and is much easier to include that to try and update to 3.x. Download from https://github.com/glfw/glfw-legacy

– 5b) Navigate to the extract folder\support\msvc100, open the solution and build with your required architecture and config

Building

Now for building the libs. You can choose to build the viewer and main source separately if you really want and just force the linker to link to both (as we are going to have to do this anyway) but it’s nicer not to. For JUST openVDB it follows the same steps as posted here https://groups.google.com/forum/#!topic/openvdb-forum/ZTEbIAk9kdc so I’ll only touch on this briefly.

– 1a) Open a new instance of VS2010/12 and select a new project. Select a Win32 Console App (as it becomes easier to tweak), deselect ‘Create directory for solution’ and give it a name. Click Ok, then Next.

Now for the viewer. I spent a while trying to get it compiled on both 32 and 64 bit systems, but it seems all the type problems related to rendermodules have been fixed in the latest release (v2.1), making things much easier. Please post if there are any errors for either, I’ve tested a fair amount for the new version and everything seems to work nicely.

For a joint library follow all above steps for the main source up to 4 and add the viewer files into the project. If you want a separate library, open a new instance of VS2010/12, follow the steps above but only add the files in the viewer folder (skip above 2a and 4).

-1a) Remove all #include <GL/gl.h> and <GL/glu.h> found in ClipBox.h, Font.h, RenderModules.h, Camera.cc and Viewer.cc and replace with #include <GL/glew.h>

– 1b) There are two places glew needs to be initialized (due to the threading?), one is when the viewer gets initialized in Viewer.cc, the other for when the shaders are first compiled in Rendermodules setshader() although I just opted to add it to the constructor of the shader program. Add glewInit(); to line 180 in RenderModules.cc and to line 165 in Viewer.cc when init is called.

– 1d) Linker options: Add the following libraries locations under Linker > General > Additional Library Directories. Also add the libraries themselves under Linker > Input > Additional Dependencies in this order (this in an example of linking to the debug libs):

glew32d.lib
glu32.lib
opengl32.lib
GLFW.lib
zlibstat.lib
Half.lib
At the end include the viewer library and, if you compiled separately, then the main lib.

-1g) If they aren’t in your environment path, you’ll want to copy the glew, TBB and half shared dlls to the executable directory found in their respective bin directories.

– 1h) Download one of the example .vdb files from the openVDB site (one of the smaller ones to test) and add the path location to Properties > Debugging > Command Arguments of the solution. You should be able to run and use the viewer! For 32bit compilers you will get an assertion failure for a subscript range but you can ignore this.