Recent status updates

Thomas Whelan and John McDonald, two of our collaborators from the National University of Ireland in Maynooth, have been feverishly working away on extensions and improvements to PCL’s implementation of KinectFusion - KinFu.

Two of the major limitations of KinectFusion have been overcome by the algorithm, which Tom calls Kintinuous:

By constructing a 3D cyclic buffer, which is continually emptied as the camera translates, the original restriction to a single cube (e.g. 5x5x5m) has been removed

Operation in locations where there are insignificant constraints to reliably carry out ICP (KinectFusion’s initial step) is now possible by using visual odometry (FOVIS).

Tom has created some amazing mesh reconstructions of apartments, corridors, lecture theaters and even outdoors (at night) - which can be produced real-time. Below is a video overview showing some of these reconstructions.

We are hoping to integrate Kintinuous into a full SLAM system (iSAM) with loop-closures to create fully consistent 3D meshes of entire buildings. More details including the output PCD files and a technical paper are available here.

Just a quick note that the visual odometry library we have been using within our particle filter - called FoVis (Fast Visual Odometry) - has been released. It supports both stereo cameras (including Pt Grey Bumblebee) and depth cameras (MS Kinect, Asus Pro Live) and achieves 30Hz using a single CPU core. I haven’t used another VO system but its highly recommended.

FoVis was developed by Nick Roy’s group here in MIT. More details can be found in this paper:

So I’ve been doing some analysis to determine the overall benefit of the improvements mentioned below within our application. It will give a feel for the impact of various improvements in future.

High Level Timing

First lets look at the high level computing usage. We did several tests and looked at elapsed time for different numbers of particles and also looked at (1) doing the likelihood evaluation on the GPU (using the shader) or (2) on the CPU (as previously).

The data in this case was at 10Hz so real-time performance amounts to the sum of each set of bars being less than 0.1 seconds. For low number of particles, the visual odometry library, FoVis, represents a significant amount of computation - more than 50 percent for 100 particles. However as the number of particles increases the balance of computation shifts to the likelihood function.

The visual odometry and likelihood function can be shifted to different threads and overlapped to avoid latency. We haven’t explored it yet.

Other components such as particle propogation and resamping are very small fractions of the overall computation and are practically negiligible.

In particular you can see that the likelihood increases much more quickly for the CPU evaluation.

Low Level Timing of Likelihood

The major elements in the likelihood are the rendering of the model view and the scoring of the simulated depths - again using the results of these same runs.

Basically the cost of rendering is fixed - a direct function of the (building) model complexity and the number of particles. We insert the ENTIRE model into OpenGL for each iteration - so there will be a good saving to be had by determining the possibly visible set of polygons to optimize this. This requires clustering the particles and extra caching at launch time but could result in an order of magnitude improvement.

It’s very interesting to see that the cost of doing the scoring is so significant here. Hordur put a lot of work into adding OpenGL Shader Language (OpenGLSL) support. The effect of it is that for large numbers of particles e.g. 900 scoring is only 5% of available time (5% of 0.1 seconds). Doing this on the CPU would be 30%.

For the simplified model, this would be very important as the scoring will remain as is, but the render time will fall a lot.

NOTE: I think that some of these runs are slightly unreliable. For portions of operation there was a charasteristic chirp in computation during operation. I think this was the GPU being clocked down or perhaps my OpenGL-based viewer soaking up the GPU. I’ll need to re-run to see.

In this post I’m going to talk about some GPU optimizations which have increased the speed of the likelihood function evaluation. Hordur Johannsson was the one who did most of this OpenGL magic.

Evaluating Individual Measurement to Model Associations

This is the primary approach of this method. Essentially by using the Z-buffer and an assumption of a generative ray-based cost function, we can evaluate the likelihood along the ray rather than the distance in Euclidean space. This is as was previously discussed below, I just mention it here for context.

Computing Per-pixel Likelihood Functions on the GPU using a Shader

The likelihood function we are currently evaluating is a normalized Gaussian with an added noise floor:

Previously this was computed on the CPU by transferring the depth buffer back to the CPU from the GPU. We have instead implemented a GPU shader to compute this on the GPU.

Currently we look-up this function from a pre-computed look-up table. Next we want to evaluate this functionally, to explore the optimization of function’s shape as well as things like image decimation. (Currently we decimate the depth image to 20x15 pixels).

Summing the Per-pixel Likelihood on the GPU

Having evaluted the log likelihood per pixel, the next step is to combine them into a single log likelihood per particle by log summation:

This can be optimized by parallel summation of the pixel images: e.g. from 32x32 to 16x16 and so on to 1x1 (the final particle likelihood). In addition there may be some accuracy improvement by summing simularly sized values and avoiding floating point rounding errors: e.g. (a+b) + (c+d) instead of ((a+b)+c) +d

We’re currently working on this but the speed up is unlikely to be as substantial as the improvement in the previous section.

Single Transfer of Model to GPU

An addition to the above, previously we used the mixed polygon model. We have now transferred to using only a model made up of only triangles. This allows us to buffer the model on the GPU and instead we transmit the indices of the model triangles which should be rendered in a particular iteration. (Which is currently all the triangles).

Future work will look at determining the set of potentially visible polygons - perhaps using a voxel-based indice grid. This is more complex as it requires either implicit or explicit clustering of the particle set.

In addition to the above, we are now using an off screen buffer which allows us to renderer virtual images without being limited by the resolution of the machine’s specific resolution.

Putting it all together

We’ve benchmarked the improvements using a single threaded application which carries out particle propogration (including Visual Odometry), renderering, likelihood scoring and resampling. The test log file was 94 seconds long at about 10Hz (about 950 frames in total). We are going to focus on the increased number of particles for real-time operation with all modules being fully computed.

The biggest improvement we found was using the triangle model. This resulted in about a 4 times increase in processing speed. Yikes!

Using the triangle model, we then tested with 100, 400 and 900 particles the effect of computing the likelihood on the GPU using a shader:

This results in a 20-40% improvement, although we would like to carry out more sample points to verify this. The result of this is that we can now achieve real-time performance with 1000 particles. For the log file we didn’t observe significant improvement in accuracy beyond about 200 particles - so the next step is to start working with more aggressive motion and data with motion blur. There, being able to support extra particles will become important.

In my next post I’ll talk about how that computation is distributed across the various elements of the application.