Introduction

I've been working with the Microsoft Kinect for Xbox 360 on my PC for a few months now, and overall I find it fantastic! However, one thing that has continued to bug me is the seemingly poor quality of rendered Depth Frame images. There is a lot of noise in a Depth Frame, with missing bits and a pretty serious flickering issue. The frame rate isn't bad from the Kinect, with a maximum of around 30 fps; however, due to the random noise present in the data, it draws your perception to the refresh. In this article I am going to show you my solution to this problem. I will be smoothing Depth Frames in real-time as they come from the Kinect, and are rendered to the screen. This is accomplished through two combined methods: pixel filtering, and weighted moving average.

Background

Some Information on the Kinect

By now, I would assume that everyone has at least heard of the Kinect and understands the basic premise. It's a specialized sensor built by Microsoft that is capable of recognizing and tracking humans in 3D space. How is it able to do that? While it's true that the Kinect has two cameras in it, it does not accomplish 3D sensing through stereo optics. A technology called Light Coding makes the 3D sensing possible.

On the Kinect, there is an Infrared (IR) Projector, a Color (RGB) Camera, and an Infrared (IR) Sensor. For purposes of 3D sensing, the IR Projector emits a grid of IR light in front of it. This light then reflects off objects in its path and is reflected back to the IR Sensor. The pattern received by the IR Sensor is then decoded in the Kinect to determine the depth information and is sent via USB to another device for further processing. This depth information is incredibly useful in computer vision applications. As a part of the Kinect Beta SDK, this depth information is used to determine joint locations on the human body, thereby allowing developers like us to come up with all sorts of useful applications and functionality.

Important Setup Information

Before you download the links for either the demo application source or demo application executable, you need to prepare your development environment. To use this application, you need to have the Kinect Beta 2 SDK installed on your machine: http://www.microsoft.com/en-us/kinectforwindows/download/

At the time of this posting, the commercial SDK has not been released. Please be sure to only use the Kinect Beta 2 SDK for this article’s downloads. Also, the SDK installs with a couple demo applications; please be sure that these run on your machine before you download the files for this article.

The Problem of Depth Data

Before I dive into the solution, let me better express the problem. Below is a screen shot of raw depth data rendered to an image for reference. Objects that are closer to the Kinect are lighter in color and objects that are further away are darker.

What you're looking at is an image of me sitting at my desk. I'm sitting in the middle; there is a bookcase to the left and a fake Christmas tree to the right. As you can already tell, even without the flickering of a video feed, the quality is pretty low. The maximum resolution that you can get for depth data from the Kinect is 320x240, but even for this resolution the quality looks poor indeed. The noise in the data manifests itself as white spots continuously popping in and out of the picture. Some of the noise in the data comes from the IR light being scattered by the object it’s hitting, some comes from shadows of objects closer to the Kinect. I wear glasses and often have noise where my glasses should be due to the IR light scattering.

Another limitation to the depth data is that it has a limit to how far it can see. The current limit is about 8 meters. Do you see that giant white square behind me in the picture? That's not an object close to the Kinect; the room I'm in actually extends beyond that white square about another meter. This is how the Kinect handles objects that it can't see with depth sensing, returning a depth of Zero.

The Solution

As I had mentioned briefly, the solution I have developed uses two different methods of smoothing the depth data: pixel filtering, and weighted moving average. The two methods can either be used separately or in series to produce a smoothed output. While the solution doesn't completely remove all noise, it does make an appreciable difference. The solutions I have used do not degrade the frame rate and are capable of producing real-time results for output to a screen or recording.

Pixel Filtering

The first step in the pixel filtering process is to transform the depth data from the Kinect into something that is a bit easier to process.

This method creates a simple short[] into which a depth value for each pixel is placed. The depth value is calculated from the byte[] of an ImageFrame that is sent every time the Kinect pushes a new frame. For each pixel, the byte[] of the ImageFrame has two values.

privateshort CalculateDistanceFromDepth(byte first, byte second)
{
// Please note that this would be different if you
// use Depth and User tracking rather than just depth
return (short)(first | second << 8);
}

Now that we have an array that is a bit easier to process, we can begin applying the actual filter to it. We scan through the entire array, pixel by pixel, looking for Zero values. These are the values that the Kinect couldn't process properly. We want to remove as many of these as realistically possible without degrading performance or reducing other features of the data (More on that later).

When we find a Zero value in the array, it is considered a candidate for filtering, and we must take a closer look. In particular, we want to look at the neighboring pixels. The filter effectively has two "bands" around the candidate pixel, and is used to search for non-Zero values in other pixels. The filter will sum all of these values, and take note of how many were found in each band. It will then compare these values to an arbitrary threshold value for each band to determine if the candidate should be filtered. If the threshold for either band is broken, then the average of all the non-Zero values will be applied to the candidate, otherwise it is left alone.

The biggest considerations for this method are ensuring that the bands for the filter actually surround the pixel as they would be displayed in the rendered image, and not just values next to each other in the depth array. The code to apply this filter is as follows:

Weighted Moving Average

Now that we have a filtered depth array on our hands, we can move on to the process of calculating a weighted moving average of an arbitrary number of previous depth arrays. The reason we do this is to reduce the flickering effect produced by the random noise still left in the depth array. At 30 fps, you're really going to notice the flicker. I had previously tried an interlacing technique to reduce the flicker, but it never really looked as smooth as I would like. After experimenting with a couple other methods, I settled on the weighted moving average.

What we do is set up a Queue to store our most recent X number of depth arrays. Since Queue's are a FIFO (First In, First Out) collection object, they have excellent methods to handle discrete sets of time series data. We then weight the importance of the most recent depth arrays to the highest, and the importance of the oldest the lowest. A new depth array is created from the weighted average of the depth frames in the Queue.

This weighting method was chosen due to the blurring effect that averaging motion data can have on the final rendering. If you were to stand still, a straight average would work fine with a small number of items in your Queue. However, once you start moving around, you will have a noticeable trail behind you anywhere you go. You can still get this with a weighted moving average, but the effects are less noticeable. The code for this is as follows:

averageQueue.Enqueue(depthArray);
CheckForDequeue();
int[] sumDepthArray = newint[depthArray.Length];
short[] averagedDepthArray = newshort[depthArray.Length];
int Denominator = 0;
int Count = 1;
// REMEMBER!!! Queue's are FIFO (first in, first out).
// This means that when you iterate over them, you will
// encounter the oldest frame first.
// We first create a single array, summing all of the pixels
// of each frame on a weighted basis and determining the denominator
// that we will be using later.
foreach (var item in averageQueue)
{
// Process each row in parallel
Parallel.For(0,240, depthArrayRowIndex =>
{
// Process each pixel in the row
for (int depthArrayColumnIndex = 0; depthArrayColumnIndex < 320; depthArrayColumnIndex++)
{
var index = depthArrayColumnIndex + (depthArrayRowIndex * 320);
sumDepthArray[index] += item[index] * Count;
}
});
Denominator += Count;
Count++;
}
// Once we have summed all of the information on a weighted basis,
// we can divide each pixel by our denominator to get a weighted average.
Parallel.For(0, depthArray.Length, i =>
{
averagedDepthArray[i] = (short)(sumDepthArray[i] / Denominator);
});

Render the Image

Now that we have applied both of our smoothing techniques to the depth data, we can render the image to a Bitmap:

// We multiply the product of width and height by 4 because each byte
// will represent a different color channel per pixel in the final iamge.
byte[] colorFrame = newbyte[width * height * 4];
// Process each row in parallel
Parallel.For(0, 240, depthArrayRowIndex =>
{
// Process each pixel in the row
for (int depthArrayColumnIndex = 0; depthArrayColumnIndex < 320; depthArrayColumnIndex++)
{
var distanceIndex = depthArrayColumnIndex + (depthArrayRowIndex * 320);
// Because the colorFrame we are creating has four times as many bytes representing
// a pixel in the final image, we set the index to be the depth index * 4.
var index = distanceIndex * 4;
// Map the distance to an intesity that can be represented in RGB
var intensity = CalculateIntensityFromDistance(depthArray[distanceIndex]);
// Apply the intensity to the color channels
colorFrame[index + BlueIndex] = intensity;
colorFrame[index + GreenIndex] = intensity;
colorFrame[index + RedIndex] = intensity;
}
});

All Together Now

Now that I have shown you some of the code and theory behind the smoothing process, let’s look at it in terms of using the demo application provided in the links above.

As you can see, the demo application will do a side by side comparison of the Raw Depth Image and the Smoothed Depth Image. You can experiment with the smoothing settings in the application as well. The settings that you will find when you first run the application are what I recommend for general purpose use. It provides a good mix of smoothing for stationary objects, moving objects, and doesn't try to "fill in" too much from the filtering method.

For example: You can turn both band filters down to 1, and turn the weighted moving average up to 10, and you'll have the lowest flicker and noise for stationary blunt objects. However, once you move, you will have a very noticeable trail, and your fingers will all look like they are webbed if you don't have a wall close behind you.

Points of Interest

I have really enjoyed playing around with these smoothing techniques and learning that there is probably no 'one-size-fits-all' solution for it. Even with the same hardware, your physical environment and intentions will drive your choice for smoothing more than anything. I would like to encourage you to open the code and take a look for yourself, and share your ideas for improvement! At the very least, go borrow your neighbor kid’s Kinect for a day and give the demo application a whirl.

I'll leave you with a brief video demonstration of the demo application. In it, I pretty much just sit and wave my arms around, but it gives you a good idea of what these techniques are capable of doing. I run through all the combinations of features in 70 seconds, and no audio.

Please keep in mind, that it is almost impossible to see a change in the flicker when I turn off the weighted moving average due to the low frame rate of YouTube. You'll just have to trust me, or download the code; it's like night and day.

Further Reading

If this topic interests you, I would highly recommend reading the Microsoft Research paper on KinectFusion: Real-Time Dense Surface Mapping and Tracking. They have done some amazing work in this particular area. However, I don’t think you would ever be able to achieve these results with .NET: http://research.microsoft.com/pubs/155378/ismar2011.pdf

History

January 21st , 2012 - First Version

January 22nd , 2012 - Updated the Article and downloads to include suggestion from jpmik in the comments