Face Tracking SDK in Kinect For Windows 1.5

After a long journey, my team at Microsoft shipped Face Tracking SDK as part of Kinect For Windows 1.5! I worked on the 3D face tracking technology (starting from the times when it was part of Avatar Kinect) and so I’d like to describe its capabilities and limitations in this post. First of all, here is the demo:

You can use the Face Tracking SDK in your program if you install Kinect for Windows Developer Toolkit 1.5. After you install it, go to the provided samples and run/build yourself “Face Tracking Visualization” C++ sample or “Face Tracking Basics-WPF” C# sample. Off course, you need to have Kinect camera attached to your PC😉 The face tracking engine tracks at the speed of 4-8 ms per frame depending on how powerful your PC is. It does its computations on CPU only (does not use GPU, since it may be needed to render graphics).

If you look at the 2 mentioned code samples, you can see that it is relatively easy to add face tracking capabilities to your application. You need to link with a provided lib, place 2 dlls in the global path or in the working directory of your your executable (so they can be found) and add something like this to your code (this is in C++, you can also do it in C#, see the code samples):

The code calls the face tracker by using either StartTracking() or ContinueTracking() functions. StartTracking() is a more expensive function since it searches for a face on a passed RGB frame. ContinueTracking() method uses previous face location to resume tracking. StartTracking() is more stable when you have big breaks between frames since it is stateless.

There are 2 modes in which the face tracker operates – with skeleton based information and without. In the 1st mode you pass an array with 2 head points to StartTracking/ContinueTracking methods. These head points are the end of the head bone contained in NUI_SKELETON_DATA structure returned by Kinect API. This head bone is indexed by NUI_SKELETON_POSITION_HEAD member of NUI_SKELETON_POSITION_INDEX enumeration. The 1st head point is the neck position and the 2nd head point is the head position. These points allow the face tracker to find a face faster and easier, so this mode is cheaper in terms of computer resources (and sometimes more reliable at big head rotations). The 2nd mode only requires color frame + depth frame to be passed with an optional region of interest parameter that tells the face tracker where to search on RGB frame for a user face. If the region of interest is not passed (passed as NULL), then the face tracker will try to find a face on a full RGB frame which is the slowest mode of operation of StartTracking() method. ContinueTracking() will use a previously found face and so is much faster.

Camera configuration structure – it is very important to pass correct parameters in it like frame width, height and the corresponding camera focal length in pixels. We don’t read these automatically from Kinect camera to give more advanced users more flexibility. If don’t initialize them to the correct values (that can be read from Kinect APIs), the tracking accuracy will suffer or the tracking will fail entirely.

Frame of reference for 3D results – the face tracking SDK uses both depth and color data, so we had to pick which camera space (video or depth) to use to compute 3D tracking results in. Due to some technical advantages we decided to do it in the color camera space. So the resulting frame of reference for 3D face tracking results is the video camera space. It is a right handed system with Z axis pointing towards a tracked person and Y pointing UP. The measurement units are meters. So it is very similar to Kinect’s skeleton coordinate frame with the exception of the origin and its optical axis orientation (the skeleton frame of reference is in the depth camera space). Online documentation has a sample that describes how to convert from color camera space to depth camera space.

Also, here are several things that will affect tracking accuracy:

1) Light – a face should be well lit without too many harsh shadows on it. Bright backlight or sidelight may make tracking worse.

2) Distance to the Kinect camera – the closer you are to the camera the better it will track. The tracking quality is best when you are closer than 1.5 meters (4.9 feet) to the camera. At closer range Kinect’s depth data is more precise and so the face tracking engine can compute face 3D points more accurately.

3) Occlusions – if you have thick glasses or Lincoln like beard, you may have issues with the face tracking. This is still an open area for improvement :-) Face color is NOT an issue as can be seen on this video

Here are some technical details for more technologically/math minded people: We used the Active Apperance Model as the foundation for our 2D feature tracker. Then we extended our computation engine to use Kinect’s depth data, so it can track faces/heads in 3D. This made it much more robust compared to 2D feature point trackers. Active Appearance Model is not quite robust to handle all real world scenarios. Off course, we also used lots of secret sauce to make things working well together :-) You can read about some of these algorithms here, here and here.

Have fun with the face tracking SDK!

We published this paper that describes the face tracking algorithm in details.

Credits – Many people worked on this project or helped with their expertise:

Very good work, I’m trying to use it in c# to cut the head as the purple rect in c++ sample but there’s no GetFaceRect() magic method in c#.
Is there anywhere we can find a map to see which positions map to each FeaturePoint (maybe not all, but the most useful ones)

Thanks! You can see definitions of the 2D points in the photo in the http://msdn.microsoft.com/en-us/library/jj130970.aspx#ID4EUF. The API returns 87 2D points that you can use to get various face features. The rectangle is easy to get based on those. In addition to that you get 3D head pose and 3D animation and shape units that are parameters of the 3D model. You can get 3D model vertices if you feed tracking data into IFTModel interface. Unfortunately, C# API is a sample only and is pretty basic. You can get full API if you call directly into the interop. See the C# sample for FTInterop.cs it has all the method bindings to the native COM APIs. The 3D vertices returned by IFTModel interface are also semantic and stable (if vertex N is the corner of the left eye it will stay that way for other tracked frames).

I’m trying to use the orientation of the face, i.e., the tracked face’s “look direction”, and the orientation of either arm, i.e., where the arms are pointing so when a user directs their eye/face gaze and points at a real world object in front of a projection screen, I can work out the real world 3D point where these two vectors/rays meet/cross and have the avatar look out at that point from its virtual world. I’m using the XNA Avateering sample as the starting block for this experiment.

Question 1: How do I get the look direction of the tracked face, is it simple derived from the rotation data for the tracked face?
Question 2: Does the face-tracking just rely on the rgb camera or is it also using the depth camera to figure out where the position of the head is?

Sometimes the triangle mesh appears totally in the wrong place on the screen, no where near my head. Also, lighting conditions play a huge role in successful face-tracking. It can be difficult to work with at night. Thank god for the Kinect Studio, one good recording and you never have to dance around the room again.

Answers:
1) To get the vector that points to where the face looks you can either derive it from rotation angles or call IFTModel interface and pass the computed IFTResult object returned by the face tracker. IFTModel returns a list of 3D points on the face (feature points) that you can use to compute this vector. Or you can derive it from the euler rotation angles that are part of the IFTResult (call GetHeadPose()). The angles are in degrees and you should take into account that they are computed in the right handed coordinate system of the color camera. The 0,0,0 angles correspond to the face looking straight to the camera aligned with its optical axis.

2) we use both rgb and depth camera. We do rely on both of them equally up to 1.5 meters away from the camer and then after that distance it starts “trusting” rgb data more and relies less on depth data since it becomes noisy. The light plays role in the tracking – harsh shadows or occlusions may affect it. At closer range <1.5 meters it affects it less since there is enough data from the camera for ok face tracking.

The face triangles should be on your face. If they are really off then we might want to investigate.. You can capture a frame and contact me to look at it.

Is the depth data necessary? The official documentation says the depth is optional for face tracking.
However, if NULL is passed to the second argument (pDepthCameraConfig) of IFTFaceTracker::Initialize(), IFTFaceTracker cannot be initialized properly… Is it really optional, since it is said the depth information is required for the tracking or detection?
Also, if the depth data pointer is not passed to the FT_SENSOR_DATA, it always fails to locate the face.
Thanks in advance,

Depth data is optional, but you must have Kinect connected to your PC for SDK to work. If you don’t use depth then make sure that you call IFTFaceTracker::Initialize() correctly (with no depth camera).

Thanks for your reply! With the Kinect connected, I tried to replace Kinect video buffer with my video captured from webcam. When the depth camera points to the scene far away, like 1.5 meter away as you said, the tracking effect is acceptable.
If I don’t use the depth information, I had a problem in calling theIFTFaceTracker::Initialize() correctly. By using ” FT_CAMERA_CONFIG myCameraConfig = {640, 480, 1.0}; HRESULT hr = pFT->Initialize(&myCameraConfig, NULL, NULL, NULL); ” from the sample code from http://msdn.microsoft.com/en-us/library/jj130970, it kept returning E_POINTER. And this made the CreateFTResult fail as well.
Thanks!!

Sorry, I misled you in my previous comments — the Kinect for Windows Face Tracking SDK requires you to pass a depth frame and initialize it to be used with the depth camera. You can use your HD color camera to augment face tracking with higher quality video feed, but then you must implement your own Depth to Color UV conversion function (and pass it to the face tracker initialization).

You can read 2D facial points, 3D head position and 3D animation coefficients from IFTResult interface (in C++). There is a corresponding thing in C# interop API. Then you can call IFTModel and pass that data to create 3D mask with 121 3D vertices located on your face. 2D and 3D points stay on your face even when it moves meaning – if left eye corner corresponds to 3Dvertex N then N will always track that point.

Very nice work.The tracker seems to generalize to arbitrary faces very well. I was kinda surprised about this as poor generalization is a known problem of AAM. Are you planning to publish the details of the algorithm at ICCV, CVPR or somewhere else? It would be nice if you’d uncover at least some of the secret sauce😉.

Hi! Great work! I am currently doing research in Computer Vision and trying to develop a head pose estimation system in real time. Now, I had a question. Assuming that when the user looks straight at the camera(this being the origin 0,0,0 point), he has the option to turn his face 90 degrees to the right or the left. So until what angle or to how much degrees can the SDK track the head and give its direction in terms of Raw, pitch and yaw? For example if the users moves his head left 90 degrees from the origin point, will the SDK be able to track the face movement all the way or there is some limitation? Also, regarding getting the head movement in terms of real time coordinates(like raw, pitch and yaw), what manipulations/algorithms does one have to apply or is there a pre built function in the SDK for that?

I was looking at your sample FaceTracking3D-WPF and as far as I understand by using GetTriangles method I can get vertices, and using it I can create 3dmodel, now what I would like to do is to map result to some face 3d model (avatar like in documentation http://msdn.microsoft.com/en-us/library/jj130970.aspx#ID4EUF) , designed in Maya or 3Dmax . Any suggestions how to do this, what should my CG artist supply to me .

You can either map ft sdk animation units to some predetermined bone movements (based on some mapping logic) or you can use an algorithm similar to icp – ft sdk model vertices can be mapped to your rendered model vertices and to deduct how to move your model (to find deformation parameters) you can solve least squares problem (find your model parameters that minimize squared distance between 2 models)

Hi, very good job!
I found if my face get close to kinect, then it does not work. It seems that I have to keep my face at least 80cm~1m away from kinect . Can I change this distance? I mean is it possible that it works when my face 50~60cm close to kinect? Thanks a lot.

You need to switch Kinect to “near mode” in which depth data is available at short range ~40 cm. We use depth data and so when you are too close there is a black hole on a face and the tracker stops working. Near mode pushes the range closer to the camera. It can be switched on when you initialize Kinect camera via its api.

Hello, it seems like you’ve got lots of undeclared variables (or missing #include FaceTrackLib.h). Please see C++ code sample provided in the Face Tracking SDK for a compilable and working code sample. The code from my page is only an example it will not compile as is since it is not a finished project.

Yes, the camera code is just pseudocode and the code in general on this page is NOT intended for compilation! It is just a “text” sample. Please use C++ or C# samples provided with the Face tracking SDK – they have fully operational code that you can build and modify.

3. In the sample visualization code provided in C++ in the developer toolkit ,it uses WPF and hence it is very difficult to display the yaw,pitch, roll angles on screen. Also, there are very few resources online that mention how to enable a console in WPF in VIsual Studio using C++. Can you please help out here?

The provided sample on my page is just a “text sample” it is not intended for compilation (has pseudocode in it). You can build code samples included into the face tracking sdk. They should build and there are samples for C++ and C#. The API itself is native COM object and easy to consume in C++. For C#, look at the C# code sample – it has an assembly with the interop that makes it simple to call FT COM API.

I just tested the Face API and displayed the head orientation angles. I had one more question regarding that. All the functions in the Face Tracking API give output/results in reference to the Kinect coordinate frame. So, the results you get would make sense as long as the coordinate frame of the user’s head(coordinate system where user’s head is at the origin) is completely parallel to the the Kinect coordinate system. But this will not be the case always as the user will keep on moving. So, is there a hidden transformation of coordinate system that takes place inside the Kinect SDK(meaning the data calculated using Kinect coordinate frame is converted in terms to the user coordinate frame)? Please do advise regarding this.

The Face Tracking SDK coordinate system is just like Kinect’s skeleton coordinate frame – right handed, Z pointing towards user, Y – up, but the origin is shifted to the video camera’s optical center. Kinect’s skeletal tracking coorindate system has the center in the depth camera (IR) optical center. This is because the face tracking API uses RGB + depth data heavily and does its computations in the visual camera frame of ref. Kinect’s skeletal tracking does it purely in IR camera space. You can convert from FT API coordinate frame to Kinect’s skeletal coordinate frame by following this sample (not sure if there is one on MSDN):
This function just approximates the real frame to frame transfer. Future versions of Kinect API may contain the API for this. Use this code as is, not warranties or anything🙂

/*
This function demonstrates a simplified (and approximate) way of converting from the color camera space to the depth camera space.
It takes a 3D point in the color camera space and returns its coordinates in the depth camera space.
The algorithm is as follows:
1) take a point in the depth camera space that is near the resulting converted 3D point. As a “good enough approximation”
we take the coordinates of the original color camera space point.
2) Project the depth camera space point to (u,v) depth image space
3) Convert depth image (u,v) coordinates to (u’,v’) color image coordinates with Kinect API
4) Un-projected converted (u’,v’) color image point to the 3D color camera space (uses known Z from the depth space)
5) Find the translation vector between two spaces as translation = colorCameraSpacePoint – depthCameraSpacePoint
6) Translate the original passed color camera space 3D point by the inverse of the computed translation vector.

This algorithm is only a rough approximation and assumes that the transformation between camera spaces is roughly the same in
a small neighbourhood of a given point.
*/
HRESULT ConvertFromColorCameraSpaceToDepthCameraSpace(const XMFLOAT3* pPointInColorCameraSpace, XMFLOAT3* pPointInDepthCameraSpace)
{
// Camera settings – these should be changed according to camera mode
float depthImageWidth = 320.0f;
float depthImageHeight = 240.0f;
float depthCameraFocalLengthInPixels = NUI_CAMERA_DEPTH_NOMINAL_FOCAL_LENGTH_IN_PIXELS;
float colorImageWidth = 640.0f;
float colorImageHeight = 480.0f;
float colorCameraFocalLengthInPixels = NUI_CAMERA_COLOR_NOMINAL_FOCAL_LENGTH_IN_PIXELS;

// Take a point in the depth camera space near the expected resulting point. Here we use the passed color camera space 3D point
// We want to convert it from depth camera space back to color camera space to find the shift vector between spaces. Then
// we will apply reverse of this vector to go back from the color camera space to the depth camera space
XMFLOAT3 depthCameraSpace3DPoint = *pPointInColorCameraSpace;

// Transform the original color camera 3D point to the depth camera space by using the inverse of the computed shift vector
XMVECTOR v3DPointInKinectSkeletonSpace = XMLoadFloat3(pPointInColorCameraSpace) – vTranslationFromColorToDepthCameraSpace;
XMStoreFloat3(pPointInDepthCameraSpace, v3DPointInKinectSkeletonSpace);

Hello, Nikolai
I want to use the Microsoft Face Tracking SDK to track some 3D model sequences which is not from kinect camera. Basically I can give the rendered 3D textured model image (640×480, R8G8B8) and corresponding depth image(640×480, UINT16) to the IFTFaceTrack. Although I did not give the head and neck position to the StartTracking function, I hope it is still able to track the model.

Currently, I found that in order to activate the face tracker engine, I need to call NuiInitialize() although I actually did not use data from Kinect. The problem is that the startTracking method always return E_POINTER. I know that IFTFaceTrack class can accept color image in R8G8B8 format and depth image in D16 format, which I currently choose. However, in the FaceTracker Demo code in the SDK, the color image format is BGRX and depth is D13P3. Does the data format have to be this?

Another Question, since the camera parameter is significant for the tracking result, how can I get the focal length in pixels for a virtual camera in DirectX?

Hi Todd, yes you must have attached Kinect and call NuiInitialize() to use the face tracking API (commercial reasons). But with those you can still feed the API from other sources like your rendered 3D faces as long as the format of RGB and depth images is correct. The API can accept several formats for RGB frames:
FTIMAGEFORMAT_UINT8_GR8, FTIMAGEFORMAT_UINT8_R8G8B8, FTIMAGEFORMAT_UINT8_A8R8G8B8, FTIMAGEFORMAT_UINT8_X8R8G8B8, FTIMAGEFORMAT_UINT8_B8G8R8X8, FTIMAGEFORMAT_UINT8_B8G8R8A8 and for depth frames it must be: FTIMAGEFORMAT_UINT16_D13P3 (last 3 bits are reserved for Kinect’s player ID!). If you pass something wrong it should return FT_ERROR_INVALID_INPUT_IMAGE.

The focal length in case of DirectX rendering is tricky to compute in pixels – your view/projection matrix should have a focal length as part of it (mixed with other values). The projection matrix is still perspective and so works the same way as real camera projection. How to turn that into pixel value, I am not sure at the moment. For the real camera you can get its focal length in millimeters and then estimate it in pixels based on the FOV or sensor size/resolution. In the rendered world you need to know the physical dimensions of it and somehow estimate the focal length. You can try various values and see if the tracker succeeds. The average value for a real webcam or for tracking videos that we use OK in the past is ~513-560 pixels.

Yeah, I notice that my depth format wasn’t correct. So I will change it to see if I can get the startTracking function return true once. For the focal length, I found that if you define it wrong, the tracking face result may be off the face in the color frame. I hope the wrong focal length would not lead to tracking failure~~~~

Is the face data accurate enough to be used to distinguish different people? In other words, 3D facial recognition? Looking at http://msdn.microsoft.com/en-us/library/jj130970.aspx I see that different shape units are available, such as “Eye separation distance”. I would use these measurements to build a profile for each person during a registration process. Then later, I could compare these profiles against an unknown in order to determine who the person is. This would seem possible now – or am I missing something?

Yes, you can use shape units returned by the API as a help to recognize people. Although, the underlying 3D model is still pretty basic (low def Candide model) and so the computed shape coefficients are not sufficient for strong classification. You can combine them with other image based classifiers to get to a good level of reconition (use it as a weak classifier in combination with other classification methods). I would not recommend building your recognition system solely for on current shape units – you will have too many false positives (they will get much better some time in the future though :-))

‘Get2DShapePoints’ function in the Kinect Face Tracking SDK gives the 87 feature points in 2D. How to get the Z coordinate for these points? Can I use ‘NuiTransformSkeletonToDepthImage’ function for this? Also the Face Tracking API follows the Kinect coordinate frame(With Kinect camera at 0,0,0). Does the skeletal tracking follow the same coordinate frame or a different one?

Really nice work there, Nikolai.
But i am running into trouble trying to track the face without a skeleton in c#.
My sample is very similar to the c# example included in the sdk.
I modified it, so i don’t need to rely on skeletons. My problem is, that the frame.trackSuccessful property is always false. The call is this.faceTracker.Track(colorImageFormat, colorImage, depthImageFormat, depthImage); The reference mentions that there are 3 possible calls: only colorimage+depthimage, colorimage+depthimage+region of interest or colorimage+depthimage+skeleton. The calls without skeleton never work.
What also bugs me is that cpu usage is just at about 10% whereas the c++ sample produces 30% on my machine.

The C# sample was intended to be used with the skeleton and so it may have some shortcomings when you run it without the skeleton. The underlying C++ API can track with and without the skeleton info available. Most likely you have some issue in C# code somewhere. 30% CPU utilization on C++ sample looks right. 10% is too little I think

This is the student who is trying to use the Kinect Face Tracking SDK to track 3D model sequence which is not from Kinect Camera. Currently, the program still failed at FT_ERROR_HEAD_SEARCH_FAILED stage. I feel it is close to the success but just don’t know what is the problem so I come here and ask for your kind help.

Basically, I describe my problem in the MicroSoft forum for Kinect SDK 1.5. The link is as below:

Besides that, I want to know if the SDK has specific requirement to track a face. For example:

1. Is the initial 3D head position or the skeleton information necessary for tracker to start, since in my case there is no body.

2. Does the SDK has specific requirement about the pixel range for the face model in the depth? For example, the depth image I get from the SingleFace demo can track a face which has depth in range of [129, 186], where the nose tip is the lowest value and the contour has the highest value. Which means the head has a range of depth about (186-129=)57. Is that necessary to keep for 3D data from other source too?

3. The only part which may be different from the Kinect camera is the mapping function between the texture and the depth. Currently I define them like this:

The FTRegDepthToColor is self-defined function to map my own depth info to color info, which is:

Hi Todd, what you are trying to do is possible. You have to provide your own mapping function from depth to color pixels and you have to set focal length for video and depth cameras to the right values (this is important). The error that you are having — FT_ERROR_HEAD_SEARCH_FAILED typically happens when the API cannot find a face on a color frame or when the depth area that corresponds to the color frame face are is invalid (for instance all values are too far from reality). Also, you need to provide depth frame in Kinect’s format which has 3 lowest bits reserved as a user ID. You can set this user ID for all pixels to be 1.

Replies to other questions:
1) Kinect’s skeleton is not needed for FT API to work. If you don’t provide a head orientation vector or region of interest then the API will do face detection over the full color frame and will track a found face.
2) The FT API is able to track faces when the face area width/height is bigger than ~25 pixels (actual number varies). In depth it is less important, but it may fail when you’ve got too few pixels.
3) You mapping function is correct. Also, you need to set a focal length for the depth camera to some realistic value (in pixels). You can estimate if if you know a field of view of your camera. The FT API uses simplistic pinhole camera model and assumes the same focal length in X and Y.

I am also trying to use the Face Tracker of the Microsoft Kinect SDK with some inputs different from the Kinect camera (by the way, thanks for this great code). I have nearly succeed on using it. However, I still have a rotation misalignment between my 3D inputs and the result of the face tracker (about 6 degrees on the pitch angle).

I expect that there is an issue with the focal length of the depth camera. I generate my own depth map from a 3D model, so I can tune the focal length parameter used for the generation. However, when I modify the focal length used at the initialization of face tracker, it has no effect on the results. I have tried a lot of different values (500, 571, 531, 200, 800, 1, 10000), but the result of the face tracker is always the same… Is it a bug ? Can you at least provide me the value used by default for the focal length of the depth camera ?

The default depth camera focal length in Kinect is defined by – NUI_CAMERA_DEPTH_NOMINAL_FOCAL_LENGTH_IN_PIXELS define and is set to 285.63 for 320×240 resolution. So you probably should try that value. Not sure how 3D data is generated in your case, so a different focal length could be better for you… The Kinect values are in NuiAPI.h and in NuiImageCamera.h in Kinect SDK.

I’m new to Kinect Programming. I’m try to use Get2DShapePoints() to get the absolute points in 2D. The sample code only uses GetProjected3DShape which only gives mapped points. I want the raw points. Also, the 87 points mentioned on http://msdn.microsoft.com/en-us/library/jj130970.aspx cant be mapped to the 122 points in GetProjected3DShape. Please help!

Thank you very much for your working and demo, I learned a lot. And I have a question that does the face tracking engine use the normal video as the input data? That means asynchronous tracking, we first use the camera records the face images as a video files (like .avi), and then use this video file as the input data to track the face feature points, can face tracking sdk can do it?
Thank you very much and I am eager for your reply.

You can do this as long as you have Kinect camera attached to your computer. Then you need to feed your video frames to the face tracking API and provide correct camera configuration, i.e. resolution and focal length. It is hard to estimate a focal length for a video if you don’t know camera characteristics that it was made with. You can estimate it from a field of view though. The approximate focal length for most consumer webcameras is about 500-600 pixels. The focal length affects the quality of tracking greatly.

how to cut this portion ? i tried to use the coordinates used by sdk to draw the pink rectangle to save only the face into a bitmap…. but in my seved picture i don`t have the face. Can someone show me how to cut this portion ?

Hi. I’m trying to make a simple face tracking program by following your code. Initializing is OK, both color image and depth image come in nicely, but I have no idea why StartTracking returns -1884553214(=0x8fac0002). In case of FaceTrackingVisualization sample, StartTracking returns S_OK and the program works fine.

You can read the face rectangle from IFTResult and then use it to cut this RECT out of input color frame that you passed to the engine to get this IFTResult. The rectangle is in color frame image coordinates.

Yes, its true there is offset between the cameras.
If we look at the default example of Face Tracking in C#, this.facePoints = frame.GetProjected3DShape(); return Feature Points collection. By using this i can only get X, Y of Feature Point. I need depth information as well. How we can achieve this ?

CoordinateMapper.MapColorFrameToDepthFrame return depth points ? I am not getting any clue how i can use these depth points with correspond to Feature Points ?

depth and color camera 3D spaces are not the same and there is an offset between them (strictly speaking they are also rotated relative to each other since the camera optical axis are not perfeclty parallel). FT SDK returns 3D results in color camera 3D space and in color camers 2D uv space (2D points). To convert back to depth camera space (where Kinect skeleton exists) you need to apply extrinsic transform (rotation, translation). It is not easy to get it from Kinect APIs (no way in fact), but see this post – https://nsmoly.wordpress.com/2012/08/03/how-to-convert-from-the-color-camera-space-to-the-depth-camera-space-in-kinect-for-windows/
to get an approximation of it.

The FTRegDepthToColor is self-defined function to map my own depth info to color info. In C++ version, if you use kinect default, you leave the 3rd parameter of Initialize as NULL, so the face tracker uses the kinect mapping function. Have you tried to look into this direction? I do not use the kinect texture and depth mapping yet so i am not sure the answer to your question. But hope this reply helps!

OK, now I see. So actually what you try to do is reversed mapping of projected 2D coordinates to the original 3D coordinates in world space. You need the world matrix, view matrix and projection matrix. Since the class also output the translation, rotation, scale parameters, world matrix should be easy to build. The other two I am not sure.

Again misunderstanding!
Depth information is separate and feature point information is separate.
Todd do you have any media of communication so that i can discuss with you easily ? I am on skype. Please share any media of communication ASAP.

Hello, Cluster
The func you use is “frame.GetProjected3DShape()”, which is used to draw the yellow line of 3D face model on the IFTImage object. If you learned CG before, it is straight-forward that this 2D coordinate is actually from 3D coordinate after its projection. In my opinion, it is not correct to direct use the 2D (x, y) in its 3D (x, y, z). What make sense is to unproject the 2D coordinate to 3D, but why?

If you go to the Microsoft Kinect Face Tracking SDK web page, you can find that the author use a simple standard 3D face model to adapt to the kinect captured information. In current face tracking domain, the idea to use 3D genetic face model to adapt to the 2D video or texture has been talked about a lot. So, if you really use “frame.GetProjected3DShape()”, you might need to consider to unproject to get the coordinate right.

However, if you just need 2D coordinate of a pixel and its represented depth value, you need to know the mapping function of the two camera. Just google “Kinect two camera calibration” to get some useful information.

IFTResult has 2D points that are feature points on the face in color (u,v) space. It also, has so called “animation units”, 3D head pose and “shape units”, which you can use to reconstruct the 3D points on the face (121 of them) by calling IFTModel interface. There are 2 methods – one to get 3D points given 3D pose, animation units and shape units (from IFTResult interface) and second to get projected 3D points on uv color image. 2D points from IFTResult are a bit different and exist only in the uv 2D space.

Sorry for this duality. We will fix it in the future releases. Also, the returned 3D points are in the color camera 3D space! not in Kinect’s depth camera space (like the skeleton). See my other post about how to convert from color camera space to depth camera space. This release is still a bit experimental, so the API is not fully backed (so it is part of optional dev kit). Future releases will be 100% integrated with the skeleton API.

Unfortunately, Kinect API does not have a function to map color UV coordinates to depth UV coordinates. You can create an approximation of it (invert provided depth UV to color UV function) if you do the following:
1) if depth you use is 2 times smaller than color, then scale down your color UV by 2 (divide by 2) to get 2st approximation to get depth UV coordinates.
2) Sample Z in the neighbourhood of the converted depth UV point (you may have a hole there, but you need to find at least 1 closest point with valid Z). Then apply forward transform from Kinect API to jump back from depth UV to color UV.
3) Calculate a vector between your original color UV point and newly computed color UV point. Scale it down by 2 and apply this “shift vector” to your depth point (from which you got your last color UV point) to get a new “better” depth UV estimation.
4) repeat steps 2 and 3 until convergence (until your point in color stops moving or the vector becomes very small).

Unfortunately I have never used the Kinect SDK and it is more as proof of concept for some ASM algorithm I was working on. Can you supply a code snippet that takes a image.jpg path and returns a list of coordinates?
I have no idea how to simulate a “depth frame”🙂 BTW is it possible to simulate a kinect camera connection (Maybe through a driver)?

iam trying to record the values of Action units points AU to file , but values of AUs are changing continuously for static face (tested with a dummy face, with no change in position of kinect, light, dummy face background) anyway how to stop that from happening?

What is the rotation order of the rotation result of IFTResult::Get3DPose? Is it XYZ, ZXY, etc.? There seems to be no official documentation on it. Also does the Face Tracking API filter its input, or would it benefit from the depth input being prefiltered by me to remove flickering and fill in holes?

I am printing the values of Action Units but they are nowhere close to the expected values. Is there any mistake on my part? Is there any way to train the Kinect SDK with the new face and use that model to get AUs(using the SDK tracking function)?

I am using face tracking sdk to do some face tracking, and there are some tracking failures. I outputed the tracking status by using IFTResult::GetStatus method. The return values are several situations:

first situation:return value is -1884553213, this situation means FT_ERROR_FACE_DETECTOR_FAILED, the sdk provides the meaning of the value.

second situation: return value is -1884553208, it means FT_ERROR_EVAL_FAILED, the sdk also provides the meaning.

third: return value is -1884553204, the sdk does not provide the meaning of the value.

fourth: return value is -1884553205, the sdk does not provide the meaning of the value too.

So, I would like to know what are the meanings of later 2 return values?

About all the 4 kinds of tracking failed, could you give me any suggestions/ideas/information?

About the later 2 values, I found their meanings, the 3rd means FT_ERROR_USER_LOST, the 4th means FT_ERROR_HEAD_SEARCH_FAILED. So about these 4 kinds of failure, could you give me any suggestions/ideas/information?
Thank you very much!

Dear sir,
As your are the main developers of Kinect Face tracking SDK, I have found that your Kinect Face tracking SDK works fine with Kinect as well as USB camera while the system is still connected with Kinect camera. However, i would like to know whether it is possible to use your Kinect Face tracking SDK for the USB camera or so without the need to connect with Kinect Camera at all. If so, could you give some clue for us to follow. If it is not possible, then please suggest the alternative route.

nsmoly,
you have said that “you can use external RGB HD camera to improve face tracking at the distance, BUT you must have Kinect camera attached at all times”. This implied that I need to find the alternative route for face tracking from USB camera without the need to connect with Kinect camera. Any suggestion for the alternative routes?

Hi
In the color to depth conversion (see blog 8/3/12), I was wondering how you get a 3d color point as an input for your function? Also, how do you pick a depth pixel as a “good enough approximation”?

1. From my understanding, it seems that you do face detection in color and fit a 3d model to the face.
2. Then you use the skeletal tracking to determine the location of the head in 3d to get its xyz.
3. You center the 3d face from #1 to the location of the tracked 3d head in #2.
4. You can now deduce 3d points for the rgb pixels in the face area?

you can do what you described or you can convert color to depth coordinates like this:
1) scale color XY coordinates to depth size to get the first approximation of the depth XY
2) sample depth value at depth XY
3) Convert back to color by calling Kinect API and get new color XY coordinates.
4) Find the error between original color XY and new color XY.
5) scale the error vector to depth size and find a new depth point and its XY by applying scaled depth vector to the previous depth point
6) sample depth at new depth point and convert it back to color…
7) repeat until color XY is close to the originally converted color XY (the last depth point will the corresponding depth point).

the angles should stay the same since R and T are computed separately, but in practice the further your head is from the optical axis the more errors you will get. The angular errors should be within +-3 degrees (but this also depends on light, haw far you are from the camera, how much facial hair you have, etc.)

Hi!
I am currently working on a project to make an audio-visual speech recognizer. For the visual part I would like to use your face tracking algorithm (especially for the mouth region).
I have a database with .xed files recorded with a Kinect. Can you explain me how I extract the tracked face points from these files (to later be used in my MATLAB code)?

Hi, you can read RGBD frames from .xed files and then push them into the Face Tracking API to let it align faces (FaceTracking SDK for Kinect for Windows 1.0 OR HD Face API for Kinect for Windows 2). The API will return a set of facial 3D points (some are enumerated by name in the API) which you can then use. XED files by themselves have only raw data from the sensor and (I think) skeletal data as well. They don’t have face points, since this API is “pay for play” type.