Free-D

3D Head Tracking in Video How a simple model can do big.

Written By: urbeller
-
May•
14•13

In this post, I will describe my own implementation of a head tracker. 3D Head Tracking (HT) consists of inferring the 3D orientation and displacement of the head, often from a (single) video source. Here, the video source will be a Logitech C910 webcam. Of course, any webcam will do. Video grabing and
image processing will be done using OpenCV library.

The outline of the algorithm is as follow:

Grab a frame and detect 2D features.

Initialize the head pose.

Compute 3D features→FTold.

Grab a frame and detect 2D features.

Compute 3D features →FTnew.

Compute motion that registers FTnew→FTold.

Update head pose.

FTold = FTnew and go to 4.

At first glance, the toughest step in this outliine seems to be the 2D→3D features conversion. It turns out this is among the easiest task thanks to a simple idea: Cylindrical head model. In a nutshell, 2D features are unprojected from the camera reference to a virtual cylinder. This intersection provides the
sought 3D positions of the image features. But first thing first…

Grabing an image is easy using OpenCV. Boiler plate code for that is a loop that looks like:

In each input frame, 2D features are detected. Among the myriad of features, KLT are probably the most suited to our real-time needs. Indeed, KLT are easy and fast to compute because there is no descriptor computation and no scale-space analysis is involved (at least not as SIFT). Using OpenCV, KLT features are retrieved as follow:

Now that features are detected, they are unprojected and intersected with the virtual cylinder. Exact solution to this ray-cylinder intersection could easily be found on the net. Now that we have 3D positions of features at time Tt-1 the same features are tracked in the upcoming frame using optical flow routine from OpenCV:

The result of this tracking is a set of features at time Tt. To get the change in head pose, we register the 3D features at time Tt-1 with 2D features at time Tt. This is performed using a PnP algorithm. Because the virtual cylinder represents the head (a rough estimate!), it must be updated with the incremental pose
just computed. In a sens, the cylinder is a state object of the tracked head.

The head pose algorithm runs comfortably on a 2.4 ghz laptop using a Logitech C910 webcam as the following video depicts:

Nice demonstration Jamil. Have you tried other tracking methods – with LK tracking points drift over time and it cant handle occulsion as you are no doubt aware. I would be interested in trying other tracking techniques based on your code if you are willing to share.

Hey there ! thank you for stopping by 🙂
Actually, my strategy was to redetect points “on-demand”. When a face
turns left for example, I redetect points on the opposite side (the one
that is not occluded). The beauty of the cylinderical model is the
fact that it holds the “state” of the head at any time. A drift is
less likely to happen in this case.

Jamil, Thanks. Not sure how many times I watched your video on youtube :). Have some doubts.

1. I use Posit instead of solvePnp as in ehci project which uses sinusoidal head model. ehci. If head is not placed centrally in a video, will I need to translate image points to origin ? How to do that ? Ehci subtracts 160x/120y on every image point for resolution 320*240. I am getting wrong rotation when ever head is not places centrally in video. Kindly suggest.

2. Your code shows 100 corner points to detect. But your youtube video didn’t contain as much points. Surprisingly no tracking points on mouth corners. How many ever times, I run, cvGoodFeaturesToTrack with what ever parameters, I get features at mouth corners. You don’t have them!!

3. Is cornerSubPix providing any improvement for face ?

4. Can you elaborate a little more on detecting points on non occluded side of face 🙂 ? I didn’t see that trick in the youtube video.

Thank you for your interest in my little project. Before going any forward, I am working on
a second version that will include 2 major additions.

1/ I am not familiar with ehci, though I saw a video demo of it. In my case, I assume
that the initial face position is fronto-parallel (basically rotation is identity).
Also, for efficiency, the face must be in a region of interest (could that be a rectangle).
Then, the result of tracking will determine the rotation and translation of the face at the
same time. I noticed that PnP gave better results. Posit assumes an orthographic or affine
projection…I think !

2/ The 100 points in my code is the maximum features point. In practice, less than that are
found. Of course, I am only interested in reliable features. I do get some features at mouth
corners when I pretend speaking 🙂
3/ Haven’t tested the improvment. Since computation time didn’t suffer from cornerSubPix, I kept
it.
4/ This is a work in progress. Once the points are detected and tracked, their normals can be estimated
(they lie on a cylinder). I use the normal direction to weigh the feature’s contribution. I haven’t
talked about it in my blog because it’s not finished yet. Stay tuned !!!