Thursday, August 28, 2014

Real-Time Face Pose Estimation

I just posted the next version of dlib, v18.10, and it includes a number of new minor features. The main addition in this release is an implementation of an excellent paper from this year's Computer Vision and Pattern Recognition Conference:

One Millisecond Face Alignment with an Ensemble of Regression Trees by Vahid Kazemi and Josephine Sullivan

As the name suggests, it allows you to perform face pose estimation very quickly. In particular, this means that if you give it an image of someone's face it will add this kind of annotation:

In fact, this is the output of dlib's new face landmarking example program on one of the images from the HELEN dataset. To get an even better idea of how well this pose estimator works take a look at this video where it has been applied to each frame:

It doesn't just stop there though. You can use this technique to make your own custom pose estimation models. To see how, take a look at the example program for training these pose estimation models.

In the paper, "One Millisecond Face Alignment ..." they output 194 landmark points on the face, however the implementation provided in dlib only outputs 68 points. Is there a way to easily produce the 194 points using the code provided in dlib?

I only included the 68 point style model used by the iBUG 300-W dataset in this dlib release. However, if you want to train a 194 point model you can do so pretty easily by following the example here: http://dlib.net/train_shape_predictor_ex.cpp.html

You can get the training data from the HELEN dataset webpage http://www.ifp.illinois.edu/~vuongle2/helen/.

answering myself, If I leave JPEG_LIBRAY and JPEG_INCLUDE_DIR empty in my Cmake-gui, then dlib is still compiled with JPEG support, despite CMake telling me: Could NOT find JPEG (missing: JPEG_LIBRARY JPEG_INCLUDE_DIR) Not sure what is going on, but it works...

CMake will try to find a version of libjpeg that is installed on your system and use that. If it can't find a system version of libjpeg it prints out that it didn't find it. I then have CMake setup to statically compile the copy in the dlib/external/libjpeg folder when a system install of libjpeg is not found. So that's why you get that message.

More importantly, I want to make sure dlib always compiles cleanly with cmake. So can you post the exact commands you typed to get the error C2371: 'INT32' : redefinition; different basic types, in jmorecfg.h error?

I don't get this on any of the systems I have. The string INT32 doesn't even appear in any code in the dlib folder so I'm not sure how this happened.

That explains a lot...As for the commands, I use Cmake-gui, so I just throw the CmakeLists.txt in there and everything works fine, except that error message about JPEG (Could NOT find JPEG (missing: JPEG_LIBRARY JPEG_INCLUDE_DIR)

If I try to fix it (now i understand that I don't need to) and fill the JPEG_INCLUDE_DIR and JPEG_LIBRARY in Cmake-gui, for example using libjpeg that comes with opencv, then I get this C2371: 'INT32' error when compiling (with visual 2012)

In that case I generated it based on the landmark positions. However, I made sure the box was sized and positioned in the same way the dlib detector would have output if it had detected it (e.g. centered on the nose and at a certain scale relative to the whole face).

You have to give a reasonable bounding box, but you can get the box any way you like. However, when you use this thing in a real application you will pair it with some object detector that outputs bounding boxes for your objects prior to pose estimation. So it's a very good idea to use that same object detector to generate your bounding boxes for pose estimation training.

The input is a real time video loaded from the camera. I load the camera by using opencv functions and set the size of camera as following:cap.set(CV_CAP_PROP_FRAME_WIDTH, 80); cap.set(CV_CAP_PROP_FRAME_HEIGHT, 45);

Then it should be very fast. You must be either timing the wrong thing or you haven't actually compiled with optimizations. Is the executable you are running output to a folder called Debug or Release? How are you timing it?

Take the output of get_face_chip_details() and give it to get_mapping_to_chip(). That will return an object that maps from the original image into the image chips which you can use to map the landmarks to the chip.

I've been trying to apply this to another different shape, training 4 images (640x480) with 180 landmarks each and default parameters of the train_shape_predictor_ex. Turned ON the SSE2 optimizations in the CMakeCache file and compiled in Release mode on Ubuntu.

It's been 3 hours and keeps saying "Fitting trees..."

I don't know what's wrong, I already tried this shape before with less landmarks (68) and bigger size images and seemed to worked kind of properly. But now even with the optimizations is hanging or something.

I was wondering if you have any suggestion to overcome this problem. I'm thinking about reducing the oversampling amount from 300 to 100 to see how it goes...

First I ran the example unmodified, it worked.Then I changed the code a little bit, basically just removed the interocular distance part because I don't need it for this object, load four random size images with 68 landmarks, it worked.Now I feed four 640x480 images with 180 landmarks and gets stuck...

Try running it in CMake's Debug mode for a few minutes. That will turn on a lot of checks that may tell you what you did wrong. E.g. maybe your objects don't all have the same number of points in them.

Can you please share the configuration you used for training 68 landmarks model?I see that amount of cascades as well as trees and depths are different from the default settings.it would be great to know you experience about choosing this settings. also amount of padding and how does it affect results.Thank you!

The model file is about 100mb. Dlib comes with an example program that runs this algorithm, so you can run that program to see exactly what kind of computational resources it consumes.

As for training parameters, I believe I used the defaults except that I changed the cascade depth to 15. If you want insight into how the parameters effect training then I would suggest playing with the example program and reading the original paper as it contains a detailed analysis of their effects.

Yes, Davis, I read the documentation for image_window, but I am so stupid and cannot find a function to display a string/text on the image window...Could you do me a favor? Badly needed. Thanks so much!

Hi David. Thanks for your fantastic work and contined support. As with Chris Collins' comment, I'm looking to calculate the yaw/pitch/roll based on the landmarks. Do you have any advice on how to go about this given that, as you say, dlib doesn't handle this? Any help you could give would be much appreciated.

It's really great work. I have one question. Where can I find the corresponding positions for the 68 landmarks? Since I may want to select a few key landmark points based on where they lie on the face. Thank you.

which face-detector did you use to train your model (i.e. get the initial bounding box for the faces)? Do you use the bounding-boxes provided by the 300W dataset, or do you run the dlib/opencv/... face-detector on the images?

Thank you for your response. So, by training with a higher nu and decreasing the tree depth the predictor file will be smaller? What values do you suggest, with a good trade-off between detection peformance and file size? Which were the parameters you used to train the available predictor shape file your making available?

I've trained the predictor on the HELEN data set and am not getting good results.

I obtained both the training landmarks and the bounding box initializations from the i-bug site (http://ibug.doc.ic.ac.uk/resources/300-W/), and it describes them in the format Xmin, Ymin, Xmax, Ymax. However, in the sample training XML file, these are defined as top, left, width and height. I assume this means 'distance from top, distance from left, width of box, height of box' - as such my bounding boxes are defined as: Ymin, Xmin, (Xmax-Xmin), (Ymax-Ymin).

To test this, I'm using one of the training images as input (in order to minimize inaccuracies which may occur as a result of a small training set), and the resulting landmarks are largely misaligned.

So my question is: are my assumptions regarding the bounding box correct? Or should I not have manipulated the BB initialization data?

Is it makes any difference if, instead of using imglab tool to annotate the parts of an object, I manually add the coordinate pixels directly editing an XML file? Because I feed this into the shape predictor trainer and when I run the landmark detector the rendering is awful.

I opened my XML file in imglab and the annotations are correct. There were a pair of little mistakes that I fixed, also confirmed there were ok in imglab again, but still! the rendering keeps crossing lines that should not. I verified all the landmarks were sequentially well accommodated with zoom in imglab, so I don't understand what is going on. That's why I was thinking maybe the shape predictor trainer is not reading my XML file as it should, because is not a direct output of annotating with imglab. I just modified the training_with_face_landmarks.xml file that was on the "examples" folder.

Of course, render_face_detections.h was modified as well with a new number of landmarks and only this loop: for (unsigned long i = 1; i <= 179; ++i) lines.push_back(image_window::overlay_line(d.part(i), d.part(i-1), color));

Also, I would like to thicker the rendering lines. I've been looking through several files... widgets_abstract, metadata_editor, base_widgets_abstract, drawable_abstract, canvas_drawing_abstract, but I don't seem to find the line where I can change that parameter. Any idea?

I'm currently applying this to a video, just like the video demo you have posted. I'd love to use the previous predicted shape to initialize the shape_predictor rather than the rect, so that we don't have to detect faces or initialize with the standard shape to every video frame. Is there a way to get around this? or I have to modify the code myself.

However, I'm not seeing anywhere close to one millisecond performance. I'm compiling the example program with g++ on linux with a core i5 processor. If I run it on some smaller images, it takes about 5-10 seconds. If I run it on the larger HELEN images, some of them take over 1 minute. I thought maybe I'm misunderstanding which part only takes a millisecond, or what kind of hardware the test was done on.

But from this blog post above: "As the name suggests, it allows you to perform face pose estimation very quickly. In particular, this means that if you give it an image of someone's face it will add this kind of annotation:"

And from the paper:

"In practice with a single CPU our algorithm takes about an hour to train on the HELEN dataset and at runtime it only takes about one millisecond per image."

No, that's not part of the current implementation. There are a variety of ways you could estimate this though, the easiest is probably to train some kind of HOG based classifier that looks at each landmark and classifies it as occluded or not occluded.

I'm about to apply this technology to my project, which needs to detect 4 features of a face (eyes, nose, mouth). What I'm trying to do is to re-train a 4-feature detector to reduce memory footprint. I understand the quality of the annotation is key to performance. Can I have your annotation file so that I can have a solid foundation to start with?

I want the algorithm to extract features also from outside the shape. Especially in y-direction, above and below the shape. Is this possible directly or do I need some code adjustments? Can I do this using this function: set_feature_pool_region_padding?

I'm trying to extract facial landmarks from an image on iOS. For that I followed face_landmark_detection_ex.cpp example, and I used the default shape_predictor_68_face_landmarks.dat. It is recognising the face from the image successfully, but the facial landmark points which I'm getting are not correct and are always making a straight diagonal line no matter whichever facial image I use. I also tried using cv_image instead of array2d but no luck. Can you point me towards what I need to do in order to get facial landmarks of a frontal face image.

Thanks for the quick reply. The only thing I modified with the example code is that instead of dlib gui component to display the image I'm displaying the image on iOS UIImageView and I'm storing all the 68 shape positions by creating CGPoints from them and displaying a UIView at those points. Here's the result I get: http://imgur.com/gallery/QgRbXm9/new. The image has dimension of 201 X 250 pixels. I tried several images of several dimensions and sizes but the output is always the same. It successfully detects whether the image contains a face or not.

Yes, it matters a lot. You need to use the same sort of box that you use when you run the learned algorithm. Presumably you will get your boxes from a face detector so you should use whatever boxes that thing produces. Also, the training xml files I used can be downloaded here: http://dlib.net/files/data/

Hi Davis, I have 16 gb memory on my system and training on the full dataset in http://dlib.net/files/data is overflowing memory and that's using only the default cascade_depth of 10. Can you mention the amount of memory on the system that you used for training? Can the trees be trained incrementally or is there another way to reduce the memory footprint?

Sir,Is it possible to make detection faster in python currently I am getting a frame rate below 10 on video, I just require the eye corners and nose tip, is there a way to selectively make the detection(do I have to perform my own training)?Will that speed it up?

Make sure you compiled dlib with AVX instructions enabled (see http://dlib.net/faq.html#Whyisdlibslow). That makes it faster. Other than that the only thing you could do is try to make your own version that is faster. The training data I used is available here: http://dlib.net/files/data/

Has anyone ported to a mobile if so what is the frame rate? any slimmed down version of the lib that can be used for a mobile integration? JITEN devlani, looks like you did the iOS integration, can you mail me at smear1@gmail.com please or anyone interested in doing contractor work based on this?

In my experience it runs in milliseconds on the mobile. If you want to reduce the size I would recommend storing all values as float16 and reduce the number of landmarks. Depending on your actual application you might not need all 68 landmarks. You also can trade some accuracy vs. size by setting the maximum tree depth to 4.

What is the total size to port the feature point code to mobile? Also any comments on how the algorithm is robust to pose changes? If the first frame is the face detection box then after that I use the location of the last frame to seed the algorithm, the algorithm wont be reliant on the face detector just how good the feature point detection is and im assuming at some pose it becomes unstable. Any do the implementation like this? Does that algorithm give a flag if it fails to find good points?

I guess it's best if you look into the code yourself. The main algorithm consists of just 2-3 header files... There are dependencies to the serialization and matrix multiply code though. If you want to port/rewrite the code you might want to use Armadillo + OpenBLAS or Eigen for the matrix stuff, which both accelerate BLAS operations with NEON instructions.

@Confidence estimate: Nope, you either go for something like joint-cascade or run a detector in some background-thread which tries to re-find the face.

Lots of people use dlib on mobile platforms so you shouldn't need to port the code. Also, dlib's linear algebra libary will use OpenBLAS or any other BLAS just like Armadillo so there isn't any point in switching.

Thanks a lot for your help last time around, I made my own xml file for the dataset you provided at- http://dlib.net/files/data/

However I think I keep running out of memory and my PC crashes, could you please generate the dat file using the xmllink to my xml-https://drive.google.com/file/d/0B5dMexTHKn6PT1RyeHdWdy1UMHM/view?usp=sharing

Hi Davis,Can you please provide a link to the IBUG dataset that contains all the images that you trained on? From the current link that is available on your blog, http://ibug.doc.ic.ac.uk/resources/300-W/, I could only get about 100 and odd images. Is IBUG data - the set that you trained on, which had a few thousand images, a collection of other databases - LFPW, Helen, XM2VTS? Thanks!

Yes, in the provided example program you can see where it calls the shape predictor with the bounding box. You can change the example program to pass in some other bounding box generated however you like.

when I tried to compile face_landmark_detection_ex.cpp with the code i wrote, it would cause errors. Error messages:error: missing template arguments before 'v'error: expected ';' before verror: 'v' was not declared in this scope

I think there are something wrong with declare v.Is that anything wrong with it?

First of all congratulations for your fantastic work! I have a question related to landmark detection using hog_object_detector. Is there a way, while or after evaluating each one of the regression_tree elements in the forests of having some kind of a confidence factor? I'm currently trying to find some kind of a quality metric for the detected landmarks and your help would be highly appreciated.

Thank you for your prompt response! What I meant is if there is a way of getting a metric from the actual evaluation result, like some kind of quality measure. I understand what you mean by training with a different confidence value, but that will be a one shot operation. During runtime I won't be able to infer the actual quality of the fitting/matching result, right?

Dear David,congrats for your great work. I have many questions about tracking from webcam.

Qst. 1) I'm running webcam tracking example on a MacBook Pro 4core 16 giga ram. When i run the release built, performance is not so good, even if SSE/AVX is enable.

Qst 2) May be correlated to Qst 1. I've changed the example source code enabling face detection only one time to detect face position. Successively i use shapes[0].get_rect(); to get new bounding box. Seems to be ok but if i make a sweeping motion,although slow, with the head the tracker fails. Is it a problem of performances?

Qst 3) Compared to real performances, the video showing actor speaking seems to be post-produced. Is it realtime?

Qst 4) My last question is about tolerance to head yaw and pitch. Seems that the algorithm fails for very small yaw and pitch angles. Does it depends on training? Can it be improved?

Hi Davis King , I run webcam_face_landmark_detection_ex.cpp using visual studio 2012 ,and I encounter some problems as follow:\dlib-18.10\dlib\opencv/cv_image.h(126) : see reference to class template instantiation 'dlib::cv_image' being compiled \dlib-18.10\dlib\opencv/cv_image.h(24): error C2653: 'cv' : is not a class or namespace name\dlib-18.10\dlib\opencv/to_open_cv.h(18): error C2653: 'cv' : is not a class or namespace name\dlib-18.10\dlib\opencv/to_open_cv.h(19): error C2065: 'image_type' : undeclared identifier Any suggestion?

Open one of the training xml files accompanying the example programs using dlib's imglab tool. It shows the labels of the points on the screen. Or you could just plot the output on the screen and see where each point falls on a face.

Hi Davis King, I want to cross compile with cmake.The compiler is arm-xilinx-linux-gnueabi.I use the cmake-gui command in linux,and create a new file-toolchain.cmake.File contents are as follows: set(CMAKE_SYSTEM_NAME Linux) set(CMAKE_SYSTEM_PROCESSOR arm) set(CMAKE_C_COMPILER arm-xilinx-linux-gnueabi-gcc) set(CMAKE_CXX_COMPILER arm-xilinx-linux-gnueabi-g++)but It can not work.Do I need to configure other parameters? I still have one question.I want to generate a shared library(.so NOT .a).How do I set the cmake parameter? Could you give me an example or some information ?

You could retrain it using the same dataset but exclude most of the landmarks. That would lower the size. This is the dataset the reference model was trained from: http://dlib.net/files/data/ibug_300W_large_face_landmark_dataset.tar.gz

Are you running the example program that comes with dlib (http://dlib.net/face_detection_ex.cpp.html), on the provided images? If so then it should run faster than that and you must not have compiled it with optimizations enabled.

Davis, thank you so much for your work on this incredible library and the help you've provided in these comments.

I have a few questions that I'm pretty sure haven't been addressed yet. I'd really appreciate some input, if anyone has the chance.

1) Training: Davis, I see in your training set XML file that for the training images in which the face detector did not detect a face, you guessed at the face bounding box. Did you just pick a box that fit the landmarks? I'm guessing you probably did something a little more clever than that. Could you explain your guessing process?

2) Training again: I see that you've mirrored all of the images to build a bigger training set. This seems pretty clever. Could similar gains be achieved by adding random noise into the images? Also, to save memory, can't I modify extract_feature_pixel_values() to simulate the mirrored image, and then build in calls to the training procedure to extract_feature_pixel_values_mirrored() with the mirrored shape? Or do you see an inherent problem in that?

3) Shape prediction: In video, is there a way to use the preceding frame to aid the prediction for the current frame? I'm guessing the answer here is no. If I understand the algorithm correctly, the decision tree needs to be traversed from the beginning for each frame, meaning that each frame must be visited as if it's brand new.

For the images that don't detect a face, I tried to use the bounding box that would have been output if the detector had worked. So I trained some simple regression model to map from the landmarks to the output box.

Adding random noise could be useful. How useful will depend on your application and the kind of noise. You could certainly flip images on the fly. However, you have to map the landmarks to a flipped image in some sensible way rather than just mirroring them since a simple mirroring would do things like conflate the left ear with the right ear. Different applications will demand different ways of performing this mapping so it's best to let the user do this themselves rather than have dlib try to guess it automatically.

The algorithm is inherently based on one frame. I can imagine multiple ways to design a new but similar algorithm that assumed sequential frames should have similar outputs. However, such a thing is not in dlib.

Hi Davis,Concerning your last post, how would you go about video shape prediction ?Learning a regressor that takes the landmarks from the previous frame and learning the possible displacements ? that would require sequential labeled data, which would be a pain to do...or were you thinking of another way to do this without further data annotations ?Thanks

I used cmake to generate Makefile for dlib example, but while I was compiling, I got the errors below:

/Users/ymi8/Downloads/dlib-18.10/dlib/../dlib/gui_core/gui_core_kernel_2.h:11:2: error: "DLIB_NO_GUI_SUPPORT is defined so you can't use the GUI code. Turn DLIB_NO_GUI_SUPPORT off if you want to use it."#error "DLIB_NO_GUI_SUPPORT is defined so you can't use the GUI code. Turn DLIB_NO_GUI_SUPPORT off if y... ^/Users/ymi8/Downloads/dlib-18.10/dlib/../dlib/gui_core/gui_core_kernel_2.h:12:2: error: "Also make sure you have libx11-dev installed on your system"#error "Also make sure you have libx11-dev installed on your system" ^2 errors generated.

I am Trying to detect the facial landmarks from UIImage Using Dlib C++, But i am unable to compile the Dlib C++ in Xcode iOSCan anyone help me or guide me to steps requires for installing DLIB C++ In ios Xcode.

I tested your landmark detection on my 64-bit desktop and got the average speed of at least 5 milliseconds per face for 68 landmarks. AVX is used. But according to the paper, it's just one millisecond per image for 194 landmarks.