Saturday, September 23, 2017

Fast Multiclass Object Detection in Dlib 19.7

The new version of dlib is out and the biggest new feature is the ability to train multiclass object detectors with dlib's convolutional neural network tooling. The previous version only allowed you to train single class detectors, but this release adds the option to create single CNN models that output multiple labels. As an example, I created a small 894 image dataset where I annotated the fronts and rears of cars and used it to train a 2-class detector. You can see the resulting detector running in this video:

If you want to run the car detector from this video on your own images you can check out this example program.

I've also improved the detector speed in dlib 19.7 by pushing more of the processing to the GPU. This makes the detector 2.5x faster. For example, running the detector on the 928x478 image used in this example program ran at 39fps in the previous version of dlib, but now runs at 98fps (when run on a NVIDIA 1080ti).

This release also includes a new 5-point face landmarking model that finds the corners of the eyes and bottom of nose:

Unlike the 68-point landmarking model included with dlib, this model is over 10x smaller at 8.8MB compared to the 68-point model's 96MB. It also runs faster, and even more importantly, works with the state-of-the-art CNN face detector in dlib as well as the older HOG face detector in dlib. The central use-case of the 5-point model is to perform 2D face alignment for applications like face recognition. In any of the dlib code that does face alignment, the new 5-point model is a drop-in replacement for the 68-point model and in fact is the new recommended model to use with dlib's face recognition tooling.

71 comments
:

Great new stuff. You say that the "new 5-point model is a drop-in replacement for the 68-point model and in fact is the new recommended model to use with dlib's face recognition tooling." However, two questions:

- Is it recommended because the results are better or just because it's faster/lightweight?

- I know you say that it is a drop-in replacement, but does that mean that a face aligned in with the 68-point model can be compared directly (distance between descriptors) to a face aligned with the 5-point model without fear of any issues?

The results should in general be the same, but it's faster and smaller. The alignment should actually be slightly more accurate in general, but not by a lot. The real benefit is speed, size, and ability to use it with the CNN face detector in addition to the HOG detector.

Yes, you can just replace the old shape model with the new model in any face recognition code that used the old one and it will work. I specifically made this new model to be a replacement for the old one. It will create the same kind of alignment as the old model and work with the previously trained face recognition model.

I was trying to compile the new release of dlib and I am having some inconvenients that I want to share with you.

Compiling on WindowsI used "dnn_face_recognition_ex.cpp" as test code. I had no problem compiling it using dlib-19.3 and dlib-19.4 in Visual Studio 2015 with cuda 8, but with dlib-19.7 I had the following errors:

I tried using cudnn5 and 7 (no diference) and using the CMakeLists.txt in dlib folder from an older version (other errors appeared) that worked correctly for me.

I was wondering if maybe we have to follow different steps in order to compile this new version, or maybe the minimum requirements of the required software have changed or maybe something happens with Policy CMP0007, because I had a warning that said it was not set.

Compiling on Linux On Linux I had no problem to compile and run dlib-19.3 and 19.4 in the past. Now with dlib-19.7 it appears the old problem of #define DLIB_JPEG_SUPPORT. When I run the cmake it does successfully, I checked if the DLIB_JPEG_SUPPORT was ON and if the code entered (in CMakeLists) in the JPEG FOUND statement and if the libjpeg library was found and all was right. Then the build at Release mode is also made correctly. But when I ran the code I had the problem of unable to load jpeg images because of the DLIB_JPEG_SUPPORT :( This just can be solved if I put a #define DLIB_JPEG_SUPPORT at the top of the cpp code.Here I was wondering if something changed compared to previous releases, this is a bit strange to me because I had no problem with them.

Sorry for this long and boring text and thank you very much for your time and effort :)

Nothing has changed in how dlib is built. You must just be making some kind of mistake. Follow the instructions at the top of this page to compile the example programs: http://dlib.net/compile.html. Read the example cmakelists.txt file.

Davis,Long time user - first time writer. Thanks very much for your code.We have built and used dlib in many situations (CPU and GPU) on many systems,We are running your classifier as serialized in the code,but on one particular Windows box, when we run face_detection (close enough to dnn_mmod_face_detection_ex), we get the following error:Error detected at line 682.Error detected in file e:\src\9.0-2017\_extrnheaders\dlib\dnn/loss.h.Error detected in function void __cdecl dlib::loss_mmod_::to_label,class dlib::add_layer,class dlib::add_layer,class dlib::add_layer,class dlib::add_layer,class dlib::add_layer,class dlib::add_layer,class dlib::input_rgb_image_pyramid >,void>,void>,void>,void>,void>,void>,void>,void>,void>,void>,void>,void>,void>,void>,void>,void>,void>,void>,void>,1,void>,classstd::vector >*>(const class dlib::tensor &,const class dlib::dimpl::subnet_wrapper,class dlib::add_layer,class dlib::add_layer,class dlib::add_layer,class dlib::add_layer,class dlib::add_layer,class dlib::add_layer,class dlib::input_rgb_image_pyramid >,void>,void>,void>,void>,void>,void>,void>,void>,void>,void>,void>,void>,void>,void>,void>,void>,void>,void>,void>,1,void> &,class std::vector > *,double) const.

This should definitely not happen and there shouldn't be anything machine specific in the code. If I had to guess I would check if there is something wrong with the GPU that is causing it to output empty tensors, which itself shouldn't happen, but maybe something is horribly wrong with CUDA on that machine.

Thanks very much for the quick response - to help others - I got this message when somebody moved the training file away from the filename we were expecting. So we were trying to classify with an unloaded classifier - dlib was not at fault in anyway

Regarding previous query by Mr. Amritanshu Sinha, It looks strange than dlib cannot detect faces from a frames batch of more than 3 images(of 6 Mega pixel each) at one time on a GPU with whopping 16GBs of memory? Please clarify.

Thanks for the reply. We are trying to do face recognition on a outdoor 6MP camera with 12mm lens. More megapixel means more clear faces and we need to process at least 15 fps. Can you suggest any workaround ?

@Davis, thanks for the pointer to the training params for the 5 point model. I notice the image flipping you're doing in here as well. Is this trained with the same dataset as the 100MB model from the original implementation?

I have one question: I tried to train a dnn_mmod using another dataset with more than 2 classes but the training fails completely (1 0 0)I have a static camera and moving objects away and towards the camera. So the scale of the objects is changing - thought your pyramid input should help here - and also the aspect ratios. The trainer complains a bit about the aspect ratios... Nevertheless, do you have a quick tip for me how to target the problem?

Have you ever heard of FindFace? It's a system made by a russian company NTechLab which allows users to instantly find people on the russian social network VK. They have a database of 500 million photos and they can return extremely accurate results within 2-3 seconds.

Here is how they work: they have trained a neural network to detect and output 300 facial features. They say that they have 1.5 kilobytes of data per face. Do you think DLib could do this some time in the future?

Once they have the facial features they store them in a database which can then be easily accessed through indexes.

As a follow up I commented out these lines: //upsample_image_dataset>(images_train, boxes_train, 1800*1800); //upsample_image_dataset>(images_test, boxes_test, 1800*1800);and the CUDA memory error bellow was solved:Error while calling cudaMalloc(&data, n) in file C:\dlib-19.7\dlib\dnn\cuda_data_ptr.cpp:28. code: 2, reason: out of memoryPS C:\dlib-19.7\examples>FYI my largest image size is 1400x1600 training on 1080 ti. So I guess 1800x1800 is still too high for the limit.

Thank you for your library. I try to apply multiclass CNN detector for OCR purposes. I've found that in some cases orientation of detector windows was changed. I suspect that the reason is a bug in the file loss.h lines 432-433 (442-443): if (detector_width < min_target_size) { detector_width = min_target_size; detector_height = min_target_size/ratio; }

I am using dlib "dlib_face_recognition_resnet_model_v1.dat" for the feature extraction. We further wants to train the network with some other data set. Can you suggest any way to loads the weights of model "dlib_face_recognition_resnet_model_v1.dat", so that we further train the model with given initial weights.

Hi Davis, I want to train an object keypoints detector instead of the face landmarks. Accuracy is low for the moment. Bounding box of the detected object is not a square and You say to use find_affine_transform function in shape_predictor.h file but how can I do this for Python module? thanks.

Hi Davis,Thank you for your grate library.My question is: Is it a way to use your "Fast Multiclass Object Detection" in python?meaning that loading model file "mmod_front_and_rear_end_vehicle_detector.dat" in python to predicting the vehicles location?If yes, is it possible to run it on GPU?

and mingw530_32 give 440.683 miliseconds and visual studio compiler gives 543.65 miliseconds for face detection but when i built visual studio version cmake said dlib will use cuda!!!!i don't think it did use cuda because it is even slower.

That part of dlib doesn't use cuda. But in general, the output of cmake will tell you in very explicit terms if dlib is compiled to use cuda. If it's compiled to use cuda then it uses cuda as much as it uses it. There is no configuration beyond "built with cuda" or "not built with cuda". So if cmake says it's using cuda then it's using cuda and you are getting whatever cuda acceleration you are going to get.

I am trying to train a landmark model using the Helen 194 points dataset. This dataset is annotated by the authors, so I simply has generated the XML files using the original annotations. I have used the default configuration of the algorithm which is extracted from the original paper. However the results are not accurate. I have modified some parameters of the algorithm such as nu, oversampling and tree_depth, but results are not accurate. Any advice for improving my results? Thank you in advance.

Currently looking at train_face_5point_model and the associated data-set dlib_faces_5points.tar... I notice that each file entry has two bounding boxes specified. Am I right in thinking that these are simply the two different bounding boxes detected by the CNN and the HOG detector? Otherwise, what do the two boxes represent? Thanks!

I am trying to do something similar to the sample program: dnn_mmod_face_detection_ex.cpp

In the sample, the input to the CNN is a matrix object that is allocated on the host. In my code, prior to calling the net, I have some CUDA kernels that preprocess the image, so the image data is already on the GPU. Is there a way to invoke the CNN on the image data without first copying the image data back to the host?

Also, is there a way to run the CNN in a specified CUDA stream (i.e. the stream I used to run my preprocessing kernels)?

You could write a custom input layer that takes input from your other source, which shouldn't be a big deal. You can also just call one of the network's member functions that takes a tensor as input rather than a matrix.

All the network computations run on the default CUDA stream. But you can just use per-thread default streams. Read the CUDA docs for details.

Thank you for your reply. Regarding your suggestion of passing a tensor to the network, my understanding is that the net work is really an object of dlib::loss_mmod templated class. dlib::loss_mmod is itself an alias of an instantiation of dlib::add_loss_layer class. dlib::add_loss_layer class has an operator() that takes a dlib::tensor as input, and that is the function you are referring to. Is my analysis correct?

The new code build, runs, but does not seem to produce any face bounding box (i.e. mmod_rectangles.size() == 1, mmod_rectangles.at(0).size() == 0).

Is there anything that stands out as incorrect in what I am doing? I am uncertain about converting from uchar8 to float32, since in the prototype code, I did not perform any explicit conversion. I only did this conversion because it seems that dlib::resizable_tensor only supports float32 numerical format.