Tuesday, October 11, 2016

Easily Create High Quality Object Detectors with Deep Learning

A few years ago I added an implementation of the max-margin object-detection algorithm (MMOD) to dlib. This tool has since become quite popular as it frees the user from tedious tasks like hard negative mining. You simply label things in images and it learns to detect them. It also produces high quality detectors from relatively small amounts of training data. For instance, one of dlib's example programs shows MMOD learning a serviceable face detector from only 4 images.

However, the MMOD implementation in dlib used HOG feature extraction followed by a single linear filter. This means it's incapable of learning to detect objects that exhibit complex pose variation or have a lot of other variability in how they appear. To get around this, users typically train multiple detectors, one for each pose. That works OK in many cases but isn't a really good general solution. Fortunately, over the last few years convolutional neural networks have proven themselves to be capable of dealing with all these issues within a single model.

So the obvious thing to do was to add an implementation of MMOD with the HOG feature extraction replaced with a convolutional neural network. The new version of dlib, v19.2, contains just such a thing. On this page you can see a short tutorial showing how to train a convolutional neural network using the MMOD loss function. It uses dlib's new deep learning API to train the detector end-to-end on the very same 4 image dataset used in the HOG version of the example program. Happily, and very much to the surprise of myself and my colleagues, it learns a working face detector from this tiny dataset. Here is the detector run over an image not in the training data:

I expected the CNN version of MMOD to inherit the low training data requirements of the HOG version of MMOD, but working with only 4 training images is very surprising considering other deep learning methods typically require many thousands of images to produce any kind of sensible results.

The detector is also reasonably fast for a CNN. On the CPU, it takes about 370ms to process a 640x480 image. On my NVIDIA Titan X GPU (the Maxwell version, not the newer Pascal version) it takes 45ms to process an image when images are processed one at a time. If I group the images into batches then it takes about 18ms per image.

To really test the new CNN version of MMOD, I ran it through the leading face detection benchmark, FDDB. This benchmark has two modes, 10-fold cross-validation and unrestricted. Both test on the same dataset, but in the 10-fold cross-validation mode you are only allowed to train on data in the FDDB dataset. In the unrestricted mode you can train on any data you like so long as it doesn't include images from FDDB. I ran the 10-fold cross-validation version of the FDDB challenge. This means I trained 10 CNN face detectors, each on 9 folds and tested on the held out 10th. I did not perform any hyper parameter tuning. Then I ran the results through the FDDB evaluation software and got this plot:

The X axis is the number of false alarms produced over the entire 2845 image dataset. The Y axis is recall, i.e. the fraction of faces found by the detector. The green curve is the new dlib detector, which in this mode only gets about 4600 faces to train on. The red curve is the old Viola Jones detector which is still popular (although it shouldn't be, obviously). Most interestingly, the blue curve is a state-of-the-art result from the paper Face Detection with the Faster R-CNN, published only 4 months ago. In that paper, they train their detector on the very large WIDER dataset, which consists of 159,424 faces, and arguably get worse results on FDDB than the dlib detector trained on only 4600 faces.

As another test, I created the dog hipsterizer, which I made a post about a few days ago. The hipsterizer used the exact same code and parameter settings to train a dog head detector. The only difference was the training data consisted in 9240 dog heads instead of human faces. That produced the very high quality models used in the hipsterizer. So now we can automatically create fantastic images such as this one :)

As one last test of the new CNN MMOD tool I made a dataset of 6975 faces. This dataset is a collection of face images selected from many publicly available datasets (excluding the FDDB dataset). In particular, there are images from ImageNet, AFLW, Pascal VOC, the VGG dataset, WIDER, and face scrub. Unlike FDDB, this new dataset contains faces in a wide range of poses rather than consisting of mostly front facing shots. To give you an idea of what it looks like, here are all the faces in the dataset tightly cropped and tiled into one big image:

Using the new dlib tooling I trained a CNN on this dataset using the same exact code and parameter settings as used by the dog hipsterizer and previous FDDB experiment. If you want to run that CNN on your own images you can use this example program. I tested this CNN on FDDB's unrestricted protocol and found that it has a recall of 0.879134, which is quite good. However, it produced 90 false alarms. Which sounds bad, until you look at them and find that it's finding labeling errors in FDDB. The following image shows all the "false alarms" it outputs on FDDB. All but one of them are actually faces.

Finally, to give you a more visceral idea of the difference in capability between the new CNN detector and the old HOG detector, here are a few images where I ran dlib's default HOG face detector (which is actually 5 HOG models) and the new CNN face detector. The red boxes are CNN detections and blue boxes are from the older HOG detector. While the HOG detector does an excellent job on easy faces looking at the camera, you can see that the CNN is way better at handling not just the easy cases but all faces in general. And yes, I ran the HOG detector on all the images, it's just that it fails to find any faces in some of them.

Thanks, it is awesome!!!This is a good deep learning library for c++ programmers, it is easy to use(nice, clean api), support gpgpu, provides nice examples and ease to build, it is like a dream come true.

I think I see two non-faces, the "4" jersey, but also the square just above it.

I was about to try this out, but I'm having some growing pain building dlib on my OS X 10.11 laptop (1. using Anaconda creates X11 problems, I think I figured that out 2. I got CUDA 8.0 built with Xcode 7.3 and I've been using it Python, dlib finds CUDA, but says it can't compile anything and can't find cuDNN). If I have time to dig into these issues more and can provide more precise info I'll post an issue on GitHub.

I have a question about the receptive field you mention in the code comment. You mention it is a little above 50x50, but I'm seem not able to reproduce this size, I end up with a receptive field of 117x117 pixels.

Well, the first few layers downsample the image by a factor of 8 so you can think of it as turning it into 8x8 cells. I realize the cells have receptive fields that are larger than 8x8, but most of the action of their focus is in a smaller area. Then the detector pulls a 6x6 grid of them at the end, after some filtering.

I should probably change the wording of the example to be a little less confusing :)

Thank you for the explanation, it helps my understanding to how to change the network for other objects. The results of experiments I did so far with the face model are remarkable. Keep on the good work ;).

I have another question though. I am trying to train a pedestrian model of 22x50. I have altered all of the annotations to more or less this aspect ratio (rounded dimention values). When training with these annotations, I get the error "Encountered a ground truth rectangle with a width and height of 71 and 90...aspect ratio is to different from the detection window", although these dimentions are not present in the annotation file.Are the ground truth annotations cropped somehow when they overlap with other annotations or the image boundary maybe? And will this problem still appear for 'ignored' annotations?

I just finished reading your MMOD paper and it is a brilliant concept. I am curious how the margin based optimization of MMOD is combined with gradient descent based optimization of neural nets. Is there more material about the convnet version I can read somewhere ?

@florisdesmedt -- I saw the same issue when training an object detector that is rectangular (2x1 aspect ratio). I believe the issue is the random crop rotation. After rotation (e.g., at a 45 degree angle) the region is no longer rectangular. It's basically at a 1x1 aspect ratio, and the training fails. This isn't really a problem for square things, since 45 degree rotation, and it's still within the tolerance. I reduced the random crop rotation to 5 degrees (from default 30) and it worked:cropper.set_max_rotation_degrees(0);

@ Davis, great job. I see that you use "num of false alarms v.s. recall" figure to compare fasterRCNN, dlib MMOD CNN, and Violajones. May I know if you have similar figures to compare fasterRCNN, dlib MMOD CNN and dlib HOG?

@ Davis, great job. I see that you use "num of false alarms v.s. recall" figure to compare fasterRCNN, dlib MMOD CNN, and Violajones. May I know if you have similar figures to compare fasterRCNN, dlib MMOD CNN and dlib HOG?

@ Davis, great job. I see that you use "num of false alarms v.s. recall" figure to compare fasterRCNN, dlib MMOD CNN, and Violajones. May I know if you have similar figures to compare fasterRCNN, dlib MMOD CNN and dlib HOG?

Hi, how could we reuse trained network to extract features from images and then train our own standard machine learning classifier (such as softmax or SVM). Like the overfeat example of sklearn_theano.Thanks

Yes, you can get the output of the network's last layer, or any other layer. Read the documentation to see how. In particular, there are two introduction examples that are important to read to understand the API.

I just used dlib to train a neural network for the first time using this library. It was super easy. A lot of the nonsense you have to deal with in other libraries is taken care of for you automatically. Plus the code has basically 0 dependencies, I can deploy it in an embedded device (No Python, lua, lmdb, or a whole menu of other stuff). I've already said this once before, but really *great* job on this library. Dlib deserves a lot more attention as a first-class neural net library.

Yeah, I wanted to make a library that professionals would want to use. Most of the other tools are obviously written by grad students who don't know how to code well. Dlib will also automatically adjust the learning rate as well as detect convergence, and so knows when to stop. So there shouldn't be much fiddling with learning rates. Just run it and wait. Although you do need to set the "steps without progress threshold", but setting it to something big like 10,000 or 8,000 should always be fine. I have it set to something smaller in the examples just so they run faster.

Size of the mnist dataset is 28*28.I check the source codes of max_pool, by default it prefer zero padding.first max_pool 28*28 become 14*14second max_pool 14*14 become 7*7final size should be 7*7*16 = 784

-----------------------Now I get the same error, when I swap out a training image for a testing image. That is if the testing image and result was faulty, I should not get a result of (1,0,0) when I swap out the faulty testing image with a training image...

How is test_object_detection_function() supposed to test and image if you don't give it any truth data? If you looked at the documentation for is_learning_problem() (or test_object_detection_function) you would find out it's complaining because you haven't given any truth detections.

Hi,I'm experimenting with batchprocessing of images. I have found out that this can easily be done by giving a vector of matrix elements to the functor net (assuming net is the network). But I want to alter the adjustment_threshold also.

For a single image I can alter it using:---//temporary create a vector to work with iteratorsstd::vector> images;images.push_back(img); // create a tensor with all the data to process (single image)resizable_tensor temp_tensor;net.to_tensor(&images[0], &images[0]+1, temp_tensor);// run the networknet.subnet().forward(temp_tensor);

// convert the output of the network to detections (adjusting the threshold with -0.6)std::vector dets;net.loss_details().to_label(temp_tensor, net.subnet(), &dets,-0.6);---For batch-processing, I assume the vector "images" should contain all the images, such that a tensor is created that contains all the data of the whole batch. But how can I perform the last step (...to_label())? I assume I have to iterate through the samples of the tensor "temp_tensor" to obtain the detections for the respective image, so call the to_label function multiple times? Is there an iterator that easily loops over the samples that I can use?

I did see that 'is_learning_problem(images,truth_dets): ' is coming out false. But I dont see why. I generated the testing.xml and training.xml the same way using the imglab tool. That is, I picked the 'truth' box the same way for both training and testing.

Here is what the testing.xml contains:

imglab datasetCreated by imglab tool.

Did it need anything other than the Box definition. Note that when I swap out the two lines of image file and box top from the testing to the training xml, it is ok..

I know the program has found the testing.xml because it knows that there is only 1 image in the testing. That is the initial comments in the code is here:

The xml statements dint come through in the last post, so replacing < with [ in the xml statementsThey have[images] [image file='/Volumes/hdname/third.jpg'] [box top='271' left='234' width='182' height='262'/] [/image][/images]

>I would use a square box right now. Ok, I will give this a shot, if the results are good I will tell you.

>Although, in the near future I'm going to add support for multiple aspect ratios into the MMOD tool.

Looking forward to that. It would be another big surprise(present) if we could use MMOD to detect objects of different aspect ratio

I check the file loss.h, there are some loss class do not used in the examples. Like loss_binary_hinge, loss_binary_log_, loss_metric_. What are they for?Under what situations we should try to use them?Thanks

I used the code to train my dataset, which contains 1000 images,and found that the program took 5G GPU memory, 6G main memory, and 40% CPU computing for 8 cores.Is it normal?How can I adjust the code for using less resource, for example 10% CPU computing.Thanks.

Some parameters that could be changed to use less resources (which I have used to keep the training process from swapping) is making the cropping size smaller, and the amounts of crops in each iteration. Both will however have an influence on the resulting detection accuracy I presume.

Thank you very much for the nice post!I got this error when I was running the training. Could anyone help me?

>>Encountered a truth rectangle with a width and height of 19 and 40.The image pyramid and sliding window can't output a rectangle of this shape. This is because the rectangle is smaller than the detection window which has a widthand height of 30 and 54.

I had a similar problem before. When you try to train a model that is not square shaped, you have to limit the amount of rotation in annotation augmentation. The command for this can be found in an earlier post.

I have some problems with getting the training working properly. I almost always get a testresult of 1 0 0 (so no detections are found, with an obvious full precision), also on the trainingsset. Does the training data has to be of some specified format? I have tried different datasets, model sizes, ... Now I'm experimenting with the head annotations of the Town Centre dataset (more or less square annotations). I started with using the full images and using all annotations per image (no luck there).I saw that the trainingsdata for the facedetector uses all square patches (250x250 or 350x350) with the annotation positioned in the center taking between 23% and 45% of the patches. Is this required? (I tried also this format without success so far). I also saw that in most of these patches the other annotations are set to ignore, with a few exceptions (which lead to a non-zero NMS threshold), is there some logic there which are ignored?

Is the model-size, cropped patch-size, ... related to size the annotations have in the training data?

Do you have any plans to support GPU for face/object detection? Speeding up the training as step 1 is pretty fantastic. Just curious if you have plans to support OpenCL or something similar for detecting faces in images and reduce the need for clusters and put the power in the hands of the average guys! I plan to look into dlib's core and see how feasible it is to do myself

By the way, today I deploy a small example on a windows laptop without cuda support, it crash suddenly(I put cublas64_80.dll, cudart64_80.dll, cudnn64_5.dll, curand64_80.dll in the folder of exe). Is this a normal case?Thanks

Hi Guys, Didn't have a chance to thank you for this work. This is amazing.

Sorry for a rookie question but I struggle with something I don't understand in your example. Using imglab tool I defined my train/test object xmls but the dnn_mmod_ex keeps saying me:"Encountered a truth rectangle located at [(10, 51) (23, 70)] that is too close to the edgeof the image to be captured by the CNN features."

Depending on the cropper and mmod_options settings the values in brackets change. It seens pretty self-explanatory comment so I wanted to know which exactly region causes the error. So I endded up with one image in the train xml that has one region in the center of it. Definitelly it is not too close to the edge of the image. I noticed that when the regions are too small this error can appear.

@Davis KingThanks for doing such amazing job! This can detect human face perfectly. However I'd like to further operation with the object detected(say, recognition), but I don't quite understand what operation I could deal with instead of showing overlay. Sorry for this kind of stupid question.Is there any reference suggested?

What changes have to be made to process a bunch of images as given in comment . Please help

matrix img; load_image(img, argv[i]);

// Upsampling the image will allow us to detect smaller faces but will cause the // program to use more RAM and run longer. while(img.size() < 1800*1800) pyramid_up(img);

// Note that you can process a bunch of images in a std::vector at once and it runs // much faster, since this will form mini-batches of images and therefore get // better parallelism out of your GPU hardware. However, all the images must be // the same size. To avoid this requirement on images being the same size we // process them individually in this example. auto dets = net(img);

I think you are talking about compile time?The problem I met is at runtime, on the target host do not support cuda, the program will crash if I compile the app with cuda. Is it possible to switch to from gpu mode to cpu mode if the target host do not support cuda?Thanks

Thank you for such a nice tool. All the CNN tools I have explored so far this is easiest one.

I built dlib 19.2 with cuda 8.0, cuDNN 5.1, visual studio 2015 update 3 and without openCV. Then I tried "dnn_mmod_face_detection_ex" which is working perfectly. But getting error while running "dnn_mmod_ex". For training and testing I use the face dataset provided with dlib.

Reducing number of crops resolve the memory issue. I tried 80 and able to train and test but the test result is very poor. No face detection at all. I am confused whether it is because of the number of crops (mini_batch_size) or any other issue.

Hello Davis, I want to use max margin object detection for face detection. For this, I tried modifying dnn_mmod_ex.cpp file that is provided in the examples. However I would like the detector to detect more faces, even at the cost of some false positives. I understand that I have to change the adjust_threshold parameter, but where do I change it from? I tried to read the documentation but could not find anything related to this. Any help would be much appreciated!

I really need to train my own DNN model for this, as the faces I'm detecting are from grayscale low resolution images and the pretrained model does not work well for this. I tried going through example programs you suggested but could not find any settings for the threshold that lets me have more detections. Any particular example you are talking about?

Thanks for your suggestions. I did not increase the number of iterations without progress. I am running the training again and this time I am going to use the network configuration and all the parameters describe in pretrained face detectors example.

I'm processing the images via random_cropper_ex. Actually some images are red in same random crops and not in some other crops. I have resized the images and made the width 600px by keeping the aspect ratio.Trying to figure out which images is better for dnn training, but still couldnt figure out yet.

HI Davis,I'd like to use tracking after object is detected in a video, and which the position of object is necessary. I wonder how to find those attributes like positions or size etc.Thank you for your contribution.

I have a very serious issue with the code.Actually , I have made a python wrapper of the code using Boost Library so that I can access the c++ code from python.from python i will be sending a frame as nd_array , my cpp program will change the ndarray to Mat and subsequently to dlib image format. The code runs perfectly on CPU but on a system with GPU it gets segmentation fault when executed from python.I tested the code on a system without the GPU and it is running fine.As i compiled my code on the other system with GPU , the line " auto dets=net(img) or auto dets=net(imgs) "gives a segmentation fault , I am not sure but there is some issue coming from dlib/dnn.h.

Post a github issue with a minimal but complete C++ program that reproduces the crash. Include everything necessary to run it. Don't post a python example since it's very likely the bug is in something you did wrong in the python C API rather than anything to do with dlib. So if you can't reproduce the crash in a pure C++ program then the bug isn't in dlib.

Hi, I am playing a little bit with this library, great work! I have a question.Can the dnn_mmod_ex example classify more than one class? In other words, if the training set contains images with different labels, is it able to classify them? Or, do I have to implement a specific loss layer for my classes?Thanks.

loss_mmod only outputs one class. However, the extension to support multiple labels isn't very difficult. I'm going to update it in an upcoming release to add such support. But right now it's just one label at a time.

Earlier versions of cuDNN have a different and incompatible API and are also missing important features. Moreover, from looking at nvidia's web page, it sure looks like current versions of cuDNN can be used with the jetson tk1, as is claimed here for instance: https://developer.nvidia.com/embedded/jetpack

Is there any way in which i can give directly an cvMat image to the network i.e. dets(net). It takes some dlib format is there any way in which without conversion i can give cv:Mat image to train the network as i am taking the data through the RTSP or Video stream .

Yes, you can use any inputs you like. Just create a new input layer that takes the input object you want to use. It's not complex and only a few lines of code. The documentation explains how to do it in detail.

I studied the file dnn_mmode_ex.cpp. inside the comments you mentioned about the training stage and xml files.

1) Why you have created an xml file also for test images?2) Does the program train again itself, once we run it more than once?3) How can I insert real time live video (such as webcam or USB3 camera), instead of test images as input?4) GPU cards are expensive, I have plan to use FPGAs. Do you have any suggestion for this procedure?

I give square box a try, the results are far from good, training and testing results are 1 0 0.I guess it is because square boxes include a lot of negative background, this make the trainer hard to differentiate positive and negative objects.

To find out my suspicious is true or false, I train the detector with different aspect ratio, this time training progress and results become much better.

Would next version of dlib(19.3) support multi aspect ratio, multi label, and allowed us to use pre-trained model to train object detector?

The shape of the box shouldn't matter since the detector we are talking about is a point detector. That is, the area of the image it looks at doesn't have anything to do with the shape of the box. The detector just scans a DNN over the image and finds the center of an object. The area of the image the DNN looks at is entirely determined by the architecture of the DNN, not the shape of the boxes in the training data. So what you are saying about looking at negative background isn't right. If you get bad results when changing box sizes then it's probably just some mistake in your data or setup or how you apply data augmentation. Who knows.

I will be upgrading the detector to support multiple aspect ratios and object categories. This will probably happen in dlib 19.4.

> That is, the area of the image it looks at doesn't have anything to do with the shape of the box.

Maybe we are not talking about the same thing?

The meaning of the square box contain too many negative object. Assume the aspect ratio of the object is 1 : 4, to make the dlib trainer work,I draw the bounding box with aspect ratio as 4 : 4 == 1 : 1, that means this bounding box has 3/4 area is negative object.

It is like drawing the bounding box of faces 4 times larger == the boxes contains 3/4 negative objects compare with original bounding boxes, now the box only got 1/4 is the object we want to detect(face).

>I will be upgrading the detector to support multiple aspect ratios and object categories. This will probably happen in dlib 19.4.

I could not succeed on installing dlib to python on windows 10.i get the error messages below.can anybody help me?

D:\Yazilim\dlib\dlib-19.2>python setup.py install --yes DPYTHON3running installrunning bdist_eggrunning buildDetected Python architecture: 32bitDetected platform: win32Removing build directory D:\Yazilim\dlib\dlib-19.2\./tools/python/buildConfiguring cmake ...-- Building for: NMake Makefiles-- The C compiler identification is unknown-- The CXX compiler identification is unknownCMake Error in CMakeLists.txt: The CMAKE_C_COMPILER: cl is not a full path and was not found in the PATH. To use the NMake generator with Visual C++, cmake must be run from a shell that can use the compiler cl from the command line. This environment is unable to invoke the cl compiler. To fix this problem, run cmake from the Visual Studio Command Prompt (vcvarsall.bat). Tell CMake where to find the compiler by setting either the environment variable "CC" or the CMake cache entry CMAKE_C_COMPILER to the full path to the compiler, or to the compiler name if it is in the PATH.CMake Error in CMakeLists.txt: The CMAKE_CXX_COMPILER: cl is not a full path and was not found in the PATH. To use the NMake generator with Visual C++, cmake must be run from a shell that can use the compiler cl from the command line. This environment is unable to invoke the cl compiler. To fix this problem, run cmake from the Visual Studio Command Prompt (vcvarsall.bat). Tell CMake where to find the compiler by setting either the environment variable "CXX" or the CMake cache entry CMAKE_CXX_COMPILER to the full path to the compiler, or to the compiler name if it is in the PATH.-- Configuring incomplete, errors occurred!See also "D:/Yazilim/dlib/dlib-19.2/tools/python/build/CMakeFiles/CMakeOutput.log".See also "D:/Yazilim/dlib/dlib-19.2/tools/python/build/CMakeFiles/CMakeError.log".error: cmake configuration failed!"

I'm using your pretrained model to detect faces in ~1920x1080 images in a C++ project.

I've found the memory usage of this model to be extraordinarily high, especially compared toother neural networks.

Previously I had multi-threaded the FHOG detector and it would use <1GB with 4-5 threads each with their own instance of the FHOG detector.

With the new DNN-based mmod detector I'm finding that the memory usage approaches 4GB with no multithreading and just a single instance of the network. The network is running on CPU, no AVX, but SSE2 is enabled.

I'm using the dn_mmod_face_detection_ex.cpp and the provided mmod_human_face_detector.dat model to replicate this issue and using visual studio 2015's diagnostic tools to measure the process memory.

Considering the model file is <1MB, and the image I'm using is <1MB, I can't fathom why it would be using this much memory. Do you have any ideas on what is going on and how I might fix this or reduce the memory usage?

That can happen. The CPU code does the convolutions by building a toeplitz matrix and shoving it through BLAS. That will result in a bigger amount of RAM usage. There is definitely room to make the CPU side faster, like by using the SIMD enabled convolution routines the HOG code uses, but I haven't gotten around to it since everything I'm doing is on the GPU.

HI Davis , Thanks for the help , solved the issues with your help . My concern is now how to improve the performance. I have configured the code to accept frames from video stream using opencv and processing it on GPU.Can you suggest me what parameters can be tuned for performance improvement for ex. Pyramid size is one of the parameter.

Hi Davis, I am using NVIDIA GeForce GTX 970, and configured the settings. i.e, using CUDA with cuDNN and AVX_INSTRUCTIONS=ON. So, when I am executing the program with Upsampling the images, it is giving 100% accuracy but taking a lot of time (per image). Whereas when I am executing it without Upsampling, it is consuming less time approx 50ms but accuracy is very less.

What configurations should I do in order to improve the accuracy?

I want one more clarification that, whatever results you have got were using Upsampling or without Upsampling?

Davis, I have a requirement to train custom object detectors. It used to take less than a min to train HOG based detectors earlier, but though the DNN detectors are superior, they take about 3+hrs for me, on even a GPU nVidia 1080. I am training with about 35-40 boxes as training data as an experiment. Is this normal?

Secondly, with the earlier HOG based detectors, I could stack multiple detectors and fire them all at once on the same image and I would get matches based on which ones matched. Is there any documentation on how do achieve this using DNN based method?

Say I'd like to train for about 20-25 different object types, and be able to detect them on my dataset, what would be the most optimal way to use dLIB to achieve this? Also what kind of training time and test (fps) performance can I expect?

Yes, the CNN training will take several hours. Possibly more depending on the network architecture and other settings.

There is no function in dlib to quickly run different CNN detectors as a group. Each CNN must be run on its own.

For really small datasets you are still probably better off with the HOG detector. Although it's possible that setting up a special CNN that uses HOG for the first few layers might be better, both in terms of speed and training data requirements, than either the regular HOG detector or CNN detector discussed in this blog post. I haven't tried such a thing, but it's definitely something I would consider if dlib's regular HOG detector isn't powerful enough but the full CNN is too slow.

:) True , but the issue is reported only after 990 iterations . I am taking the frames from opencv and as the frame count reaches to 990 this error is generated . So why it is happening only after 990 iterations only , even though I am cosuming more memory it should have reported me at the starting stage only

I'm having some trouble getting this example to run on my machine : I keep getting a 'Bad allocation' error. I resorted to commenting out lines one by one, and I managed to track the error to the line 'auto dets = net(img);'.

Now, I'm a complete C++/CMake novice, so it's quite possible that I did something wrong. However, I haven't got a clue how to investigate this further. Can I toggle something to give me more verbose error messages? Or did I miss something while building? Any nudge in the right direction would be much appreciated.

>I will be upgrading the detector to support multiple aspect ratios and object categories. This will probably happen in dlib 19.4.

Do you plan on making this detector get state of art results in voc2007+voc2012, coco object detection challenge too? Sorry if I am asking too much, dlib detector already provide us decent api, rich document, able to train reason detector with a few of training images, easy to build and cross platform. Thanks for your hard works

I have a few questions after reading your paper about MMOD.- Are you still using a sliding window classifier now you switched from HOG to CNN?If yes, this means you use y* = argmax sum(f(x,r)) as scoring function, right?- I was wondering how you go from this scoring function of a window to the coordinates of the bounding box. Does your window changes in size (using an image pyramid) and the highest scoring window == bounding box coordinates, or do you somehow look in the highest scoring window for the right coordinates of the bounding box?

I am trying to run dlib's dnn_mmod_ex.cpp to train a face detector with just 5 images as given by them. Its mentioned in the example that the learning rate will get smaller as the time passes. Its already been more than 24 hours since my code is running (with SSE4 instructions enabled and in release mode) and its more than 450 steps, but still the learning rate is not getting smaller and its not converging. What should I do?

Thank you very much for your great work and thank you for sharing it :)I am running the face mmod in a Linux machine with a NVIDIA Titan X GPU (Maxwell version) and the code is taking over 600ms to process a single image. I have read that in your case this code takes 45ms when you process the images one at a time.I am using CUDA 8 with cuDNN 5.1 and just that (no OpenCV, no sqlite or any other), the compilation was ok (it found CUDA and cuDNN correctly) and I have checked with the nvidia-smi command that the example, while was running, was using the GPU. So, my question is: are you are using any extra library or procedure in order to get better times? Because the time I am getting is 10 times slower that yours and I don't understand why this could be happening.

I start to measure the time the line after the deserialization of the "mmod_human_face_detector.dat" and I finish measuring before the " cout << "Hit enter to process the next image." << endl;" and I had disabled all the windows and image displays.

hello Davis ,is there any way to increase the speed of the current CNN based Dlib face detection . I tried reducing the pyramid size from 6 to 4 , but with the cost of loosing faces but i am fine with it.Please suggest something . I am working on Nvidia Jetson Tx1 board and have NVIDIA GTX 1080. I am trying to process image of size 4MP.

I run the test on ubuntu16.04, and dlib is compiled with cuda. GPU is GTX 1060.I use the image in faces folder, and I disable the pyramid_up. I record the execution time in the below way:" cout << "start" << endl; int start = getCurrentTime(); auto dets = net(img); int end = getCurrentTime(); cout << "finish" << end - start << endl;"The execution time is about 1~2 seconds, which is much slower than you mentioned. And I have "measure many calls"I am so confused, and have completely no idea why the test is slow.

I realized that I was not compiled with GPU because I use g++ command to compile the example. I use cmake to recompile the example, and the execution time is less than 100ms.Do you know any way I can compiled with GPU using g++ command?

@Davis, great work with dlib! I have researched online for various libraries out there, and dlib is one of the best. I am training a new object detector to be used on a smartphone app. Your comment about performance in dnn_mmod_face_detection_ex.cpp caught attention. "[CNN model] takes much more computational power to run, and is meant to be executed on a GPU to attain reasonable speed". Does it mean I should not use CNN model for a smartphone app and instead use HOG based model shown in the face_detection_ex.cpp?

I have a problem regarding the compilation of dlib - using the variant "Compiling C++ Examples Without CMake". I will detail :I have my own opencv project in which I include source.cpp. Everything seems to work fine if I do not want to build with CUDA support. If I define DLIB_USE_CUDA, I get lots of undefined references to some "cuda::functions". I also saw that in the dnn folder there are sources which define the namespace "cuda::"; here I can find "the undefined function references."Should I modify the source.cpp to include those files from dnn?Could you give me some hints how to build?

Note: If I build dlib standalone with CMAKE I do not have errors in dlib, but I need to use it in my opencv project.

I found a solution.(it was cumbersome) In case someone would like to have a standalone built I will detail the steps (some steps might be redundant, but anyways I put everything I did in order to work):1) made a correction inside a CUDA header:/usr/include/cudnn.h

-commented a header line and inserted the one below//#include "driver_types.h"#include

4) build the rest of the cpp files(note that I put both defines USE_CUDA - you can check in the sources which of them is correct; if you do not want that, you can let them like this):nvcc -ccbin g++ -std=c++11 -O3 -DDLIB_USE_CUDA -DLIB_USE_CUDA -I. -I../../ -c ./*.cpp -I/usr/local/cuda/include -lcudnn -lcublas -lcudart -lpthread -lX11 cuda_dlib.o

5) cp *.o

6) In your project include source.cpp then link with the object file you have copied.

7) Build the application- all the dnn functions will call CUDA primitives so you will run on GPU. I could check that - due to the speedup of the training. At the beginning, with 8 cores - on cores only, I stood 40 mins and there was no progress from the first step. Now I can see the fast processing on the GPU and multiple iterations were saved.

I don't recall. But the speed difference between the two example programs isn't very large. You are seeing this speed problem because either you haven't enabled compiler optimizations, or you aren't using a BLAS library like the Intel MKL, or you are timing the whole program rather than the detector. Also: http://dlib.net/faq.html#Whyisdlibslow

Davis the thing is the upsampling in the example that causes the exponential increase in running time for the net, which probably makes sense. Keeping original sizes I get close to your timings even with openblas.

This detector at this stage is even better than most cloud-based detectors out there. Excellent work.

I've got a question about run time initialization.Let's say we have a network such that:using net_type = loss_multiclass_log<fc<10,

...

function_x() {

net_type net2(num_fc_outputs(15));

...}

So I can change the number of outputs of the net dynamically. But as you can seethe variable net is local to the function scope.

So let's say I have a class then, in that class I have a method that trains the model. (function "train" will retrain the model if new subjects were added, so that means the net output layer will be changed, it is also possible to change other layers) I want the "net" variable to be a member of that class and not locally defined in the train function.In the train function the net variable will have the output layer changed then the network will be trained.

In another method I will use the "net" for prediction.

So I want something like this "net2.num_fc_outputs(15)" in a desired function

How can I update a global "net" variable if the templates are statically defined?

As I said the model changes - today I have x objects , after some days I could have x + N.

That's too sad.I am trying to increase face detection performance to 30 fps on 1280 x 700 video.Is it possible to use some kind of mask with dlib face detector? Use skin detecor first and then process part of image?

Hello Davis.I want to create detector which finds all trucks on image using your dlib library. So I am trying to use this http://dlib.net/dnn_mmod_ex.cpp.html example. When I train my detector should I put in training dataset all possible trucks in all possible poses? Could you please give me a hint in what direction should I move? Thanks

I am trying to train the CNN. I tried to edit the dnn_mmod_ex1.cpp program, the CNN architecture code is as follows:template using con5d = con;template using con5 = con;template using downsampler = relu>>>>>>>>;template using rcon5 = relu>>;using net_type = loss_mmod>>>>>>>;

Also,mmod_options options(face_boxes_train, 40, 40)

Training is successfully done. But, at the time of testing, its showing following error:

I'm facing a similar problem, but I'm using the CUDA code.Using the provided dlib dataset + xmls ; The only changes I made besides the changes stated in the face detector documentation was to change batch to be 50 instead of 150 as my cuda memory (4GB) is not high.cropper(50, images_train, face_boxes_train, mini_batch_samples, mini_batch_labels);

The generated dat file is 24MB and it results in the below error upon running the detection using it.It seems that 1.3 as an average loss is the best I can get , is this good enough?