Sunday, February 12, 2017

High Quality Face Recognition with Deep Metric Learning

Since the last dlib release, I've been working on adding easy to use deep metric learning tooling to dlib. Deep metric learning is useful for a lot of things, but the most popular application is face recognition. So obviously I had to add a face recognition example program to dlib. The new example comes with pictures of bald Hollywood action heroes and uses the provided deep metric model to identify how many different people there are and which faces belong to each person. The input images are shown below along with the four automatically identified face clusters:

Just like all the other example dlib models, the pretrained model used by this example program is in the public domain. So you can use it for anything you want. Also, the model has an accuracy of 99.38% on the standard Labeled Faces in the Wild benchmark. This is comparable to other state-of-the-art models and means that, given two face images, it correctly predicts if the images are of the same person 99.38% of the time.

For those interested in the model details, this model is a ResNet network with 29 conv layers. It's essentially a version of the ResNet-34 network from the paper Deep Residual Learning for Image Recognition by He, Zhang, Ren, and Sun with a few layers removed and the number of filters per layer reduced by half.

The network was trained from scratch on a dataset of about 3 million faces. This dataset is derived from a number of datasets. The face scrub dataset[2], the VGG dataset[1], and then a large number of images I personally scraped from the internet. I tried as best I could to clean up the combined dataset by removing labeling errors, which meant filtering out a lot of stuff from VGG. I did this by repeatedly training a face recognition model and then using graph clustering methods and a lot of manual review to clean up the dataset. In the end, about half the images are from VGG and face scrub. Also, the total number of individual identities in the dataset is 7485. I made sure to avoid overlap with identities in LFW so the LFW evaluation would be valid.

The network training started with randomly initialized weights and used a structured metric loss that tries to project all the identities into non-overlapping balls of radius 0.6. The loss is basically a type of pair-wise hinge loss that runs over all pairs in a mini-batch and includes hard-negative mining at the mini-batch level. The training code is obviously also available, since that sort of thing is basically the point of dlib. You can find all details on training and model specifics by reading the example program and consulting the referenced parts of dlib. There is also a Python API for accessing the face recognition model.

Could you say a little more about what "graph clustering methods" you used here? I'm interested in using this on a dataset to cluster unknown identities. Right now I have a few ideas: 1.) just to k-means, 2.) do the n^2 comparisons, then do k-means on those rows, 3.( take each face and compare it to the n-1 others, assign it to the best match, and then at the end group all the faces that are part of the same set (don't know if there's a name for #2 or #3...)

The one you probably want to use is the one in the example program, the "Chinese Whispers" algorithm. The paper describing the method is referenced in the dlib documentation. It's a really simple iterative graph neighbor relabeling algorithm that gives surprisingly good results. It's what made the 4 clusters in this example. You don't even tell it how many clusters there are.

There are also graph clustering methods like modularity clustering, which is also in dlib, but I've found on many problems that a simple method like Chinese whispers gives better results. Which is surprising considering how theoretically well motivated modularity clustering is.

As for what else I did to clean up the data. I would sort pairs of identities by how similar their average similarity was. That helped find cases where the same person appeared under two names. Then I would also sort all the images for a given person by how close they were to the centroid of their class. If you then look at that sorted list you can see obvious labeling errors accumulate at the end and remove them. There were a bunch of other minor variations on that kind of theme with a bunch of manual review. A LOT of manual review.

Thanks! I just looked into the Chinese whispers algorithm. It feels like a graphical version of the k-medoids algorithm, except you're changing the assignments of each item instead of changing the medoid assignment. It makes sense to me that it would converge on something useful if the initialization is good, but I would expect it to have similar problems as k-means where bad initialization can cause degenerate assignments. I'll run it a few times and look for the best results :)

You will be surprised. It's very good considering it's a really simple method. I'm still slightly mystified that it's better than modularity clustering but what's always been my experience.

I've also found that that the random initialization is irrelevant. It always seems to converge to something pretty sensible. The only thing I can say that's bad, aside from the name being maybe slightly racist, is that sometimes I've found it useful to do some kind of post processing to clean up the results. e.g. looking at clusters and checking if any of them have a lot of edges between them and merging them after the fact. But usually it's pretty good.

I used the code in python_examples/face_recognition.py to get descriptors for two given face images and then calculate the cosine similarity between these two 128D descriptors so as to verify whether these two face images are from the same person. However, I found that although the input images are not from the same person, the similarity will be very high (greater than 0.9). Actually, I used the images from LFW to verify the code.

Another great extention of the dlib library! Is there a reason the CPU HOG-based frontal face detector is used instead of the (more accurate) dnn version (except training a model for only frontal faces)?

I'm probably not going to post the data as it's a big dataset and I don't want to deal with hosting it. Also, the Microsoft celeb-1M dataset is out now which is bigger than mine anyway. So you might as well get that dataset instead.

I have tried to use dlib to detect anime faces but only work less than 50% of the time. Is there anyway I can twist the code to do so without going through manual labeling and retraining models? Thanks!

I would like to play around with this Face Recognition network in combination with the OpenCV VideoCapture. The images from OpenCV (dlib::cv_image) are however in bgr pixel format and I am assuming that the face network is trained with rgb images. Would it make a big difference if I feed the network bgr images? Or does dlib have an efficient routine to convert from bgr to rgb?

You can also make a new input layer that reads directly from an OpenCV image if you feel the need. It's easy to do since the input layer interface you have to implement is fully documented: http://dlib.net/dlib/dnn/input_abstract.h.html#EXAMPLE_INPUT_LAYER

No, it's not required to retrain. The model posted wasn't trained on any of the faces/identities in LFW for example. The whole point of this type of model is that you don't need to do that kind of target specific training, which is why metric learning style algorithms are so popular for face recognition and verification right now. That's not to say that you don't, as a post processing step, combine some kind of target specific SVM or something that operates on top of the metric learning algorithm. People sometimes do that and it can improve verification. But you can also just do k-nearest-neighbors as your verification algorithm and that is pretty good too. Many things are possible. But in any case, no, you don't retrain the metric learning part.

Although, if you want to retrain or fine tune or do anything like that the API is fully documented. There are introduction examples to the DNN API as well as a full API reference. http://dlib.net/faq.html#Whereisthedocumentationforobjectfunction

As for training data, as I said before: I'm probably not going to post the data as it's a big dataset and I don't want to deal with hosting it. Also, the Microsoft celeb-1M dataset is out now which is bigger than mine anyway. So you might as well get that dataset instead.

Ah, I see. Thank you so much for your comprehensive reply. I will try it out for other image sets.

I tried it for the example file given in faces/2007_007763.jpg in the examples folder of the dlib Github repository, but the clustering didn't quite turn out correct. Is there any kind of preprocessing required for this to work out? Also, is there any necessity for more images of the same identity to be present for the clustering to work?

Nothing is perfect. The examples are what they are. What is best for any application depends on the details and computer vision and machine learning is complex. I can always find some additional thing to do or change to some standard technique that makes something more or less applicable to any given problem.

I'm wondering if I did something wrong when compiling the dnn_face_recognition_ex.cpp since it appears to be very slow (it runs about 7 mins). Does it make use of the GPU? Do I have to enable something for it to do so?

Yeah, there is definitely some dataset bias. The training data I have, along with LFW, is definitely biased towards white guys in the sense that they are overrepresented in the data. I spent a while trying to gather non-white people for the training dataset to improve it but it's still somewhat biased.

David, thank you very much for this great work. Just a simple but intriguing question: Have you used a person with different gender for hard-negative mining at the mini-batch level? Meaning a = female, p = female, n = male, or viceversa?

Sir You are awesome! and dlib Library too . I really like your Dlib Library it helped me a lot .

I am working in image & video analytics team as a researcher in a company . I have around 2 year programming experience in C++ . Sir how to start writing codes such as Dlib . I really like your C++ codes and they do wonders .I find sometimes difficult to write classes that are usable in C++ . I really need your guidance like where to start and how to improve code on a daily basis. Thanks in Advance.

I'm probably not going to post the data as it's a big dataset and I don't want to deal with hosting it. Also, the Microsoft celeb-1M dataset is out now which is bigger than mine anyway. So you might as well get that dataset instead.

Thanks for this great work! I try other pictures with non-bald faces, and find that: all of non-bald faces are categorized to the same one person, but the bald faces can be correctly categorized to the right people. Is this the training problems ? Or I should use a special way to run non-bald faces ?

I'd like to use dnn for detection instead of fhog used in this example, but it seems that the shape predictor can not directly take the result that the net provided as input. How do I convert the net to something can be used in this piece of example?

Hello, i have a strange problem compiling dnn_face_recognition_ex.cpp. It just freezes, the CPU usage is max, but nothing happens (waited for 30 minutes). I figured out that it was caused by very long type names, generated by templates. If i use alevel4>>>>> type, it's ok, but alevel3>>>>>> makes a problem. The compiler is supposed to raise a warning 4503 (unless disabled), but not freeze. Tried to install the latest VS 2017 Enterprise, didn't help.What would you advice to workaround the problem?

This only happens in visual studio since it has terrible C++11 support. You can make it work in visual studio 2015, but visual studio 2017 has even worse C++11 support than 2015 apparently (a lot of users who are trying VC2017 have been complaining to me).

I had to switch to CLang. In cmake you can still have the generator set to VS2017 but set your toolset to v141_clang_c2 . Since then I have actually started to use a direct install of LLVM and I use the LLVM-vs2014 toolset (even though I use VS2017). I have altered the dlib cmake files a bunch to tell not to disable features on MSVC if you are using clang but I think you can still get what you want with the cmake files that come with dlib.

I am using your dnn_metric_learning_on_images_ex.cpp to train on images that are roughly twice as wide as they are tall. I am using the example code (dnn_face_recognition_ex.cpp) to evaluate the trained net. The random_cropper appears to transpose the rows/cols (line 219: get_rect(img)), returning incorrect crops. I swapped the rows/cols and now get better crops, but there still appears to be an error in cropper when handling non-square crops. I can share some images with you if you would like.

I also noticed that the cropper.set_randomly_flip() is set to true, which will feed mirrored faces back to the net. This seems incorrect, but you may have a good reason for doing it.

You have to setup the cropping in a way that's appropriate for your problem. There probably isn't "One True Random Cropper" that everyone can always use. Although I'm sure there might be usability improvements to the dlib random cropper object used in that example. But at the end of the day it's up to you do decide how to build the mini-batches.

I have two computers : one MAC and one Window.When I use dlib - dnn to embed vector , the time it took when embedding 1 image on Mac is 0.05s while on Window it took 0.3s. Why is there such a big difference?(since the two machines are on par with each other, in terms of specs)My MAC config: Core i5, CPU 2.6 Ghz, RAM 8Gb,and My Window config: Core i7, CPU 2.4 Ghz, RAM 32Gb

Hi Davis King,From your response to Kyle McDonald's comment, I learned more about how you clean the dataset. Thanks for sharing your experience.

However, I still don't understand how you used the graph clustering method. Is it used for either one of the purposes below?1) automatically merging same identities from different datasets(vgg/facescrub)?Or 2) clustering similar faces within an identity's folder so that you can more easily pick out the outliers manually.

Both. It's still a very manual process. You have to do a lot of review to make sure the labeling is going to be improved. These automated tricks are just to help you review the data and find labeling mistakes. They aren't going to create a cleaned dataset for you.

Hi Davis first of all great work (y) i just wanted to ask that if there is a python implementation of your recognition model that you just described above i found a recognition algorithm i.e http://dlib.net/face_recognition.py.html but this is very limited as compared to your C++ example so i was hoping if you could provide an example in python and in which you could draw the histogram and then apply the chinese whispers algorithm! it would be a great help if you can do such a thing Regards,

Hello Davis,Could you explain, please, the treshold value (0.6) - is it there by design? Can it be set lower/higher and for how much? In 128D even slightest increase/decrease of it should mean a HUGE difference in volume, am I right?

Thank you for your answer. Just one more question - only RGB images should be used for this particular pretrained DNN as an input? Could I use grayscale ones instead? Dlib's face detection eats them perfectly, but here I see no options. Though I'm quite a nub in face recognition, but for me it seems obvious, that such unreliable thing as color information (different light conditions, changed skin tint/make up, etc.) should offer not much of a real help for the process, am I right? Anyway - is there any reason for me to try to define something like input_grayscale_image, so the data could be transferred into tensors in a proper way?

You don't have to do anything. The existing code will load a jpeg or png or whatever and process it just the same regardless of it being color or gray. As for how well it work will without color, I have no idea. It's probably alright, but maybe not quite as good.

Thank you once again. Actually, I just try to deal with this DNN directly from my application, where I produce only grayscale images. Last question: what R,G,B average values (122.782, 117.001, 104.298) are there for? I cloned your input_rgb_image_sized to the new input type (all the same, only luminance will go to R,G,B), but have no idea about these offsets. So, in order to achieve best results, should it be put there in the same asymmetric way, or just be in [-0.5, +0.5] bounds, simply copied for all color channels?

Hi from dblib I find many examples that can differentiate faces, can it differentiate different objects like bottles or bags based on their color. And which algorithm do you think can help. And link or suggestion will be very helpful. Thanks in advance.

If you get a big training dataset I'm sure you could make something that does that. There are links to the example programs that show how to train new models referenced from this post. In particular, read the C++ face recognition examples.

I'm using openface: http://cmusatyalab.github.io/openface/demo-3-classifier/ for training a classifier. I have 1.500 people with around 50 images per person which makes 80k images and generate a huge classifier that take ~35 seconds to predict a person.

Now I want to scale that up, but if I increase the amount of people to 10k it will take forever. My current machine is what I have, will your approach have better performance? I mean in time not in accuracy. Or could you give me some advices about what to improve/change/use?

You aren't going to make a very good model with a dataset that small. Fortunately, you can use the free model that comes with dlib. It is trained on millions of faces and gets state-of-the-art accuracy on the standard LFW benchmark for face recognition.

I think I didn't explain myself properly. I don't want to generate a new model. From what I got, in your post, you have a clustered image of different actors and it can cluster them by actor. I guess the name you just put it in the picture. But I have right now 80.000 images and I'm planning to expand it to 800.000 if I keep having 50 images per person. How can I use a new picture that isn't on this model and make a prediction of who has more similarities with my new image as openface is doing

Just briefly, congratulations on the dlib library, its fantastic and I'm only just getting to know it - as it's helping both speed up my code and make it work with higher accuracy!

I have a question regarding face descriptors. I am tracking unique faces seen in video clips in a real-time application. So with a stream of frames containing faces, I already compute the euclidean distance between faces to ensure I haven't seen a new one. If I do see a new face, I collate descriptors into an inventory for that face.

I'm wondering, for the inventory (cluster) of descriptors that I've already gathered for each 'unique face', would there be any benefit in computing an 'averaged descriptor' for each person - say over 50 frames?

My thinking is that this might help identify that face more accurately (or in 'fringe cases'), because it could account for a moving mouth (while speaking) and blinking eyes and angles of the face nodding a head etc

I only used the phrase 'averaged descriptor' in an abstract sense as I have no idea mathematically how best I would do this if it were deemed a good idea - would I literally do an index-wise average over 50 vectors to produce a single vector?

Trying to make an average isn't usually going to work very well, due to the unintuitive geometry of high dimensional spaces. To be specific, suppose you have two sets of points in 128 dimensions, call them A and B, such that all the points in A are within 0.6 distance of each other, and similarly all the points in B are within 0.6 distance of each other. Moreover, suppose that none of the points in A are within 0.6 distance of any point in B.

It is surprising, but true, that it's quite likely that the distance between the centroid (i.e. the average of all the points) of A and the centroid of B is less than 0.6 apart. This kind of thing can happen in low dimensions as well, but it becomes increasingly more likely to happen when the dimension goes up.

So using an average is generally not a good idea. You should instead use a k-nearest-neighbor type of algorithm to do classification.

Very simple question on the use of dlib in this example:I am interested in comparing an unknown face one by one with a large number of known faces.I can see in the example code that: face_descriptor = facerec.compute_face_descriptor(img, shape)will give me a 128D vector for my unknown face.

If I have a database of all the 128D vectors for all the known faces, how can I compare two 128D vectors to get the distance between them (ie: similarly of the faces)?

Hi Davis, very nice work with dlib! I'm a PhD student working in Face Recognition and I have used dlib a lot for face detection, landmark localization, tracking, etc. with remarkable results. Now, I'm trying to replicate your results following the LFW protocol. Doing so, one first question arises, which images did you used? As you know, there are different sets of the LFW database according to the aligment method used, i.e., the original aligned, funneled, deep funneled or lfw-a images. Or did you performed a different alignment/preprocessing to the images?

Fantastic stuff. Thanks for all you've done! I am doing face detection / recognition on IR images. This means I cannot use the standard features for detection or recognition. I am trying tobuild my own detctor using your "train_object_detector.py" and it is working really well - mostly.I have a training set that are faces of one size and the detector is getting faces of similar sizes but completely missing smaller face images.

So my question is how does the system work to detect faces of different sizes. Do you need to have training samples of all the sizes that you expect to be finding? Or does the system take in the training images and resize them?

If you could clarify how this process works and what kind of training set I need and how it works to find faces of different sizes, I would really appreciate it. I have the recognizer working well, I just need to find the faces.

You can't control it through python. But more importantly, you should read about image pyramids if you want to understand what it's doing. Wikipedia explains them well. And more to your question, the detector finds all objects in an image bigger than the "detection window" which is user specified. If you want to find smaller objects either train with a smaller window or resize your images so they are bigger. Usually resizing is the best strategy.

the face recognition tool is using the DNN distance metric, I was trying to use it on a gpu to training with some data. The gpu utilisation under 7% and after like a while is 0 and the whole process is very slow. There are things that I can do to make the whole thing faster ?

Hi. Great stuff, thanks for this. I have a question, I have been using the code in the face_recognition.py for trying out the LFW protocol myself. As it says I should get 99.13 without the third parameter, while I only get 99.02 somehow. Could it be that the file face_recognition.py is missing something that you used while testing ?

Also the default dlib face detector sometimes misses out some face detections while testing over LFW.

LFW isn't about testing detection, just recognition. So you have to measure the accuracy of only the recognition component. This means you don't throw away any images in the LFW set when you do the evaluation.

I have trained my own IR face detector using train_object_detector.pyI have 2 sizes of faces: a.) 362x292 face and b.) 108x82 face inside a 362x292 imageI set options.upsample_limit = 4; and options.detection_window_size = 80*80;And I trained only on the a.) facesWhen I run the detector it finds the a=362x292 sized faces but not the b=108x82.-----When I manually resize the b. by 2x, i.e. 724x584, it finds the resized 108x82 faces.-----Shouldn't the upsample_limit be doing this 2x resizing and finding the b. images?Otherwise it would be upsampling the 724x584 by 2, 3, &, 4 and this would overlap the2, 3, 4, upsampling of the 362x292 images4x(362x292) = 2x(724x584) So, if the upsampling was working it should get the b. images.Thanks, Jon

upsample_limit applies only during training. When you use the detector it's up to you to prepare your image by upsampling, downsampling, cropping, or whatever else you think is appropriate before you run the detector.

hi,i have my own database for 3 persons and want to use them as reference to check if they exist in picture. i try to make train model using dnn_metric_learning_on_images_ex but each time gives me error in dlib/dnn/loss.h.so can you help me to use my database for face recognition here.thanks in advance

Thanks again for the help. However, what documentation are you referring to? I've read the docs for loss_metric and many others, but haven't seen explicit mention of that... Perhaps that's because a number of classes are involved and I'm not looking at the right one. I will continue reading up.

Also, am I right in understanding that this all happens in a single thread? Looking at the docs/headers I'm trying to figure out if it's already parallelized, but can't figure it out.

BTW, even aside from your great support, I am surprised at how easy it was to get everything working and how well it works. Amazing stuff.

Practically all of the documentation. For instance, didn't you notice you can call get_output() on different layers on the net to see the output of that layer? How could that work if the output wasn't stored in the net?

I have been toying around with converting the model to Tensorflow, which I'm more familiar with than C++. Is there any preprocessing on the input images before they are fed to the network? It looks like the class input_rgb_image_sized does some subtraction of RGB-mean values and divides by 256, is that also performed on the input faces for this network?

Also, I see the face landmarks are passed to the facerec model. Is this to do some fancy face alignment before feeding to the network?

Thanks for the swift reply! I'm not fluent in C++ so this might be a stupid question, but I see input_rgb_image_sized<150> is the first "layer" of the network. Does this mean it automatically does the mean subtraction on input images (based on the to_tensor function in class input_rgb_image_sized from dlib/dnn/input.h).

I'm asking because I get different results on the same image in my Tensorflow version of the network and the dlib implementation, and I'm wondering whether it's some bug of my own making or simply differences in preprocessing.

Hello Davis,I admire your work very much, and wish to congratulate you for that! I'm willing to build a gender recognition algorithm with age classes (baby,child,teen,young adult,adult,senior), with the help of dlib. What are your thoughts about that?How can i train a classifier (like svm) do that? I've already found about 3000 images of people and spanned them across the above classes..Thanks a lot,George!

I tried using the dlib for face recognition.I tried it with both c++ and python but the problem is that i am getting a different vector as the face descriptor for both...do you know any reason as to why this is happening??

ya ..my bad it gives the same output but the problem is that in c++ it gives in float format and in python it gives in long double format this leads to approximation in c++.i tried to define the matrix as long double in c++ but it gives an error that net(faces) is of matrix type and cant be converted to matrix can you put a suggestion as to what is to be done.i am a novice in this so dont mind all the beginner questions please... :)

Dear Davis,On the process of building a classification program, i am trying to use FHOG (extract HOG features from images) and feed those features on a one_vs_one trainer. But i can see in some of your examples that it takes as input matrix samples, while FHOG gives array2d >! Should i try to convert those, or should i use a different trainer?I'm a little confused because there are so many examples about classification...so maybe i should try another (easier) project instead!Thanks!!

But it's very sadly to see, the software has a huge racial bias (like one Google has used) - thei can differntiante well "white people", but it does not differntiante "black people", so it sorts all "black man's" together to one group and all "black womans" togeter (with one mismatch where woman is sorted to man). This scenario was not "specially constructed", it was simply a first try to test an algorithmus for a "wild scenario". The image I've used was from "Heart to Heart International: Our People of the Year - Ebola fighters" (www.hearttoheart.org/our-people-of-the-year/ - big poster on top of the site). Is it possible to avoid this?

And I've three interesting questions more as following: 1. assuming the DNN is loaded with "pretrained" wheights.2. after that it will recognize/compare the face of new (unknown/unseen) person with some probability Pn. For recognition I will compare some amount (ideally one) of nearly some pictures Px (from video stream) of the person with one or more "template" picture(s) Pt of the same person (which in turn are also nearly identical, and ideally one picture). But Px is not imperative "nearly identicaly" to Pt.

Question nr. 1 - is it better (for recognize a person as specific known person) to compare with one template picture or more (nearly identical) pictures.

Question nr. 2 is more complicated. If it is also known, that some pictures Pt (nearly identical to each other) are all of the same person, it is possible, reasonable and how to "continue train" DNN weights, so that then new recognition rate would be Pn+1 > Pn.

Question nr.3 is really experimental one - how somebody tried to combine this DNN method with Eigenfaces/Fisherfaces method so that DNN recognizes using "back projected" faces. This of coarse assumes, that much more (50 or more) variations of unknown "template" preson faces were recorded. Or it lowers the recognition rate and enhances the false-positive rate?

My question is about using a unique face descriptor from several images from the same person. In a previous post you said its not a good a idea to average all the descriptor. However, as far as I understand, you indeed do this when using jittering (matrix v1 = mean(mat(net(crops)));).

¿Can you tell me which would be the best option? ¿How about computing all distances between each other and get the mean distance?

Hello!Thank you very much for your wonderful library, it's a great job !!

I probably have a similar question to several previous ones. I hope I will not be too brazen.

I'm looking for faces with video, that is, I can get many vectors of the same person (by tracking the face with the help of correlation_tracker). My task is to determine how many different people passed the camera. If a person left the frame and lost the track, and then came in, I must count it 1 time.

Often the photos are of poor quality (I try to find the best photos using the definition of the head tilt and blur, but this does not give good results), besides they are made from a great distance.

Now I take one photo, and compare its vector, with the others saved one for each person. Then, I'm looking for the minimum value of the Euclidean distance. If there is none that is less than 0.6 add this vector to other. It works badly, because of the poor quality of the photo. Especially if bad photos are used as original, for compare other, photos. They give a false positive result with others, not even similar.

I looked at the K-Nearest Neighbors algorithm, it requires a trained classifier model. But, initially, I do not have a model. I do not know who these people are.

I think: If I have, for example, 10 photos, for the first person, I will add them as a separate class, then, for the second person, I also have 10 vectors. I can compare them each vector of the second person, with the model, and I will get 10 results (some may belong to the first class, but some are not). Further, for some algorithm (I do not know what to apply for this), I have to add (or not add depending on the result), a new class to the model (class chosen from 10 vectors of the second person). I think correctly, or can I use some other, classifier / clusterizer here?

Now I understand how to use K-Nearest Neighbors, if I had 10 photos in advance, for each person, and the vectors stored in the model were calculated for them, at the same time I compared one photo to the model. But in my case the situation is different.

I've done some experiments on face recognition and it is a really fun & joy :)Onne example was really amazing - I've two mans with three different fotos of both, the both mans looks also very different, moreover one of them has a glasses on all three fotos. For some reason both persons, also all 6 fotos, were grouped to same person. I've simply don't understand why. Do you have interest to see this result (I can't make it publicly)?

You wrote also that the total number of individual identities in the dataset is 7485 and the whole dataset is mix up from different datasets and other people from internet. Can you please explain, which datasets is used?

I've also asked early, if it possible to continue train a network as new data arrived, without retrain a network from the beginning in a whole? If it would be possible, that would be a great addition because the network can be better with time continous without take too much time for retrain.

I've also another very ineresting question - your network computes 128D-vector from face, the 128D vecor is widely used dimension for such a task. Why 128? Can we get better result with higher dimension up to 256 and how much the ptraining/prediction time may grow?

And the other question, why 0.6 as distance metric were used for training/prediction? What happens (and how to do without your unpublished database) if we wang get it smaller, say 0.1?

Hello again Davis,One more question, is it possible to feed Imglab with rectangles of fixed size? I mean, instead of dragging to set the rectangle's size, just choose a point which will denote the rectangles center, and the rectangle's size is fixed (NxN). I'm asking this, because i'm finding it hard to annotate rectangles of same aspect ratio, since the objects i'm trying to detect can be rotated...Thanks a lot!

Hello Devis.I find many people who have a Euclidean distance of less than 0.6, and they are different people. And such people are much more than 0.62%. When working with video, from a few frames, I also in most cases have this situation for all frames.

In this image, is the about distance 0.53.https://drive.google.com/open?id=0B3_jMty60ScdXy1aVGt4UlEyRjQhttps://drive.google.com/open?id=0B3_jMty60ScdM3VZZGRqaUIzNFk

If i use 0.53 value, it look batter...

(For video i try to made 6 photo, find distance of old all, to all new (for insert), and next select 24-36 minimal distance (if it has), and try to find dominant class, if in this class i have more 70% of combination, use it as same person. Other ways add new, all 6 vectors, to new class. It helps a little. Mostly the error goes through all the photos....)

You can't fix the size of rectangles in imglab. I would recommend drawing accurate bounding boxes in the tool. Then if you want to force them to all have some property, like a certain aspect ratio or some other thing then write a script that reads in the xml file and applies whatever kind of transformation you want. It's easy to do. The routines dlib uses to read and write those xml files are part of dlib's documented API: http://dlib.net/dlib/data_io/image_dataset_metadata.h.html

Alexander, It does not seem that the mouth and eyes, affect the result. In general, as far as I can see, the results are quite accurate regardless of the facial expressions, which is surprising for me. I used 2 frames, with eyes open, and mouth in different positions, it too grupped in one cluster. https://drive.google.com/open?id=0B3_jMty60ScdTlVXeDkycTVXTlEMaybe this is the fault of makeup...

There is a certain error rate, especially in the clustering algorithm used in the example program. Sometimes you might have to adjust the threshold or do some other thing. The point of the example program we are talking about here is to educate and be an introductory document that helps the reader begin to understand and work with face recognition systems. It's not a turn-key problem solver that you just run and it does all things for all applications. It's just a jumping off point into the deeper world of face recognition.

I don't think that this is just an issue with the clustering algorithm, but of the false-positive rate when comparing any two given faces. I've noticed in my tests that there is some fraction of faces of different people that will compare with a distance lower than 0.6.

"given two face images, it correctly predicts if the images are of the same person 99.38% of the time."

I think this is the trip-up. I think 99+% is optimistic, but I haven't done systematic testing to get the error rate that I'm seeing (yet)...

1. Is there more detailed documentation (or can you explain here) what the 3rd argument means, i.e. the -1. Specifically how they effect the output, & how do know how to set it.2. Also, what does the idx value mean. --> face_type:x.xIs there a description of what each face_type value, means.3. How is score calculated? Is there a reference for this?4. Does the 2nd argument (1 above) upsample the img if set > 1?

The code is well written and documented, I just need more information about these values.Thank you, Jon

There are 5 sub-detectors in the default face detector model, each matching a different face orientation. If you run it on some images and plot the outputs you will see the variation in detection pattern as a function of sub-detector.

As for questions about accuracy, the pretrained model gets 99.38% accuracy on the LFW benchmark. It should be noted that LFW is heavily biased towards white adult american public figures. The model was also trained with a large dataset with a similar bias, so this creates an obvious problem for images of people that don't naturally match that distribution. It should still work well in many cases, but likely would require a different threshold. It should also be noted that usually when you make a detector for a specific person you will train something like a linear SVM on the 128D face vectors. The example program in this blog post uses something more akin to a k-nearest-neighbor method because it makes for a fun example, not because it's the best thing to do in all cases.

Moreover, even in the good case of data like LFW, a 99.38% accuracy doesn't mean there won't be mistakes. For instance, if you have 30 images and you want to compare them all to each other, you have to make 435 comparisons because that's how many pairs there are if you have 30 things. Therefore, 1-0.9938^435 is the probability that at least one of those comparisons makes a mistake. This is 93.33%, so very likely.

So you have to think carefully about how to use a system like this if you want to get good results. Be aware of the details.

as you can see, your work is of huge interest on communtity :)Because ot that, more qestions will be ongoing, but we can`t do a really research and answer this questions by itself, because of lack of training database. You already invest a huge amount of time to create that database. Maybe this is the time, we can cooperate all together, because of many additional databases were published, but we don't know (I have already asked this) if some of those are already integrated in your training database.

At the beginning of this blog you wrote "The network training started with randomly initialized weights and used a structured metric loss that tries to project all the identities into non-overlapping balls of radius 0.6."

Than, ARBaboon wote: "I find tweaking the threshold does wonders. I understand the net was trained to 0.6 but I have better results at 0.45 . This is only an observation."

At the least, Lubyagov Nickolay said "But basically, all the photos I tried have a good result, with a value of 0.53 (If the value is greater, I have errors)."

At the same time you wrote "Training took about a day on an older titan x."

So my question, if it make sense and is possible to you just to provide us with trained network weight values for additional projection radius of 0.45 and 0.53? And may be hor higher dimensional output as 128D?

And of course it would be great, if we can integrate our expirience in already trained network, as example by providing "negative" examples where classification output fails. Is it possible at all to supply "negative match" to training algorithmus in order to improve robustnes of result?

The Microsoft celeb-1M database is available online and is larger than the dataset I used for training. So nothing is stopping anyone from doing experiments themselves. There is a whole example program that explains how to train new models in the examples folder.

Also, I'm too busy doing other projects to retrain the network. And retraining with different thresholds is not an interesting experiment in any case.

Davis, in regards to the face_detector.py example, are you saying that dlib.get_front_detector was not trained with an SVM classifier but rather a K-NN type classifier?

Also, is there a listing of the dlib.get_front_detector code and class dlib.fhog_object_detector? I cannot find it in your github. I did find this at: http://dlib.net/python/index.html#dlib.get_frontal_face_detectorBut all that says is:

dlib.get_frontal_face_detector() → fhog_object_detector :¶Returns the default face detectorAnd does not tell me much about how it works. Same for class dlib.fhog_object_detector.Thanks, Jon

http://dlib.net/python/index.html has documentation for fhog_object_detector. It lists the methods available and what they do. The code for all this is on github, the python bindings are in the tools/python folder.

First of all - thanks for sharing this really great work!I'm using another face-detector before using your the Face Recognition API. However, some of the detected faces are not found by the dlib detector. Therefore, I've tried using the dlib CNN directly on all crops, without the 68-points shape alignment (I made sure to rescale them to 150x150). From the tests I've made, seems like the recognition accuracy drops down very strongly when not aligning the faces. Is this to be expected? was the Face-CNN trained only on aligned faces detected by the dlib detector? Thanks,Ran

I’ve cropped all the images from Microsoft’s dataset and began training. Yesterday I tried jittered (50 copies) for each image and it took QUITE A WHILE and made the 200 GB dir into a whopping 10 TB one. load_objects_list alone took more than 5 hours. So I instead put jittering (1 copy) into the dnn training code and pushed the jittered image into the images vector and it started training.

dlib compilation was able to recognize cuDNN:

-- Found Intel MKL BLAS/LAPACK library-- Found CUDA: /usr/local/cuda/8.0.44 (found suitable version "8.0", minimum required is "7.5")-- Looking for cuDNN install...-- Building a CUDA test project to see if your compiler is compatible with CUDA...-- Checking if you have the right version of cuDNN installed.-- Found cuDNN: /usr/local/cuda/8.0.44/lib64/libcudnn.so

But apparently dnn_metric_learning_on_images_ex is not using GPU as I watched the output of nvidia-smi, which shows no job and memory consumption.

I’ve set N of data_loaders to 1 and monitor the CPU consumption. When I pin the program to one CPU with `taskset -c 0`, one CPU 100% is used, and when I do not pin it, all possible CPUs are used, which would mean CPUs are being used for training, (instead of GPU). Interestingly, I output a line in each trainer.train_one_step step, and using 1 or all CPUs gave me about 1 line output per second. How is more CPUs (28) not leading to a big performance boost? Is it because the mini batch is too small?

If I could get the NVIDIA P100 (16 GB) to work, what number would you recommend for load_mini_batch(5, 5, rnd, objs, images, labels)? Is it only dependent on how many 16 GB can hold?

I got it working. It was a remote submission system where the worker nodes share the same disk space and CUDA libs as the terminal nodes but I compiled it on a terminal node without the actual GPU. Maybe that's the reason? But anyway it works.

The training has finished. Just to provide some reference, it took ~5 hrs on a P100 with 4 CPUs to make sure the queue is always full. Increasing the thread count at that N_CPU from 5 (default in program) to 20 helped filling the queue, but I set the thread count to 30, which still did not saturate the CPU usage (250% instead of 400%). I suspect there are other bottlenecks, because the filesystem program mmfsd is running crazy, maybe taking care of the file loading.

I set the steps without apparent progress to 10000 and here are the last few lines of training:

The success rate is 92.3%. Is this, and the average loss about the same as your experimentation? To get a higher value should I decrease the learning rate threshold to a smaller number, or increase the steps without apparent progress?

I don't remember what the output was. It doesn't really matter though. You need to evaluate against some benchmark and follow their protocol to see how well you are doing. Only experimenting will determine what works and what doesn't.

I noticed the dnn_metric_learning_on_images_ex.cpp uses input_rgb_image as the input layer and the python binding as well as the LFW test suite use input_rgb_image_sized. It lead to serialization problems. Since I have had a couple of models trained, is there a way to convert models with input_rgb_image to those with input_rgb_image_sized?

I am using this on real time video, and when tested on a video sequence, the linestd::vector> face_descriptors = net(faces); takes about 700 ticks(0.7 sec)which is acceptable but is it possible to make face recognition quicker by turning some knobs?

I still want to use the model pre-trained (dlib_face_recognition_resnet_model_v1.dat),so to my understanding I can't change anything in level and pooling in the loss-metric net.By the way I am already using Release mode with AVX instructions.

You should compile against an optimized BLAS library like the Intel MKL if you want to run on the CPU, or even better, run on the GPU by installing CUDA and cuDNN. If you install either of these things dlib's cmake scripts should automatically find them and build against them.

The python API doesn't support doing that. You can do it via the C++ API though.

It should be emphasized, that the network expects a certain kind of cropping and alignment. So if you aren't cropping the faces the way the dlib code does it then the face recognition accuracy will suffer.

I do have already cropped and aligned faces stored in the directory. We are processing video streams and we don't want to run face detection on same frame again and again. We are experimenting different parameters of classifier to see it accuracy improves or not. Hope python API had that feature.