This project was needed by Bluusun LLC company who served Cupid Media network. The customer wanted a rating system which could rate each user face photo by its internal alorithms (without involving other users' input). At that moment they were starting a pilot project aiming to introduce the new matching system and to use it with existing databases of the network of their *cupid webistes. The new project is supposed to appear at mightydating.com So I decided to developed a human facial beauty estimator, basing one of the existing facial feature extraction libraries.

Model search

The first atttempt was to use cmusatyalab OpenFace library for face feature extraction. The library is aimed for face recognition tasks, extracting features related to face geometry. Apparently this means loosing some useful info needed for beauty estimator like skin texture and hairstyling. However I gave it a try, given face geometry makes the greatest impact in overall facial beauty perception. The library uses dlib for face extraction/alignment and Torch as a deep learning framework.

Initially I used database of thaicupid.com website which was provided in JSON format. I wrote a Python script to scrape about 4000 images from this dating website. I've decided to start with binary classifier which would be able to choose top 10% images from all others. The images were manually labled by me with 1/0 rating.

OpenFace is distributed as Docker container, so I had to push my code into container each time to run it. The following script runs a python script provided as input argument in the openface directory inside container. It also mounts current directory in so that script can read needed inputs.

I then used sklearn framework to train SVM classifier. I actually tested all provided classifiers but other performed worse than SVM. First of all the data was scaled with StandardScaler.

One have to use "balanced" parameter of Support Vector Regression for skewed data. This will give more weight to less-frequent ("1" or "beatiful" in this case) labels, otherwise they will have no effect compared to "0" samples, given summation in SVM loss formula.

Then I used GirdSearchCV to perform hyperparameters C and gamma search for different SVM kernels. Since the data is skewed, I used F1 scoring. The matplotlib library was used to visualize heatmap (much more values were searhed actually):

Later I've discovered the model was suffering from high bias problem. But at that moment due to lack of experience and some ambiguity in the curves I've diagnosed high variance problem. So that I've decided to perform some feature selection based on finding correlations in density histograms, but with no avail (some transformations were applied to the graphs actually but I omit them).

The above model didn't prove being effective which I suppose is because of feature extraction engine, which is well-suited for face recognition. Looks like geometric traits only are not sufficient for the beauty estimation.

Model

Then I've decided to try more feature extraction algorithms/models. After some unsuccessfull attempts I've found the following working solution. In short, I use VGG-Very-Deep-16 convolutional neural network to extract face features. Above those features a Support Vector Regression with linear kernel is trained.

First the face is found and extracted from the image, reusing OpenFace code. I have modified the scaling part which uses OpenCV. Now it not only extracts face from an arbitrary image but also scales and rotates it so that eyes get to a predefined position. The described estimator is trained on SCUT-FBP dataset[1]. So the code places a face from arbitrary image at the same position in resulting 224x224 snippet.

"""Find landmarks on the face image in source image coordinates"""landmarks=self.findLandmarks(rgbImg,bb)landmarks_np=np.float32(landmarks)"""Choose eyes and mouth landmarks from template landmarks"""landmark_indices_np=np.array(chosen_landmarks_)"""For chosen landmarks: get template landmark positions in 0..1 template coordinates; Multiply to image height/width (targret image is quadratic).Results in the position of template landmarks in target image coordinates """tgt_landmarks=image_side*MINMAX_TEMPLATE[landmark_indices_np]"""Shrink eyes and mouth position by 0.6 and move downwards by 18%. This transform template landmark positionsinto the approximate position inferred from the train database images"""tgt_landmarks=scale_transform(np.array(tgt_landmarks),np.float32([image_side/2,image_side/2]),0.6)tgt_landmarks=move_transform(tgt_landmarks,np.float32([0,image_side*0.18]))"""Create transformation so that eyes and nose positions on the source image (taken from landmarks_np)are transformed to corresponding scaled points on target image"""H=cv2.getAffineTransform(landmarks_np[landmark_indices_np],tgt_landmarks)"""Apply transformation. Now we get extracted face with the following properties:-square form with width 224-image is rotated and scaled so that the face is placed exactly at the position the target estimator expect it to beWhite background is added where image borders are exceeded. """result_img=cv2.warpAffine(rgbImg,H,(image_side,image_side),borderValue=(255,255,255))returnresult_img

Image affine transormation to align with train dataset images

---->>

Face features are then extracted by VGG-Very-Deep-16 convolutional neural network. FC layers 6 and 7 output 2622 floats. I use FC layer 7 output, so in my case the network extracts 2622 features. In the original work [2] this output is further passed to the Rectification Unit, dropout and one more FC layer, finally getting to the softmax classifier. Since we don't need classification, but only feature extraction, the last 4 layers are not used. Below is the overall structure of the CNN being used.

It has the following peculiarities:-13 convolution layers, thus considered "very deep" network; The similar results can also be achieved with 5 convolution layers CNN.-3 fully connected layers; They are actually the same as the convolution layers but each filter size matches the size of the input data and the number of filters is the desired output size. The last FC layer performs classification according to the number of persons being recognized (not used in this project).-to add regulariztion, droupout takes place after relu6 and relu7 layers (not shown in the table);

The features were extracted from SCUT-FBP dataset resulting in 500x2622 resulting csv file.

Examples of SCUT-FBP samples

These are seemingly too much features so I apply dimensionality reduction with sklearn.decomposition.PCA. Choosing 99% threshold I have to leave the first 114 components:

Then I use pickle to save Ureduce matrix as part of PCA class and pandas to export resulting dataset to .csv file with the reduced dimensionality.

Now then, we have ready dataset, produced by our CNN feature extraction with dimensionality reduced to somewhat moderate values (still high-dimensional though). The dataset is labeled and labels are stored in .xlsx spreadsheet. Since we need continuous rating output values 1.0-5.0 I use Support Vector Regression as an estimator. I perform extensive hyperparameters search and get heatmaps similar to the picture above related to the previous model. Finally I come up with the following values:

"""Extract image (repeated for a set of images actually)"""im=extract_image(img_path)...im=np.asarray(im,dtype='float64')/256"""Prepare input image"""MEAN_RGB=np.array([129.1863,104.7624,93.5940])"""Change axes so that dimensions correspond to those of the CNN input and subtract MEAN_RGB""""""Results in 1 x channels x height x width"""im=prepare_image(im)"""Get fully-connected layer 7 output"""""" Apply Rectified Linear Unit to layer 6 output and combine outputs"""out=net_caffe.forward(data=floatX(image_list),end='fc7')""" Multiple resulting vector to pre-calculated Ureduce matrix"""img_features=out['fc7'].reshape((out['fc7'].shape[0],np.prod(out['fc7'].shape[1:])))..."""Final design matrix for a set of images"""feat_mx=np.concatenate((feat_mx,img_features.copy),axis=0)..."""Make predictions for a set of images"""pred=clf.predict(feat_mx)

Pearsons correlation coefficient have been chosen as the metric to measure beauty ranking accuracy. The SVR estimator have been trained with images from SCUT-FBP dataset. When assessed on the same dataset, cross-validation with random shuffling is used. Predictably when assessed on the special Test 200 dataset (described in the next section), the scores are lower. But the last case only can be considered meaningful, because it is obtained on the real data. The pool5+fc6 layers output was tested as in [3]. The resulting accuracies are shown in the table below. Only the extract is shown.

The interesting observation is that data standardization yields lower score compared to just centered data.

The estimator demo

Test 200 dataset

In order to perform estimator testing a special dataset have been crafted. The pictures were taken from thaicupid.com database. The manual rating was performed on a small set of 20 pictures. Those samples are not used for training but rather only for approximate assessment of the estimator effectiveness. To ease rating process/decrease errors I've used the pairwise comparison technique similar to described in 2.2 [4] A Python script have been developed to obtain pairwise comparisons on a set of pictures. A pair of pictures is chosen randomly and shown to the user. The user is supposed to press either "z" of "m" on the keybord to chose left or right picture, whichever looks better. 200 comparisons are enough to proceed with absolute rating:

"To convert the scores from pairwise to absolute, we minimize a cost function defined such that as many of the pairwise preferences as possible areenforced and the scores lie within a specified range. Let \$s = \in\{s_1, s_2 ,..., s_N \}\$ be the set of all scores assigned to images 1 to N . We formulate the problem into minimizing the cost function:

$$J(s) = \displaystyle\sum_{i=1}^{M} \phi(s_i^+ - s_i^-) + \lambda s^T s $$where \$(s_i^+ / s_i^-)\$ denotes the current scores of the i th comparison and \$\phi(d)\$ is some cost function which penalizes images that have scores which disagree with one of M pairwise preferences and \$\lambda\$ is a regularization constant that controls the range of final scores. We define \$\phi(d)\$ as an exponential cost function \$\phi(d) = e^{−d}\$ " [4]

Below is the derivative of the cost w.r.t. to one of the variables \$s_1\$

where \$s_i\$ is taken from pairs where image 1 compares positive (better) and \$s_j\$ is taken from pairs where image 1 compares negative (worse). Having derivative formula I can write Octave/MATLAB fucntion returning the cost and gradient for a given vector \$s\$:

function[J,grad]=computeCost(s,id_list,pairs,lambda)J=0;%"compute cost J"forp=pairslarger_id=p(1);s_i_plus=s*(id_list==larger_id);smaller_id=p(2);s_i_minus=s*(id_list==smaller_id);J+=exp(-(s_i_plus-s_i_minus));endJ+=lambda*s*s';p_first_row=pairs(1,:);p_second_row=pairs(2,:);N=length(s);grad=zeros(N,1);%"compute gradient w.r.t to each s"fori=1:length(s)id=id_list(i);%"id pairs, where id(i) (corresponding to s_i) compares +"s_more_pairs=pairs(:,find(p_first_row==id));%"ids, where id(i) compares + to this id "s_more_ids=unique(s_more_pairs(2,:));%"s_x corresponding to those ids"s_more=s(:,find(ismember(id_list,s_more_ids)));grad(i)+=exp(-s(i))*sum(exp(s_more));%"same but comparing -"s_less_pairs=pairs(:,find(p_second_row==id));s_less_ids=unique(s_less_pairs(1,:));s_less=s(:,find(ismember(id_list,s_less_ids)));grad(i)+=exp(s(i))*sum(exp(-s_less));grad(i)+=2*s(i);endend

With the above cost/gradient function the minimization problem is solved in one call:

The resulting \$s\$ contains absolute ratings according to the input pairwise comparisons. It may require some scaling to fit into the scale used by the estimator.

Conclusion

The developed estimator enables for facial beauty estimation of an arbitrary image (URL). The underlying convolutional NN is very deep network developed by VGG. scikit-learn library was used above its outputs. The Test 200 labeled dataset was composed from pairwise ratings obtained with a developed Python utility and an Octave script. Differrent CNN architectures/hyperparameters/preprocessing options were researched for maximum correlation between the estimator and human-produced rating. The project was successfully shipped and deployed on the customer site.