Video Face Recognition

Introduction

Movie interest is largely correlated to the actors in a movie making annotation of all occurrences of cast members within a movie essential. This work addresses the
difficult problem of identifying a video face track with a dictionary of still face images of many people, while rejecting unknown individuals. We employ a large
database of still images from the Internet to perform complete video face recognition from face tracking to face track identification.

Face Tracking

Our method performs the difficult task of face track- 281 ing based on face detections extracted using the high-performance
SHORE face detection. We generate tracks using two metrics one spatial and the other
appearance. The spatial metric computes the percent overlap of the current bounding box with the previous. The appearance metric computes a histogram intersection of
the local bounding box, which can handle abrupt changes in the scene and the face. We compare each new face detection to existing tracks; if the location and
appearance metric is similar, the face is added to the track, otherwise a new track is created. Finally, we use a global histogram for the entire frame, encoding
scene information, to detect scene boundaries and impose a lifespan of 20 frames of no detection to detect the end of tracks.

Mean Sequence Sparse Representation-based Classification

In recent years, Sparse Representation-based Classification (SRC) has received
much attention due to its high precision and ability to handle occlusions. More recently, we found that combined with several features SRC works well for real-world
face recognition and excels at rejecting unknown identities (see
Face Recognition for Web-Scale Datasets). Now, given a face track with frames, we make the strong assumption that
they will result in a single coefficient vector based on the fact that all of the frames belong to the same person and should intuitively be
linearly represented by the same people in the dictionary. Based on this assumption we produce the following formulation:

,

in which we minimize the sum residual error between every frame and the linear combination and maximizing the sparsity of
. By analyzing the least-squares formulation of the residual error, we find the interesting result that it reduces to the mean face track vector as
follows:

,

where . This formulation results in at least a 5x speedup depending on the average length of the input face tracks over a naive frame-by-frame
application of SRC.

Movie Trailer Face Dataset

We built our Movie Trailer Face Dataset using 113 movie trailers from YouTube of the 2010 release year that con tained celebrities present in our supplemented
PublicFig+10 dataset. These videos were then processed to generate face tracks using the method described above. The resulting dataset contains 3,585 face tracks,
63% consisting of unknown identities (not present in PubFig+10) and 37% 514 known.

Video Face Recognition Toolbox

For benchmarking of future methods with our or some other custom data, we provide a Video Face Recognition Toolbox. The tool contains implementations of the tested
algorithms (NN, SVM, L2, SRC, and MSSRC). There are two principal scripts: