Large Scale Visual Recognition Challenge 2015 (ILSVRC2015)

Legend:
Yellow background = winner in this task according to this metric; authors are willing to reveal the method
White background = authors are willing to reveal the method
Grey background = authors chose not to reveal the method
Italics = authors requested entry not participate in competition

Given a lot of box proposals(eg. from selective search), the localization task become (a) how to find real boxes that contain ground truth, (b) what is the category lies in the box. Enlightened by fast R-CNN, we propose a classification && localization framework, which combines the global context information and local box information by selecting proper (class, box) pairs.

Given a lot of box proposals(eg. from selective search), the localization task become (a) how to find real boxes that contain ground truth, (b) what is the category lies in the box. Enlightened by fast R-CNN, we propose a classification && localization framework, which combines the global context information and local box information by selecting proper (class, box) pairs.

Yongyi Lu (The Hong Kong University of Science and Technology)
Hao Chen (The Chinese University of Hong Kong)
Qifeng Chen (Stanford University)
Yao Xiao (The Hong Kong University of Science and Technology)
Law Hei (University of Michigan)
Chi-Keung Tang(The Hong Kong University of Science and Technology)

Our system detects large and small resolution objects using different schemes. The threshold between large and small resolution is 100 x 100. For large-resolution objects, we score the average scores of 4 models (Caffe, NIN, Vgg16, Vgg19) in the bounding box of selective search. For small-resolution objects, scores are generated by fast RCNN on selective search (quality mode). The finally result is the combination of the output of large and small resolution objects after applying NMS. In training, we augment the data with our annotated HKUST-object-100 dataset which consists of 219174 images. HKUST-object-100 will be published after the 2015 competition to benefit the research communities.

We adapted two image object detection architectures, namely Fast-RCNN[1] and Faster-RCNN[2], for the task of object detection from video. We used Edge Boxes[3] as proposal generation algorithm for Fast-RCNN in our pipeline since we found that it outperformed other methods in blurred, low resolution video data. To exploit temporal information of video, we tried to aggregate proposals from multiple frames to provide better proposals for each single frame. In addition, we also devised a simple post-processing program, with CMT[4] tracker involved, to rectify the predictions.

Our algorithm is based on fast-rcnn.
We fintuned the fast-rcnn network using the date picked from ILSVRC2015's training set.After that,each testing frames were inputted to the network,then we get the predict result.
We have also tried several kinds of method to using the similiarity of neighboring frames.At begining,we compared object proposals created by different methods(selective search,rigor,edgeboxes,mop,gop)and we choosed edgeboxes finally.
We tried to add the "behind" frame's op and the "before" frame's to the middle one to use their relativity.Our experiments proved it work.
We have also tried kinds of algorithms to tracking object,like optical flow and streamline,it's a pity that we havn't apply any of these algorithms to our model.
Whatever,we have learned a lot from this competition and thanks for your organization!
We will come back!

1.DET:Spatial Cacade region regression
We first set up Faster RCNN[3] as our baseline. (mAP 45.6% for VGG-16;mAP 47.2% for Google-net).

Object detection is to answer "Where" and "What".
We utilize cascade regression regression model to gradually to refine the location of object, which is helpful to answer "what".

Solid tricks including:
Negative example (discriminative feature is inhomogeneous on the objects, response maps is helpful for choosing reasonable positive and negative examples, instead of only using IoU).
Multi-scale(image,joint feature map,inception layer; Answer "where" from former CNN, and "what" from later with high capacity models).
Learn to combine(NMS inter-class, NMS intra-class; exclusive in space).
Learn to rank(hypothesis：the data number distribution between validation set and test set is similar. We can choose class-specific parameters).
Add training samples for classes with little training data.
Design class-specific model for hard classes.
Rank low resolution and dense prediction later.
Model ensemble with multi-view learning.

2.VID: Tempor Cascade region regression

Objectiveness based tracker is designed to track the objects on videos.

Firstly, We train Faster-RCNN[3](VGG-16) with the provided training data (sampling from frames).
The network provides features for tracking.

Secondly,the tracker uses the roi_pooling features from the last conv layer and tempor information, which can be seen as the tempor Fast RCNN[2].
(Take the location-indexed features from current frame to predict the bounding box of object on next frame.)

Tempor information and scence cluster(different video from one scence) are greatly helpful to decide the classes on the videos with high confidence.

Our submissions leverage multiple pre-trained CNNs and a second stage Random Forest classifier to choose which label or CNN to use for top 5 guesses. The second stage classifier is trained using the validation data set based on the 1000 class scores from each individual network, or based on which network(s) selected the correct label faster (i.e., closer to the top guess). The primary pre-trained CNNs leveraged are VGG VeryDeep-19, VGG VeryDeep-16, and VGG S. The second-stage Random Forest classifier is trained using 1000 trees.

The problem of classification and segmentation of objects in videos is one of the biggest challenges in computer vision, demanding simultaneous solutions of several fundamental problems. Most of these fundamental problems are yet to be solved separately. Perhaps the most challenging task in this context, is the task of object detection and classification. In this work, we utilized the feature-extraction capabilities of Deep Neural Network in order to construct robust object classifiers, and accurately localize them in the scene. On top of that, we use time and space analysis in order to capture the tracklets of each detected object in time. The results show that our system is able to localize multiple objects in different scenes, while maintaining track stability over time.

We trained CNNs with a new activation function, called "exponential linear unit" (ELU) [1], which speeds up learning in deep neural networks.

Like rectified linear units (ReLUs) [2, 3], leaky ReLUs (LReLUs) and parametrized ReLUs (PReLUs), ELUs also avoid a vanishing gradient via the identiy for positive values. However ELUs have improved learning characteristics compared to the other activation functions. In contrast to ReLUs, ELUs have negative values which allows them to push mean unit activations closer to zero. Zero means speed up learning because they bring the gradient closer to the unit natural gradient.

The unit natural gradient differs from the normal gradient by a bias shift term, which is proportional to the mean activation of incoming units. Like batch normalization, ELUs push the mean towards zero, but with a significantly smaller computational footprint. While other activation functions like LReLUs and PReLUs also have negative values, they do not ensure a noise-robust deactivation state. ELUs saturate to a negative value with smaller inputs and thereby decrease the propagated variation and information. Therefore ELUs code the degree of presence of particular phenomena in the input, while they do not quantitatively model the degree of their absence. Consequently dependencies between ELU units are much easier to model and distinct concepts are less likely to interfere.

In this challenge ELU networks considerably speed up learning compared to a ReLU network with similar classification performance.

The network for classfication pre-trained on the ILSVRC 2014 classification dataset is modified for bounding box regression. Then the regressor to predict a bouding box is fine-tuned using the features of middle and last convolutional layer [2]. At inference time, the modified greedy-merge techinque [3] for multi-scale prediction is applied on each scale, and the optimal scale is chosen to determin the final predicted box.

For the object detection challenge, our submission is based on the combination of two types of models, i.e. DeepID-Net in ILSVRC 2014 and Faster RCNN [a].

Compared with DeepID-Net in ILSVRC 2014, the new components are as follows.
(1) GoogleNet with batch normalization and VGG are pre-trained on 1000-class ImageNet classification/location data (for the entries of using official data only) and 3000-class ImageNet classification data (for the entries of using extra data).
(2) A new cascade method is introduced to generate region proposals. It has higher recall rate with fewer region proposals.
(3) The models are fine-tuned on 200 detection classes with multi-context and multi-crop.
(4) The 200 classes are clustered in a hierarchical way based on their visual similarity. Instead of finetuning with the 200 classes once, the models are finetuned for multiple times when the 200 classes are divided into smaller clusters iteratively. Different clusters share different feature representations. Feature representations gradually adapt to individual classes in this way.

Compared with Faster RCNN, the new components are
(1) We cascade the RPN, where the proposals generated by RPN are fed into a object\background Fast-RCNN. It leads to 93% recall rate with about 126 proposals per image on val2.
(2) We cascade the Fast-RCNN, where a Fast-RCNN with category-wise softmax loss is used in the cascade step for hard negative mining. It leads to about 2% improvement in AP.
Deep-ID models and Faster RCNN models are combined with model averaging.
For the localization task, class labels are predicted with VGG. For each image and each predicted class, candidate regions are proposed by employing a learned saliency map and edge boxes [c] with high recall rate. Candidate regions are assigned with scores by using VGG or GoogLeNet with BN finetuned on candidate regions. The 1000 classes are grouped in multiple clustered in a hierarchical way. VGG and GoogleNet are finetuned for multiple times to adapt to different clusters.
The fastest publicly available multi-GPU caffe code (requires only 6 seconds for 20 iterations when mini-batch size is 256 using GoogleNet using 4 TitanX) is our strong support [d].
[a] W. Ouyang, X. Wang, X. Zeng, S. Qiu, P. Luo, Y. Tian, H. Li, S. Yang, Z. Wang, C. Loy, X. Tang, “DeepID-Net: Deformable Deep Convolutional Neural Networks for Object Detection,” CVPR 2015.
[b] Shaoqing Ren, Kaiming He, Ross Girshick, Jian Sun, “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks,” arXiv:1506.01497.
[c] C. Lawrence Zitnick, Piotr Dollár, “Edge Boxes: Locating Object Proposals from Edges”, ECCV2014
[d] https://github.com/yjxiong/caffe

First, around 4000 candidate object proposals are generated from selective search and structure edge. Then we extract 12 different regions' CNN features for each proposal, and concatenate them as part of final object representation as the method in [1]. In detail, region CNN is a 16-layer VGG-version SPPnet modified with some random initialization, a one-level pyramid, very leaky ReLU and a hand-designed two-level label tree for structurally sharing knowledge to defeat class imbalance. Single model but another three deformation layers are also fused for capturing repeated patterns in just 3 regions each proposal. Semantic segmentation-aware CNN extension in [1] is also used and here segmentation model is a mixed model of deconvnet and CRF .
Second, we use RPN[2] with convolution layers initialized by the pre-trained RCNN above to obtain at least 500 proposals each instance. And next retrain a new model as above for better using RPN proposals and excluding trained patterns, but part of data used duo to time limit. Apart from the model above, we add unsupervised segmented features using CPMC, and encode them with shape and fixation maps, to learn a latent SVM.
Third, we use the enlarged candidate box for bounding box regression, iterative localization, and bounding box voting as [1] to do object localization. Finally, thanks for the competition organisers at imagenet and GPU resources from NVIDIA and IBM cloud.

For the Classification+Localization task we trained two neural nets with architectures, inspired by "Inception-6"[1] (but without batch normalization), and one with architecture, inspired by "Inception-7a"[2] (with batch normalization and 3x1 + 1x3 filters instead of 3x3 filters for some layers).

For the scene classification task, our model is based on a convolutional neural network framework implemented on Caffe. We use parameters of vgg_19 model training on ILSVRC classification task as initialization of our model [1]. Since current deep features learnt by those convolutional neural networks, which are trained from ImageNet, are not competitive enough for scene classification task, due to the fact that ImageNet is an object-centric dataset [3], we further train our model on Places2 [4]. Moreover, according to our experiments, “msra” initialization of filter weights for rectifiers is a more robust method of training extremely deep rectifier networks [2], we use this method for initialization of some fully-connected layers.

A VGG-like model has been trained for this scene classification task. We only use the resized 256x256 image data to train this model. In training phase, random crops of multi scales are used to do data augmentation. The procedure generally follows the VGG paper, for example, the batch size was set to 128, and the learning rate was initially set to 0.01. The only difference is that we don't use Gaussian method for weight initialization. We proposed a new weight initializing method, which can get a bit faster convergence performance than MSRA weight filler. In test phase, we convert the full connected layers into convolutional layers, and then this fully convolutional network is applied over the whole image. Multi scales images are used to evaluate dense predictions. Finally, the top 5 classification score we got on the validation set is 80.0%.

We used an ensemble of 6 deep neural network models, consisting of 11-16 convolutional and 4-5 maximum pooling layers. No 1x1 convolution is involved, meaning no fully-connected layers are used.
On-line, random image deformation is adopted during training. Test samples are classified under the augmented pattern classification rule according to I. Sato, et al., arXiv:1505.03229.
Models are trained by machine-distributed deep learning software that we developed from scratch on large-scale GPU supercomputer TSUBAME in Tokyo Institute of Technology by taking advantage of the 1 year trial use period and the TSUBAME Ground Challenge in Fall 2015.
Up to 96 GPUs are used to speed up the training.

Our framework is mainly based on RCNN[1], and we make following improvements:
1. Region proposals come from two sources: selective search and region proposal network[3] trained on ILSVRC.
2. The initial models of several googLenets are pretrained on images or bounding boxes following [4].
3. Inspired by [5], an adapted multi-region model is used.
4. Inspired by [2], We train a seperate regression network to correctify the detection positions.
5. Model averaging on the SVM scores of all the used models.

Here we use a short cascade of classifiers for object detection. The first stage includes our novel polyhedral conic classifier (PCC) whereas the second classifier is the kernelized SVM. PCC classifiers can return polyhedral acceptance regions for positive classes with a simple linear dot product, thus they are better suited for object detection tasks compared to linear SVMs. We used LBP+HOG descriptors for image representation and sliding window approach is used to scan images. Our first submission includes independent detector outputs for each class and we apply a non-maximum suppression algorithm between classes for the second submission.

It is the third time that we participate in ILSVRC. In this year, we start with the GoogLeNet [1] model and apply it to all four tasks. Details are shown below.

Task 1 Object Classification/Localization
===============
We utilize the GoogLeNet with batch normalization and prelu for object classification. Three models are trained. The first one uses the original GoogLeNet architecture but all relu layers are replaced with prelu layers. The second model is the one mentioned in Ref. [2]. The third model is fine-tuned from the second model with multi-scale training. Multi-scale testing and models ensemble is utilized to generate the final classification result. 144 crops of an image [1] are used to evaluate one network. Actually, we also tried the method of 150 crops which is described in Ref. [3]. The performance is almost the same. And merging the results of 144 crops and 150 crops does not bring to much increased performance. The top-5 error of the three models on validation is 8.89%, 8.00% and 7.79%, respectively. And using an ensemble of the three models, we decreased the top-5 error rate to 7.03%. As we all know, ensemble with more models can improve the performance. But we do not have enough time and GPUs to do that.
To generate a bounding box for each label of an image, we firstly fine-tune the second classification model with object-level annotations of 1,000 classes from ImageNet CLS-LOC train data. Moreover, a background class is added into the network. Then test images are segmented into ~300 regions by selective search and these regions are classified by the fine-tuned model into one of 1,001 classes. We select the top-3 regions with the highest possibility classes generated by the classification model. A new bounding box is generated by finding a minimal bounding rectangle of three regions. The localization error is about 32.9% on validation. We also try the third classification model. And the localization error on validation is 32.76%. After merging the two aforementioned results, the localization error decrease to 31.53%.

Task 2 Object Detection
===============
We employ the well-known Fast-RCNN framework [4]. Firstly, we tried the AlexNet model which is pre-trained on the CLS-LOC dataset with image –level annotation. When training on the object detection dataset, we run SDG for 12 epochs, and then lower the learning rate from 0.001 to 0.0001 and train for another 4 epochs. The other setting is the same with the original Fast-RCNN method. This approach achieves 34.9% MAP on validation. Then we apply GoogLeNet with Fast-RCNN framework. The pool layer after inception 4 layers is replaced by a ROI pooling layer. This trial achieves about 34% map on the validation. In another trial, we move the ROI pooling layer from the pool4 to pool5 and enlarge the input size from 600(max 1000) to 786(max 1280). The pooled width and height is set to 4x4 instead of 1x1. The MAP is about 37.8% on validation. It is worth noting that the last model needs about 6g GPU memory to train and 1.5g GPU memory to test. And it has near the same test speed with AlexNet but gains better performance. We employ a simple strategy to merge the three results and gain 38.7% MAP.

Task 3 Scene Classification
===============
The Places2 training set has about 8 million images. We reduce the input size of images from 256x256 to 128x128 and try small network architectures to accelerate the training procedure. However, due to the large amount images, we still cannot finish the training before the submission deadline. We trained two models which architectures are modified from the original GOOGLENET. The first one only removes the inception 5 layers. We only trained the model for about 10 epochs. The top-5 error on validation is about 37.19%. The second model enlarges the stride of the conv1 layer from 2 to 4 and reduces the kernel number of the conv2 layer from 192 to 96. For the remaining inception layers, every kernel number is set as the half of its original number. This model is trained about 12 epochs and achieves 38.99% top-5 error on the validation. Unfortunately, about one week ago, we found that the final output vector was set to 400 instead of 401 due to an oversight. We correct the error and fine-tune the first model for about 1 epoch. The top-5 error on validation of this model is about 31.42%.

Task 4 Object Detection from Video
===============
A simple method for this task is to perform object detection in all frames. But it does not utilize the spatial temporal constraint or context information between continual frames in a video. Thus, we employ object detection and object tracking for this track. First, key frames from a video are selected to detect objects in them using Fast RCNN. There are about 1.3 million frames in training set. Due to temporal continuity, we select one frame every 25 frames to train an object detection model. 52,922 frames are utilized to train the model. Similar to the approach in the Object Detection track, we run SGD for 12 epochs and then lower the learning rate from 0.001 to 0.0001 and train for another 8 epochs. The training procedure takes 1 day and 3 days on a single K40 for AlexNet and GoogLeNet, respectively. More details about the object detection can be found in the instruction of the task 2. During test, if a video has less than 50 frames, we choose two frames in the middle of the video. If a video has more than 50 frames, we choose a frame every 25 frames. This results in 6,861 frames of 176,126 on validation and 12,329 frames of 315,176 on test. We do object detection on these frames and filter out the objects which confident scores are larger than 0.2(AlexNet) and 0.4(GoogLeNet). Then, detected objects are tracked to generate the results of the other frames. The tracking method we used is TLD [5]. After tracking, we generate the final results for evaluation. It is worth noting that we set most of parameters empirically because we have no time to validate them. Three entries are provided. The first one utilizes AlexNet and achieves 19.1% map on validation. The second one uses GoogLeNet and achieves 24.6% map on validation. A simple strategy to merge the two results is employed and results in the third entry. The map of it is 25.3% on validation.

For ILSVRC 2015 detection datasets, we trained Fast R-CNN and Faster R-CNN multiple times and the trained models are used for ensemble.

Three algorithms are used for the region proposal: Selective search[4], Region proposal network step_1[3] and Region proposal network step_3[3]. The three kinds of proposals are fed to detection models (VGG16, VGG19) which then are used for ensemble.

Detection results are 41.7% mAP for a single model and 44.1% mAP for the ensemble of multiple models.

Dave Ojika, University of Florida
Liu Chujia, University of Florida
Rishab Goel, University of Florida
Vivek Viswanath, University of Florida
Arpita Tugave, University of Florida
Shruti Sivakumar, University of Florida
Dapeng Wu, University of Florida

We implement a Caffe-based convolutional neural network using the Places2 dataset for a large-scale visual recognition environment. We trained a network based on the VGG ConvNet with 13 weight layers and 3 by 3 kernels, with 3 fully connected layers. All convolutional layers are followed with a ReLU layer. Due to the very large amount of time required to train the model with deeper layers, we deployed Caffe on a multiple GPU cluster environment and leveraged cuDNN libraries to improve training time.

Gil-Jin Jang, School of Electronics Engineering, Kyungpook National University, Daegu, Republic of Korea
Han-Gyu Kim, School of Computing, Korea Advanced Institute of Science and Technology (KAIST), Daejeon, Republic of Korea

The novel deep network architecture is proposed based on independent subspace analysis (ISA). We extract 4096 dimensional features by the baseline Alexnet trained by the Places2 Database, and the proposed architecture is applied on top of the feature extraction network. Every other 4 nodes of the 4096 feature nodes are grouped as a single subspace, resulting in 1024 individual subspaces. The output of the each subspace is generated by the square root of the sum of the squares of the components, and the architecture is repeated 3 times to generate 256 nodes before connecting to the final network output of 401 categories.

Henry Machine

Henry Shu (Home)
Jerry Shu (Home)

Fundamentally different from deep learning/ConvNet/ANN/representation learning, Henry machine was trained using the traditional methodology: feature engineering --> classifier design --> prediction paradigm. The intent is to encourage continued research interest in many traditional methods in spite of the current popularity of deep learning.

The most recent (as of Nov 14, 2015) top-5 and top-1 accuracies of Henry machine for the Scene401 validation dataset are 68.53% and 36.15%, respectively.

Here are some characteristics of Henry machine.

- The features used are our own modified version of a selection of features in the literature. These are engineered features, not learned features. The feature extraction was done for all the 8.1M training, 380K test, and 20K validation images on their original, high-resolution version with the original aspect ratio left unchanged. I.e., no image resizing or rescaling were applied. The entire feature extraction step using our own implementation took 7 days to complete on a cluster of home-brewed CPU cluster (listed below), which consists of 12 low-end to mid-end computers borrowed from family and friends. Five of the twelve computers are laptops.

- We did not have time to study and implement many strong features in the literature. The accuracy of Henry machine is expected to increase once we include more such features. We believe that the craftsmanship of feature engineering encodes the expertise and ingenuity of the human designer behind it, and cannot be dispensed away (at least not yet) in spite of the current popularity of deep learning and representation learning-based approaches. Our humble goal with Henry machine is to encourage continued research interest in the craftsmanship of feature engineering.

- The training of Henry machine for Scene401 was also done using the home-brewed CPU cluster, and took 21 days to complete (not counting algorithm design/development/debugging time).

- While Henry machine was trained using traditional classification methodology, the classifier itself was devised by us from scratch instead of using conventionally available methods such as SVM. We did not use SVM because it suffers from several drawbacks. For example, SVM is fundamentally a binary classifier, and applying its maximum-margin formulation in a multiclass setting, especially with such a large number of classes (401 and 1000), could be ad hoc. Also, the output of SVM is inherently non-probabilistic, which makes it less convenient to use in a probabilistic setting. The training algorithm behind Henry machine is our attempt to address these and several other issues (not mentioned here), while at the same time to make it efficient to train, with small memory footprint, on mom-and-pop computers at home, using only CPU's. However, Henry machine is still very far from perfect, and our training algorithm still needs a lot of improvement. More details of the training and prediction algorithm will be available in our publication.

- As Nov 13 was fast approaching, we were pressed by time. The delay was mainly due to hardware heat stress from many triple-digit-temperature days in Sep and Oct here in California. In the end of Oct, we bought two GTX 970 graphics card and implemented high-performance CUDA code (driver version 7.5) to help us finish the final prediction phase in time.

- We will also release the performance report of Henry machine on the ImageNet1000 CLS-LOC validation dataset.

- The source code for building Henry machine, including the feature extraction part, was primarily written in Octave (version 4.0.0) and C++ (gcc 4.8.4) on linux machines (Ubuntu x86_64 14.04).

[DET] We follow the Fast R-CNN [1] framework for detection. EdgeBoxes is used for generating object proposals. A detection model is fine-tuned based on a pre-trained VGG16 [3] model on ILSVRC2012 CLS dataset. During testing, predictions on test images and their flipped version are combined by non-maximum suppression. Validation mAP is 42.3%.

[CLS-LOC] We train different models for classification and localization separately, i.e. GoogLeNet [4,5] for classification and VGG16 for bounding box regression. The final models achieve 28.9% top-5 cls-loc error and 6.62% cls error on the validation set.

[Scene] Due to the limit of time and GPUs, we have just trained one CNN model for the scene classification task, namely VGG19, based on the resized 256x256 image datasets. The top-5 accuracy on the validation set with single center crop is 79.9%. In test phase, a multi-scale dense evaluation is adopted for the prediction, whose accuracy on the validation set is 80.7%.

[VID] First we apply Fast R-CNN with RPN proposals [2] to detect objects frame by frame. Then a Multi-Object Tracking (MOT) [6] method is utilized to associate the detections for each snippet. Validation mAP is 43.1%.

The submitted results were produced by a detection algorithm based yolo detection method [1]. Different from the original yolo detection, I made several changes:
1. Downsize the training/testing image to 224*224 for faster training and testing;
2. Reduce “pooling” layers to improve the detection performance of small objects;
3. Add weight balance method to deal with the unbalanced number of objects in training images.
[1] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhad, You Only Look Once: Unified, Real-Time Object Detection, arXiv:1506.02640

Isia_ICT

Xiangyang Li
Xinhang Song
Luis Herranz
Shuqiang Jiang

As the number of images per category for training is non-uniform, we sample 4020 images for each class from the training dataset. We use this uniform distributed subdataset to train our convolutional neural networks. In order to reuse the semantic information in the 205-catogery Places dataset [1], we also use the models trained on this dataset to extract visual features for the classification task. Even though the mid-level representations in convolutional neural networks are rich, but the geometric invariance properties are poor [2]. So we use multi-scale features. Precisely, we convert all the layers in the convolutional neural network to convolution layers and use the full convolution network to extract features with different input sizes. We use max pooling to pool the features to the same size with the fixed size of 227 which is used to train the network. At last, we combine features extracted from models which not only have different architectures, but also are pre-trained on different datasets. We use the concatenated features to classify the scene images. Considering the efficiency, we use a logistic regression classifier composed with two fully-connected layers of 4096 units and 401 units respectively and a softmax layer with the sampled training examples exposed to the model.

A hierarchical data-driven object detection framework is addressed considering deep feature hierarchy of object appearances. We are motivated from the observations that many object detectors are degraded in performance due to ambiguities in inter-class and variations in intra-class appearances; deep features extracted from visual objects show strong hierarchical clustering property. We partition the deep features into unsupervised super-categories in the inter-class level, augmented categories in the object level to discover deep-feature-driven knowledge. We build hierarchical feature model using the Latent Dirichlet Allocation (LDA) [6] algorithm and constitute hierarchical classification ensemble.

Our method is mainly based on Fast RCNN framework [1]. The region proposal algorithm EdgeBoxes [2] is employed to generate region of interests from an image, and features are generated using 16 layer CNN network [3] which pre-trained on the ILSVRC 2013 CLS dataset and fine-tuned on the detection dataset. We perform the final decision of object localization using the hierarchical ridge regression and the extended bounding box similar to [5], and the weighted non-maximum suppression similarly to [1, 4].

We purpose an object detection based tracking algorithm. Our method is mainly based on Fast RCNN framework [1]. The region proposal algorithm EdgeBoxes [2] is employed to generate region of interests from a frame, and features are generated using 16 layer CNN network [3] which pre-trained on the ILSVRC 2013 CLS dataset and fine-tuned on the ILSVRC 2015 video dataset. We implemented the tracking algorithm similarly to Partial Least Square Analysis for generating a low-dimensional discriminative subspace in video [4]. For parameter optimization, we adopt POMDP based parameter learning approach which described in our previous work [5]. We perform the final decision of object localization using the bounding box ridge regression and the weighted non-maximum suppression similar to [1].

Our baseline algorithm is Convolutional Neural Network for both detection and classification/localization entries. For detection task, our method is based on Fast R-CNN [1] framework. We trained VGG16 network[2] within proposed regions by Selective Search[3]. Fast-RCNN uses ROI pooling on the top of convolutional feature maps. For classification and localization task, we trained on GoogLeNet network[4] with batch normalization methods.[5] The submitted model averaged 5 models trained on multiple random crops and tested on single center crop without any further data augmentation during training.

[1] R. Girshick, Fast R-CNN, in Proceedings of the International Conference on Computer Vision (ICCV), 2015.

In this work, we use a variant of GoogLenet [1] for localization task. We further use VGG classification models [2] to boost up the performance of the GoogLenet-based network. The overall training of the baseline localization network follows a similar procedure with [2]. Our CNN model is based on GoogLenet [1]. We first trained our network for minimizing the classification loss using batch normalization [3]. Based on this pre-trained classification network, we further fine-tuned it to minimize the localization loss. Then, we performed recursive localizations to adjust localization outputs utilizing outputs of a VGG-based classification network. For obtaining the VGG-based network, we used pre-trained VGG-16 and VGG-19 models with multiple crops on regular grid, selective crops based on objectness score using a similar method with BING [4] and different image sizes. It is further tuned on validation set.

Given top-5 object classes provided by multiple networks including GoogLeNet [1] and VGG-16 [2], we localize each object with a single class-agnostic AttentionNet, which is a multi-class extension of [3]. In order to improve the localization accuracy, we significantly increased the network depth from 8-layer [3] to 22-layer. In addition, 1,000 class-wise direction layers and a classification layer are stacked on top of the network, sharing the convolutional layers. Starting from an initial bounding-box, AttentionNet predicts quantized weak directions for top-left and bottom-right corners pointing a target object, and aggregates the predictions iteratively to guide the bounding-box to an accurate object boundary. Since AttentionNet is a unified framework for localization, any independent pre/post-processing technique such as the hand-engineered object proposal and the bounding-box regression is not used in this submission.
[1] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In CVPR, 2015.
[2] K. Simonyan, A. Zisserman. Very Deep Convolutional Networks for Large-Scale Image Recognition. In ICLR 2015.
[3] D. Yoo, S. Park, J.-Y. Lee, A. S. Paek, and I. S. Kweon. Attentionnet: Aggregating weak directions for accurate object detection. In ICCV, 2015.

Abstract:
In ILSVRC 2015 challenge, we (team MCG-ICT-CAS) participate in two tasks: object classification/localization (CLS-LOC) and object detection (DET) without using any outside images or annotations.
CLS-LOC: we do object classification and localization sequentially. So we will describe them respectively as follows.
For object classification, we use fusion of VGG [1] and GoogleNet [2] models as our baseline. Although the top-5 accuracy of the GoogleNet used by us which was released by Caffe platform [3] is 90.73 in the validation dataset, much lower than that of the initial unreleased GoogleNet by Google, we can still get the top-5 accuracy of 93.28 after fusion with VGG (92.55). After further fusion with our own 2 models, we can improve the accuracy to final 93.72. Our unique contributions are three-fold:
(1) Sparse CNN model (SPCNN): In the early April of this year, we propose to learn a compact CNN model named SPCNN for reducing computation and memory cost. Since the 4K*1K (4M) number of connections between the 7th and 8th layer is one of the densest connections between the CNN layers, we focus on how to remove most of small connections between them because small connections are often unstable and may cause noise. Actually, for a given category (a neuron in the 8th layer), it is naturally only related with a very few number of category bases (neurons of the 7th layer), i.e., the former can be modelled as sparse linear combinations of a very few number of the latter. Hence, we propose to select only a very few number of connections with large weights between a given category and category bases for retraining, and set other small unstable weights as constant zero without retraining. Experiments on the validation dataset show that only keeping and retraining a very small percentage (about 9.12%) of all the initial 4M connections can still gain 0.32% improvement of top-5 accuracy (92.87%) compared with that (92.55%) of the initial VGG model after removing the 90.88% small unstable connections.
(2) Sparse Ensemble Learning (SEL): Large scale training dataset imposes a great challenge for efficiency of both training and testing. To overcome the low efficiency and unscalability of the classical methods based on global classification such as SVM, we proposed SEL for visual concept detection in [4]. It leverages sparse codes of image features for both partition of large scale training set into small localities for efficient training and coordination of the individual classifiers in each locality for final classification. The sparsity ensures that only a small number of complementary classifiers in the ensemble will fire on a testing sample, which not only gives better AP than other fusion methods like average fusion and Adaboost, but also allows the high efficiency of online detection. In this year’s object classification task, we use the 4K-dim CNN features as the image features instead of traditional bag of words (BoW). After expanding the training dataset from 1.2 million samples to about 6 million samples with through data argumentation, we use 8K-dim sparse codes of CNN features to partition the training dataset into 8K number of small localities for efficient training and use the sparse codes of test sample for fusion. Experiments on the validation dataset show that compared with CaffeNet, SEL with CNN features can improve 3.6% and 1.0% in top-1 and top-5 accuracies respectively.
(3) Additionally, to compare with the widely used average fusion (AVG) method, we try ordered weighted averaging (OWA) [5] in terms of top-5 accuracy for fusion of all the models including GoogleNet, VGG, SPCNN and SEL. Experiments on the validation dataset show that OWA is better than average fusion.
For object localization, we mainly focus on the following three aspects:
(1) We apply the framework of Fast R-CNN [6] into object localization task. Instead of using Non-Maximum-Suppression (NMS) filtering of overlapped detection proposals of standard Fast R-CNN, we propose to fuse the most probable detection proposals with top-N (N=10) scores to get the final localization result.
(2) In order to get more confident proposals, we combine three different region proposals including Selective Search (SS) [7], EdgeBox (EB) [8], Region Proposal Network (RPN) [9] for both training and localization of objects.
(3) We try clustering-based object localization to get more positive training samples in each individual clustering subset besides the complexity reduction of both training and localization.
After the above measures, we can improve our localization accuracy from 83.2% (baseline) to around 84.9% in the validation dataset.
DET: An overwhelming majority of existing object detection methods have focused on how to reduce the number of region proposals while keeping high object recall without consideration of category information, which may lead to a lot of false positives due to the interferences between categories especially when the number of categories is very large. To eliminate such interferences, in this year’s DET task, we propose a novel category aggregation among region proposals based upon our careful observation that more frequently detected categories around an object have the higher probabilities to be present in the image. After further exploiting the co-occurrences relationship between categories, we can determine the most possible categories for an image in advance. Thus, many false positives of region proposals can be greatly filtered out before subsequent classification process. Our experiments on the validation dataset verified the effectiveness of our proposed both category aggregation and co-occurrence refinement approaches.

Masataka Yamaguchi (The Univ. of Tokyo)
Qishen Ha (The Univ. of Tokyo)
Katsunori Ohnishi (The Univ. of Tokyo)
Masatoshi Hidaka (The Univ. of Tokyo)
Yusuke Mukuta (The Univ. of Tokyo)
Tatsuya Harada (The Univ. of Tokyo)

We use Fast-RCNN[1] as the base detection system.

Before we train models using the Fast-RCNN framework, we retrain VGG-16[2] with object-level annotations from CLS/LOC and DET data as with [3]. We initialize two models with original VGG-16 and other two models with retrained one.

For all models, we concatenate the whole image features extracted by CNN with the fc7-layer output and use it as the input to the inner product layer before Softmax, while we use original fc7-layer output as the input to the bounding box regressors. We multiply the whole image features by a constant to make the norm of the whole image features smaller compared to that of the fc7-layer output.

We replace pool5 layer of one of the models initialized by original VGG-16 and pool5 layer of one of the models retrained on annotated objects with RoI Pooling layers, and replace pool4 layers of the other models with RoI Pooling layers and then train them on training dataset and the val1 dataset (see [4]) using the Fast-RCNN framework.

During testing, we use object region proposals obtained from Selective Search[5] and Multibox[6], and we test not only original images but also horizontally-flipped ones and combine them.

In our experiments, replacing pool4 layer rather than pool5 layer with RoI Pooling Layer improved mAP on the val2 dataset from 42.9% to 44.2%, and retraining the model with object-level annotation improved mAP from 44.2% to 45.6%.

We submitted two results. One is obtained by model fusion using the same weights for all models and the other is obtained by model fusion using the weights learned separately for each class by Bayesian optimization on the val2 dataset.

* We used GoogleNet architecture for Classification and Localization. We also used Simple Linear Iterative Clustering for Localization.

* For each ground truth label, we selected 3 different nodes based upon WordNet tree structure, hierarchically. Also, we trained the model from scratch with these selected three nodes, one for each output level of the model.

* We tried two different schemes to train the GoogleNet architecture. In first scheme, we separated the architecture into three individual parts such that each part has one of the output layers of GoogleNet architecture and we trained these parts separately. Later, we reconstructed the original GoogleNet architecture from these parts and fine-tuned it by using all of the three output layers.
In second scheme, we trained GoogleNet architecture with three output layers without separating them.

* For each image, we selected multiple crops that are more likely to be possible objects with Simple Linear Iterative Clustering. Then, we selected five of the crops (one for each Top-5 prediction) with GoogleNet Classification model.

Our method learns bounding box information from the results of classifier rather than from CNN features. The classification score before softmax is used as a feature. The input image is divided into grids with various sizes and locations. Testing is applied to various crops of input images. Only the classifier score from the ground truth class is selected and stacked to generate a feature vector. We used 140 crops for this competition. The localization network is trained with the feature vector as input and bounding box coordinates as output. Euclidean loss is used. The classification is performed by GoogLeNet[1] trained using quick solver in [2]. Once the class is determined, feature vector for localization is extracted and bounding box information is determined by the localization network. We used single network for bounding box estimation, so ensemble of multiple models may improve the performance.

The submitted result is computed using the VGG16 network, which contains 13 convolutional layers and 3 fully connected layers. After the network is trained for several epochs using the training procedure described in the original paper, we fine-tune the network by using a weighted cross entropy loss, where the weights are determined per class and are based on their fitting errors. During testing time, we conduct a multi-resolution testing. The images are resized to three different resolutions. 10 crops are extracted from each resolution and the final score is the average scores of the 30 crops.

MPG_UT

Noriki Nishida,
Jan Zdenek,
Hideki Nakayama

Machine Perception Group
Graduate School of Information Science and Technology
The University of Tokyo

Our proposed system consists of two components:
the first component is a deep neural network that predicts bounding boxes in every frame while utilizing contextual information,
and the second component is a convolutional neural network that predicts class categories for given regions.
The first component uses a recurrent neural network ("inner" RNN) to sequentially detect multiple objects in every frame.
Morever, the first component uses an encoding bidirectional RNN ("outer" RNN) to extract temporal dynamics.
To facilitate the learning of the encoding RNN, we develop decoding RNNs to reconstruct the input sequences.
We also use curriculum learning for training the "inner" RNN.
We use The Oxford VGG-Net with 16 layers in Caffe to initialize our ConvNets.

MSRA

Kaiming He
Xiangyu Zhang
Shaoqing Ren
Jian Sun

We train neural networks with depth of over 150 layers. We propose a "deep residual learning" framework [a] that eases the optimization and convergence of extremely deep networks. Our "deep residual nets" enjoy accuracy gains when the networks are substantially deeper than those used previously. Such accuracy gains are not witnessed for many common networks when going deeper.

Our localization and detection systems are based on deep residual nets and the "Faster R-CNN" system in our NIPS paper [b]. The extremely deep representations generalize well, and greatly improve the results of the Faster R-CNN system. Furthermore, we show that the region proposal network (RPN) in [b] is a generic framework and performs excellent for localization.

We only use the ImageNet main competition data. We do not use the Scene/VID data.

This is a baseline entry relying on the current state-of-the-art in standard object detection.
We use the Faster-RCNN framework [1] and finetune the network for the 30 classes with the provided training data.
The model is essentially the same as in [1], except that the training data has changed.
We did not experiment much with the hyper-parameters, but we expect better results with more training iterations and proper sampling of the video frames.
This model does not exploit temporal information at all but is a good object detector on static images, which is why this entry should serve as a baseline.

We first built AlexNet[NIPS2012_4824] model using Caffe[jia2014caffe] on Imagenet’s object classification dataset.
The performance of the our AlexNet on object classification is 55.76 percent of accuracy.
Then we replaced the classification layers (fc6 and fc7) with larger ones.
By comparing the size of training data between object classification dataset and the Place2 dataset, we doubled the number of hidden nodes on these two layers.
We connected this structure with new output layer with 401 nodes for 401 classes in Place2 dataset.
The Place2 training dataset was split into 2 parts.
The first part was used to adjust weights on the new layers.
To train with the first part, the model is trained with 1,000,000 iterations in total.
The learning rate is initialed with 0.001, then decreased 10 times every 200,000 iterations.
This yielded 43.18% accuracy on the validation set.
Then we retrained the whole convolution network using the second part.
We set learning rate of the new layers to be 100 times higher than the lower layers.
This raised the validation accuracy to 43.31%.
We also have trained from Place2 training dataset, since the beginning but the method described above achieved a better result.

For the scene task, we first train VGG-16[5], VGG-19[5] and Inception-BN[1] model, after this step, we use CNN Tree to learning Fine-grained Features[6]. After all, we combine CNN Tree model and VGG-16,VGG-19 , Inception-BN model as the final prediction result.

We jointly train image classification and object localization on a single CNN
using cross entropy loss and L2 regression loss respectively. The network
predicts both the location of the object and a corresponding confidence score.
We use a variant of the network topology (VGG-A) proposed by [1]. This network
is initialized using the weights of classification only network. This network
is used to identify bounding boxes for the objects, while a 144-crop classification
is used to classify the image.

The network has been trained on Intel Parallel Computing Lab’s deep learning library
(PCL-DNN) and all the experiments were performed on 32-node Xeon E5 clusters. A network
of this size typically takes about 30 hrs for training on our deep learning framework.
Multiple experiments for fine-tuning were performed in parallel on NERSC’s Edison and
Cori clusters, as well as Intel’s Endeavor cluster.

We present NeoNet, an inception-style [1] deep convolutional neural network ensemble that forms the basis for our work on object detection, object localization and scene classification. Where traditional deep nets in the ImageNet challenge are image-centric, NeoNet is object-centric. We emphasize the notion of objects during pseudo-positive mining, in the improved box proposals [2], in the augmentations, during batch-normalized pre-training of features, and via bounding box regression at run time [3].

Peng Han, Renmin University of China
Wenwu Yuan, Renmin University of China
Zhiwu Lu, Renmin University of China
Jirong Wen, Renmin University of China

The main components of our algorithm are the R-CNN model [1] and the video segmentation algorithm [2]. Given a video, we use the well trained R-CNN model [1] to extract the potential bounding boxes and their object categories for each keyframe. Considering that R-CNN has ignored the temporal context across all the keyframes of the video, we further utilize the results (with temporal context) of the video segmentation algorithm [2] to refine the results of R-CNN. In addition, we also define several local refinement rules using the spatial and temporal context to obtain better object detection results.

Our submissions are trained by modified version of [1] and [2]. We use the structure of [1], but remove the batch normalization layers. And Relu is replaced by Prelu[3]. Meanwhile, the modified version of latent Semantic representation learning is integrated into the structure of [1].

We propose a new scene recognition system with deep convolutional models. Specifically, we address this problem from four aspects:

(i) Multi-scale CNNs: we utilize Inception2 architecture as our main exploration network structure due to its performance and efficiency. We propose a multi-scale CNN framework, where we train CNNs from image patches of two resolutions (224 cropped from 256, and 336 cropped from 384). For CNN at low resolution, we use the same network as Inception2. For CNN at high resolution, we design a deeper network based on Inception2.

(ii) Handing Label Ambiguity: As the scene labels are not mutually exclusive with each other and some categories are easily confused, we propose two methods to handle this problem. First, according to the confusion matrix on the validation dataset, we merge some scene categories into one single super-category. Second, we utilize the Places205 scene model to test images and use the soft output as another target to guide the training of CNN.

(iii) Better Optimize CNN: We use a large batch size to train CNN (1024). Meanwhile, we try to set the decrease of learning rate in the exponential form. Moreover, we design a locally-supervised learning method to learn the weight of CNNs.

(iv) Combing CNNs of different architectures: considering the complementarity of networks with different architectures, we also fuse the prediction results of networks: VGGNet13, VGGNet16, VGGNet19 and MSRANet-B.

Our method for scene classification is based on deep convolutional neural networks.
We used pre-trained networks on ILSVRC2015 localization dataset and retrained the networks with 256x256 size Places2 dataset.
For test, we used ten crop data augmentation and model combination with four slightly different models.
This work was supported by Hanwha Techwin.

We design our detection model based on fast RCNN [1], and improve it from the following two aspects. First, we utilize a CNN-based method to re-rank and re-fine the proposals generated by proposal methods. Specifically, we re-rank the proposals to reject the low-confident ones and refine the proposals to get more accurate locations for the corresponding proposals. Second, we incorporate self-pace learning (SPL) in our optimization stage. We first initial a detector with all the annotated training samples and assign a confidence to each candidate with this detector. We further fine tune the detector using the candidates with high-confidence.

Localization task is composed of two parts: indicate the category and point out where is it. As the great progress in image classification[1][2], indicating in the category no longer seems to be the bottlenecks of the localization task. In previous competition, dense box regression[3][2] was used to generalize boundary prediction. However, the image crops of dense prediction is bounded at certain position and scale, and is hard to regress to a single box when more than one object appear in the same image crop.

Given a lot of box proposals(eg. from selective search[5], about 1500 boxes per image), the localization task become (a) how to find real boxes that contain ground truth, (b) what is the category lies in the box. Enlightened by fast R-CNN[4], We propose a classification && localization framework, which combines the global context information and local box information by selecting proper (class, box) pairs: first we pre train a googlenet for classification; and then two kinds of transformations have been made to the network, obtaining objectness and local category within the box; and finally, a SVM classifier is applied to select proper (class, box) pairs.

1. Fine-tuning for Objectness
Similar to fast R-CNN, we train a network to indicate the "objectness" of box. The probability of objectness is computed by a two-class softmax layer over fully-connected layer. A box proposal is positive if it has intersection over union (IoU) at least 0.7, negative if IoU<0.3. No 1000 categories information is used at this stage. Another Box offsets regression layer is also jointly trained with objectness, and ignored when the box is negative, as is standard practice in[4].
At test time, the objectness of all box proposals are verified by the network. Then we perform non-maximum suppression and keep most objectness boxes.

2. Fine-tuning for Local Classification
We acquire the global classification by avarging result of multi crops(eg. 144crops per image[1]) using the pre-trained googlenet model. On the other hand, we finetune the googlenet to indicate the 1000 categories within the box, which we call "local classification". The training sets are sampled around ground truth and resized to the network input.
At test time, we clip images from the regression boundary of the most objectness boxes, and put them into local classification network. Classes with high probability from local and global will be retaining.

3. Combine Information and Pair Select
So far, we have got objectness and offsets regression for some boxes, classification results for both local and global. For each possible pair (class, box), all these information form a feature vector, and is trained by SVM.
At test time, the pairs that reach top5 confidence in SVM would become our final result.

It may be surprising that the simple strategy can be more accurate even without models ensemble. We believe that the object boundary has something to do with low level vision, such as corner, edge, shape. As the network goes deeper, the features become more and more abstract. It is questionable to train a box regression while retaining 1000 categories information, especially when there are only 500 images per category on average. Our solution avoid seeking box directly, but take full advantage of classification capability of deep network.

Zifeng Wu, the University of Adelaide
Chunhua Shen, the University of Adelaide
Anton van den Hengel, the University of Adelaide

Our method is largely improved upon fast-rcnn. Multiple VGG16 and VGG19 networks are involved, which were pre-trained with the ImageNet CLS-LOC dataset. Each of the models is initialized with different models and/or tuned with different data augmentation strategies. Furthermore, we observe that feature maps obtained by applying the 'hole convolution algorithm' are beneficial. Selective-search proposals are filtered by a pre-trained two-way (object or non-object) classifier. The outputs of each network for the original and flipped images at multiple scales are averaged to obtain the predictions.

THU-UTSA-MSRA

Liang Zheng, University of Texas at San Antonio
Shengjin Wang, Tsinghua University
Qi Tian, University of Texas at San Antonio
Jingdong Wang, Microsoft Research

Our team submits results on the scene classification task using the Places2 dataset [4].

We trained three CNNs on the Places2 datasets using the GoogleNet [1]. Specifically, Caffe [2] is used for training. Among the three models, the first model is trained on by fine-tuning the GoogleNet model trained on Places dataset [3]. The classification accuracy of this model is: top-1 accuracy = 42.96%, top-5 accuracy = 75.35%. This model is obtained after 3,320,000 mini-batches, and the batch size is 32. Learning rate is set to 0.001, and gamma is set to 0.1, with a step size of 750,000.

The second and the third models are trained using the "quick" and the default solvers provided by GoogleNet, respectively, and both models are trained from scratch. Specifically, for the second model, we use a base learning rate of 0.01, gamma = 0.7, and step size = 320,000. We run this model for 4,500,000 mini-batches, and each mini-batch is of size 32. For this model, our result on the validation set is: top-1 accuracy = 43.41%, top-5 accuracy = 75.37%. In more detail, we only change the architecture of GoogleNet to have 401 blobs in the last fully connected layer. The average operation is done after the softmax calculation.

For submission, we submit results of each model as the first three runs (run 1, run 2, and run 3). Then, run 4 is the averaged result of fine-tuned GoogleNet + quick GoogleNet. Run 5 is the averaged result of all three models.

(The Third Research Institute of the Ministry of Public Security, P.R. China.)

Object detection:
Our models were trained based on Fast R-CNN and Faster R-CNN. 1) More training signal were added, including negative classes and objectness. Some models were trained on 489 subcategories first, then fine-tuning using 200 categories. 2) Replace pooling layers with stride convolutional layers for more accurate localization. 3) Extra data from Microsoft COCO, more anchors. 4) Iterative scheme, which alternates between scoring the proposals and refining their localizations with bounding box regression. 5) Various models were combined with weighted nms.

Object localization:
Different data augmentation methods were used, including random crops, multiple scales, contrast and color jittering. Some models were trained by maintaining the aspect ratio of input images, while others were not. In the test phase, whole uncropped images are densely processed for various scales. Further generate the fusion classification result according to the scores and labels jointly. On the localization side, we refer to the framework of Fast R-CNN. An iterative scheme like detection task is used. Then select top-k regions and averaging their coordinates as output. Results from multiple models are fused in different ways, using the model accuracy as weights.

Object detection from video:
We use same models as object detection task. Part of these models were fine-tuned using VID data. We also try main object constraint by considering whole snippet rather than single frame.

Scene classification:
Based on both MSRA-net and BN-GoogLeNet, and plus several improvements: 1) Choose subset data in a stochastic way at each epoch, which ensures each class has roughly equal images. This can both accelerate training and increase model diversity. 2) To utilize the whole image and part object information simultaneously, three different size patches of image (whole image with 224x224, crop of 160x160, and crop of 112x112) were feed into network and concatenate at the last convolution layer. 3) Enlarge the MSRA-net to 25 layers, and change some BN-net input from 224x224 to 270x270. 4) Use dense sample and multi-crop (50x3) for testing.

Our system uses deep convolutional neural networks (CNNs) for object detection and can be broken up into three distinct phases: (i) object proposal generation (ii) object classification and (iii) post-processing using non-maximum suppression (NMS). The first two phases are based on the faster R-CNN framework presented in [1].

First, we find image regions that may contain objects via bounding boxes (i.e. proposals) using the fully convolutional region-proposal network (RPN) of [1]. The network’s topology is similar to the Zeiler-Fergus (ZF) network [2]. Once the RPN has been trained, we train another neural network (RCNN) [1] that tries to classify the object in a given proposal. For this phase, we use a VGG16-style [3] network that was pre-trained on the ImageNet Classification and Localization Data (CLS) and only fine-tune the last fully-connected layer. Instead of trying to classify 200 objects, the layer has been altered to classify a proposal as being one of 30 classes. Both the RPN and RCNN are trained using the initial training release.

During testing, we pass an image to the RPN which extracts 500 proposals. We then apply the RCNN to each proposal and extract 30 confidence scores and 30 refined bounding boxes. We then consider one of two possible post-processing algorithms: (i) NMS on each frame (ii) sequence-NMS (SEQ-NMS) on each video snippet.

Regular frame-wise NMS operates on a single frame by iteratively selecting a class’ most confident detection in the frame and removes detections in the vicinity that have sufficient overlap. SEQ-NMS, on the other hand, operates on each video snippet. It iteratively selects the sequence of boxes over time that has maximum score and then suppresses detections that overlap with any of the members in the selected sequence. We accomplish this by constructing a graph over the snippet’s frames and doing dynamic programming. We score each sequence using either the average (NMS-AVG) or the max score (NMS-MAX) of the boxes. We then re-score each box in the kept sequence with the overall sequence score. One of our models selects whether to score using the average or the max for each class depending on each method's performance on the initial validation set.

Abstract:
We develop a new architecture for deep Convolutional Neutral Networks (CNNs), named Filter Panorama Convolutional Neutral Network (FPCNNs) for this scene classification competition. Convolutional layers are essential parts of CNNs and each layer is comprised of a set of trainable filters. To enhance the representation capability of the convolutional layers with more filters while maintaining almost the same parameter size, the filters of one convolutional layer (or possibly several convolutional layers) of FPCNNs are replaced by a filter panorama, wherein each window of the filter panorama serves as a filter. With the densely extracted overlapping windows from the filter panorama, a significantly larger filter set is obtained without the risk of overfitting since the parameter size of the filter panorama is the same as that of the original filters in CNNs.

The idea of filter map is inspired by epitome [1], which is developed in the computer vision and machine learning literature for learning a condensed version of Gaussian Mixture Models (GMMs). In epitome, the Gaussian means are represented by a two dimensional matrix wherein each window in this matrix contains parameters of the Gaussian means for a Gaussian component. The same structure is adopted for representing the Gaussian covariances. With almost the same parameter space as GMMs, the epitome possesses significantly more number of Gaussian components than its GMMs counterpart since much more Gaussian means and covariances can be extracted densely from the mean and covariances matrices of the epitome. Therefore, the generalization and representation capability of epitome outshines GMMs with almost the same parameter space, while circumventing the potential overfitting.

The above characteristics of epitome encourage us to arrange filters in a way similar to epitome in the FPCNNs. More precisely, we construct a three dimensional matrix named filter panorama for the convolutional layer of FPCNNs, wherein each window of the filter map plays the same role as the filter in the convolutional layer of CNNs. The filter panorama is designed such that the number of non-overlapping windows in the filter panorama is almost equal to the number of filters in the corresponding convolutional layer of CNNs. By densely extracting overlapping windows from the filter panorama, there are many more filters in the filter panorama that vary smoothly in the spatial domain, and neighboring filters share weights in their overlapping region. These smoothly varying filters tend to activate on more types of features that also exhibit small variations in the input volume, increasing the chance of extracting more robust deformation invariant features by the subsequent max-pooling layer [2].

In addition to the superior representation capability, filter panorama inherently enables a better visualization of the filters by forming a “panorama” of the filters such that adjacent filters changes smoothly across the spatial domain, and similar filters group together. This feature would benefit the design of the networks and make it easier to observe the characteristics of the filters learnt in the convolutional layer.

Our localization models are based on deep convolutional neural network

For the classification model which predicts the confidence score of each object class in the viewing window, we ensembled four deep neural networks two which are similar with [1](CNN) and two of which are their variants. The variants models(CLSTM) replaces last two convolution layers of [2] with 2D-LSTM [3] layer, which provides robustness to object warping and contextual information.

For localization model which predicts the location of bounding box, we used per-class-regression(PCR) approach[3] which replaces last 1000D softmax layer with 4000D regression layer. Additionally, we also replaced SPP layer of CNN and CLSTM with max-pooling layer. The results obtained from two combinations of the models are submitted: CNN-CNN, CNN-CLSTM.

For training, we used scale jittering strategy in [3] where the scale is sampled from fixed set of scales. For testing, we used multi-scale testing and merged all bounding boxes obtained from each scale.

All experiments were performed using our own deep learning library called VunoNet on GPU server with 4 NVIDIA Titan X GPUs.

We exploit partially overlapping optimization strategy to improve the convolutional neural networks, alleviating the optimization difficulty at lower layers and favoring better discrimination at higher layers. We have verified its effectiveness on VGG-like architectures [1]. We also apply two modifications of network architectures. Model A has 22 weight layers in total, adding three 3x3 convolutional layers in VGG-19 [1] and replacing the last max-pooling layer with SPP layer [2]. Model B integrates multi-scale information combination. Moreover, we apply balanced sampling strategy during training to tackle the non-uniform distribution of class samples. The algorithm and architecture details will be described in our arXiv paper (available online shortly).

In this competition, we submit five entries. The first is a single model (model A), which achieved 16.33% top-5 error on validation dataset. The second is a single model (model B), which achieved 16.36% top-5 error on validation dataset. The third is a combination of multiple CNN models with the averaging strategy. The fourth is the combination of these CNN models with a product strategy. The fifth is the combination of multiple CNN models with learnt weights.

The model is trained based on the Fast R-CNN framework. The selective search method is applied for object proposals generation. A VGG16 model pre-trained based on the image level classification task is used for initialization. A balanced fine-tuning dataset constructed from the training and validation dataset for the object detection task is utilized for fine-tuning the model. No other data augmentation nor model combination are applied.

ZeroHero

Svetlana Kordumova, UvA; Thomas Mensink, UvA; Cees Snoek, UvA;

ZeroHero recognizes scenes without using any scene images as training data. Instead of using attributes for the zero-shot recognition, we recognize a scene using a semantic word embedding that is spanned by a skip-gram model of thousands of object categories [1]. We subsample 15K object categories from the 22K ImageNet dataset, for which more than 200 training examples are available. Using those, we train an inception-style convolutional neural network [2]. An unseen test image, is represented as the sparisified set of prediction scores of the last network layer with softmax normalization. For the embedding space, we learn a 500-dimensional word2vec model [3], which is trained on the title description and tag text from the 100M Flickr photos in the YFCC100M dataset [4]. The similarity between object and scene affinities in the semantic space is computed with cosine similarity over their word2vec representations, pooled with Fisher word vectors [1]. For each test image we predict the five highest scoring scenes.