Researchers

Accomplishments

Major Goals of the Project:

Motivated by the fact that multiple concepts that frequently co-occur across images form patterns which could provide contextual cues for individual concept inference, the objectives of the proposed EAGER project are:

(a) Develop a social network inspired formal framework for finding hierarchical co-occurrence correlation among concepts, and use these patterns of co-occurrence as contextual cues to improve the detection of individual concepts in multimedia databases.

(c) Develop an image content descriptor called concept signature that can record both the semantic concept and the corresponding confidence value inferred from low­level image features.

(d) Evaluate the effectiveness of the proposed approach in application domains such as automatic image annotation and concept-based image/video retrieval. The validation of the proposed techniques will be carried out by performing experiments on multiple databases using a variety of quantitative measures.

Accomplishments under these goals:

Major Activities (2017-18):

Pl Bir Bhanu worked with his students Xiu Zhang and Raj Theagarajan to perform the proposed research, carry out the experiments and publish the research work.

During the project period Xiu Zhang and Bir Bhanu wrote a paper on unbiased spatio-temporal semantic representation in video which was accepted in a premier IEEE conference. This paper shows results on several large benchmark datasets which indicate that unbiased spatio-temporal semantic can be learned in challenging re-identification videos.

2016-17

PI Bir Bhanu worked with his researchers/students Linan Feng, Xiu Zhang, Federico Pala and Raj Theagarajan to perform the proposed research, carry out the experiments and publish the research work. Xiu Zhang completed her course work and passed the PhD qualifying examinations and advanced to the candidacy for the PhD degree in Computer Science.

During the project period Xiu Zhang, Federico Pala and Bir Bhanu wrote a paper on Attributes Co-occurrence Pattern Mining which has been accepted in a premier IEEE conference. This paper shows results on several large benchmark datasets which indicate that attributes can provide improvements both in accuracy and generalization capabilities.

Raj Theagarajan, Federico Pala and Bir Bhanu participated in a competition and workshop (held in conjunction with IEEE Conference on Computer Vision and Pattern Recognition) on identifying 10 vehicle classes using deep ensemble learning techniques that exploit logical reasoning as semantic information. A detailed paper on this work is forthcoming. The dataset for this work includes over ¾ million diverse images resembling a real-world environment.

2015-16

Pi Bir Bhanu worked with his students Linan Feng and Xiu Zhang to perform the proposed research, carry out the experiments and publish the research work. Linan Feng and Bir Bhanu completed and revised a journal paper. Xiu Zhang worked on developing network based hierarchical co-occurrence algorithms and exploiting correlation structure for available large image datasets for re-identification.

Specific Objectives:

2017-2018

For video analytic tasks, most of the current research aims to encode the temporal and spatial information by using convolutional neural networks (CNNs) to extract spatial features and recurrent neural networks (RNNs) or their variations to discover the time dependencies. However, it ignores the effect of the complex background, which leads to a biased spatial representation. Further, it often uses the back propagation through time (BPTT) to train RNNs. Unfortunately, it is hard to learn the long-term dependency via BPTT due to the gradient vanishing or exploding. The significance of a frame should not be biased by its position in a given sequence. Semantic consistency (identity of a person/concept) is a typical characteristic of video-based person re-identification task. The challenge is that the target person is interfered by the occlusions from complex background and other pedestrians. To solve this problem, we propose a learning strategy which focuses on the silhouette region to get rid of the background clutter, and selects the important frames to back propagate to train RNNs to address the long-term dependency in video. Sparse attentive backtracking is used to prevent the biased temporal representation that emphasizes only the last few frames through BPTT.

2016-17

Being different from the widely used low-level descriptors, visual attributes (e.g., hair and shirt color) offer a human understandable way to recognize objects such as people. In this work, a new way to take advantage of them is proposed for person re-identification where the challenges include illumination, pose and viewpoint changes among non-overlapping camera views.

First, detect the attributes in images/videos by using deep learning-based convolutional neural networks.

Second, compute the dependencies among attributes by mining association rules that are used to refine the attributes classification results.

Third, transfer the attribute learning task to person re-identification in video by using metric learning technique.

Finally, integrate the attributes-based approach into an appearance-based method for video-based person re-identification and evaluate the results on benchmark datasets.

2015-16

1) Discover and represent the co­occurrence patterns as hierarchical communities by graph modularity maximization in a network with nodes and edges representing concepts and co­occurrence relationships separately.

A new method is developed to learn an unbiased semantic representation for video-based person re-identification. To handle the background clutter and occlusions, a pedestrian segmentation method is used to obtain the silhouette of the body. After the segmentation, an unbiased Siamese bi-directional recurrent convolutional neural network architecture is developed to learn the unbiased spatial and temporal representation. Experimental results on three public datasets demonstrate the effectiveness of the proposed method. The proposed method is capable of learning discriminative spatial representation by substituting invariant background and identifying the weights of frames independent of their positions in a video for learning a semantic concept.

The proposed approach is evaluated on three of the most popular public datasets: iLIDs-VID, PRID 2011 and SDU-VID, and it is compared with other state-of-the-art methods. The iLIDs-VID dataset contained 300 persons, which are recorded at an airport arrival hall using a CCTV network. Each person had two acquisition of videos whose length varied from 23 to 192 frames. This dataset has been very challenging due to the clothing similarities among people, changing illumination conditions and viewpoints, cluttered background and occlusions. The PRID 2011 dataset has 749 persons who are captured by two adjacent camera views. Only the first 200 pairs of video have been taken from both cameras. The length of the image sequences ranged from 5 to 675, with an average of 100 frames. Compared with iLIDs-VID dataset, this dataset is less challenging because it is taken under the uncrowded outdoor scenes, and it has relatively simple background and rare occlusions. The first 200 persons of this dataset are used for evaluation purposes similar to the prior research. The SDU-VID dataset included 300 pairs of sequences. Each video has a variable length ranging from 16 to 346 with an average number of 130 frames, which is a bit larger than the other two datasets. It has also been a challenging dataset with cluttered background, occlusions and changing viewpoints.

We achieved rank 1 recognition rates of 64.7%, 75.2%, and 87.6% for iLIDs-VID dataset, PRID 2011 dataset and SDU-VID dataset, respectively. Especially, our model outperforms the other compared methods with rank 1 accuracy on iLIDs-VID and SDU-VID datasets. For the PRID 2011 dataset, we get a rank 1 accuracy of 75.2%, which is slightly lower than the best that is 78%. The reason is we use the similar architecture as RCNN and BRCNN whose performance is relatively lower than other methods ASPTN and TSC. On the other hand, there are less occlusions and background clutter with respect to the other two datasets. It’s more appropriate to include the background substitution and bidirectional RNNs to handle the biased representation problem.

2016-17

We validate our approach on two of the most important benchmark datasets for video-based person re-identification. The first dataset is iLIDS-VID, which consists of 2 acquisitions of 300 pedestrians at an airport arrival hall. The length of videos varies from 23-192 frames with an average of 73 frames. This dataset is very challenging due to the changing illumination conditions and viewpoints, complex backgrounds and occlusions. The other dataset is PRID 2011 dataset that includes 200 pairs of image sequences taken from two adjacent camera views. The length of the image sequences varies from 5 to 675, with an average of 100 frames. Compared with the iLIDS-VID dataset, this is less challenging because of the relatively simple backgrounds and the rare presence of occlusions. PEdesTrian Attribute (PETA) is the dataset we used to learn attributes. It is a large-scale surveillance dataset of 19000 attribute labeled images taken from 8707 persons. Each image is annotated with 61 binary and 4 multi-class attributes, such as hair style, clothing color and accessories. The images in this dataset are from 10 different datasets including 477 images from iLIDS-VID and 1134 images from PRID. The images from iLIDS-VID and PRID 2011 datasets are handled appropriately in performing experiments with PETA. For the attributes detection network, we use a NVIDIA Digits DevBox, which comes with Four TITAN X GPUs with 7 TFlops of single precision, 336.5 GB/s of memory bandwidth, and 12 GB of memory/board. For the co-occurrence pattern mining, we use Weka 3 package.

For each dataset, we randomly extract two equal subsets, one for training and one for testing. During the testing stage, for each query sequence, we compute the distance against each identity in the gallery set and return the top n identities. To measure the performance, the Cumulative Match Characteristic (CMC) plot is used, which represents the percentage of the test sequences that are correctly matched within the specified rank. The experiments are repeated 10 times and the average CMC plot is reported. For evaluating the attributes detection, the corresponding 477 and 1134 images of related person identities in iLIDSVID and PRID 2011 are removed from the PETA dataset separately. Then we randomly select 16000 images from the remaining PETA dataset for training and leave the remaining 2523 and 1866 images from the two datasets for validation. We test the performance on two datasets of 150 and 100 videos from iLIDS-VID and PRID 2011, respectively. For all the experiments, same parameters are used.

We compare our results with the following state-of-the-art methods: Recurrent Convolutional Neural Networks (RCNN), Top-Push (TDL), Temporally Aligned Pooling Representation (TAPR) and Simultaneously learning Intra-Video and Inter-video Distance Learning (SI2DL). We achieve rank 1 identification rates of 60.3% and 73.2%, which results in improvements of 2% and 2.6% with respect to [McLaughlin et al. CVPR 2016] for iLIDS-VID and PRID 2011 datasets, respectively. For iLIDS-VID dataset, our algorithm achieves the best rank 1 performance. For the PRID 2011 dataset, the results are approaching to the best result obtained by [Zhu et al. IJCAI 2016]. However, if we examine the results on the more challenging iLIDS-VID dataset, it obtains the lowest recognition rates compared to all the listed results. Instead, our method performs consistently well on both datasets. It is fair to say that attributes information and co-occurrence patterns are complementary to RCNN [McLaughlin et al. CVPR 2016]. Further details are given in the paper attached with this report.

2015-16

We carried out experiments for automatic image annotation and semantic image retrieval on several challenging datasets. We use three datasets: 10,000 image and 2500 concepts from LabelMe dataset, 12,000 images and 5800 concepts from SUN09 dataset and 2682 images and 520 concepts from OSR dataset. We use a variety of features (color, histogram of oriented gradients, etc. We evaluate the results for automated image annotation using various measures, including F1 measure and precision measures and for retrieval using mean average precision. The key results are:

Co-occurrence pattern detection results - Our combined co-occurrence measure of normalized google distance, normalized tag distance, and automated local analysis is more effective than each of the individual measures in co-occurrence network construction as well as co-occurrence pattern detection. The combined measure gives the best performance in modularity measure.

Automated image Annotation: To analyze the scalability of our approach, we compare the results on the three datasets with increased complexity (OSR < SUN09 < LabelMe) evaluated by the total number of concepts in the datasets and the number of concepts per image. Our results show that generally when the images are complex the performance of the approaches drop. In particular, we observe that our approach achieves better maximum performance gain when the images have higher complexities. For example, LabelMe usually has more than 10 concepts in an image, the maximum performance gain reaches 20.59 percent when the training set contains 80 percent of the images. SUN09 contains on average 5-10 concepts per image, the maximum performance gain is between 11:29 and 14.00 percent. OSR has the least number of concepts in an image, and the maximum gain is the lowest as well which is approximately 10.00 percent only. This indicates that our approach is well suited for understanding images with complex scenes.

Image Retrieval: The proposed hierarchical concept co-occurrence patterns can boost the individual concept inference. In particular, we can observe that when using only a small fraction of the dataset for training, our method can still achieve comparatively good performance. Further, we observe that the returned images are more semantically related to the scene concept reﬂected in the query images rather than just visually related.

(a) The importance of the hierarchy of co-occurrence patterns and its representation as a network structure, and (b) The effectiveness of the approach for building individual concept inference models and the utilization of co-occurrence patterns for reﬁnement of concept signature as a way to encode both visual and semantic information.

Key Outcomes or other Achievements:

2017-18

Our method demonstrates its effectiveness by improving the state-of-the-art results. It surpasses the previously published results by the McLaughlin method [best results, CVPR 2016] with the same architecture (RCNN) by 8.1%, 4.2% and 15.2%, respectively, in rank 1 accuracy for iLIDs-VID, PRID 2011 and SDU-VID datasets, respectively.

We plan to disseminate the software with the publication of the journal paper on spatio-temporal semantics.

2016-17

As compared to the state-of-the-art, our contribution can be summarized as follows:

We developed a novel framework that takes into account attributes and their co-occurrence.

We perform experiments that highlight the generalization capabilities of the framework. We train on a large independent attribute dataset and then test on two different re-id benchmarks. Unlike the work of Zhu et al (IJCAI 2016), our approach performs consistently on both testing datasets. Experimental results on two benchmark datasets indicate that attributes can provide improvements both in accuracy and generalization capabilities.

2015-16

Developed algorithms to represent the co­occurrence patterns as hierarchical communitiees by graph modularity maximization in a network with nodes and edges representing concepts and co­occurrence relationships separately.

Developed algorithms for a random walk process that works on the inferred concept probabilities with the discovered co­occurrence patterns to acquire the refined concept signature representation.

What opportunities for training and professional development has the project provided?

The project provided opportunity for research on large image databases, machine learning, deep learning and data mining and the development of algorithms/tools.

Websites

Acknowledgement

This material is based upon work supported by the National Science Foundation Project ID No. IIS-1552454. Any opinions,
findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.