Jia Li Anlin Zheng Xiaowu Chen Bin Zhou

State Key Laboratory of Virtual Reality Technology and Systems, Beihang University

Published in ICCV, 2017

This paper proposes a novel approach for segmenting primary video objects by using Complementary Convolutional Neural Networks (CCNN) and neighborhood reversible flow. The proposed approach first pre-trains CCNN on massive images with manually annotated salient objects in an end-to-end manner, and the trained CCNN has two separate branches that simultaneously handle two complementary tasks, i.e., foregroundness and backgroundness estimation. By applying CCNN on each video frame, the spatial foregroundness and backgroundness maps can be initialized, which are then propagated between various frames so as to segment primary video objects and suppress distractors. To enforce efficient temporal propagation, we divide each frame into superpixels and construct neighborhood reversible flow that reflects the most reliable temporal correspondences between superpixels in far-away frames. Within such flow, the initialized foregroundness and backgroundness can be efficiently and accurately propagated along the temporal axis so that primary video objects gradually pop-out and distractors are well suppressed. Extensive experimental results on three video datasets show that the proposed approach achieves impressive performance in comparisons with 18 state-of-the-art models.

The Approach

The framework consists of two major modules. The spatial module trains CCNN to simultaneously initialize the foregroundness and backgroundness maps of each frame. This module operates on GPU to provide pixel-wise predictions for each frame. The temporal module constructs neighborhood reversible flow so as to propagate foregroundness and backgroundness along the most reliable inter-frame correspondences. This module operates on superpixels for efficient temporal propagation.

Complementary CNNs

Architecture of CCNN. Here CONV (3*3/2) indicates a convolutional layer with 3*3 kernels and dilation of 2. Foregroundness and backgroundness maps initialized by CCNN as well as their fusion maps (i.e., maximum values from two maps). (a) and (e) Video frames, (b) and (f) foregroundness maps, (c) and (g) backgroundness maps, (d) and (h) fusion maps. We can see that the foregroundness and backgroundness maps can well depict salient objects and distractors in many frames (see (a)-(d)). However, they are not always perfectly complementary, leaving some area mistakenly predicted in both foreground and background maps (see the black area in fusion maps (h))

Result

Representative results of our approach. Red masks are the ground-truth and green contours are the segmented primary objects.