Abstract

This paper presents a new approach for dynamic scene recognition based on a super descriptor tensor decomposition. Recently, local feature extraction based on dense trajectories has been used for modeling motion. However, dense trajectories usually include a large number of unnecessary trajectories, which increase noise, add complexity, and limit the recognition accuracy. Another problem is that the traditional bag-of-words techniques encode and concatenate the local features extracted from multiple descriptors to form a single large vector for classification. This concatenation not only destroys the spatio-temporal structure among the features but also yields high dimensionality. To address these problems, first, we propose to refine the dense trajectories by selecting only salient trajectories in a region of interest containing motion. Visual descriptors consisting of oriented gradient and motion boundary histograms are then computed along the refined dense trajectories. In case of camera motion, a short-window video stabilization is integrated to compensate for global motion. Second, the extracted features from multiple descriptors are encoded using a super descriptor tensor model. To this end, the TUCKER-3 tensor decomposition is employed to obtain a compact set of salient features, followed by feature selection via Fisher ranking. Experiments are conducted using two benchmark dynamic scene recognition datasets: Maryland "in-the-wild" and YUPPEN dynamic scenes. Experimental results show that the proposed approach outperforms several existing methods in terms of recognition accuracy and achieves a performance comparable with the state-of-the-art deep learning methods. The proposed approach achieves classification rates of 89.2% for Maryland and 98.1% for YUPPEN datasets