Abstract

Learning video visual representation for detection has emerged as one of the fundamental problems toward general video understanding. It requires rich knowledge for spatial and/or temporal localization, however, manually collecting fully-supervised annotations is extremely expensive and not scalable. This thesis has made progress toward effectively employing weakly labeled data in learning video representations for detection tasks. Specifically, we focus on video object detection with human action descriptions, and temporal action detection with video-level action categories.

For the weakly-supervised video object detection task, we propose a temporal dynamic graph long short-term memory network. It enables global temporal reasoning by constructing a dynamic graph that is based on temporal correlations of object proposals and spans the entire video. It significantly alleviates the missing label issue for each individual frame by transferring knowledge across correlated objects proposals in the whole video. Extensive evaluations on a large-scale daily-life action dataset demonstrates the superiority of our proposed method.

For the weakly-supervised temporal action detection task, we propose three different approaches. (1) We propose an end-to-end framework to simultaneously update feature representation for classification and generate temporal proposals with gated recurrent unit for detection. (2) We propose a novel structure-and-relation network, which includes a local structure module to leverage the context information for improving localization, and a global relation module to process all instances simultaneously through exploiting their interactions. These modules are integrated from a probabilistic perspective and can be learned in an end-to-end fashion. (3) We propose a marginalized dropout attention (MDA) mechanism for video feature aggregation, in order to learn more accurate action probability for each frame from classification network. The MDA module is performed as a structural regularization, alleviating an existing problem of only attending on the most salient frames. Our proposed methods outperform the previous weakly-supervised approaches on several challenging video benchmarks.