ARTICLE

Video Segmentation – Related Work, Datasets and Benchmarks

Unsupervised video segmentation can be seen as intermediate vision problem: although labels do not have semantics, temporally coherent segments that adhere object boundaries are desirable. Many video segmentation algorithms have been proposed – often generating oversegmentations – and this article is intended to give an overview over these algorithms, used datasets and benchmarks.

Unsupervised video segmentation inherits many of its difficulties from image segmentation where it has been shown [1] that even humans have immense problems choosing the optimal number of segments and granularity of the segmentation. In contrast, video provides valuable temporal information to guide segmentation. Still, most state-of-the-art approaches generate oversegmentations where the video volume is segmented into many so called supervoxels [2]. Some authors [3] build hierarchies of segmentations by iteratively merging these supervoxels. However, choosing the optimal hierarchy level is difficult and no metrics are available to guide this decision.

This article is intended to give a brief overview over state-of-the-art video segmentation algorithms, evaluation benchmarks and datasets.

Video Segmentation

Video segmentation takes different forms and, therefore, highly varying approaches have been proposed. The earliest publications on video segmentation use mean-shift approaches [4], [5]. Later, Brendel a Todorovic [6] proposed to track regions over time. Lee et al. [7], Li et al. [8] as well as Papazoglou and Ferrari [9] present approaches for figure ground segmentation. Other approaches are motivated by superpixel algorithms such as [10], [11] or [12] and introduce generalizations to video [3], [11], [13]. These algorithms are commonly referred to as supervoxel algorithms, however, the terms superpixel and supervoxel are not clearly defined throughout the literature. Galasso et al. [14] extend the algorithm of [15] to include motion cues and a streaming version of this approach was introduced in [16]. Another streaming video segmentation algorithm, based on the approach by Grundmann et al. [3], was presented by Xu et al. [17].

Datasets

Proper datasets are essential for evaluating and comparing video segmentation algorithms. In the following, the most important datasets used for (semantic) video segmentation are presented. A short overview over additional datasets can also be found in [18].

Grundmann et al. [3] provide a set of 15 sequences of varying length and scenes, see here. However, no ground truth annotation is provided such that the dataset can only be used for qualitative evaluation and most of the measures described later are not applicable.

Chen and Corso [19] provide a semantically labeled dataset with 8 sequences, see here. The sequences are annotated according to 24 classes such as "building", "grass" or "body". Example sequences are shown in figure 12. Further, Chen and Corso [CC11] provide another dataset, see here of 380 frames split up into 10 sequences and annotated by 4 classes ("vehicles", "road", "obstacles" and "others").

In [20], Xu and Corso use the SegTrack dataset by Tsai et al. [21], consisting of 6 sequences to evaluate supervoxel algorithms, see here. However, the dataset merely provides figure-ground annotations.

The Sintel dataset published by Butler et al. [22] provides 23 synthetic sequences taken from the open movie "Sintel". Ground truth annotation is derived from material properties of the sequences. As result, the ground truth cannot be considered semantic, however, still includes non-connected segments.

Liu et al. [23] provide the Wild8 dataset (which is currently not publicly available, however, can be obtained from the authors), consisting of 100 sequences of which 33 are semantically labeled. The sequences are taken from documentaries and segmented into 8 classes such as "water", "sky", "bird" or "lion".

Galasso et al. [18] provide the VSB100 dataset, see here, of short sequences as part of their video segmentation benchmark. They provide 40 training sequences and 60 test sequences with 11-15 frames per training sequence and 3-8 frames per test sequence. The sequences are taken from the Berkeley Video Dataset by Sundberg et al. [24].

An important drawback of most of the presented datasets is their semantic annotation as segments are not required to be connected. This may influence evaluation using some of metrics discussed later. For converting the annotations a three-dimensional connected components algorithm could be used.

Benchmarks

To the best of my knowledge, two publications comparing state-of-the-art video segmentation algorithms are available: Xu and Corso [20] as well as Galasso et al. [18]. In both, the authors propose to generalize common metrics used for image segmentation:

Xu and Corso [20] - as well as Galasso et al. [18] - propose to generalize Boundary Recall to video volumes. Given a ground truth segmentation $G = \{G_i\}$ and a segmentation $S = \{S_j\}$, both partitions of the set of all pixels, 3D Boundary Recall is defined as

$3DRec(G, S) = \frac{TP(G, S)}{TP(G, S) + FN(G, S)}$

where $TP(G, S)$ is the number of true positive boundary pixels and $FN(G, S)$ the number of false positive boundary pixels. Thus, 3D Boundary Recall measures the fraction of correctly detected boundary pixels. Boundary pixels are identified both spatially and temporally, however, Xu and Corso [20] as well as Galasso et al. [18] do not distinguish between spatial and temporal boundary pixels. As consequence, temporal boundary pixels in $S$ may be matched to temporal boundary pixels in $G$ and vice versa.

Furthermore, Xu and Corso [20] propose the following formulation of the 3D Undersegmentation Error:

As this formulation of the 3D Undersegmentation Error is not constrained to lie in $[0,1]$, comparison across datasets is difficult. Therefore, generalization of the formulation by Neubert and Protzel [25] may be advantageous:

ABOUTTHEAUTHOR

In September, I was honored to receive the MINT-Award IT 2018, sponsored by ZF and audimax, for my master thesis on weakly-supervised shape completion. For CVPR 2019, however, I am working on a different topic: adversarial robustness and generalization of deep neural networks.
18thOCTOBER2018 , David Stutz

What is your opinion on this article? Did you find it interesting or useful? Let me know your thoughts in the comments below: