Much progress has been made in image and video segmentation
over the last years. To a large extent, the success can be attributed to
the strong appearance models completely learned from data, in particular
using deep learning methods. However,to perform best these methods require
large representative datasets for training with expensive pixel-level
annotations, which in case of videos are prohibitive to obtain. Therefore,
there is a need to relax this constraint and to consider alternative forms
of supervision, which are easier and cheaper to collect. In this thesis,
we aim to develop algorithms for learning to segment in images and videos
with different levels of supervision.
First, we develop approaches for training convolutional networks with weaker
forms of supervision, such as bounding boxes or image labels, for object
boundary estimation and semantic/instance labelling tasks. We propose to
generate pixel-level approximate groundtruth from these weaker forms of
annotations to train a network, which allows to achieve high-quality
results comparable to the full supervision quality without any
modifications of the network architecture or the training procedure.
Second, we address the problem of the excessive computational and memory
costs inherent to solving video segmentation via graphs. We propose
approaches to improve the runtime and memory efficiency as well as the
output segmentation quality by learning from the available training data
the best representation of the graph. In particular, we contribute with
learning must-link constraints, the topology and edge weights of the graph
as well as enhancing the graph nodes - superpixels - themselves.
Third, we tackle the task of pixel-level object tracking and address the
problem of the limited amount of densely annotated video data for training
convolutional networks. We introduce an architecture which allows training
with static images only and propose an elaborate data synthesis scheme
which creates a large number of training examples close to the target
domain from the given first frame mask. With the proposed techniques we
show that densely annotated consequent video data is not necessary to
achieve high-quality temporally coherent video segmentationresults.
In summary, this thesis advances the state of the art in weakly supervised
image segmentation, graph-based video segmentation and pixel-level object
tracking and contributes with the new ways of training convolutional
networks with a limited amount of pixel-level annotated training data.

Convolutional networks reach top quality in pixel-level object tracking but
require a large amount of training data (1k ~ 10k) to deliver such results. We
propose a new training strategy which achieves state-of-the-art results across
three evaluation datasets while using 20x ~ 100x less annotated data than
competing methods. Instead of using large training sets hoping to generalize
across domains, we generate in-domain training data using the provided
annotation on the first frame of each video to synthesize ("lucid dream")
plausible future video frames. In-domain per-video training data allows us to
train high quality appearance- and motion-based models, as well as tune the
post-processing stage. This approach allows to reach competitive results even
when training from only a single annotated frame, without ImageNet
pre-training. Our results indicate that using a larger training set is not
automatically better, and that for the tracking task a smaller training set
that is closer to the target domain is more effective. This changes the mindset
regarding how many training samples and general "objectness" knowledge are
required for the object tracking task.

State-of-the-art learning based boundary detection methods require extensive
training data. Since labelling object boundaries is one of the most expensive
types of annotations, there is a need to relax the requirement to carefully
annotate images to make both the training more affordable and to extend the
amount of training data. In this paper we propose a technique to generate
weakly supervised annotations and show that bounding box annotations alone
suffice to reach high-quality object boundaries without using any
object-specific boundary annotations. With the proposed weak supervision
techniques we achieve the top performance on the object boundary detection
task, outperforming by a large margin the current fully supervised
state-of-the-art methods.

Graph-based video segmentation methods rely on superpixels as starting point.
While most previous work has focused on the construction of the graph edges and
weights as well as solving the graph partitioning problem, this paper focuses
on better superpixels for video segmentation. We demonstrate by a comparative
analysis that superpixels extracted from boundaries perform best, and show that
boundary estimation can be significantly improved via image and time domain
cues. With superpixels generated from our better boundaries we observe
consistent improvement for two video segmentation methods in two different
datasets.