Recent advances in Generative Adversarial Networks (GANs) have shown increasing success in generating photorealistic images. But they also raise challenges to visual forensics and model authentication. We present the first study of learning GAN fingerprints towards image attribution: we systematically investigate the performance of classifying an image as real or GAN-generated. For GAN-generated images, we further identify their sources. Our experiments validate that GANs carry distinct model fingerprints and leave stable fingerprints to their generated images, which support image attribution. Even a single difference in GAN training initialization can result in different fingerprints, which enables fine-grained model authentication. We further validate such a fingerprint is omnipresent in different image components and is not biased by GAN artifacts. Fingerprint finetuning is effective in immunizing five types of adversarial image perturbations. Comparisons also show our learned fingerprints consistently outperform several baselines in a variety of setups.

Internal thought refers to the process of directing attention away from a
primary visual task to internal cognitive processing. Internal thought is a
pervasive mental activity and closely related to primary task performance. As
such, automatic detection of internal thought has significant potential for
user modelling in intelligent interfaces, particularly for e-learning
applications. Despite the close link between the eyes and the human mind, only
a few studies have investigated vergence behaviour during internal thought and
none has studied moment-to-moment detection of internal thought from gaze.
While prior studies relied on long-term data analysis and required a large
number of gaze characteristics, we describe a novel method that is
computationally light-weight and that only requires eye vergence information
that is readily available from binocular eye trackers. We further propose a
novel paradigm to obtain ground truth internal thought annotations that
exploits human blur perception. We evaluate our method for three increasingly
challenging detection tasks: (1) during a controlled math-solving task, (2)
during natural viewing of lecture videos, and (3) during daily activities, such
as coding, browsing, and reading. Results from these evaluations demonstrate
the performance and robustness of vergence-based detection of internal thought
and, as such, open up new directions for research on interfaces that adapt to
shifts of mental attention.

Recent methods to automatically calibrate stationary eye trackers were shown
to effectively reduce inherent calibration distortion. However, these methods
require additional information, such as mouse clicks or on-screen content. We
propose the first method that only requires users' eye movements to reduce
calibration distortion in the background while users naturally look at an
interface. Our method exploits that calibration distortion makes straight
saccade trajectories appear curved between the saccadic start and end points.
We show that this curving effect is systematic and the result of distorted gaze
projection plane. To mitigate calibration distortion, our method undistorts
this plane by straightening saccade trajectories using image warping. We show
that this approach improves over the common six-point calibration and is
promising for reducing distortion. As such, it provides a non-intrusive
solution to alleviating accuracy decrease of eye tracker during long-term use.

Automatic detection of emergent leaders in small groups from nonverbal
behaviour is a growing research topic in social signal processing but existing
methods were evaluated on single datasets -- an unrealistic assumption for
real-world applications in which systems are required to also work in settings
unseen at training time. It therefore remains unclear whether current methods
for emergent leadership detection generalise to similar but new settings and to
which extent. To overcome this limitation, we are the first to study a
cross-dataset evaluation setting for the emergent leadership detection task. We
provide evaluations for within- and cross-dataset prediction using two current
datasets (PAVIS and MPIIGroupInteraction), as well as an investigation on the
robustness of commonly used feature channels (visual focus of attention, body
pose, facial action units, speaking activity) and online prediction in the
cross-dataset setting. Our evaluations show that using pose and eye contact
based features, cross-dataset prediction is possible with an accuracy of 0.68,
as such providing another important piece of the puzzle towards emergent
leadership detection in the real world.

We consider general discrete Markov Random Fields(MRFs) with additional
bottleneck potentials which penalize the maximum (instead of the sum) over
local potential value taken by the MRF-assignment. Bottleneck potentials or
analogous constructions have been considered in (i) combinatorial optimization
(e.g. bottleneck shortest path problem, the minimum bottleneck spanning tree
problem, bottleneck function minimization in greedoids), (ii) inverse problems
with $L_{\infty}$-norm regularization, and (iii) valued constraint satisfaction
on the $(\min,\max)$-pre-semirings. Bottleneck potentials for general discrete
MRFs are a natural generalization of the above direction of modeling work to
Maximum-A-Posteriori (MAP) inference in MRFs. To this end, we propose MRFs
whose objective consists of two parts: terms that factorize according to (i)
$(\min,+)$, i.e. potentials as in plain MRFs, and (ii) $(\min,\max)$, i.e.
bottleneck potentials. To solve the ensuing inference problem, we propose
high-quality relaxations and efficient algorithms for solving them. We
empirically show efficacy of our approach on large scale seismic horizon
tracking problems.

Not Using the Car to See the Sidewalk: Quantifying and Controlling the Effects of Context in Classification and SegmentationR. Shetty, B. Schiele and M. Fritz 32nd IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2019), 2019

From autonomous driving cars to surgical robots, robotic system has enjoyed significant growth over the past decade. With the rapid development in robotics alongside the evolution in the related fields, such as computer vision and machine learning, integrating perception, anticipation and manipulation is key to the success of future robotic system. In this thesis, we explore different ways of such integration to extend the capabilities of a robotic system to take on more challenging real world tasks. On anticipation and perception, we address the recognition of ongoing activity from videos. In particular we focus on long-duration and complex activities and hence propose a new challenging dataset to facilitate the work. We introduce hierarchical labels over the activity classes and investigate the temporal accuracy-specificity trade-offs. We propose a new method based on recurrent neural networks that learns to predict over this hierarchy and realize accuracy specificity trade-offs. Our method outperforms several baselines on this new challenge. On manipulation with perception, we propose an efficient framework for programming a robot to use human tools. We first present a novel and compact model for using tools described by a tip model. Then we explore a strategy of utilizing a dual-gripper approach for manipulating tools – motivated by the absence of dexterous hands on widely available general purpose robots. Afterwards, we embed the tool use learning into a hierarchical architecture and evaluate it on a Baxter research robot. Finally, combining perception, anticipation and manipulation, we focus on a block stacking task. First we explore how to guide robot to place a single block into the scene without collapsing the existing structure. We introduce a mechanism to predict physical stability directly from visual input and evaluate it first on a synthetic data and then on real-world block stacking. Further, we introduce the target stacking task where the agent stacks blocks to reproduce a tower shown in an image. To do so, we create a synthetic block stacking environment with physics simulation in which the agent can learn block stacking end-to-end through trial and error, bypassing to explicitly model the corresponding physics knowledge. We propose a goal-parametrized GDQN model to plan with respect to the specific goal. We validate the model on both a navigation task in a classic gridworld environment and the block stacking task.

As first-person cameras in head-mounted displays become increasingly
prevalent, so does the problem of infringing user and bystander privacy. To
address this challenge, we present PrivacEye, a proof-of-concept system that
detects privacysensitive everyday situations and automatically enables and
disables the first-person camera using a mechanical shutter. To close the
shutter, PrivacEye detects sensitive situations from first-person camera videos
using an end-to-end deep-learning model. To open the shutter without visual
input, PrivacEye uses a separate, smaller eye camera to detect changes in
users' eye movements to gauge changes in the "privacy level" of the current
situation. We evaluate PrivacEye on a dataset of first-person videos recorded
in the daily life of 17 participants that they annotated with privacy
sensitivity levels. We discuss the strengths and weaknesses of our
proof-of-concept system based on a quantitative technical evaluation as well as
qualitative insights from semi-structured interviews.

In this study, we explore building a two-stage framework for enabling users
to directly manipulate high-level attributes of a natural scene. The key to our
approach is a deep generative network which can hallucinate images of a scene
as if they were taken at a different season (e.g. during winter), weather
condition (e.g. in a cloudy day) or time of the day (e.g. at sunset). Once the
scene is hallucinated with the given attributes, the corresponding look is then
transferred to the input image while preserving the semantic details intact,
giving a photo-realistic manipulation result. As the proposed framework
hallucinates what the scene will look like, it does not require any reference
style image as commonly utilized in most of the appearance or style transfer
approaches. Moreover, it allows to simultaneously manipulate a given scene
according to a diverse set of transient attributes within a single model,
eliminating the need of training multiple networks per each translation task.
Our comprehensive set of qualitative and quantitative results demonstrate the
effectiveness of our approach against the competing methods.

We introduce Primal-Dual Wasserstein GAN, a new learning algorithm for
building latent variable models of the data distribution based on the primal
and the dual formulations of the optimal transport (OT) problem. We utilize the
primal formulation to learn a flexible inference mechanism and to create an
optimal approximate coupling between the data distribution and the generative
model. In order to learn the generative model, we use the dual formulation and
train the decoder adversarially through a critic network that is regularized by
the approximate coupling obtained from the primal. Unlike previous methods that
violate various properties of the optimal critic, we regularize the norm and
the direction of the gradients of the critic function. Our model shares many of
the desirable properties of auto-encoding models in terms of mode coverage and
latent structure, while avoiding their undesirable averaging properties, e.g.
their inability to capture sharp visual features when modeling real images. We
compare our algorithm with several other generative modeling techniques that
utilize Wasserstein distances on Frechet Inception Distance (FID) and Inception
Scores (IS).

Due to the importance of zero-shot learning, i.e. classifying images where
there is a lack of labeled training data, the number of proposed approaches has
recently increased steadily. We argue that it is time to take a step back and
to analyze the status quo of the area. The purpose of this paper is three-fold.
First, given the fact that there is no agreed upon zero-shot learning
benchmark, we first define a new benchmark by unifying both the evaluation
protocols and data splits of publicly available datasets used for this task.
This is an important contribution as published results are often not comparable
and sometimes even flawed due to, e.g. pre-training on zero-shot test classes.
Moreover, we propose a new zero-shot learning dataset, the Animals with
Attributes 2 (AWA2) dataset which we make publicly available both in terms of
image features and the images themselves. Second, we compare and analyze a
significant number of the state-of-the-art methods in depth, both in the
classic zero-shot setting but also in the more realistic generalized zero-shot
setting. Finally, we discuss in detail the limitations of the current status of
the area which can be taken as a basis for advancing it.

Machine learning is transforming the world. Its application areas span privacy
sensitive and security critical tasks such as human identification and self-driving
cars. These applications raise privacy and security related questions that are not
fully understood or answered yet: Can automatic person recognisers identify people
in photos even when their faces are blurred? How easy is it to find an adversarial
input for a self-driving car that makes it drive off the road?
This thesis contributes one of the first steps towards a better understanding of
such concerns. We observe that many privacy and security critical scenarios for
learned models involve input data manipulation: users obfuscate their identity by
blurring their faces and adversaries inject imperceptible perturbations to the input
signal. We introduce a data manipulator framework as a tool for collectively describing
and analysing privacy and security relevant scenarios involving learned models.
A data manipulator introduces a shift in data distribution for achieving privacy or
security related goals, and feeds the transformed input to the target model. This
framework provides a common perspective on the studies presented in the thesis.
We begin the studies from the user’s privacy point of view. We analyse the
efficacy of common obfuscation methods like face blurring, and show that they
are surprisingly ineffective against state of the art person recognition systems. We
then propose alternatives based on head inpainting and adversarial examples. By
studying the user privacy, we also study the dual problem: model security. In model
security perspective, a model ought to be robust and reliable against small amounts
of data manipulation. In both cases, data are manipulated with the goal of changing
the target model prediction. User privacy and model security problems can be
described with the same objective.
We then study the knowledge aspect of the data manipulation problem. The more
one knows about the target model, the more effective manipulations one can craft.
We propose a game theoretic manipulation framework to systematically represent
the knowledge level on the target model and derive privacy and security guarantees.
We then discuss ways to increase knowledge about a black-box model by only querying
it, deriving implications that are relevant to both privacy and security perspectives.

Machine Learning techniques are widely used by online services (e.g. Google,
Apple) in order to analyze and make predictions on user data. As many of the
provided services are user-centric (e.g. personal photo collections, speech
recognition, personal assistance), user data generated on personal devices is
key to provide the service. In order to protect the data and the privacy of the
user, federated learning techniques have been proposed where the data never
leaves the user's device and "only" model updates are communicated back to the
server. In our work, we propose a new threat model that is not concerned with
learning about the content - but rather is concerned with the linkability of
users during such decentralized learning scenarios.
We show that model updates are characteristic for users and therefore lend
themselves to linkability attacks. We show identification and matching of users
across devices in closed and open world scenarios. In our experiments, we find
our attacks to be highly effective, achieving 20x-175x chance-level
performance.
In order to mitigate the risks of linkability attacks, we study various
strategies. As adding random noise does not offer convincing operation points,
we propose strategies based on using calibrated domain-specific data; we find
these strategies offers substantial protection against linkability threats with
little effect to utility.

Encouraged by the recent progress in pedestrian detection, we investigate the gap between current state-of-the-art methods
and the “perfect single frame detector”. We enable our analysis by creating a human baseline for pedestrian detection (over the Caltech
pedestrian dataset). After manually clustering the frequent errors of a top detector, we characterise both localisation and background-
versus-foreground errors.
To address localisation errors we study the impact of training annotation noise on the detector performance, and show that we can
improve results even with a small portion of sanitised training data. To address background/foreground discrimination, we study convnets
for pedestrian detection, and discuss which factors affect their performance.
Other than our in-depth analysis, we report top performance on the Caltech pedestrian dataset, and provide a new sanitised set of
training and test annotations.

The inner structure of a material is called microstructure. It stores the
genesis of a material and determines all its physical and chemical properties.
While microstructural characterization is widely spread and well known, the
microstructural classification is mostly done manually by human experts, which
opens doors for huge uncertainties. Since the microstructure could be a
combination of different phases with complex substructures its automatic
classification is very challenging and just a little work in this field has
been carried out. Prior related works apply mostly designed and engineered
features by experts and classify microstructure separately from feature
extraction step. Recently Deep Learning methods have shown surprisingly good
performance in vision applications by learning the features from data together
with the classification step. In this work, we propose a deep learning method
for microstructure classification in the examples of certain microstructural
constituents of low carbon steel. This novel method employs pixel-wise
segmentation via Fully Convolutional Neural Networks (FCNN) accompanied by
max-voting scheme. Our system achieves 93.94% classification accuracy,
drastically outperforming the state-of-the-art method of 48.89% accuracy,
indicating the effectiveness of pixel-wise approaches. Beyond the success
presented in this paper, this line of research offers a more robust and first
of all objective way for the difficult task of steel quality appreciation.

For autonomous agents to successfully operate in the real world, anticipation
of future events and states of their environment is a key competence. This
problem can be formalized as a sequence prediction problem, where a number of
observations are used to predict the sequence into the future. However,
real-world scenarios demand a model of uncertainty of such predictions, as
future states become increasingly uncertain and multi-modal -- in particular on
long time horizons. This makes modelling and learning challenging. We cast
state of the art semantic segmentation and future prediction models based on
deep learning into a Bayesian formulation that in turn allows for a full
Bayesian treatment of the prediction problem. We present a new sampling scheme
for this model that draws from the success of variational autoencoders by
incorporating a recognition network. In the experiments we show that our model
outperforms prior work in accuracy of the predicted segmentation and provides
calibrated probabilities that also better capture the multi-modal aspects of
possible future states of street scenes.

We propose a deep representation of appearance, i. e. the relation of color,
surface orientation, viewer position, material and illumination. Previous
approaches have used deep learning to extract classic appearance
representations relating to reflectance model parameters (e. g. Phong) or
illumination (e. g. HDR environment maps). We suggest to directly represent
appearance itself as a network we call a deep appearance map (DAM). This is a
4D generalization over 2D reflectance maps, which held the view direction
fixed. First, we show how a DAM can be learned from images or video frames and
later be used to synthesize appearance, given new surface orientations and
viewer positions. Second, we demonstrate how another network can be used to map
from an image or video frames to a DAM network to reproduce this appearance,
without using a lengthy optimization such as stochastic gradient descent
(learning-to-learn). Finally, we generalize this to an appearance
estimation-and-segmentation task, where we map from an image showing multiple
materials to multiple networks reproducing their appearance, as well as
per-pixel segmentation.

With the widespread use of machine learning (ML) techniques, ML as a service
has become increasingly popular. In this setting, an ML model resides on a
server and users can query the model with their data via an API. However, if
the user's input is sensitive, sending it to the server is not an option.
Equally, the service provider does not want to share the model by sending it to
the client for protecting its intellectual property and pay-per-query business
model. In this paper, we propose MLCapsule, a guarded offline deployment of
machine learning as a service. MLCapsule executes the machine learning model
locally on the user's client and therefore the data never leaves the client.
Meanwhile, MLCapsule offers the service provider the same level of control and
security of its model as the commonly used server-side execution. In addition,
MLCapsule is applicable to offline applications that require local execution.
Beyond protecting against direct model access, we demonstrate that MLCapsule
allows for implementing defenses against advanced attacks on machine learning
models such as model stealing/reverse engineering and membership inference.

The matching of multiple objects (e.g. shapes or images) is a fundamental
problem in vision and graphics. In order to robustly handle ambiguities, noise
and repetitive patterns in challenging real-world settings, it is essential to
take geometric consistency between points into account. Computationally, the
multi-matching problem is difficult. It can be phrased as simultaneously
solving multiple (NP-hard) quadratic assignment problems (QAPs) that are
coupled via cycle-consistency constraints. The main limitations of existing
multi-matching methods are that they either ignore geometric consistency and
thus have limited robustness, or they are restricted to small-scale problems
due to their (relatively) high computational cost. We address these
shortcomings by introducing a Higher-order Projected Power Iteration method,
which is (i) efficient and scales to tens of thousands of points, (ii)
straightforward to implement, (iii) able to incorporate geometric consistency,
and (iv) guarantees cycle-consistent multi-matchings. Experimentally we show
that our approach is superior to existing methods.

We propose a novel approach to jointly perform 3D object retrieval and pose
estimation from monocular images.In order to make the method robust to real
world scene variations in the images, e.g. texture, lighting and background,we
learn an embedding space from 3D data that only includes the relevant
information, namely the shape and pose.Our method can then be trained for
robustness under real world scene variations without having to render a large
training set simulating these variations. Our learned embedding explicitly
disentangles a shape vector and a pose vector, which alleviates both pose bias
for 3D shape retrieval and categorical bias for pose estimation. Having the
learned disentangled embedding, we train a CNN to map the images to the
embedding space, and then retrieve the closest 3D shape from the database and
estimate the 6D pose of the object using the embedding vectors. Our method
achieves 10.8 median error for pose estimation and 0.514 top-1-accuracy for
category agnostic 3D object retrieval on the Pascal3D+ dataset. It therefore
outperforms the previous state-of-the-art methods on both tasks.

Following a period of expedited progress in the capabilities of digital systems, the society begins to realize that systems designed to assist people in various tasks can also harm individuals and society. Mediating access to information and explicitly or implicitly ranking people in increasingly many applications, search systems have a substantial potential to contribute to such unwanted outcomes. Since they collect vast amounts of data about both searchers and search subjects, they have the potential to violate the privacy of both of these groups of users. Moreover, in applications where rankings influence people's economic livelihood outside of the platform, such as sharing economy or hiring support websites, search engines have an immense economic power over their users in that they control user exposure in ranked results. This thesis develops new models and methods broadly covering different aspects of privacy and fairness in search systems for both searchers and search subjects. Specifically, it makes the following contributions: (1) We propose a model for computing individually fair rankings where search subjects get exposure proportional to their relevance. The exposure is amortized over time using constrained optimization to overcome searcher attention biases while preserving ranking utility. (2) We propose a model for computing sensitive search exposure where each subject gets to know the sensitive queries that lead to her profile in the top-k search results. The problem of finding exposing queries is technically modeled as reverse nearest neighbor search, followed by a weekly-supervised learning to rank model ordering the queries by privacy-sensitivity. (3) We propose a model for quantifying privacy risks from textual data in online communities. The method builds on a topic model where each topic is annotated by a crowdsourced sensitivity score, and privacy risks are associated with a user's relevance to sensitive topics. We propose relevance measures capturing different dimensions of user interest in a topic and show how they correlate with human risk perceptions. (4) We propose a model for privacy-preserving personalized search where search queries of different users are split and merged into synthetic profiles. The model mediates the privacy-utility trade-off by keeping semantically coherent fragments of search histories within individual profiles, while trying to minimize the similarity of any of the synthetic profiles to the original user profiles. The models are evaluated using information retrieval techniques and user studies over a variety of datasets, ranging from query logs, through social media and community question answering postings, to item listings from sharing economy platforms.

Natural language explanations of deep neural network decisions provide an
intuitive way for a AI agent to articulate a reasoning process. Current textual
explanations learn to discuss class discriminative features in an image.
However, it is also helpful to understand which attributes might change a
classification decision if present in an image (e.g., "This is not a Scarlet
Tanager because it does not have black wings.") We call such textual
explanations counterfactual explanations, and propose an intuitive method to
generate counterfactual explanations by inspecting which evidence in an input
is missing, but might contribute to a different classification decision if
present in the image. To demonstrate our method we consider a fine-grained
image classification task in which we take as input an image and a
counterfactual class and output text which explains why the image does not
belong to a counterfactual class. We then analyze our generated counterfactual
explanations both qualitatively and quantitatively using proposed automatic
metrics.

Deep neural perception and control networks have become key com-
ponents of self-driving vehicles. User acceptance is likely to benefit from easy-
to-interpret textual explanations which allow end-users to understand what trig-
gered a particular behavior. Explanations may be triggered by the neural con-
troller, namely
introspective explanations
, or informed by the neural controller’s
output, namely
rationalizations
. We propose a new approach to introspective ex-
planations which consists of two parts. First, we use a visual (spatial) attention
model to train a convolutional network end-to-end from images to the vehicle
control commands,
i
.
e
., acceleration and change of course. The controller’s at-
tention identifies image regions that potentially influence the network’s output.
Second, we use an attention-based video-to-text model to produce textual ex-
planations of model actions. The attention maps of controller and explanation
model are aligned so that explanations are grounded in the parts of the scene that
mattered to the controller. We explore two approaches to attention alignment,
strong- and weak-alignment. Finally, we explore a version of our model that
generates rationalizations, and compare with introspective explanations on the
same video segments. We evaluate these models on a novel driving dataset with
ground-truth human explanations, the Berkeley DeepDrive eXplanation (BDD-
X) dataset. Code is available at
https://github.com/JinkyuKimUCB/explainable-deep-driving

In-depth scene descriptions and question answering tasks have greatly
increased the scope of today's definition of scene understanding. While such
tasks are in principle open ended, current formulations primarily focus on
describing only the current state of the scenes under consideration. In
contrast, in this paper, we focus on the future states of the scenes which are
also conditioned on actions. We posit this as a question answering task, where
an answer has to be given about a future scene state, given observations of the
current scene, and a question that includes a hypothetical action. Our solution
is a hybrid model which integrates a physics engine into a question answering
architecture in order to anticipate future scene states resulting from
object-object interactions caused by an action. We demonstrate first results on
this challenging new problem and compare to baselines, where we outperform
fully data-driven end-to-end learning approaches.

While great progress has been made recently in automatic image manipulation,
it has been limited to object centric images like faces or structured scene
datasets. In this work, we take a step towards general scene-level image
editing by developing an automatic interaction-free object removal model. Our
model learns to find and remove objects from general scene images using
image-level labels and unpaired data in a generative adversarial network (GAN)
framework. We achieve this with two key contributions: a two-stage editor
architecture consisting of a mask generator and image in-painter that
co-operate to remove objects, and a novel GAN based prior for the mask
generator that allows us to flexibly incorporate knowledge about object shapes.
We experimentally show on two datasets that our method effectively removes a
wide variety of objects using weak supervision only

Understanding physical phenomena is a key component of human intelligence and
enables physical interaction with previously unseen environments. In this
paper, we study how an artificial agent can autonomously acquire this intuition
through interaction with the environment. We created a synthetic block stacking
environment with physics simulation in which the agent can learn a policy
end-to-end through trial and error. Thereby, we bypass to explicitly model
physical knowledge within the policy. We are specifically interested in tasks
that require the agent to reach a given goal state that may be different for
every new trial. To this end, we propose a deep reinforcement learning
framework that learns policies which are parametrized by a goal. We validated
the model on a toy example navigating in a grid world with different target
positions and in a block stacking task with different target structures of the
final tower. In contrast to prior work, our policies show better generalization
across different goals.

Computer Vision has undergone major changes over the recent five years. Here, we investigate if the performance of such architectures generalizes to more complex tasks that require a more holistic approach to scene comprehension. The presented work focuses on learning spatial and multi-modal representations, and the foundations of a Visual Turing Test, where the scene understanding is tested by a series of questions about its content. In our studies, we propose DAQUAR, the first ‘question answering about real-world images’ dataset together with methods, termed a symbolic-based and a neural-based visual question answering architectures, that address the problem. The symbolic-based method relies on a semantic parser, a database of visual facts, and a bayesian formulation that accounts for various interpretations of the visual scene. The neural-based method is an end-to-end architecture composed of a question encoder, image encoder, multimodal embedding, and answer decoder. This architecture has proven to be effective in capturing language-based biases. It also becomes the standard component of other visual question answering architectures. Along with the methods, we also investigate various evaluation metrics that embraces uncertainty in word's meaning, and various interpretations of the scene and the question.

What does human gaze reveal about a users' intents and to which extend can
these intents be inferred or even visualized? Gaze was proposed as an implicit
source of information to predict the target of visual search and, more
recently, to predict the object class and attributes of the search target. In
this work, we go one step further and investigate the feasibility of combining
recent advances in encoding human gaze information using deep convolutional
neural networks with the power of generative image models to visually decode,
i.e. create a visual representation of, the search target. Such visual decoding
is challenging for two reasons: 1) the search target only resides in the user's
mind as a subjective visual pattern, and can most often not even be described
verbally by the person, and 2) it is, as of yet, unclear if gaze fixations
contain sufficient information for this task at all. We show, for the first
time, that visual representations of search targets can indeed be decoded only
from human gaze fixations. We propose to first encode fixations into a semantic
representation and then decode this representation into an image. We evaluate
our method on a recent gaze dataset of 14 participants searching for clothing
in image collages and validate the model's predictions using two human studies.
Our results show that 62% (Chance level = 10%) of the time users were able to
select the categories of the decoded image right. In our second studies we show
the importance of a local gaze encoding for decoding visual search targets of
user

Modern image classification methods are based on supervised learning algorithms that require labeled training data. However, only a limited amount of annotated data may be available in certain applications due to scarcity of the data itself or high costs associated with human annotation. Introduction of additional information and structural constraints can help improve the performance of a learning algorithm. In this thesis, we study the framework of learning using privileged information and demonstrate its relation to learning with instance weights. We also consider multitask feature learning and develop an efficient dual optimization scheme that is particularly well suited to problems with high dimensional image descriptors. Scaling annotation to a large number of image categories leads to the problem of class ambiguity where clear distinction between the classes is no longer possible. Many real world images are naturally multilabel yet the existing annotation might only contain a single label. In this thesis, we propose and analyze a number of loss functions that allow for a certain tolerance in top k predictions of a learner. Our results indicate consistent improvements over the standard loss functions that put more penalty on the first incorrect prediction compared to the proposed losses. All proposed learning methods are complemented with efficient optimization schemes that are based on stochastic dual coordinate ascent for convex problems and on gradient descent for nonconvex formulations.

Audio Description (AD) provides linguistic descriptions of movies and allows
visually impaired people to follow a movie along with their peers. Such
descriptions are by design mainly visual and thus naturally form an interesting
data source for computer vision and computational linguistics. In this work we
propose a novel dataset which contains transcribed ADs, which are temporally
aligned to full length movies. In addition we also collected and aligned movie
scripts used in prior work and compare the two sources of descriptions. In
total the Large Scale Movie Description Challenge (LSMDC) contains a parallel
corpus of 118,114 sentences and video clips from 202 movies. First we
characterize the dataset by benchmarking different approaches for generating
video descriptions. Comparing ADs to scripts, we find that ADs are indeed more
visual and describe precisely what is shown rather than what should happen
according to the scripts created prior to movie production. Furthermore, we
present and compare the results of several teams who participated in a
challenge organized in the context of the workshop "Describing and
Understanding Video & The Large Scale Movie Description Challenge (LSMDC)", at
ICCV 2015.

Visual object detection has seen substantial improvements during the last years due to the possibilities enabled by deep learning. While research on image classification provides continuous progress on how to learn image representations and classifiers jointly, object detection research focuses on identifying how to properly use deep learning technology to effectively localise objects. In this thesis, we analyse and improve different aspects of the commonly used detection pipeline. We analyse ten years of research on pedestrian detection and find that improvement of feature representations was the driving factor. Motivated by this finding, we adapt an end-to-end learned detector architecture from general object detection to pedestrian detection. Our deep network outperforms all previous neural networks for pedestrian detection by a large margin, even without using additional training data. After substantial improvements on pedestrian detection in recent years, we investigate the gap between human performance and state-of-the-art pedestrian detectors. We find that pedestrian detectors still have a long way to go before they reach human performance, and we diagnose failure modes of several top performing detectors, giving direction to future research. As a side-effect we publish new, better localised annotations for the Caltech pedestrian benchmark. We analyse detection proposals as a preprocessing step for object detectors. We establish different metrics and compare a wide range of methods according to these metrics. By examining the relationship between localisation of proposals and final object detection performance, we define and experimentally verify a metric that can be used as a proxy for detector performance. Furthermore, we address a structural weakness of virtually all object detection pipelines: non-maximum suppression. We analyse why it is necessary and what the shortcomings of the most common approach are. To address these problems, we present work to overcome these shortcomings and to replace typical non-maximum suppression with a learnable alternative. The introduced paradigm paves the way to true end-to-end learning of object detectors without any post-processing. In summary, this thesis provides analyses of recent pedestrian detectors and detection proposals, improves pedestrian detection by employing deep neural networks, and presents a viable alternative to traditional non-maximum suppression.

Learning to Segment in Images and Videos with Different Forms of SupervisionA. Khoreva PhD Thesis, Universität des Saarlandes, 2017

Abstract

Much progress has been made in image and video segmentation
over the last years. To a large extent, the success can be attributed to
the strong appearance models completely learned from data, in particular
using deep learning methods. However,to perform best these methods require
large representative datasets for training with expensive pixel-level
annotations, which in case of videos are prohibitive to obtain. Therefore,
there is a need to relax this constraint and to consider alternative forms
of supervision, which are easier and cheaper to collect. In this thesis,
we aim to develop algorithms for learning to segment in images and videos
with different levels of supervision.
First, we develop approaches for training convolutional networks with weaker
forms of supervision, such as bounding boxes or image labels, for object
boundary estimation and semantic/instance labelling tasks. We propose to
generate pixel-level approximate groundtruth from these weaker forms of
annotations to train a network, which allows to achieve high-quality
results comparable to the full supervision quality without any
modifications of the network architecture or the training procedure.
Second, we address the problem of the excessive computational and memory
costs inherent to solving video segmentation via graphs. We propose
approaches to improve the runtime and memory efficiency as well as the
output segmentation quality by learning from the available training data
the best representation of the graph. In particular, we contribute with
learning must-link constraints, the topology and edge weights of the graph
as well as enhancing the graph nodes - superpixels - themselves.
Third, we tackle the task of pixel-level object tracking and address the
problem of the limited amount of densely annotated video data for training
convolutional networks. We introduce an architecture which allows training
with static images only and propose an elaborate data synthesis scheme
which creates a large number of training examples close to the target
domain from the given first frame mask. With the proposed techniques we
show that densely annotated consequent video data is not necessary to
achieve high-quality temporally coherent video segmentationresults.
In summary, this thesis advances the state of the art in weakly supervised
image segmentation, graph-based video segmentation and pixel-level object
tracking and contributes with the new ways of training convolutional
networks with a limited amount of pixel-level annotated training data.

Convolutional networks reach top quality in pixel-level object tracking but
require a large amount of training data (1k ~ 10k) to deliver such results. We
propose a new training strategy which achieves state-of-the-art results across
three evaluation datasets while using 20x ~ 100x less annotated data than
competing methods. Instead of using large training sets hoping to generalize
across domains, we generate in-domain training data using the provided
annotation on the first frame of each video to synthesize ("lucid dream")
plausible future video frames. In-domain per-video training data allows us to
train high quality appearance- and motion-based models, as well as tune the
post-processing stage. This approach allows to reach competitive results even
when training from only a single annotated frame, without ImageNet
pre-training. Our results indicate that using a larger training set is not
automatically better, and that for the tracking task a smaller training set
that is closer to the target domain is more effective. This changes the mindset
regarding how many training samples and general "objectness" knowledge are
required for the object tracking task.

People are often a central element of visual scenes, particularly in real-world street scenes. Thus it has been a long-standing goal in Computer Vision to develop methods aiming at analyzing humans in visual data. Due to the complexity of real-world scenes, visual understanding of people remains challenging for machine perception. In this thesis we focus on advancing the techniques for people detection and tracking in crowded street scenes. We also propose new models for human pose estimation and motion segmentation in realistic images and videos. First, we propose detection models that are jointly trained to detect single person as well as pairs of people under varying degrees of occlusion. The learning algorithm of our joint detector facilitates a tight integration of tracking and detection, because it is designed to address common failure cases during tracking due to long-term inter-object occlusions. Second, we propose novel multi person tracking models that formulate tracking as a graph partitioning problem. Our models jointly cluster detection hypotheses in space and time, eliminating the need for a heuristic non-maximum suppression. Furthermore, for crowded scenes, our tracking model encodes long-range person re-identification information into the detection clustering process in a unified and rigorous manner. Third, we explore the visual tracking task in different granularity. We present a tracking model that simultaneously clusters object bounding boxes and pixel level trajectories over time. This approach provides a rich understanding of the motion of objects in the scene. Last, we extend our tracking model for the multi person pose estimation task. We introduce a joint subset partitioning and labelling model where we simultaneously estimate the poses of all the people in the scene. In summary, this thesis addresses a number of diverse tasks that aim to enable vision systems to analyze people in realistic images and videos. In particular, the thesis proposes several novel ideas and rigorous mathematical formulations, pushes the boundary of state-of-the-arts and results in superior performance.

Deep models are the defacto standard in visual decision problems due to their
impressive performance on a wide array of visual tasks. On the other hand,
their opaqueness has led to a surge of interest in explainable systems. In this
work, we emphasize the importance of model explanation in various forms such as
visual pointing and textual justification. The lack of data with justification
annotations is one of the bottlenecks of generating multimodal explanations.
Thus, we propose two large-scale datasets with annotations that visually and
textually justify a classification decision for various activities, i.e. ACT-X,
and for question answering, i.e. VQA-X. We also introduce a multimodal
methodology for generating visual and textual explanations simultaneously. We
quantitatively show that training with the textual explanations not only yields
better textual justification models, but also models that better localize the
evidence that support their decision.

People nowadays share large parts of their personal lives through social
media. Being able to automatically recognise people in personal photos may
greatly enhance user convenience by easing photo album organisation. For human
identification task, however, traditional focus of computer vision has been
face recognition and pedestrian re-identification. Person recognition in social
media photos sets new challenges for computer vision, including non-cooperative
subjects (e.g. backward viewpoints, unusual poses) and great changes in
appearance. To tackle this problem, we build a simple person recognition
framework that leverages convnet features from multiple image regions (head,
body, etc.). We propose new recognition scenarios that focus on the time and
appearance gap between training and testing samples. We present an in-depth
analysis of the importance of different features according to time and
viewpoint generalisability. In the process, we verify that our simple approach
achieves the state of the art result on the PIPA benchmark, arguably the
largest social media based benchmark for person recognition to date with
diverse poses, viewpoints, social groups, and events.
Compared the conference version of the paper, this paper additionally
presents (1) analysis of a face recogniser (DeepID2+), (2) new method naeil2
that combines the conference version method naeil and DeepID2+ to achieve state
of the art results even compared to post-conference works, (3) discussion of
related work since the conference version, (4) additional analysis including
the head viewpoint-wise breakdown of performance, and (5) results on the
open-world setup.

Many deployed learned models are black boxes: given input, returns output.
Internal information about the model, such as the architecture, optimisation
procedure, or training data, is not disclosed explicitly as it might contain
proprietary information or make the system more vulnerable. This work shows
that such attributes of neural networks can be exposed from a sequence of
queries. This has multiple implications. On the one hand, our work exposes the
vulnerability of black-box neural networks to different types of attacks -- we
show that the revealed internal information helps generate more effective
adversarial examples against the black box model. On the other hand, this
technique can be used for better protection of private content from automatic
recognition models using adversarial examples. Our paper suggests that it is
actually hard to draw a line between white box and black box models.

We study the problem of decomposing (clustering) a tree with respect to costs attributed to pairs of nodes, so as to minimize the sum of costs for those pairs of nodes that are in the same component (cluster). For the general case and for the special case of the tree being a star, we show that the problem is NP-hard. For the special case of the tree being a path, this problem is known to be polynomial time solvable. We characterize several classes of facets of the combinatorial polytope associated with a formulation of this clustering problem in terms of lifted multicuts. In particular, our results yield a complete totally dual integral (TDI) description of the lifted multicut polytope for paths, which establishes a connection to the combinatorial properties of alternative formulations such as set partitioning.

Convnets have enabled significant progress in pedestrian detection recently,
but there are still open questions regarding suitable architectures and
training data. We revisit CNN design and point out key adaptations, enabling
plain FasterRCNN to obtain state-of-the-art results on the Caltech dataset.
To achieve further improvement from more and better data, we introduce
CityPersons, a new set of person annotations on top of the Cityscapes dataset.
The diversity of CityPersons allows us for the first time to train one single
CNN model that generalizes well over multiple benchmarks. Moreover, with
additional training with CityPersons, we obtain top results using FasterRCNN on
Caltech, improving especially for more difficult cases (heavy occlusion and
small scale) and providing higher localization quality.

Previous work focused on predicting visual search targets from human
fixations but, in the real world, a specific target is often not known, e.g.
when searching for a present for a friend. In this work we instead study the
problem of predicting the mental picture, i.e. only an abstract idea instead of
a specific target. This task is significantly more challenging given that
mental pictures of the same target category can vary widely depending on
personal biases, and given that characteristic target attributes can often not
be verbalised explicitly. We instead propose to use gaze information as
implicit information on users' mental picture and present a novel gaze pooling
layer to seamlessly integrate semantic and localized fixation information into
a deep image representation. We show that we can robustly predict both the
mental picture's category as well as attributes on a novel dataset containing
fixation data of 14 users searching for targets on a subset of the DeepFahion
dataset. Our results have important implications for future search interfaces
and suggest deep gaze pooling as a general-purpose approach for gaze-supported
computer vision systems.

Generating natural language descriptions for visual data links computer vision and computational linguistics. Being able to generate a concise and human-readable description of a video is a step towards visual understanding. At the same time, grounding natural language in visual data provides disambiguation for the linguistic concepts, necessary for many applications. This thesis focuses on both directions and tackles three specific problems. First, we develop recognition approaches to understand video of complex cooking activities. We propose an approach to generate coherent multi-sentence descriptions for our videos. Furthermore, we tackle the new task of describing videos at variable level of detail. Second, we present a large-scale dataset of movies and aligned professional descriptions. We propose an approach, which learns from videos and sentences to describe movie clips relying on robust recognition of visual semantic concepts. Third, we propose an approach to ground textual phrases in images with little or no localization supervision, which we further improve by introducing Multimodal Compact Bilinear Pooling for combining language and vision representations. Finally, we jointly address the task of describing videos and grounding the described people. To summarize, this thesis advances the state-of-the-art in automatic video description and visual grounding and also contributes large datasets for studying the intersection of computer vision and computational linguistics.

Boundary prediction in images as well as video has been a very active topic
of research and organizing visual information into boundaries and segments is
believed to be a corner stone of visual perception. While prior work has
focused on predicting boundaries for observed frames, our work aims at
predicting boundaries of future unobserved frames. This requires our model to
learn about the fate of boundaries and extrapolate motion patterns. We
experiment on established real-world video segmentation dataset, which provides
a testbed for this new task. We show for the first time spatio-temporal
boundary extrapolation in this challenging scenario. Furthermore, we show
long-term prediction of boundaries in situations where the motion is governed
by the laws of physics. We successfully predict boundaries in a billiard
scenario without any assumptions of a strong parametric model or any object
notion. We argue that our model has with minimalistic model assumptions derived
a notion of 'intuitive physics' that can be applied to novel scenes.

Together with the development of more accurate methods in Computer Vision and
Natural Language Understanding, holistic architectures that answer on questions
about the content of real-world images have emerged. In this tutorial, we build
a neural-based approach to answer questions about images. We base our tutorial
on two datasets: (mostly on) DAQUAR, and (a bit on) VQA. With small tweaks the
models that we present here can achieve a competitive performance on both
datasets, in fact, they are among the best methods that use a combination of
LSTM with a global, full frame CNN representation of an image. We hope that
after reading this tutorial, the reader will be able to use Deep Learning
frameworks, such as Keras and introduced Kraino, to build various architectures
that will lead to a further performance improvement on this challenging task.

Common computational methods for automated eye movement detection - i.e. the
task of detecting different types of eye movement in a continuous stream of
gaze data - are limited in that they either involve thresholding on
hand-crafted signal features, require individual detectors each only detecting
a single movement, or require pre-segmented data. We propose a novel approach
for eye movement detection that only involves learning a single detector
end-to-end, i.e. directly from the continuous gaze data stream and
simultaneously for different eye movements without any manual feature crafting
or segmentation. Our method is based on convolutional neural networks (CNN)
that recently demonstrated superior performance in a variety of tasks in
computer vision, signal processing, and machine learning. We further introduce
a novel multi-participant dataset that contains scripted and free-viewing
sequences of ground-truth annotated saccades, fixations, and smooth pursuits.
We show that our CNN-based method outperforms state-of-the-art baselines by a
large margin on this challenging dataset, thereby underlining the significant
potential of this approach for holistic, robust, and accurate eye movement
protocol analysis.

In this paper we are extracting surface reflectance and natural environmental
illumination from a reflectance map, i.e. from a single 2D image of a sphere of
one material under one illumination. This is a notoriously difficult problem,
yet key to various re-rendering applications. With the recent advances in
estimating reflectance maps from 2D images their further decomposition has
become increasingly relevant.
To this end, we propose a Convolutional Neural Network (CNN) architecture to
reconstruct both material parameters (i.e. Phong) as well as illumination (i.e.
high-resolution spherical illumination maps), that is solely trained on
synthetic data. We demonstrate that decomposition of synthetic as well as real
photographs of reflectance maps, both in High Dynamic Range (HDR), and, for the
first time, on Low Dynamic Range (LDR) as well. Results are compared to
previous approaches quantitatively as well as qualitatively in terms of
re-renderings where illumination, material, view or shape are changed.

Recovering natural illumination from a single Low-Dynamic Range (LDR) image
is a challenging task. To remedy this situation we exploit two properties often
found in everyday images. First, images rarely show a single material, but
rather multiple ones that all reflect the same illumination. However, the
appearance of each material is observed only for some surface orientations, not
all. Second, parts of the illumination are often directly observed in the
background, without being affected by reflection. Typically, this directly
observed part of the illumination is even smaller. We propose a deep
Convolutional Neural Network (CNN) that combines prior knowledge about the
statistics of illumination and reflectance with an input that makes explicit
use of these two observations. Our approach maps multiple partial LDR material
observations represented as reflectance maps and a background image to a
spherical High-Dynamic Range (HDR) illumination map. For training and testing
we propose a new data set comprising of synthetic and real images with multiple
materials observed under the same illumination. Qualitative and quantitative
evidence shows how both multi-material and using a background are essential to
improve illumination estimations.

Deep models are the defacto standard in visual decision models due to their
impressive performance on a wide array of visual tasks. However, they are
frequently seen as opaque and are unable to explain their decisions. In
contrast, humans can justify their decisions with natural language and point to
the evidence in the visual world which led to their decisions. We postulate
that deep models can do this as well and propose our Pointing and Justification
(PJ-X) model which can justify its decision with a sentence and point to the
evidence by introspecting its decision and explanation process using an
attention mechanism. Unfortunately there is no dataset available with reference
explanations for visual decision making. We thus collect two datasets in two
domains where it is interesting and challenging to explain decisions. First, we
extend the visual question answering task to not only provide an answer but
also a natural language explanation for the answer. Second, we focus on
explaining human activities which is traditionally more challenging than object
classification. We extensively evaluate our PJ-X model, both on the
justification and pointing tasks, by comparing it to prior models and ablations
using both automatic and human evaluations.

Gaze reflects how humans process visual scenes and is therefore increasingly
used in computer vision systems. Previous works demonstrated the potential of
gaze for object-centric tasks, such as object localization and recognition, but
it remains unclear if gaze can also be beneficial for scene-centric tasks, such
as image captioning. We present a new perspective on gaze-assisted image
captioning by studying the interplay between human gaze and the attention
mechanism of deep neural networks. Using a public large-scale gaze dataset, we
first assess the relationship between state-of-the-art object and scene
recognition models, bottom-up visual saliency, and human gaze. We then propose
a novel split attention model for image captioning. Our model integrates human
gaze information into an attention-based long short-term memory architecture,
and allows the algorithm to allocate attention selectively to both fixated and
non-fixated image regions. Through evaluation on the COCO/SALICON datasets we
show that our method improves image captioning performance and that gaze can
complement machine attention for semantic scene understanding tasks.

Marker-based and marker-less optical skeletal motion-capture methods use an
outside-in arrangement of cameras placed around a scene, with viewpoints
converging on the center. They often create discomfort by possibly needed
marker suits, and their recording volume is severely restricted and often
constrained to indoor scenes with controlled backgrounds. We therefore propose
a new method for real-time, marker-less and egocentric motion capture which
estimates the full-body skeleton pose from a lightweight stereo pair of fisheye
cameras that are attached to a helmet or virtual-reality headset. It combines
the strength of a new generative pose estimation framework for fisheye views
with a ConvNet-based body-part detector trained on a new automatically
annotated and augmented dataset. Our inside-in method captures full-body motion
in general indoor and outdoor scenes, and also crowded scenes.

Beyond the success in classification, neural networks have recently shown
strong results on pixel-wise prediction tasks like image semantic segmentation
on RGBD data. However, the commonly used deconvolutional layers for upsampling
intermediate representations to the full-resolution output still show different
failure modes, like imprecise segmentation boundaries and label mistakes in
particular on large, weakly textured objects (e.g. fridge, whiteboard, door).
We attribute these errors in part to the rigid way, current network aggregate
information, that can be either too local (missing context) or too global
(inaccurate boundaries). Therefore we propose a data-driven pooling layer that
integrates with fully convolutional architectures and utilizes boundary
detection from RGBD image segmentation approaches. We extend our approach to
leverage region-level correspondences across images with an additional temporal
pooling stage. We evaluate our approach on the NYU-Depth-V2 dataset comprised
of indoor RGBD video sequences and compare it to various state-of-the-art
baselines. Besides a general improvement over the state-of-the-art, our
approach shows particularly good results in terms of accuracy of the predicted
boundaries and in segmenting previously problematic classes.

Recently, Minimum Cost Multicut Formulations have been proposed and proven to
be successful in both motion trajectory segmentation and multi-target tracking
scenarios. Both tasks benefit from decomposing a graphical model into an
optimal number of connected components based on attractive and repulsive
pairwise terms. The two tasks are formulated on different levels of granularity
and, accordingly, leverage mostly local information for motion segmentation and
mostly high-level information for multi-target tracking. In this paper we argue
that point trajectories and their local relationships can contribute to the
high-level task of multi-target tracking and also argue that high-level cues
from object detection and tracking are helpful to solve motion segmentation. We
propose a joint graphical model for point trajectories and object detections
whose Multicuts are solutions to motion segmentation {\it and} multi-target
tracking problems at once. Results on the FBMS59 motion segmentation benchmark
as well as on pedestrian tracking sequences from the 2D MOT 2015 benchmark
demonstrate the promise of this joint approach.

Understanding physical phenomena is a key competence that enables humans and
animals to act and interact under uncertain perception in previously unseen
environments containing novel object and their configurations. Developmental
psychology has shown that such skills are acquired by infants from observations
at a very early stage.
In this paper, we contrast a more traditional approach of taking a
model-based route with explicit 3D representations and physical simulation by
an end-to-end approach that directly predicts stability and related quantities
from appearance. We ask the question if and to what extent and quality such a
skill can directly be acquired in a data-driven way bypassing the need for an
explicit simulation.
We present a learning-based approach based on simulated data that predicts
stability of towers comprised of wooden blocks under different conditions and
quantities related to the potential fall of the towers. The evaluation is
carried out on synthetic data and compared to human judgments on the same
stimuli.

Rotations performed with the index finger and thumb involve some of the most complex motor action among common multi-touch gestures, yet little is known about the factors affecting performance and ergonomics. This note presents results from a study where the angle, direction, diameter, and position of rotations were systematically manipulated. Subjects were asked to perform the rotations as quickly as possible without losing contact with the display, and were allowed to skip rotations that were too uncomfortable. The data show surprising interaction effects among the variables, and help us identify whole categories of rotations that are slow and cumbersome for users.

We are addressing an open-ended question answering task
about real-world images. With the help of currently available methods
developed in Computer Vision and Natural Language Processing, we would
like to push an architecture with a global visual representation to its
limits. In our contribution, we show how to achieve competitive
performance on VQA with global visual features (Residual Net) together
with a carefully desgined architecture.

Undoing the image formation process and therefore decomposing appearance into
its intrinsic properties is a challenging task due to the under-constraint
nature of this inverse problem. While significant progress has been made on
inferring shape, materials and illumination from images only, progress in an
unconstrained setting is still limited. We propose a convolutional neural
architecture to estimate reflectance maps of specular materials in natural
lighting conditions. We achieve this in an end-to-end learning formulation that
directly predicts a reflectance map from the image itself. We show how to
improve estimates by facilitating additional supervision in an indirect scheme
that first predicts surface orientation and afterwards predicts the reflectance
map by a learning-based sparse data interpolation.
In order to analyze performance on this difficult task, we propose a new
challenge of Specular MAterials on SHapes with complex IllumiNation (SMASHINg)
using both synthetic and real images. Furthermore, we show the application of
our method to a range of image-based editing tasks on real images.

State-of-the-art learning based boundary detection methods require extensive
training data. Since labelling object boundaries is one of the most expensive
types of annotations, there is a need to relax the requirement to carefully
annotate images to make both the training more affordable and to extend the
amount of training data. In this paper we propose a technique to generate
weakly supervised annotations and show that bounding box annotations alone
suffice to reach high-quality object boundaries without using any
object-specific boundary annotations. With the proposed weak supervision
techniques we achieve the top performance on the object boundary detection
task, outperforming by a large margin the current fully supervised
state-of-the-art methods.

Non-maximum suppression (NMS) is used in virtually all state-of-the-art
object detection pipelines. While essential object detection ingredients such
as features, classifiers, and proposal methods have been extensively researched
surprisingly little work has aimed to systematically address NMS. The de-facto
standard for NMS is based on greedy clustering with a fixed distance threshold,
which forces to trade-off recall versus precision. We propose a convnet
designed to perform NMS of a given set of detections. We report experiments on
a synthetic setup, and results on crowded pedestrian detection scenes. Our
approach overcomes the intrinsic limitations of greedy NMS, obtaining better
recall and precision.

With the advent of affordable depth sensors, 3D capture becomes more and more
ubiquitous and already has made its way into commercial products. Yet,
capturing the geometry or complete shapes of everyday objects using scanning
devices (eg. Kinect) still comes with several challenges that result in noise
or even incomplete shapes. Recent success in deep learning has shown how to
learn complex shape distributions in a data-driven way from large scale 3D CAD
Model collections and to utilize them for 3D processing on volumetric
representations and thereby circumventing problems of topology and
tessellation. Prior work has shown encouraging results on problems ranging from
shape completion to recognition. We provide an analysis of such approaches and
discover that training as well as the resulting representation are strongly and
unnecessarily tied to the notion of object labels. Furthermore, deep learning
research argues ~\cite{Vincent08} that learning representation with
over-complete model are more prone to overfitting compared to the approach that
learns from noisy data. Thus, we investigate a full convolutional volumetric
denoising auto encoder that is trained in a unsupervised fashion. It
outperforms prior work on recognition as well as more challenging tasks like
denoising and shape completion. In addition, our approach is atleast two order
of magnitude faster at test time and thus, provides a path to scaling up 3D
deep learning.

Graph-based video segmentation methods rely on superpixels as starting point.
While most previous work has focused on the construction of the graph edges and
weights as well as solving the graph partitioning problem, this paper focuses
on better superpixels for video segmentation. We demonstrate by a comparative
analysis that superpixels extracted from boundaries perform best, and show that
boundary estimation can be significantly improved via image and time domain
cues. With superpixels generated from our better boundaries we observe
consistent improvement for two video segmentation methods in two different
datasets.

Clearly explaining a rationale for a classification decision to an end-user
can be as important as the decision itself. Existing approaches for deep visual
recognition are generally opaque and do not output any justification text;
contemporary vision-language models can describe image content but fail to take
into account class-discriminative image aspects which justify visual
predictions. We propose a new model that focuses on the discriminating
properties of the visible object, jointly predicts a class label, and explains
why the predicted label is appropriate for the image. We propose a novel loss
function based on sampling and reinforcement learning that learns to generate
sentences that realize a global sentence property, such as class specificity.
Our results on a fine-grained bird species classification dataset show that our
model is able to generate explanations which are not only consistent with an
image but also more discriminative than descriptions produced by existing
captioning methods.

The goal of this paper is to advance the state-of-the-art of articulated pose
estimation in scenes with multiple people. To that end we contribute on three
fronts. We propose (1) improved body part detectors that generate effective
bottom-up proposals for body parts; (2) novel image-conditioned pairwise terms
that allow to assemble the proposals into a variable number of consistent body
part configurations; and (3) an incremental optimization strategy that explores
the search space more efficiently thus leading both to better performance and
significant speed-up factors. We evaluate our approach on two single-person and
two multi-person pose estimation benchmarks. The proposed approach
significantly outperforms best known multi-person pose estimation results while
demonstrating competitive performance on the task of single person pose
estimation. Models and code available at http://pose.mpi-inf.mpg.de

Current top performing object detectors employ detection proposals to guide
the search for objects, thereby avoiding exhaustive sliding window search
across images. Despite the popularity and widespread use of detection
proposals, it is unclear which trade-offs are made when using them during
object detection. We provide an in-depth analysis of twelve proposal methods
along with four baselines regarding proposal repeatability, ground truth
annotation recall on PASCAL and ImageNet, and impact on DPM and R-CNN detection
performance. Our analysis shows that for object detection improving proposal
localisation accuracy is as important as improving recall. We introduce a novel
metric, the average recall (AR), which rewards both high recall and good
localisation and correlates surprisingly well with detector performance. Our
findings show common strengths and weaknesses of existing methods, and provide
insights and metrics for selecting and tuning proposal methods.

Head-mounted eye tracking has significant potential for
mobile gaze-based interaction with ambient displays but current
interfaces lack information about the tracker\'s gaze estimation error.
Consequently, current interfaces do not exploit the full potential of
gaze input as the inherent estimation error can not be dealt with. The
error depends on the physical properties of the display and constantly
varies with changes in position and distance of the user to the display.
In this work we present a computational model of gaze estimation error
for head-mounted eye trackers. Our model covers the full processing
pipeline for mobile gaze estimation, namely mapping of pupil positions
to scene camera coordinates, marker-based display detection, and display
mapping. We build the model based on a series of controlled measurements
of a sample state-of-the-art monocular head-mounted eye tracker. Results
show that our model can predict gaze estimation error with a root mean
squared error of 17.99~px ($1.96^\\circ$).

Mobile gaze-based interaction with multiple displays may
occur from arbitrary positions and orientations. However, maintaining
high gaze estimation accuracy still represents a significant challenge.
To address this, we present GazeProjector, a system that combines
accurate point-of-gaze estimation with natural feature tracking on
displays to determine the mobile eye tracker’s position relative to a
display. The detected eye positions are transformed onto that display
allowing for gaze-based interaction. This allows for seamless gaze
estimation and interaction on (1) multiple displays of arbitrary sizes,
(2) independently of the user’s position and orientation to the display.
In a user study with 12 participants we compared GazeProjector to
existing well- established methods such as visual on-screen markers and
a state-of-the-art motion capture system. Our results show that our
approach is robust to varying head poses, orientations, and distances to
the display, while still providing high gaze estimation accuracy across
multiple displays without re-calibration. The system represents an
important step towards the vision of pervasive gaze-based interfaces.

Szeliski et al. published an influential study in 2006 on energy minimization
methods for Markov Random Fields (MRF). This study provided valuable insights
in choosing the best optimization technique for certain classes of problems.
While these insights remain generally useful today, the phenomenal success of
random field models means that the kinds of inference problems that have to be
solved changed significantly. Specifically, the models today often include
higher order interactions, flexible connectivity structures, large
la\-bel-spaces of different cardinalities, or learned energy tables. To reflect
these changes, we provide a modernized and enlarged study. We present an
empirical comparison of 32 state-of-the-art optimization techniques on a corpus
of 2,453 energy minimization instances from diverse applications in computer
vision. To ensure reproducibility, we evaluate all methods in the OpenGM 2
framework and report extensive results regarding runtime and solution quality.
Key insights from our study agree with the results of Szeliski et al. for the
types of models they studied. However, on new and challenging types of models
our findings disagree and suggest that polyhedral methods and integer
programming solvers are competitive in terms of runtime and solution quality
over a large range of model types.

We present labelled pupils in the wild (LPW), a novel dataset of 66
high-quality, high-speed eye region videos for the development and evaluation
of pupil detection algorithms. The videos in our dataset were recorded from 22
participants in everyday locations at about 95 FPS using a state-of-the-art
dark-pupil head-mounted eye tracker. They cover people with different
ethnicities, a diverse set of everyday indoor and outdoor illumination
environments, as well as natural gaze direction distributions. The dataset also
includes participants wearing glasses, contact lenses, as well as make-up. We
benchmark five state-of-the-art pupil detection algorithms on our dataset with
respect to robustness and accuracy. We further study the influence of image
resolution, vision aids, as well as recording location (indoor, outdoor) on
pupil detection performance. Our evaluations provide valuable insights into the
general pupil detection problem and allow us to identify key challenges for
robust pupil detection on head-mounted eye trackers.

Progress in language and image understanding by machines has sparkled the
interest of the research community in more open-ended, holistic tasks, and
refueled an old AI dream of building intelligent machines. We discuss a few
prominent challenges that characterize such holistic tasks and argue for
"question answering about images" as a particular appealing instance of such a
holistic task. In particular, we point out that it is a version of a Turing
Test that is likely to be more robust to over-interpretations and contrast it
with tasks like grounding and generation of descriptions. Finally, we discuss
tools to measure progress in this field.

The eyes are a rich channel for non-verbal communication in
our daily interactions. We propose social gaze interaction as a game
mechanic to enhance user interactions with virtual characters. We
develop a game from the ground-up in which characters are esigned to be
reactive to the player’s gaze in social ways, such as etting annoyed
when the player seems distracted or changing their dialogue depending on
the player’s apparent focus of ttention. Results from a qualitative user
study provide insights bout how social gaze interaction is intuitive for
users, elicits deep feelings of immersion, and highlight the players’
self-consciousness of their own eye movements through their strong
reactions to the characters

A Field Study on Spontaneous Gaze-based Interaction with a Public Display using PursuitsM. Khamis, F. Alt and A. Bulling UbiComp & ISWC’15, ACM International Joint Conference on Pervasive and Ubiquitous Computing, 2015

An increasing number of works explore collaborative human-computer systems in
which human gaze is used to enhance computer vision systems. For object
detection these efforts were so far restricted to late integration approaches
that have inherent limitations, such as increased precision without increase in
recall. We propose an early integration approach in a deformable part model,
which constitutes a joint formulation over gaze and visual data. We show that
our GazeDPM method improves over the state-of-the-art DPM baseline by 4% and a
recent method for gaze-supported object detection by 3% on the public POET
dataset. Our approach additionally provides introspection of the learnt models,
can reveal salient image structures, and allows us to investigate the interplay
between gaze attracting and repelling areas, the importance of view-specific
models, as well as viewers' personal biases in gaze patterns. We finally study
important practical aspects of our approach, such as the impact of using
saliency maps instead of real fixations, the impact of the number of fixations,
as well as robustness to gaze estimation error.

Humans can easily describe what they see in a coherent way and at varying
level of detail. However, existing approaches for automatic video description
are mainly focused on single sentence generation and produce descriptions at a
fixed level of detail. In this paper, we address both of these limitations: for
a variable level of detail we produce coherent multi-sentence descriptions of
complex videos. We follow a two-step approach where we first learn to predict a
semantic representation (SR) from video and then generate natural language
descriptions from the SR. To produce consistent multi-sentence descriptions, we
model across-sentence consistency at the level of the SR by enforcing a
consistent topic. We also contribute both to the visual recognition of objects
proposing a hand-centric approach as well as to the robust generation of
sentences using a word lattice. Human judges rate our multi-sentence
descriptions as more readable, correct, and relevant than related work. To
understand the difference between more detailed and shorter descriptions, we
collect and analyze a video description corpus of three levels of detail.

Although gaze is an attractive modality for pervasive
interaction, real-world implementation of eye-based interfaces poses
significant challenges. In particular, user calibration is tedious and
time consuming. Pursuits is an innovative interaction technique that
enables truly spontaneous interaction with eye-based interfaces. A user
can simply walk up to the screen and readily interact with moving
targets. Instead of being based on gaze location, Pursuits correlates
eye pursuit movements with objects dynamically moving on the interface.

Despite significant recent advances in image classification, fine-grained
classification remains a challenge. In the present paper, we address the
zero-shot and few-shot learning scenarios as obtaining labeled data is
especially difficult for fine-grained classification tasks. First, we embed
state-of-the-art image descriptors in a label embedding space using side
information such as attributes. We argue that learning a joint embedding space,
that maximizes the compatibility between the input and output embeddings, is
highly effective for zero/few-shot learning. We show empirically that such
embeddings significantly outperforms the current state-of-the-art methods on
two challenging datasets (Caltech-UCSD Birds and Animals with Attributes).
Second, to reduce the amount of costly manual attribute annotations, we use
alternate output embeddings based on the word-vector representations, obtained
from large text-corpora without any supervision. We report that such
unsupervised embeddings achieve encouraging results, and lead to further
improvements when combined with the supervised ones.

While the majority of today's object class models provide only 2D bounding
boxes, far richer output hypotheses are desirable including viewpoint,
fine-grained category, and 3D geometry estimate. However, models trained to
provide richer output require larger amounts of training data, preferably well
covering the relevant aspects such as viewpoint and fine-grained categories. In
this paper, we address this issue from the perspective of transfer learning,
and design an object class model that explicitly leverages correlations between
visual features. Specifically, our model represents prior distributions over
permissible multi-view detectors in a parametric way -- the priors are learned
once from training data of a source object class, and can later be used to
facilitate the learning of a detector for a target class. As we show in our
experiments, this transfer is not only beneficial for detectors based on
basic-level category representations, but also enables the robust learning of
detectors that represent classes at finer levels of granularity, where training
data is typically even scarcer and more unbalanced. As a result, we report
largely improved performance in simultaneous 2D object localization and
viewpoint estimation on a recent dataset of challenging street scenes.

This paper introduces a new architecture for human pose estimation using a
multi- layer convolutional network architecture and a modified learning
technique that learns low-level features and higher-level weak spatial models.
Unconstrained human pose estimation is one of the hardest problems in computer
vision, and our new architecture and learning schema shows significant
improvement over the current state-of-the-art results. The main contribution of
this paper is showing, for the first time, that a specific variation of deep
learning is able to outperform all existing traditional architectures on this
task. The paper also discusses several lessons learned while researching
alternatives, most notably, that it is possible to learn strong low-level
feature detectors on features that might even just cover a few pixels in the
image. Higher-level spatial models improve somewhat the overall result, but to
a much lesser extent then expected. Many researchers previously argued that the
kinematic structure and top-down information is crucial for this domain, but
with our purely bottom up, and weak spatial model, we could improve other more
complicated architectures that currently produce the best results. This mirrors
what many other researchers, like those in the speech recognition, object
recognition, and other domains have experienced.

While the majority of today's object class models provide only 2D bounding boxes, far richer output hypotheses are desirable including viewpoint, fine-grained category, and 3D geometry estimate. However, models trained to provide richer output require larger amounts of training data, preferably well covering the relevant aspects such as viewpoint and fine-grained categories. In this paper, we address this issue from the perspective of transfer learning, and design an object class model that explicitly leverages correlations between visual features. Specifically, our model represents prior distributions over permissible multi-view detectors in a parametric way -- the priors are learned once from training data of a source object class, and can later be used to facilitate the learning of a detector for a target class. As we show in our experiments, this transfer is not only beneficial for detectors based on basic-level category representations, but also enables the robust learning of detectors that represent classes at finer levels of granularity, where training data is typically even scarcer and more unbalanced. As a result, we report largely improved performance in simultaneous 2D object localization and viewpoint estimation on a recent dataset of challenging street scenes.

Current top performing Pascal VOC object detectors employ detection proposals to guide the search for objects thereby avoiding exhaustive sliding window search across images. Despite the popularity of detection proposals, it is unclear which trade‐offs are made when using them during object detection. We provide an in depth analysis of ten object proposal methods along with four baselines regarding ground truth annotation recall (on Pascal VOC 2007 and ImageNet 2013), repeatability, and impact on DPM detector performance. Our findings show common weaknesses of existing methods, and provide insights to choose the most adequate method for different settings.

As language and visual understanding by machines progresses rapidly, we are observing an increasing interest in holistic architectures that tightly interlink both modalities in a joint learning and inference process. This trend has allowed the community to progress towards more challenging and open tasks and refueled the hope at achieving the old AI dream of building machines that could pass a turing test in open domains. In order to steadily make progress towards this goal, we realize that quantifying performance becomes increasingly difficult. Therefore we ask how we can precisely define such challenges and how we can evaluate different algorithms on this open tasks? In this paper, we summarize and discuss such challenges as well as try to give answers where appropriate options are available in the literature. We exemplify some of the solutions on a recently presented dataset of question-answering task based on real-world indoor images that establishes a visual turing challenge. Finally, we argue despite the success of unique ground-truth annotation, we likely have to step away from carefully curated dataset and rather rely on ’}social consensus{’ as the main driving force to create suitable benchmarks. Providing coverage in this inherently ambiguous output space is an emerging challenge that we face in order to make quantifiable progress in this area.

Estimating a constrained relation is a fundamental problem in machine
learning. Special cases are classification (the problem of estimating a map
from a set of to-be-classified elements to a set of labels), clustering (the
problem of estimating an equivalence relation on a set) and ranking (the
problem of estimating a linear order on a set). We contribute a family of
probability measures on the set of all relations between two finite, non-empty
sets, which offers a joint abstraction of multi-label classification,
correlation clustering and ranking by linear ordering. Estimating (learning) a
maximally probable measure, given (a training set of) related and unrelated
pairs, is a convex optimization problem. Estimating (inferring) a maximally
probable relation, given a measure, is a 01-linear program. It is solved in
linear time for maps. It is NP-hard for equivalence relations and linear
orders. Practical solutions for all three cases are shown in experiments with
real data. Finally, estimating a maximally probable measure and relation
jointly is posed as a mixed-integer nonlinear program. This formulation
suggests a mathematical programming approach to semi-supervised learning.

The recent progress in sparse coding and deep learning has made unsupervised
feature learning methods a strong competitor to hand-crafted descriptors. In
computer vision, success stories of learned features have been predominantly
reported for object recognition tasks. In this paper, we investigate if and how
feature learning can be used for material recognition. We propose two
strategies to incorporate scale information into the learning procedure
resulting in a novel multi-scale coding procedure. Our results show that our
learned features for material recognition outperform hand-crafted descriptors
on the FMD and the KTH-TIPS2 material classification benchmarks.

Over the last two decades we have witnessed strong progress on modeling
visual object classes, scenes and attributes that have significantly
contributed to automated image understanding. On the other hand, surprisingly
little progress has been made on incorporating a spatial representation and
reasoning in the inference process. In this work, we propose a pooling
interpretation of spatial relations and show how it improves image retrieval
and annotations tasks involving spatial language. Due to the complexity of the
spatial language, we argue for a learning-based approach that acquires a
representation of spatial relations by learning parameters of the pooling
operator. We show improvements on previous work on two datasets and two
different tasks as well as provide additional insights on a new dataset with an
explicit focus on spatial relations.

Garments made of smart textiles have an enormous potential for embedding sensors in close proximity to the body in an unobtrusive and comfortable manner. Combined with signal processing and pattern recognition technologies, complex high-level information about human behaviors or situations can be inferred from the sensor data. The goal of this chapter is to introduce the reader to the design of activity-aware systems that use body-worn sensors, such as those that can be made available through smart textiles. We start this chapter by emphasizing recent trends towards ‘}wearable{’ sensing and computing and we present several examples of activity-aware applications. Then we outline the role that smart textiles can play in activity-aware applications, but also the challenges that they pose. We conclude by discussing the design process followed to devise activity-aware systems: the choice of sensors, the available data processing methods, and the evaluation techniques. We discuss recent data processing methods that address the challenges resulting from the use of smart textiles.

We are very happy to present the proceedings of the 4th Augmented
Human International Conference (Augmented Human 2013). Augmented
Human 2013 focuses on augmenting human capabilities through technology
for increased well-being and enjoyable human experience. The conference
is in cooperation with ACM SIGCHI, with its proceedings to be archived
in ACM\textquoteright}s Digital Library. With technological advances,
computing has progressively moved beyond the desktop into new physical
and social contexts. As physical artifacts gain new computational
behaviors, they become reprogrammable, customizable, repurposable,
and interoperable in rich ecologies and diverse contexts. They also
become more complex, and require intense design effort in order to
be functional, usable, and enjoyable. Designing such systems requires
interdisciplinary thinking. Their creation must not only encompass
software, electronics, and mechanics, but also the system{\textquoterights
physical form and behavior, its social and physical milieu, and beyond.

Eye-based interaction has commonly been based on estimation of eye
gaze direction, to locate objects for interaction. We introduce Pursuits,
a novel and very different eye tracking method that instead is based
on following the trajectory of eye movement and comparing this with
trajectories of objects in the field of view. Because the eyes naturally
follow the trajectory of moving objects of interest, our method is
able to detect what the user is looking at, by matching eye movement
and object movement. We illustrate Pursuits with three applications
that demonstrate how the method facilitates natural interaction with
moving targets.

Biologically inspired, from the early HMAX model to Spatial Pyramid Matching,
pooling has played an important role in visual recognition pipelines. Spatial
pooling, by grouping of local codes, equips these methods with a certain degree
of robustness to translation and deformation yet preserving important spatial
information. Despite the predominance of this approach in current recognition
systems, we have seen little progress to fully adapt the pooling strategy to
the task at hand. This paper proposes a model for learning task dependent
pooling scheme -- including previously proposed hand-crafted pooling schemes as
a particular instantiation. In our work, we investigate the role of different
regularization terms showing that the smooth regularization term is crucial to
achieve strong performance using the presented architecture. Finally, we
propose an efficient and parallel method to train the model. Our experiments
show improved performance over hand-crafted pooling schemes on the CIFAR-10 and
CIFAR-100 datasets -- in particular improving the state-of-the-art to 56.29% on
the latter.

From the early HMAX model to Spatial Pyramid Matching, spatial pooling
has played an important role in visual recognition pipelines. By
aggregating local statistics, it equips the recognition pipelines
with a certain degree of robustness to translation and deformation
yet preserving spatial information. Despite of its predominance in
current recognition systems, we have seen little progress to fully
adapt the pooling strategy to the task at hand. In this paper, we
propose a flexible parameterization of the spatial pooling step and
learn the pooling regions together with the classifier. We investigate
a smoothness regularization term that in conjuncture with an efficient
learning scheme makes learning scalable. Our framework can work with
both popular pooling operators: sum-pooling and max-pooling. Finally,
we show benefits of our approach for object recognition tasks based
on visual words and higher level event recognition tasks based on
object-bank features. In both cases, we improve over the hand-crafted
spatial pooling step showing the importance of its adaptation to
the task.

Eye gaze is a compelling interaction modality but requires a user
calibration before interaction can commence. State of the art procedures
require the user to fixate on a succession of calibration markers,
a task that is often experienced as difficult and tedious. We present
a novel approach, pursuit calibration, that instead uses moving targets
for calibration. Users naturally perform smooth pursuit eye movements
when they follow a moving target, and we use correlation of eye and
target movement to detect the users attention and to sample data
for calibration. Because the method knows when the users is attending
to a target, the calibration can be performed implicitly, which enables
more flexible design of the calibration task. We demonstrate this
in application examples and user studies, and show that pursuit calibration
is tolerant to interruption, can blend naturally with applications,
and is able to calibrate users without their awareness.

Humans use rich natural language to describe and communicate visual
perceptions. In order to provide natural language descriptions for
visual content, this paper combines two important ingredients. First,
we generate a rich semantic representation of the visual content
including e.g. object and activity labels. To predict the semantic
representation we learn a CRF to model the relationships between
different components of the visual input. And second, we propose
to formulate the generation of natural language as a machine translation
problem using the semantic representation as source language and
the generated sentences as target language. For this we exploit the
power of a parallel corpus of videos and textual descriptions and
adapt statistical machine translation to translate between our two
languages. We evaluate our video descriptions on the TACoS dataset,
which contains video snippets aligned with sentence descriptions.
Using automatic evaluation and human judgments we show significant
improvements over several base line approaches, motivated by prior
work. Our translation approach also shows improvements over related
work on an image description task.

Category models for objects or activities typically rely on supervised
learning requiring sufficiently large training sets. Transferring
knowledge from known categories to novel classes with no or only
a few labels however is far less researched even though it is a common
scenario. In this work, we extend transfer learning with semi-supervised
learning to exploit unlabeled instances of (novel) categories with
no or only a few labeled instances. Our proposed approach Propagated
Semantic Transfer combines three main ingredients. First, we transfer
information from known to novel categories by incorporating external
knowledge, such as linguistic or expert-specified information, e.g.,
by a mid-level layer of semantic attributes. Second, we exploit the
manifold structure of novel classes. More specifically we adapt a
graph-based learning algorithm - so far only used for semi-supervised
learning - to zero-shot and few-shot learning. Third, we improve
the local neighborhood in such graph structures by replacing the
raw feature-based representation with a mid-level object- or attribute-based
representation. We evaluate our approach on three challenging datasets
in two different applications, namely on Animals with Attributes
and ImageNet for image classification and on MPII Composites for
activity recognition. Our approach consistently outperforms state-of-the-art
transfer and semi-supervised approaches on all datasets.

Research on human activity recognition has traditionally focused on
discriminating between different activities, i.e. to predict \textquoteleft}{\textquoteleft}which{\textquoteright}{\textquoteright}
activity was performed at a specific point in time. The quality of
executing an activity, the {\textquoteleft}{\textquoteleft}how (well){\textquoteright}{\textquoteright,
has only received little attention so far, even though it potentially
provides useful information for a large variety of applications,
such as sports training. In this work we first define quality of
execution and investigate three aspects that pertain to qualitative
activity recognition: the problem of specifying correct execution,
the automatic and robust detection of execution mistakes, and how
to provide feedback on the quality of execution to the user. We illustrate
our approach on the example problem of qualitatively assessing and
providing feedback on weight lifting exercises. In two user studies
we try out a sensor- and a model-based approach to qualitative activity
recognition. Our results underline the potential of model-based assessment
and the positive impact of real-time user feedback on the quality
of execution.

Automatic annotation of life logging data is challenging. In this
work we present EyeContext, a system to infer high-level contextual
cues from human visual behaviour. We conduct a user study to record
eye movements of four participants over a full day of their daily
life, totalling 42.5 hours of eye movement data. Participants were
asked to self-annotate four non-mutually exclusive cues: social (interacting
with somebody vs. no interaction), cognitive (concentrated work vs.
leisure), physical (physically active vs. not active), and spatial
(inside vs. outside a building). We evaluate a proof-of-concept EyeContext
system that combines encoding of eye movements into strings and a
spectrum string kernel support vector machine (SVM) classifier. Using
person-dependent training, we obtain a top performance of 85.3%
precision (98.0% recall) for recognising social interactions. Our
results demonstrate the large information content available in long-term
human visual behaviour and opens up new venues for research on eye-based
behavioural monitoring and life logging.

Previous work has validated the eyes and mobile input as a viable
approach for pointing at, and selecting out of reach objects. This
work presents Eye Pull, Eye Push, a novel interaction concept for
content transfer between public and personal devices using gaze and
touch. We present three techniques that enable this interaction:
Eye Cut & Paste, Eye Drag & Drop, and Eye Summon & Cast. We outline
and discuss several scenarios in which these techniques can be used.
In a user study we found that participants responded well to the
visual feedback provided by Eye Drag & Drop during object movement.
In contrast, we found that although Eye Summon & Cast significantly
improved performance, participants had difficulty coordinating their
hands and eyes during interaction.

People tracking in crowded real-world scenes is challenging due to
frequent and long-term occlusions. Recent tracking methods obtain
the image evidence from object (people) detectors, but typically
use off-the-shelf detectors and treat them as black box components.
In this paper we argue that for best performance one should explicitly
train people detectors on failure cases of the overall tracker instead.
To that end, we first propose a novel joint people detector that
combines a state-of-the-art single person detector with a detector
for pairs of people, which explicitly exploits common patterns of
person-person occlusions across multiple viewpoints that are a common
failure case for tracking in crowded scenes. To explicitly address
remaining failure cases of the tracker we explore two methods. First,
we analyze typical failure cases of trackers and train a detector
explicitly on those failure cases. And second, we train the detector
with the people tracker in the loop, focusing on the most common
tracker failures. We show that our joint multi-person detector significantly
improves both detection accuracy as well as tracker performance,
improving the state-of-the-art on standard benchmarks.

Object class recognition is an active topic in computer vision still
presenting many challenges. In most approaches, this task is addressed
by supervised learning algorithms that need a large quantity of labels
to perform well. This leads either to small datasets (< 10,000 images)
that capture only a subset of the real-world class distribution (but
with a controlled and verified labeling procedure), or to large datasets
that are more representative but also add more label noise. Therefore,
semi-supervised learning is a promising direction. It requires only
few labels while simultaneously making use of the vast amount of
images available today. We address object class recognition with
semi-supervised learning. These algorithms depend on the underlying
structure given by the data, the image description, and the similarity
measure, and the quality of the labels. This insight leads to the
main research questions of this thesis: Is the structure given by
labeled and unlabeled data more important than the algorithm itself?
Can we improve this neighborhood structure by a better similarity
metric or with more representative unlabeled data? Is there a connection
between the quality of labels and the overall performance and how
can we get more representative labels? We answer all these questions,
i.e., we provide an extensive evaluation, we propose several graph
improvements, and we introduce a novel active learning framework
to get more representative labels.

In this paper we consider people detection and articulated pose estimation, two closely related and challenging problems in computer vision. Conceptually, both of these problems can be addressed within the pictorial structures framework (Felzenszwalb and Huttenlocher in Int. J. Comput. Vis. 61(1):55–79, 2005; Fischler and Elschlager in IEEE Trans. Comput. C-22(1):67–92, 1973), even though previous approaches have not shown such generality. A principal difficulty for such a general approach is to model the appearance of body parts. The model has to be discriminative enough to enable reliable detection in cluttered scenes and general enough to capture highly variable appearance. Therefore, as the first important component of our approach, we propose a discriminative appearance model based on densely sampled local descriptors and AdaBoost classifiers. Secondly, we interpret the normalized margin of each classifier as likelihood in a generative model and compute marginal posteriors for each part using belief propagation. Thirdly, non-Gaussian relationships between parts are represented as Gaussians in the coordinate system of the joint between the parts. Additionally, in order to cope with shortcomings of tree-based pictorial structures models, we augment our model with additional repulsive factors in order to discourage overcounting of image evidence. We demonstrate that the combination of these components within the pictorial structures framework results in a generic model that yields state-of-the-art performance for several datasets on a variety of tasks: people detection, upper body pose estimation, and full body pose estimation.

Current object class recognition systems typically target 2D bounding box localization, encouraged by benchmark data sets, such as Pascal VOC. While this seems suitable for the detection of individual objects, higher-level applications such as 3D scene understanding or 3D object tracking would benefit from more fine-grained object hypotheses incorporating 3D geometric information, such as viewpoints or the locations of individual parts. In this paper, we help narrowing the representational gap between the ideal input of a scene understanding system and object class detector output, by designing a detector particularly tailored towards 3D geometric reasoning. In particular, we extend the successful discriminatively trained deformable part models to include both estimates of viewpoint and 3D parts that are consistent across viewpoints. We experimentally verify that adding 3D geometric information comes at minimal performance loss w.r.t. 2D bounding box localization, but outperforms prior work in 3D viewpoint estimation and ultra-wide baseline matching.

State-of-the-art methods for human detection and pose estimation require many training samples for best performance. While large, manually collected datasets exist, the captured variations w.r.t. appearance, shape and pose are often uncontrolled thus limiting the overall performance. In order to overcome this limitation we propose a new technique to extend an existing training set that allows to
explicitly control pose and shape variations. For this we build on recent advances in
computer graphics to generate samples with realistic appearance and background
while modifying body shape and pose.
We validate the effectiveness of our approach on the task of articulated human detection and articulated pose estimation.
We report close to state of the art results on the popular Image Parsing human pose estimation benchmark and demonstrate superior performance for articulated human detection. In addition we define a new challenge of combined articulated human detection and pose estimation in real-world scenes.

This paper presents a novel pedestrian detection system for intelligent
vehicles. We propose the use of dense stereo for both the generation of regions
of interest and pedestrian classification. Dense stereo allows the dynamic
estimation of camera parameters and the road profile, which, in turn, provides
strong scene constraints on possible pedestrian locations. For classification,
we extract spatial features (gradient orientation histograms) directly from
dense depth and intensity images. Both modalities are represented in terms of
individual feature spaces, in which discriminative classifiers (linear support
vector machines) are learned. We refrain from the construction of a joint
feature space but instead employ a fusion of depth and intensity on the
classifier level. Our experiments involve challenging image data captured in
complex urban environments (i.e., undulating roads and speed bumps). Our
results show a performance improvement by up to a factor of 7.5 at the
classification level and up to a factor of 5 at the tracking level (reduction
in false alarms at constant detection rates) over a system with static scene
constraints and intensity-only classification.

Location is a key information for context-aware systems. While coarse-grained
indoor location estimates may be obtained quite easily (e.g. based on WiFi or
GSM), finer-grained estimates typically require additional infrastructure (e.g.
ultrasound). This work explores an approach to estimate significant places,
e.g., at the fridge, with no additional setup or infrastructure. We use a
pocket-based inertial measurement sensor, which can be found in many recent
phones. We analyze how the spatial layout such as geographic orientation of
buildings, arrangement and type of furniture can serve as the basis to estimate
typical places in a daily scenario. Initial experiments reveal that our
approach can detect fine-grained locations without relying on any
infrastructure or additional devices.

We propose a method to learn simultaneously a vector-valued function and a
kernel between its components. The obtained kernel can be used both to improve
learning performance and to reveal structures in the output space which may be
important in their own right. Our method is based on the solution of a suitable
regularization problem over a reproducing kernel Hilbert space of vector-valued
functions. Although the regularized risk functional is non-convex, we show that
it is invex, implying that all local minimizers are global minimizers. We
derive a block-wise coordinate descent method that efficiently exploits the
structure of the objective functional. Then, we empirically demonstrate that
the proposed method can improve classification accuracy. Finally, we provide a
visual interpretation of the learned kernel matrix for some well known
datasets.

Branch&rank is an object detection scheme that overcomes the inherent
limitation of branch&bound: this method works with arbitrary (classifier)
functions whereas tight bounds exist only for simple functions. Objects are
usually detected with less than 100 classifier evaluation, which paves the way
for using strong (and thus costly) classifiers: We utilize non-linear SVMs with
RBF- 2 kernels without a cascade-like approximation. Our approach features
three key components: a ranking function that operates on sets of hypotheses
and a grouping of these into different tasks. Detection efficiency results from
adaptively sub-dividing the object search space into decreasingly smaller sets.
This is inherited from branch&bound, while the ranking function supersedes a
tight bound which is often unavailable (except for too simple function
classes). The grouping makes the system effective: it separates image
classification from object recognition, yet combines them in a single,
structured SVM formulation. A novel aspect of branch&rank is that a better
ranking function is expected to decrease the number of classifier calls during
detection. We demonstrate the algorithmic properties using the VOC'07 dataset.

Geometric 3D reasoning has received renewed attention recently, in the context
of visual scene understanding. The level of geometric detail, however, is
typically limited to qualitative or coarse-grained quantitative
representations. This is linked to the fact that today's object class detectors
are tuned towards robust 2D matching rather than accurate 3D pose estimation,
encouraged by 2D bounding box-based benchmarks such as Pascal VOC. In this
paper, we therefore revisit ideas from the early days of computer vision,
namely, 3D geometric object class representations for recognition. These
representations can recover geometrically far more accurate object hypotheses
than just 2D bounding boxes, including relative 3D positions of object parts.
In combination with recent robust techniques for shape description and
inference, our approach outperforms state-of-the-art results in 3D pose
estimation, while at the same time improving 2D localization. In a series of
experiments, we analyze our approach in detail, and demonstrate novel
applications enabled by our geometric object class representation, such as
fine-grained categorization of cars according to their 3D geometry and
ultra-wide baseline matching.

We address the challenging task of decoupling material properties from lighting
properties given a single image. In the last two decades virtually all works
have concentrated on exploiting edge information to address this problem. We
take a different route by introducing a new prior on reflectance, that models
reflectance values as being drawn from a sparse set of basis colors. This
results in a Random Field model with global, latent variables (basis colors)
and pixel-accurate output reflectance values. We show that without edge
information high-quality results can be achieved, that are on par with methods
exploiting this source of information. Finally, we are able to improve on
state-of-the-art results by integrating edge information into our model. We
believe that our new approach is an excellent starting point for future
developments in this field.

Hearing instruments (HIs) have emerged as true pervasive computers
as they continuously adapt the hearing program to the user\textquoterights
context. However, current HIs are not able to distinguish different
hearing needs in the same acoustic environment. In this work, we
explore how information derived from body and eye movements can be
used to improve the recognition of such hearing needs. We conduct
an experiment to provoke an acoustic environment in which different
hearing needs arise: active conversation and working while colleagues
are having a conversation in a noisy office environment. We record
body movements on nine body locations, eye movements using electrooculography
(EOG), and sound using commercial HIs for eleven participants. Using
a support vector machine (SVM) classifier and person-independent
training we improve the accuracy of 77% based on sound to an accuracy
of 92% using body movements. With a view to a future implementation
into a HI we then perform a detailed analysis of the sensors attached
to the head. We achieve the best accuracy of 86% using eye movements
compared to 84% for head movements. Our work demonstrates the potential
of additional sensor modalities for future HIs and motivates to investigate
the wider applicability of this approach on further hearing situations
and needs.

Remarkable performance has been reported to recognize single object classes.
Scalability to large numbers of classes however remains an important challenge
for today's recognition methods. Several authors have promoted knowledge
transfer between classes as a key ingredient to address this challenge.
However, in previous work the decision which knowledge to transfer has required
either manual supervision or at least a few training examples limiting the
scalability of these approaches. In this work we explicitly address the
question of how to automatically decide which information to transfer between
classes without the need of any human intervention. For this we tap into
linguistic knowledge bases to provide the semantic link between sources (what)
and targets (where) of knowledge transfer. We provide a rigorous experimental
evaluation of different knowledge bases and state-of-the-art techniques from
Natural Language Processing which goes far beyond the limited use of language
in related work. We also give insights into the applicability (why) of
different knowledge sources and similarity measures for knowledge transfer.

Object recognition is challenging due to high intra-class variability caused,
e.g., by articulation, viewpoint changes, and partial occlusion. Successful
methods need to strike a balance between being flexible enough to model such
variation and discriminative enough to detect objects in cluttered, real world
scenes. Motivated by these challenges we propose a latent conditional random
field (CRF) based on a flexible assembly of parts. By modeling part labels as
hidden nodes and developing an EM algorithm for learning from class labels
alone, this new approach enables the automatic discovery of semantically
meaningful object part representations. To increase the flexibility and
expressiveness of the model, we learn the pairwise structure of the underlying
graphical model at the level of object part interactions. Efficient
gradient-based techniques are used to estimate the structure of the domain of
interest and carried forward to the multi-label or object part case. Our
experiments illustrate the meaningfulness of the discovered parts and
demonstrate state-of-the-art performance of the approach.

Automatic recovery of 3D human pose from monocular image sequences is a
challenging and important research topic with numerous applications. Although
current methods are able to recover 3D pose for a single person in controlled
environments, they are severely challenged by real-world scenarios, such as
crowded street scenes. To address this problem, we propose a three-stage
process building on a number of recent advances. The first stage obtains an
initial estimate of the 2D articulation and viewpoint of the person from single
frames. The second stage allows early data association across frames based on
tracking-by-detection. These two stages successfully accumulate the available
2D image evidence into robust estimates of 2D limb positions over short image
sequences (= tracklets). The third and final stage uses those tracklet-based
estimates as robust image observations to reliably recover 3D pose. We
demonstrate state-of-the-art performance on the HumanEva II benchmark, and also
show the applicability of our approach to articulated 3D tracking in realistic
street conditions.

Knowledge transfer between object classes has been identified as an important
tool for scalable recognition. However, determining which knowledge to transfer
where remains a key challenge. While most approaches employ varying levels of
human supervision, we follow the idea of mining linguistic knowledge bases to
automatically infer transferable knowledge. In contrast to previous work, we
explicitly aim to design robust semantic relatedness measures and to combine
different language sources for attribute-based knowledge transfer. On the
challenging Animals with Attributes (AwA) data set, we report largely improved
attribute-based zero-shot object class recognition performance that matches the
performance of human supervision.

Recognizing 3D objects from arbitrary view points is one of the
most fundamental problems in computer vision. A major challenge lies
in the transition between the 3D geometry of objects and 2D
representations that can be robustly matched to natural images. Most
approaches thus rely on 2D natural images either as the sole source of
training data for building an implicit 3D representation, or by
enriching 3D models with natural image features.
In this paper, we go back to the ideas from the early days of computer
vision, by using 3D object models as the only source of information for
building a multi-view object class detector. In particular, we use
these models for learning 2D shape that can be robustly matched to 2D
natural images. Our experiments confirm the validity of our approach,
which outperforms current state-of-the-art techniques on a multi-view
detection data set.

Many computer vision methods rely on annotated image databases without taking
advantage of the increasing number of unlabeled images available. This paper
explores an alternative approach involving unsupervised structure discovery and
semi-supervised learning (SSL) in image collections. Focusing on object
classes, the ﬁrst part of the paper contributes with an extensive evaluation of
state-of-the-art image representations underlining the decisive inﬂuence of the
local neighborhood structure, its direct consequences on SSL results, and the
importance of developing powerful object representations. In a second part, we
propose and explore promising directions to improve results by looking at the
local topology between images and feature combination strategies.

Finding injured humans is one of the primary
goals of any search and rescue operation. The aim of this paper
is to address the task of automatically finding people lying on
the ground in images taken from the on-board camera of an
unmanned aerial vehicle (UAV).
In this paper we evaluate various state-of-the-art visual
people detection methods in the context of vision based victim
detection from an UAV. The top performing approaches in
this comparison are those that rely on flexible part-based
representations and discriminatively trained part detectors. We
discuss their strengths and weaknesses and demonstrate that by
combining multiple models we can increase the reliability of the
system. We also demonstrate that the detection performance
can be substantially improved by integrating the height and
pitch information provided by on-board sensors. Jointly these
improvements allow us to significantly boost the detection
performance over the current de-facto standard, which provides
a substantial step towards making autonomous victim detection
for UAVs practical.