Detecting and identifying the different objects in an image fast and reliably is an
important skill for interacting with one’s environment. The main problem is that in
theory, all parts of an image have to be searched for objects on many different scales
to make sure that no object instance is missed. It however takes considerable time
and effort to actually classify the content of a given image region and both time
and computational capacities that an agent can spend on classification are limited.
Humans use a process called visual attention to quickly decide which locations of
an image need to be processed in detail and which can be ignored. This allows us
to deal with the huge amount of visual information and to employ the capacities
of our visual system efficiently.
For computer vision, researchers have to deal with exactly the same problems,
so learning from the behaviour of humans provides a promising way to improve
existing algorithms. In the presented master’s thesis, a model is trained with eye
tracking data recorded from 15 participants that were asked to search images for
objects from three different categories. It uses a deep convolutional neural network
to extract features from the input image that are then combined to form a saliency
map. This map provides information about which image regions are interesting
when searching for the given target object and can thus be used to reduce the
parts of the image that have to be processed in detail. The method is based on a
recent publication of Kümmerer et al., but in contrast to the original method that
computes general, task independent saliency, the presented model is supposed to
respond differently when searching for different target categories.

For grasping and manipulation with robot arms, knowing the current pose of the arm is crucial
for successful controlling its motion. Often, pose estimations can be acquired from encoders
inside the arm, but they can have significant inaccuracy which makes the use of additional
techniques necessary.
In this master thesis, a novel approach of robot arm pose estimation is presented, that works on
single depth images without the need of prior foreground segmentation or other preprocessing
steps.
A random regression forest is used, which is trained only on synthetically generated data.
The approach improves former work by Bohg et al. by considerably reducing the computational
effort both at training and test time. The forest in the new method directly estimates the
desired joint angles while in the former approach, the forest casts 3D position votes for the
joints, which then have to be clustered and fed into an iterative inverse kinematic process to
finally get the joint angles.
To improve the estimation accuracy, the standard training objective of the forest training is
replaced by a specialized function that makes use of a model-dependent distance metric, called
DISP.
Experimental results show that the specialized objective indeed improves pose estimation and
it is shown that the method, despite of being trained on synthetic data only, is able to
provide reasonable estimations for real data at test time.

Our goal is to understand the principles of Perception, Action and Learning in autonomous systems that successfully interact with complex environments and to use this understanding to design future systems