It has been demonstrated in a number of robotic areas how the use of virtual fixtures improves task performance both in terms of execution time and overall precision, [1]. However, the fixtures are typically inflexible, resulting in a degraded performance in cases of unexpected obstacles or incorrect fixture models. In this paper, we propose the use of adaptive virtual fixtures that enable us to cope with the above problems. A teleoperative or human machine collaborative setting is assumed with the core idea of dividing the task, that the operator is executing, into several subtasks. The operator may remain in each of these subtasks as long as necessary and switch freely between them. Hence, rather than executing a predefined plan, the operator has the ability to avoid unforeseen obstacles and deviate from the model. In our system, the probability that the user is following a certain trajectory (subtask) is estimated and used to automatically adjusts the compliance. Thus, an on-line decision of how to fixture the movement is provided.

Acquiring, representing and modeling human skins is one of the key research areas in teleoperation, programming. by-demonstration and human-machine collaborative settings. One of the common approaches is to divide the task that the operator is executing into several subtasks in order to provide manageable modeling. In this paper we consider the use of a Layered Hidden Markov Model (LHMM) to model human skills. We evaluate a gestem classifier that classifies motions into basic action-primitives, or gestems. The gestem classifiers are then used in a LHMM to model a simulated teleoperated task. We investigate the online and offline classilication performance with respect to noise, number of gestems, type of HAIM and the available number of training sequences. We also apply the LHMM to data recorded during the execution of a trajectory-tracking task in 2D and 3D with a robotic manipulator in order to give qualitative as well as quantitative results for the proposed approach. The results indicate that the LHMM is suitable for modeling teleoperative trajectory-tracking tasks and that the difference in classification performance between one and multi dimensional HMMs for gestem classification is small. It can also be seen that the LHMM is robust w.r.t misclassifications in the underlying gestem classifiers.

Acquiring, representing and modelling human skills is one of the key research areas in teleoperation, programming-by-demonstration and human-machine collaborative settings. The problems are challenging mainly because of the lack of a general mathematical model to describe human skills. One of the common approaches is to divide the task that the operator is executing into several subtasks or low-level subsystems in order to provide manageable modelling. In this paper we consider the use of a Layered Hidden Markov Model (LHMM) to model human skills. We evaluate a gesteme classifier that classifies motions into basic action-primitives, or gestemes. The gesteme classifiers are then used in a LHMM to model a teleoperated task. The proposed methodology uses three different HMM models at the gesteme level: one-dimensional HMM, multi-dimensional HMM and multidimensional HMM with Fourier transform. The online and off-line classification performance of these three models is evaluated with respect to the number of gestemes, the influence of the number of training samples, the effect of noise and the effect of the number of observation symbols. We also apply the LHMM to data recorded during the execution of a trajectory tracking task in 2D and 3D with a mobile manipulator in order to provide qualitative as well as quantitative results for the proposed approach. The results indicate that the LHMM is suitable for modelling teleoperative trajectory-tracking tasks and that the difference in classification performance between one and multidimensional HMMs for gesteme classification is small. It can also be seen that the LHMM is robust with respect to misclassifications in the underlying gesteme classifiers.

This paper presents our ongoing research in the design of a versatile service robot capable of operating in a home or office environment. Ideas presented here cover architectural issues and possible applications for such a robot system with focus on tasks requiring constrained end-effector motions. Two key components of such system is a path planner and a reactive behavior capable of force relaxation and path adaptation. These components are presented in detail along with an overview of the software architecture they fit into.

One of the main challenges in the field of robotics is to make robots ubiquitous. To intelligently interact with the world, such robots need to understand the environment and situations around them and react appropriately, they need context-awareness. But how to equip robots with capabilities of gathering and interpreting the necessary information for novel tasks through interaction with the environment and by providing some minimal knowledge in advance? This has been a longterm question and one of the main drives in the field of cognitive system development. The main idea behind the work presented in this paper is that the robot should, like a human infant, learn about objects by interacting with them, forming representations of the objects and their categories that are grounded in its embodiment. For this purpose, we study an early learning of object grasping process where the agent, based on a set of innate reflexes and knowledge about its embodiment. We stress out that this is not the work on grasping, it is a system that interacts with the environment based on relations of 3D visual features generated trough a stereo vision system. We show how geometry, appearance and spatial relations between the features can guide early reactive grasping which can later on be used in a more purposive manner when interacting with the environment.

This paper extends the recently developed Model-Aided Visual-Inertial Fusion (MA-VIF) technique for quadrotor Micro Air Vehicles (MAV) to deal with wind disturbances. The wind effects are explicitly modelled in the quadrotor dynamic equations excluding the unobservable wind velocity component. This is achieved by a nonlinear observability of the dynamic system with wind effects. We show that using the developed model, the vehicle pose and two components of the wind velocity vector can be simultaneously estimated with a monocular camera and an inertial measurement unit. We also show that the MA-VIF is reasonably tolerant to wind disturbances, even without explicit modelling of wind effects and explain the reasons for this behaviour. Experimental results using a Vicon motion capture system are presented to demonstrate the effectiveness of the proposed method and validate our claims.

The computation of persistent homology has proven a fundamental component of the nascent field of topological data analysis and computational topology. We describe a new software package for topological computation, with design focus on needs of the research community. This tool, replacing previous jPlex and Plex, enables researchers to access state of the art algorithms for persistent homology, cohomology, hom complexes, filtered simplicial complexes, filtered cell complexes, witness complex constructions, and many more essential components of computational topology. We describe, herewithin, the design goals we have chosen, as well as the resulting software package, and some of its more novel capabilities.

In this paper, we propose a framework for gradually improving the quality of an already existing image descriptor. The descriptor used in this paper (Afkham et al., 2013) uses the response of a series of discriminative components for summarizing each image. As we will show, this descriptor has an ideal form in which all categories become linearly separable. While, reaching this form is not feasible, we will argue how by replacing a small fraction of these components, it is possible to obtain a descriptor which is, on average, closer to this ideal form. To do so, we initially identify which components do not contribute to the quality of the descriptor and replace them with more robust components. Here, a joint feature selection method is used to find improved components. As our experiments show, this change directly reflects in the capability of the resulting descriptor in discriminating between different categories.

In this paper, we discuss the properties of a class of latent variable models that assumes each labeled sample is associated with set of different features, with no prior knowledge of which feature is the most relevant feature to be used. Deformable-Part Models (DPM) can be seen as good example of such models. While Latent SVM framework (LSVM) has proven to be an efficient tool for solving these models, we will argue that the solution found by this tool is very sensitive to the initialization. To decrease this dependency, we propose a novel clustering procedure, for these problems, to find cluster centers that are shared by several sample sets while ignoring the rest of the cluster centers. As we will show, these cluster centers will provide a robust initialization for the LSVM framework.

This thesis is mostly about supervised visual recognition problems. Based on a general definition of categories, the contents are divided into two parts: one which models categories and one which is not category based. We are interested in data driven solutions for both kinds of problems.

In the category-free part, we study novelty detection in temporal and spatial domains as a category-free recognition problem. Using data driven models, we demonstrate that based on a few reference exemplars, our methods are able to detect novelties in ego-motions of people, and changes in the static environments surrounding them.

In the category level part, we study object recognition. We consider both object category classification and localization, and propose scalable data driven approaches for both problems. A mixture of parametric classifiers, initialized with a sophisticated clustering of the training data, is demonstrated to adapt to the data better than various baselines such as the same model initialized with less subtly designed procedures. A nonparametric large margin classifier is introduced and demonstrated to have a multitude of advantages in comparison to its competitors: better training and testing time costs, the ability to make use of indefinite/invariant and deformable similarity measures, and adaptive complexity are the main features of the proposed model.

We also propose a rather realistic model of recognition problems, which quantifies the interplay between representations, classifiers, and recognition performances. Based on data-describing measures which are aggregates of pairwise similarities of the training data, our model characterizes and describes the distributions of training exemplars. The measures are shown to capture many aspects of the difficulty of categorization problems and correlate significantly to the observed recognition performances. Utilizing these measures, the model predicts the performance of particular classifiers on distributions similar to the training data. These predictions, when compared to the test performance of the classifiers on the test sets, are reasonably accurate.

We discuss various aspects of visual recognition problems: what is the interplay between representations and classification tasks, how can different models better adapt to the training data, etc. We describe and analyze the aforementioned methods that are designed to tackle different visual recognition problems, but share one common characteristic: being data driven.

The non-linear decision boundary between object and background classes - due to large intra-class variations - needs to be modelled by any classifier wishing to achieve good results. While a mixture of linear classifiers is capable of modelling this non-linearity, learning this mixture from weakly annotated data is non-trivial and is the paper's focus. Our approach is to identify the modes in the distribution of our positive examples by clustering, and to utilize this clustering in a latent SVM formulation to learn the mixture model. The clustering relies on a robust measure of visual similarity which suppresses uninformative clutter by using a novel representation based on the exemplar SVM. This subtle clustering of the data leads to learning better mixture models, as is demonstrated via extensive evaluations on Pascal VOC 2007. The final classifier, using a HOG representation of the global image patch, achieves performance comparable to the state-of-the-art while being more efficient at detection time.

It has been shown that the performance of classifiers depends not only on the number of training samples, but also on the quality of the training set [10, 12]. The purpose of this paper is to 1) provide quantitative measures that determine the quality of the training set and 2) provide the relation between the test performance and the proposed measures. The measures are derived from pairwise affinities between training exemplars of the positive class and they have a generative nature. We show that the performance of the state of the art methods, on the test set, can be reasonably predicted based on the values of the proposed measures on the training set. These measures open up a wide range of applications to the recognition community enabling us to analyze the behavior of the learning algorithms w.r.t the properties of the training data. This will in turn enable us to devise rules for the automatic selection of training data that maximize the quantified quality of the training set and thereby improve recognition performance.

We propose a system for the automatic segmentation of novelties from the background in scenarios where multiple images of the same environment are available e.g. obtained by wearable visual cameras. Our method finds the pixels in a query image corresponding to the underlying background environment by comparing it to reference images of the same scene. This is achieved despite the fact that all the images may have different viewpoints, significantly different illumination conditions and contain different objects cars, people, bicycles, etc. occluding the background. We estimate the probability of each pixel, in the query image, belonging to the background by computing its appearance inconsistency to the multiple reference images. We then, produce multiple segmentations of the query image using an iterated graph cuts algorithm, initializing from these estimated probabilities and consecutively combine these segmentations to come up with a final segmentation of the background. Detection of the background in turn highlights the novel pixels. We demonstrate the effectiveness of our approach on a challenging outdoors data set.

This paper demonstrates a system for the automatic extraction of novelty in images captured from a small video camera attached to a subject's chest, replicating his visual perspective, while performing activities which are repeated daily. Novelty is detected when a (sub)sequence cannot be registered to previously stored sequences captured while performing the same daily activity. Sequence registration is performed by measuring appearance and geometric similarity of individual frames and exploiting the invariant temporal order of the activity. Experimental results demonstrate that this is a robust way to detect novelties induced by variations in the wearer's ego-motion such as stopping and talking to a person. This is an essentially new and generic way of automatically extracting information of interest to the camera wearer and can be used as input to a system for life logging or memory support.

The kind of interaction occurring between a conductor and musicians while performing a musical piece together is an unique instance of human non-verbal communication. This Musical Production Process (MPP) thus provides an interesting area of research, both from a communication perspective and by its own right. The long term goal of this project is to model the MPP with machine learning methods, for which large amounts of data are required. Since the amount of data available to this master thesis stems from a single recording session (collected at KTH May 2014) a direct modeling of the MPP is unfeasible. As such the thesis can instead be considered as a pilot project which examines pre-requisites for modeling of the MPP. The main aim of the thesis is to investigate how musical expression can be captured in the modeling of the MPP. Two experiments, as well as a theoretical investigation of the MPP, are performed to this end. The first experiment consists of a HMM classification of sound represented by expressive tone parameters extracted by the CUEX algorithm, and labeled by four emotions. This experiment complements the previous classification of conducting gesture in GP-LVM representation performed by Kelly Karipidou on the same data set. The result of the classification implicitly proves that expression has been transferred from conductor to musicians. As the first experiment considers expression over the musical piece as a whole, the second experiment investigates the transfer of expression from conductor to musician on a local level. To this end local representations of the sound and conducting gesture are extracted, the separability of the four emotions are calculated for both representations by use of the Bhattacharyya distance and the results are compared in search for correlation. Some indications of correlation between the representations of sound and gesture are found. The conclusion is nevertheless that the utilized representations of conducting gesture do not capture musical expression to a sufficient extent.

The qualitative structure of objects and their spatial distribution,to a large extent, define an indoor human environmentscene. This paper presents an approach forindoor scene similarity measurement based on the spatialcharacteristics and arrangement of the objects inthe scene. For this purpose, two main sets of spatialfeatures are computed, from single objects and objectpairs. A Gaussian Mixture Model is applied both onthe single object features and the object pair features, tolearn object class models and relationships of the objectpairs, respectively. Given an unknown scene, the objectclasses are predicted using the probabilistic frameworkon the learned object class models. From the predictedobject classes, object pair features are extracted. A fi-nal scene similarity score is obtained using the learnedprobabilistic models of object pair relationships. Ourmethod is tested on a real world 3D database of deskscenes, using a leave-one-out cross-validation framework.To evaluate the effect of varying conditions on thescene similarity score, we apply our method on mockscenes, generated by removing objects of different categoriesin the test scenes.

In this work we summarize the solution developed by Team KTH for the Amazon Picking Challenge 2016 in Leipzig, Germany. The competition simulated a warehouse automation scenario and it was divided in two tasks: a picking task where a robot picks items from a shelf and places them in a tote and a stowing task which is the inverse task where the robot picks items from a tote and places them in a shelf. We describe our approach to the problem starting from a high level overview of our system and later delving into details of our perception pipeline and our strategy for manipulation and grasping. The solution was implemented using a Baxter robot equipped with additional sensors.

We present a novel method for re-creating the static structure of cluttered office environments -which we define as the " meta-room" -from multiple observations collected by an autonomous robot equipped with an RGB-D depth camera over extended periods of time. Our method works directly with point clusters by identifying what has changed from one observation to the next, removing the dynamic elements and at the same time adding previously occluded objects to reconstruct the underlying static structure as accurately as possible. The process of constructing the meta-rooms is iterative and it is designed to incorporate new data as it becomes available, as well as to be robust to environment changes. The latest estimate of the meta-room is used to differentiate and extract clusters of dynamic objects from observations. In addition, we present a method for re-identifying the extracted dynamic objects across observations thus mapping their spatial behaviour over extended periods of time.

We present a novel method for clustering segmented dynamic parts of indoor RGB-D scenes across repeated observations by performing an analysis of their spatial-temporal distributions. We segment areas of interest in the scene using scene differencing for change detection. We extend the Meta-Room method and evaluate the performance on a complex dataset acquired autonomously by a mobile robot over a period of 30 days. We use an initial clustering method to group the segmented parts based on appearance and shape, and we further combine the clusters we obtain by analyzing their spatial-temporal behaviors. We show that using the spatial-temporal information further increases the matching accuracy.

The Connectivity Constrained UGV Surveillance Problem (CUSP) considered in this paper is the following. Given a set of surveillance UGVs and a user defined area to be covered, find waypoint-paths such that; 1) the area is completely surveyed, 2) the time for performing the search is minimized and 3) the induced information graph is kept recurrently connected. It has previously been shown that the CUSP is NP-hard. This paper presents four different heuristic algorithms for solving the CUSP, namely, the Token Station Algorithm, the Stacking Algorithm, the Visibility Graph Algorithm and the Connectivity Primitive Algorithm. These algorithms are then compared by means of Monte Carlo simulations. The conclusions drawn are that the Token Station Algorithm provides the most optimal solutions, the Stacking Algorithm has the lowest computational complexity, while the Connectivity Primitive Algorithm provides the best trade-off between optimality and computational complexity for larger problem instances.

This paper proposes an optimization based approach to multi-UGV surveillance. In particular, we formulate both the minimum time- and connectivity constrained surveillance problems, show NP-hardness of them and propose decomposition techniques that allow us to solve them efficiently in an algorithmic manner. The minimum time formulation is the following. Given a set of surveillance UGVs and a polyhedral area, find waypoint-paths for all UGVs such that every point of the area is visible from a point on a path and such that the time for executing the search in parallel is minimized. Here, the sensor's field of view are assumed to be occluded by the obstacles and limited by a maximal sensor range. The connectivity constrained formulation extends the first by additionally requiring that the information graph induced by the sensors is connected at the time instants when the UGVs stop to perform the surveillance task. The second formulation is relevant to situation when mutual visibility is needed either to transmit the sensor data being gathered, or to protect the team from hostile persons trying to approach the stationary UGVs.

Fusion of information from different complementary sources may be necessary to achieve a robust sensing system that degrades gracefully under various conditions. Many approaches use a specific tailor-made combination of algorithms that do not easily allow the inclusion of more, or other, types of algorithms. In this paper, we explore a variant of a generic algorithm for fusing visual cues to the task of object segmentation in a video stream. The fusion algorithm combines the output of several segmentation algorithms in a straight forward way by using a bayesian approach and a particle filter to track several hypotheses. Segmentation algorithms can be added or removed without changing the over all structure of the system. It was or particular interest to investigate if the method was suitable when realistic real-world scenes with much noise was analysed. The system has been tested on image sequences taken from a moving vehicle where stationary and moving objects are successfully segmented from the background. In conclusion, the fusion algorithm explored is well suited to this problem domain and is easily adopted. The context of this work is on-line pedestrian detection to be deployed in cars.

Segmentation of the scene is a fundamental component in computer vision to find regions of interest. Most systems that aspire to run in real-time use a fast segmentation stage that considers the whole image, and then a more costly stage for classification. In this paper we present a novel approach to segment moving objects from images taken with a moving camera. The segmentation algorithm is based on a special representation of optical flow, on which u-disparity is applied. The u-disparity is used to indirectly find and mask out the background flow in the image, by approximating it with a quadratic function. Robustness in the optical flow calculation is achieved by contrast content filtering. The algorithm successfully segments moving pedestrians from a moving vehicle with few false positive segments. Most false positive segments are due to poles and organic structures, such as trees. Such false positives are, however, easily rejected in a classification stage. The presented segmentation algorithm is intended to be used as a component in a detection/classification framework.

A new mobile robot localization technique is presented which uses multiple Gaussian hypotheses to represent the probability distribution of the robots location in the environment. A tree of hypotheses is built by the application of Bayes' rule with each new sensor mesurement. However, such a tree can grow without bound and so rules are introduced for the elimination of the least likely hypotheses from the tree and for the proper re-distribution of their probability. This technique is applied to a feature-based mobile robot localization scheme and experimental results are given demonstrating the effectiveness of the scheme.

28. Using geometric primitives to render live RGB-D data in the Occulus Rift

With advances in technology and lowered price the use of RGB-D cameras for robot applications has become popular. They are able to provide rich information about the environment but due to the huge amount of data that they produce and often limited computational resources, processing and analysing the data is challenging. This creates the need for good and efficient compression methods.

In this thesis we suggest a lossy compression method that extracts planar surfaces from point cloud data, removes redundant interior points and stores the planes as a triangulation of the remaining points. The method can remove over 95\% of input points for a given plane and can do so in real time making it suitable for robotics applications. Despite high compression ratio, the resulting compressed point cloud stays true to the original scene and is visually pleasing to look at.

A mechanicaly scanned imaging sonar, MSIS, pro-duces a 2D image of the range and bearing of return intensities.The pattern produced in this image depends on the envior-mental feature that caused it. These features are very usefulfor underwater navigation but the inverse mapping of sonarimage pattern to environmental feature can be ambiguous. Weinvestigate problems associated with using MSIS for navigation.In particular we show that support vector machines can be usedto classify the existance and types of feature in a sonar image.We develop a sonar processing pipleline that can be used fornavigation. This is tested on two sonar datasets collected fromROV’s. 1

Robots are envisioned to take on jobs that are dirty, dangerous and dull, the three D's of robotics. With this mission, robotic technology today is ubiquitous on the factory floor. However, the same level of success has not occurred when it comes to robots that operate in everyday living spaces, such as homes and offices.

A big part of this is attributed to domestic environments being complex and unstructured as opposed to factory settings which can be set up and precisely known in advance. In this thesis we challenge the point of view which regards man-made environments as unstructured and that robots should operate without prior assumptions about the world. Instead, we argue that robots should make use of the inherent structure of everyday living spaces across various scales and applications, in the form of contextual and prior information, and that doing so can improve the performance of robotic tasks.

To investigate this premise, we start by attempting to solve a hard and realistic problem, active visual search. The particular scenario considered is that of a mobile robot tasked with finding an object on an entire unexplored building floor. We show that a search strategy which exploits the structure of indoor environments offers significant improvements on state of the art and is comparable to humans in terms of search performance. Based on the work on active visual search, we present two specific ways of making use of the structure of space. First, we propose to use the local 3D geometry as a strong indicator of objects in indoor scenes. By learning a 3D context model for various object categories, we demonstrate a method that can reliably predict the location of objects. Second, we turn our attention to predicting what lies in the unexplored part of the environment at the scale of rooms and building floors. By analyzing a large dataset, we propose that indoor environments can be thought of as being composed out of frequently occurring functional subparts. Utilizing these, we present a method that can make informed predictions about the unknown part of a given indoor environment.

The ideas presented in this thesis explore various sides of the same idea: modeling and exploiting the structure inherent in indoor environments for the sake of improving robot's performance on various applications. We believe that in addition to contributing some answers, the work presented in this thesis will generate additional, fruitful questions.

In this paper we address the problem of simultaneous object class and pose estimation using nothing more than object class label measurements from a generic object classifier. We detail a method for designing a likelihood function over the robot configuration space. This function provides a likelihood measure of an object being of a certain class given that the robot (from some position) sees and recognizes an object as being of some (possibly different) class. Using this likelihood function in a recursive Bayesian framework allows us to achieve a kind of spatial averaging and determine the object pose (up to certain ambiguities to be made precise). We show how inter-class confusion from certain robot viewpoints can actually increase the ability to determine the object pose. Our approach is motivated by the idea of minimalistic sensing since we use only class label measurements albeit we attempt to estimate the object pose in addition to the class.

In this paper we present a principled planner based approach to the active visual object search problem in unknown environments. We make use of a hierarchical planner that combines the strength of decision theory and heuristics. Furthermore, our object search approach leverages on the conceptual spatial knowledge in the form of object cooccurences and semantic place categorisation. A hierarchical model for representing object locations is presented with which the planner is able to perform indirect search. Finally we present real world experiments to show the feasibility of the approach.

We present Kinect@Home, aimed at collecting a vast RGB-D dataset from real everyday living spaces. This dataset is planned to be the largest real world image collection of everyday environments to date, making use of the availability of a widely adopted robotics sensor which is also in the homes of millions of users, the Microsoft Kinect camera.

In this paper, we argue that there is a strong correlation between local 3D structure and object placement in everyday scenes. We call this the 3D context of the object. In previous work, this is typically hand-coded and limited to flat horizontal surfaces. In contrast, we propose to use a more general model for 3D context and learn the relationship between 3D context and different object classes. This way, we can capture more complex 3D contexts without implementing specialized routines. We present extensive experiments with both qualitative and quantitative evaluations of our method for different object classes. We show that our method can be used in conjunction with an object detection algorithm to reduce the rate of false positives. Our results support that the 3D structure surrounding objects in everyday scenes is a strong indicator of their placement and that it can give significant improvements in the performance of, for example, an object detection system. For evaluation, we have collected a large dataset of Microsoft Kinect frames from five different locations, which we also make publicly available.

Many robotics tasks require the robot to predict what lies in the unexplored part of the environment. Although much work focuses on building autonomous robots that operate indoors, indoor environments are neither well understood nor analyzed enough in the literature. In this paper, we propose and compare two methods for predicting both the topology and the categories of rooms given a partial map. The methods are motivated by the analysis of two large annotated floor plan data sets corresponding to the buildings of the MIT and KTH campuses. In particular, utilizing graph theory, we discover that local complexity remains unchanged for growing global complexity in real-world indoor environments, a property which we exploit. In total, we analyze 197 buildings, 940 floors and over 38,000 real-world rooms. Such a large set of indoor places has not been investigated before in the previous work. We provide extensive experimental results and show the degree of transferability of spatial knowledge between two geographically distinct locations. We also contribute the KTH data set and the software tools to with it.

A significant amount of research in robotics is aimed towards building robots that operate indoors yet there exists little analysis of how human spaces are organized. In this work we analyze the properties of indoor environments from a large annotated floorplan dataset. We analyze a corpus of 567 floors, 6426 spaces with 91 room types and 8446 connections between rooms corresponding to real places. We present a system that, given a partial graph, predicts the rest of the topology by building a model from this dataset. Our hypothesis is that indoor topologies consists of multiple smaller functional parts. We demonstrate the applicability of our approach with experimental results. We expect that our analysis paves the way for more data driven research on indoor environments.

In this paper, we study the problem of active visual search (AVS) in large, unknown, or partially known environments. We argue that by making use of uncertain semantics of the environment, a robot tasked with finding an object can devise efficient search strategies that can locate everyday objects at the scale of an entire building floor, which is previously unknown to the robot. To realize this, we present a probabilistic model of the search environment, which allows for prioritizing the search effort to those parts of the environment that are most promising for a specific object type. Further, we describe a method for reasoning about the unexplored part of the environment for goal-directed exploration with the purpose of object search. We demonstrate the validity of our approach by comparing it with two other search systems in terms of search trajectory length and time. First, we implement a greedy coverage-based search strategy that is found in previous work. Second, we let human participants search for objects as an alternative comparison for our method. Our results show that AVS strategies that exploit uncertain semantics of the environment are a very promising idea, and our method pushes the state-of-the-art forward in AVS.

Objects are integral to a robot’s understandingof space. Various tasks such as semantic mapping, pick-andcarrymissions or manipulation involve interaction with objects.Previous work in the field largely builds on the assumption thatthe object in question starts out within the ready sensory reachof the robot. In this work we aim to relax this assumptionby providing the means to perform robust and large-scaleactive visual object search. Presenting spatial relations thatdescribe topological relationships between objects, we thenshow how to use these to create potential search actions. Weintroduce a method for efficiently selecting search strategiesgiven probabilities for those relations. Finally we performexperiments to verify the feasibility of our approach.

We present a method for utilising knowledge of qualitative spatial relations between objects in order to facilitate efficient visual search for those objects. A computational model for the relation is used to sample a probability distribution that guides the selection of camera views. Specifically we examine the spatial relation “on”, in the sense of physical support, and show its usefulness in search experiments on a real robot. We also experimentally compare different search strategies and verify the efficiency of so-called indirect search.

Two important components of a visual recognition system are representation and model. Both involves the selection and learning of the features that are indicative for recognition and discarding those features that are uninformative. This thesis, in its general form, proposes different techniques within the frameworks of two learning systems for representation and modeling. Namely, latent support vector machines (latent SVMs) and deep learning.

First, we propose various approaches to group the positive samples into clusters of visually similar instances. Given a fixed representation, the sampled space of the positive distribution is usually structured. The proposed clustering techniques include a novel similarity measure based on exemplar learning, an approach for using additional annotation, and augmenting latent SVM to automatically find clusters whose members can be reliably distinguished from background class.

In another effort, a strongly supervised DPM is suggested to study how these models can benefit from privileged information. The extra information comes in the form of semantic parts annotation (i.e. their presence and location). And they are used to constrain DPMs latent variables during or prior to the optimization of the latent SVM. Its effectiveness is demonstrated on the task of animal detection.

Finally, we generalize the formulation of discriminative latent variable models, including DPMs, to incorporate new set of latent variables representing the structure or properties of negative samples. Thus, we term them as negative latent variables. We show this generalization affects state-of-the-art techniques and helps the visual recognition by explicitly searching for counter evidences of an object presence.

Following the resurgence of deep networks, in the last works of this thesis we have focused on deep learning in order to produce a generic representation for visual recognition. A Convolutional Network (ConvNet) is trained on a largely annotated image classification dataset called ImageNet with $\sim1.3$ million images. Then, the activations at each layer of the trained ConvNet can be treated as the representation of an input image. We show that such a representation is surprisingly effective for various recognition tasks, making it clearly superior to all the handcrafted features previously used in visual recognition (such as HOG in our first works on DPM). We further investigate the ways that one can improve this representation for a task in mind. We propose various factors involving before or after the training of the representation which can improve the efficacy of the ConvNet representation. These factors are analyzed on 16 datasets from various subfields of visual recognition.

Discriminative latent variable models (LVM) are frequently applied to various visualrecognition tasks. In these systems the latent (hidden) variables provide a formalism formodeling structured variation of visual features. Conventionally, latent variables are de-fined on the variation of the foreground (positive) class. In this work we augment LVMsto includenegativelatent variables corresponding to the background class. We formalizethe scoring function of such a generalized LVM (GLVM). Then we discuss a frameworkfor learning a model based on the GLVM scoring function. We theoretically showcasehow some of the current visual recognition methods can benefit from this generalization.Finally, we experiment on a generalized form of Deformable Part Models with negativelatent variables and show significant improvements on two different detection tasks.

Computer vision tasks are traditionally defined and eval-uated using semantic categories. However, it is known to thefield that semantic classes do not necessarily correspondto a unique visual class (e.g. inside and outside of a car).Furthermore, many of the feasible learning techniques athand cannot model a visual class which appears consistentto the human eye. These problems have motivated the useof 1) Unsupervised or supervised clustering as a prepro-cessing step to identify the visual subclasses to be used ina mixture-of-experts learning regime. 2) Felzenszwalb etal. part model and other works model mixture assignmentwith latent variables which is optimized during learning 3)Highly non-linear classifiers which are inherently capableof modelling multi-modal input space but are inefficient atthe test time. In this work, we promote an incremental viewover the recognition of semantic classes with varied appear-ances. We propose an optimization technique which incre-mentally finds maximal visual subclasses in a regularizedrisk minimization framework. Our proposed approach uni-fies the clustering and classification steps in a single algo-rithm. The importance of this approach is its compliancewith the classification via the fact that it does not need toknow about the number of clusters, the representation andsimilarity measures used in pre-processing clustering meth-ods a priori. Following this approach we show both quali-tatively and quantitatively significant results. We show thatthe visual subclasses demonstrate a long tail distribution.Finally, we show that state of the art object detection meth-ods (e.g. DPM) are unable to use the tails of this distri-bution comprising 50% of the training samples. In fact weshow that DPM performance slightly increases on averageby the removal of this half of the data.

Deformable part-based models [1, 2] achieve state-of-the-art performance for object detection, but rely on heuristic initialization during training due to the optimization of non-convex cost function. This paper investigates limitations of such an initialization and extends earlier methods using additional supervision. We explore strong supervision in terms of annotated object parts and use it to (i) improve model initialization, (ii) optimize model structure, and (iii) handle partial occlusions. Our method is able to deal with sub-optimal and incomplete annotations of object parts and is shown to benefit from semi-supervised learning setups where part-level annotation is provided for a fraction of positive examples only. Experimental results are reported for the detection of six animal classes in PASCAL VOC 2007 and 2010 datasets. We demonstrate significant improvements in detection performance compared to the LSVM [1] and the Poselet [3] object detectors.

Evidence is mounting that ConvNets are the best representation learning method for recognition. In the common scenario, a ConvNet is trained on a large labeled dataset and the feed-forward units activation, at a certain layer of the network, is used as a generic representation of an input image. Recent studies have shown this form of representation to be astoundingly effective for a wide range of recognition tasks. This paper thoroughly investigates the transferability of such representations w.r.t. several factors. It includes parameters for training the network such as its architecture and parameters of feature extraction. We further show that different visual recognition tasks can be categorically ordered based on their distance from the source task. We then show interesting results indicating a clear correlation between the performance of tasks and their distance from the source task conditioned on proposed factors. Furthermore, by optimizing these factors, we achieve stateof-the-art performances on 16 visual recognition tasks.

Evidence is mounting that Convolutional Networks (ConvNets) are the most effective representation learning method for visual recognition tasks. In the common scenario, a ConvNet is trained on a large labeled dataset (source) and the feed-forward units activation of the trained network, at a certain layer of the network, is used as a generic representation of an input image for a task with relatively smaller training set (target). Recent studies have shown this form of representation transfer to be suitable for a wide range of target visual recognition tasks. This paper introduces and investigates several factors affecting the transferability of such representations. It includes parameters for training of the source ConvNet such as its architecture, distribution of the training data, etc. and also the parameters of feature extraction such as layer of the trained ConvNet, dimensionality reduction, etc. Then, by optimizing these factors, we show that significant improvements can be achieved on various (17) visual recognition tasks. We further show that these visual recognition tasks can be categorically ordered based on their similarity to the source task such that a correlation between the performance of tasks and their similarity to the source task w.r.t. the proposed factors is observed.

Kernel methods have been used very successfully to classify data in various application domains. Traditionally, kernels have been constructed mainly for vectorial data defined on a specific vector space. Much less work has been addressing the development of kernel functions for non-vectorial data. In this paper, we present a new kernel for encoding sequential data. We present our results comparing the proposed kernel to the state of the art, showing a significant improvement in classification and a much improved robustness and interpretability.

We define a novel kernel function for finite sequences of arbitrary length which we call the path kernel. We evaluate this kernel in a classification scenario using synthetic data sequences and show that our kernel can outperform state of the art sequential similarity measures. Furthermore, we find that, in our experiments, a clustering of data based on the path kernel results in much improved interpretability of such clusters compared to alternative approaches such as dynamic time warping or the global alignment kernel.

Graph Networks are used to make decisions in potentially complex scenarios but it is usually not obvious how or why they made them. In this work, we study the explainability of Graph Network decisions using two main classes of techniques, gradient-based and decomposition-based, on a toy dataset and a chemistry task. Our study sets the ground for future development as well as application to real-world problems.

We present two approaches to modeling affordance relations between objects, actions and effects. The first approach we present focuses on a probabilistic approach which uses a voting function to learn which objects afford which types of grasps. We compare the success rate of this approach to a second approach which uses an ontological reasoning engine for learning affordances. Our second approach employs a rule-based system with axioms to reason on grasp selection for a given object.

In this paper, we present a new adaptation of the regular polygon detection algorithm for real-time road sign detection for autonomous vehicles. The method is robust to partial occlusion and fading, and insensitive to lighting conditions. We experimentally demonstrate its application to the detection of various signs, particularly evaluating it on a sequence of roundabout signs taken from the ANU/NICTA vehicle. The algorithm runs faster than 20 frames per second on a standard PC, detecting signs of the size that appears in road scenes, as observed from a camera mounted on the rear-vision mirror. The algorithm uses the symmetric nature of regular polygonal shapes, we also use the constrained appearance of such shapes in the road scene to the car in order to facilitate their fast, robust detection.