From September 1st, Jeannette Bohg will be Assistant Professor at Stanford. She will remain affiliated as a guest researcher with the MPI. Her research focuses on perception for autonomous robotic manipulation and grasping. She is specifically interesting in developing methods that are goal-directed, real-time and multi-modal such that they can provide meaningful feedback for execution and learning.

Before joining the Autonomous Motion lab in January 2012, Jeannette Bohg was a PhD student at the Computer Vision and Active Perception lab (CVAP) at KTH in Stockholm. Her thesis on Multi-modal scene understanding for Robotic Grasping was performed under the supervision of Prof. Danica Kragic. She studied at Chalmers in Gothenburg and at the Technical University in Dresden where she received her Master in Art and Technology and her Diploma in Computer Science, respectively.

Real-Time Perception meet Reactive Motion Generation

This video shows the performance of our fully integrated manipulation system that consumes continuous visual feedback on the environment and thereby adapts motion plans online.

Dual Execution of Optimized Contact Interaction Trajectories

This video showcases a method which optimizes trajectories that are in contact with the environment to exploit these constraints for more robust reaching of a given target. It re-plans these trajectories online using force feedback.

This video showcases a novel variant of the particle filter in which re-sampling is not only done each time-step but also for each dimension of the state vector. Compared to a standard particle filter it yields more robust results the higher the dimension of the state.

Probabilistic Object Tracking using a Range Camera

We show a visual object tracking method that is specifically well suited for tracking during manipulation tasks. These are situations that are characterized by strong occlusions of the object and are therefore challenging for standard tracking methods. By explicitly modelling these occlusions and some clever factorization of the resulting system state, we developed a methods that has been successfully applied in the Phase 2 of the DARPA ARM Challenge.

Robot Arm Pose Estimation through Pixel-Wise Part Classification

Hand-eye coordination is a notorious problem for many robot systems. Even offline calibration may not always provide the final solution or sufficient accuracy for fine manipulation tasks or just simple precision grasps. Being able to estimate the arm pose directly in the image would provide a solution to this problem by providing hand-eye coordination online.
We present a frame-by-frame technique for arm pose estimation that does not require an initialization or any information from sensors other than an RGB-D camera.

Task-based Grasp Adaptation

We present the integration of many different modules to allow a robot to infer a task-relevant grasp for a perceived object. It relies on learning techniques to determine the category of an object and the associated grasp given a task. The system is demonstrated on the humanoid robot Armar IIIa at the Karlsruhe Institute for Technology.

Autonomous systems such as humanoid robots are characterized by a multitude of feedback control loops operating at different hierarchical levels and time-scales. Designing and tuning these controllers typically requires significant manual modeling and design effort and exhaustive experimental testing. For managing the ever greater c...

We have developed algorithms which enable an autonomous manipulation system to grasp a wide range of objects and to perform a certain number of manipulation tasks, such as drilling, using a stapler, unlocking a door with a key or changing a tire \cite{Righetti_AR_2013}. More generally, we are interested in providing complete integra...

Decision making requires knowledge of some variables of interest. In the vast majority of real-world problems, these variables are latent, i.e. they cannot be observed directly and must be inferred from available measurements. To maintain an up-to-date distribution over the latent variables, past beliefs have to ...

Motion generation is increasingly formalized as a large scale optimization over future outcomes of actions. For high dimensional manipulation platforms, this optimization is computationally so difficult that for a long time traditional approaches focused primarily on feasibility of the solution rather than even local optimality. Rec...

Hand-eye coordination is crucial for capable manipulation of objects. It requires to know the manipulator's and the objects' locations. These locations have to be inferred from sensory data. In this project we work with range sensors, which are wide spread in robotics and provide dense depth images.

Autonomous robotic grasping is one of the pre-requisites for personal robots to become useful when assisting humans in households. Seamlessly easy for humans, it still remains a very challenging task for robots. The key problem of robotic grasping is to automatically choose an appropriate grasp configuration given an object as perce...

Searching for a given target object in a scene not only requires detecting the target object if it is visible, but also to identify promising locations for search if not. The quantity that measures how interesting an image region is given a certain task is called top-down saliency.

While there exist many solutions and criteria for selecting the best grasp for an object of known shape, grasping an object whose shape is uncertain and noise remains a challenge. In this project, we consider the problem of object shape estimation when it is only partially observable. Once we have a prediction, we can apply the crit...

Data-driven methods towards grasping address the challenges of grasp synthesis that arise in the real world such as noisy sensors and incomplete information about the objects and the environment. They focus on finding a suitable representation of the perceptual data that allows to predict whether a certain grasp will succeed.

Purposeful and robust manipulation requires a good hand-eye coordination. To a certain extend this can be achieved using information from joint encoders and known kinematics. However, for many robots a significant error in the pose of the end-effector and fingers of several centimeters remains. Especially for fine manipulation tasks...

There is compelling evidence that perception in humans and animals is an active and exploratory process. For example, Gibson showed that physical interaction further augments perceptual processing beyond what can be achieved by just looking at the environment. In the specific experiment, human subjects had to find a reference object...

We address the challenging problem of robotic grasping and manipulation in the presence of uncertainty. This uncertainty is due to noisy sensing, inaccurate models and hard-to-predict environment dynamics. Our approach emphasizes the importance of continuous, real-time perception and its tight integration with reactive motion genera...

2017

In Proceedings of the IEEE/RSJ International Conference of Intelligent Robots and Systems, September 2017 (inproceedings)Accepted

Abstract

We aim to reliably predict whether a grasp on a known object is successful before it is executed in the real world. There is an entire suite of grasp metrics that has already been developed which rely on precisely known contact points between object and hand. However, it remains unclear whether and how they may be combined into a general purpose grasp stability predictor. In this paper, we analyze these questions by leveraging a large scale database of simulated grasps on a wide variety of objects. For each grasp, we compute the value of seven metrics. Each grasp is annotated by human subjects with ground truth stability labels. Given this data set, we train several classification methods to find out whether there is some underlying, non-trivial structure in the data that is difficult to model manually but can be learned. Quantitative and qualitative results show the complexity of the prediction problem. We found that a good prediction performance critically depends on using a combination of metrics as input features. Furthermore, non-parametric and non-linear classifiers best capture the structure in the data.

We address the challenging problem of robotic grasping and manipulation in the presence of uncertainty. This uncertainty is due to noisy sensing, inaccurate models and hard-to-predict environment dynamics. Our approach emphasizes the importance of continuous, real-time perception and its tight integration with reactive motion generation methods. We present a fully integrated system where real-time object and robot tracking as well as ambient world modeling provides the necessary input to feedback controllers and continuous motion optimizers. Specifically, they provide attractive and repulsive potentials based on which the controllers and motion optimizer can online compute movement policies at different time intervals. We extensively evaluate the proposed system on a real robotic platform in four scenarios that exhibit either challenging workspace geometry or a dynamic environment. We compare the proposed integrated system with a more traditional sense-plan-act approach that is still widely used. In 333 experiments, we show the robustness and accuracy of the proposed system.

We propose a probabilistic filtering method which fuses joint measurements with depth images to yield a precise, real-time estimate of the end-effector pose in the camera frame. This avoids the need for frame transformations when using it in combination with visual object tracking methods.
Precision is achieved by modeling and correcting biases in the joint measurements as well as inaccuracies in the robot model, such as poor extrinsic camera calibration. We make our method computationally efficient through a principled combination of Kalman filtering of the joint measurements and asynchronous depth-image updates based on the Coordinate Particle Filter.
We quantitatively evaluate our approach on a dataset recorded from a real robotic platform, annotated with ground truth from a motion capture system. We show that our approach is robust and accurate even under challenging conditions such as fast motion, significant and long-term occlusions, and time-varying biases. We release the dataset along with open-source code of our approach to allow for quantitative comparison with alternative approaches.

Recent approaches in robotics follow the insight that perception is facilitated by interactivity with the environment. These approaches are subsumed under the term of Interactive Perception (IP). We argue that IP provides the following benefits: (i) any type of forceful interaction with the environment creates a new type of informative sensory signal that would otherwise not be present and (ii) any prior knowledge about the nature of the interaction supports the interpretation of the signal. This is facilitated by knowledge of the regularity in the combined space of sensory information and action parameters. The goal of this survey is to postulate this as a principle and collect evidence in support by analyzing and categorizing existing work in this area. We also provide an overview of the most important applications of Interactive Perception. We close this survey by discussing the remaining open questions. Thereby, we hope to define a field and inspire future work.

Proceedings of the IEEE/RSJ Conference on Intelligent Robots and Systems, IEEE, IROS, October 2016 (conference)

Abstract

One of the central tasks for a household robot is searching for specific objects. It does not only require localizing the target object but also identifying promising search locations in the scene if the target is not immediately visible. As computation time and hardware resources are usually limited in robotics, it is desirable to avoid expensive visual processing steps that are exhaustively applied over the entire image. The human visual system can quickly select those image locations that have to be processed in detail for a given task. This allows us to cope with huge amounts of information and to efficiently deploy the limited capacities of our visual system. In this paper, we therefore propose to use human fixation data to train a top-down saliency model that predicts relevant image locations when searching for specific objects. We show that the learned model can successfully prune bounding box proposals without rejecting the ground truth object locations. In this aspect, the proposed model outperforms a model that is trained only on the ground truth segmentations of the target object instead of fixation data.

An action-oriented perspective changes the role of an individual from a passive observer to an actively engaged agent interacting in a closed loop with the world as well as with others. Cognition exists to serve action within a landscape that contains both. This chapter surveys this landscape and addresses the status of the pragmatic turn. Its potential influence on science and the study of cognition are considered (including perception, social cognition, social interaction, sensorimotor entrainment, and language acquisition) and its impact on how neuroscience is studied is also investigated (with the notion that brains do not passively build models, but instead support the guidance of action).
A review of its implications in robotics and engineering includes a discussion of the application of enactive control principles to couple action and perception in robotics as well as the conceptualization of system design in a more holistic, less modular manner. Practical applications that can impact the human condition are reviewed (e.g. educational applications, treatment possibilities for developmental and psychopathological disorders, the development of neural prostheses). All of this foreshadows the potential societal implications of the pragmatic turn. The chapter concludes that an action-oriented approach emphasizes a continuum of interaction between technical aspects of cognitive systems and robotics, biology, psychology, the social sciences, and the humanities, where the individual is part of a grounded cultural system.

Since the 1950s, robotics research has sought to build a general-purpose agent capable of autonomous, open-ended interaction with realistic, unconstrained environments. Cognition is perceived to be at the core of this process, yet understanding has been challenged because cognition is referred to differently within and across research areas, and is not clearly defined. The classic robotics approach is decomposition into functional modules which perform planning, reasoning, and problem-solving or provide input to these mechanisms. Although advancements have been made and numerous success stories reported in specific niches, this systems-engineering approach has not succeeded in building such a cognitive agent.
The emergence of an action-oriented paradigm offers a new approach: action and perception are no longer separable into functional modules but must be considered in a complete loop. This chapter reviews work on different mechanisms for action- perception learning and discusses the role of embodiment in the design of the underlying representations and learning. It discusses the evaluation of agents and suggests the development of a new embodied Turing Test. Appropriate scenarios need to be devised in addition to current competitions, so that abilities can be tested over long time periods.

In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA) 2016, IEEE, IEEE International Conference on Robotics and Automation, May 2016 (inproceedings)

Abstract

To achieve accurate vision-based control with a robotic arm, a good hand-eye coordination is required. However, knowing the current configuration of the arm can be very difficult due to noisy readings from joint encoders or an inaccurate hand-eye calibration. We propose an approach for robot arm pose estimation that uses depth images of the arm as input to directly estimate angular joint positions. This is a frame-by-frame method which does not rely on good initialisation of the solution from the previous frames or knowledge from the joint encoders. For estimation, we employ a random regression forest which is trained on synthetically generated data. We compare different training objectives of the forest and also analyse the influence of prior segmentation of the arms on accuracy. We show that this approach improves previous work both in terms of computational complexity and accuracy. Despite being trained on synthetic data only, we demonstrate that the estimation also works on real depth images.

In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA) 2016, IEEE, IEEE International Conference on Robotics and Automation, May 2016 (inproceedings)

Abstract

In this paper, we consider the problem of robotic grasping of objects when only partial and noisy sensor data of the environment is available. We are specifically interested in the problem of reliably selecting the best hypothesis from a whole set. This is commonly the case when trying to grasp an object for which we can only observe a partial point cloud from one viewpoint through noisy sensors. There will be many possible ways to successfully grasp this object, and even more which will fail. We propose a supervised learning method that is trained with a ranking loss. This explicitly encourages that the top-ranked training grasp in a hypothesis set is also positively labeled. We show how we adapt the standard ranking loss to work with data that has binary labels and explain the benefits of this formulation. Additionally, we show how we can efficiently optimize this loss with stochastic gradient descent. In quantitative experiments, we show that we can outperform previous models by a large margin.

In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA) 2016, IEEE, IEEE International Conference on Robotics and Automation, May 2016 (inproceedings)

Abstract

We propose a novel method that enables a robot to identify a graspable object part of an unknown object given only noisy and partial information that is obtained from an RGB-D camera. Our method combines the benefits of local with the advantages of global methods. It learns a classifier that takes a local shape representation as input and outputs the probability that a grasp applied at this location will be successful. Given a query data point that is classified in this way, we can retrieve all the locally similar training data points and use them to predict latent global object shape. This information may help to further prune positively labeled grasp hypotheses based on, e.g. relation to the predicted average global shape or suitability for a specific task. This prediction can also guide scene exploration to prune object shape hypotheses.
To learn the function that maps local shape to grasp stability we use a Random Forest Classifier. We show that our method reaches the same classification performance as the current state-of-the-art on this dataset which uses a Convolutional Neural Network. Additionally, we exploit the natural ability of the Random Forest to cluster similar data. For a positively predicted query data point, we retrieve all the locally similar training data points that are associated with the same leaf nodes of the Random Forest. The main insight from this work is that local object shape that affords a grasp is also a good predictor of global object shape. We empirically support this claim with quantitative experiments. Additionally, we demonstrate the predictive capability of the method on some real data examples.

In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), IEEE International Conference on Robotics and Automation, May 2016 (inproceedings)

Abstract

This paper proposes an automatic controller tuning framework based on linear optimal control combined with Bayesian optimization. With this framework, an initial set of controller gains is automatically improved according to a pre-defined performance objective evaluated from experimental data. The underlying Bayesian optimization algorithm is Entropy Search, which represents the latent objective as a Gaussian process and constructs an explicit belief over the location of the objective minimum. This is used to maximize the information gain from each experimental evaluation. Thus, this framework shall yield improved controllers with fewer evaluations compared to alternative approaches. A seven-degree- of-freedom robot arm balancing an inverted pole is used as the experimental demonstrator. Results of a two- and four- dimensional tuning problems highlight the method’s potential for automatic controller tuning on robotic platforms.

In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA) 2016, IEEE, IEEE International Conference on Robotics and Automation, May 2016 (inproceedings)

Abstract

We consider the problem of model-based 3D- tracking of objects given dense depth images as input. Two difficulties preclude the application of a standard Gaussian filter to this problem. First of all, depth sensors are characterized by fat-tailed measurement noise. To address this issue, we show how a recently published robustification method for Gaussian filters can be applied to the problem at hand. Thereby, we avoid using heuristic outlier detection methods that simply reject measurements if they do not match the model. Secondly, the computational cost of the standard Gaussian filter is prohibitive due to the high-dimensional measurement, i.e. the depth image. To address this problem, we propose an approximation to reduce the computational complexity of the filter. In quantitative experiments on real data we show how our method clearly outperforms the standard Gaussian filter. Furthermore, we compare its performance to a particle-filter-based tracking method, and observe comparable computational efficiency and improved accuracy and smoothness of the estimates.

In Proceedings of the American Control Conference, Boston, MA, USA, July 2016 (inproceedings)

Abstract

Most widely-used state estimation algorithms, such as the Extended Kalman Filter and the Unscented Kalman Filter, belong to the family of Gaussian Filters (GF). Unfortunately, GFs fail if the measurement process is modelled by a fat-tailed distribution. This is a severe limitation, because thin-tailed measurement models, such as the analytically-convenient and therefore widely-used Gaussian distribution, are sensitive to outliers. In this paper, we show that mapping the measurements into a specific feature space enables any existing GF algorithm to work with fat-tailed measurement models. We find a feature function which is optimal under certain conditions. Simulation results show that the proposed method allows for robust filtering in both linear and nonlinear systems with measurements contaminated by fat-tailed noise.

2015

Machine Learning in Planning and Control of Robot Motion Workshop at the IEEE/RSJ International Conference on Intelligent Robots and Systems, September 2015 (conference)

Abstract

This paper proposes an automatic controller tuning framework based on linear optimal control combined with Bayesian optimization. With this framework, an initial set of controller gains is automatically improved according to a pre-defined performance objective evaluated from experimental data. The underlying Bayesian optimization algorithm is Entropy Search, which represents the latent objective as a Gaussian process and constructs an explicit belief over the location of the objective minimum. This is used to maximize the information gain from each experimental evaluation. Thus, this framework shall yield improved controllers with fewer evaluations compared to alternative approaches. A seven-degree-of-freedom robot arm balancing an inverted pole is used as the experimental demonstrator. Preliminary results of a low-dimensional tuning problem highlight the method’s potential for automatic controller tuning on robotic platforms.

Inverse Optimal Control (IOC) has strongly impacted the systems engineering process, enabling automated planner tuning through straightforward and intuitive demonstration. The most successful and established applications, though, have been in lower dimensional problems such as navigation planning where exact optimal planning or control is feasible. In higher dimensional systems, such as humanoid robots, research has made substantial progress toward generalizing the ideas to model free or locally optimal settings, but these systems are complicated to the point where demonstration itself can be difficult. Typically, real-world applications are restricted to at best noisy or even partial or incomplete demonstrations that prove cumbersome in existing frameworks. This work derives a very flexible method of IOC based on a form of Structured Prediction known as Direct Loss Minimization. The resulting algorithm is essentially Policy Search on a reward function that rewards similarity to demonstrated behavior (using Covariance Matrix Adaptation (CMA) in our experiments). Our framework blurs the distinction between IOC, other forms of Imitation Learning, and Reinforcement Learning, enabling us to derive simple, versatile, and practical algorithms that blend imitation and reinforcement signals into a unified framework. Our experiments analyze various aspects of its performance and demonstrate its efficacy on conveying preferences for motion shaping and combined reach and grasp quality optimization.

In Proceedings of the IEEE International Conference on Robotics and Automation, May 2015 (inproceedings)

Abstract

We propose a new large-scale database containing grasps that are applied to a large set of objects from numerous categories. These grasps are generated in simulation and are annotated with different grasp stability metrics. We use a descriptive and efficient representation of the local object shape at which each grasp is applied. Given this data, we present a two-fold analysis: (i) We use crowdsourcing to analyze the correlation of the metrics with grasp success as predicted by humans. The results show that the metric based on physics simulation is a more consistent predictor for grasp success than the standard Îµ-metric. The results also support the hypothesis that human labels are not required for good ground truth grasp data. Instead the physics-metric can be used to generate datasets in simulation that may then be used to bootstrap learning in the real world. (ii) We apply a deep learning method and show that it can better leverage the large-scale database for prediction of grasp success compared to logistic regression. Furthermore, the results suggest that labels based on the physics-metric are less noisy than those from the Îµ-metric and therefore lead to a better classification performance.

For robots to be able to manipulate in unknown and unstructured environments the robot should be capable of operating under partial observability of the environment. Object occlusions and unmodeled environments are some of the factors that result in partial observability. A common scenario where this is encountered is manipulation in clutter. In the case that the robot needs to locate an object of interest and manipulate it, it needs to perform a series of decluttering actions to accurately detect the object of interest. To perform such a series of actions, the robot also needs to account for the dynamics of objects in the environment and how they react to contact. This is a non trivial problem since one needs to reason not only about robot-object interactions but also object-object interactions in the presence of contact. In the example scenario of manipulation in clutter, the state vector would have to account for the pose of the object of interest and the structure of the surrounding environment. The process model would have to account for all the aforementioned robot-object, object-object interactions. The complexity of the process model grows exponentially as the number of objects in the scene increases. This is commonly the case in unstructured environments. Hence it is not reasonable to attempt to model all object-object and robot-object interactions explicitly. Under this setting we propose a hypothesis based action selection algorithm where we construct a hypothesis set of the possible poses of an object of interest given the current evidence in the scene and select actions based on our current set of hypothesis. This hypothesis set tends to represent the belief about the structure of the environment and the number of poses the object of interest can take. The agent's only stopping criterion is when the uncertainty regarding the pose of the object is fully resolved.

In Proceedings of the IEEE International Conference on Robotics and Automation, May 2015 (inproceedings)

Abstract

Parametric filters, such as the Extended Kalman Filter and the Unscented Kalman Filter, typically scale well with the dimensionality of the problem, but they are known to fail if the posterior state distribution cannot be closely approximated by a density of the assumed parametric form.
For nonparametric filters, such as the Particle Filter, the converse holds. Such methods are able to approximate any posterior, but the computational requirements scale exponentially with the number of dimensions of the state space. In this paper, we present the Coordinate Particle Filter which alleviates this problem. We propose to compute the particle weights recursively, dimension by dimension. This allows us to explore one dimension at a time, and resample after each dimension if necessary.
Experimental results on simulated as well as real data con- firm that the proposed method has a substantial performance advantage over the Particle Filter in high-dimensional systems where not all dimensions are highly correlated. We demonstrate the benefits of the proposed method for the problem of multi-object and robotic manipulator tracking.

2014

In IEEE International Conference on Robotics and Automation (ICRA) 2014, pages: 3143-3150, IEEE International Conference on Robotics and Automation (ICRA), June 2014 (inproceedings)

Abstract

We propose to frame the problem of marker-less robot arm pose estimation as a pixel-wise part classification problem. As input, we use a depth image in which each pixel is classified to be either from a particular robot part or the background. The classifier is a random decision forest trained on a large number of synthetically generated and labeled depth images. From all the training samples ending up at a leaf node, a set of offsets is learned that votes for relative joint positions. Pooling these votes over all foreground pixels and subsequent clustering gives us an estimate of the true joint positions. Due to the intrinsic parallelism of pixel-wise classification, this approach can run in super real-time and is more efficient than previous ICP-like methods. We quantitatively evaluate the accuracy of this approach on synthetic data. We also demonstrate that the method produces accurate joint estimates on real data despite being purely trained on synthetic data.

The ability to grasp unknown objects still remains an unsolved problem in the robotics community. One of the challenges is to choose an appropriate grasp configu- ration, i.e., the 6D pose of the hand relative to the object and its finger configuration. In this paper, we introduce an algo- rithm that is based on the assumption that similarly shaped objects can be grasped in a similar way. It is able to synthe- size good grasp poses for unknown objects by finding the best matching object shape templates associated with previously demonstrated grasps. The grasp selection algorithm is able to improve over time by using the information of previous grasp attempts to adapt the ranking of the templates to new situa- tions. We tested our approach on two different platforms, the Willow Garage PR2 and the Barrett WAM robot, which have very different hand kinematics. Furthermore, we compared our algorithm with other grasp planners and demonstrated its superior performance. The results presented in this paper show that the algorithm is able to find good grasp configura- tions for a large set of unknown objects from a relatively small set of demonstrations, and does improve its performance over time.

In Proceedings of the International Conference on Intelligent Robots and Systems, Chicago, IL, October 2014 (inproceedings)

Abstract

Efficient manipulation requires contact to reduce uncertainty. The manipulation literature refers to this as funneling: a methodology for increasing reliability and robustness by leveraging haptic feedback and control of environmental interaction. However, there is a fundamental gap between traditional approaches to trajectory optimization and this concept of robustness by funneling: traditional trajectory optimizers do not discover force feedback strategies. From a POMDP perspective, these behaviors could be regarded as explicit obser- vation actions planned to sufficiently reduce uncertainty thereby enabling a task. While we are sympathetic to the full POMDP view, solving full continuous-space POMDPs in high-dimensions is hard. In this paper, we propose an alternative approach in which trajectory optimization objectives are augmented with new terms that reward uncertainty reduction through contacts, explicitly promoting funneling. This augmentation shifts the responsibility of robustness toward the actual execution of the optimized trajectories. Directly tracing trajectories through configuration space would lose all robustnessâ??dual execution achieves robustness by devising force controllers to reproduce the temporal interaction profile encoded in the dual solution of the optimization problem. This work introduces dual execution in depth and analyze its performance through robustness experiments in both simulation and on a real-world robotic platform.

We review the work on data-driven grasp synthesis and the
methodologies for sampling and ranking candidate grasps. We
divide the approaches into three groups based on whether they
synthesize grasps for known, familiar or unknown objects. This
structure allows us to identify common object representations and
perceptual processes that facilitate the employed data-driven grasp
synthesis technique.
In the case of known objects, we concentrate on the approaches that
are based on object recognition and pose estimation. In the case of
familiar objects, the techniques use some form of a
similarity matching to a set of previously encountered objects.
Finally for the approaches dealing with unknown objects, the core part
is the extraction of specific features that are indicative of good
grasps.
Our survey provides an overview of the different methodologies and
discusses open problems in the area of robot grasping. We also draw a
parallel to the classical approaches that rely on analytic
formulations.

In IEEE International Conference on Robotics and Automation (ICRA), pages: 3547-3554, 2013 (inproceedings)

Abstract

In this work, we propose to reconstruct a complete 3-D model of an unknown object by fusion of visual and tactile information while the object is grasped. Assuming the object is symmetric, a first hypothesis of its complete 3-D shape is generated from a single view. This initial model is used to plan a grasp on the object which is then executed with a robotic manipulator equipped with tactile sensors. Given the detected contacts between the fingers and the object, the full object model including the symmetry parameters can be refined. This refined model will then allow the planning of more complex manipulation tasks. The main contribution of this work is an optimal estimation approach for the fusion of visual and tactile data applying the constraint of object symmetry. The fusion is formulated as a state estimation problem and solved with an iterative extended Kalman filter. The approach is validated experimentally using both artificial and real data from two different robotic platforms.

In IEEE/RSJ International Conference on Intelligent Robots and Systems, pages: 3195-3202, IEEE, November 2013 (inproceedings)

Abstract

We address the problem of tracking the 6-DoF pose of an object while it is being manipulated by a human or a robot. We use a dynamic Bayesian network to perform inference and compute a posterior distribution over the current object pose. Depending on whether a robot or a human manipulates the object, we employ a process model with or without knowledge of control inputs. Observations are obtained from a range camera. As opposed to previous object tracking methods, we explicitly model self-occlusions and occlusions from the environment, e.g, the human or robotic hand. This leads to a strongly non-linear observation model and additional dependencies in the Bayesian network. We employ a Rao-Blackwellised particle filter to compute an estimate of the object pose at every time step. In a set of experiments, we demonstrate the ability of our method to accurately and robustly track the object pose in real-time while it is being manipulated by a human or a robot.

The International Journal of Robotics Research, 33(2):321-341, Sage, October 2013 (article)

Abstract

In this work, we propose to reconstruct a complete 3-D model of an unknown object
by fusion of visual and tactile information while the object is grasped. Assuming the
object is symmetric, a first hypothesis of its complete 3-D shape is generated. A grasp
is executed on the object with a robotic manipulator equipped with tactile sensors.
Given the detected contacts between the fingers and the object, the initial full object
model including the symmetry parameters can be refined. This refined model will then
allow the planning of more complex manipulation tasks.
The main contribution of this work is an optimal estimation approach for the fusion of
visual and tactile data applying the constraint of object symmetry. The fusion is
formulated as a state estimation problem and solved with an iterative extended
Kalman filter. The approach is validated experimentally using both artificial and real
data from two different robotic platforms.

2012

In IEEE International Workshop on Haptic Audio Visual Environments and Games (HAVE), pages: 25-31, October 2012 (inproceedings)

Abstract

In this paper, we address some of the challenges that arise as model-mediated teleoperation is applied to systems with multiple degrees of freedom and multiple sensors. Specifically we use a system with position, force, and vision sensors to explore an environment geometry in two degrees of freedom. The inclusion of vision is proposed to alleviate the difficulties of estimating an increasing number of environment properties. Vision can furthermore increase the predictive nature of model-mediated teleoperation, by effectively predicting touch feedback before the slave is even in contact with the environment. We focus on the case of estimating the location and orientation of a local surface patch at the contact point between the slave and the environment. We describe the various information sources with their respective limitations and create a combined model estimator as part of a multi-d.o.f. model-mediated controller. An experiment demonstrates the feasibility and benefits of utilizing vision sensors in teleoperation.

In this paper, we present an approach towards autonomous grasping of objects according to their category and a given task. Recent advances in the field of object segmentation and categorization as well as task-based grasp inference have been leveraged by integrating them into one pipeline. This allows us to transfer task-specific grasp experience between objects of the same category. The effectiveness of the approach is demonstrated on the humanoid robot ARMAR-IIIa.

We study visual servoing in a framework of detection and grasping of unknown objects. Classically, visual servoing has been used for applications where the object to be servoed on is known to the robot prior to the task execution. In addition, most of the methods concentrate on aligning the robot hand with the object without grasping it. In our work, visual servoing techniques are used as building blocks in a system capable of detecting and grasping unknown objects in natural scenes. We show how different visual servoing techniques facilitate a complete grasping cycle.

Current robotics research is largely driven by the vision of creating an intelligent being that can perform dangerous, difficult or unpopular tasks. These can for example be exploring the surface of planet mars or the bottom of the ocean, maintaining a furnace or assembling a car. They can also be more mundane such as cleaning an apartment or fetching groceries. This vision has been pursued since the 1960s when the first robots were built. Some of the tasks mentioned above, especially those in industrial manufacturing, are already frequently performed by robots. Others are still completely out of reach. Especially, household robots are far away from being deployable as general purpose devices. Although advancements have been made in this research area, robots are not yet able to perform household chores robustly in unstructured and open-ended environments given unexpected events and uncertainty in perception and execution.In this thesis, we are analyzing which perceptual and motor capabilities are necessary for the robot to perform common tasks in a household scenario. In that context, an essential capability is to understand the scene that the robot has to interact with. This involves separating objects from the background but also from each other.Once this is achieved, many other tasks become much easier. Configuration of object scan be determined; they can be identified or categorized; their pose can be estimated; free and occupied space in the environment can be outlined.This kind of scene model can then inform grasp planning algorithms to finally pick up objects.However, scene understanding is not a trivial problem and even state-of-the-art methods may fail. Given an incomplete, noisy and potentially erroneously segmented scene model, the questions remain how suitable grasps can be planned and how they can be executed robustly.In this thesis, we propose to equip the robot with a set of prediction mechanisms that allow it to hypothesize about parts of the scene it has not yet observed. Additionally, the robot can also quantify how uncertain it is about this prediction allowing it to plan actions for exploring the scene at specifically uncertain places. We consider multiple modalities including monocular and stereo vision, haptic sensing and information obtained through a human-robot dialog system. We also study several scene representations of different complexity and their applicability to a grasping scenario. Given an improved scene model from this multi-modal exploration, grasps can be inferred for each object hypothesis. Dependent on whether the objects are known, familiar or unknown, different methodologies for grasp inference apply. In this thesis, we propose novel methods for each of these cases. Furthermore,we demonstrate the execution of these grasp both in a closed and open-loop manner showing the effectiveness of the proposed methods in real-world scenarios.

We consider the problem of grasp and manipulation planning when the state of the world is only partially observable. Specifically, we address the task of picking up unknown objects from a table top. The proposed approach to object shape prediction aims at closing the knowledge gaps in the robot's understanding of the world. A completed state estimate of the environment can then be provided to a simulator in which stable grasps and collision-free movements are planned. The proposed approach is based on the observation that many objects commonly in use in a service robotic scenario possess symmetries. We search for the optimal parameters of these symmetries given visibility constraints. Once found, the point cloud is completed and a surface mesh reconstructed. Quantitative experiments show that the predictions are valid approximations of the real object shape. By demonstrating the approach on two very different robotic platforms its generality is emphasized.

We propose a novel human-robot-interaction framework for robust visual scene understanding. Without any a-priori knowledge about the objects, the task of the robot is to correctly enumerate how many of them are in the scene and segment them from the background. Our approach builds on top of state-of-the-art computer vision methods, generating object hypotheses through segmentation. This process is combined with a natural dialog system, thus including a `human in the loop' where, by exploiting the natural conversation of an advanced dialog system, the robot gains knowledge about ambiguous situations. We present an entropy-based system allowing the robot to detect the poorest object hypotheses and query the user for arbitration. Based on the information obtained from the human-robot dialog, the scene segmentation can be re-seeded and thereby improved. We present experimental results on real data that show an improved segmentation performance compared to segmentation without interaction.

In IROS’10 Workshop on Defining and Solving Realistic Perception Problems in Personal Robotics, October 2010 (inproceedings)

Abstract

Object grasping and manipulation pose major challenges for perception and control and require rich interaction between these two fields. In this paper, we concentrate on the plethora of perceptual problems that have to be solved before a robot can be moved in a controlled way to pick up an object. A vision system is presented that integrates a number of different computational processes, e.g. attention, segmentation, recognition or reconstruction to incrementally build up a representation of the scene suitable for grasping and manipulation of objects. Our vision system is equipped with an active robotic head and a robot arm. This embodiment enables the robot to perform a number of different actions like saccading, fixating, and grasping. By applying these actions, the robot can incrementally build a scene representation and use it for interaction. We demonstrate our system in a scenario for picking up known objects from a table top. We also show the system’s extendibility towards grasping of unknown and familiar objects.

This paper presents work on vision based robotic grasping. The proposed method adopts a learning framework where prototypical grasping points are learnt from several examples and then used on novel objects. For representation purposes, we apply the concept of shape context and for learning we use a supervised learning approach in which the classifier is trained with labelled synthetic images. We evaluate and compare the performance of linear and non-linear classifiers. Our results show that a combination of a descriptor based on shape context with a non-linear classification algorithm leads to a stable detection of grasping points for a variety of objects.

In this paper we present a framework for the segmentation of multiple objects from a 3D point cloud. We extend traditional image segmentation techniques into a full 3D representation. The proposed technique relies on a state-of-the-art min-cut framework to perform a fully 3D global multi-class labeling in a principled manner. Thereby, we extend our previous work in which a single object was actively segmented from the background. We also examine several seeding methods to bootstrap the graphical model-based energy minimization and these methods are compared over challenging scenes. All results are generated on real-world data gathered with an active vision robotic head. We present quantitive results over aggregate sets as well as visual results on specific examples.

We propose a method for multi-modal scene exploration where initial object hypothesis formed by active visual segmentation are confirmed and augmented through haptic exploration with a robotic arm. We update the current belief about the state of the map with the detection results and predict yet unknown parts of the map with a Gaussian Process. We show that through the integration of different sensor modalities, we achieve a more complete scene model. We also show that the prediction of the scene structure leads to a valid scene representation even if the map is not fully traversed. Furthermore, we propose different exploration strategies and evaluate them both in simulation and on our robotic platform.

A distinct property of robot vision systems is that they are embodied. Visual information is extracted for the purpose of moving in and interacting with the environment. Thus, different types of perception-action cycles need to be implemented and evaluated.
In this paper, we study the problem of designing a vision system for the purpose of object grasping in everyday environments. This vision system is firstly targeted at the interaction with the world through recognition and grasping of objects and secondly at being an interface for the reasoning and planning module to the real world. The latter provides the vision system with a certain task that drives it and defines a specific context, i.e. search for or identify a certain object and analyze it for potential later manipulation. We deal with cases of: (i) known objects, (ii) objects similar to already known objects, and (iii) unknown objects. The perception-action cycle is connected to the reasoning system based on the idea of affordances. All three cases are also related to the state of the art and the terminology in the neuroscientific area.

We present work on vision based robotic grasping. The proposed method relies on extracting and representing the global contour of an object in a monocular image. A suitable grasp is then generated using a learning framework where prototypical grasping points are learned from several examples and then used on novel objects. For representation purposes, we apply the concept of shape context and for learning we use a supervised learning approach in which the classifier is trained with labeled synthetic images. Our results show that a combination of a descriptor based on shape context with a non-linear classification algorithm leads to a stable detection of grasping points for a variety of objects. Furthermore, we will show how our representation supports the inference of a full grasp configuration.