We study visual servoing in a framework of detection and grasping of unknown objects. Classically, visual servoing has been used for applications where the object to be servoed on is known to the robot prior to the task execution. In addition, most of the methods concentrate on aligning the robot hand with the object without grasping it. In our work, visual servoing techniques are used as building blocks in a system capable of detecting and grasping unknown objects in natural scenes. We show how different visual servoing techniques facilitate a complete grasping cycle.

We propose a system for markerless pose estimation and tracking of a robot manipulator. By tracking the manipulator, we can obtain an accurate estimate of its position and orientation necessary in many object grasping and manipulation tasks. Tracking the manipulator allows also for better collision avoidance. The method is based on the notion of virtual visual servoing. We also propose the use of distance transform in the control loop, which makes the performance independent of the feature search window.

This paper investigates object categorization according to function, i.e., learning the affordances of objects from human demonstration. Object affordances (functionality) are inferred from observations of humans using the objects in different types of actions. The intended application is learning from demonstration, in which a robot learns to employ objects in household tasks, from observing a human performing the same tasks with the objects. We present a method for categorizing manipulated objects and human manipulation actions in context of each other. The method is able to simultaneously segment and classify human hand actions, and detect and classify the objects involved in the action. This can serve as an initial step in a learning from demonstration method. Experiments show that the contextual information improves the classification of both objects and actions.

This paper presents a vision based method for grasp classification. It is developed as part of a Programming by Demonstration (PbD) system for which recognition of objects and pick-and-place actions represent basic building blocks for task learning. In contrary to earlier approaches, no articulated 3D reconstruction of the hand over time is taking place. The indata consists of a single image of the human hand. A 2D representation of the hand shape, based on gradient orientation histograms, is extracted from the image. The hand shape is then classified as one of six grasps by finding similar hand shapes in a large database of grasp images. The database search is performed using Locality Sensitive Hashing (LSH), an approximate k-nearest neighbor approach. The nearest neighbors also give an estimated hand orientation with respect to the camera. The six human grasps are mapped to three Barret hand grasps. Depending on the type of robot grasp, a precomputed grasp strategy is selected. The strategy is further parameterized by the orientation of the hand relative to the object. To evaluate the potential for the method to be part of a robust vision system, experiments were performed, comparing classification results to a baseline of human classification performance. The experiments showed the LSH recognition performance to be comparable to human performance.

The visual analysis of human manipulation actions is of interest for e.g. human-robot interaction applications where a robot learns how to perform a task by watching a human. In this paper, a method for classifying manipulation actions in the context of the objects manipulated, and classifying objects in the context of the actions used to manipulate them is presented. Hand and object features are extracted from the video sequence using a segmentation based approach. A shape based representation is used for both the hand and the object. Experiments show this representation suitable for representing generic shape classes. The action-object correlation over time is then modeled using conditional random fields. Experimental comparison show great improvement in classification rate when the action-object correlation is taken into account, compared to separate classification of manipulation actions and manipulated objects.

Imagine that a robot fetched this thesis for you from a book shelf. How doyou think the robot would have been programmed? One possibility is thatexperienced engineers had written low level descriptions of all imaginabletasks, including grasping a small book from this particular shelf. A secondoption would be that the robot tried to learn how to grasp books from yourshelf autonomously, resulting in hours of trial-and-error and several bookson the floor.In this thesis, we argue in favor of a third approach where you teach therobot how to grasp books from your shelf through grasping by demonstration.It is based on the idea of robots learning grasping actions by observinghumans performing them. This imposes minimum requirements on the humanteacher: no programming knowledge and, in this thesis, no need for specialsensory devices. It also maximizes the amount of sources from which therobot can learn: any video footage showing a task performed by a human couldpotentially be used in the learning process. And hopefully it reduces theamount of books that end up on the floor.

This document explores the challenges involved in the creation of such asystem. First, the robot should be able to understand what the teacher isdoing with their hands. This means, it needs to estimate the pose of theteacher's hands by visually observing their in the absence of markers or anyother input devices which could interfere with the demonstration. Second,the robot should translate the human representation acquired in terms ofhand poses to its own embodiment. Since the kinematics of the robot arepotentially very different from the human one, defining a similarity measureapplicable to very different bodies becomes a challenge. Third, theexecution of the grasp should be continuously monitored to react toinaccuracies in the robot perception or changes in the grasping scenario.While visual data can help correcting the reaching movement to the object,tactile data enables accurate adaptation of the grasp itself, therebyadjusting the robot's internal model of the scene to reality. Finally,acquiring compact models of human grasping actions can help in bothperceiving human demonstrations more accurately and executing them in a morehuman-like manner. Moreover, modeling human grasps can provide us withinsights about what makes an artificial hand design anthropomorphic,assisting the design of new robotic manipulators and hand prostheses.

All these modules try to solve particular subproblems of a grasping bydemonstration system. We hope the research on these subproblems performed inthis thesis will both bring us closer to our dream of a learning robot andcontribute to the multiple research fields where these subproblems arecoming from.

Understanding the spatial dimensionality and temporal context of human hand actions can provide representations for programming grasping actions in robots and inspire design of new robotic and prosthetic hands. The natural representation of human hand motion has high dimensionality. For specific activities such as handling and grasping of objects, the commonly observed hand motions lie on a lower-dimensional non-linear manifold in hand posture space. Although full body human motion is well studied within Computer Vision and Biomechanics, there is very little work on the analysis of hand motion with nonlinear dimensionality reduction techniques. In this paper we use Gaussian Process Latent Variable Models (GPLVMs) to model the lower dimensional manifold of human hand motions during object grasping. We show how the technique can be used to embed high-dimensional grasping actions in a lower-dimensional space suitable for modeling, recognition and mapping.

This paper presents a method for vision based estimation of the pose of human hands in interaction with objects. Despite the fact that most robotics applications of human hand tracking involve grasping and manipulation of objects, the majority of methods in the literature assume a free hand, isolated from the surrounding environment. Our hand tracking method is non-parametric, performing a nearest neighbor search in a large database (100000 entries) of hand poses with and without grasped objects. The system operates in real time, it is robust to self occlusions, object occlusions and segmentation errors, and provides full hand pose reconstruction from markerless video. Temporal consistency in hand pose is taken into account, without explicitly tracking the hand in the high dimensional pose space.

We are developing a Programming by Demonstration (PbD) system for which recognition of objects and pick-and-place actions represent basic building blocks for task learning. An important capability in this system is automatic isual recognition of human grasps, and methods for mapping the human grasps to the functionally corresponding robot grasps. This paper describes the grasp recognition system, focusing on the human-to-robot mapping. The visual grasp classification and grasp orientation regression is described in our IROS 2008 paper [1]. In contrary to earlier approaches, no articulated 3D reconstruction of the hand over time is taking place. The input data consists of a single image of the human hand. The hand shape is classified as one of six grasps by finding similar hand shapes in a large database of grasp images. From the database, the hand orientation is also estimated. The recognized grasp is then mapped to one of three predefined Barrett hand grasps. Depending on the type of robot grasp, a precomputed grasp strategy is selected. The strategy is further parameterized by the orientation of the hand relative to the environment show purposes.

We study the problem of human to robot grasp mapping as a basic building block of a learning by imitation system. The human hand posture, including both the grasp type and hand orientation, is first classified based on a single image and mapped to a specific robot hand. A metric for the evaluation based on the notion of virtual fingers is proposed. The first part of the experimental evaluation, performed in simulation, shows bow the differences in the embodiment between human and robotic hand affect the grasp strategy. The second part, performed with a robotic system, demonstrates the feasibility of the proposed methodology in realistic applications.

Markerless, vision based estimation of human hand pose over time is a prerequisite for a number of robotics applications, such as Learning by Demonstration (LbD), health monitoring, teleoperation, human-robot interaction. It has special interest in humanoid platforms, where the number of degrees of freedom makes conventional programming challenging. Our primary application is LbD in natural environments where the humanoid robot learns how to grasp and manipulate objects by observing a human performing a task. This paper presents a method for continuous vision based estimation of human hand pose. The method is non-parametric, performing a nearest neighbor search in a large database (100000 entries) of hand pose examples. The main contribution is a real time system, robust to partial occlusions and segmentation errors, that provides full hand pose recognition from markerless data. An additional contribution is the modeling of based on temporal consistency in hand pose, without explicitly tracking the hand in the high dimensional pose space. The pose representation is rich enough to enable a descriptive humanto-robotmapping. Experiments show the pose estimation to be more robust and accurate than a non-parametric method without temporal constraints.

We show how matching and reconstruction of contour points can be performed using Dynamic Time Warping (DTW) for the purpose of 3D hand contour tracking. We evaluate the performance of the proposed algorithm in object manipulation activities and perform comparison with the Iterative Closest Point (ICP) method.