2016

The International Journal of Robotics Research, 35(14):1731-1749, December 2016 (article)

Abstract

The Gaussian Filter (GF) is one of the most widely used filtering algorithms; instances are the Extended Kalman Filter, the Unscented Kalman Filter and the Divided Difference Filter. The GF represents the belief of the current state by a Gaussian distribution, whose mean is an affine function of the measurement. We show that this representation can be too restrictive to accurately capture the dependences in systems with nonlinear observation models, and we investigate how the GF can be generalized to alleviate this problem. To this end, we view the GF as the solution to a constrained optimization problem. From this new perspective, the GF is seen as a special case of a much broader class of filters, obtained by relaxing the constraint on the form of the approximate posterior. On this basis, we outline some conditions which potential generalizations have to satisfy in order to maintain the computational efficiency of the GF. We propose one concrete generalization which corresponds to the standard GF using a pseudo measurement instead of the actual measurement. Extending an existing GF implementation in this manner is trivial. Nevertheless, we show that this small change can have a major impact on the estimation accuracy.

The invention comprises a learned model of human body shape and pose dependent shape variation that is more accurate than previous models and is compatible with existing graphics pipelines. Our Skinned Multi-Person Linear model (SMPL) is a skinned vertex based model that accurately represents a wide variety of body shapes in natural human poses. The parameters of the model are learned from data including the rest pose template, blend weights, pose-dependent blend shapes, identity- dependent blend shapes, and a regressor from vertices to joint locations. Unlike previous models, the pose-dependent blend shapes are a linear function of the elements of the pose rotation matrices. This simple formulation enables training the entire model from a relatively large number of aligned 3D meshes of different people in different poses. The invention quantitatively evaluates variants of SMPL using linear or dual- quaternion blend skinning and show that both are more accurate than a Blend SCAPE model trained on the same data. In a further embodiment, the invention realistically models dynamic soft-tissue deformations. Because it is based on blend skinning, SMPL is compatible with existing rendering engines and we make it available for research purposes.

Brief verbal descriptions of bodies (e.g. curvy, long-legged) can elicit vivid mental images. The ease with which we create these mental images belies the complexity of three-dimensional body shapes. We explored the relationship between body shapes and body descriptions and show that a small number of words can be used to generate categorically accurate representations of three-dimensional bodies. The dimensions of body shape variation that emerged in a language-based similarity space were related to major dimensions of variation computed directly from three-dimensional laser scans of 2094 bodies. This allowed us to generate three-dimensional models of people in the shape space using only their coordinates on analogous dimensions in the language-based description space. Human descriptions of photographed bodies and their corresponding models matched closely. The natural mapping between the spaces illustrates the role of language as a concise code for body shape, capturing perceptually salient global and local body features.

Workshop on the Algorithmic Foundations of Robotics, pages: 22, November 2016 (conference)

Abstract

Stochastic Optimal Control (SOC) typically considers noise only in the process model, i.e. unknown disturbances. However, in many robotic applications that involve interaction with the environment, such as locomotion and manipulation, uncertainty also comes from lack of pre- cise knowledge of the world, which is not an actual disturbance. We de- velop a computationally efficient SOC algorithm, based on risk-sensitive control, that takes into account uncertainty in the measurements. We include the dynamics of an observer in such a way that the control law explicitly depends on the current measurement uncertainty. We show that high measurement uncertainty leads to low impedance behaviors, a result in contrast with the effects of process noise variance that creates stiff behaviors. Simulation results on a simple 2D manipulator show that our controller can create better interaction with the environment under uncertain contact locations than traditional SOC approaches.

Proceedings of the IEEE/RSJ Conference on Intelligent Robots and Systems, IEEE, IROS, October 2016 (conference)

Abstract

One of the central tasks for a household robot is searching for specific objects. It does not only require localizing the target object but also identifying promising search locations in the scene if the target is not immediately visible. As computation time and hardware resources are usually limited in robotics, it is desirable to avoid expensive visual processing steps that are exhaustively applied over the entire image. The human visual system can quickly select those image locations that have to be processed in detail for a given task. This allows us to cope with huge amounts of information and to efficiently deploy the limited capacities of our visual system. In this paper, we therefore propose to use human fixation data to train a top-down saliency model that predicts relevant image locations when searching for specific objects. We show that the learned model can successfully prune bounding box proposals without rejecting the ground truth object locations. In this aspect, the proposed model outperforms a model that is trained only on the ground truth segmentations of the target object instead of fixation data.

In International Conference on Intelligent Robots and Systems (IROS) 2016, IEEE/RSJ International Conference on Intelligent Robots and Systems, October 2016 (inproceedings)

Abstract

Binary descriptors allow fast detection and matching algorithms in computer vision problems. Though binary descriptors can be computed at almost two orders of magnitude faster than traditional gradient based descriptors, they suffer from poor matching accuracy in challenging conditions. In this paper we propose three improvements for binary descriptors in their computation and matching that enhance their performance in comparison to traditional binary and non-binary descriptors without compromising their speed. This is achieved by learning some weights and threshold parameters that allow customized matching under some variations such as lighting and viewpoint. Our suggested improvements can be easily applied to any binary descriptor. We demonstrate our approach on the ORB (Oriented FAST and Rotated BRIEF) descriptor and compare its performance with the traditional ORB and SIFT descriptors on a wide variety of datasets. In all instances, our enhancements outperform standard ORB and is comparable to SIFT.

We describe the first method to automatically estimate the 3D pose of the human body as well as its 3D shape from a single unconstrained image. We estimate a full 3D mesh and show that 2D joints alone carry a surprising amount of information about body shape. The problem is challenging because of the complexity of the human body, articulation, occlusion, clothing, lighting, and the inherent ambiguity in inferring 3D from 2D. To solve this, we first use a recently published CNN-based method, DeepCut, to predict (bottom-up) the 2D body joint locations. We then fit (top-down) a recently published statistical body shape model, called SMPL, to the 2D joints. We do so by minimizing an objective function that penalizes the error between the projected 3D model joints and detected 2D joints. Because SMPL captures correlations in human shape across the population, we are able to robustly fit it to very little data. We further leverage the 3D model to prevent solutions that cause interpenetration. We evaluate our method, SMPLify, on the Leeds Sports, HumanEva, and Human3.6M datasets, showing superior
pose accuracy with respect to the state of the art.

In European Conference on Computer Vision (ECCV), Lecture Notes in Computer Science, Springer, 14th European Conference on Computer Vision, October 2016 (inproceedings)

Abstract

In this paper we propose a CNN architecture for semantic image segmentation. We introduce a new “bilateral inception” module that can be inserted in existing CNN architectures and performs bilateral filtering, at multiple feature-scales, between superpixels in an image. The feature spaces for bilateral filtering and other parameters of the module are learned end-to-end using standard backpropagation techniques. The bilateral inception module addresses two issues that arise with general CNN segmentation architectures. First, this module propagates information between (super) pixels while respecting image edges, thus using the structured information of the problem for improved results. Second, the layer recovers a full resolution segmentation result from the lower resolution solution of a CNN. In the experiments, we modify several existing CNN architectures by inserting our inception modules between the last CNN (1 × 1 convolution) layers. Empirical results on three different datasets show reliable improvements not only in comparison to the baseline networks, but also in comparison to several dense-pixel prediction techniques such as CRFs, while being competitive in time.

The caffe framework is one of the leading deep learning toolboxes in the machine learning and computer vision community. While it offers efficiency and configurability, it falls short of a full interface to Python. With increasingly involved procedures for training deep networks and reaching depths of hundreds of layers, creating configuration files and keeping them consistent becomes an error prone process.
We introduce the barrista framework, offering full, pythonic control over caffe. It separates responsibilities and offers code to solve frequently occurring tasks for pre-processing, training and model inspection. It is compatible to all caffe versions since mid 2015 and can import and export .prototxt files.
Examples are included, e.g., a deep residual network implemented in only 172 lines (for arbitrary depths), comparing to 2320 lines in the official implementation for the equivalent model.

The purpose of this thesis is the study of non-parametric models for structured data and their fields of application in computer vision. We aim at the development of context-sensitive architectures which are both expressive and efficient. Our focus is on directed graphical models, in particular Bayesian networks, where we combine the flexibility of non-parametric local distributions with the efficiency of a global topology with bounded treewidth. A bound on the treewidth is obtained by either constraining the maximum indegree of the underlying graph structure or by introducing determinism. The non-parametric distributions in the nodes of the graph are given by decision trees or kernel density estimators.
The information flow implied by specific network topologies, especially the resultant (conditional) independencies, allows for a natural integration and control of contextual information. We distinguish between three different types of context: static, dynamic, and semantic. In four different approaches we propose models which exhibit varying combinations of these contextual properties and allow modeling of structured data in space, time, and hierarchies derived thereof. The generative character of the presented models enables a direct synthesis of plausible hypotheses.
Extensive experiments validate the developed models in two application scenarios which are of particular interest in computer vision: human bodies and natural scenes. In the practical sections of this work we discuss both areas from different angles and show applications of our models to human pose, motion, and segmentation as well as object categorization and localization. Here, we benefit from the availability of modern datasets of unprecedented size and diversity. Comparisons to traditional approaches and state-of-the-art research on the basis of well-established evaluation criteria allows the objective assessment of our contributions.

In Proceedings of the American Control Conference (ACC), Boston, MA, USA, July 2016 (inproceedings)

Abstract

Most widely-used state estimation algorithms, such as the Extended Kalman Filter and the Unscented Kalman Filter, belong to the family of Gaussian Filters (GF). Unfortunately, GFs fail if the measurement process is modelled by a fat-tailed distribution. This is a severe limitation, because thin-tailed measurement models, such as the analytically-convenient and therefore widely-used Gaussian distribution, are sensitive to outliers. In this paper, we show that mapping the measurements into a specific feature space enables any existing GF algorithm to work with fat-tailed measurement models. We find a feature function which is optimal under certain conditions. Simulation results show that the proposed method allows for robust filtering in both linear and nonlinear systems with measurements contaminated by fat-tailed noise.

Realistic, metrically accurate, 3D human avatars are useful for games, shopping, virtual reality, and health applications. Such avatars are not in wide use because solutions for creating them from high-end scanners, low-cost range cameras, and tailoring measurements all have limitations. Here we propose a simple solution and show that it is surprisingly accurate. We use crowdsourcing to generate attribute ratings of 3D body shapes corresponding to standard linguistic descriptions of 3D shape. We then learn a linear function relating these ratings to 3D human shape parameters. Given an image of a new body, we again turn to the crowd for ratings of the body shape. The collection of linguistic ratings of a photograph provides remarkably strong constraints on the metric 3D shape. We call the process crowdshaping and show that our Body Talk system produces shapes that are perceptually indistinguishable from bodies created from high-resolution scans and that the metric accuracy is sufficient for many tasks. This makes body “scanning” practical without a scanner, opening up new applications including database search, visualization, and extracting avatars from books.

This paper considers the task of articulated human pose estimation of multiple people in real-world images. We propose an approach that jointly solves the tasks of detection and pose estimation: it infers the number of persons in a scene, identifies occluded body parts, and disambiguates body parts between people in close proximity of each other.
This joint formulation is in contrast to previous strategies, that address the problem by first detecting people and subsequently estimating their body pose. We propose a partitioning and labeling formulation of a set of body-part hypotheses generated with CNN-based part detectors. Our formulation, an instance of an integer linear program, implicitly performs non-maximum suppression on the set of part candidates and groups them to form configurations of body parts respecting geometric and appearance constraints. Experiments on four different datasets demonstrate state-of-the-art results for both single person and multi person pose estimation.

In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), June 2016 (inproceedings)

Abstract

Video object segmentation is challenging due to fast moving objects, deforming shapes, and cluttered backgrounds. Optical flow can be used to propagate an object segmentation over time but, unfortunately, flow is often inaccurate, particularly around object boundaries. Such boundaries are precisely where we want our segmentation to be accurate. To obtain accurate segmentation across time, we propose an efficient algorithm that considers video segmentation and optical flow estimation simultaneously. For video segmentation, we formulate a principled, multiscale, spatio-temporal objective function that uses optical flow to propagate information between frames. For optical flow estimation, particularly at object boundaries, we compute the flow independently in the segmented regions and recompose the results. We call the process object flow and demonstrate the effectiveness of jointly optimizing optical flow and video segmentation using an iterative scheme. Experiments on the SegTrack v2 and Youtube-Objects datasets show that the proposed algorithm performs favorably against the other state-of-the-art methods.

In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), June 2016 (inproceedings)

Abstract

In this paper, we propose a non-local structured prior for volumetric multi-view 3D reconstruction. Towards this goal, we present a novel Markov random field model based on ray potentials in which assumptions about large 3D surface patches such as planarity or Manhattan world constraints can be efficiently encoded as probabilistic priors. We further derive an inference algorithm that reasons jointly about voxels, pixels and image segments, and estimates marginal distributions of appearance, occupancy, depth, normals and planarity. Key to tractable inference is a novel hybrid representation that spans both voxel and pixel space and that integrates non-local information from 2D image segmentations in a principled way. We compare our non-local prior to commonly employed local smoothness assumptions and a variety of state-of-the-art volumetric reconstruction baselines on challenging outdoor scenes with textureless and reflective surfaces. Our experiments indicate that regularizing over larger distances has the potential to resolve ambiguities where local regularizers fail.

International Journal of Computer Vision (IJCV), 118(2):172-193, June 2016 (article)

Abstract

Hand motion capture is a popular research field, recently gaining more attention due to the ubiquity of RGB-D sensors. However, even most recent approaches focus on the case of a single isolated hand. In this work, we focus on hands that interact with other hands or objects and present a framework that successfully captures motion in such interaction scenarios for both rigid and articulated objects. Our framework combines a generative model with discriminatively trained salient points to achieve a low tracking error and with collision detection and physics simulation to achieve physically plausible estimates even in case of occlusions and missing visual data. Since all components are unified in a single objective function which is almost everywhere differentiable, it can be optimized with standard optimization techniques. Our approach works for monocular RGB-D sequences as well as setups with multiple synchronized RGB cameras. For a qualitative and quantitative evaluation, we captured 29 sequences with a large variety of interactions and up to 150 degrees of freedom.

Existing optical flow methods make generic, spatially homogeneous, assumptions about the spatial structure of the flow. In reality, optical flow varies across an image depending on object class.
Simply put, different objects move differently. Here we exploit recent advances in static semantic scene segmentation to segment the image into objects of different types. We define different models of image motion in these regions depending on the type of object. For example, we model the motion on roads with homographies, vegetation with spatially smooth flow, and independently moving objects like cars and planes with affine motion plus deviations. We then pose the flow estimation problem using a novel formulation of localized layers, which addresses limitations of traditional layered models for dealing with complex scene motion. Our semantic flow method achieves the lowest error of any published monocular method in the KITTI-2015 flow benchmark and produces qualitatively better flow and segmentation than recent top methods on a wide range of natural videos.

Bilateral filters have wide spread use due to their edge-preserving properties. The common use case is to manually choose a parametric filter type, usually a Gaussian filter. In this paper, we will generalize the parametrization and in particular derive a gradient descent algorithm so the filter parameters can be learned from data. This derivation allows to learn high dimensional linear filters that operate in sparsely populated feature spaces. We build on the permutohedral lattice construction for efficient filtering. The ability to learn more general forms of high-dimensional filters can be used in several diverse applications. First, we demonstrate the use in applications where single filter applications are desired for runtime reasons. Further, we show how this algorithm can be used to learn the pairwise potentials in densely connected conditional random fields and apply these to different image segmentation tasks. Finally, we introduce layers of bilateral filters in CNNs and propose bilateral neural networks for the use of high-dimensional sparse data. This view provides new ways to encode model structure into network architectures. A diverse set of experiments empirically validates the usage of general forms of filters.

In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), June 2016 (inproceedings)

Abstract

Occlusion boundaries contain rich perceptual information about the underlying scene structure. They also provide important cues in many visual perception tasks such as scene understanding, object recognition, and segmentation. In this paper, we improve occlusion boundary detection via enhanced exploration of contextual information (e.g., local structural boundary patterns, observations from surrounding regions, and temporal context), and in doing so develop a novel approach based on convolutional neural networks (CNNs) and conditional random fields (CRFs). Experimental results demonstrate that our detector significantly outperforms the state-of-the-art (e.g., improving the F-measure from 0.62 to 0.71 on the commonly used CMU benchmark). Last but not least, we empirically assess the roles of several important components of the proposed detector, so as to validate the rationale behind this approach.

In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), June 2016 (inproceedings)

Abstract

Semantic annotations are vital for training models for object recognition, semantic segmentation or scene understanding. Unfortunately, pixelwise annotation of images at very large scale is labor-intensive and only little labeled data is available, particularly at instance level and for street scenes. In this paper, we propose to tackle this problem by lifting the semantic instance labeling task from 2D into 3D. Given reconstructions from stereo or laser data, we annotate static 3D scene elements with rough bounding primitives and develop a probabilistic model which transfers this information into the image domain. We leverage our method to obtain 2D labels for a novel suburban video dataset which we have collected, resulting in 400k semantic and instance image annotations. A comparison of our method to state-of-the-art label transfer baselines reveals that 3D information enables more efficient annotation while at the same time resulting in improved accuracy and time-coherent labels.

In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA) 2016, IEEE, IEEE International Conference on Robotics and Automation, May 2016 (inproceedings)

Abstract

To achieve accurate vision-based control with a robotic arm, a good hand-eye coordination is required. However, knowing the current configuration of the arm can be very difficult due to noisy readings from joint encoders or an inaccurate hand-eye calibration. We propose an approach for robot arm pose estimation that uses depth images of the arm as input to directly estimate angular joint positions. This is a frame-by-frame method which does not rely on good initialisation of the solution from the previous frames or knowledge from the joint encoders. For estimation, we employ a random regression forest which is trained on synthetically generated data. We compare different training objectives of the forest and also analyse the influence of prior segmentation of the arms on accuracy. We show that this approach improves previous work both in terms of computational complexity and accuracy. Despite being trained on synthetic data only, we demonstrate that the estimation also works on real depth images.

In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA) 2016, IEEE, IEEE International Conference on Robotics and Automation, May 2016 (inproceedings)

Abstract

In this paper, we consider the problem of robotic grasping of objects when only partial and noisy sensor data of the environment is available. We are specifically interested in the problem of reliably selecting the best hypothesis from a whole set. This is commonly the case when trying to grasp an object for which we can only observe a partial point cloud from one viewpoint through noisy sensors. There will be many possible ways to successfully grasp this object, and even more which will fail. We propose a supervised learning method that is trained with a ranking loss. This explicitly encourages that the top-ranked training grasp in a hypothesis set is also positively labeled. We show how we adapt the standard ranking loss to work with data that has binary labels and explain the benefits of this formulation. Additionally, we show how we can efficiently optimize this loss with stochastic gradient descent. In quantitative experiments, we show that we can outperform previous models by a large margin.

In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA) 2016, IEEE, IEEE International Conference on Robotics and Automation, May 2016 (inproceedings)

Abstract

We propose a novel method that enables a robot to identify a graspable object part of an unknown object given only noisy and partial information that is obtained from an RGB-D camera. Our method combines the benefits of local with the advantages of global methods. It learns a classifier that takes a local shape representation as input and outputs the probability that a grasp applied at this location will be successful. Given a query data point that is classified in this way, we can retrieve all the locally similar training data points and use them to predict latent global object shape. This information may help to further prune positively labeled grasp hypotheses based on, e.g. relation to the predicted average global shape or suitability for a specific task. This prediction can also guide scene exploration to prune object shape hypotheses.
To learn the function that maps local shape to grasp stability we use a Random Forest Classifier. We show that our method reaches the same classification performance as the current state-of-the-art on this dataset which uses a Convolutional Neural Network. Additionally, we exploit the natural ability of the Random Forest to cluster similar data. For a positively predicted query data point, we retrieve all the locally similar training data points that are associated with the same leaf nodes of the Random Forest. The main insight from this work is that local object shape that affords a grasp is also a good predictor of global object shape. We empirically support this claim with quantitative experiments. Additionally, we demonstrate the predictive capability of the method on some real data examples.

In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), pages: 270-277, IEEE, IEEE International Conference on Robotics and Automation, May 2016 (inproceedings)

Abstract

This paper proposes an automatic controller tuning framework based on linear optimal control combined with Bayesian optimization. With this framework, an initial set of controller gains is automatically improved according to a pre-defined performance objective evaluated from experimental data. The underlying Bayesian optimization algorithm is Entropy Search, which represents the latent objective as a Gaussian process and constructs an explicit belief over the location of the objective minimum. This is used to maximize the information gain from each experimental evaluation. Thus, this framework shall yield improved controllers with fewer evaluations compared to alternative approaches. A seven-degree- of-freedom robot arm balancing an inverted pole is used as the experimental demonstrator. Results of a two- and four- dimensional tuning problems highlight the method’s potential for automatic controller tuning on robotic platforms.

In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA) 2016, IEEE, IEEE International Conference on Robotics and Automation, May 2016 (inproceedings)

Abstract

We consider the problem of model-based 3D- tracking of objects given dense depth images as input. Two difficulties preclude the application of a standard Gaussian filter to this problem. First of all, depth sensors are characterized by fat-tailed measurement noise. To address this issue, we show how a recently published robustification method for Gaussian filters can be applied to the problem at hand. Thereby, we avoid using heuristic outlier detection methods that simply reject measurements if they do not match the model. Secondly, the computational cost of the standard Gaussian filter is prohibitive due to the high-dimensional measurement, i.e. the depth image. To address this problem, we propose an approximation to reduce the computational complexity of the filter. In quantitative experiments on real data we show how our method clearly outperforms the standard Gaussian filter. Furthermore, we compare its performance to a particle-filter-based tracking method, and observe comparable computational efficiency and improved accuracy and smoothness of the estimates.

An action-oriented perspective changes the role of an individual from a passive observer to an actively engaged agent interacting in a closed loop with the world as well as with others. Cognition exists to serve action within a landscape that contains both. This chapter surveys this landscape and addresses the status of the pragmatic turn. Its potential influence on science and the study of cognition are considered (including perception, social cognition, social interaction, sensorimotor entrainment, and language acquisition) and its impact on how neuroscience is studied is also investigated (with the notion that brains do not passively build models, but instead support the guidance of action).
A review of its implications in robotics and engineering includes a discussion of the application of enactive control principles to couple action and perception in robotics as well as the conceptualization of system design in a more holistic, less modular manner. Practical applications that can impact the human condition are reviewed (e.g. educational applications, treatment possibilities for developmental and psychopathological disorders, the development of neural prostheses). All of this foreshadows the potential societal implications of the pragmatic turn. The chapter concludes that an action-oriented approach emphasizes a continuum of interaction between technical aspects of cognitive systems and robotics, biology, psychology, the social sciences, and the humanities, where the individual is part of a grounded cultural system.

Since the 1950s, robotics research has sought to build a general-purpose agent capable of autonomous, open-ended interaction with realistic, unconstrained environments. Cognition is perceived to be at the core of this process, yet understanding has been challenged because cognition is referred to differently within and across research areas, and is not clearly defined. The classic robotics approach is decomposition into functional modules which perform planning, reasoning, and problem-solving or provide input to these mechanisms. Although advancements have been made and numerous success stories reported in specific niches, this systems-engineering approach has not succeeded in building such a cognitive agent.
The emergence of an action-oriented paradigm offers a new approach: action and perception are no longer separable into functional modules but must be considered in a complete loop. This chapter reviews work on different mechanisms for action- perception learning and discusses the role of embodiment in the design of the underlying representations and learning. It discusses the evaluation of agents and suggests the development of a new embodied Turing Test. Appropriate scenarios need to be devised in addition to current competitions, so that abilities can be tested over long time periods.

Advances in 3D scanning technology allow us to create realistic virtual avatars from full body 3D scan data. However, negative reactions to some realistic computer generated humans suggest that this approach might not always provide the most appealing results. Using styles derived from existing popular character designs, we present a novel automatic stylization technique for body shape and colour information based on a statistical 3D model of human bodies. We investigate whether such stylized body shapes result in increased perceived appeal with two different experiments: One focuses on body shape alone, the other investigates the additional role of surface colour and lighting. Our results consistently show that the most appealing avatar is a partially stylized one. Importantly, avatars with high stylization or no stylization at all were rated to have the least appeal. The inclusion of colour information and improvements to render quality had no significant effect on the overall perceived appeal of the avatars, and we observe that the body shape primarily drives the change in appeal ratings. For body scans with colour information, we found that a partially stylized avatar was most effective, increasing average appeal ratings by approximately 34%.

Although commercial and open-source software exist to reconstruct a static object from a sequence recorded with an RGB-D sensor, there is a lack of tools that build rigged models of articulated objects that deform realistically and can be used for tracking or animation.
In this work, we fill this gap and propose a method that creates a fully rigged model of an articulated object from depth data of a single sensor.
To this end, we combine deformable mesh tracking, motion segmentation based on spectral clustering and skeletonization based on mean curvature flow.
The fully rigged model then consists of a watertight mesh, embedded skeleton, and skinning weights.

Our goal is to understand the principles of Perception, Action and Learning in autonomous systems that successfully interact with complex environments and to use this understanding to design future systems