Our visual system can easily categorize objects (e.g. faces vs. bodies) and further differentiate them into subcategories (e.g. male vs. female). This ability is particularly important for objects of social significance, such as human faces and bodies. While many studies have demonstrated category selectivity to faces and bodies in the brain, how subcategories of faces and bodies are represented remains unclear. Here, we investigated how the brain encodes two prominent subcategories shared by both faces and bodies, sex and weight, and whether neural responses to these subcategories rely on low-level visual, high-level visual or semantic similarity. We recorded brain activity with fMRI while participants viewed faces and bodies that varied in sex, weight, and image size. The results showed that the sex of bodies can be decoded from both body- and face-responsive brain areas, with the former exhibiting more consistent size-invariant decoding than the latter. Body weight could also be decoded in face-responsive areas and in distributed body-responsive areas, and this decoding was also invariant to image size. The weight of faces could be decoded from the fusiform body area (FBA), and weight could be decoded across face and body stimuli in the extrastriate body area (EBA) and a distributed body-responsive area. The sex of well-controlled faces (e.g. excluding hairstyles) could not be decoded from face- or body-responsive regions. These results demonstrate that both face- and body-responsive brain regions encode information that can distinguish the sex and weight of bodies. Moreover, the neural patterns corresponding to sex and weight were invariant to image size and could sometimes generalize across face and body stimuli, suggesting that such subcategorical information is encoded with a high-level visual or semantic code.

This paper presents an overview of the Grassroots project Aerial Outdoor Motion Capture (AirCap) running at the Max Planck Institute for Intelligent Systems. AirCap's goal is to achieve markerless, unconstrained, human motion capture (mocap) in unknown and unstructured outdoor environments. To that end, we have developed an autonomous flying motion capture system using a team of aerial vehicles (MAVs) with only on-board, monocular RGB cameras. We have conducted several real robot experiments involving up to 3 aerial vehicles autonomously tracking and following a person in several challenging scenarios using our approach of active cooperative perception developed in AirCap. Using the images captured by these robots during the experiments, we have demonstrated a successful offline body pose and shape estimation with sufficiently high accuracy. Overall, we have demonstrated the first fully autonomous flying motion capture system involving multiple robots for outdoor scenarios.

In International Conference on Computer Vision, October 2019 (inproceedings)

Abstract

To understand and analyze human behavior, we need to capture humans moving in, and interacting with, the world. Most existing methods perform 3D human pose estimation without explicitly considering the scene. We observe however that the world constrains the body and vice-versa. To motivate this, we show that current 3D human pose estimation methods produce results that are not consistent with the 3D scene. Our key contribution is to exploit static 3D scene structure to better estimate human pose from monocular images. The method enforces Proximal Relationships with Object eXclusion and is called PROX. To test this, we collect a new dataset composed of 12 different 3D scenes and RGB sequences of 20 subjects moving in and interacting with the scenes. We represent human pose using the 3D human body model SMPL-X and extend SMPLify-X to estimate body pose using scene constraints. We make use of the 3D scene information by formulating two main constraints. The interpenetration constraint penalizes intersection between the body model and the surrounding 3D scene. The contact constraint encourages specific parts of the body to be in contact with scene surfaces if they are close enough in distance and orientation. For quantitative evaluation we capture a separate dataset with 180 RGB frames in which the ground-truth body pose is estimated using a motion-capture system. We show quantitatively that introducing scene constraints significantly reduces 3D joint error and vertex error. Our code and data are available for research at https://prox.is.tue.mpg.de.

In International Conference on Computer Vision, October 2019 (inproceedings)

Abstract

Deep neural networks provide powerful tools for pattern recognition, while classical graph algorithms are widely used to solve combinatorial problems. In computer vision, many tasks combine elements of both pattern recognition and graph reasoning. In this paper, we study how to connect deep networks with graph decomposition into an end-to-end trainable framework. More specifically, the minimum cost multicut problem is first converted to an unconstrained binary cubic formulation where cycle consistency constraints are incorporated into the objective function. The new optimization problem can be viewed as a Conditional Random Field (CRF) in which the random variables are associated with the binary edge labels. Cycle constraints are introduced into the CRF as high-order potentials. A standard Convolutional Neural Network (CNN) provides the front-end features for the fully differentiable CRF. The parameters of both parts are optimized in an end-to-end manner. The efficacy of the proposed learning algorithm is demonstrated via experiments on clustering MNIST images and on the challenging task of real-world multi-people pose estimation.

In International Conference on Computer Vision, October 2019 (inproceedings)

Abstract

We present the first method to perform automatic 3D pose, shape and texture capture of animals from images acquired in-the-wild. In particular, we focus on the problem of capturing 3D information about Grevy's zebras from a collection of images. The Grevy's zebra is one of the most endangered species in Africa, with only a few thousand individuals left. Capturing the shape and pose of these animals can provide biologists and conservationists with information about animal health and behavior. In contrast to research on human pose, shape and texture estimation, training data for endangered species is limited, the animals are in complex natural scenes with occlusion, they are naturally camouflaged, travel in herds, and look similar to each other. To overcome these challenges, we integrate the recent SMAL animal model into a network-based regression pipeline, which we train end-to-end on synthetically generated images with pose, shape, and background variation. Going beyond state-of-the-art methods for human shape and pose estimation, our method learns a shape space for zebras during training.
Learning such a shape space from images using only a photometric loss is novel, and the approach can be used to learn shape in other settings with limited 3D supervision.
Moreover, we couple 3D pose and shape prediction with the task of texture synthesis, obtaining a full texture map of the animal from a single image.
We show that the predicted texture map allows a novel per-instance unsupervised optimization over the network features. This method, SMALST (SMAL with learned Shape and Texture) goes beyond previous work, which assumed manual keypoints and/or segmentation, to regress directly from pixels to 3D animal shape, pose and texture.
Code and data are available at https://github.com/silviazuffi/smalst

In International Conference on Computer Vision, October 2019 (inproceedings)Accepted

Abstract

Capturing human motion in natural scenarios means moving motion capture out of the lab and into the wild. Typical approaches rely on fixed, calibrated, cameras and reflective markers on the body, significantly limiting the motions that can be captured. To make motion capture truly unconstrained, we describe the first fully autonomous outdoor capture system based on flying vehicles. We use multiple micro-aerial-vehicles(MAVs), each equipped with a monocular RGB camera, an IMU, and a GPS receiver module. These detect the person, optimize their position, and localize themselves approximately. We then develop a markerless motion capture method that is suitable for this challenging scenario with a distant subject, viewed from above, with approximately calibrated and moving cameras. We combine multiple state-of-the-art 2D joint detectors with a 3D human body model and a powerful prior on human pose. We jointly optimize for 3D body pose and camera pose to robustly fit the 2D measurements. To our knowledge, this is the first successful demonstration of outdoor, full-body, markerless motion capture from autonomous flying vehicles.

International Conference on Computer Vision (ICCV), October 2019 (conference)

Abstract

Large datasets are the cornerstone of recent advances in computer vision using deep learning. In contrast, existing human motion capture (mocap) datasets are small and the motions limited, hampering progress on learning models of human motion. While there are many different datasets available, they each use a different parameterization of the body, making it difficult to integrate them into a single meta dataset. To address this, we introduce AMASS, a large and varied database of human motion that unifies 15 different optical marker-based mocap datasets by representing them within a common framework and parameterization. We achieve this using a new method, MoSh++, that converts mocap data into realistic 3D human meshes represented by a rigged body model. Here we use SMPL [26], which is widely used and provides a standard skeletal representation as well as a fully rigged surface mesh. The method works for arbitrary marker-sets, while recovering soft-tissue dynamics and realistic hand motion. We evaluate MoSh++ and tune its hyper-parameters using a new dataset of 4D body scans that are jointly recorded with marker-based mocap. The consistent representation of AMASS makes it readily useful for animation, visualization, and generating training data for deep learning. Our dataset is significantly richer than previous human motion collections,
having more than 40 hours of motion data, spanning over 300 subjects, more than 11000 motions, and is available for research at https://amass.is.tue.mpg.de/.

In German Conference on Pattern Recognition (GCPR), September 2019 (inproceedings)

Abstract

Neural networks need big annotated datasets for training. However, manual annotation can be too expensive or even unfeasible for certain tasks, like multi-person 2D pose estimation with severe occlusions. A remedy for this is synthetic data with perfect ground truth. Here we explore two variations of synthetic data for this challenging problem; a dataset with purely synthetic humans, as well as a real dataset augmented with synthetic humans. We then study which approach better generalizes to real data, as well as the influence of virtual humans in the training loss. We observe that not all synthetic samples are equally informative for training, while the informative samples are different for each training stage. To exploit this observation, we employ an adversarial student-teacher framework; the teacher improves the student by providing the hardest samples for its current state as a challenge. Experiments show that this student-teacher framework outperforms all our baselines.

In ACM Symposium on Applied Perception, September 2019 (inproceedings)

Abstract

The creation of realistic self-avatars that users identify with is important for many virtual reality applications. However, current approaches for creating biometrically plausible avatars that represent a particular individual require expertise and are time-consuming. We investigated the visual perception of an avatar’s body dimensions by asking males and females to estimate their own body weight and shape on a virtual body using a virtual reality avatar creation tool. In a method of adjustment task, the virtual body was presented in an HTC Vive head-mounted display either co-located with (first-person perspective) or facing (third-person perspective) the participants. Participants adjusted the body weight and dimensions of various body parts to match their own body shape and size. Both males and females underestimated their weight by 10-20% in the virtual body, but the estimates of the other body dimensions were relatively accurate and within a range of ±6%. There was a stronger influence of visual perspective on the estimates for males, but this effect was dependent on the amount of control over the shape of the virtual body, indicating that the results might be caused by where in the body the weight changes expressed themselves. These results suggest that this avatar creation tool could be used to allow participants to make a relatively accurate self-avatar in terms of adjusting body part dimensions, but not weight, and that the influence of visual perspective and amount of control needed
over the body shape are likely gender-specific.

In Proceedings of the 8th IFAC Workshop on Distributed Estimation and Control in Networked Systems, 8th IFAC Workshop on Distributed Estimation and Control in Networked Systems (NecSys), September 2019 (inproceedings)Accepted

Properly designing a system to exhibit favorable natural dynamics can greatly simplify designing or learning the control policy. However, it is still unclear what constitutes favorable natural dynamics and how to quantify its effect. Most studies of simple walking and running models have focused on the basins of attraction of passive limit cycles and the notion of self-stability. We instead emphasize the importance of stepping beyond basins of attraction. In this paper, we show an approach based on viability theory to quantify robust sets in state-action space. These sets are valid for the family of all robust control policies, which allows us to quantify the robustness inherent to the natural dynamics before designing the control policy or specifying a control objective. We illustrate our formulation using spring-mass models, simple low-dimensional models of running systems. We then show an example application by optimizing robustness of a simulated planar monoped, using a gradient-free optimization scheme. Both case studies result in a nonlinear effective stiffness providing more robustness.

We present a novel robotic front-end for autonomous aerial motion-capture (mocap) in outdoor environments. In previous work, we presented an approach for cooperative detection and tracking (CDT) of a subject using multiple micro-aerial vehicles (MAVs). However, it did not ensure optimal view-point configurations of the MAVs to minimize the uncertainty in the person's cooperatively tracked 3D position estimate. In this article, we introduce an active approach for CDT. In contrast to cooperatively tracking only the 3D positions of the person, the MAVs can actively compute optimal local motion plans, resulting in optimal view-point configurations, which minimize the uncertainty in the tracked estimate. We achieve this by decoupling the goal of active tracking into a quadratic objective and non-convex constraints corresponding to angular configurations of the MAVs w.r.t. the person. We derive this decoupling using Gaussian observation model assumptions within the CDT algorithm. We preserve convexity in optimization by embedding all the non-convex constraints, including those for dynamic obstacle avoidance, as external control inputs in the MPC dynamics. Multiple real robot experiments and comparisons involving 3 MAVs in several challenging scenarios are presented.

In Proceedings of the IEEE World Haptics Conference (WHC), pages: 229-234, Tokyo, Japan, July 2019 (inproceedings)

Abstract

Masking has been used to study human perception of tactile stimuli, including those created on haptic touch screens. Earlier studies have investigated the effect of in-site masking on tactile perception of electrovibration. In this study, we investigated whether it is possible to change detection threshold of electrovibration at fingertip of index finger via remote masking, i.e. by applying a (mechanical) vibrotactile stimulus on the proximal phalanx of the same finger. The masking stimuli were generated by a voice coil (Haptuator). For eight participants, we first measured the detection thresholds for electrovibration at the fingertip and for vibrotactile stimuli at the proximal phalanx. Then, the vibrations on the skin were measured at four different locations on the index finger of subjects to investigate how the mechanical masking stimulus propagated as the masking level was varied. Finally, electrovibration thresholds measured in the presence of vibrotactile masking stimuli. Our results show that vibrotactile masking stimuli generated sub-threshold vibrations around fingertip, and hence did not mechanically interfere with the electrovibration stimulus. However, there was a clear psychophysical masking effect due to central neural processes. Electrovibration absolute threshold increased approximately 0.19 dB for each dB increase in the masking level.

In this study, we develop a high-fidelity finite element (FE) analysis framework that enables multiphysics simulation of the human finger in contact with a surface that is providing tactile feedback. We aim to elucidate a variety of physical interactions that can occur at finger-surface interfaces, including contact, friction, vibration, and electrovibration. We also develop novel FE-based methods that will allow prediction of nonconventional features such as real finger-surface contact area and finger stickiness. We envision using the developed computational tools for efficient design and optimization of haptic devices by replacing expensive and lengthy experimental procedures with high-fidelity simulation.

Using a force-controlled robotic platform, we tested the human perception of positive and negative modulations in normal force during passive dynamic touch, which also induced a strong related change in the finger-surface lateral force. In a two-alternative forced-choice task, eleven participants had to detect brief variations in the normal force compared to a constant controlled pre-stimulation force of 1 N and report whether it had increased or decreased. The average 75% just noticeable difference (JND) was found to be around 0.25 N for detecting the peak change and 0.30 N for correctly reporting the increase or the decrease. Interestingly, the friction coefficient of a subject’s fingertip positively correlated with his or her performance at detecting the change and reporting its direction, which suggests that humans may use the lateral force as a sensory cue to perceive variations in the normal force.

52nd Annual Meeting of the Society for Mathematical Psychology, July 2019 (conference)

Abstract

Stimulus-response learning constitutes an important part of human experience over the life course. Independent of the domain, it is characterized by changes in performance with increasing task progress. But what cognitive mechanisms are responsible for these changes and how do additional task requirements affect the related dynamics? To inspect that in more detail, we introduce a computational modeling approach that investigates performance-related changes in learning situations with reference to chunk activation patterns. It leverages the cognitive architecture ACT-R to model learner behavior in abstract stimulus-response learning in two conditions of task complexity. Additional situational demands are reflected in embedded secondary tasks that interrupt participants during the learning process. Our models apply an activation equation that also takes into account the association between related nodes of information and the similarity between potential responses. Model comparisons with two human datasets (N = 116 and N = 123 participants) indicate a good fit in terms of both accuracy and reaction times. Based on the existing neurophysiological mapping of ACT-R modules on defined human brain areas, we convolve recorded module activity into simulated BOLD responses to investigate underlying cognitive mechanisms in more detail. The resulting evidence supports the connection of learning effects in both task conditions with activation-related patterns to explain changes in performance.

In Proceedings of the IEEE World Haptics Conference, pages: 467-472, July 2019 (inproceedings)

Abstract

A typical approach to creating realistic vibrotactile feedback is reducing 3D vibrations recorded by an accelerometer to 1D signals that can be played back on a haptic actuator, but some of the information is often lost in this dimensional reduction process. This paper describes seven representative algorithms and proposes four metrics based on the spectral match, the temporal match, and the average value and the variability of them across 3D rotations. These four performance metrics were applied to four texture recordings, and the method utilizing the discrete fourier transform (DFT) was found to be the best regardless of the sensing axis. We also recruited 16 participants to assess the perceptual similarity achieved by each algorithm in real time. We found the four metrics correlated well with the subjectively rated similarities for the six dimensional reduction algorithms, with the exception of taking the 3D vector magnitude, which was perceived to be good despite its low spectral and temporal match metrics.

During hugs, humans naturally provide and intuit subtle non-verbal cues that signify the strength and duration of an exchanged hug. Personal preferences for this close interaction may vary greatly between people; robots do not currently have the abilities to perceive or understand these preferences. This work-in-progress paper discusses designing, building, and testing a novel inflatable torso that can simultaneously soften a robot and act as a tactile sensor to enable more natural and responsive hugging. Using PVC vinyl, a microphone, and a barometric pressure sensor, we created a small test chamber to demonstrate a proof of concept for the full torso. While contacting the chamber in several ways common in hugs (pat, squeeze, scratch, and rub), we recorded data from the two sensors. The preliminary results suggest that the complementary haptic sensing channels allow us to detect coarse and fine contacts typically experienced during hugs, regardless of user hand placement.

To understand the adhesive force that occurs when a finger pulls off of a smooth surface, we built an apparatus to measure the fingerpad’s moisture, normal force, and real contact area over time during interactions with a glass plate. We recorded a total of 450 trials (45 interactions by each of ten human subjects), capturing a wide range of values across the aforementioned variables. The experimental results showed that the pull-off force increases with larger finger contact area and faster detachment rate. Additionally, moisture generally increases the contact area of the finger, but too much moisture can restrict the increase in the pull-off force.

Dysgraphia is a neurological disorder characterized by writing disabilities that affects between 7% and 15% of children. It presents itself in the form of unfinished letters, letter distortion, inconsistent letter size, letter collision, etc. Traditional therapeutic exercises require continuous assistance from teachers or occupational therapists. Autonomous partial or full haptic guidance can produce positive results, but children often become bored with the repetitive nature of such activities. Conversely, virtual rehabilitation with video games represents a new frontier for occupational therapy due to its highly motivational nature. Virtual reality (VR) adds an element of novelty and entertainment to therapy, thus motivating players to perform exercises more regularly. We propose leveraging the HTC VIVE Pro and the EXOS Wrist DK2 to create an immersive spellcasting “exergame” (exercise game) that helps motivate children with dysgraphia to improve writing fluency.

In Proceedings of the IEEE World Haptics Conference (WHC), pages: 395-400, Tokyo, Japan, July 2019 (inproceedings)

Abstract

Both vision and touch contribute to the perception of real surfaces. Although there have been many studies on the individual contributions of each sense, it is still unclear how each modality’s information is processed and integrated. To fill this gap, we investigated the similarity of visual and haptic perceptual spaces, as well as how well they each correlate with fingertip interaction metrics. Twenty participants interacted with ten different surfaces from the Penn Haptic Texture Toolkit by either looking at or touching them and judged their similarity in pairs. By analyzing the resulting similarity
ratings using multi-dimensional scaling (MDS), we found that surfaces are similarly organized within the three-dimensional perceptual spaces of both modalities. Also, between-participant correlations were significantly higher in the haptic condition. In a separate experiment, we obtained the contact forces and accelerations acting on one finger interacting with each surface in a controlled way. We analyzed the collected fingertip interaction data in both the time and frequency domains. Our results suggest that the three perceptual dimensions for each modality can be represented by roughness/smoothness, hardness/softness, and friction, and that these dimensions can be estimated by surface vibration power, tap spectral centroid, and kinetic friction coefficient, respectively.

We address the unsupervised learning of several interconnected problems in low-level vision: single view depth prediction, camera motion estimation, optical flow, and segmentation of a video into the static scene and moving regions. Our key insight is that these four fundamental vision problems are coupled through geometric constraints. Consequently, learning to solve them together simplifies the problem because the solutions can reinforce each other. We go beyond previous work by exploiting geometry more explicitly and segmenting the scene into static and moving regions. To that end, we introduce Competitive Collaboration, a framework that facilitates the coordinated training of multiple specialized neural networks to solve complex problems. Competitive Collaboration works much like expectation-maximization, but with neural networks that act as both competitors to explain pixels that correspond to static or moving regions, and as collaborators through a moderator that assigns pixels to be either static or independently moving. Our novel method integrates all these problems in a common framework and simultaneously reasons about the segmentation of the scene into moving objects and the static background, the camera motion, depth of the static scene structure, and the optical flow of moving objects. Our model is trained without any supervision and achieves state-of-the-art performance among joint unsupervised methods on all sub-problems.

In this paper, we provide a modern synthesis of the classic inverse compositional algorithm for dense image alignment. We first discuss the assumptions made by this well-established technique, and subsequently propose to relax these assumptions by incorporating data-driven priors into this model. More specifically, we unroll a robust version of the inverse compositional algorithm and replace multiple components of this algorithm using more expressive models whose parameters we train in an end-to-end fashion from data. Our experiments on several challenging 3D rigid motion estimation tasks demonstrate the advantages of combining optimization with learning-based techniques, outperforming the classic inverse compositional algorithm as well as data-driven image-to-pose regression approaches.

Fine-grained temporal action parsing is important in many applications, such as daily activity understanding, human motion analysis, surgical robotics and others requiring subtle and precise operations in a long-term period. In this paper we propose a novel bilinear pooling operation, which is used in intermediate layers of a temporal convolutional encoder-decoder net. In contrast to other work, our proposed bilinear pooling is learnable and hence can capture more complex local statistics than the conventional counterpart. In addition, we introduce exact lower-dimension representations of our bilinear forms, so that the dimensionality is reduced with neither information loss nor extra computation. We perform intensive experiments to quantitatively analyze our model and show the superior performances to other state-of-the-art work on various datasets.

The estimation of 3D face shape from a single image must be robust to variations in lighting, head pose, expression, facial hair, makeup, and occlusions. Robustness requires a large training set of in-the-wild images, which by construction, lack ground truth 3D shape. To train a network without any 2D-to-3D supervision, we present RingNet, which learns to compute 3D face shape from a single image. Our key observation is that an individual’s face shape is constant across images, regardless of expression, pose, lighting, etc. RingNet leverages multiple images of a person and automatically detected 2D face features. It uses a novel loss that encourages the face shape to be similar when the identity is the same and different for different people. We achieve invariance to expression by representing the face using the FLAME model. Once trained, our method takes a single image and outputs the parameters of FLAME, which can be readily animated. Additionally we create a new database of faces “not quite in-the-wild” (NoW) with 3D head scans and high-resolution images of the subjects in a wide variety of conditions. We evaluate publicly available methods and find that RingNet is more accurate than methods that use 3D supervision. The dataset, model, and results are available for research purposes.

Our goal is to understand the principles of Perception, Action and Learning in autonomous systems that successfully interact with complex environments and to use this understanding to design future systems