Technical session talks from ICRA 2012

TechTalks from event: Technical session talks from ICRA 2012

Conference registration code to access these videos can be accessed by visiting this link: PaperPlaza. Step-by-step to access these videos are here: step-by-step process .
Why some of the videos are missing? If you had provided your consent form for your video to be published and still it is missing, please contact support@techtalks.tv

Human Detection and Tracking

This paper presents a unified probabilistic framework to tackle two closely related visual tasks: pedestrian segmentation and pose tracking along monocular videos. Although the two tasks are complementary in nature, most previous approaches focus on them individually. Here, we resolve the two problems simultaneously by building and inferring a single body model. More specifically, pedestrian segmentation is performed by optimizing body region with constraint of body pose in a Markov Random Field (MRF), and pose parameters are reasoned about through a Bayesian filtering, which takes body silhouette as an observation cue. Since the two processes are inter-related, we resort to an Expectation-Maximization (EM) algorithm to refine them alternatively. Additionally, a template matching scheme is utilized for initialization. Experimental results on challenging videos verify the framework's robustness to non-rigid human segmentation, cluttered backgrounds and moving cameras.

This paper presents a hierarchal, two-layer, connectionist-based human-action recognition system (CHARS) as a first step towards developing socially intelligent robots. The first layer is a K-nearest neighbor (K-NN) classifier that categorizes human actions into two classes based on the existence of locomotion, and the second layer consists of two multi-layer recurrent neural networks that distinguish between subclasses within each class. A pyramid of histograms of oriented gradients (PHOG) descriptor is proposed for extracting local and spatial features. The PHOG descriptor reduces the dimensionality of input space drastically, which results in better convergence for the learning and classification processes. Computer simulations were conducted to illustrate the performance of the proposed CHARS and the role of temporal factor in solving this problem. A widely used KTH human-action database and the human-action dataset from our lab were utilized for performance evaluation. The proposed CHARS was found to perform better than other existing human-action recognition methods and achieved a 95.55% recognition rate.

In this paper we address an important issue in human-robot interaction, that of accurately deriving pointing information from a corresponding gesture. Based on the fact that in most applications it is the pointed object rather than the actual pointing direction which is important, we formulate a novel approach which takes into account prior information about the location of possible pointed targets. To decide about the pointed object, the proposed approach uses the Dempster-Shafer theory of evidence to fuse information from two different input streams: head pose, estimated by visually tracking the off-plane rotations of the face, and hand pointing orientation. Detailed experimental results are presented that validate the effectiveness of the method in realistic application setups.

Ensuring that an interaction is initiated with a particular and unsuspecting member of a group is a complex task. As a first step the robot must effectively, expediently and reliably recognise the humans as they carry on with their typical behaviours (in situ). A method for constructing a scale and viewing angle robust feature vector (from analysing a 3D pointcloud) designed to encapsulate the inter-person variations in the size and shape of the people's head to shoulder region (Head-to-shoulder signature - HSS) is presented. Furthermore, a method for utilising said feature vector as the basis of person recognition via a Support-Vector Machine is detailed. An empirical study was performed in which person recognition was attempted on in situ data collected from 25 participants over 5 days in a office environment. The results report a mean accuracy over the 5 days of 78.15% and a peak accuracy 100% for 9 participants. Further, the results show a considerably better-than-random (1/23 = 4.5%) result for when the participants were: in motion and unaware they were being scanned (52.11%), in motion and face directly away from the sensor (36.04%), and post variations in their general appearance. Finally, the results show the HSS has considerable ability to accommodate for a person's head, shoulder and body rotation relative to the sensor - even in cases where the person is faced directly away from the robot.

The language is a symbolic system unique to human being. The acquisition of language, which has its meanings in the real world, is important for robots to understand the environment and communicate with us in our daily life. This paper propose a novel approach to establish a fundamental framework for the robots which can understand language through their whole body motions. The proposed framework is composed of three modules : ``motion symbol&quot;, ``motion language model&quot;, and ``natural language model&quot;. In the motion symbol module, motion data is symbolized by Hidden Markov Models (HMMs). Each HMM represents abstract motion patterns. Then the HMMs are defined as motion symbols. The motion language model is stochastically designed for links between motion symbols and words. This model consists of three layers of motion symbols, latent variables and words. The connections between the motion symbol and the latent state, and between the latent state and the words is denoted by two kinds of probabilities respectively. One connection is represented by the probability that the motion symbol generates the latent state, and the other connection is represented by the probability that the latent state generates the words. Therefore, the motion language model can connect the motion symbols to the words through the latent state. The natural language model stochastically represents sequences of words. In this paper, a bigram, which is a special case of N-gram model, is adopted as the natura

We describe an integrated, real-time multi-camera surveillance system that is able to find and track individuals, acquire and archive facial image sequences, and perform face recognition. The system is based around an inference engine that can extract high-level information from an observed scene, and generate appropriate commands for a set of pan-tiltzoom (PTZ) cameras. The incorporation of a reliable facial recognition into the high-level feedback is a main novelty of our work, showing how high-level understanding of a scene can be used to deploy PTZ sensing resources effectively. The system comprises a distributed camera system using SQL tables as virtual communication channels, Situation Graph Trees for knowledge representation, inference and high-level camera control, and a variety of visual processing algorithms including an on-line acquisition of facial images, and on-line recognition of faces by comparing image sets using subspace distance. We provide an extensive evaluation of this method using our system for both acquisition of training data, and later recognition. A set of experiments in a surveillance scenario show the effectiveness of our approach and its potential for real applications of cognitive vision.