Transcription

1 Sequence of the Most Informative (SMIJ): A New Representation for Human Skeletal Action Recognition Ferda Ofli 1, Rizwan Chaudhry 2, Gregorij Kurillo 1,RenéVidal 2 and Ruzena Bajcsy 1 1 Tele-immersion Lab, University of California, Berkeley 2 Center for Imaging Sciences, Johns Hopkins University Abstract Much of the existing work on action recognition combines simple features (e.g., joint angle trajectories, optical flow, spatio-temporal video features) with somewhat complex classifiers or dynamical models (e.g., kernel SVMs, HMMs, LDSs, deep belief networks). Although successful, these approaches represent an action with a set of parameters that usually do not have any physical meaning. As a consequence, such approaches do not provide any qualitative insight that relates an action to the actual motion of the body or its parts. For example, it is not necessarily the case that clapping can be correlated to hand motion or that walking can be correlated to a specific combination of motions from the feet, arms and body. In this paper, we propose a new representation of human actions called Sequence of the Most Informative (SMIJ), which is extremely easy to interpret. At each time instant, we automatically select a few skeletal joints that are deemed to be the most informative for performing the current action. The selection of joints is based on highly interpretable measures such as the mean or variance of joint angles, maximum angular velocity of joints, etc. We then represent an action as a sequence of these most informative joints. Our experiments on multiple databases show that the proposed representation is very discriminative for the task of human action recognition and performs better than several state-of-the-art algorithms. 1. Introduction Human motion analysis has remained as one of the most important areas of research in computer vision. Over the last few decades, a large number of methods have been proposed for human motion analysis. (See the surveys by Moeslund et al. [12, 13] and most recently by Aggarwal and Ryoo [1] for a comprehensive analysis). In general all methods use a mathematical representation of human motion and develop algorithms for comparing and classifying different instances of human activities under these representations. A common and intuitive method to represent human motion is to use a sequence of approximate human skeletal configurations. In the past, extracting accurate skeletal configurations from monocular videos was a difficult and unreliable process, especially for arbitrary human poses. Motion capture systems on the other hand provide very accurate skeletal configurations of human actions but are limited to only laboratory settings. Therefore, methods that relied heavily on accurate skeletal data slowly fell out of favor and most state-of-the-art activity recognition methods extract spatiotemporal interest points from monocular videos and learn their statistics [7, 8, 9]. Recently, with the release of several low-cost and relatively accurate 3D capturing systems, such as the Microsoft Kinect, 3D data collection and skeleton extraction have become much easier and more practical for the applications of natural human computer interaction, gesture recognition and animation, thus reviving interest in skeleton-based action representation. Skeleton-based approaches for human activities were primarily focused on modeling the dynamics of either the full skeleton or a combination of limb parts. Majority of the methods use Linear Dynamical Systems (LDS) or Non- Linear Dynamical Systems (NLDS), e.g. in [4,, 3] or Hidden Markov Models (HMM), see e.g. the earlier work by Yamato et al. [17] and a review of several others in [2], to represent the dynamics of normalized 3D positions of joints or joint angle configurations. Recently Taylor et al. in [16, ] proposed using Conditional Restricted Boltzman Machines (CRBM) to model the temporal evolution of human actions. While these methods have been very successful for both human activity synthesis and recognition, they represent human motion with a set of dynamics/observation parameters that, in general, do not have a qualitatively interpretable property. A key observation that we make is that even though humans perform the same action differently, while generating dissimilar joint trajectories, the same set of joints, roughly in the same order, are activated during the performance of these actions. In our approach we take advantage of this observation to capture invariances in human skeletal motion in a given action. We propose finding the relative informative-

2 ness of all the joints in a temporal window during an action. A joint is the most informative in a particular temporal window if, for example, it has the highest variance of motion as captured by the change in the joint angle. Such a notion of informativeness is very intuitive and interpretable. Furthermore, the ordered sequence of informative joints in a full skeletal motion implicitly models the temporal dynamics of the motion. In this paper, we therefore propose a new representation for human motion based on the sequence of the most informative joints. We compare the performance of this representation to several holistic action representations, based on the histograms of motion words, as well as the methods that explicitly model the dynamics of the skeletal motion. We will show that our simple yet highly intuitive and interpretable representation performs much better than standard methods for the task of action recognition from skeletal motion data. 2. Sequence of the Most Informative (SMIJ) The human body is an articulated system that can be represented by a hierarchy of joints that are connected with bones, forming the skeleton. Different joint configurations produce different skeletal poses and a time series of these poses yields the skeletal motion. An action can be simply described as a collection of time series of 3D positions (i.e., 3D trajectories) of the joints in the skeleton hierarchy. This representation, however, lacks important properties such as view- and scale-invariance. A better description is obtained by computing the joint angles between any two connected limbs and using the time series of joint angles as the skeletal motion data. Let a i denote the joint angle time series of joint i, i.e., a i = where T is the number of frames in an action sequence. An action sequence can then be seen as a collection of such time-series data from different joints, i.e., A = [ a 1 a 2... a J], where J is the number of joints in the skeleton hierarchy. Hence, A is the T J matrix of joint angle time series representing an action sequence. Common modeling methods such as LDS or HMM model the evolution of the time series of joint angles. However, instead of directly using the original joint angle timeseries data A, one can also extract various types of features from A such as the mean or variance of joint angle time series, or the maximum angular velocity of each joint. For the sake of generality, we will denote this operation with O in the remainder of the paper unless an explicit specification is necessary. Here O(a) :R a R is a function that maps a time series of scalar values to a single scalar value. Furthermore, one can extract such features either across the entire action sequence or across smaller segments of the timeseries data. The former case describes an action sequence with its global statistics whereas the latter case emphasizes {a i t} t=t t=1 more the local temporal statistics of an action sequence. Our hypothesis presented in this paper is the following: Different actions require human to engage different joints of the skeleton at different intensity (energy) levels at different times. Hence, the ordering of joints based on their level of engagement across time should reveal significant information about the underlying dynamics, in other words, the invariant temporal structure of the action itself. In order to visualize this phenomenon, let us consider the labeled joint angle configuration shown in Figure 1(a), and perform a simple analysis on Dataset #1, (see Section 3.2 for details about the datasets). The analysis is based on the following steps: i) partition an action sequence into a number of congruent segments, ii) compute the variance of the joint angle time series of each joint over each temporal segment (note that O is defined to be the variance operator in this particular case), iii) rank-order the joints within each segment based on their variance in descending order, iv) repeat the first three steps to get the orderings of joints for all the action sequences in the dataset. Below we investigate the resulting set of joint orderings for different actions. Figure 1(b), shows the distribution of the top-ranking joints for different actions. We can see that some actions engage only a few joints (e.g., actions 4 (punch) and 6 (wave one)) whereas other actions engage more joints. Nevertheless, the set of the most engaged joints are different for different actions. Joint (RElbow) is the top-ranking joint 4% of the time, followed by joint 9 (RArm) 33% of the time in action 6 (wave one). Both joints (RElbow) and 13 (LElbow) are the top-ranking joints more than % of the time in action 4 (punch). On the other hand, almost half of the joints appear in the top-ranking position at some point in actions 9 (sit-stand), (sit) and 11 (stand); however, the differences across the sets of engaged joints in each of these three actions are still noticeable. For instance, joint 19 (LKnee) is engaged more in action 9 (sit-stand) than in actions (sit) and 11 (stand). Figure 2 shows the histogram of the top-ranking joints for four different actions. While the differences in the distribution of 1 st -, 2 nd -, or 3 rd -ranking joints, and so on, for actions 4 (punch) and 6 (wave one) are evident, actions 1 (jump) and 2 (jumping jacks) require closer look at the histograms. Specifically, even though joints (RKnee) and 19 (LKnee) appear more than 2% of the time as either the 1 st -or2 nd -ranking joint for both actions 1 (jump) and 2 (jumping jacks), joints (RElbow) and 13 (LElbow) tend to rank in the top three at least % of the time for action 1 (jump) whereas joints 9 (RArm) and 12 (LArm) tend to rank in the top three for action 2 (jumping jacks). In short, different sets of joints reveal discriminative information about the underlying structure of the action. This is precisely the main observation that motivates us to consider sequences of the top N most informative joints as a new feature repre-

3 (a) Distribution of Top Ranking to Different Actions (b) 1: Jump 2: Jumping Jacks 3: Bend 4: Punch : Wave Two 6: Wave One 7: Clap 8: Throw 9: Sit Stand : Sit 11: Stand Figure 1. (a) The structure of the skeleton used in Dataset#1 and corresponding set of 21 joint angles computed from it. (b) Distribution of the top-ranking joints, i.e., the most engaged joints, to different actions obtained for Dataset #1. Nonzero entries in a row show the set of joints that are engaged the most for the corresponding action. Some actions (such as 4 (punch) and 6(wave one)) can be described easily by only a few number of joints whereas some actions (such as 9 (sit-stand), (sit) and 11 (stand)) require many more joints. Actions sentation for human skeletal action recognition. The new feature representation, which we call Sequence of the Most Informative (SMIJ), has two main components: i) the set of the most informative joints in each time segment, and ii) the temporal evolution of the set of the most informative joints over all of the time segments. To extract this representation from the time-series data, we first partition the action sequence into N s temporal segments and compute O over each segment. Let a i k = {ai k } k=t be a segment of k s,...,tk ai where t k e s and t k e denote the start and the end frames for the segment k, respectively. Then, an action sequence is written as a collection of features, F = {f k } k=1,...,ns where f k = [ O ( a 1 k) O ( a 2 k ) O ( a J k )]. (1) The feature function, O(a i k ), provides a measure of information (e.g. the mean or variance of joint angles, or the maximum angular velocity) of the joint i in the temporal segment a k. We then rank-order all the joints in f k based on the value of O and define SMIJ features as SMIJ = {{idof (sort (f k ),n)} k=1,...,ns } n=1,...,n, (2) where the sort operator sorts the joints based on their local O score in descending order, the idof (,n) operator returns the id of a joint that ranks n th in the joint ordering, and N specifies the number of top-ranking joints included in the representation. In other words, the SMIJ features represent an action sequence by encoding the set of N most informative joints at a specific time instant (by rank-ordering and keeping the top-ranking N joints) as well as the temporal evolution of the set of the most informative joints throughout the action sequence (by preserving the temporal order of the top-ranking N joints). The resulting feature descriptor is N s N-dimensional. Metrics for SMIJ Since the proposed representation is a set of sequences over a fixed alphabet - the joints, we use the Levenshtein distance, D L (S i,s j ), [] for comparing the SMIJ features from two different sequences, S i and S j. The Levenshtein distance measures the amount of difference between two sequences of symbols such as strings. It is defined as the minimum number of operations required to transform one sequence into the other where the allowable operations are insertion, deletion, or substitution of a single symbol. We use a normalized version of the Levenshtein distance, D L (S i,s j )= D L(S i,s j ) N s N, (3) where N s is the number of time segments as defined in Equation 2. This allows us to compare pairs of distances that have been computed between two pairs of sequences that have different lengths. For example, normalization allows us to say that D L (abab, abad) = 1 4 < D L (ab, ad) = 1 2, whereas the un-normalized version would give the same distance of 1 for both pairs of sequences. The size of the SMIJ feature depends on the number of segments N s, which depends on how the action sequence is partitioned. The Levenshtein distance between two sequences is at least equal to the difference in lengths of the two sequences. Since we require a distance of zero when two actions have the same rank-ordering, irrespective of their actual temporal length, one natural choice is to fix N s to a constant value for all the action sequences.

4 Percentage of Occurrence Distribution of Top Ranking for Action 1: Jump th Percentage of Occurrence Distribution of Top Ranking for Action 2: Jumping Jacks th Percentage of Occurrence Distribution of Top Ranking for Action 4: Punch th Percentage of Occurrence Distribution of Top Ranking for Action 6: Wave One th Figure 2. Histogram distribution of the top-ranking joints for four actions selected from Dataset #1. 3. Evaluation of Feature Representations In this section, we will compare the proposed Sequence of the Most Informative (SMIJ) features against several other standard feature representations using three different action classification datasets. Baseline feature representations are briefly mentioned in Section 3.1. Details of the datasets are given in Section 3.2. The classification methods used are described in Section 3.3. Finally the comparative results are presented in Section Baseline Feature Representations We first propose an alternative feature representation based on a similar idea to SMIJ. Instead of stacking the top-ranking N joints from all temporal segments into a single sequence of symbols, while keeping the temporal order of the joints intact, we create histograms separately for the 1 st -ranking joints, 2 nd -ranking joints, and so on, from all temporal segments; and then concatenate them as a feature descriptor, called Histograms of Most Informative (HMIJ), to represent the action sequence, i.e., HMIJ = {hist ({idof (sort (f k ),n)} k=1,...,ns )} n=1,...,n (4) where the hist operator creates a J-bin l 1 -normalized histogram from the input joint sequence, resulting in J Ndimensional feature descriptor. It is important to note that HMIJ features ignore the temporal order of the top-ranking N joints, and hence, will be used as a reference to assess the importance of preserving the temporal order information in the feature representation (as SMIJ features do). Another popular method for representing an action sequence is based on the bag-of-motion words model. We first cluster the set of f k sintok clusters (i.e., motion words) using k-means or k-medoids and then count the number of motion words that appear in a particular action sequence, yielding the Histogram-of-Motion Words (HMW) representation. As mentioned earlier, one of the most common techniques to analyze human motion data is based on modeling the motion with a Linear Dynamical System over the entire sequence, e.g. in [4], and use LDS parameters (LDSP) as an alternative feature representation. Even though we do not provide an exhaustive list of all possible feature representations, we believe that these three feature representations, i.e., HMIJ, HMW and LDSP, are comprehensive enough to demonstrate the power of the proposed SMIJ features in terms of discriminability and interpretability for human action recognition Datasets We evaluate the performance of each feature representation described above on three different human action

5 datasets of 3D skeleton data. Each dataset has almost completely distinct set of actions with different frame rates, different skeleton extraction method, and hence, skeleton data of various quality. Dataset #1: We recently collected a dataset that contains 11 actions performed by 12 subjects using an active optical motion capture system (PhaseSpace Inc, San Leandro, CA). The motion data was recorded with 43 active LED markers at 48 Hz. For each subject we collected repetitions of each action, yielding a total of 69 action sequences (after excluding the faulty one). We then extracted the skeleton data by post-processing the 3D optical motion capture data. The actions lengths vary from 773 to 146 frames. In our experiments, we used 7 subjects (384 action sequences) for training and subjects (27 action sequences) for testing. The set of actions consisted of jump, jumping jacks, bend, punch, wave one hand, wave two hands, clap, throw, sit down, stand up, sit down/stand up. Dataset #2: From the HDM database [14] we used 11 actions performed by subjects. In this dataset, subjects performed each action with various number of repetitions, resulting in 21 action sequences in total. In addition to marker location captured with the frequency of 1 Hz, the HDM database also provides the corresponding skeleton data. The duration of the action sequences ranges from 121 to 91 frames. In our experiments, we used 3 subjects (142 action sequences) for training and 2 subjects (9 action sequences) for testing. The set of actions consisted of deposit floor, elbow to knee, grab high, hop both legs, jog, kick forward, lie down floor, rotate both arms backward, sneak, squat, throw basketball. Dataset #3: We also evaluated the action recognition on the MSR Action3D dataset [11] consisting of the skeleton data obtained from a depth sensor similar to the Microsoft Kinect with Hz. Due to missing or corrupted skeleton data in some of the action sequences, we selected a subset of 17 actions performed by 8 subjects, with 3 repetitions of each action. The subset consisted of 379 action sequences in total, with the duration of the sequences ranging from to 76 frames. We used subjects (226 action sequences) for training and 3 subjects (3 action sequences) for testing. The set of actions included high arm wave, horizontal arm wave, hammer, hand catch, forward punch, high throw, draw x, draw tick, draw circle, hand clap, two hand wave, side-boxing, forward kick, side kick, jogging, tennis swing, tennis serve Classification Methods In this section we examine the quality of different feature representations by evaluating their classification performance using well-known methods such as 1-nearest neighbor (1-NN) and support vector machine (SVM). Since we are investigating feature descriptors with different characteristics, we need to select the distance metrics according to the feature representations. We use the Levenshtein distance as explained in Section 2 for classification based on SMIJ. We use the χ 2 distance for classification based on histogram feature representations HMIJ and HMW. Finally, we use the Martin distance [6] as a metric between dynamical systems for classification based on LDSP. For SVM based classification, we follow one-vs-one classification scheme and use Gaussian kernel K (S i,s j )=e γd2 (S i,s j) with an appropriate distance function D (S i,s j ) depending on the feature type listed above. As for the SVM hyperparameters, we set the regularization parameter C to 1 and the Gaussian kernel function parameter γ to the inverse of the mean value of the distances between all training sequences as in [18]. As proposed in Section 2, we use O to be the variance operator for Datasets #1 and #2 since it provides a good measure of activation (energy) of the skeletal joints. However, since the frame rate of Dataset #3 is very low, computing the variance over only a few number of samples did not seem reasonable. Therefore, we defined O to be the maximum angular velocity of the joints for Dataset #3 (which is more informative than the variance over segments with just one or two frames). We choose N s =6for Dataset #1, N s =for Dataset #2, and N s =11for Dataset #3 when computing SMIJ and HMIJ features. For HMW, we choose K =. For LDSP, a system order of was used Experimental Results Table 1 shows the performance of the proposed SMIJ representation. Note that using N > 1 most informative joints is better than using only the single most informative joint. The best classification performance is obtained for different values of N for different datasets; specifically, 94.91% when N =6for Dataset #1, 84.% when N =2 for Dataset #2, and 33.33% when N =, 6 for Dataset #3. The result of the best performance when using an intermediate value of N is not unexpected. In general, as N increases, the proposed representation captures more and more information about the action being performed. At the same time, the number of classification parameters increases with N, while the amount of training data remains the same. Therefore, there is a risk of over-fitting when N is large. Table 2 shows the classification results when using HMIJ for several values of N. Notice that the performance of HMIJ is in general worse than that of SMIJ. This is to be expected, because HMIJ does not capture the temporal order of the sequence of ordered joints and therefore loses discriminability. Table 3 shows classification results for all four feature representations. We choose to compare against SMIJ using N =2and HMIJ using N =4. In general the best classification results are obtained by SMIJ using N =2for the first two datasets. In Dataset #3, however, the classi-

Classifying Manipulation Primitives from Visual Data Sandy Huang and Dylan Hadfield-Menell Abstract One approach to learning from demonstrations in robotics is to make use of a classifier to predict if

CHAPTER 6 TEXTURE ANIMATION 6.1. INTRODUCTION Animation is the creating of a timed sequence or series of graphic images or frames together to give the appearance of continuous movement. A collection of

The Visual Internet of Things System Based on Depth Camera Xucong Zhang 1, Xiaoyun Wang and Yingmin Jia Abstract The Visual Internet of Things is an important part of information technology. It is proposed

Privacy Preserving Automatic Fall Detection for Elderly Using RGBD Cameras Chenyang Zhang 1, Yingli Tian 1, and Elizabeth Capezuti 2 1 Media Lab, The City University of New York (CUNY), City College New

The Delicate Art of Flower Classification Paul Vicol Simon Fraser University University Burnaby, BC pvicol@sfu.ca Note: The following is my contribution to a group project for a graduate machine learning

Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

Learning from Diversity Epitope Prediction with Sequence and Structure Features using an Ensemble of Support Vector Machines Rob Patro and Carl Kingsford Center for Bioinformatics and Computational Biology

Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set Amhmed A. Bhih School of Electrical and Electronic Engineering Princy Johnson School of Electrical and Electronic Engineering Martin

1 Introduction Project 2: Character Animation Due Date: Friday, March 10th, 11:59 PM The technique of motion capture, or using the recorded movements of a live actor to drive a virtual character, has recently

1 Kinect Gesture Recognition for Interactive System Hao Zhang, WenXiao Du, and Haoran Li Abstract Gaming systems like Kinect and XBox always have to tackle the problem of extracting features from video

Recognizing Cats and Dogs with Shape and Appearance based Models Group Member: Chu Wang, Landu Jiang Abstract Recognizing cats and dogs from images is a challenging competition raised by Kaggle platform

Sequence for Determining Necessary Data. Wrong: Catalog everything you have, and decide what data is important. Right: Work backward from the solution, define the problem explicitly, and map out the data

Making Sense of the Mayhem: Machine Learning and March Madness Alex Tran and Adam Ginzberg Stanford University atran3@stanford.edu ginzberg@stanford.edu I. Introduction III. Model The goal of our research

Classifying Chess Positions Christopher De Sa December 14, 2012 Chess was one of the first problems studied by the AI community. While currently, chessplaying programs perform very well using primarily

Artificial Neural Networks and Support Vector Machines CS 486/686: Introduction to Artificial Intelligence 1 Outline What is a Neural Network? - Perceptron learners - Multi-layer networks What is a Support

A Learning Based Method for Super-Resolution of Low Resolution Images Emre Ugur June 1, 2004 emre.ugur@ceng.metu.edu.tr Abstract The main objective of this project is the study of a learning based method

An Introduction to Data Mining for Wind Power Management Spring 2015 Big Data World Every minute: Google receives over 4 million search queries Facebook users share almost 2.5 million pieces of content

Fast Matching of Binary Features Marius Muja and David G. Lowe Laboratory for Computational Intelligence University of British Columbia, Vancouver, Canada {mariusm,lowe}@cs.ubc.ca Abstract There has been

Solving Simultaneous Equations and Matrices The following represents a systematic investigation for the steps used to solve two simultaneous linear equations in two unknowns. The motivation for considering

82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described

.6 Data Mining: Algorithms and Applications Matrix Math Review The purpose of this document is to give a brief review of selected linear algebra concepts that will be useful for the course and to develop

Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms Scott Pion and Lutz Hamel Abstract This paper presents the results of a series of analyses performed on direct mail

Chapter 3 Linear Codes In order to define codes that we can encode and decode efficiently, we add more structure to the codespace. We shall be mainly interested in linear codes. A linear code of length