in depth, and work well when some of the finger-tips are occluded as it does ... fied two major noise sources. The first type of .... used for modeling the articulation of the five fingers, and 1 for palm .... IEEE Transactions on Pattern Analysis an

The algorithm generates many pose candidates from a signature to find the nose tip ... 1 Introduction. Head pose estimation is the problem of finding a human head in digital im- ... imagery is cheap and relatively easy to obtain. Secondly, the ...

Nov 25, 2016 - Several surveys on human pose estimation can be found ..... a number of scales, SIFT features (shown in Figure 6a) can be matched ... A convolutional neural network (CNN, or ConvNet) is currently the most popular feature in ...... Xiao

combined for source localization, and numerical search is adopted for reflector ... (DOA) of the sound using spherical arrays of sources and microphones. In [2] ... hence those of the reflectors, using a 2-D set of microphones configured as a ...

Robust and fast algorithms for estimating the pose of a human given an image would have a far reaching impact on many fields in and outside of computer vision. We address the problem using depth data that can be captured inexpensively using consumer depth cameras such as the Kinect sensor. To achieve robustness and speed on a small training dataset, we formulate the pose estimation task within a regression and Hough voting framework. Our approach uses random regression forests to predict joint locations from each pixel and accumulate these predictions with Hough voting. The Hough accumulator images are treated as likelihood distributions where maxima correspond to joint location hypotheses. We demonstrate our approach and compare to the state-ofthe-art on a publicly available dataset.

1

INTRODUCTION

Estimation of human pose is a problem that has received significant attention in recent years. A fast, robust solution to the problem would have wide ranging impact in gaming, human computer interaction, video analysis, action and gesture recognition, and many other fields. The problem remains a difficult one primarily because the human body is a highly deformable object. Aditionally, there is large variability in body shape among the population, image capture conditions, clothing, camera viewpoint, occlusion of body parts (including self-occlusion) and background is often complex. In this paper we cast the pose estimation task as a continuous non-linear regression problem. We show how this problem can be effectively addressed by Random Regression Forests (RRFs). Our approach is different to a part-based approach since there are no part detectors at any scale. Instead, the approach is more direct, with features computed efficiently on each pixel used to vote for joint locations. The votes are accumulated in Hough accumulator images and the most likely hypothesis is found by non-maximal suppression. The availability of depth information from realtime depth cameras has simplified the task of pose estimation (Zhu and Fujimura, 2010; Ganapathi et al., 2010; Shotton et al., 2011; Holt et al., 2011) over

Figure 1: Overview. Given a single input depth image, evaluate a bank of RRFs for every pixel. The output from each regressor is accumulated in a Hough-like accumulator image. Non-maximal suppression is applied to find the peaks of the accumulator images.

traditional image capture devices by supporting high accuracy background subtraction, working in lowillumination environments, being invariant to color and texture, providing depth gradients to resolve ambiguities in silhouettes, and providing a calibrated estimate of the scale of the object. However, even with these advantages, there remains much to done to achieve a pose estimation system that is fast and robust. One of the major challenges is the amount of data required in training to generate high accuracy joint estimates. The recent work of Shotton et al. (Shotton

et al., 2011) constructs a training set of approximately two billion samples from one million computer generated depth images. If each value is stored in a 32 bit floating point number, the size of their training set would be 14TB, which is beyond the reach of what most researchers could store or process. Shotton et al make use of a proprietary distributed training architecture using 1000 cores to train their decision trees. We propose an approach that is in many ways similar to Shotton et al’s approach, but requires significantly less data and processing power. Our approach applies advances made using RRFs reported recently in a wide range of computer vision problems. This technique has been demonstrated by Gall and Lempitsky (Gall and Lempitsky, 2009) to offer superior object detection results, and has been used successfully in applications as diverse as the estimation of head pose (Fanelli et al., 2011), anatomy detection and localisation (Criminisi et al., 2011), estimating age based on facial features (Montillo and Ling, 2009) and improving time-of-flight depth map scans (Reynolds et al., 2011). To the best of our knowledge Random Regression Forests have not been applied to pose estimation. The contributions of this paper are the following. First, we show how RRFs can be combined within a Hough-like voting framework for static pose estimation, and secondly we evaluate the approach against state-of-the-art performance on publicly available datasets. The paper is organised as follows: Section 2 discusses related work, Section 3 develops the theory and discusses the approach, Section 4 details the experimental setup and results and Section 5 concludes.

2

RELATED WORK

A survey of the advances in pose estimation can be found in (Moeslund et al., 2006). Broadly speaking, static pose estimation can be divided into global and local (part-based) pose estimation. Global approaches to discriminative pose estimation include direct regression using Relevance Vector Machines (Agarwal and Triggs, 2006), using a parameter sensitive variant of Locality Sensitive Hashing to efficiently lookup and interpolate between similar poses (Shakhnarovich et al., 2003), using Gaussian Processes for generic structured prediction of the global body pose (Bo and Sminchisescu, 2010) and a manifold based approach using Random Forests trained by clustering similar poses hierarchically (Rogez et al., 2008). Many of the state of the art approaches to pose estimation use part-based models (Sigal and Black,

2006; Tran and Forsyth, 2010; Sapp et al., 2010) . The first part of the problem is usually formulated as an object detection task, where the object is typically an anatomically defined body part (Felzenszwalb and Huttenlocher, 2005; Andriluka et al., 2009) or Poselets (parts that are “tightly clustered in configuration space and appearance space”) (Holt et al., 2011; Bourdev et al., 2010; Wang et al., 2011). The subsequent task of assembly of parts into an optimal configuration is often achieved through a Pictorial Structures approach (Felzenszwalb and Huttenlocher, 2005; Andriluka et al., 2009; Eichner et al., 2009), but also using Bayesian Inference with belief propagation (Singh et al., 2010), loopy belief propagation for cyclical models (Sigal and Black, 2006; Wang and Mori, 2008; Tian and Sclaroff, 2010) or a direct inference on a fully connected model (Tran and Forsyth, 2010). Work most similar to ours includes • Gall and Lempitsky (Gall and Lempitsky, 2009) apply random forests tightly coupled with a Hough voting framework to detect objects of a specific class. The detections of each class cast probabilistic votes for the centroid of the object. The maxima of the Hough accumulator images correspond to most likely object detection hypotheses. Our approach also uses Random Forests, but we use them for regression and not object detection. • Shotton et al. (Shotton et al., 2011) apply an object categorisation approach to the pose estimation task. A Random Forest classifier is trained to classify each depth pixel belonging to a segmented body as being one of 32 possible categories, where each category is chosen for optimal joint localisation. Our approach will use the same features as (Shotton et al., 2011) since they can be computed very efficiently, but our approach skips the intermediate representation entirely by directly regressing and then voting for joint proposals. • The work of (Holt et al., 2011) serves as a natural baseline for our approach, since their publicly available dataset is designed for the evaluation of static pose estimation approaches on depth data. They apply an intermediate step in which poselets are first detected, whereas we eliminate this step with better results.

3

PROPOSED APPROACH

The objective of our work is to estimate the configuration of a person in the 2D image plane parameterised by B body parts by making use of a small training set. We define the set of body parts B = {bi }Bi=1 where bi ∈ ℜ2 corresponding to the row and column of bi in the image plane. The labels corresponding to B comprise Q = {head, neck, shoulderL , shoulderR , hipL , hipR , elbowL , elbowR , handL , handR } where |Q| = B. The novelty in our approach is twofold. Firstly, our approach is able to learn the relationship between the context around a point x in a training image and the offset to a body part bi . Given a new point x0 in a test image, we can use the learned context to predict the offset from x0 to b0i . Secondly, since the image features that we use are weak and the human body is highly deformable, our second contribution is to use Hough accumulators as body part likelihood distributions where the most likely hypothesis bˆi is found using non-maximal suppression.

3.1

Image features

We apply the randomised comparison descriptor of (Amit and Geman, 1997; Lepetit and Fua, 2006; Shotton et al., 2011) to depth images. While this is an inherently weak feature, it is both easy to visualise how the feature relates to the image, and when combined with many other features within a non-linear regression framework like Random Regression Forests it yields high accuracy predictions. Given a current pixel location x and random offsets φ = (u, v) |u| < w, |v| < w at a maximum window size w, define the feature fφ (I, x) = I(x +

u v ) − I(x + ) I(x) I(x)

Figure 3: Random Regression Forest. A forest is an ensemble learner consisting of a number of trees, where each tree contributes linearly to the result. During training, each tree is constructed by recursively partitioning the input space until stopping criteria are reached. The input subregion at each leaf node (shown with rectangles) is then approximated with a constant value that minimises the squared error distance to all labels within that subregion. In this toy example, the single dimension function f (x) is approximated by constant values (shown in different colours) over various regions of the input space.

where I(x) is the depth value (the range from the camera to the object) at pixel x in image I and φ = (x1 , x2 ) are the offset vectors relative to x. As explained in (Shotton et al., 2011), we scale the offset 1 vectors by a factor I(x) to ensure that the generated features are invariant to depth. Similarly, we also define I(x0 ) to be a large positive value when x0 is either background or out of image bounds. The most discriminative features found to predict the head are overlaid on test images in Figure 2. These features make sense intuitively, because in Figure 2(a) the predictions of the row location of the head depend on features that compute the presence or absence of support in the vertical direction and similarly for Figure 2(b) in the horizontal direction.

3.2

Random Regression Forests

(1)

Figure 2: Image features. The most discriminative feature φ is that which yields the greatest decrease in mean squared error, and is therefore by definition the feature at the root node of the tree. In (a) the pixel x is shown with these offsets φ = (u, v) that contribute most to heady (the row) and in (b) the offsets φ that contribute most to headx (the column).

A decision tree (Breiman et al., 1984) is a nonparameteric learner that can be trained to predict categorical or continuous output labels. Given a supervised training set consisting of p F-dimensional vector and label pairs (Si , l) where Si ∈ RF , i = 1, ..., p and l ∈ R1 , a decision tree recursively partitions the data such that impurity in the node is minimised, or equivalently the information gain is maximised through the partition. Let the data at node m be represented by Q. For each candidate split θ = ( j, τm ) consisting of a feature j and threshold τm , partition the data into Qle f t (θ) and Qright (θ) subsets Qle f t (θ) = (x, l)|x j ≤ τm

Given a continuous target y, for node m, representing a region Rm with Nm observations, a common criterion H() to minimise is the Mean Squared Error (MSE) criterion. Initially calculate the mean cm over a region cm =

1 ∑ yi Nm i∈N m

(6)

The MSE is the sum of squared differences from the mean H(Q) =

1 ∑ (yi − cm )2 Nm i∈N m

(7)

Recurse for subsets Qle f t (θ∗ ) and Qright (θ∗ ) until the maximum allowable depth is reached, Nm < min samples or Nm = 1. Given that trees have a strong tendency to overfit to the training data, they are often used within an ensemble of T trees. The individual tree predictions are averaged yˆ =

1 T

T

∑ yˆt

(8)

t=0

to form a final prediction with demonstrably lower generalisation errors (Breiman, 2001).

3.3

Figure 4: Hough accumulator images. The Hough image is a probabilistic parameterisation that accumulates votes cast by the RRFs. The maxima in the parameterised space correspond to the most likely hypotheses in the original space. In this example the Hough accumulator shows the concentration of votes cast for the (b) left shoulder, (c) left elbow and (d) left hand.

Hough Voting

Hough voting is technique that has proved very successful for identifying the most likely hypotheses in a parameter space. It is a distributed approach to optimisation, by summing individual responses to an input in an parameter space. The maxima are found to correspond to the most likely hypotheses. Our approach uses the two dimensional image plane as both the input and the parameter space. For each body part q j ∈ Q we define a Hough accumulator {Hq }, ∀q ∈ Q, where the dimensions of the accumulator correspond to the dimensions of the input image I: H ∈ ℜIw × ℜIh , H = 0 for all pixels.

An example of the Hough voting step in our systen can be seen in Figure 4 where the final configuration is shown alongside the accumulator images for the left shoulder, elbow and hand. Note that the left shoulder predictions are tightly clustered around the groundtruth location, whereas the left elbow is less certain and the left hand even more so. Nevertheless, the weight of votes in each case are in the correct area, leading to successful predictions shown in Figure 4(a).

3.4

Training

Before we can train our system, it is necessary to extract features and labels from the training data. Firstly, we generate a dictionary of F random offsets φ j = (u j , v j )Fj=1 . Then, we construct our training data and labels. For each image in the training set, a random subset of P example pixels is chosen to ensure that the distribution over the various body parts is approximately uniform. For each pixel x p in this random subset, the feature vector S is computed as S = fφ j (I, x)Fj=1

(9)

and the offset oi ∈ ℜ2 from every x to every body part qi is oi = x − bi

(10)

The training set is then the set of all training vectors and corresponding offsets. With the training dataset constructed, we train 2B RRFs R1i i ∈ 1..B, to estimate the offset to the row of body part bi and 2B

RRFs R2i i ∈ 1..B, to estimate the offset to the column of body part bi .

of the subject. The dataset comes with annotations of all the upper body part locations.

3.5

4.2

Test

Since the output of a RRF is a single valued continuous variable, we let f (R1,2 i , I, x) be a function that on image I at pixel x. evaluates the RRF R1,2 i We apply the following algorithm to populate the Hough parameter space Hq ∀q ∈ Q. Algorithm 1 Compute probability distribution Hq Input: Image I, for each pixel x do for each label qi ∈ Q do o1i ⇐ R1i (x) o2i ⇐ R2i (x) increment Hqi (x + o1i , x + o2i ) end for end for The key idea is that for each pixel in a test image, each RRF will be evaluated to estimate the the location of the body part by adding the prediction (which is the offset) to the current pixel.

4

EXPERIMENTAL RESULTS

In this section we evaluate our proposed method and describe the experimental setup and experiments performed. We compare our results to the state-ofthe-art (Holt et al., 2011) on a publicly available dataset, and evaluate our results both quantitively and qualitatively. For each body part qi ∈ Q, a Hough accumulator likelihood distribution is computed using Algorithm 1. Unless otherwise specified, we construct our training set from 1000 random pixels x per training image I, where each sample has F = 2000 features fφ (I, x). This results in a training set of 5.2GB.

4.1

Dataset

A number of datasets exist for the evaluation of pose estimation techniques on appearance images, for example Buffy (Ferrari et al., 2008) and Image Parse (Ramanan, 2006), but until recently there were no publicly available datasets for depth image pose estimation. CDC4CV Poselets (Holt et al., 2011) appears to be the first publicly available Kinect dataset, consisting of 345 training and 347 test images at 640x480 pixels, where the focus is on capturing the upper body

Evaluation

We report our results using the evaluation metric proposed by (Ferrari et al., 2008): “A body part is considered as correctly matched if its segment endpoints lie within r = 50% of the length of the ground-truth segment from their annotated location.” The percentage of times that the endpoints match is then defined as the Percentage of Correctly Matched Parts (PCP). A low value for r requires to a very high level of accuracy in the estimation of both endpoints for the match to be correct, and this requirement is relaxed progressively as the ratio r increases to its highest value of r = 50%. In Figure 6 we show the effect of varying r in the PCP calculation, and we report our results at r = 50% in Table 1 as done by (Ferrari et al., 2008) and (Holt et al., 2011). From Table 1 it can be seen that our approach represents an improvement on average of 5% for the forearm, upper arm and waist over (Holt et al., 2011), even though our approach makes no use of kinematic constraints to improve predictions. In Figure 5(a) we show the effect of varying the maximum depth of the trees. Note how the Random Regression Forest trained on the training set with less data (10 pixels per image) tends to overfit to the data on deeper trees. Figure 5(b) shows the effect of varying the maximum window size w for the offsets φ. Confirming our intuition, a small window has too little context to make an accurate prediction, whereas a very large window has too much context which reduces performance. The optimal window size is 100 pixels. Example predictions including accurate estimates and failure modes are shown in Figure 7.

4.3

Computation Times

Our implementation in python runs at ∼ 15 seconds per frame on a single core modern desktop CPU. The memory consumption is directly proportional to the number of trees per forest and the maximum depth to which each tree has been trained. At 10 trees per forest and a maximum depth of 20 nodes, the classifier bank uses approximately 4 gigabytes of memory. The code is not optimised, meaning that further speedups could be achieved by parallelising the prediction process since the estimates of each pixel are independent of each other, by reimplementing the algorithm in C/C++, or by making use of an off the shelf graphics card that supports CUDA to run the algorithm in

1.0

0.70 0.65 Average Precision

Average Precision

0.8 0.6 0.4 0.2 0.04

10 pixels per image 1000 pixels per image 6

10 12 14 Maximum depth

8

16

18

20

0.60 0.55 0.500

10 pixels per image 1000 pixels per image 50

100 150 200 Maximum offset

250

300

Figure 5: Parameter tuning. Experiments on accuracy when (a) the depth of the trees are varied, (b) the maximum offset is varied.

(Holt et al., 2011) Our method

Head 0.99 0.97

Shoulders 0.78 0.81

Side 0.93 0.73 0.82 0.83

Waist 0.65 0.71

Upper arm 0.69 0.66 0.74 0.72

Forearm 0.22 0.33 0.28 0.37

Total 0.67 0.69

Table 1: Percentage of Correctly Matched Parts. Where two numbers are present in a cell, they refer to left/right respectively.

Percentage of Correctly Matched Parts

1.0

PCP Curve

0.8 0.6 0.4 0.2

Our method Holt et al.

sion Forests are trained, and then subsequently used on test image with Hough voting to accurately predict joint locations. We demonstrate our approach and compare to the state-of-the-art on a publicly available dataset. Even though our system is implemented in an unoptimised high level language, it runs in seconds per frame on a single core. As future work we plan to apply these results with the temporal constraints of a tracking framework for increased accuracy and temporal coherency. Finally, we would like to apply these results to other areas of cognitive vision such as HCI and gesture recognition.

In this paper we have shown how Random Regression Forests can be combined with a Hough voting framework to achieve robust body part localisation with minimal training data. We use data captured with consumer depth cameras and efficiently compute depth comparison features that support our goal of non-linear regression. We show how Random Regres-

ACKNOWLEDGMENTS This work was supported by the EC project FP7-ICT-23113 Dicta-Sign and the EPSRC project EP/I011811/1. Thanks to Eng-Jon Ong and Helen Cooper for their insights and stimulating discussions.