1016 images from the original FLIC are held out as a test set, augmented with the aforementioned motion features.

Experimentation with several length of frame difference between image pair.

Wrap one of the image pair using inverse of best fitting projection between the image pair to remove camera motion.

Convolutional neural network:

Recent work [ 49, 50] has shown ConvNet architectures are well suited for the task of human body pose detection

Due to the availability of modern Graphics Processing Units (GPUs), we can perform Forward Propagation (FPROP) of deep ConvNet architectures at interactive frame-rates.

Similarly, we can realize pose detection model as a deep ConvNet architecture.

Input: a 3D tensor containing an RGB image and its corresponding motion features.

Output: a 3D tensor containing response-maps, with one response-map for each joint.

Each response-map describes the per-pixel energy for the presence of the corresponding joint at that pixel location.

Based on a sliding-window architecture.

The input patches are first normalized using:

Local Contrast Normalization (LCN [53]) for the RBG channels

A new normalization for motion features that is called Local Motion Normalization (LMN)

Local subtraction with the response from a Gaussian kernel with large standard deviation followed by a divisive normalization.

It removes some unwanted background camera motion as well as normalizing the local intensity of motion

Helps improve network generalization for motions of varying velocity but with similar pose.

Prior to processing through the convolution stages, the normalized motion channels are concatenated along the feature dimension with the normalized RGB channels.

The resulting tensor is processed though 3 stages of convolution:

Rectified linear units (ReLU)

Maxpooling

A single ReLU layer.

The output of the last convolution stage is then passed to a three stage fully-connected neural network.

The network is then applied to all 64 × 64 sub-windows of the image, stepped every 4 pixels horizontally and vertically to produce a dense response-map output, one for each joint.

The major advantage: the learned detector is translation invariant by construction.

Simple Spatial Model

The test images in FLIC-motion may contain multiple people, however, only a single actor per frame is labeled in the test set.

A rough torso location of the labeled person is provided at test time to help locate the “correct” person.

Incorporate the rough torso location information by means of a simple and efficient Spatial-Model.

The inclusion of this stage has two major advantages:

The correct feature activation from the Part-Detector output is selected for the person for whom a ground-truth label was annotated.

Since the joint locations of each part are constrained in proximity to the single ground-truth torso location, then (indirectly) the connectivity between joints is also constrained, enforcing that inferred poses are anatomically viable

Results

Training time for our model on the FLIC-motion dataset (3957 training set images, 1016 test set images) is approximately 12 hours, and FPROP of a single image takes approximately 50ms (on 12 cores workstation with NVIDIA Titan GPU)

For the proposed models that use optical flow as a motion feature input, the most expensive part of our pipeline is the optical flow calculation, which takes approximately 1.89s per image pair.

Plan to investigate real-time flow estimations in the future.

Comparison with Other Techniques

Compares the performance of our system with other state of-the-art models on the FLIC dataset for the elbow and wrist joints:

The proposed detector is able to significantly outperform all prior techniques on this challenging dataset. Note that using only motion features already outperforms [6, 7, 8].

Using only motion features is less accurate than using a combination of motion features and RGB images, especially in the high accuracy region. This is because fine details such as eyes and noses are missing in motion features.

Toshev et al. [49] suffers from inaccuracy in the high-precision region, which we attribute to inefficient direct regression of pose vectors from images.