READING

Molchanov et al. propose a 3D CNN for hand gesture recognition. The system consists of two networks, a high-resolution network and a low-resolution network – the predictions are multiplied during testing. The architecture is illustrated in Figure 1.

Figure 1: The two employed networks, i.e. the high-resolution network (top) and the low-resolution network (bottom), including all necessary parameters.

While the network architectures are quite simple, they perform thorough data augmentation during training. Fortunately, they detail their training and data augmentation approaches. For data augmentation they use:

reverse ordering of the frames and horizontal mirroring (computed offline, the remaining data augmentations are computed online during training);

rotation, scaling and translation – spatially;

spatial elastic deformation;

fixed-pattern dropout, i.e. setting the same (but randomly selected) pixels across all frames to zero;

random dropout;

temporal scaling (of duration) and translation;

temporal elastic deformation (elastic deformation extended to the temporal domain, see the paper for details).

In experiments, they show that using depth information alone performs better than using intensity data only. Still the combination outperforms both. They also observe that including pre-computed gradients increases final performance.

What is your opinion on the summarized work? Or do you know related work that is of interest? Let me know your thoughts in the comments below: