Abstract

Hand gestures are widely used in Human-Computer and Human Robotic Interfaces. Head mounted devices use gestures to communicate as evident on HoloLens, Meta, and ARCore/ARKit platform enabled smartphones. However, these devices are expensive mainly due to onboard powerful processors and sensors such as multiple cameras, depth and IR sensors that process hand gestures. To enable mass market reach via inexpensive MR headsets without built-in depth or IR sensors, we propose a real-time, in-air gestural framework that works on monocular RGB input alone. We use fingertip for writing in air analogous to a pen on paper. The major challenge in training egocentric gesture recognition models is in obtaining sufficient labeled data for end-to-end learning. Thus, we design a cascade of networks, consisting of a CNN with differentiable spatial to numerical transform (DSNT) layer, for fingertip regression, followed by a Bidirectional Long Short-Term Memory (Bi-LSTM), for a real-time pointing hand gesture classification. The framework takes 1.73s to run end-to-end and has a low memory footprint of 14MB facilitating easy portability on a smart-phone while achieving an accuracy of 88.0 on egocentric video dataset.