Continuous Gesture Segmentation and Recognition using 3DCNN and Convolutional LSTM

Abstract

Continuous gesture recognition aims at recognizing the ongoing gestures from continuous gesture sequences, and is more meaningful for the scenarios where the start and end frames of each gesture instance are generally unknown in practical applications. This paper presents an effective deep architecture for continuous gesture recognition. Firstly, continuous gesture sequences are segmented into isolated gesture instances using the proposed temporal dilated Res3D network. A balanced squared hinge loss function is proposed to deal with the imbalance between boundaries and non-boundaries. Temporal dilation can preserve the temporal information for the dense detection of the boundaries at fine granularity, and the large temporal receptive field makes the segmentation results more reasonable and effective. Then, the recognition network is constructed based on the 3D convolutional neural network (3DCNN), the convolutional Long-Short-Term-Memory network (ConvLSTM), and the 2D convolutional neural network (2DCNN) for isolated gesture recognition. The "3DCNN-ConvLSTM-2DCNN" architecture is more effective to learn long-term and deep spatiotemporal features. The proposed segmentation and recognition networks obtain the Jaccard index of 0.7163 on the Chalearn LAP ConGD dataset, which is 0.106 higher than the winner of 2017 ChaLearn LAP Large-scale Continuous Gesture Recognition Challenge.

title = "Continuous Gesture Segmentation and Recognition using 3DCNN and Convolutional LSTM",

abstract = "Continuous gesture recognition aims at recognizing the ongoing gestures from continuous gesture sequences, and is more meaningful for the scenarios where the start and end frames of each gesture instance are generally unknown in practical applications. This paper presents an effective deep architecture for continuous gesture recognition. Firstly, continuous gesture sequences are segmented into isolated gesture instances using the proposed temporal dilated Res3D network. A balanced squared hinge loss function is proposed to deal with the imbalance between boundaries and non-boundaries. Temporal dilation can preserve the temporal information for the dense detection of the boundaries at fine granularity, and the large temporal receptive field makes the segmentation results more reasonable and effective. Then, the recognition network is constructed based on the 3D convolutional neural network (3DCNN), the convolutional Long-Short-Term-Memory network (ConvLSTM), and the 2D convolutional neural network (2DCNN) for isolated gesture recognition. The {"}3DCNN-ConvLSTM-2DCNN{"} architecture is more effective to learn long-term and deep spatiotemporal features. The proposed segmentation and recognition networks obtain the Jaccard index of 0.7163 on the Chalearn LAP ConGD dataset, which is 0.106 higher than the winner of 2017 ChaLearn LAP Large-scale Continuous Gesture Recognition Challenge.",

N2 - Continuous gesture recognition aims at recognizing the ongoing gestures from continuous gesture sequences, and is more meaningful for the scenarios where the start and end frames of each gesture instance are generally unknown in practical applications. This paper presents an effective deep architecture for continuous gesture recognition. Firstly, continuous gesture sequences are segmented into isolated gesture instances using the proposed temporal dilated Res3D network. A balanced squared hinge loss function is proposed to deal with the imbalance between boundaries and non-boundaries. Temporal dilation can preserve the temporal information for the dense detection of the boundaries at fine granularity, and the large temporal receptive field makes the segmentation results more reasonable and effective. Then, the recognition network is constructed based on the 3D convolutional neural network (3DCNN), the convolutional Long-Short-Term-Memory network (ConvLSTM), and the 2D convolutional neural network (2DCNN) for isolated gesture recognition. The "3DCNN-ConvLSTM-2DCNN" architecture is more effective to learn long-term and deep spatiotemporal features. The proposed segmentation and recognition networks obtain the Jaccard index of 0.7163 on the Chalearn LAP ConGD dataset, which is 0.106 higher than the winner of 2017 ChaLearn LAP Large-scale Continuous Gesture Recognition Challenge.

AB - Continuous gesture recognition aims at recognizing the ongoing gestures from continuous gesture sequences, and is more meaningful for the scenarios where the start and end frames of each gesture instance are generally unknown in practical applications. This paper presents an effective deep architecture for continuous gesture recognition. Firstly, continuous gesture sequences are segmented into isolated gesture instances using the proposed temporal dilated Res3D network. A balanced squared hinge loss function is proposed to deal with the imbalance between boundaries and non-boundaries. Temporal dilation can preserve the temporal information for the dense detection of the boundaries at fine granularity, and the large temporal receptive field makes the segmentation results more reasonable and effective. Then, the recognition network is constructed based on the 3D convolutional neural network (3DCNN), the convolutional Long-Short-Term-Memory network (ConvLSTM), and the 2D convolutional neural network (2DCNN) for isolated gesture recognition. The "3DCNN-ConvLSTM-2DCNN" architecture is more effective to learn long-term and deep spatiotemporal features. The proposed segmentation and recognition networks obtain the Jaccard index of 0.7163 on the Chalearn LAP ConGD dataset, which is 0.106 higher than the winner of 2017 ChaLearn LAP Large-scale Continuous Gesture Recognition Challenge.