Keyword Spotting Based On Decision Fusion

Automatic speech recognition (ASR) technology is available now-a-days in all handsets where keyword spotting plays a vital role. Keyword spotting performance significantly degrades when applied to real-world environment due to background noise. As visual features are not affected much by noise this provides better solution. In this paper, audio-visual integration is proposed which combines audio features with the visual features where decision fusion used to adapt for various noise conditions. Visual features are extracted by a set of both geometry based features and appearance based features for facial landmark localization. To avoid similarities among the textons spatiotemporal lip feature (SPTLF) is used which map the features into intra class subspace. The dimensionality of the lip features are reduced using WPCA. A hybrid HMM-ANN method is proposed for integrating audio and visual features. Adaptive weights are generated using neural network for integration of audio and visual features. A parallel two step keyword spotting strategy is provided to avoid overlap between audio and visual keywords. Experiments results on dataset demonstrate that the proposed HMM-ANN method shows improved performance compared to the state of the art network.