In this paper, we investigate the use of articulatory informa-
tion, and more specifically real time Magnetic Resonance
Imaging (rtMRI) data of the vocal tract, to improve speech
recognition performance. For the purpose of our experiments,
we use data from the rtMRI-TIMIT database. Firstly, Scale
Invariant Feature Transform (SIFT) features are extracted for
each video frame. Afterwards, the SIFT descriptors of each
frame are transformed to a single histogram per picture, by
using the Bag of Visual Words methodology. Since this kind

Speech recognition in digital assistants such as Google Assistant can
potentially benefit from the use of conversational context consisting of user
queries and responses from the agent. We explore the use of recurrent,
Long Short-Term Memory (LSTM), neural language models (LMs) to model the conversations
in a digital assistant. Our proposed methods effectively capture the context of
previous utterances in a conversation without modifying the underlying LSTM
architecture. We demonstrate a 4% relative improvement in recognition performance

We study the problem of direction of arrival estimation for arbitrary antenna
arrays. We formulate it as a continuous line spectral estimation problem and solve it under
a sparsity prior without any gridding assumptions. Moreover, we incorporate the
array's beampattern in form of the Effective Aperture Distribution Function
(EADF), which allows to use arbitrary (synthetic as well as measured) antenna
arrays. This generalizes known atomic norm based grid-free DOA estimation methods (that

In this paper, we collect a trivial event speech database that involves 75 speakers and 6 types of events, and report preliminary speaker recognition results on this database, by both human listeners and machines. Particularly, the deep feature learning technique recently proposed by our group is utilized to analyze and recognize the trivial events, leading to acceptable equal error rates (EERs) ranging from 5% to 15% despite the extremely short durations (0.2-0.5 seconds) of these events. Comparing different types of events, ‘hmm’ seems more speaker discriminative.

In this paper, we collect a trivial event speech database that involves 75 speakers and 6 types of events, and report preliminary speaker recognition results on this database, by both human listeners and machines. Particularly, the deep feature learning technique recently proposed by our group is utilized to analyze and recognize the trivial events, leading to acceptable equal error rates (EERs) ranging from 5% to 15% despite the extremely short durations (0.2-0.5 seconds) of these events. Comparing different types of events, ‘hmm’ seems more speaker discriminative.

As we are surrounded by an increased number of mobile devices equipped with wireless links and multiple microphones, e.g., smartphones, tablets, laptops and hearing aids, using them collaboratively for acoustic processing is a promising platform for emerging applications.

Deep learning is still not a very common tool in speaker verification field. We study deep convolutional neural network performance in the text-prompted speaker verification task. The prompted passphrase is segmented into word states — i.e. digits — to test each digit utterance separately. We train a single high-level feature extractor for all states and use cosine similarity metric for scoring. The key feature of our network is the Max-Feature-Map activation function, which acts as an embedded feature selector.