Low-Latency Sound Source Separation Using Deep Neural Networks

Sound source separation at low-latency requires that each in-
coming frame of audio data be processed at very low de-
lay, and outputted as soon as possible. For practical pur-
poses involving human listeners, a 20 ms algorithmic delay
is the uppermost limit which is comfortable to the listener.
In this paper, we propose a low-latency (algorithmic delay
≤ 20 ms) deep neural network (DNN) based source sepa-
ration method. The proposed method takes advantage of an
extended past context, outputting soft time-frequency mask-
ing filters which are then applied to incoming audio frames
to give better separation performance as compared to NMF
baseline. Acoustic mixtures from five pairs of speakers from
CMU Arctic database were used for the experiments. At
least 1 dB average improvement in source to distortion ratios
(SDR) was observed in our DNN-based system over a low-
latency NMF baseline for different processing and analysis
frame lengths. The effect of incorporating previous temporal
context into DNN inputs yielded significant improvements in
SDR for short processing frame lengths.