Deep neural networks (DNN) have been successfully employed
for the problem of monaural sound source separation achieving
state-of-the-art results. In this paper, we propose using convolutional
recurrent neural network (CRNN) architecture for tackling
this problem. We focus on a scenario where low algorithmic delay
(≤ 10 ms) is paramount, and relatively little training data is available.
We show that the proposed architecture can achieve slightly
better performance as compared to feedforward DNNs and long
short-term memory (LSTM) networks. In addition to reporting
separation performance metrics (i.e., source to distortion ratios),
we also report extended short term objective intelligibility (ESTOI)
scores which better predict intelligibility performance in presence
of non-stationary interferers.