Convolutional neural networks for acoustic scene classiﬁcation

In this thesis we investigate the use of deep neural networks applied to the ﬁeld of computational audio scene analysis, in particular to acoustic scene classiﬁcation. This task concerns the recognition of an acoustic scene, like a park or a home, performed by an artiﬁcial system. In our work we examine the use of deep models aiming to give a contribution in one of their use cases which is, in our opinion, one of the most poorly explored.
The neural architecture we propose in this work is a convolutional neural network speciﬁcally designed to work on a time-frequency audio representation known as log-mel spectrogram. The network output is an array of prediction scores, each of which is associated with one class of a set of 15 predeﬁned classes. In addition, the architecture features batch normalization, a recently proposed regularization technique used to enhance the network performance and to speed up its training.
We also investigate the use of different audio sequence lengths as classiﬁcation unit for our network. Thanks to these experiments we observe that, for our artiﬁcial system, the recognition of long sequences is not easier than of medium-length sequences, hence highlighting a counterintuitive behaviour. Moreover, we introduce a training procedure which aims to make the best of small datasets by using all the labeled data available for the network training. This procedure, possible under particular circumstances, constitutes a trade-off between an accurate training stop and an increased data representation available to the network. Finally, we compare our model to other systems, proving that its recognition ability can outperform either other neural architectures as well as other state-of-the-art statistical classiﬁers, like support vector machines and Gaussian mixture models.
The proposed system reaches good accuracy scores on two different databases collected in 2013 and 2016. The best accuracy scores, obtained according to two cross-validation setups, are 77% and 79% respectively. These scores constitute a 22% and 6.1% accuracy increment with respect to the correspondent baselines published together with datasets.