Abstract

Bird audio detection (BAD) aims to detect whether
there is a bird call in an audio recording or not. One difficulty of
this task is that the bird sound datasets are weakly labelled, that
is only the presence or absence of a bird in a recording is known,
without knowing when the birds call. We propose to apply joint
detection and classification (JDC) model on the weakly labelled
data (WLD) to detect and classify an audio clip at the same time.
First, we apply VGG like convolutional neural network (CNN)
on mel spectrogram as baseline. Then we propose a JDC-CNN
model with VGG as a classifier and CNN as a detector. We report
the denoising method including optimally-modified log-spectral
amplitude (OM-LSA), median filter and spectral spectrogram
will worse the classification accuracy on the contrary to previous
work. JDC-CNN can predict the time stamps of the events from
weakly labelled data, so is able to do sound event detection from
WLD. We obtained area under curve (AUC) of 95.70% on the
development data and 81.36% on the unseen evaluation data,
which is nearly comparable to the baseline CNN model.