0. Remind me, why am I reading this?

Actually I have no idea, but last time we left off on actually listening to the random bird songs and looking at the spectrograms. This time we need to pre-process all the data properly to feed it to our CNN (convolutional neural networks). In this article I will not bore you with the details of actual implementation and I will point out only the most important code snippets.

1. Analyze which data we managed to download

Well, for the proof of concept stage now we have a nice (9277, 34) shaped dataframe, indicating that we have ca. 9k bird songs.

Song length distribution looks like this (log10 scale)

File size distribution looks like this (log10 scale) - which it its turn means that that the third of full dataset (most representative data) will weight ca. 50-100GB

Top bird class distribution

2. Test SVD / NVM for sound vs. noise decomposition

In the recent computational linear algebra course by fast.ai I stumbled upon a list of matrix decomposition methods in lesson 2, that are also used for noise / voice separation. Let''s try to apply them here and see the result. (For more fast.ai information consider their online textbook and a series of online videos).

Also NVM produces only non-negative values and required non-negative matrices to work with (which is really useful for real world applications);

In NMF you explicitly set the number of components, SVD gives you full decomposition;

SVD

To quote fast.ai SVD is an exact decomposition, since the matrices it creates are big enough to fully cover the original matrix. SVD is extremely widely used in linear algebra, and specifically in data science, including:

NMF

To quote fast.ai, nonnegative matrix factorization (NMF) is a non-exact factorization that factors into one skinny positive matrix and one short positive matrix. NMF is NP-hard and non-unique. There are a number of variations on it, created by adding different constraints.

Applications of NMF:

Face Decompositions

Collaborative Filtering, eg movie recommendations

Audio source separation

Chemistry

Bioinformatics and Gene Expression

Topic Modeling

It's easy to project mentally this approach to separating bird song from the background noise. The result - as for bird noise / song separation - NMF works and SVD does not.

NMF decomposition of random bird song

Note that I could not play back the decomposed tracks, because they were log-scaled (in db, between -80 and 0 db), and I had to scale them to positive numbers and then scale them back.

NMF decomposition of random bird song

Component importance

First and second components

Initially I wanted to include NMF into my pre-processing pipeline, but I decided against it for simplicity and speed. In the actual mobile app, it will definitely make sense to use some kind of noise reduction / sound preprocessing, but this is for separate investigation.

Also scientists are known for adding noise (e.g. see more in Andrew Ng MOOC about pipelines) to the data for regularization, data augmentation and sample extension purposes. So for the proof of concept I abandoned this idea.

3. Data preprocessing

Initially I wanted to try sci-kit cuda or Pytorch for pre-processing, but I learned a couple of things:

Sci-kit CUDA does not really install well from packages;

Numpy arrays read and write blazingly fast to consumer level SSDs in case you have only O(n) type sequential operation;

After some considerations, I chose the following hyperparameters for my dataset:

After some fiddling, I produced this code-snippet that cut bird songs into sliding windows of ~5s with ~2-3s intervals. Amazingly enough, it took only a couple of minutes for the whole process to finish.

Data annotation is costly and takes a lot of time. For 100k songs it will take unreasonable amount of time;

In the real environment (mobile app) it's better to have relatively more light-weight models (i.e. analyzing each 5s sound) and then deciding on the bird class using some kind of simple voting algorithm (logistic regression, regression trees. simple voting rules etc);

Birds have distinct 'sentences' but reasonable expert pattern recognition and annotation is simply not feasible;

There are CNN-inspired approaches to more automated image / array segmentation, but they require some kind of annotated input anyway;

This technique produces ca. 62k long dataset with 132 classes.

4. Working with NNs - data preprocessing for Keras

Ok, now we have the dataset stored in .npy arrays. This dataset actually takes ca. 3-5GB of storage. Ideally we can fit it into our RAM and feed it to Keras. But despite this 'right' approach, I actually decided to convert all of the .npy arrays to .jpg pictures and feed them to Keras. Why? Here is a list of reasons:

The full dataset will contain 10-100x more data than now, and it WILL NOT fit in the RAM;

Image preprocessing routines in Keras that work exceptionally well for regularization purposes work only with pictures;

Code reuse;

(Important remark- the final trained production model will be exported to tensorflow as a computational graph and fed spectrograms directly - no worries)

Surprisingly enough this process is also blazingly fast. 70-30 train-validation split is used.