0. TLDR

Preprocessing 169GB of .mp3 files is really tricky if you are using high-level programming libraries because:

Multi-threading may be tricky if you have no explicit experience in system programming / high load python programming;

If you rely on high level libraries for data preprocessing, speed may be an issue (rewriting audio preprocessing in something like pytorch or tensorflow can even be trickier and not worth investing much time if you can just wait for a couple of days for the scrips to finish) - audio unpacking and spectrogram take ca. 10x more time than any other data manipulations combined together;

Though input data weighs a lot (~150GB) the output images weight only ca. 2-3GB, which essentially is a couple orders of magnitude lower. This is really nice for neural network training and exploration (CNN weight size is proportional to the square of image dimensions, e.g. if picture resolution increases by 2x, then NN weights will be 4x as heavy);

Obvious top-level advice(probably applicable to any domain):

Divide and conquer - divide a task into subtasks and slowly gnaw at each of them;

Probably if you know an 'ugly' but doable solution - just go for it;

Do not try to have more than 2 degrees of freedom in each task which is new to you (ideally have only one), e.g. you have to write the preprocessing logic and implement multi-threading, probably you should not add any more sophistication to that;

Try as many simple approaches as fast as you can;

But I nailed it. See below how.

1. Choosing the dataset

We have finished the previous article on doing the proof of concept of the CNN that was able to tell birds by their songs with 70% accuracy, which is good enough, but not really good. The obvious way to increase the dataset size 10-100x. In neural networks it's known that drastically increasing dataset size allows them to learn much better, actually better than expected.

As you may have noticed for the MVP there were the following stages of the process:

Choose a balanced dataset;

Download the files;

Extract spectrograms from them and convert them into sliding windows and save as images;

I will not bore you with the details of the implementation (as usual I will just provide my scripts and notebooks at the end of the article), but basically I just compared how many songs we can get per bird genus with colloquial name (vgenus) and for bird species. It turns our that we can have either ca. 200 bird genus with at least 50+ songs per genus, or ca. 1500-2000 bird species with much lower number of songs per species.

Though scientifically going for bird genus may not make sense, I decided to go for bird genus instead of species to ensure that we can ship a reasonable application.

In a nutshell, it's not difficult to download 150+GB of files, but it's more difficult to do so file by file and in reasonable amount of time utilizing 100% of the network speed. Therefore I started with a bit of research. It turned out that advanced web-scraping is not really popular nowadays and the most obvious candidate - multi-curl library has not been updated for ca. 5+ years. My friend advised me to use this high-levellibraryfor professional crawlers (also check out the developer's list of projects, they are impressive) that has multi-curl as a dependency, but in the end I decided that this is overkill and just found a working example of multi-curl implementation.

In the end, I ended up using a slightly modified version of the multi-curl download script. This allowed me to download ca. 130k of files in very reasonable amount of time, most of time utilizing ~50% of my network connection speed.

If you have read the previous articles, then probably you have seen that I actually adopted the following pipeline for file preprocessing:

Use librosa library to produce sound spectrograms;

Normalize them by applying logarithms;

Cut them into 5s rolling windows with 3s steps (arbitrary choice made just by manually inspecting the data);

Save as images;

When making my final decision on which preprocessing approach to use, I mostly kept in mind advice from this article and the below video.

After having written the full end-to-end preprocessing and logging script I timed it and realized that:

Processing each file takes ca. 0.25-0.7s, mp3 unpacking takes ca. 90% of the time (basically all the other operations combined take 10x-100x less time);

About ca. 5%-7% of all .mp3 files just do not open and python just hangs forever trying to process them - probably it's because of corrupted files of just library is unstable (it has been released early this year after all);

0.5s * 130,000 ~ a couple of days worth of computation - if I could speed it up a bit (3-10x) - it would be great;

So, if we try to use the multi-threading approach since we cannot avoid using high-level library for sound processing (it does not make sense considering the time constraints) a speed up of 4x (number of cores in my CPU) or 8-12x (2 or 3 processes per CPU). When I started researching multi-threading, I mostly found a lot of old stale boilerplate code, where code junk takes ca. 90% of the space. Also making sure we terminate the stale processes also should be taken into consideration.

Given that, I found a few amazing blog posts / articles on the topic, that would make any life easier:

After the painful examinations of the insides of python libraries and testing a lot of scripts I debugged the following multi-thread script. There was just one problem - it quickly processed 4-12 items and then just hang (regardless of the number of workers - 4 survived a bit longer, 4+ died after one pass).

The worst thing is that I cannot really understand why. My working hypothesis is that it has something to do with ffmpeg encoder used by librosa library - probably it has some issues with multi-threading, which I had no time to investigate.

Probably there is a reason why all multi-processing code example show only the most simple workers - like writing to files, opening files, making HTTP calls, etc

Having no time to spare, I decided to use old and faithful approach of just slowly iterating over all of the data, which produced the following script. Notice .sample(frac=1.0) call on the dataframe. This is used for randomization, because I learned from experience that bad files usually cluster in big chunks, which makes performance evaluation of the script impossible.

After running the above script for a couple of days I managed to pre-process almost all of the files I have. Also interestingly enough, ~169GB of bird sounds compressed into ca. 2-3GBs of spectrograms, which is nice.

Compression level

File count

Mesmerizing progress indicator

4. Downloads and links

As usual, I attach the files that I used, but this week they are bit messy because I had to do a lot of fiddling with data and scripts: