mirex

The 2018 edition of MIREX, the Music Information Retrieval Evaluation eXchange, was the sixth in a row for which we at the Centre for Digital Music submitted a set of Vamp audio analysis plugins for evaluation. For the third year in a row, the set of plugins we submitted was entirely unchanged — these are increasingly antique methods, but we have continued to submit them with the idea that they could provide a useful year-on-year baseline at least. It also gives me a good reason to take a look at the MIREX results and write this little summary post, although I’m a bit late with it this year, having missed the end of 2018 entirely!

Structural Segmentation

No results appear to have been published for this task in 2018; I don’t know why. Last time around, ours was the only entry. Maybe it was the only entry again, and since it was unchanged, there was no point in running the task.

Multiple Fundamental Frequency Estimation and Tracking

After 2017’s feast with 14 entries, 2018 is a famine with only 3, two of which were ours and the third of which (which I can’t link to, because its abstract is missing) was restricted to a single subtask, in which it got reasonable results. Results pages are here and here.

Audio Onset Detection

Almost as many entries as last time, and a new convolutional network from Axel Röbel et al disrupts the tidy sweep of Sebastian Böck’s group at the top of the results table. Our simpler methods are squarely at the bottom this time around. Röbel’s submission has a nice informative abstract which casts more light on the detailed result sets and is well worth a read. Results here.

Audio Beat Tracking

Pure consolidation: all the 2018 entries are repeats from 2017, and all perform identically (with the methods from Böck et al doing better than our plugins). Every year I say that this doesn’t feel like a solved problem, and it still doesn’t — the results we’re seeing here still don’t seem all that close to human performance, but perhaps there are misleading properties to the evaluation. Results here, here, here.

Audio Tempo Estimation

This is a busier category, with a new dataset and a few new submissions. The new dataset is most intriguing: all of the submissions perform better with the new dataset than the older one, except for our QM Tempo Tracker plugin, which performs much, much worse with the new one than the old!

I believe the new dataset is of electronic dance music, so it’s likely that much of it is high tempo, perhaps tripping our plugin into half-tempo octave errors. We could probe this next time by tweaking the submission protocol a little. Submissions are asked to output two tempo estimates, and the results report whether either of them was correct. Because our plugin only produces one estimate, we lazily submit half of that estimate as our second estimate (with a much lower salience score). But if our single estimate was actually half of the “true” value, as is plausible for fast music, we would see better scores from submitting double instead of half as the second estimate.

Audio Key Detection

Some novelty here from a pair of template-based methods from the Universitat Autonoma de Barcelona, one attributed to Galin and Castells-Rufas and the other to Castells-Rufas and Galin. Their performance is not a million miles away from our own template-based key estimation plugin.

The strongest results appear to be from a neural network method from Korzeniowski et al at JKU, an updated version of one of last year’s better-performing submissions, an implementation of which can be found in the madmom library.

Audio Chord Estimation

A lively (or daunting) category. A team from Fudan University in Shanghai, whence came two of the previous year’s strongest submissions, is back with another new method, an even stronger set of results, and once again a very readable abstract; and the JKU team have an updated model, just as in the key detection category, which also performs extremely impressively. Meanwhile a separate submission from JKU, due to Stefan Gasser and Franz Strasser, would have been at the very top had it been submitted a year earlier, but is now a little way behind. Convolutional neural networks are involved in all of these.

Our Chordino submission can still be described as creditable. Results can be found here.

For the fifth year in a row, this year the Centre for Digital Music submitted a number of Vamp audio analysis plugins to the MIREX evaluation for “music information retrieval” tasks. This year we submitted the same set of plugins as last year; there were no new implementations, and some of the existing ones are so old as to have celebrated their tenth birthday earlier in the year. So the goal is not to provide state-of-the-art results, but to give other methods a stable baseline for comparison and to check each year’s evaluation metrics and datasets against neighbouring years. I’ve written about this in each of the four previous years: see posts about 2016, 2015, 2014, and 2013.

Obviously, having submitted exactly the same plugins as last year, we expect basically the same results. But the other entries we’re up against will have changed, so here’s a review of how each category went.

(Note: we dropped one category this year, Audio Downbeat Estimation. Last year’s submission was not well prepared for reasons I touched on in last year’s post, and I didn’t find time to rework it.)

Structural Segmentation

Results for the four datasets are here, here, here, and here. Our results, for Segmentino from Matthias Mauch and the older QM Segmenter from Mark Levy, were the same as last year, with the caveat that the QM Segmenter uses random initialisation and so never gets exactly the same results twice.

Surprisingly, nobody else entered anything to this category this year, which seems a pity because it’s an interesting problem. This category seems to have peaked around 2012-2013.

Multiple Fundamental Frequency Estimation and Tracking

An exciting year for this mind-bogglingly difficult category, with 14 entries from ten different sets of authors and a straight fight between template decomposition methods (including our Silvet plugin, from Emmanouil Benetos’s work) and trendy convolutional neural networks. Results are here and here.

With so many entries and evaluations it’s not that easy to get a clear picture, and no single method appears to be overwhelmingly strong. There were fine results in some evaluations for CNN methods from Thickstun et al and Thomé and Ahlbäck, for Pogorelyuk and Rowley‘s very intriguing “Dynamic Mode Decomposition”, and for a few others whose abstracts are missing from the entry site and so can’t be linked to.

Silvet, with the same results as last year, does well enough to be interesting, but in most cases it isn’t troubling the best of the newer methods.

Audio Onset Detection

Bit of a puzzle here, as our two plugin submissions both got slightly different results from last year despite being unchanged implementations of deterministic methods invoked in the same way on the same data sets.

Last year saw a big expansion in the number of entries, and this year there were nearly as many. Just as last year, our old plugins did modestly, but again some of the new experiments fared a bit less well so we weren’t quite at the bottom. Results here.

Audio Beat Tracking

Same puzzle as in onset detection: while our results were basically similar to last year, they weren’t identical. The 2015 and 2016 results were identical and we would have expected the same again in 2017.

Audio Tempo Estimation

Last year there were two entries in this category, ours and a much stronger one from Sebastian Böck. This year sees one addition, from Hendrick Schreiber and Meinard Müller, which fares creditably. The results are here.

Audio Key Detection

Two pretty successful new submissions this year, both using convolutional neural networks: one from Korzeniowski, Böck, Krebs and Widmer, and the other from Hendrik Schreiber. Our old plugin (from work by Katy Noland) does not fare tragically, but it’s clear that some other methods are getting much closer to the sort of performance one imagines should be realistic. The results are linked from here.

Intuitively, key estimation seems like the sort of problem that is interesting only so long as you don’t have enough training data. As a 24-way classification with large enough training datasets, it looks a bit mundane. The problem becomes, what does it mean for a piece of music to be in a particular key anyway? Submissions are not expected to answer that, but presumably it sets an upper bound on performance.

Audio Chord Estimation

Another increase in the number of test datasets, from 5 to 7, and a strong category again. Last year our submission Chordino (by Matthias Mauch) was beginning to trail, though it wasn’t quite at the back. This year some of the weaker submissions have not been repeated, some new entries have appeared, and Chordino is in last place for every evaluation. It’s not far behind — perceptually it’s still a pretty good algorithm — but some of the other methods are very impressive now. Here are the results.

The abstracts accompanying the two submissions from the audio information processing group at Fudan University in Shanghai (Jiang, Li and Wu and Wu, Feng and Li) are both well worth a read. The former paper refers closely to Chordino, using the same NNLS Chroma features with a new front-end. Meanwhile, the latter paper proposes a method worth remembering for dinner parties, using deep residual networks trained from MIDI-synchronised constant-Q representations of audio with a bidirectional long-short-term memory and conditional random field for labelling.

This year, for the fourth year in a row, we submitted a number of Vamp audio analysis plugins published by the Centre for Digital Music to the annual MIREX evaluation. The motivation is to give other methods a baseline to compare against, to compare one year’s evaluation metrics and datasets against the next year’s, and to give our group a bit of visibility. See my posts about this process in 2015, 2014, and 2013.

Here’s a review of how we got on this year. We entered an extra category compared to last year, a makeshift entry in the audio downbeat estimation task, making this the widest range of categories we’ve covered with these plugins in MIREX so far.

Structural Segmentation

Results for the four datasets are here, here, here, and here. I don’t find the evaluations any easier to follow than I did last year, but I can see that both of our submissions (Segmentino from Matthias Mauch and the older QM Segmenter from Mark Levy) produced the same results as expected from previous years.

Segmentino actually comes across well in this year’s results, not least because the authors of last year’s best method (Thomas Grill and Jan Schlüter) didn’t submit anything this time.

Multiple Fundamental Frequency Estimation and Tracking

Results here and here. Our Silvet plugin performed much as before: reasonably well, though as usual in such a hard task, with hugely varying results from one test case to another.

Audio Onset Detection

Results here. Many more submissions than last year, which was already a broader field
than the year before. Our two old plugins score the same as they did last year, but are no longer placed last, as three of the new submissions have lower scores.

Audio Beat Tracking

Results here, here, and here. Our BeatRoot and QM Tempo Tracker are once again placed near the back. There’s little change from last year at the top, still occupied by the work of Sebastian Böck and Florian Krebs — work which the authors have, to their great credit, made available as freely-licensed, readable, and well-documented Python code in the madmom library.

Audio Tempo Estimation

Results here. Only two entries this year, our QM Tempo Tracker and Sebastian Böck’s entry from the aforementioned madmom.

Audio Downbeat Estimation

Results here. In this category we submitted the QM Bar and Beat Tracker plugin by Matthew Davies, which has been around for a few years; it’s based on the QM Tempo Tracker with an additional downbeat estimator.

The results don’t come across very well, for varying reasons according to the dataset. The QM Bar and Beat Tracker needs to be prompted with the time signature and (following a last-minute decision to enter the category this year) I submitted a script which assumed fixed 4/4 time. This meant we knowingly threw away the Ballroom category, which was all 3/4, but the plugin was also ill-suited to several of the other categories. Not a strong submission then, but interesting to see.

Audio Key Detection

Results here and here. Last year I lamented the lack of any other entries than ours, since the category had just gained a second (and more realistic) test dataset. So I’m delighted to see a couple of new submissions this year, including one from Gilberto Bernardes and Matthew Davies at INESC in Porto which appears to perform well.

Audio Chord Estimation

Results here, now up to five test datasets. Last year saw a torrid time with a bug in the Chordino plugin, but this year it’s back to normal. Chordino still performs well, but in a strong category this year it’s no longer one of the top performers.

For the past three years now, we’ve taken a number of Vamp audio analysis plugins published by the Centre for Digital Music and submitted them to the annual MIREX evaluation. The motivation is to give other methods a baseline to compare against, to compare one year’s evaluation metrics and datasets against the next year’s, and to give our group a bit of visibility. See my posts about this process in 2014 and in 2013.

Here are this year’s outcomes. All these categories are ones we had submitted to before, but I managed to miss a couple of category deadlines last year, so in total we had more categories than in either 2013 or 2014.

Structural Segmentation

Results for the four datasets are here, here, here, and here. This is one of the categories I missed last year and, although I find the evaluations quite hard to understand, it’s clear that the competition has moved on a bit.

Our own submissions, the Segmentino plugin from Matthias Mauch and the much older QM Segmenter from Mark Levy, produced the expected results (identical to 2013 for Segmentino; similar for QM Segmenter, which has a random initialisation step). As before, Segmentino obtains the better scores. There was only one other submission this year, a convolutional neural network based approach from Thomas Grill and Jan Schlüter which (I think) outperformed both of ours by some margin, particularly on the segment boundary measure.

Multiple Fundamental Frequency Estimation and Tracking

Results here and here. In addition to last year’s submission for the note tracking task of this category, this year I also scripted up a submission for the multiple fundamental frequency estimation task. Emmanouil Benetos and I had made some tweaks to the Silvet plugin during the year, and we also submitted a new fast “live mode” version of it. The evaluation also includes a new test dataset this year.

Our updated Silvet plugin scores better than last year’s version in every test they have in common, and the “live mode” version is actually not all that far off, considering that it’s very much written for speed. (Nice to see a report of run times in the results page — Silvet live mode is 15-20 times as fast as the default Silvet mode and more than twice as fast as any other submission.) Emmanouil’s more recent research method does substantially better, but this is still a pleasing result.

This category is an extremely difficult one, and it’s also painfully difficult to get good test data for it. There’s plenty of potential here, but it’s worth noting that a couple of the authors of the best submissions from last year were not represented this year — in particular, if Elowsson and Friberg’s 2014 method had appeared again this year, it looks as if it would still be at the top.

Audio Onset Detection

Results here. Although the top scores haven’t improved since last year, the field has broadened a bit — it’s no longer only Sebastian Böck vs the world. Our two submissions, both venerable methods, are now placed last and second-last.

Oddly, our OnsetsDS submission gets slightly better results than last year despite being the same, deterministic, implementation (indeed exactly the same plugin binary) run on the same dataset. I should probably check this with the task captain.

Audio Beat Tracking

Results here, here, and here. Again the other submissions are moving well ahead and our BeatRoot and QM Tempo Tracker submissions, producing unchanged results from last year and the year before, are now languishing toward the back. (Next year will see BeatRoot’s 15th birthday, by the way.) The top of the leaderboard is largely occupied by a set of neural network based methods from Sebastian Böck and Florian Krebs.

This is a more interesting category than it gets credit for, I think — still improving and still with potential. Some MIREX categories have very simplistic test datasets, but this category introduced an intentionally difficult test set in 2012 and it’s notable that the best new submissions are doing far better here than the older ones. I’m not quite clear on how the evaluation process handles the problem of what the ground truth represents, and I’d love to know what a reasonable upper bound on F-measure might be.

Audio Tempo Estimation

Results here. This is another category I missed last year, but we get the same results for the QM Tempo Tracker as we did in 2013. It still does tolerably well considering its output isn’t well fitted to the evaluation metric (which rewards estimators that produce best and second-best estimates across the whole piece).

The top scorer here is a neural network approach (spotting a theme here?) from Sebastian Böck, just as for beat tracking.

Audio Key Detection

The QM Key Detector gets the same results as last year for the dataset that existed then. It scores much worse on the new dataset, which suggests that may be a more realistic test. Again there were no other submissions in this category — a pity now that it has a second dataset. Does nobody like key estimation? (I realise it’s a problematic task from a musical point of view, but it does have its applications.)

Audio Chord Estimation

Poor results for Chordino because of a bug which I went over at agonising length in my previous post. This problem is now fixed in Chordino v1.1, so hopefully it’ll be back to its earlier standard in 2016!

Some notes

Neural networks

… are widely-used this year. Several categories contained at least one submission whose abstract referred to a convolutional or recurrent neural network or deep learning, and in at least 5 categories I think a neural network method can reasonably be said to have “won” the category. (Yes I know, MIREX isn’t a competition…)

Structural segmentation: convolutional NN performed best

Beat tracking: NNs all over the place, definitely performing best

Tempo estimation: NN performed best

Onset detection: NN performed best

Multi-F0: no NNs I think, but it does look as if last year’s “deep learning” submission would have performed better than any of this year’s

Chord estimation: NNs present, but not yet quite at the top

Key detection: no NNs, indeed no other submissions at all

Categories I missed

Audio downbeat estimation: I think I just overlooked this one, for the second year in a row. As last year, I should have submitted the QM Bar & Beat Tracker plugin from Matthew Davies.

Real-time audio to score alignment: I nearly submitted the MATCH Vamp Plugin for this, but actually it only produces a reverse path (offline alignment) and not a real-time output, even though it’s a real-time-capable method internally.

Other submissions from the Centre for Digital Music

This feels like a pity — evaluation is always a pain and it’s nice to get someone else to do some of it.

It’s also a pity because several of the plugins I’m submitting are getting a bit old and are falling to the bottom of the results tables. There are very sound reasons for submitting them (though I may drop some of the less well performing categories next year, assuming I do this again) but it would be good if they didn’t constitute the only visibility QM researchers have in MIREX.

Why would this be the case? I don’t really know. The answer presumably must include some or all of

not working on music informatics signal-processing research at all

working on research that builds on feature extractors, rather than building the feature extractors themselves

research not congruent with MIREX tasks (e.g. looking at dynamics or articulations rather than say notes or chords)

research uses similar methods but not on mainstream music recordings (e.g. solo singing, animal sounds)

state-of-the-art considered good enough

lack the background to compete with current methods (e.g. the wave of NNs) and so sticking with progressive enhancements of existing models

lack the data to compete with current methods

not aware of MIREX

not prioritised by supervisor

The last four reasons would be a problem, but the rest might not be. It could really be that MIREX isn’t very relevant to the people in this group at the moment. I’ll have to see what I can find out.

I wasn’t sure whether to do a repeat submission this year—most of the plugins would be the same—but Simon Dixon persuaded me. The test datasets might change; it might be interesting to see whether results are consistent from one year to the next; and it’s always good to provide one more baseline for other submissions to compare themselves against. So I dusted off last year’s submission scripts, added the new Silvet note transcription plugin, and submitted them.

Multiple Fundamental Frequency Estimation and Tracking

The only category we didn’t submit to last year. This is the problem of deducing which notes are being played, and at what times, in music where more than one note happens at once. I submitted the Silvet plugin which is based on a method by Emmanouil Benetos that had performed well in MIREX in an earlier year.

The results for this category are divided into two parts, multiple fundamental frequency estimation and note tracking. I submitted a script only for the note tracking part. I would describe the performance of our plugin as “correct”, in that it was reliably mid-pack across the board, pretty good for piano transcription, and generally marginally better than the MIREX 2012 submission which inspired it.

This was a fairly popular category this year, and one submission in particular improved quite substantially on previous years’ results—it may be no coincidence that that submission’s abstract employs the phrase-of-the-moment deep learning.

Audio Onset Detection

The same two submissions as last year (OnsetsDS and QM Onset Detector) and exactly the same results—the test dataset is unchanged and the plugins are entirely deterministic. Last year I remarked that our methods are quite old and other submissions should improve on them over time, but this year’s top methods were actually no improvement on last year’s.

Audio Beat Tracking

Again the same two submissions as last year (BeatRoot and QM Tempo Tracker) and exactly the same results (1, 2, 3), behind the front-runners but still reasonably competitive. While the best-performing methods continue to advance, it’s clear that beat tracking is still not a solved problem.

Audio Key Detection

Last year we entered a plugin that wasn’t expected to do very well here, and it swept the field. This year everyone else seems to have dropped out, so our repeat submission was in fact the only entry! (It got the same results as last year.)

Audio Chord Estimation

This is interesting partly because our submission (Chordino) performed very well last year but the evaluation metric has since changed.

Sadly, there were only three submissions this year. Chordino still looks good in all three datasets (1, 2, 3) but it is now ranked second rather than first for all three. I’m a bit disappointed that the new leading submission seems to be lacking a descriptive abstract.

Categories we could have entered but didn’t

Audio Melody Extraction

Last year’s submission wasn’t really good enough to repeat.

Audio Downbeat Estimation

I overlooked this task, which was new this year. Otherwise I could have submitted the QM Bar and Beat Tracker plugin.

Audio Tempo Estimation, Structural Segmentation

These categories had an earlier submission deadline than the rest, and stupidly I missed it.

Some of these methods were, and remain, pretty good. Some are reasonably good, simplified versions of work that was state-of-the-art at the time, but might not be any more. Some have always been less impressive. They are all available free, with source code—or with commercial licences for companies that want to incorporate them into their products.

This year we thought we should give them a trial against the current state-of-the-art in academia. Luis Figueira and I prepared a number of entries for the annual Music Information Retrieval Evaluation Exchange (or MIREX), submitting a Vamp plugin from our group in every category where we had one available.

MIREX, which is an excellent large-scale community endeavour organised by J Stephen Downie at UIUC, works by running your methods across a known test dataset of music recordings, comparing the results against “ground truth” produced in advance by humans, and publishing scores for how well each method compares.

Here’s how we got on for each evaluation task.

Audio Onset Detection

(That is, identifying the times in the music recording where each of the individual notes begin.)

We submitted two plugins here: the QM Onset Detector plugin implementing a number of (by now) standard methods, from Juan Bello and others back in 2005; and OnsetsDS, a refinement by Dan Stowell aimed at real-time use (so not directly relevant to this task). Both did modestly well. These methods have been published for a long time and become widely known, so it would be a disappointment if current work didn’t improve on them.

Audio Beat Tracking

(Tapping along with the beat.)

Here we entered the QM Tempo Tracker plugin, based on the work of Matthew Davies, and a Vamp plugin implementation of Simon Dixon‘s BeatRoot beat tracker. Both of these are now quite old methods (especially BeatRoot, although the plugin is new). The results for three datasets are here, here and here.

Both the original BeatRoot and a different version of Matthew Davies’ work were included in the MIREX evaluation back in 2006, and the ’06 dataset is one of the three used this year. So you can compare the 2006 versions here and the 2013 evaluations over here. They perform quite similarly, which is a relief. You can also see that the state of the art has moved on a bit.

Audio Tempo Estimation

(Coming up with an overall estimate in beats-per-minute of the tempo of a recording. Presumably the evaluation uses clips in which the tempo doesn’t vary.)

We entered the same QM Tempo Tracker plugin, from Matthew Davies, as used in the Beat Tracking evaluation. It doesn’t quite suit the evaluation metric, because the plugin estimates tempo changes rather than the two fixed tempo estimates (higher and lower, to allow for beat-period “octave” errors) the task calls for—but it performed pretty well. Again, a related method was evaluated on the same dataset in MIREX ’06 with quite similar results.

Audio Key Detection

(Estimating the overall key of the piece, insofar as that makes sense.)

We entered the QM Key Detector plugin for this task. This plugin, from Katy Noland back in 2007, is straightforward and fast, and is intended to detect key changes rather than the overall key.

To everyone’s surprise (including Katy’s) it scored better than any other entry, and indeed better than any entry from the past four years! The test dataset is pretty simplistic, but this is a nice result anyway.

Audio Melody Extraction

(Writing down the notes for the main melody from a recording which may have more than one instrument.)

Here we submitted my own cepstral pitch tracker plugin. This is not actually a melody extractor at all, but a monophonic pitch tracker with note estimation intended for solo singing. And it was developed as an exercise in test-driven development, rather than as a research outcome. It was not expected to do well. It actually did come out well in one dataset (solo vocal?), but it got weak results in theotherthree. I’m quite excited about having submitted something all-my-own-work to MIREX though.

Audio Chord Estimation

(Annotating the chord changes in a piece based on the recording.)

For this task we entered the Chordino plugin from Matthias Mauch. This plugin is much the same as the “Simple Chord Estimate” method that Matthias entered for MIREX in 2010; it got the same excellent results then and now for the dataset that was used in both years, and it also got the highest scores in the other dataset.

Structural Segmentation

(Dividing a song up into parts based on musical structure. The parts might correspond to verse, chorus, bridge, etc—though the segmenter is not required to label them, only to identify which ones have similar structural purpose.)

Two entries here. The Segmentino plugin from Matthias Mauch is fairly new, and is the only submission we made for which plugin code has not yet been released—we hope to remedy that soon. And we also entered Mark Levy‘s QM Segmenter plugin, an older and more lightweight method.

The results for different test datasets are here, here, here and here. The evaluation metrics are slightly baffling (for me anyway). I have been advised to concentrate on

Frame pair clustering F-measure: how well corresponding sections correspond; this measures getting matching segment types right. Segmentino does very well here, except in one dataset for some reason. The QM Segmenter is not as good, but actually not so bad either.

Segment boundary recovery evaluation measures: how accurately the segmenters report the precise locations of segment boundaries. Neither of our submissions does this very well, although Segmentino does well on precision at 3 seconds, meaning the segment boundaries it does report are usually fairly close to the real ones.