The availability of common datasets is very import in the
progress of the music information retrieval (MIR) community.
Whereas standard benchmark tasks are widely used in other similar
research areas (e.g. speech or handwriting recognition), it is
difficult to freely distribute music data due to very restrictive copyright laws.
However, different groups try to overcome these problems by using
music with a free license (e.g. Creative Commons)
or by just distributing feature vectors and not the audio data.

This is an attempt to list already available datasets.
Similar resources for MIR tools, papers and conferences can be found at
the music-ir.org web page.
Furthermore there exist an annual Music Information Retrieval Evaluation eXchange
Contest during the ISMIR Conference,
called MIREX,
where groups can evaluate and compare the performance of their algorithms.

Note

Please contact me (mail) if you know or
have any other free dataset, or if you have other comments !

Artist20 is a database of six albums by each of 20 artists,
making a total of 1413 tracks.
There is a defined training set (three albums per artist),
validation set (1 album), and test set (2 albums)
with a Matlab baseline classifier.

A collection of 80 songs, each performed by 2 artists,
for automatic detection of "cover songs"
(i.e. alternative performances of the same basic musical piece
by different artists, typically with large stylistic
and/or harmonic changes)

Data with transition probabilities between different
chords (such as Dm->G) computed from a database of popular music.
Data is divided into four different human-labeled categories:
Pop, Rock, Country and Beatles.

As part of the Linking Open Data on the Semantic Web community project,
DBTune hosts a number of servers, providing access to music-related
structured data, in a Linked Data fashion.
It now provides access to more than 14 billion RDF triples, with data
from Jamendo, Magnatune, AudioScrobbler, MySpace, Musicbrainz,
BBC playcount data, Echonest and more.

Musipedia, inspired by Wikipedia, is building a searchable,
editable, and expandable collection of tunes, melodies,
and musical themes.
Every entry can be edited by anybody. An entry can contain a bit
of sheet music, a MIDI file, textual information about the work and
the composer, and last but not least the Parsons Code, a rough
description of the melodic contour.

30,000 symbolically encoded melodies (Lilypond, MIDI) and 100,000
MIDI files, searchable with a SOAP interface or with a web interface.

Provides a web-based interface to the Humdrum thema command,
which in turn allows searching of databases containing
musical themes or incipits.
Currently there are three databases: Classical Instrumental Music,
European Folksongs, and Latin Motets from the sixteenth century.

A webservices system for submitting code and running it
against virtual collections:

"The NEMA team aims to create an open and extensible
webservice-based resource framework that facilitates the
integration of music data and analytic/evaluative tools that
can be used by the global MIR and CM research and education
communities on a basis independent of time or location."

Echo Nest is providing APIs for building your own MIR datasets.
It extracts tempo, beats, time signature, song sections, timbre,
key, and other musical attributes from an uploaded song,
can generate similar recommendations and returns feeds.