Here are the datasets backing the Music Ngram Viewer. These
datasets were generated in April and September 2011; I will update these datasets as
the score recognition continues, and the updated versions will have
distinct and persistent version identifiers (20110401 for the current
set). You can also access data not yet published here via the API

Each of the numbered links below will directly download a fragment of the
given corpus. In addition, for each corpus I provide the file total counts,
which records the total number of 1-grams contained in the scores that make up the corpus.
This file is useful to compute the relative frequencies of n-grams.

Details on the corpus construction but are abbreviated here.
Of note, I report only the n-grams that appeared over 3 times in any particular year.
Therefore, the sum of the 1-gram occurrences in any given corpus is smaller than the number
given in the total counts file.

File format: Each of the numbered files below is
gzipped tab-separated data. Each line has the following format:

ngram TAB year TAB match_count NEWLINE

As an example, here are the 7,000,000th and 7,000,001st lines from file of the IMSLP interval 5-grams (imslp-interval-5gram-20110401.csv.gz):

3 -2 4 -5 3 1804 94
3 -2 4 -5 3 1805 21

The first line tells us that in 1804, the melody
occurred 94 times overall.

The format of the total counts file is identical,
except that the ngram field is absent:
there is only one value match_count per year.

Inside each file the ngrams are sorted alphabetically and then
chronologically.

Petrucci Music Library - Melodies

Version 20110401

Petrucci Music Library - Transposed Chord Progressions

This dataset contains chord progressions of up to four chords length and their counts.
The chords represent all simultaneously active notes over all voices of a score.
This means that the notes must not have the same onset time in order to appear in the same chord.

Counts of progressions contained in scores for which no year of composition/first publication is known
are stored under the "?" year.

The entries represent equivalence classes of chord sequences equivalent up to a pitch shift.
If the first chord of a sequence consists of multiple notes, the pitch of the lowest note is not stored
and the chord is starts with and underscore sign "_". The following number indicates the difference in
semitones between the lowest and the second lowest notes. If the first chord consisted of a single note,
then the ngram begins with a number indicating the difference in semitones bewteen that single note and
the lowest note of the second chord.

Version 20110830

Petrucci Music Library - Exact Chord Progressions

This dataset contains chord progressions of up to four chords length and their counts.
The chords represent all simultaneously active notes over all voices of a score.
This means that the notes must not have the same onset time in order to appear in the same chord.

Counts of progressions contained in scores for which no year of composition/first publication is known
are stored under the "?" year.