Music information processing
has been widely deployed in music industries over the years.
Of course,
technologies oriented to musicians have long been studied
including sound synthesis on music synthesizers,
desktop music production based on MIDI sequencers,
and various kinds of support for
composing, performing, and recording music.
Such tools have already become
an essential part of the music-production process.
But more recently,
focus has shifted from these conventional tools to new technologies
that target the direct enjoyment of music
by end users who are not musicians.
For example,
it has become relatively easy to ``rip'' audio signals
from compact discs (CD) and compress them and
to deal with many musical pieces on a personal computer.
It has also become possible
to load a huge number of songs onto a portable music player
(e.g., Apple iPod)
enabling anyone to carry their personal collection of music anywhere and
to listen to it at anytime.

A variety of factors
can be given for this trend including advances in computer hardware
(high processing speeds and large-capacity/small-size memory and hard disks),
spread of the Internet,
and provision of low-cost audio input/output devices as standard equipment.
The standardization of MPEG Audio Layer 3 (MP3) in 1992 and
its spread in the latter half of the 1990s and
the establishment of MP3-based businesses in response to end-user demand
have also played a role here.
This trend is accelerating all the more
in the first half of the 2000s
with the proposal of Ogg Vorbis, MPEG-4 AAC, Windows Media Audio (WMA),
and other compression systems following on the heels of MP3.
Enterprises for delivering music via the Internet
are also appearing in rapid succession.

End users who are not musicians are not generally proficient in music
---
their knowledge of notes, harmony, and other elements of music
is usually limited.
Furthermore, they generally have little desire to create music.
They are quite interested,
however,
in retrieving and listening to their favorite music
or a portion of a musical piece in a convenient and flexible way.
Recent research themes of music information processing
reflect such end-user demand.
The target of processing is expanding
from the internal content of individual musical pieces
(notes, chords, etc.)
to entire musical pieces and even sets of musical pieces.
Accordingly,
research is becoming active in music systems
that can be used by people with no musical knowledge.
Typical technologies driving this trend are
technology for computing similarity between musical pieces and
for retrieving and classifying music;
technology for referring to what music friends and other people listen to
and for selecting music accordingly;
and technology for creating advanced music-handling interfaces.

Focusing on this emerging research trend,
this paper introduces recent studies on music information processing
from a unique perspective.

In contrast to past research
that focused on the internal contents of individual musical pieces,
the past ten years have seen
the growth of a new research field
targeting the retrieval, classification, and management of
large sets of music in which a single musical piece is treated as a unit.
This field,
which is called Music Information Retrieval (MIR),
has become quite active,
and since 2000,
the International Conference on Music Information Retrieval (ISMIR)
has been held annually.
A variety of topics are being researched in this field,
but in the following,
we introduce three ways of retrieving music based on audio signals
as opposed to text searches based on bibliographic information
(such as titles and artist names from CDDB,
an online database of CD information).

2.1. Research on Melody: Query by Humming

As the name implies,
Query by Humming (QBH)
enables one to retrieve the title of a musical piece
by humming or singing its melody using sounds like
``la-la-la...''
In other words,
humming or singing a melody becomes
the search key for finding a musical piece with that melody.
The use of such search keys raises some issues,
however,
such as how to deal with errors when singing off key and
how to absorb differences in key and tempo.
Specific methods differ in terms of the database used,
which may consist of melodies only
[1, 2, 3, 4, 5],
standard MIDI files (SMF) of entire musical pieces
[6, 7, 8],
or audio signals of entire musical pieces
[9, 10].
If we use a melody-only database,
similarity with a search key can be directly computed,
but for SMF,
the track containing a melody
must be identified before computing similarity.
In the case of audio signals,
similarity with a melody included in a mixture of sounds must be computed,
which is even harder to achieve.

For someone who would like to know the title of a musical piece
that is currently playing on the street or elsewhere,
this retrieval method enables that title to be identified
based on a fragment of that piece, which can be recorded on a cellular phone.
The fragment is therefore the search key and
the method searches for the musical piece containing that fragment.
Important issues here are
how to achieve efficient searching and
how to absorb acoustic fluctuations caused by noise and distortion
on the transmission path.
Proposed methods include
the time-series active search method
based on histograms of vector-quantized power-spectrum shapes
[11]
and
a method based on patterns of power-spectrum peaks
[12].

Given that one likes certain musical pieces,
this method searches for another musical piece having a similar feeling.
The search key is musical pieces themselves and
the method searches for a similar piece.
To this end,
similarity must be defined based on various features
such as timbral texture within a piece (power-spectrum shape)
[13, 14],
rhythm
[14, 15, 16, 17],
modulation spectrum
[18],
and singer voice
[19].
Similarity is also important for purposes other than retrieval.
For example,
the use of similarity to automatically classify musical pieces
(into genres, music styles, etc.)
is also being researched
[14, 17, 18, 19, 20].
It is difficult, however, to compute
the appropriate similarity between musical pieces
considering various factors.
This in conjunction with the understanding of musical audio signals
as introduced in the following section
needs further research in the years to come.

3. UNDERSTANDING MUSICAL AUDIO SIGNALS

Research related to the understanding of musical audio signals
has developed significantly over the last ten years.
Before that,
it was common to research
the segregation and extraction of individual sound components
making up an audio signal (sound source segregation) and
to use that information to automatically generate a musical score
(automatic transcription).
But in 1997,
in reconsideration of what it means for human beings to understand music,
a new research approach was proposed on music understanding
(Sect. 3.2,
Music scene description
[21, 22, 23])
based on the viewpoint that
listeners understand music
without segregating sound sources and
without mentally representing audio signals as musical scores
[24].
Research themes conforming to this approach,
including beat tracking, melody extraction, and music structure analysis,
have also been proposed.

This major development
on the understanding of musical audio signals
has been supported by advances
in hardware and in techniques for processing audio signals.
Ten years ago,
it was still difficult to calculate a Fast Fourier Transform (FFT)
in real time,
but nowadays,
it can be performed so fast that
the time required for its computation can essentially be ignored.
This jump in processing performance has let
researchers devise computationally intensive approaches
that could not be considered in the past, and
has also promoted the use of a wide range of statistical techniques.
For example,
techniques based on probabilistic models
such as the Hidden Markov Model (HMM)
and various techniques
making use of maximum likelihood estimation and Bayes estimation
have been proposed.

Automatic transcription has a long history as a research theme
going back to the 1970s, and
has progressed steadily
as the difficulty of the target music has increased
from monophonic sounds of melodies to
polyphonic sounds from a single instrument
and a mixture of sounds from several instruments.
This progression has been accompanied
by a shift toward more specialized research topics,
namely,
sound source segregation and
estimation of fundamental frequency (F0, perceived as pitch).

Because space does not allow an exhaustive introduction to
the many studies in this research field,
we here focus on new approaches
that first appeared in the past ten years.
In 1994,
Kashino et al.
introduced a method based on a probabilistic model and
implemented as a process model called OPTIMA
[25, 26].
This method was novel in its use of a graphical model
to describe the hierarchical structure of
frequency components, musical notes, and chords
and in determining the most likely interpretation
based on this hierarchical relationship.
Then, in 1999,
Goto proposed a predominant-F0 estimation method (PreFEst)
that does not assume the number of sound sources
[21, 23, 27].
This method prepares probability distributions
that represent the shape of harmonic structures
for all possible F0s, and
models input frequency components
as a mixture (weighted sum) of those distributions.
It then estimates the parameters of this model
---
the amplitude (weight) of each component sound
in the input sound mixture and
the shape of its harmonic structure
---
by using Maximum A Posteriori Probability (MAP) estimation
executed by the Expectation-Maximization (EM) algorithm.
This method can be extended, in principle, to an inharmonic structure
[23, 27],
and as such,
can be considered a framework for understanding general sound mixtures.

Other proposed methods include
a method for sequentially determining the components in a sound mixture
by repeatedly estimating the predominant F0 and
removing its harmonic components
[28];
a method for estimating model parameters
such as the number of simultaneous sounds,
the number of frequency components making up each sound,
F0s, and amplitude
by modeling the signal as
a weighted sum of sound waveforms in the time domain
and applying the Markov Chain Monte Carlo (MCMC) algorithm
[29];
a method for estimating notes, tempo, and waveforms
by associating them with a graphical model
that models the waveform-generation process
when performing a musical score at a certain (local) tempo
[30];
and
a method that formalizes the problem
as the clustering of frequency components
under harmonic-structure constraints and
determines the number of clusters (sound sources)
that minimizes the Akaike Information Criterion (AIC)
so as to estimate the median (F0) and weight (amplitude) of each cluster
[31].

3.2. Music Scene Description

Music scene description
[21, 22, 23]
aims to achieve an understanding of musical audio signals
at the level of untrained listeners.
This contrasts with most studies in the past
that aimed to achieve it at the level of trained musicians
by identifying all musical notes forming a musical score or
obtaining segregated signals from sound mixtures.
Music scene description
features the description of ``scenes'' that
occur within a musical performance
such as melody, bass, beat, chorus and phrase repetition,
structure of the musical piece,
and timbre of musical instruments.
The following introduce methods for obtaining descriptions of such scenes.

Melody and bass lines

Estimating the fundamental frequency (F0) of melody and bass lines
in CD recordings
containing sounds of various musical instruments
was first achieved in 1999
by applying the previously mentioned PreFEst method
[21, 23, 27]
with appropriate frequency-range limitation.
Another method for estimating the F0 of the bass line
was later proposed by Hainsworth et al.
[32].

Beat

While beat tracking and measure (bar line) estimation
for obtaining a hierarchical beat structure
(including beat and measure levels)
were researched in the 1980s using MIDI signals,
Goto et al. began
a series of studies
[24, 33, 34, 35, 36]
on real-world sound mixtures in 1994.
Then,
taking a hint from these studies,
Scheirer proposed a beat-tracking method
that could accommodate changes in tempo
[37].
Still later,
a variety of methods having even fewer restrictions were proposed
[38, 39, 40].
Methods targeting MIDI signals
[41, 42]
have also progressed significantly in recent years.

Chorus and phrase repetition, music structure

On entering the 2000s,
new approaches appeared
based on the detection of similar sections
that repeat within a musical piece (such as a repeating phrase).
These led to methods
for extracting the most representative section of a musical piece
(usually the chorus) from one location
[43, 44, 45];
music-summarization methods that shorten a musical piece
leaving only main sections
[46, 47];
and the RefraiD method that exhaustively detects all chorus sections
[48].
Among these methods,
RefraiD focuses on chorus detection
with the capability of determining
the start and end points of every chorus section
regardless of whether a key-change occurs.

Timbre of musical instruments

The second half of the 1990s saw
the development of sound-source identification methods
[49, 50]
that recognize the musical-instrument name
of each component sound in a polyphonic sound mixture
simultaneously with F0 estimation.
This period also saw
the development of methods for estimating the timing of drum sounds
in musical performances
[33, 34, 36, 51, 52].
Sound-source identification methods for isolated monophonic sounds
have also been researched from various viewpoints
[53, 54, 55, 56, 57, 58].

To respond directly to end-user demands
for flexible retrieval of music
for one's listening pleasure,
much research has been targeting the extraction and usage of metadata
to enhance the listening of musical pieces
or to facilitate their retrieval.
Such metadata include
information on the composers and performers of musical pieces and
the listener's preferences with regard to those pieces.

End users find it convenient
if they can refer to a list of music
that other people listened to
as a basis for selecting their own music.
On the Internet,
music-sales sites (e.g., Amazon) and
music-review sites (e.g., Allmusic)
perform daily collections of metadata
including user evaluations and impressions and purchase history.
This information can be subjected to collaborative filtering
to achieve services that promote the purchasing of music
by recommending artists and albums to users
[59, 60, 61]
and proposing playlists
[62, 63, 64, 65].
(The original meaning of ``playlist'' is
a broadcast or concert program of musical pieces
but it here means a list of musical pieces
to be played back on a media player or other device.)

Because collaborative filtering by itself
cannot easily deal with unknown new musical pieces,
it can be reinforced with content-based filtering
[61].
This makes the music-understanding methods described in
Section 3
all the more important.
Using the results of those methods
can aid in the generation of
more appropriate recommendations and playlists
based on the acoustical features and content of musical pieces.

As an example of using both metadata and acoustical features,
Whitman et al.
have developed methods for detecting artist styles
[20]
and identifying artists
[66]
by combining acoustical features of artist's musical pieces
with statistical data on words or phrases
on WWW pages that include that artist's name.
Ellis et al., moreover, have investigated
an automatic measure of the similarity between artists
by extracting metadata from lists of similar artists on a music-review site,
from end-user music collections,
and from statistical data on words or phrases
on WWW pages that include artist names
[67].
In addition,
Berenzweig et al. have compared
similarity based on such metadata with
similarity based on audio signals
[68].

While methods for encoding only musical-score information
have already reached a sufficiently practical level
[69],
there are currently several XML-based proposals
for music-description methods (including metadata) and
their standardization.
For example,
MusicXML
[70]
and WEDELMUSIC
[71]
have been proposed for describing music
at a symbol level including musical-score information.
Likewise,
MPEG-7 Audio
[72]
has been standardized for describing metadata
related to musical audio signals
such as melody contour and statistical data on the power spectrum.
We expect various research and development activities
conforming to these proposals to appear.

To enable end users without detailed knowledge of music
to deal with music on their own terms,
it is important that new types of interfaces be developed
since existing tools designed for musicians are
not sufficient for this purpose.

To achieve interfaces that can be used naturally without hassle,
we can consider the use of real-world objects themselves as interfaces.
Here,
it is important that such an approach conforms to conventional usage.
The musicBottles
[73]
and FieldMouse
[74]
are two examples of this real-world approach.
The musicBottles is a music-playback interface
that associates each musical-instrument part in a musical piece
with different glass bottles
and then enables a user to play the sound of any part
only when the cap of that bottle is in the open position.
The FieldMouse is an input device
that combines an ID-tag detector such as a barcode reader
with a relative-position detector such as a mouse.
A user can then select musical pieces and adjust the playback volume,
for example,
by moving the FieldMouse through space to read ID-tags
that correspond to those operations.

In conventional listening stations and media players,
an end user who would like to listen to only the chorus section of a song
must search for it himself by pressing the fast-forward button
repeatedly.
To make this task easier for an end user,
SmartMusicKIOSK
[75]
adds a ``NEXT CHORUS'' button
by employing the RefraiD method
[48]
described earlier.
Pressing this button
forces a jump to the next chorus section in that song
through automatic chorus detection.
This makes it easy for a user
to immediately skip a section (part) of no interest within a song
much like
the ``NEXT TRACK'' button on a CD player enables him
to skip a song (track) of no interest.

Considering that
end users will be faced with various types of music systems in the future,
they should be able to input musical information
(such as melody and rhythm)
associated with specific musical pieces.
Effective means to this end include humming
(Sect. 2.1)
and onomatopoeia as introduced in the following.
A music notation system ``Sutoton Music''
[76]
enables a melody to be described in text form in the manner of
``do re mii so-mmi re do''
for playback on a computer.
A drum-pattern retrieval method by voice percussion (beatboxing)
[77]
aims to recognize the drum part of a musical piece
that the user utters using natural sounds like
``dum ta dum-dum ta''
and to search for that piece on the basis of that drum part.

It was mentioned earlier that
end users do not generally have much interest in creating music.
Nevertheless,
if easy-to-use support functions for creating music
could be embedded in social networking tools (e.g., Orkut),
music might also become a means of communication for end users.

One example of music systems
that place particular emphasis on inter-user communication is
CosTune
[78].
In this system,
pads that control different sounds are attached to a user's jacket or pants
and touching these pads in a rhythmical manner
enables the user to jam with other nearby users via a wireless network.
Another example is Music Resonator
[79],
which enables a user
to process and edit annotated fragments of musical pieces and
to share that music with other users for collaborative music productions.
In addition,
RemoteGIG
[80]
enables remotely located users to
jam together
along a repetitive chord progression like 12-bar blues
in real time over the Internet
despite its relatively large latency.
This system overcomes the network latency
by having users listen to each other's performance
delayed by just one cycle of the chord progression (several tens of seconds).

There are other interesting themes
not introduced in this paper
that are being actively researched
in the field of music information processing.
While we have here focused on topics related to audio signal processing,
research into symbol-level processing including musical scores and MIDI
has also been progressing.
For example,
there has been much work
on computing similarity between melodies at a symbol level
[81]
and research on musical structure and expression in musical performances
[82].
Fusion of symbol processing and audio signal processing
is still far from sufficient
and will be targeted as an important issue in the future.
Such fusion will bridge a gap between them and
enable symbol processing to be based on proper symbol grounding
and audio signal processing to cover abstract semantic computing,
eventually achieving music computing
that reflects the manifold meaning of music.

The research environment for music information processing
is also expanding.
The years 2000 and 2001 saw
the construction of the world's first copyright-cleared music database
``RWC Music Database''
that can be used in common for research purposes
[83].
This database makes it easier
to use music for comparing and evaluating various methods,
for corpus-based machine learning,
and
for publishing research and making presentations
without conventional copyright restrictions.
Considering that
shared databases of various kinds
have long been constructed in other research fields
and have made significant contributions to their advancement,
we anticipate the RWC Music Database
to contribute to the advancement of music information processing
in a similar way.

Ten years ago,
it was necessary for us to explain
that music information processing was not ``amusement''
but a real research topic.
Today, it is common sense to treat it as an important research field.
This field is now experiencing
the birth of large-scale projects one after another,
an increase in international conferences year by year,
and an ever increasing number of researchers.
We look forward to
further advances in music information processing research.

Masataka Goto
received his Doctor of Engineering degree in
Electronics, Information and Communication Engineering
from Waseda University, Japan, in 1998.
He then joined the Electrotechnical Laboratory
(ETL; reorganized as the
National Institute of Advanced Industrial Science and Technology (AIST)
in 2001),
where he has been engaged as a researcher ever since.
He served concurrently as a researcher in
Precursory Research for Embryonic Science and Technology (PRESTO),
Japan Science and Technology Corporation (JST) from 2000 to 2003.
His research interests include
music information processing and spoken language processing.
Dr. Goto received
the IPSJ Yamashita SIG Research Awards
(MUS and SLP)
from the Information Processing Society of Japan (IPSJ),
Best Paper Award for Young Researchers from the
Kansai-Section Joint Convention of Institutes of Electrical Engineering,
WISS 2000 Best Paper Award and Best Presentation Award,
Awaya Prize for Outstanding Presentation and
Award for Outstanding Poster Presentation
from the Acoustical Society of Japan (ASJ),
Award for Best Presentation
from the Japanese Society for Music Perception and Cognition (JSMPC),
and Interaction 2003 Best Paper Award.
He is a member of the ASJ, IPSJ, JSMPC,
Institute of Electronics, Information and Communication Engineers (IEICE),
and
International Speech Communication Association (ISCA).

Keiji Hirata
received his Doctor of Engineering degree in
Information Engineering
from University of Tokyo, Japan, in 1987.
He then joined NTT Basic Research Laboratories.
He spent 1990 to 1993 at the
Institute for New Generation Computer Technology (ICOT),
where he was engaged in the research and development of
parallel inference machines.
In 1999,
he joined NTT Communication Science Laboratories,
where he has been engaged as a researcher ever since.
His research interests include
musical knowledge programming and interaction.
Dr. Hirata received
the IPSJ Best Paper Award
from the Information Processing Society of Japan (IPSJ) in 2001,
and is a member of the IPSJ,
Japanese Society for Artificial Intelligence (JSAI),
and Japan Society for Software Science and Technology.