Abstract

Rhythmic information plays an important role in Music Information Retrieval. Example applications include automatically annotating large databases by genre, meter, ballroom dance style or tempo, fully automated D.J.-ing, and audio segmentation for further retrieval tasks such as automatic chord labeling. In this article, we therefore provide an introductory overview over basic and current principles of tempo detection. Subsequently, we show how to improve on these by inclusion of ballroom dance style recognition. We introduce a feature set of 82 rhythmic features for rhythm analysis on real audio. With this set, data-driven identification of the meter and ballroom dance style, employing support vector machines, is carried out in a first step. Next, this information is used to more robustly detect tempo. We evaluate the suggested method on a large public database containing 1.8 k titles of standard and Latin ballroom dance music. Following extensive test runs, a clear boost in performance can be reported.

Keywords

1. Introduction

Music Information Retrieval (MIR) has been a growing field of research over the last decade. The increasing popularity of portable music players and music distribution over the internet has made worldwide, instantaneous access to rapidly growing music archives possible. Such archives must be well structured and sorted in order to be user friendly. For example, many users face the problem of having heard a song they would like to buy but not knowing its bibliographic data, that is, title and artist, which is necessary to find the song in conventional (online) music stores. According to Downie in [1], almost three fourths of all MIR queries are of bibliographic nature. The querying person gives information he or she knows about the song, most likely genre, meter, tempo, lyrics, or acoustic properties, for example, tonality and demands information about title and/or artist. In order to have machines assist in building a song database queryable by features such as tempo, meter, or genre, intelligent Information Retrieval algorithms are necessary to automatically extract such high-level features from raw music data. Many works exist that describe or give overviews over basic MIR methods, for example, [2–8]. Besides tonal features, the temporal features play an important role. Tempo, meter, and beat locations form the basis for segmenting music and thus for further feature extraction such as chord change detection or higher level metrical analysis, for example, as performed in [9]. Because of its importance, we will primarily focus on robust tempo detection within this article.

Currently existing state-of-the-art tempo detection algorithms are—generally speaking—based on methods of periodicity detection. That is, they use techniques such as autocorrelation, resonant filter banks, or onset time statistics to detect the tempo. A good comparison and overview is given in [10]. However, very little work exists that combines various low-level detection methods, such as tempo induction, meter recognition, and beat tracking into a system that is able to use features from all these subtasks to perform robust high-level classification tasks, for example, ballroom dance style or genre recognition, and in turn use the classification results to improve the low-level detection results. Only few, such as [11, 12], present data-driven genre and meter recognition. Other methods, such as [13, 14], use rhythmic features only for specific tasks, like audio identification, and do not use rhythmic features in a multistep process to improve results themselves.

A novel approach that aims at robust, data-driven rhythm analysis primarily targeted at database applications is presented in this article. A compact set of low-level rhythmic features is described, which is highly suitable for discrimination between duple and triple meter as well as ballroom dance style classification. Based on the results of data-driven dance style and meter recognition, the quarter-note tempo can be detected very reliably reducing errors, where half or twice of the true tempo is detected. Beat tracking at the beat level for songs with an approximately constant tempo can be performed more reliably once the tempo is known—however, it will not be discussed in this article. A beat tracking method, that can be used in conjunction with the new data-driven rhythm analysis approach, is presented in [15]. Although the primary aim of the presented approach is to robustly detect the quarter-note tempo, the complete procedure is referred to as rhythm analysis, because meter and ballroom dance style are also detected and used in the final tempo detection pass.

The article is structured as follows. In Section 2, an introduction to tempo detection, meter recognition, and genre classification is given along with an overview over selected related work. Section 3 describes the novel approach to improved data-driven tempo detection through prior meter and ballroom dance style classification. The results are presented in Section 4 and compared to results obtained at the ISMIR 2004 tempo induction contest before the conclusion and outlook in Section 5.

2. Related Work

Tempo induction, beat tracking, and meter detection methods can roughly be divided into two major groups. The first group consists of those that attempt to explicitly find onsets in the first step (or use onsets obtained from a symbolic notation, e.g., MIDI), and then deduct information about tempo, beat positions, and possibly meter by analyzing the interonset intervals (IOIs) [9, 16–21]. The second group contains those that extract information about the tempo and metrical structure prior to onset detection. Correlation or resonator methods are mostly used for this task. If onset positions are required, onset detection can then be assisted by information from the tempo detection stage [2, 4–6, 8, 22].

The more robust methods, especially, for database applications, are those from the second group. However, we will first explain the concept of onset detection used in the methods of the first group, as we believe it is a very intuitive way to approach the problem of beat tracking and tempo detection.

Before we start explaining the tempo induction methods, we take a look at some music terminology regarding meter. The metrical structure of a musical piece is composed of multiple hierarchical levels [23], where the tempo of each higher level is an integer multiple of the tempo on the lowest level. The latter is called level. The level at which we tap along when listening to a song is the or level. Sometimes this tempo is referred to as the quarter-note tempo. The or level corresponds to the bar in notated music, and the period of its tempo gives the length of a measure. The relation between measure and beat level is often referred to as time signature or more generally the meter.

To get familiar with the concept of onset detection, on which the first group of algorithms is based, let us assume that a beat basically corresponds to a sudden increase in the signal (energy) envelope. This is a very simplified assumption, which is valid only for music containing percussion and strong beats. There are basically two methods for computing an audio signal envelope (depicted in Figure 1) suitable for onset detection of a signal .

Figure 1

Waveform and envelope (dashed line) of 4 seconds from "OMD—Maid of Orleans."

(1)Full-wave rectification and lowpass filtering of followed by down sampling to approximately 100 Hz.

(2)Dividing the signal into small windows having a length around 20 milliseconds with approximately 50% overlap and then calculating the RMS energy of each window by averaging over all in the window. This can be followed by an additional lowpass filter for smoothing purposes.

The first order differential of the resulting (energy) envelope is then computed (Figure 2). A local maximum in the differential of the envelope corresponds to a strong rise in the envelope itself. By picking peaks in the differential that are above a certain threshold (e.g., the mean value or a given percentage of the maximum of the differential over a certain time window) some onsets can be located. The magnitude, or strength, of the onset is related to the height of the peak.

Figure 2

Differential of envelope of 4 seconds from "Maid of Orleans."

Figure 3

Differentials of frequency band envelopes from 10 seconds of "Maid of Orleans."

Figure 4

Periodic ACF of band envelope differentials from 10 seconds of "Maid of Orleans."

Figure 5

Distance matrix for 10 seconds from the beginning of "Maid of Orleans" (OMD). White spots have a high correlation (or low distance) and black spots a low correlation (or high distance).

In [5], Scheirer states that the amplitude envelope does not contain all rhythmic information. Multiple nonlinear frequency bands must be analyzed separately and the results are to be combined at the end. To improve the simple onset detection introduced in the last paragraph, the signal can be split into six nonlinear bands using a bandpass filter bank. Onsets are still assumed to correspond to an increase in the amplitude envelope, not of the full-spectrum signal, but now of each bandpass signal. Therefore, for each bandpass signal the same onset detection procedure as described above can be performed. This results in onset data for each band. The data of the six bands must be combined. This is done by adding the onsets of all bands and combining onsets that are sufficiently close together. Such a multiple band approach gives better results for music, where no strong beats, such as base drums in electronic dance music, are present. A more advanced discussion of onset detection in multiple frequency bands is presented in [24].

All methods presented up to this point are based on detecting a sudden increase in signal energy. In recent years, phase based [25] or combined energy/phase approaches [26] introduced by Bello et al. have been shown to give better results than energy-only approaches. Basically, onset detection incorporating phase and energy, that is, operating in the complex domain, bases on the assumption that there is both a notable phase deviation and an energy increase when an onset occurs. Yet, to preserve the general and introductory nature of this overview and focus more on tempo detection, we will not go into details on these techniques.

For tempo detection from onset data mainly a histogram technique is used in the literature [2, 18]. The basic idea is the following: duration and weight of all possible IOIs are computed. Similar IOIs are grouped in clusters and the clusters are arranged in a histogram. From the weights and the centers of the clusters the tempo of several metrical levels can be determined. Dixon in [2] uses a simple rule-based method. Seppänen in [18] uses a more advanced method. He extracts only the tatum pulse level (fastest occurring tempo) directly from the IOI histogram, by picking the cluster with the center corresponding to the smallest IOI. Features in a window around each tatum pulse are extracted. Using Bayesian pattern recognition, the tatum pulses are classified with respect to their perceived accentuation. Thus, the beat level is detected by assuming that beats are more accented than offbeat pulses. Although Seppänen's work stops at the tatum level, the score level could be detected in the same way, assuming that beats at the beginning of a score are more accented than beats within.

We will now take a look at the second group of algorithms that attempt to find the tempo without explicitly detecting onsets. Still it is assumed that rhythmic events such as beats, percussion, or note onsets correspond to a change in signal amplitude in a few nonlinear bands. Again we start with either the envelopes or the differentials of the envelopes of the six frequency bands but omit the step of peak picking. To keep this overview general the term "detection function" [26] will be used in the ongoing, referring to either the envelope, its differential or any other function related to perceivable change in the signal.

The beat level tempo, which is what we are interested in at this point, can be viewed as a periodicity in the envelope function. A commonly used method to detect periodicities in a function is autocorrelation [8, 27]. The periodic autocorrelation is computed over a small window (10 seconds) of the envelope function. The index of the highest peak in the autocorrelation function (ACF) indicates the strongest periodicity. However, as findings in [28] suggest, the strongest periodicity in the signal may not always be the dominant periodicity perceived. The findings suggest an interval of preferred tapping linked to a supposed resonance between our perceptual and motor system. Still, as a first guess, which will work fairly well on music with strong beats in the preferred tapping range, the highest peak can be assumed to indicate the beat level tempo. We also have to combine the results from all bands. The simplest way is to add up the ACF of all bands and pick the highest peak in the summary ACF (SACF). Determining the tempo for each band and choosing the tempo that was detected in the majority of bands as the final tempo is an alternative method. Dixon describes a tempo induction method based on autocorrelation in [2]. Uhle et al. use autocorrelation for meter detection in [8].

An alternative to autocorrelation is a resonant filter bank consisting of resonators tuned to different frequencies (periodicities), first introduced for beat tracking by Scheirer in [5]. The detection function is fed to all resonators and the total output energy of each resonator is computed. In analogy to the highest autocorrelation peak, the resonator with the highest output energy matches the songs periodicity best and thus the beat level tempo is assumed to be its resonance frequency. As explained in the last paragraph, this assumption does not fully match our perception of rhythm. This is one reason why it is so difficult, even for most of state-of-the-art systems, to reliably detect the tempo on the beat level. Octave errors, that is, where double/triple or half/third the beat level tempo is detected, are very common according to [10]. Even human listeners in some cases do not agree on a common tapping level.

All the methods introduced so far require the extraction of a detection function. Publications exist discussing how such a detection function can be computed, considering signal processing theory [26] and applying psychoacoustic knowledge [24]. In order to bypass the issue of selecting a good detection function, a different periodicity detection approach as was introduced for tempo and meter analysis by Foote and Uchihashi [4] can be used. This approach is based on finding self-similarities among audio features. First, the audio data is split into small (20–40 milliseconds) overlapping windows. Feature vectors containing, for example, FFT coefficients or MFCC [29] are extracted from these windows and a distance matrix is computed by comparing every vector with all the remaining vectors via a distance measure or cross-correlation.

Using (1), a so called beat spectrum [4] can be computed from the distance matrix . This beat spectrum is comparable to the ACF or the output of the resonant filter bank in the previously discussed methods;

(1)

While still the choice of the feature set might have an influence on the performance, this method has an advantage over computing the ACF of a detection function. In computing the correlation or distance of every feature vector to every other feature vector all possible relations between all features in all feature vectors are accounted for. Detection functions for separate frequency bands can only account for (temporal) relations within each band. If the detection function is a sum over all bands, for example, relations between the frequency bands are accounted for, but only in a very limited way. This case would correspond to reducing the feature vector to one dimension by summing its elements before computing the distance matrix.

However, computing distance matrices is a very time consuming task and might thus not be applicable to live applications, for example, that demand real-time algorithms. For most mainstream music, it can be assumed that the sensation of tempo corresponds to a loudness periodicity, as can be represented by a single detection function or a set of detection functions for a few subbands. Therefore, even though in our opinion the distance matrix method seems to be the theoretically most advanced method, it is not used in the rhythm analysis method presented in the following.

In the remaining part of this overview section we will give a very short overview over selected meter detection and ballroom dance style and genre recognition methods.

Various work exists on the subject of genre recognition, for example, [30, 31]. The basic approach is to extract a large number of features representing acoustic properties for each piece of music to be classified. Using a classifier trained on annotated training data, the feature vectors extracted from the songs are assigned a genre. Reference [30] extracts features related to timbral texture, rhythmic content and pitch content. The rhythmic features are extracted from the result of autocorrelation of subband envelopes. As classifiers Gaussian mixture models (GMMs) and K-nearest-neighbour (K-NN) are investigated, a discrimination rate of 61% for 10 musical genres is reported. Reference [31] investigates the use of a large open feature sets and automatic feature selection combined with support vector machines as classifiers. A success rate of 92.2% is reported for discrimination between 6 genres.

The subject of ballroom dance style recognition is relatively new. Gouyon et al. have published a data-driven approach to ballroom dance style recognition in [12]. They test various features extracted from IOI histograms using 1-NN classification. The best result is achieved with 15 MFCC like descriptors computed from the IOI histogram. 90.1% accuracy is achieved with these descriptors plus the ground truth tempo by 1-NN classifiers. Without ground truth tempo, that is, only the 15 descriptors, 79.6% accuracy is reported.

Meter detection requires tempo information from various metrical levels. Klapuri et al. introduce an extensive method to analyze audio on the tatum, pulse, and measure level [6]. For each level, the period is estimated based on periodicity analysis using a comb filter bank. A probabilistic model encompasses the dependencies between the metrical levels. The method is able to deal with changing metrical structures throughout the song. It proves robust for phase and tempo on the beat level, but still has some difficulties on the measure level. The method is well suited for, in depth, metrical analysis of a wide range of musical genres. For a limited set of meters, for example, as in ballroom dance music the complexity can be reduced—at the gain of accuracy—to binary decisions between duple or triple periods on the measure level. Gouyon et al. assume a given segmentation of the song on the beat level and then focus on a robust discrimination between duple and triple meter [11] on the measure level. For each beat segment, a set of low-level descriptors is computed from the audio. Periodic similarities of each descriptor across beats are analyzed by autocorrelation. From the output of the autocorrelation, a decisional criterion is computed for each descriptor, which is used as a feature in meter classification.

3. Rhythm Analysis

A data-driven rhythm analysis approach is now introduced, capable of extracting rhythmic features, robustly identifying duple and triple meter, quarter-note tempo and ballroom dance style basing on 82 rhythmic features, which are described in the following sections.

Robustly identifying the quarter-note or beat level tempo is a challenging task, since octave errors, that is, where double or half of the true tempo is detected, are very common. Therefore, a new tempo detection approach, based on integrated ballroom dance style recognition, is investigated.

The tatum tempo [8, 18], that is, the fastest tempo, presents the basis for extracting rhythmic features. A resonator-based approach, inspired by [5], is used for detecting this tatum tempo and extracting features containing information about the distribution of resonances throughout the song.

The features are used to decide whether the song is in duple or triple meter. Confining the metrical decision to a binary one was introduced in [11]. For dance music, the discrimination between duple and triple meter has the most practical significance. Identifying various time signatures, such as 2/4, 4/4, and 6/8 is a more complicated task and of less practical relevance for ballroom dance music. The rhythmic features are further used to classify songs into 9 ballroom dance style classes. These results will be used to assist the tempo detection algorithm by providing information about tempo distributions collected from the training data for the corresponding class. For evaluation 10-fold stratified cross-validation is used. This is described in more detail in Section 3.5.

3.1. Comb Filter Tempo Analysis

The approach for tatum tempo analysis discussed in this article is based on Scheirer's multiple resonator approach [5] using comb filters as resonators. His approach has been adapted and improved successfully in other work for tempo and meter detection [6, 10, 32]. The main concept is to filter the envelopes or detection functions (see Section 2) of six nonlinear frequency bands through a bank of resonators. The resonance frequency of the resonator with the highest output energy is chosen as tempo. The comb filters used here are a slight variation of Scheirer's filters. In the following paragraphs, there will be a brief theoretical discussion of IIR comb filters and a description of the chosen filter parameters.

In the ongoing, the symbol will be used to denote a tempo. The tempo is specified as a frequency having the unit BPM (beats per minute). If an index IOI is appended to the symbol , it is indicated that the tempo is given as IOI period in frames.

A comb filter adds a signal itself to a delayed version of the signal. Every comb filter is characterized by two parameters: the delay (or period, which is the inverse of the filters resonance frequency) and the gain .

For tempo detection IIR comb filters are used as described in the discrete time domain by (2),

(2)

The filter has a transfer function in the -domain given by (3),

(3)

The frequency response for two exemplary values of is depicted in Figure 6.

Figure 6

Frequency responses of IIR comb filters with gain ofand

To achieve optimal tempo detection performance, an optimal value for must be determined. Scheirer's [5] method of constant half-energy time by using variable gain depending on has not proven well in our test runs. Instead, we use a fixed value for . When choosing this value, we have to consider small temporary tempo drifts occurring in most music performances. So the theoretically optimal gain cannot be used. We conducted test runs with multiple values for in the range from 0.2 to 0.99. Best results were obtained with .

3.2. Feature Extraction

The comb filters introduced in the previous section are used to extract the necessary features for ballroom-dance style recognition, meter recognition, and tempo detection. The key concept is to set up comb filter banks over a much broader range than used by [5] in order to include higher metrical layers. The resulting features describe the distribution of resonances among several metrical layers, which provides qualitative information about the metrical structure.

To effectively reduce the number of comb filters required, we exploit the fact that in music performances several metrical layers are present (see Section 2). In a first step the tempo on the lowest level, the tatum tempo, is detected. It is now assumed that all possibly existing higher metrical levels can only have tempi that are integer multiples of the tatum tempo. This is true for a wide variety of musical styles.

3.2.1. Preprocessing

The input data is down sampled to and converted into monophonic by stereo-channel addition in order to reduce computation time. The input audio of length seconds is split into frames of samples with an overlap of 0.57, resulting in a final envelope frame rate of 100 fps (frames per second). A Hamming window is applied to each frame and a fast Fourier transform (FFT) of the frame is computed, resulting in 128 FFT coefficients.

By using overlapping triangular filters, equidistant on the mel-frequency scale, the 128 FFT coefficients are reduced to envelope samples of nonlinear bands. These triangular filters are the same as used in speech recognition for the computation of MFCC [29].

Such a small set of frequency bands, still covering the whole human auditory frequency range, contains the complete rhythmic structure of the musical excerpt, according to experiments conducted in [5].

The envelope samples of each mel-frequency band are converted to a logarithmic representation according to the following equation:

(4)

The envelopes of the mel-frequency bands are then lowpass filtered by convolution with a half-wave raised cosine filter with a length of 15 envelope samples, equal to 150 milliseconds. The impulse response of the filter is given in (5). This filter preserves fast attacks, but filters noise and rapid modulation, most as in the human auditory system,

(5)

Of each lowpass filtered mel-frequency band envelope a weighted differential is taken according to (6). For a sample at position a moving average is calculated over one window of 10 samples to the left of sample (left mean ) and a second window of 20 samples to the right of sample (right mean ),

(6)

This method is based on the fact that a human listener perceives note onsets as more intense if they occur after a longer time of lower sound level and thus are not affected by temporal post-masking caused by previous sounds [33]. The weighting with the right mean incorporates the fact that note duration and total note energy play an important role in determining the perceived note accentuation [18].

3.2.2. Tatum Features

For detecting the tatum tempo , an IIR comb filter bank is used consisting of 57 filters, with gain and delays ranging from to envelope samples. This filter bank is able to detect tatum tempos in the range from 81 to 333 pulses per minute. The range might need adjustments when very slow music is processed, that is, music with no tempo faster than 81 pulses per minute.

The weighted differential of each mel-frequency band envelope is fed as input to each filter having a delay of . The filter output for band , frame and filter is referred to as . The total energy output over all bands is computed for each filter with (7),

(7)

The result of this step is the not flattened tatum vector with 57 elements , where is in the range from 18 to 74. Examples of for three songs are plotted in Figures 8 and 7.

From three additional features are extracted that reveal the quality of the peaks.

(i) is computed by dividing the highest value by the lowest.

(ii) is the fraction of the first value over the last value.

(iii) is computed as mean of the maximum and minimum value normalized by the global mean.

These features correspond to how clearly visible the peaks of the vector are, and how flat is (see Figures 8 and 7). Since the employed comb filters tend to higher resonances at higher tempos for songs with little rhythmic content (Figure 7), the vector is adjusted, that is, flattened, by considering the difference between the average of the first 6 values and the average of the last 6 values. From the resulting flattened tatum vector the two most dominant peaks are picked as follows. Firstly, all local minima and maxima are detected, then for each maximum its apparent height is computed by taking the average of the maximum minus its left and right minimum. The indices of the two maxima with the greatest apparent height are considered possible tatum candidates ( and ). For each candidate a confidence is computed as follows:

(8)

The candidate for which the confidence is maximal is called the final tatum tempo in the ongoing. Conversion from the IOI period of the final tatum tempo to the final tatum tempo in BPM () is performed by the following equation:

(9)

The 63 tatum features consisting of , , , , , , and the tatum vector with 57 elements constitute the first part of the rhythmic feature set. A major difference to some existing work is the use of the complete tatum vector in the feature set. Reference [30] uses rhythmic features for genre classification. However, from a beat histogram, which is loosely comparable to the tatum vector (both contain information about the periodicities), only a small set of features is extracted, only considering the two highest peaks and the sum of the histogram.

3.2.3. Meter Features

The tatum features only contain information from a very small tempo range, hence, they are not sufficient when one is interested in the complete metrical structure and other tempi than the tatum tempo. Thus, features that contain information about tempo distributions over a broader range are required. These are referred to as meter features, although they do not contain explicit information about the meter.

A so called meter vector is introduced. This vector shows the distribution of resonances among 19 metrical levels, starting at, and including the tatum level.

Each of the 19 elements of vector is a normalized score value of the tempo , indicating how well the tempo resonates with the song. To compute , first an unnormalized score value is computed by setting up a comb filter bank for each value of . Each filter bank consists of filters with delays from to . As in Section 3.2.2 the total energy output of each filter in the bank is computed and the maximum value is assigned to . The delay of the filter with the highest total energy output is saved as adjusted tempo belonging to . The vector consisting of the 19 elements is the not flattened meter vector . Exemplary plots of are given in Figures 10 and 9;

(10)

As the same problem with higher resonances of higher tempi as exists for the tatum vector (see Section 3.2.2) also exists for (see Figure 9), the vector is flattened in the same way as the tatum vector by taking into account the difference . The resulting vector is the flattened meter vector , referred to simply as meter vector. For accurate meter vector computation a minimal input length is required, since the higher metrical levels correspond to very slow tempi and thus large comb filter delays.

Figure 9

Plots of not flattened meter vector(a) for "Moon River (Waltz)" and (flattened) meter vector(b).

The 19 elements of the meter vector , without further processing or reduction, constitute the second part of the rhythmic feature set. We would like to note at this point, that no explicit value for the meter (i.e., duple or triple) is part of the meter features. In the ongoing the reader will learn how the meter is detected in a data-driven manner using support vector machines (SVMs).

3.3. Feature Selection

A total of 82 features has been described in the previous two sections, including all 19 meter vector elements and the 63 tatum features, namely , , , , , plus all 57 elements of tatum vector (see Table 1). These features will be referred to as feature set in the ongoing. Basing on our experience in [31, 32], SVMs with a polynomial Kernel function of degree 1 are used for the following classification tasks. The SVMs are trained using a sequential minimum optimization (SMO) method as described in [34].

Table 1

Overview over all 82 rhythmic features. Feature set .

Tatum features

tatum vector (57 el.)

tatum candidates , [BPM]

final tatum tempo [BPM]

, ,

Meter features

Meter vector (19 el.)

In order to find relevant features for meter and ballroom dance style classification, the dataset is analyzed for each of these two cases by performing a closed-loop hill-climbing feature selection employing the target classifier's error rate as optimization criterion, namely, sequential forward floating search (SVM-SFFS) [31].

The feature selection reveals the following feature subset to yield the best results for meter classification: , meter vector elements 4, 6, 8, 16, and the tatum vector .

3.4. Song Database

A set of 1855 pieces of typical ballroom and Latin dance music obtained from [35] is used for evaluation. A more detailed list of the 1855 songs can be found at [36]. The set covers the standard dances Waltz, Viennese Waltz, Tango, Quick Step, and Foxtrot, and the Latin dances Rumba, Cha Cha, Samba, and Jive giving a total of 9 classes. The songs have a wide range of tempi ranging from 68 BPM to 208 BPM. 30 seconds of each song are available, which were converted from a real audio like format to 44.1 kHz PCM, so the preprocessing from Section 3.2.1 can be applied. In total length however, this set corresponds to 5 days of music. The distribution among dance styles is depicted in Table 3. This set is abbreviated in the ongoing. Ground truth statistics about the tempo distribution for the whole set and in each dance style class are given in Table 2.

Table 2

Mean , standard deviation , minimum and maximum tempo in BPM for each class, and complete set .

For the dataset, the ground truth of tempo and dance style is known from [35]. The ground truth regarding duple or triple metrical grouping is also implicitly known from the given source because it can be deduced from the dance style. All Waltzes have triple meter, all other dances have duple meter. Tempo ground truths are not manually double checked as performed in [10], therefore errors among the ground truths might be present. Results with manually checked ground truths might improve slightly. This is further discussed near the end of Section 4.

3.5. Data-Driven Meter and Ballroom Dance Style Recognition

From the abstract features in set (see Section 3.3) meter and quarter-note tempo have to be extracted. While data-driven meter recognition by SVM yields excellent results, data-driven tempo detection is a complicated task because tempo is a continuous variable. An SVM regression method was investigated, but has not proven successful. The method was not able to correctly identify tempi within a tolerance of only a few percent relative BPM deviation. A hybrid approach is used therefore the data is divided into a small number of classes representing tempo ranges. The ranges are allowed to overlap slightly. As the database described in Section 3.4 already has one of nine ballroom dance styles assigned to each instance, the dance styles are chosen as the tempo classes, since music of the same dance style generally is limited to a specific tempo range. This is confirmed by other work, which uses tempo ranges to assign a ballroom dance style [2, 37].

In three consecutive steps (see Figure 11) meter, ballroom dance style, and quarter-note tempo are determined for the whole dataset in a 10-fold stratified cross validation (SCV) as described in the following.

(1)The feature set is extracted for all instances in the dataset. The 1855 instances are split into training and test splits for 10 stratified folds. An SVM model for meter classification is built on each training split using the feature subset . The model is used to assign a meter (duple or triple) to the instances in each test split. Doing this for all 10 folds, the meter can be determined for the whole dataset by SVM classification.

(2)

The meter , from the previous step, is used as a feature in feature set (see Section 3.3) for ballroom dance style classification. The same 10-fold procedure as was used for meter classification in step 1 is performed in order to assign a ballroom dance style to all instances in the dataset.

(3)

With the results of both meter and ballroom dance style classification, it is now possible to quite robustly detect the quarter-note tempo. The following section describes the novel tempo detection procedure in detail.

3.6. From Ballroom Dance Style to Tempo

For the training data of each of the 10 folds introduced in the previous section, the means and variances of the distributions of quarter-note tempi (ground truths) and tatum tempi are computed for each of the 9 ballroom dance styles. No ground truth for the tatum tempo is available, so the automatically extracted tatum tempo (see Section 3.2.2) from step (1) in Section 3.5. is used. Results might improve further if ground truth tatum information were available, since correct tatum detection is crucial for correct results.

For the test data in each fold the tempo is detected with the following procedure. Using the two tatum candidates and extracted in step (1) in Section 3.5, the final tatum for the instances in the test split in each fold now is chosen based upon the statistics estimated from the training data. The Gaussian function (11) is used instead of the confidence (see Section 3.2.2). Parameters and are set to the values of and for the corresponding ballroom dance style (assigned in step (2) in the previous subsection),

(11)

Now the candidate for which the function is maximal is chosen as the final tatum tempo . Based upon this new tatum, a new flattened meter vector is computed for all instances as described in Section 3.2.3.

The new meter vector is used for detection of the quarter-note tempo. Each element is multiplied by a Gaussian weighting factor . The parameters and in (11) are now set to the values and of the corresponding ballroom dance style. indicates the tempo the meter vector element belongs to (see Section 3.2.3).

Next, the index , for which the expression is maximized, is identified. The tempo belonging to index is the detected quarter-note (beat level) tempo .

4. Results

Results for tempo detection with and without prior ballroom dance style recognition are compared in Table 4. The tempo thereby is detected as described in Section 3.6, except that without dance style only one predefined Gaussian for the tempo distribution is applied, instead of using the distributions determined for each dance style.

By the results in Table 4, it can be clearly seen that the number of instances, where the correct tempo octave is identified, increases by almost 20% absolute, when incorporating the ballroom dance style recognized in step (2). When assuming an optimal ballroom dance style recognition, that is, when ground truth ballroom data is used instead of the recognition results, the tempo octave is identified correctly in almost all cases, where the tempo is identified correctly. With the new data-driven approach to tempo detection, accuracies for the quarter-note tempo are improved by approximately 5% absolute for Waltz and over 10% for Viennese Waltz, compared to previous work on the same dataset [15]. On 88% of all instances the correct tempo octave was identified, which is remarkble, considering the wide range of tempi of the dataset.

Detailed final results, after applying all the steps from Section 3.5 through Section 3.6, are depicted in Table 3. The tolerance for tempo detection hereby is 3.5% relative BPM deviation to maintain consistency with previous publications [32]. We would like to note that ballroom dance style recognition has been performed completely without using the quarter-note tempo as a feature.

In [2], Dixon et al. use a rule-based approach for dance style classification basing on simple tempo ranges. However, results on a large dataset are not reported. In [12], Gouyon et al. test a data-driven approach on a subset of the dataset. They evaluate multiple feature sets and different classifiers. Using ground truth of tempo and meter from [35] with a K-nearest neighbour classifier, they report an accuracy of 82.3%. Using the same ground truths and SVM instead of k-NN, we achieve 84.6% of correctly classified instances. With a set of 15 MFCC-like features, comparable to our 82 rhythmic features, Gouyon et al. achieve accuracies of 79.6%. Using SVM on the rhythmic features introduced in this article, the ballroom dance style recognition results improve by almost 10% absolute to 89.1%.

Meter detection results improve by approximately 2% over those reported by Gouyon et al. in [11]. However, different datasets and classifiers are used, so results cannot be properly compared. Comparing meter detection results with those reported by Klapuri et al. [6] is not feasible because in our article meter detection is restricted to a simple binary decision due to the main focus being on tempo detection incorporating ballroom dance style recognition. Klapuri et al. describe more in detail, multilevel tempo and meter analysis system.

At ISMIR 2004 a tempo induction contest was held comparing state-of-the-art tempo induction algorithms. The results are reported in [10]. To show the reader how our data-driven tempo induction approach compares to the algorithms of the contest participants, we have conducted a test run on the publicly available ballroom dance set used in the contest (referred to as set in the ongoing, obtainable at [38]). This set approximately is a subset of the dataset. The tempo ground truth of this set was manually double checked. Two accuracies are evaluated in [10], namely accuracy 1 which corresponds to tempo correct in this article, and accuracy 2, which corresponds to the percentage of correctly identified tempo octaves. Table 5 shows the results obtained on this dataset. The winner of the ISMIR contest is an algorithm by Klapuri et al. which achieves 91.0% accuracy 1 and 63.2% accuracy 2 on the set. Scheirer's algorithm, on which our comb filter tatum detection stage is loosely based, was also evaluated in the contest. It achieves 75.1% accuracy 1 and 51.9% accuracy 2 on the same dataset. The novel approach presented in this article outperforms Scheirer's algorithm by 17.9% absolute and Klapuri's algorithm by 2.0% absolute regarding accuracy 1 and 35.0% and 23.7% absolute, respectively regarding accuracy 2. These results are the best reported so far. Still, it is to note that tests were only performed on ballroom dance data. In future work, other datasets such as the song set from [10] or the MTV set from [32] must be assigned ground truth tempo range classes, in order to evaluate performance with other data than ballroom songs. Yet already, good results on ballroom dance music are practically useable, for example, for virtual dance assistants [15].

5. Conclusion and Outlook

Within this article, an overview over basic and current approaches for rhythm analysis on real audio was given. Further, a method to improve over today's robustness by combining tempo detection, rhythmic feature extraction, meter recognition, and ballroom dance style recognition in a data-driven manner was presented. As opposed to other work, ballroom dance style classification is carried out first, and significantly boosts performance of tempo detection. 82 rhythmic features were described and their high usefulness for all of these tasks was demonstrated.

Further applications for these features, ranging from general genre recognition to song identification [13], or measuring rhythmic similarity [39], must be investigated. Preliminary test runs for discrimination between 6 genres (Documentary, Chill, Classic, Jazz, Pop-Rock, and Electronic) on the same dataset, and with same test-conditions as used in [31] indicate accuracies of up to 70% using only the 83 rhythmic features.

It will further be investigated if adding other features, such as those described by [8, 12], or [13] can further improve results for all the presented rhythm analysis steps. Moreover, the data-driven tempo detection approach will be extended to nonballroom music, for example, popular and rock music.

Overall, automatic tempo detection on real audio—also outside of electronic dance music—has matured to a degree, where it is ready for multiple intelligent Music Information Retrieval applications in everyday life.

Foote J, Uchihashi S: The beat spectrum: a new approach to rhythm analysis.Proceedings of the IEEE International Conference on Multimedia and Expo (ICME '01), August 2001, Tokyo, Japan 881-884.Google Scholar

Copyright

This article is published under license to BioMed Central Ltd. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.