Prosody and prosodic boundaries carry significant information regarding linguistics and paralinguistics and are important aspects of speech. In the field of prosodic event detection, many local acoustic features have been investigated; however, contextual information has not yet been thoroughly exploited. The most difficult aspect of this lies in learning the long-distance contextual dependencies effectively and efficiently. To address this problem, we introduce the use of an algorithm called auto-context. In this algorithm, a classifier is first trained based on a set of local acoustic features, after which the generated probabilities are used along with the local features as contextual information to train new classifiers. By iteratively using updated probabilities as the contextual information, the algorithm can accurately model contextual dependencies and improve classification ability. The advantages of this method include its flexible structure and the ability of capturing contextual relationships. When using the auto-context algorithm based on support vector machine, we can improve the detection accuracy by about 3% and F-score by more than 7% on both two-way and four-way pitch accent detections in combination with the acoustic context. For boundary detection, the accuracy improvement is about 1% and the F-score improvement reaches 12%. The new algorithm outperforms conditional random fields, especially on boundary detection in terms of F-score. It also outperforms an n-gram language model on the task of pitch accent detection.

Speech is often characterized across two levels of expression: the segmental level encompassing basic phonetic meaning and the prosodic level with additional suprasegmental information. The prosodic level expression plays a crucial role in speech communication, carrying much linguistic and paralinguistic information. Prosody enables listeners to recover word meanings, emphasis, and speaker intent and attitude. In addition, prosody carries information about the speaker’s emotional state. Prosody primarily manifests itself as pitch accent, pause, variations in speaking rate, and intonation. These prosodic events are realized by modulating acoustical correlates such as duration, pitch, and intensity at a syllable, word, or whole utterance level.

In spoken language processing, the detection of prosodic events is a primitive step needed for computers to access the critical high-level information regarding human speech interaction. As such, this task has wide application. It can provide assistance for automatic prosody annotation. Since the manual annotation of prosody for speech synthesis or speech understanding is time-consuming and laborious, the detection of prosodic events can provide substantial time savings. For second language learning, there is potential for computer-assisted language learning (CALL) systems to incorporate prosodic event detection to help detect learner mispronunciations at a prosodic level and provide feedback for improving pronunciation naturalness. Prosodic event detection can also be used as foundation for the downstream spoken language processing tasks such as speech summarization and topic segmentation.

Due to the suprasegmental nature of prosody, contextual information is very important for prosodic event recognition. Here, ‘context’ refers to the correlation between each prosodic unit and its surroundings. There are several types contextual information, including nearby acoustic appearances (‘acoustic context’), nearby prosodic event distributions (‘prosodic context’), and nearby lexical and syntactic appearances (‘textual context’). In this paper, we focus only on acoustic context and prosodic context, but do not address linguistic effects. Perceptually, pitch accent is perceived when the related acoustic features of a syllable stand out from its surroundings. In addition to the influence of the adjacent syllables, the overall surface realization of accent is also affected by many other broader-scale phenomena such as the presence of phrasal boundary, phrase structure, and topic[1, 2].

In this paper, we investigate the utilization of contextual information for pitch accent and boundary detection by using the auto-context algorithm, which was first proposed in[3] for high-level computer vision tasks like image segmentation. In this algorithm, the classification probabilities obtained from the preceding iteration are used to provide possible contextual clues, together with acoustic features to improve the next iteration. Each detection object is supported by combinations of contextual probabilities from any contextual range. Our experimental results show that this algorithm enhances detection performance for both pitch accent and boundary detection tasks.

Many approaches have been explored for prosodic event detection. The target unit used to indicate the prosodic events can be a phone, syllable or word, with acoustic features typically modeled by statistical machine learning methods. In[4], Conkie et al. used short-frame speaker-normalized pitch and energy as acoustic representations modeled by prosodic context-dependent HMMs. In[5], Ananthakrishnan and Narayanan modeled the frame-level acoustic features using coupled hidden Markov models (CHMMs) to detect pitch accent and boundary in binary modes.

This work assumed that the modulation of prosody-related acoustic parameters, such as increase of the local energy, extension of the duration, and exaggeration of pitch movements, were all asynchronous and could be modeled by CHMMs with multiple data streams. In[6], bagging and boosting ensemble machine learning methods were adopted based on a decision tree learning algorithm. Both methods improved the overall accuracy of four-category pitch accent classification.

The work in[7] utilized a decision tree to model acoustic features. Using the posterior probabilities provided by the decision tree, a bigram prosodic label sequence model was combined to detect pitch accent and boundary tones at the syllable level. In[8], Ananthakrishnan and Narayanan used a maximum a posteriori (MAP) framework with multiple classifiers including GMMs, linear decision discriminants, and neural networks (NNs) to detect pitch accent and boundary, with NNs giving the best performance. When combined with 4-gram de-lexicalized language model, this method achieved an accuracy of 80.1% on binary pitch accent classification and 89.6% on boundary detection. Jeon and Liu used NNs[9] and support vector machines (SVMs)[10] for acoustic modeling. With NNs, pitch accent and boundary detection accuracy reached 83.5% and 84.8%, respectively. The performance of the SVM classifier was better than that of NNs, achieving 85.7% accuracy for pitch accent detection. Recently, conditional random fields (CRFs) have become popular in prosodic event detection. In[11], acoustic features were modeled by a linear-chain CRF as well as a two-level factorial CRF. Linear-chain CRF has been used extensively in recent work[2, 12, 13]. These reference approaches to prosodic event detection have been summarized in Table1.

To investigate the importance of contextual information in prosodic event detection, the work in[16] examined the detection performance of pitch accent at word, syllable, and vowel levels, respectively. When using a constant amount of context, the results showed that detection in the word domain achieved the best performance, showing that acoustic excursions exist beyond syllable range. In[1], the contextual influence of local coarticulatory constraints and broader range phrasal effects were investigated for the detection of prominence. The results showed that the incorporation of local acoustic context can significantly improve the detection performance, with phrasal effects less significant.

With respect to utilizing contextual information in acoustic modeling, there are two representative methods: an n-gram language model[7, 8] and CRF[2, 11–13] model. An n-gram language model assumes that the current prosodic state is dependent on its finite histories, with dependencies established in the form of conditional probability. CRFs are a class of graphical models that are undirected and conditionally trained. For the commonly used linear chain CRF, the dependencies between prosodic labels are modeled in a pairwise neighborhood structure. CRFs have the advantage of modeling the relationships between sequential labels and have been proven efficient in prosodic prediction[2, 11–13]. Figure1 shows the dependency diagrams of the two models. Unlike these models, auto-context simultaneously integrates the acoustic features together with the context information by learning a series of classifiers. As discussed in[3], auto-context can integrate any mode of neighborhood structure, including long range, to make good use of contextual information. It is up to the learning algorithm to select and fuse the informative context and acoustic features.

Figure 1

The relational structures adopted by n-gram and CRF models. Here, X represents the features, and Y represents the labels which are being predicted. (a) N-gram model. (b) Linear-chain CRF.

The data corpus used in this work is the Boston University Radio Speech Corpus (BURSC)[17], a standard corpus for prosody event detection and prediction studies. This corpus is composed of news stories read by seven FM radio news announcers. Each paragraphed-size utterance typically consists of several sentences and is hand-annotated with the orthographic transcription, phonetic alignments, part-of-speech (POS) tags, and prosodic labels based on ToBI conventions. Utterances from two females (F1, F2) and two males (M1, M2) constitute the training and testing dataset used here. The distribution of these utterances is listed in Table2.

Table 2

The distribution of the dataset used in our experiment

F1

F2

M1

M2

Number of utterances

74

166

72

51

Number of sentences

279

1,176

391

209

Number of words

3,993

12,060

5,059

3,608

Number of syllables

6,580

20,836

8,168

5,915

Number of accents

2,253

7,063

2,564

1,933

Number of boundaries

977

3,702

1,092

882

Accent occupancy (%)

34.3

33.9

31.4

32.7

Boundary occupancy (%)

14.9

17.8

13.4

14.9

There are four parallel tiers for ToBI annotation conventions to describe the prosodic events[18]. Among these, the tone tier annotates the presence of pitch accent (* suffix) and phrase boundaries. There are two basic types of accent, high (H) and low (L), which can be further divided into subclasses such as downstepped accent (! prefix). The phrase boundaries include intermediate phrase boundary (- suffix) and intonational phrase boundary (% suffix), which follow the different types of phrase accent and boundary tones. The break tier is used to describe the disjuncture between words. The degree of disjuncture from weak to strong is marked by break indices ranging from 0 to 4. The phrase boundary locations usually score 3 or above, where ‘3’ indicates an intermediate phrase boundary and ‘4’ indicates a full intonational phrase boundary. These two kinds of boundaries are different in their degree of salience. Figure2 shows a ToBI annotation example for the phrase ‘design improvement and schedule’. The top three layers are the orthographic tier, the tone tier, and the break tier, respectively.

Figure 2

An example of ToBI annotation for the phrase ‘design improvement and schedule’. The example includes the acoustic appearances of pitch, energy, and frequency spectrum.

In our work, we have implemented syllable partitions and take the syllable as the domain of pitch accent and boundary detection. Using the detailed representation of prosodic types from the ToBI annotation framework can cause serious data sparsity problems since there are only a few examples for some prosodic types. Considering a balance between the amount of the training data and detection fineness, in this work, we implement pitch accent detection tasks in two-way and four-way modes, following previous approaches[8, 9, 11, 13]. For the two-way task, we divide the syllables into accented and unaccented based on the presence of an asterisk mark. As Table2 shows, in this classification case, the percentage of the accented syllables is about 33%. For the four-way task, we decompose the pitch accent into three types: high, low, and downstepped, in addition to the unaccented type. Four-way classification increases the detection complexity and fineness, but causes some data sparsity. For boundary detection, we set our detection task as a binary (presence/absence) classification problem, which identifies whether the syllable is followed by a boundary or not. Although intonational phrase boundaries can be more reliably detected and have been widely applied in many downstream spoken language processing tasks such as speech summarization, intermediate phrase boundaries are also important elements for phrasal analysis and prosodic annotation. Here, we treat the intonational and intermediate phrase boundaries equally and group the break indices of ‘3’ and ‘4’ together to represent our ‘boundary’ category, as has been done in some previous works[8, 13, 19]. As shown in Table2, the syllables with boundary presence represent about 15% of the dataset. The boundary detection results obtained in this way can be used directly or as preliminary information that can be followed by a further identification of the boundary salience at the presence positions. We summarize the clusters of ToBI labels and their mapping relationships with prosodic categories for detection in Table3.

Table 3

Mapping between clusters of ToBI labels and prosodic categories for detection

In this paper, the auto-context algorithm is introduced for prosodic event detection. The basic objective of the auto-context algorithm is to maximize p(yi|X) for all samples, where X=(x1,⋯,xn) is the input feature vector, and yi is the class label for sample i. The auto-context algorithm provides an iterative way to asymptotically approach this objective as follows: Given a set of training samples with ground truth labels S={(Yi,X(i)),i=1,…,m}, a classifier is first trained using local features so the probabilities of classes for each sample can be obtained. In the ensuing iterative process, the probabilities of both the current and surrounding samples obtained in the current iteration t are combined together as a probability vector Pt(i). This is then concatenated with the original acoustic features to construct a new feature vector. The new feature vectors of all the samples compose the training set St+1={Yi,[ X(i),Pt(i)] } for iteration t+1. Using this updated training set, the new model is trained.

During the iterative process, the auto-context algorithm selects the informative contexts automatically and fuses them with appearance cues. At first, samples with strong discriminant cues will be correctly classified by the initial model and obtain stable posterior probabilities. These probabilities can then influence their neighbors in subsequent iterations, especially when there are close correlations between them. Convergence has been proven, with a monotonically decreasing training error, as shown in[3].

We adopt this algorithm to explore contextual information for pitch accent and boundary detection. We choose syllables from one to five syllables away from the central syllable as contextual regions to conduct our analysis. In our tasks, the posterior probabilities P(i) are used as the prosodic context (as opposed to ‘acoustic context’) since they present the likelihoods of syllables belonging to different prosodic events. Acoustic features can include not only local features but also acoustic context. Correspondingly, X(i) can be decomposed as the local features vector Xlocal(i) and acoustic context vector Xcontext(i), as Eq. (1) denoted. Here, we will consider both of them. The local acoustic feature Xlocal(i) used in this work will be described in detail in Section 5.1. The first-order differential values are used as the acoustic context Xcontext(i). We use the following steps to implement the algorithm:

1.

The acoustic features, including local features Xlocal(i) and acoustic context Xcontext(i), are used to train the initial acoustic model. After the first round of training and testing, we obtain the class probabilities.

X(i)=[Xlocal(i),Xcontext(i)],

(1)

2.

The contextual information of each syllable is incorporated by combining the probabilities from its neighbors for the next iteration. The probability vector P(i) for the i th sample is constructed as follows:

P(i)=[C∗(i-n),⋯,C∗(i-1),C∗(i),C∗(i+1),⋯,C∗(i+n)],

(2)

where C∗(i) represents any collection of classification probabilities for the i th syllable, and n is the range parameter controlling the extent of the context. After extension, the new training set for the second stage becomes

S(i)={yi,[X(i),P(i)],i=1,2,3,…,m}

(3)

3.

Using the S(i) training set, we train the next acoustic model and update P(i).

4.

Steps 2 and 3 are repeated until convergence to let the algorithm recursively learn the informative pitch accent context automatically.

5.1 Features for pitch accent

Pitch accent typically correlates with a higher level of pitch and energy and an increased duration. We extracted acoustic features based on these acoustic measurements in syllable regions to detect pitch accent. These features can be categorized into two groups: frame-averaged features, including the mean values of pitch, energy and duration; and TILT features provided by parameterizing the pitch contour using the TILT model. In addition, the forward and backward difference values for both frame-averaged features and TILT features are extracted as acoustic contextual information. In the following section, we describe how to extract these features in detail.

5.1.1 Frame-averaged features

5.1.1.0 Loudness

Pitch accent is closely related to human auditory characteristics. For perceptual accuracy, loudness can be used instead of intensity to detect pitch accent[20]. Here, we use a loudness model proposed by Zwicker and Fastl[21] to extract the loudness feature. This starts with using the short-time Fourier transform (STFT) to transfer the signal from the temporal to the frequency domain. The linear-scale frequency is grouped into a critical band rate scale to model the human hearing mechanism. This mapping relationship is given by

z(Bark)=T(f)=13tan-10.00076f+3.5tan-1f75002,

(4)

where f denotes frequency in Hertz, z represents the critical band rate measured in Bark units, and T(·) is the transform function between them. Here, we divide the audible range into 24 critical bands. The intensity of each critical band is obtained by summing up all the frequency points that are distributed within the band range of (z - 0.5,z + 0.5) and then by calculating the corresponding sound pressure level (SPL) according to

I(z)=10log∫L(z)H(z)I(f)dfI0dB,whereH(z)=T-1(z+0.5),L(z)=T-1(z-0.5).

(5)

Here, I0=10-12W/m2 is the standard threshold of hearing at 1 kHz.

Stevens pointed out that the relationship between intensity and perceptual loudness obeys the power law[22]. Following this law, we calculate the loudness of per Bark based on I(z) by

L(z)=0.08IQ(z)I00.230.5+0.5I(z)IQ(z)0.23-1,

(6)

where L(z) denotes the specific loudness in the z th Bark, and IQ(z) is the correponding threshold of intensity in quiet enviroment. The total loudness L of the frame is given by the summation of every critical band loudnessL=∑z=124L(z).

5.1.1.0 Semitone

Based on a similar perceptual consideration, we transform the pitch values in Hertz to the semitone scale to better match with human perception[23]. Raw pitch values are calculated first using Praat[24], then a log-scale transformation is conducted according to the following equation:

S=69+12log2f440,

(7)

where f is the fundamental frequency in Hz, and S is in semitones.

5.1.1.0 Spectral emphasis

Previous studies have shown that midfrequency energy is more effective in accent classification than full-range distributed energy[25]. Midfrequency refers to the frequencies between 500 and 2, 000 Hz. In this work, we use a finite impulse response (FIR) filter with Kaiser window to extract the energy within this bandwidth as a spectral emphasis feature.

5.1.1.0 Duration

We compute the syllable duration using the boundary information generated by forced alignment. The speaker-independent speech recognizer was trained using the data from BURSC and the corresponding manual transcriptions.

Among these four features, the loudness, spectral emphasis, and semitone values are all extracted at a frame-by-frame level in the first iteration. After this, the loudness and the spectral emphasis are averaged across a syllable scope, while the semitone is averaged across a syllable nucleus to obtain frame-averaged values. In order to reduce the negative impact caused by different speakers and speaking rates, these features are all normalized by the mean value across the whole sentence.

5.1.2 TILT features

We follow the rise/fall/connection (RFC) model proposed in[26] to extract the TILT parameters as the representation of the pitch variation[27]. As an intonation model, the RFC model first categorizes the F0 contour into rise (R), fall (F), and connection (C) cases according to pitch trend and then continuously parameterizes the contour based on this classification. Here, we still use the semitone-scaled F0 contour of each syllable nuclei to extract the TILT features.

Linear interpolation is implemented to smoothen the contour. Then, a labeling procedure is conducted to mark the contours with one of the three kinds of categories. Labels of successive frames are merged together to be a single interval. Within the range of these divided areas, the amplitude-related measurement (tilta) is calculated by

tilta=|Arise|-|Afall||Arise|+|Afall|,

(8)

the duration-related measurement (tiltd) is calculated by

tiltd=Drise-DfallDrise+Dfall,

(9)

and the overall measurement of tilt (tiltt) is calculated by

tiltt=|Arise|-|Afall|2|Arise|+|Afall|+Drise-Dfall2Drise+Dfall.

(10)

Arise and Afall are the sum of the rise and fall amplitudes, respectively, and Drise and Dfall represent the sum of the rise and fall durations, respectively. We extracted these features using the Edinburgh Speech Tools Library (EST). In addition, we also included the maximum semitone that can be calculated directly by EST as one of the tilt features.

5.2 Features for boundary

As discussed in[28], the presence of a phrase boundary typically correlates with the presence of silence, the reset of pitch and energy, and the lengthening of pre-boundary duration. Each of these play a role in perception of increased disjuncture. We assume that acoustic variations caused by boundary phenomenon exist only within the region of syllables, and extract acoustic features for boundary detection using a syllable unit, as has been done in[7, 8, 19]. We assume that the region affected by the boundary covers the end of the pre-syllable, the silence interval (possibly absent), and the beginning of the post-syllable, while the candidate result of this region is assigned to the pre-syllable. The acoustic features include the acoustic measurement of the preceding and following syllables and their differential features across syllable boundaries for boundary detection. There are 25 feature dimensions in total, as follows:

1.

The duration of the two syllables and their ratio value (3)

2.

The duration of the two syllable nuclei and their ratio values (3)

3.

The silence duration between the two syllables (1)

4.

The means and maxima of pitch of the two syllables, and their differential values (6)

5.

The loudness and spectral emphasis mean of the two syllables, and their differential values (6)

6.

The amplitude, duration, and overall measurements of the TILT features of the two syllables (6).

Although one can describe a linguistic grammar that only word-final syllables can contain a boundary, we choose to take no linguistic constraints into the implementation of our algorithm and allow all possibilities for all syllables. We just let the algorithm itself to make decisions according to acoustic features and contextual information.

We conduct a number of prosodic event detection experiments in this section. First, a classifier selection is conducted. The performances of the different classifiers are investigated, and the one with best performance is chosen as the baseline classifier for the auto-context experiments. The auto-context algorithm is then implemented for two-way and four-way pitch accent detections and for boundary detection. In these experiments, the effectiveness of the auto-context approach is verified from different aspects. Finally, comparisons are made between the auto-context, CRF, and n-gram methods.

The BURSC F1, F2, M1, and M2 data described in Section 3 is used for experimental evaluation. We use randomly selected fivefold cross-validation for each experiment. Accuracy and F-score are utilized to measure the performance of the pitch accent detection and boundary detection tasks. In addition, for boundary detection, we investigate the syllable-level detection performance, in which the measurement is presented as a fraction of all syllables, as well as the word-level detection performance, in which the measurement is presented as a fraction of word-final syllables. All test results presented here are obtained by averaging over the five cross-validation test folds. Acoustic model parameters for both baseline and proposed methods are optimized using a development dataset. This development dataset is constructed from the F3 and M3 data of the BURSC corpus, including 57 utterances, 282 sentences, and about 8, 000 syllables. All experiments are implemented by first optimizing the methods using the development data and then testing with fivefold cross-validation over the primary dataset.

6.1 Classifier selection

As mentioned in Section 4, the auto-context algorithm produces a sequence of classifiers through an iterative process. The performance of the chosen classifier will play a vital role in determining the final performance. The auto-context algorithm is not dependent on any specific classifier and can use either SVM or NN approaches. Since each of these have been used extensively in prosodic event detection, we will first conduct the investigation for their performance here.

We use LIBSVM with a radial basis function (RBF) kernel to implement SVM classification[29]. In the four-way pitch accent detection, the one-versus-the-rest mode is adopted to decompose multi-class classification into binary. For NN, we use a three-layer network with a fully connected structure. The sigmoid activation function is chosen for network nodes. The classical backpropagation is adopted to minimize the cross entropy error between outputs and targets. Many options related to the two classifiers, such as the number of hidden nodes, the learning rate, the momentum for NN and the cost, the gamma value of RBF function for SVM, are optimized with the development dataset. The results of the prosodic event detection by the two classifiers are given in Table4.

Table 4

Performance of prosodic event detection by NN and SVM classifiers using all combined features

We can see from Table4 that on this task, SVM outperforms NN. Another advantage of SVM is that it has fewer parameters to tune and thus is not as sensitive to them as NN. Therefore, considering efficiency and accuracy, we adopt the SVM as our basic classifier in the following experiments. During the iterative process, the same configuration is used for all SVM classifiers. The posterior probabilities that the auto-context algorithm uses are obtained by mapping the distance between the sample and the classifying hyperplane with a sigmoid function.

6.2 Auto-context algorithm for pitch accent detection

We conduct the following experiments to investigate the ability of the auto-context algorithm to model contextual information for the task of pitch accent detection. The first experiment investigates the effect of prosodic context utilized by auto-context for pitch accent detection. Performance is verified only in independent-syllable conditions, where no acoustic context has been used in the baseline. Based on this, a further investigation is then conducted to identify the effect of different contextual locations, e.g., preceding and following contexts. In the final experiment, acoustic differential features are employed to investigate the combination effect of prosodic context and acoustic context.

6.2.1 Auto-context algorithm without acoustic context

To perform this experiment, we select the number of contextual syllables before and after the current syllable to range from M=1 to M=5, where M is the maximum contextual extent. For each selected syllable, the probabilities of all classes were included.

Figure4a,b shows the average detection accuracy of the auto-context algorithm calculated across the training and test sets of the fivefold cross-validation. For each case, the training set accuracy increases with each iteration, and the wider the contextual range, the better the performance. Correspondingly, the test set accuracy also increases gradually across iterations, as illustrated in Figure4b. This suggests that the auto-context algorithm is able to iteratively capture informative prosodic context. The larger the range is, the greater ability this algorithm has to model these useful relationships.

Additionally, we can see that the greatest performance gain is often obtained in the first iteration. In our experiment, the first iteration produces no less than 1.5% net accuracy improvement (more than 70% of the total improvement) on test data. This is because it is at this stage that the context is first added to a baseline without any prior contextual information. It can also be observed that sometimes there is performance degradation on the test set when the classifiers are iterated too many times, which indicates the over-fitting of the model. To avoid this problem in practice, a development dataset needs to be used to tune and determine an appropriate configuration before each implementation. In our experiment under the fivefold cross-validation scenario, the optimal configuration is determined by the average performance of the development dataset. Final results are obtained under the constraint of the chosen configuration. In order to fully show the change trend, the curves shown in the figures are not preset with any stopping criterion. From Figure4b, we can also see that M=3 gives the best performance. This is consistent with statistics, showing that most pitch accent events occur once every two or three syllables. The final result is obtained at M=3 and iteration number itr=3, achieving accuracy of 81.2%.

The F-score performance on the training and test datasets are fully shown in Figure4c,d, respectively. As with accuracy, we see a similar increasing trend. The first iteration again gives the most salient improvement, more than 3% net, representing more than 60% of the total improvement. The F-score improvement is more significant than the accuracy improvement. In our experiment, auto-context improves the F-score from 66.7% to 71.1% at M=3 and itr=4 on the test data.

The test accuracy and F-score performance on the four-way pitch accent detection are fully shown in Figure5a,b. From this figure, we can see that the auto-context can substantially impact the four-way pitch accent detection result as well. In our experiments, it can improve the accuracy from 74.2% to 75.7% at M=2 and itr=6, and improve the F-score from 54.3% to 59.5% at M=2 and itr=6.

6.2.2 Effect of contextual location

Auto-context is sensitive to the choice of context. Section 6.2.1 has discussed the performance variation associated with contextual range influence. In this experiment, we conducted further analysis about how different contextual locations impact the auto-context effectiveness. The contextual location was split into preceding and following based on the position relative to the current syllable. We chose an upper contextual limit of M=3 for the two-way task and M=2 for the four-way task, based on the results of the previous experiments. In each experiment, only the preceding or following probabilities are used. The final results are given in Table5. We see that both the preceding and following contexts are useful for the tasks of two-way and four-way extent detection. As expected, combining both of them gives the best performance.

Table 5

Accuracy performance of the auto-context algorithm using contextual information from different locations

Auto-context (%)

None

Preceding

Following

Both

Two-way (M=3)

79.3

80.4

80.1

81.2 (M=3 itr = 3)

Four-way (M=2)

74.2

75.3

75.0

75.7 (M=2 itr = 6)

Here, ‘None’ means not using auto-context, and ‘Preceding’, ‘Following’, and ‘Both’ refer to the directions of contextual prosodic information.

6.2.3 Auto-context algorithm with acoustic context

The auto-context algorithm has the potential to explore relationships between prosodic events across a wider extent and combine these with acoustic information in a unified framework. In this experiment, under the same experiment setups to Section 6.2.1, we implement the auto-context algorithm based on not only local acoustic features but also acoustic context to investigate the combination effect between prosodic context and acoustic context. The testing accuracy and F-score results on the two-way task are shown in Figure6a,b, and the testing performance of the four-way classification is shown in Figure7a,b.

The final results are listed in Table6. From these results, we can see that although the prosodic contexts used by auto-context achieves comparable performance to direct acoustic context when used separately, the combination of them yields further improvement. Based on the feature set that has included acoustic context, auto-context can give a further accuracy improvement on both of the pitch accent detection tasks. It improves the accuracy of two-way detection from 81.2% to 82.0% at M=3 and itr=4, and improves the accuracy of four-way detection from 76.3% to 77.0% at M=2 and itr=1. It also gives a F-score improvement of more than 2% (from 70.8% to 73.0% at M=3itr=4) for the two-way task and 3% (from 59.9% to 63.0% at M=2 itr=4) for the four-way task. The addition of acoustic context improve the overall performance of auto-context as well, compared to the case of only utilizing prosodic context. For example, without acoustic context information, auto-context can achieve an accuracy of 75.7% on four-way pitch accent detection. When combined with acoustic context, however, the performance improves to 77.0%. These improvements also have been shown in other detection tasks and confirmed by F-score measurements. This demonstrates that prosodic context and acoustic context provide complementary contextual information and that the auto-context algorithm can effectively combine them together.

Table 6

Performance comparison of auto-context, acoustic context, and their combinations

No

Acoustic

Auto-context

Combination

context

context

Two-way (%)

Accuracy

79.3

81.2

81.2 (M=3 itr = 3)

82.0 (M=3, itr = 4)

F-score

66.7

70.8

71.1 (M=3 itr = 4)

73.0 (M=3, itr = 4)

Four-way (%)

Accuracy

74.2

76.3

75.7 (M=2 itr = 6)

77.0 (M=2, itr = 1)

F-score

54.3

59.9

59.5 (M=2 itr = 6)

63.0 (M=24, itr = 4)

6.3 Auto-context on boundary detection

This experiment investigates the performance of auto-context algorithm on boundary detection. Using the boundary-related features introduced in Section 5.2, we obtain a syllable-level accuracy of 88.1% and F-score of 45.3% using overall statistics, and a word-level accuracy of 81.2% and F-score of 46.2% using word-final statistics. On this baseline, we implemented the auto-context algorithm across different contextual extents. The syllable-level accuracy and F-score performances are shown in Figure8a,b, and the corresponding results of the word-level performance are shown in Figure9a,b.

The final results are listed in Table7. As we can observe, for boundary detection, the auto-context algorithm yields a nearly 1% improvement using overall statistics and 1.6% improvement using word-final statistics. For the former, it achieves an accuracy of 89.0% at M=4 and itr=4, while for the latter, it achieves accuracy of 82.9% at M=3 and itr=7. Similarly to pitch accent detection, the auto-context algorithm achieved a larger improvement on F-score than the accuracy for the boundary detection task, improving the F-score by about 12% for both statistics. It achieves a syllable-level F-score of 57.3% and word-level F-score of 58.9%, respectively.

Table 7

Performance of boundary detection when using the auto-context algorithm

Auto-context (%)

Accuracy (%)

F-score (%)

None

Auto-context

None

Auto-context

SL boundary

88.1

89.0 (M=4, itr = 4)

45.3

57.3 (M=5, itr = 6)

WL boundary

81.2

82.9 (M=3, itr = 7)

46.2

58.9 (M=5, itr = 7)

Here, ‘None’ means not using auto-context.

6.4 Methods comparison

In this section, we compare the performance of our SVM-based auto-context system with two other alternative methods, including an n-gram language model and the CRF approach. For the n-gram approach, we referred to the results of the representative work[8], in which the same two-way pitch accent detection and binary boundary detection are implemented on the BURSC dataset using the syllable-level acoustic features of F0, timing cues, and energy. This work used an NN as the classifier and applied the n-gram language model in the rescoring stage in order to utilize context information. The experiments were conducted without considering acoustic dependencies across syllables, and a five-fold cross-validation is used for the measurement of detection accuracy. By using a 4-gram language model, the accuracy of pitch accent detection was substantially improved from 74.1% to 80.1%. However, there was a slight performance degradation for the boundary detection accuracy from 90.0% to 89.6%. We conducted the CRF-related experiments using the CRF++ toolkit[30] with the same data and acoustic features (include first-order differential features), as were used in the auto-context experiments. The The CRF tool does not support continuous features, so we discretized them with a k-means approach. The linear chain CRF model with bigram mode was used, with the model trained using the limited memory BFGS algorithm. Options like cut-off threshold and number of quantized brackets for acoustic features are also optimally tuned using the development dataset. The final results are listed in Table8. We can observe that the SVM-based auto-context system achieves better performance than CRF, especially in terms of F-score performance for boundary detection, which surpasses the CRF algorithm by 2%. There is also an advantage in binary pitch accent detection compared with the n-gram language model based on NN, although the performance on binary boundary detection is not improved.

In this paper, we introduce a flexible and effective algorithm called auto-context for prosodic event detection. This algorithm uses an iterative approach to incorporate the contextual information to improve prosodic event detection. The probabilities of neighboring syllables are integrated with acoustic features to recursively boost the classification performance of the acoustic models. The experiments on two-way and four-way pitch accent detection and binary boundary detection show that auto-context improves the performance both in terms of accuracy and F-score measurements. The combination of both prosodic and acoustic context together gives the best performance. For two-way pitch accent detection, accuracy is improved from 79.3% to 82.0% and F-score from 66.7% to 73.0%. For four-way pitch accent detection, accuracy is improved from 74.2% to 77.0% and F-score from 54.3% to 63.0%. Similar improvement is also shown for boundary detection. Using the overall statistical method, the detection accuracy is improved from 88.1% to 89.0% and F-score from 45.3% to 57.3%, while using the word-final statistical method, the detection is improved from 46.2% to 58.9%.

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.