Abstract [en]

Continuous modeling of intonation in natural speech has long been hampered by a focus on modeling pitch, of which several normative aspects are particularly problematic. The latter include, among others, the fact that pitch is undeﬁned in unvoiced segments, that its absolute magnitude is speaker-speciﬁc, and that its robust estimation and modeling, at a particular point in time, rely on a patchwork of long-time stability heuristics. In the present work, we continue our analysis of the fundamental frequency variation (FFV) spectrum, a recently proposed instantaneous, continuous, vector-valued representation of pitch variation, which is obtained by comparing the harmonic structure of the frequency magnitude spectra of the left and right half of an analysis frame. We analyze the sensitivity of a task-speciﬁc error rate in a conversational spoken dialogue system to the speciﬁc deﬁnition of the left and right halves of a frame, resulting in operational recommendations regarding the framing policy and window shape.