In the EMIME project, we
developed a mobile device that performs personalized
speech-to-speech translation such that a user’s spoken input in one
language is used to produce spoken output in another language,
while continuing to sound like the user’s voice. We integrated two
techniques into a single architecture: unsupervised adaptation for
HMM-based TTS using word-based large-vocabulary continuous speech
recognition, and cross-lingual speaker adaptation (CLSA) for
HMM-based TTS. The CLSA is based on a state-level transform mapping
learned using minimum Kullback–Leibler divergence between pairs of
HMM states in the input and output languages. Thus, an unsupervised
cross-lingual speaker adaptation system was developed. End-to-end
speech-to-speech translation systems for four languages (English,
Finnish, Mandarin, and Japanese) were constructed within this
framework. In this paper, the English-to-Japanese adaptation is
evaluated. Listening tests demonstrate that adapted voices sound
more similar to a target speaker than average voices and that
differences between supervised and unsupervised cross-lingual
speaker adaptation are small. Calculating the KLD state-mapping on
only the first 10 mel-cepstral coefficients leads to huge savings
in computational costs, without any detrimental effect on the
quality of the synthetic speech.