Fully automatic builds of synthesizers in unresearched languages is a
long way off, however with the greater demand for support in minority
languages it is something that should be addressed.

Using acoustic information to find distinctions is implicitly what we
have been trying to do in unit selection synthesis, thus explicitly
taking advantage of that should not be a surprise.

Anecdotal evidence of this already shows up in other synthesizers
build by us. When using an American English based synthesizer
with US English phoneset, a US English lexicon, and a Scottish
English speaker, the lexical entries do not properly match the
speaker's pronunciations. For examples palatalized /uw/ as found in
British English in /t y uw z d ey/ (Tuesday) is defined as /t uw z d
ey/ in the US English lexicon. When this labeling is used against a
Scottish English speaker the /y-uw/ segment is labeled as /uw/. Thus
when other words are synthesized with similar contexts the
palatalization is still generated thus words labeled as /s t uw d eh n
t/ (student) may correctly, for the dialect, be synthesized as acoustics that
could be labeled as /s t y uw d eh n t/.

It should be noted that it is rare that absolutely no phonetic
knowledge is available for a language and often at least some
information (vowel/consonant) can be directly derived from the
orthographic system. However it is not unusual that there are no
linguistically knowledgeable speakers of the language available, and
native speakers are often not explicitly conscious of the distinction
they are making. In a practical sense, a gross classification of phonemes
can be reasonably specified but fine distinctions are much harder.

It is worth comparing the complexity of mapping letters directly to
acoustics, with the more standard approach of having an intermediate
finite phone set. As we are considering mapping without explicit
lexicons it is best to compare with the automatic letter to sound rule
mappings as described in [7]; in this case, we map
letters to predefined finite phone sets. Importantly, letter
to sound training sets are bigger, because it is easier to collect
text than speech. However the difference in size is only perhaps one
order of magnitude (5000 words vs. 50,000 words), and in the letter to
acoustic case we have selected data deliberately to get coverage.

Machine learning techniques could allow us to assume a hidden layer
that explicitly represents a phone set, but we have not investigated
that yet.

Another direction that may be worth investigating is to cluster the
acoustics independent of any labeling and then match the types
identified by the clusters to letters. Such techniques for
acoustically derived units have been studied for speech recognition
(e.g. [8]) but have not yet been investigated for
unit selection synthesis.

It is clear that depending on the language and knowledge available,
there is a scale of pure letter to acoustic through to
letter to phone and phone to acoustic models. But we would like to
make that scale available to the voice builder so they may best take
advantage of the information they currently have available.

Another point that we wish to make clear is that without native
speaker's feedback for evaluation the ultimate quality of a synthetic
voice cannot be determined. As those who work in the field
immediately notice, synthesis in languages you are not familiar with
typically sound better than synthesis in languages you are
knowledgeable about. It requires fluent speakers to properly evaluate
content. In our experience in building synthesizers for minority
language we find, anecdotally, that listeners can be more extreme that
those in more common languages. On one hand, that there is a
synthesizer at all in their language can make some native listeners
accept what is not the best possible synthesis. On the other hand,
listeners of minority languages are likely to be unfamiliar with
speech synthesis, and they can even find listening to high quality
recorded speech difficult to understand.