A Flite-based synthesizer has been thoroughly tested, and it runs on
multiple platforms. The example voice distributed is an 8KHz diphone
voice; this is the same voice as kallpc8K as distributed with
Festival. That voice is rather old and not very good, but we
deliberated wanted to use a stable voice as our first example so we
could properly ensure the quality in Flite was the same as it is under
Festival.

The following table gives code/data size comparisons for
the 8KHz kal voice.

Flite

Festival

core code

50K

2.6M

USEnglish

35K

??

lexicon

1.6M

5M

diphonedb

2.1M

2.1M

Festival doesn't have a clear separation between its language
implementation and its core code so its difficult to give a figure for
that. However, the Festival Scheme representation of a basic duration
model alone is 35 kilobytes.

Run-time memory requirements for Flite are less than twice the size of
the largest waveform built. In its current form a complete 16 bit
waveform is built for each utterance being synthesized, the complete
runtime memory requirements are about 1.75 times that size. For our
test set of the first two chapters of ``Alice's Adventures in
Wonderland,'' the requirement is less that 1 megabyte. For the same
task with Festival using the equivalent 8KHz diphone voice the size
is about 16-20 megabytes.

The current Flite system with an 8KHz diphone voice has a full
footprint of 5M, 4M of code and data and 1M of RAM. The equivalent
for Festival is about 30-40M.

As for speech of synthesis, our test consist of the first two chapters
of alice which renders to just under 22 minutes of speech. On a
500MHz PIII running Linux, Flite renders this in 19.1 seconds (70.6
times faster than real time) while the equivalent voice in Festival
takes 97 seconds (13.4 times faster). Thus Flite is over 5 times
faster.

Another key speed test we did was to time how quickly the system can
start to speak. For a twenty word utterance, Flite starts writing to
the audio device in 45ms, for a 40 word utterance it is about 75ms.
The startup time before the first synthesis function is called is
about 23ms. For Festival running from the command line the equivalent
is about 4-5 seconds. When running as a server and using the client
access method and thus exclude the start up time, we still can't make
the time less that 1 second for the 20 word utterance and nearer
2 seconds for the 40 word utterance.