Luks.fe.uni-lj.si

Melita Hajdinjak and France MiheliˇcUniversity of Ljubljana, Faculty of Electrical Engineering, Slovenia{melita.hajdinjak,france.mihelic}@fe.uni-lj.si and http://luks/
Keywords: natural-language dialogue systems, Wizard-of-Oz experiment, dialogue-manager evalua-tion, PARADISE evaluation framework
Human-human and human-computer dialogues differ in such an important way that thedata from human interaction becomes an unreliable source of information for some im-portant aspects of designing natural-language dialogue systems. Therefore, we beganthe process of developing a natural-language, weather-information-providing dialoguesystem by conducting the Wizard-of-Oz (WOZ) experiment. In WOZ experiments sub-jects are told to interact with a computer system, though in fact they are not sincethe system is partly simulated by a human, the wizard. During the development of theweather-information-providing dialogue system this experiment was used twice. Whilethe aim of the first WOZ experiment was, first of all, to gather human-computer data,the aim of the second WOZ experiment was to evaluate the newly-implemented dialogue-manager component. The evaluation was carried out using the PARADISE evaluationframework, which maintains that the system’s primary objective is to maximize usersatisfaction, and it derives a combined performance metric for a dialogue system as aweighted linear combination of task-success measures and dialogue costs.
been argued that human dialogues should be re-garded as a guidance and a norm for the de-
In a nutshell, a dialogue system or a voice in-
sign of natural-language dialogue systems, i.e.,
terface enables users to interact with some appli-
that a natural dialogue between a person and a
cation using spoken language. The application
computer should resemble a dialogue between hu-
in question, for example, can be a piece of hard-
mans as much as possible. On the other hand, a
ware (command & control systems) or a kind of
computer is not a person. Consequently, human-
database (interactive voice response, information-
human and human-computer dialogues differ in
providing dialogue systems, problem-solving dia-
such an important way that the data from hu-
logue systems). A detailed overview is given by
man interaction becomes an unreliable source of
Krahmer [1]. In this article, we will focus on
information for some important aspects of design-
information-providing, natural-language dialogue
ing natural-language dialogue systems, in par-
systems, which have already been developed for
ticular the style and complexity of interaction
different domains, for instance, restaurant infor-
mation [2], theatre information [3], train travel
language dialogue systems are influenced by the
information [4, 5], air travel information [6, 7],
system’s language [11], i.e., they often adapt their
behaviour to the expected language abilities of
It is generally acknowledged that developing
the counterpart. Therefore, instead of gather-
a successful computational model of natural-
ing human-human data, we started the process
language dialogues requires extensive analysis of
of designing the Slovenian and Croatian spo-
sample dialogues, but the question that arises
ken, weather-information-providing dialogue sys-
is whether these sample dialogues should be hu-
tem [12] by conducting the Wizard-of-Oz (WOZ)
man dialogues. On the one hand, it has often
experiment [10, 13], which is a more accurate pre-
dictor of actual human-computer interaction [9].
responses as well as forms, image fields and
This is because in WOZ studies subjects are told
to interact with a computer system, though infact they are not. The system is at least partly
simulated by a human, the wizard, with the con-
Slovenian text-to-speech synthesis [16].
sequence that the subjects can be given more free-dom of expression or be constrained in more sys-
Hence, the task of the wizard in the first WOZ
tematic ways than this is the case in already ex-
experiment was to simulate Slovenian speech
understanding (speech recognition and natural-
language understanding) and dialogue manage-
information-providing dialogue system the WOZ
ment. Croatian speech understanding was not
experiment was used twice. While the aim of the
performed since only Slovene users were being in-
first WOZ experiment (section 2) was, first of all,
volved into the experiment. During the experi-
to gather human-computer data, the aim of the
ment, the wizard was sitting behind the graphi-
second WOZ experiment (section 3) was to eval-
cal interface, listened to users’ queries and tried to
uate the newly-implemented dialogue-manager
mediate an appropriate response, which was be-
component [14]. Consequently, while in the first
ing successively followed by the natural-language-
WOZ experiment dialogue management was still
generation process and the text-to-speech process.
one of the tasks of the wizard, in the second
However, a total of 76 Slovene users (38 fe-
WOZ experiment it was performed by the newly-
male, 38 male) were chosen to take part in the
implemented dialogue-manager component. The
first WOZ experiment. The statistical distribu-
differences in the data from both WOZ exper-
tions of the users’ ages, educations, dialects, the
iments therefore reflect the dialogue manager’s
telephone units and the background environments
performance. However, this data was evaluated
from where the telephone calls were made were
with the PARADISE evaluation framework [15],
chosen to simulate the actual scenarios. The users
i.e., a potential general methodology for evaluat-
were given verbal instructions about the general
ing and comparing the performance of spoken di-
functionality of the system and a sheet of paper
alogue agents, which maintains that the system’s
containing a description of the tasks they were
primary objective is to maximize user satisfaction,
supposed to complete. They had two scenarios
and it derives a combined performance metric for
to enact. The first task was to obtain a partic-
a dialogue system as a weighted linear combina-
ular piece of weather-forecast information, such
tion of task-success measures and dialogue costs.
as the temperature in Ljubljana or the weatherforecast for Slovenia tomorrow, and the secondtask was a given situation, such as ”You are plan-
ning a trip to. What are you interested in?”,the aim of which was to stimulate the user to ask
The aim of the first WOZ experiment [13] was
context-specific questions. After these two sce-
to gather data that would serve as the basis for
narios, users were given the freedom to ask addi-
the construction of the dialogue manager and
the speech-understanding component within the
In order to evaluate user satisfaction, users were
developing Slovenian and Croatian spoken di-
given the user-satisfaction survey [17] used within
alogue system for weather-information retrieval
the PARADISE framework (section 4), which asks
[12]. However, the first WOZ system consisted
to specify the degree to which one agrees with
several questions about the behaviour or the per-formance of the system (TTS Performance,
Pace, User Expertise, System Response,Expected Behaviour, Future Use). The an-
wizard’s graphical interface [13], designed as
swers to the questions were based on a five-class
an internet application, which included facil-
ranking scale from 1, indicating strong disagree-
ities for the playback of predefined spoken
ment, to 5, indicating strong agreement. All the
mean values are given in table 1. A comprehen-
and the same user-satisfaction survey as the users
sive User Satisfaction was then computed by
in the first experiment. All the mean user val-
summing each question’s score, and thus ranged
ues, which were slightly worse than the values
in value from a low of 8 to a high of 40. In the
from the first WOZ experiment, are given in ta-
first WOZ experiment, the mean User Satisfac-
ble 1. The mean User Satisfaction value was
tion value was 34.08, with a standard deviation
this time 31.96, with a standard deviation of 4.99.
Note, the difference between the mean User Sat-isfaction values in both experiments is expected
since the wizard with her human-level intelligence
should had been able to manage the dialogue bet-
ter than the implemented dialogue-manager com-
The Slovenian spontaneous speech data col-
lected during the second WOZ experiment was
named Slovenian Spontaneous Speech Queries 2
In agreement with previous studies [9, 10, 11],
we observed that in both experiments the usersadapted their behaviour to the expected language
abilities of the natural-language-spoken WOZ sys-
(WOZ1) and the second (WOZ2) WOZ experi-
tem. In several dialogues the first question was
much longer than the following ones and, in caseof repetitions, requested by the system, the speechmode became more articulated, slower and/or
The spontaneous speech data, named Slove-
louder. Moreover, while the wizard was medi-
nian Spontaneous Speech Queries (SSSQ), that
ating her response some users made fun of the
was collected during the first WOZ experiment,
system, they made comments like ”What a voice
was transcribed with the Transcriber tool [18].
- terribly”, ”It is thinking”, ”It is searching in
The transcription was labelled for turns and ut-
the computer”, and they laugh. But such side re-
terances, and special labels for dialectal words
marks certainly would be rather strange in a nat-
and non-speech sounds were added. An example
ural information-providing task because, in both
experiments, subjects were basically role playing.They were not real users with real information re-quirements or real time constraints and telephone
The second WOZ experiment was carried out inorder to evaluate the performance of the newly-
implemented dialogue manager [14], build on thebasis of the data collected during the first WOZ
The dialogue-manager component [14] was evalu-
experiment. Therefore, all the other components
ated using the PARADISE framework [15], which
of the system remained the same. Hence, in com-
maintains that the system’s primary objective is
parison with the first WOZ experiment, the task
to maximize user satisfaction, and it derives a
of the wizard in the second WOZ experiment was
combined performance metric for a dialogue sys-
only to simulate Slovenian speech understanding.
tem as a weighted linear combination of task-
The wizard was sitting behind the dialogue man-
success measures and dialogue costs (i.e., dialogue-
ager’s interface and entered the meaning repre-
efficiency costs and dialogue-quality costs). The
PARADISE model of performance posits that a
A total of 68 Slovene users (29 female, 39 male)
performance function can then be derived by ap-
were chosen to take part in the second WOZ ex-
plying multivariate linear regression (MLR) with
periment. They were given the same instructions
user satisfaction as the dependent variable and
Hello. The dialogue system for weather-forecast information speaking. Can I help you?Wait a moment, please . [the wizard is choosing her answer]In ˇStajerska today - the visibility will be more than 10 km. Is there something else?For which location are you asking for?Wait a moment, please . [the wizard is choosing her answer]No, in ˇStajerska today – the sky will clear up. Is there something else?
What about the weather in Poland in the next few days?
Wait a moment, please . [the wizard is choosing her answer]In Varˇsava, Poland - it is cloudy, the air temperature is -6 degrees Celsius. Is theresomething else?I do not offer this information. Do you have any other question?Thank you for your cooperation. Goodbye.
Table 2: The Slovene-English translation of an example dialogue between a user (U) and the WOZsystem (S), recorded during the first WOZ experiment.
task-success measures, dialogue-efficiency costs,
Message Ratio (HMR), i.e., the ratio of sys-
and dialogue-quality costs as the independent
tem help moves; Check Ratio (CR) and Num-
variables. Here, user satisfaction, which has been
ber of Check moves (NC), i.e., the ratio and
frequently used in the literature as an external
the number of system moves checking some in-
indicator of the usability of a dialogue system, is
formation regarding past dialogue events; Non-
calculated with the survey [17], used in our WOZ
Provided Information Ratio (NPR), i.e., the
ratio of user-initiating moves that do not result
In order to model the performance of both
in relevant information being provided; No-Data
WOZ systems, we selected 17 regression pa-
rameters. First, we computed the task-success
sponses (NNR), i.e., the ratio and the number
measure Kappa coefficient (κ) [19], reflect-
of system moves stating that the requested in-
ing the wizard’s typing errors, and the dialogue-
formation is not available; Relevant-Data Ra-
efficiency costs Mean Elapsed Time (MET),
tio (RDR), i.e., the ratio of system moves di-
i.e., the mean elapsed time for the completion
recting the user to select relevant, available data;
of the tasks that occurred within the interac-
Unsuitable-Initiative Ratio (UIR), i.e., the ra-
tion, and Number of User Turns (NUT). Sec-
tio of user-initiating moves that are out of context;
ond, the following dialogue-quality costs were
Non-Initiating Ratio (NIR), i.e., the ratio of
selected: Task Completion (Comp), i.e., the
user’s perception of completing the given task;
Mean Words per Turn (MWT), i.e., the mean
the first WOZ experiment to derive a performance
number of words per user’s turns; Mean Re-
sponse Time (MRT), i.e., the mean system-
tio, Non-Provided-Information Ratio, Task
response time; Max Response Time (MaxRT),
i.e., the maximum system-response time; Rejec-
tion Ratio (RR), i.e., the ratio of system moves
that significantly contributed to user satisfac-
asking for a repetition of the last utterance; Help-
tion. On the other hand, the most significant
parameters in the second WOZ experiment were
manager should be as flexible as possible in
directing the user to select relevant, available
Walker et al. [17] found in their experiments
that Task Completion, rather than Kappa,was a significant factor in predicting user sat-
isfaction, and argued that this was because the
ducted WOZ experiments, aim of which was to
user’s perceptions of task completion sometimes
gather human-computer data and to evaluate the
varied from Kappa. In our experiments, Kappa
dialogue-manager component of the developing,
only referred to the wizard and Task Comple-
Slovenian and Croatian spoken dialogue system
tion was related only with the first task, which
could be the reasons why we did not come to the
The results of applying PARADISE to the data
same conclusion. On the one hand, in these ex-
from both WOZ experiments have been given.
periments, Kappa and Task Completion were
These have shown that user satisfaction is sig-
uncorrelated, but on the other hand, in the sec-
nificantly correlated with the percentage of those
ond WOZ experiment, Kappa was an even more
user initiatives that did not result in relevant in-
significant predictor of user satisfaction.
the ability to direct the user to select relevant,
available data is of great importance, and, con-
sequently, that a dialogue system should give no
information only if there is no other available data
that might be relevant to the user’s request.
Message Ratio is a consequence of the user’sbehaviour during the conversation, which is,
[1] Krahmer, E.J. (2001) The Science and Art
on the other hand, influenced by the system’s
of Voice Interfaces, Philips research report,
level of user-friendliness and cooperation.
user-friendly and cooperative dialogue systemshould not only play an active role in directing
[2] Jurafsky, D., Wooters, C., Tajchman, G., Se-
the dialogue flow toward a successful conclusion
gal, J., Stolcke, A., Fosler, E., and Morgan,
for the user, it should also be able to take the
N. (1994) The Berkeley Restaurant Project,
initiative and to instruct the user if he/she asks
Proc. of the 3rd International Conference on
for help. However, because some novice users
Spoken Language Processing, Acoustical So-
of a dialogue system who are not able to adapt
ciety of Japan, Yokohama, Japan, pp. 2139–
quickly are likely to need instructions provided
by the system, Help-Message Ratio is ex-pected to reflect user satisfaction. Furthermore,
[3] van der Hoeven, G., Andernach, J., van der
because Check Ratio is in a way related to the
Burgt, S., Kruijff, J., Nijholt, A., Schaake, J.,
speech-understanding process, which is usually
and de Jong, F. (1995) A Natural Language
the most problematic part of a dialogue-system’s
Accesible Theatre Information and Booking
performance, it is inappropriate to try to decrease
System, Proc. of the 1st International Work-
it at any price. Consequently, user satisfaction
shop on Applications of Natural Language to
can be remarkably improved only by decreasing
Data Bases, AFCET, Versailles, France, pp.
be done by preventing the dialogue manager fromgiving no information before first checking that
[4] Eckert, W., Kuhn, T., Niemann, H., Rieck,
there is no other available data that might be
S., Scheuer, A., and Schukat-Talamazzini,
relevant to the user’s request, i.e., the dialogue
for Weather Information Retrieval, Proc. of
quiries, Proc. of the 3rd European Conferencethe 8th European Conference on Speech Com-on Speech Communication and Technology,
munication and Technology, ISCA, Geneva,
ISCA, Berlin, Germany, pp. 1871–1874.
[5] Allen, J.F., Schubert, L.K., Ferguson, G.,
[13] Hajdinjak, M. and Miheliˇc, F. (2003) The
Heeman, P., Hwang, C.-H., Kato, T., Light,
tion Retrieval, Lecture Notes in Artificial In-telligence 2807: Text, Speech and Dialogue,
Project: A Case Study in Building a Conver-
pp. 400–405. Matouˇsek, V. and Mautner, P.
sational Planning Agent, Journal of Experi-mental and Theoretical AI, Taylor and Fran-cis Ltd, pp. 7 7–48.
[14] Hajdinjak, M. and Miheliˇc, F. (2004)
[6] Ipˇsi´c, I., Miheliˇc, F., Dobriˇsek, S., Gros,
ment, Lecture Notes in Artificial Intelligence
J., and Paveˇsi´c, N. (1999) A Slovenian
3206: Text, Speech and Dialogue, pp. 595–
Spoken Dialogue System for Air Flight In-
602. Sojka, P., Kopecek, I. and Pala, K.
quires, Proc. of the 6th European Conferenceon Speech Communication and Technology,ISCA, Budapest, Hungary, pp. 2659–2662.
[15] Walker, M.A., Litman, D., Kamm, C.A., and
[7] Stallard, D. (2000) Talk’n’Travel: A Con-
versational System for Air Travel Planning,
Agents, Proc. of the 35th Annual MeetingProc. of the 6th Applied Natural Languageof the Association of Computational Linguis-Processing Conference, Association for Com-
tics, Association for Computational Linguis-
putational Linguistics, Seattle, USA, pp. 68–
[16] Gros, J., Paveˇsi´c, N., and Miheliˇc, F. (1997)
[8] Zue, V., Seneff, S., Glass, J., Polifroni, J.,
Text-to-Speech Synthesis: a Complete Sys-
Pao, C., Hazen, T.J., and Hetherington, L.
tem for the Slovenian Language, Journalof Computing and Information Technology,
versational Interface for Weather Informa-
University Computing Centre Zagreb, pp.
tion, IEEE Transactions on Speech and Au-dio Processing, IEEE, pp. 8(1) 85–96.
[17] Walker, M.A., Litman, D.A., Kamm, C.A.,
[9] Fraser, N.M. and Gilbert, G.N. (1991) Sim-
and Abella, A. (1998) Evaluating Spoken Di-
ulating Speech Systems, Computer, Speechand Language, Academic Press, pp. 5(1) 81–
Studies, Computer, Speech and Language,
Academic Press, pp. 12(3) 317–347.
[10] Dahlb¨ack, N., J¨onsson, A., and Ahrenberg,
[18] Barras, C., Geoffrois, E., Wu, Z., and Liber-
How, Proc. of the International Workshop on
and Use of a Tool for Assisting Speech Cor-
Intelligent User Interfaces, ACM Press, Or-
pora Production, Speech Communication:Special Issue on Speech Annotation and Cor-
[11] Zoltan-Ford, E. (1991) How to Get People
pus Tools, Elsevier Science, pp. 33(1) 5-22.
[19] Di Eugenio, B. and Glass, M. (2004) The
derstand, Journal of Man-Machine Studies,
Kappa Statistic: A Second Look, Computa-tional Linguistics, The MIT Press, pp. 30(1)
Zibert, J., Martinˇci´c-Ipˇsi´c, S., Hajdinjak, M.,
Ipˇsi´c, I., and Miheliˇc, F. (2003) Develop-ment of a Bilingual Spoken Dialogue System

ISOLATION OF CAFFEINE FROM TEA EXPERIMENTAL TECHNIQUES REQUIRED OTHER DOCUMENTS INTRODUCTION Caffeine is a commonly encountered mild stimulant and a diuretic; it is widely used in proprietary drugs for the stimulant effect to prevent drowsiness. Caffeine is naturally present in the fruit and bark of a number of plants, including tea, coffee, and cacao. Tea contains about 30-75

Board Certified Family Medicine / Geriatric Medicine / General Surgery / Physical Medicine / Cardiology Goals We at Preferred Medical Group, PC are recognized as a Patient Centered Medical Home (PCMH.) This simply means that we meet stringent requirements as a health care team and as a team we are committed to providing you the best state-of-the-art care we can. You are an integral part of thi