ABSTRACT In this paper, we propose a novel method for a robot to detect robot-directed speech, that is, to distinguish speech that users speak to a robot from speech that users speak to other people or to themselves. The originality of this work is the introduction of a multimodal semantic confidence (MSC) measure, which is used for domain classification of input speech based on the decision on whether the speech can be interpreted as a feasible action under the current physical situation in an object manipulation task. This measure is calculated by integrating speech, object, and motion confidence with weightings that are optimized by logistic regression. Then we integrate this measure with gaze tracking and conduct experiments under conditions of natural human-robot interaction. Experimental results show that the proposed method achieves a high performance of 94% and 96% in average recall and precision rates, respectively, for robot-directed speech detection.

[Show abstract][Hide abstract]ABSTRACT: Speech technologies nowadays available on mobile devices show an increased performance both in terms of the language that they are able to capture and in terms of reliability. The availability of perfor-mant speech recognition engines suggests the deployment of vocal inter-faces also in consumer robots. In this paper, we report on our current work, by specifically focussing on the difficulties that arise in grounding the user's utterances in the environment where the robot is operating.

Artificial General Intelligence, Beijing, China; 07/2013

Data provided are for informational purposes only. Although carefully collected, accuracy cannot be guaranteed. The impact factor represents a rough estimation of the journal's impact factor and does not reflect the actual current impact factor. Publisher conditions are provided by RoMEO. Differing provisions from the publisher's actual policy or licence agreement may be applicable.

Fig. 1.manipulation task.Robot used in the objectFig. 2.tion tasks.Example of object manipula-and the robot executes an action according to this speech.The solid line in Fig. 2 shows the trajectory of the movingobject manipulated by the robot.The commands used in this task are represented by asequence of phrases, each of which refers to a motion, anobject to be manipulated (“trajector”), or a reference objectfor the motion (“landmark”). In the case shown in Fig. 2, thephrases for the motion, trajector, and landmark are “Place-on,” “Kermit,” and “big box,” respectively. Moreover, frag-mental commands without a trajector phrase or a landmarkphrase, such as “Place-on big box” or just “Place-on,” arealso acceptable.To execute a correct action according to such a command,the robot must understand the meaning of each word in it,which is grounded on the physical situation. The robot mustalso have a belief about the context information to estimatethe corresponding objects for the fragmental commands.In this work, we used the speech understanding methodproposed by [5] to interpret the input speech as a possibleaction for the robot under the current physical situation.However, for an object manipulation task in a real-worldenvironment, there may exist OOD speech such as chatting,soliloquies, or noise. Consequently, an RD speech detectionmethod should be used.III. PROPOSED RD SPEECH DETECTION METHODThe proposed RD speech detection method is based onintegrating gaze tracking and the MSC measure. A flowchartis given in Fig. 3. First, a Gaussian mixture model basedvoice activity detection method (GMM-based VAD) [6] iscarried out to detect speech from the continuous audiosignal, and gaze tracking is performed to estimated the gazedirection from the camera images3. If the proportion of theuser’s gaze at the robot during her/his speech is higherthan a certain threshold η, the robot judges that the userwas looking at it while speaking. The speech during theperiods when the user is not looking at the robot is rejected.Then, for the speech detected while the user was lookingat the robot, speech understanding is performed to outputthe indices of a trajector object and a landmark object,a motion trajectory, and corresponding phrases, each ofwhich consists of recognized words. Then, three confidencemeasures, i.e., for speech (CS), object image (CO) andmotion (CM), are calculated to evaluate the feasibilities of3In this work, gaze direction was identified by human face angle. We usedfaceAPI (http://www.seeingmachines.com) to extract human face anglesfrom the images captured by a camera.Audio signal Camera imagesGMM based VAD Gaze TrackingIs human user looking atthe robot during his speaking?Physical situationsYESNOSpeech UnderstandingSpeech ConfidenceMeasure CSθ1Object ConfidenceMeasure COMotion ConfidenceMeasure CMθ2θ3θ0MSC measure CMS (s,O, q)CMS(s,O,q) > δ?YESRD speechNOOOD speechFig. 3. Flowchart of the proposed RD speech detection method.the outputted word sequence, the trajector and landmark,and the motion, respectively. The weighted sum of theseconfidence measures with a bias is inputted to a logisticfunction. The bias and the weightings {θ0,θ1,θ2,θ3}, areoptimized by logistic regression [7]. Here, the MSC measureis defined as the output of the logistic function, and itrepresents the probability that the speech is RD speech. If theMSC measure is higher than a threshold δ, the robot judgesthat the input speech is RD speech and executes an actionaccording to it. In the rest of this section, we give details ofthe speech understanding process and the MSC measure.A. Speech UnderstandingGiven input speech s and a current physical situationconsisting of object information O and behavioral contextq, speech understanding selects the optimal action a basedon a multimodal integrated user model. O is representedas O = {(o1,f,o1,p),(o2,f,o2,p)...(om,f,om,p)}, whichincludes the visual features oi,f and positions oi,p of allobjects in the current situation, where m denotes the numberof objects and i denotes the index of each object that isdynamically given in the situation. q includes informationon which objects were a trajector and a landmark in theprevious action and on which object the user is now holding.a is defined as a = (t,ξ), where t and ξ denote the indexof the trajector and a trajectory of motion, respectively. Auser model integrating the five belief modules – (1) speech,(2) object image, (3) motion, (4) motion-object relationship,and (5) behavioral context – is called an integrated belief.Each belief module and the integrated belief are learned bythe interaction between a user and the robot in a real-worldenvironment.1) Lexicon and Grammar: The robot initially had basiclinguistic knowledge, including a lexicon L and a grammarGr. L consists of pairs of a word and a concept, each ofwhich represents an object image or a motion. The words arerepresented by HMMs using mel-scale cepstrum coefficientsand their delta parameters (25-dimensional) as features.The concepts of object images are represented by Gaussianfunctions in a multi-dimensional visual feature space (size,Lab color space (L∗, a∗, b∗), and shape). The conceptsof motions are represented by HMMs using the sequence644

Page 3

of three-dimensional positions and their delta parameters asfeatures.The word sequence of speech s is interpreted as a con-ceptual structure z = [(α1, wα1), (α2, wα2), (α3, wα3)],where αirepresents the attribute of a phrase and has a valueamong {M,T,L}. wM, wT and wLrepresent the phrasesdescribing a motion, a trajector, and a landmark, respectively.For example, the user’s utterance “Place-on Kermit big box”is interpreted as follows: [(M, Place-on), (T, Kermit), (L,big box)]. The grammar Gr is a statistical language modelthat is represented by a set of occurrence probabilities forthe possible orders of attributes in the conceptual structure.2) Belief Modules and Integrated Belief: Each of the fivebelief modules in the integrated belief is defined as follows.Speech BS: This module is represented as the log prob-ability of s conditioned by z, under lexicon L and grammarGr.Object image BO: This module is represented as thelog likelihood of wT and wLgiven the trajector’s and thelandmark’s visual features ot,f and ol,f.Motion BM: This module is represented as the loglikelihood of wM given trajectory ξ.Motion-object relationship BR: This module representsthe belief that in the motion corresponding to wM, featuresot,f and ol,f are typical for a trajector and a landmark,respectively. This belief is represented by a multivariateGaussian distribution with a parameter set R.Behavioral context BH: This module represents thebelief that the current speech refers to object o, givenbehavioral context q, with a parameter set H.Given weighting parameter set Γ={γ1...,γ5sented by integrated belief function Ψ, written as}, the degreeof correspondence between speech s and action a is repre-Ψ(s,a,O,q,L,Gr,R,H,Γ) =(+γ2logP(ot,f|wT;L) + logP(ol,f|wL;L)+γ3logP(ξ|ot,p,ol,p,wM;L)+γ4logP(ot,f− ol,f|wM;R)+γ5BH(ot,q;H) + BH(ol,q;H)maxz,lγ1logP(s|z;L)P(z;Gr)([BS])[BO][BM][BR]()),[BH](1)where l denotes the index of landmark, otand oldenote thetrajector and landmark, respectively, and ot,pand ol,pdenotethe positions of otand ol, respectively. Conceptual structurez and landmark olare selected to maximize the value of Ψ.Then, as the meaning of speech s, corresponding action ˆ a isdetermined by maximizing Ψ:ˆ a = (ˆt,ˆξ) = argmaxaΨ(s,a,O,q,L,Gr,R,H,Γ).(2)Finally, action ˆ a = (ˆt,ˆξ), index of selected landmarkˆl,and conceptual structure (recognized word sequence) ˆ z areoutputted from the speech understanding process.B. MSC MeasureNext, we describe the proposed MSC measure. MSCmeasure CMS is calculated based on the outputs of speechunderstanding and represents an RD speech probability. Forinput speech s and current physical situation (O, q), speechunderstanding is performed first, and then CMSis calculatedby the logistic regression asCMS(s,O,q) = P(domain = RD|s,O,q)=1 + e−(θ0+θ1CS+θ2CO+θ3CM).Logistic regression is a type of predictive model that canbe used when the target variable is a categorical variablewith two categories, which is quite suitable for the domainclassification problem in this work. In addition, the outputof the logistic function has a value in the range from 0.0 to1.0, which can be used directly to represent an RD speechprobability.Given a threshold δ, speech s with an MSC measure higherthan δ is treated as RD speech. The BS, BO, and BM arealso used for calculating CS, CO, and CM, each of whichis described as follows.1) Speech Confidence Measure: Speech confidence mea-sure CSis used to evaluate the reliability of the recognizedword sequence ˆ z. It is calculated by dividing the likelihoodof ˆ z by the likelihood of a maximum likelihood phonemesequence with phoneme network Gp, and it is written as1(3)CS(s, ˆ z;L,Gp) =1n(s)logP(s|ˆ z;L)maxu∈L(Gp)P(s|u;A),(4)where n(s) denotes the analysis frame length of the inputspeech, P(s|ˆ z;L) denotes the likelihood of ˆ z for inputspeech s and is given by a part of BS, u denotes a phonemesequence, A denotes the phoneme acoustic model used inBS, and L(Gp) denotes a set of possible phoneme sequencesaccepted by Japanese phoneme network Gp. For speech thatmatches robot command grammar Gr, CShas a greater valuethan speech that does not match Gr.The speech confidence measure is conventionally usedas a confidence measure for speech recognition [8]. Thebasic idea is that it treats the likelihood of the most typi-cal (maximum-likelihood) phoneme sequences for the inputspeech as a baseline. Based on this idea, the object andmotion confidence measures are defined as follows.2) Object Confidence Measure: Object confidence mea-sure COis used to evaluate the reliability that the outputtedtrajector oˆ tand landmark oˆlare referred to by ˆ wT and ˆ wL.It is calculated by dividing the likelihood of visual featuresoˆ t,fand oˆl,fby a baseline obtained by the likelihood of themost typical visual features for the object models of ˆ wTand ˆ wL. In this work, the maximum probability densitiesof Gaussian functions are used as these baselines. Then, theobject confidence measure COis written asCO(oˆ t,f,oˆl,f, ˆ wT, ˆ wL;L) =logP(oˆ t,f| ˆ wT;L)P(oˆl,f| ˆ wL;L)maxofP(of| ˆ wT)maxofP(of| ˆ wL),(5)where P(oˆ t,f| ˆ wT;L) and P(oˆl,f| ˆ wL;L) denote the like-lihood of oˆ t,fand oˆl,fand are given by BO; furthermore,maxofP(of| ˆ wT) and maxofP(of| ˆ wL) denote the max-imum probability densities of Gaussian functions, and ofdenotes the visual features in object models.645

Page 4

For example, Fig. 4 (a) illustrates a physical situation inwhich a low object confidence measure was obtained forinput OOD speech “There is a red box.” Here, by the speechunderstanding process, the input speech was recognized as aword sequence “Raise red box.” Then, an action of the robotraising object 1 was outputted (solid line) because the “redbox” did not exist and thus object 1 with the same color wasselected as a trajector. However, the visual feature of object1 was very different from “red box,” resulting in a low valueof CO.3) Motion Confidence Measure: The confidence measureof motion CM is used to evaluate the reliability that theoutputted trajectoryˆξ is referred to by ˆ wM. It is calculatedby dividing the likelihood ofˆξ by a baseline that is obtainedby the likelihood of the most typical trajectory˜ξ for themotion model of ˆ wM. In this work,˜ξ is written as˜ξ = argmaxξ,otrajpP(ξ|otrajp,oˆl,p, ˆ wM;L),(6)where otrajis obtained by treating otrajof˜ξ is the maximum output probability of HMMs. In thiswork, we used the method proposed by [9] to obtain thisprobability. Different fromˆξ, the trajector’s initial positionof˜ξ is unconstrained, and the likelihood of˜ξ has a greatervalue thanˆξ. Then, the motion confidence measure CM iswritten aspdenotes the initial position of the trajector.˜ξas a variable. The likelihoodpCM(ˆξ, ˆ wM;L) = logP(ˆξ|oˆ t,p,oˆl,p, ˆ wM;L)maxξ,otrajpP(ξ|otrajp,oˆl,p, ˆ wM;L),(7)where P(ˆξ|oˆ t,p,oˆl,p, ˆ wM;L) denotes the likelihood ofˆξ andis given by BM.For example, Fig. 4 (b) shows a physical situation in whicha low motion confidence measure was obtained for inputOOD speech “Bring me that Chutotoro.” Here, by the speechunderstanding process, the input speech was recognized as aword sequence “Move-away Chutotoro.” Then, an action ofthe robot moving object 1 away from object 2 was outputted(solid line). However, the typical trajectory of “move-away”is for one object to move away from another object that isclose to it (dotted line). Here, the trajectory of the outputtedaction was very different from the typical trajectory, resultingin a low value of CM.4) Optimization of Weights: We now consider the problemof estimating weight Θ in Eq. (3). The ith training sampleis given as the pair of input signal (si,Oi,qi) and teachingsignal di. Thus, the training set TNcontains N samples:TN= {(si,Oi,qi,di)|i = 1,...,N},where diis 0 or 1, which represents OOD speech or RDspeech, respectively. The likelihood function is written as(8)P(d|Θ) =N∏i=1(CMS(si,Oi,qi))di(1 − CMS(si,Oi,qi))1−di,(9)Input speech: “There is a red box.”Recognized as: [Raise red box.](a) Case of object confidence measure.Fig. 4.Example cases where object and motion confidence measures arelow.Input speech: “Bring me that Chutotoro.”Recognized as: [Move-away Chutotoro.](b) Case of motion confidence measure.where d= (d1,...,dN). Θ is optimized by the maximum-likelihood estimation of Eq. (9) using Fisher’s scoring algo-rithm [10].IV. EXPERIMENTSA. Experimental SettingWe first evaluated the performance of MSC. This evalua-tion was performed by an off-line experiment by simulationwhere gaze tracking was not used and speech was extractedmanually without using the GMM based VAD in order toavoid its detection errors. The weighting set Θ and thethreshold δ were also optimized in this experiment. Then weperformed an on-line experiment with the robot to evaluatethe entire system.The robot lexicon L used in both experiments has 50words, including 31 nouns and adjectives representing 40objects and 19 verbs representing 10 kinds of motions. L alsoincludes five Japanese postpositions. Different from otherwords in L, none of the postpositions is associated witha concept. By using the postpositions, users can speak acommand in a more natural way. The parameter set Γ inEq. (1) was γ1= 1.00, γ2= 0.75, γ3= 1.03, γ4= 0.56,and γ5= 1.88.B. Off-line Experiment by Simulation1) Setting: The off-line experiment was conducted underboth clean and noisy conditions using a set of pairs of speechs and scene information (O, q). Figure 4 (a) shows anexample of scene information. The yellow box on object 3represents the behavioral context q, which means object 3was manipulated most recently. We prepared 160 differentsuch scene files, each of which included three objects onaverage. To pair with the scene files, we also prepared 160different speech samples and recorded them under both cleanand noisy conditions as follows.Clean condition: We recorded the speech in a soundproofroom without noise. A subject sat on a chair one meter fromthe SANKEN CS-3e directional microphone and read out atext in Japanese.Noisy condition: We added dining hall noise, having alevel from 50 to 52 dBA, to each speech record gatheredunder a clean condition.We gathered the speech records from 16 subjects, includ-ing 8 males and 8 females. All subjects were native Japanesespeakers. As a result, 16 sets of speech-scene pairs wereobtained, each of which included 320 pairs (160 for clean and160 for noisy conditions). These pairs were manually labeledas either RD or OOD and then inputted into the system.646

Page 5

SpeechSpeech+ObjectSpeech+MotionMSC(Speech+Object+Motion)7080 90100Precision (%)708090100Recall (%)Fig. 5.Average precision-recall curves under clean condition.For each pair, speech understanding was first performed4,and then the MSC measure was calculated. During thespeech understanding experiment, a Gaussian mixture modelbased noise suppression method [11] was performed, andATRASR [12] was used for phoneme- and word-sequencerecognition. With ATRASR, accuracies of 83% and 67%in phoneme recognition were obtained under the clean andnoisy conditions, respectively.The evaluation under the clean condition was performedby leave-one-out cross-validation: 15 subjects’ data wereused as a training set to learn the weighting Θ in Eq. (3),and the remaining 1 subject’s data were used as a test set andrepeated 16 times. The values of the weightingˆΘ learnedby using 16 subjects’ data were used for the evaluationunder the noisy condition, where all noisy speech-scenepairs collected from 16 subjects were treated as a test set.For comparison, four cases were evaluated for RD speechdetection by using: (1) the speech confidence measure only,(2) the speech and object confidence measures, (3) the speechand motion confidence measures, and (4) the MSC measure.2) Results: The average precision-recall curves over 16subjects under clean and noisy conditions are shown inFig. 5 and Fig. 6, respectively. The performances of eachof four cases are shown by “Speech,” “Speech+Object,”“Speech+Motion,” and “MSC.” From the figures, we foundthat (1) the MSC outperforms all others for both clean andnoisy conditions and (2) both object and motion confidencemeasures helped to improve performance. The average max-imum F-measures under clean conditions are MSC: 99%,Speech+Object: 97%, Speech+Motion: 97%, Speech: 94%;those for noisy condition are MSC: 95%, Speech+Object:92%, Speech+Motion: 93%, and Speech: 83%. By compari-son with the speech confidence measure only, MSC achievedan absolute increase of 5% and 12% for clean and noisyconditions, respectively, indicating that MSC was particularlyeffective under the noisy condition.We also performed the paired t-test. Under the cleancondition, there were statistical differences between (1)Speech and Speech+Object (p < 0.01), (2) Speech andSpeech+Motion (p < 0.05), and (3) Speech and MSC4We conducted another experiment to evaluate the speech understandingby using the RD speech-scene pairs. Consequently, 99.8% and 96.3% ofRD speech was correctly interpreted under clean and noisy conditions,respectively.SpeechSpeech+ObjectSpeech+MotionMSC708090100Precision (%)708090100Recall (%)Fig. 6. Average precision-recall curves under noisy condition.(p < 0.01). Under the noisy condition, there were statisticaldifferences (p < 0.01) between Speech and all other cases.Here, p denotes the probability value obtained from the t-test.The values for optimizedˆΘ were:ˆθ0= 5.9,ˆθ1= 0.00011,ˆθ2 = 0.053, andˆθ3 = 0.74. The threshold δ of domainclassification was set toˆδ = 0.79, which maximized theaverage F-measure of MSC under the clean condition. Thismeans that a piece of speech with an MSC measure of morethan 0.79 would be treated as RD speech and the robot wouldexecute an action according to this speech. The aboveˆΘ andˆδ were used in the on-line experiment.C. On-line Experiment Using the Robot1) Setting: In the on-line experiment, the entire systemwas evaluated by using the robot. In each session of theexperiment, two subjects, an “operator” and a “ministrant,”sat in front of the robot at a distance of about one meterfrom the microphone. The operator ordered the robot tomanipulate objects in Japanese. He was also allowed to chatfreely with the ministrant. The threshold η of gaze trackingwas set to 0.5, which means that if the proportion of theoperator’s gaze at the robot during input speech was higherthan 50%, the robot judged that the speech was made whilethe operator was looking at it.We conducted a total of four sessions of this experimentusing four pairs of subjects, and each session lasted forabout 50 minutes. All subjects were adult males. There wasconstant ambient noise of about 48 dBA from the robot’spower module in all sessions. For comparison, five caseswere evaluated for RD speech detection by using (1) gazeonly, (2) gaze and speech confidence measure, (3) gaze andspeech and object confidence measures, (4) gaze and speechand motion confidence measures and, (5) gaze and MSCmeasure.2) Results: During the experiment, a total of 983 piecesof speech were made, each of which was manually labeledas either RD or OOD. There were 708 pieces of speechwhich were made while the operator was looking at therobot, including 155 and 553 pieces of RD and OOD speech,respectively. This means that in addition to the RD speech,there was also a lot of OOD speech made while the subjectswere looking at the robot.The average recall and precision rates for each of theabove five cases are shown in Fig. 7 and Fig. 8, respectively.647

Page 6

80859095100Recall (%)GazeGaze+Speech+ObjectGaze+MSCGaze+SpeechGaze+Speech+Motion94%90%92% 92%94%Fig. 7. Average recall rates obtained in the on-line experiment.By using gaze only, an average recall rate of 94% wasobtained (see “Gaze” column in Fig. 7), which means thatalmost all of the RD speech was made while the operatorwas looking at the robot. The recall rate dropped to 90%by integrating gaze with the speech confidence measure,which means that some RD speech was rejected erroneouslyby the speech confidence measure. However, by integratinggaze with MSC, the recall rate returned to 94% because themistakenly rejected RD speech was correctly detected byMSC. In Fig. 8, the average precision rate by using gazeonly was 22%. However, by using MSC, the instances ofOOD speech were correctly rejected, resulting in a highprecision rate of 96%, which means the proposed methodis particularly effective in situations where users make a lotof OOD speech while looking at a robot.V. DISCUSSIONThis work can be extended in many ways, and we mentionsome of them in this section. Here, we evaluated the MSCmeasure in situations where users usually order the robotwhile looking at it. However, in other situations, users mightorder a robot without looking at it. For example, in an objectmanipulation task where a robot manipulates objects togetherwith a user, the user may make an order while looking atthe object that she/he is manipulating instead of looking atthe robot itself. For such tasks, the MSC measure should beused separately without integrating it with gaze. Therefore,a method that automatically determines whether to use thegaze information according to the task and user situationshould be implemented.Moreover, aside from the object manipulation task, theMSC measure can also be extended to multi-task dialogs,including both physically grounded and ungrounded tasks.In physically ungrounded tasks, users’ utterances expressno immediate physical objects or motions. For such dialog,a method that automatically switches between the speechconfidence and MSC measures should be implemented. Infuture works, we will evaluate the MSC measure for variousdialog tasks.In addition, MSC can be used to develop an advanced in-terface for human-robot interaction. The RD speech probabil-ity represented by MSC can be used to provide feedback suchas the utterance “Did you speak to me?”, and this feedbackshould be made in situations where the MSC measure has anintermediate value. Moreover, each of the object and motionconfidence measures can be used separately. For example, ifthe object confidence measures for all objects in a robot’svision were particularly low, an active exploration should20406080100Precision (%)GazeGaze+Speech+ObjectGaze+MSCGaze+SpeechGaze+Speech+Motion22%85%93% 95% 96%Fig. 8. Average precision rates obtained in the on-line experiment.be executed by the robot to search for a feasible object inits surroundings, or an utterance such as “I cannot do that”should be made for situations where the motion confidencemeasure is particularly low.VI. CONCLUSIONThis paper described a robot-directed (RD) speech detec-tion method that enables a robot to distinguish the speech towhich it should respond in an object manipulation task bycombining speech, visual, and behavioral context with humangaze. The novel feature of this method is the introductionof the MSC measure. The MSC measure evaluates thefeasibility of the action which the robot is going to executeaccording to the users’ speech under the current physicalsituation. The experimental results clearly show that themethod is very effective and provides an essential functionfor natural and safe human-robot interaction. Finally, weshould note that the basic idea adopted in the method isapplicable to a broad range of human-robot dialog tasks.REFERENCES[1] H. Asoh, T. Matsui, J. Fry, F. Asano, and S. Hayamizu, “A spokendialog system for a mobile robot,” in Proc. Eurospeech, 1999, pp.1139–1142.[2] S. Lang, M. Kleinehagenbrock, S. Hohenner, J. Fritsch, G. A. Fink,and G. Sagerer, “Providing the basis for human-robot-interaction: Amulti-modal attention system for a mobile robot,” in Proc. ICMI, 2003,pp. 28–35.[3] T. Yonezawa, H. Yamazoe, A. Utsumi, and S. Abe, “Evaluatingcrossmodal awareness of daily-partner robot to user’s behaviors withgaze and utterance detection,” in Proc. CASEMANS, 2009, pp. 1–8.[4] T. Takiguchi, A. Sako, T. Yamagata, and Y. Ariki, “System requestutterance detection based on acoustic and linguistic features,” SpeechRecognition, Technologies and Applications, pp. 539–550, 2008.[5] N. Iwahashi, “Robots that learn language: A developmental approachto situated human-robot conversations,” Human-Robot Interaction, pp.95–118, 2007.[6] A. Lee, K. Nakamura, R. Nishimura, H. Saruwatari, and K. Shikano,“Noise robust real world spoken dialogue system using GMM basedrejection of unintended inputs,” in Proc. Interspeech, 2004, pp. 173–176.[7] D. W. Hosmer and S. Lemeshow, Applied Logistic Regression. Wiley-Interscience, 2009.[8] H. Jiang, “Confidence measures for speech recognition: A survey,”Speech Communication, vol. 45, pp. 455–470, 2005.[9] K. Tokuda, T. Kobayashi, and S. Imai, “Speech parameter generationfrom HMM using dynamic features,” in Proc. ICASSP, 1995, pp. 660–663.[10] T. Kurita, “Iterative weighted least squares algorithms for neuralnetworks classifiers,” in Proc. ALT, 1992.[11] M. Fujimoto and S. Nakamura, “Sequential non-stationary noisetracking using particle filtering with switching dynamical system,” inProc. ICASSP, vol. 2, 2006, pp. 769–772.[12] S. Nakamura, K. Markov, H. Nakaiwa, G. Kikui, H. Kawai, T. Jit-suhiro, J. Zhang, H. Yamamoto, E. Sumita, and S. Yamamoto, “TheATR multilingual speech-to-speech translation system,” IEEE Trans.ASLP, vol. 14, no. 2, pp. 365–376, 2006.648