Robot Asimo can understand three voices at once

点击量： 时间：2019-03-02 10:14:00

By Colin Barras (Image: IEEE) Advanced humanoid robot Asimo just got a new superpower – it can understand three humans shouting at once. For now the modified Asimo’s new ability are being used to judge rock-paper-scissors contests, where three people call out their choices at once. But the number of voices and the complexity of the sentences the software can deal with should grow in future. Hiroshi Okuno at Kyoto University, and Kazuhiro Nakadai at the Honda Research Institute in Saitama, both in Japan, have designed the new software, which they call HARK. HARK uses an array of eight microphones to work out where each voice is coming from and isolate it from other sound sources. The software then works out how reliably it has extracted an individual voice, before passing it onto speech-recognition software to decode. That quality control step is important. The other voices are likely to confuse speech recognition software. So any parts of the sound file that contain a lot of background noise across a range of frequencies are automatically ignored when the patched-up recording of each voice is passed on to a speech-recognition system. The HARK system actually goes beyond normal human listening capabilities, Okuno told New Scientist. “It can listen to several things at once, and not just focus on a particular single sound source.” While focusing on a single voice among many is known as the “cocktail party effect“, Okuno calls the ability to focus on multiple voices at once the “Prince Shotoku Effect”. “According to Japanese legend, Prince Shotoku listened to 10 people’s petitions at the same time,” he says. Although the HARK software can’t comprehend 10 voices at once yet, Okuno and Nakadai say it can follow three players calling simultaneously at 70 to 80% accuracy when installed into Honda’s Asimo robot. The array of eight microphones is placed around the Asimo’s face and body, which helps it to accurately detect and isolate simultaneous voices. “The number of sound sources and their directions are not given to the system in advance,” says Nakadai. Guy Brown at the University of Sheffield, UK, is impressed with the work, although he points out that it is largely built from existing elements used to process sound, such as getting an array of microphones to localise a sound, and using automated software to block out difficult-to-interpret parts of a voice recording. “The main achievement has been to embed this technology in a robot and to get it all working in a real-time, interactive manner,” Brown says. Rock-paper-scissors uses a small vocabulary, making the task easier. “Clearly there’s a long way to go before we can match the performance of human listeners in ‘cocktail party’ situations,” he says. In fact, when Okuno and Nakadai tried using their software to follow several complicated sentences at once, as three people shouted out restaurant orders, it could only identify 30 to 40% of what was said. Alexander Gutschalk at the Ruprecht-Karl University of Heidelberg in Germany has just conducted one of the first studies of brain activity when dealing with the cocktail party effect, and says future collaboration between neuroscientists and roboticists could make robots better party conversationalists. Okuno and Nakadai presented their work at the 2008 IEEE International Conference on Robotics and Automation in Pasadena, California, last month. Robots – Learn more about the robotics revolution in our continually updated special report. More on these topics: