Voice Dialogue in Different Environments

If you try to record voice using a microphone in an enclosed environment such as in a room, the voice is reflected off the walls and ceiling, and the travel time of the sounds reaching the microphone differs. Therefore, the voice observed by the microphone sounds blurred in comparison to the original voice sound. This blurring effect is referred to as reverberation, and it has a negative impact on speech recognition systems. We employ de-reverberation methods in our research to mitigate the blurring effect and realize a robust voice dialogue system.

The reflection of speech signals, reflection of noise from different sources and the conversation of more than two people all pose an “attention” problem to an intelligent system. A system has difficulty deciding which sound source to pay attention to (attention problem) when multiple people are speaking or where there are various noises including reflected sounds. The system will naturally pay attention to whatever the detected sound source is, regardless if it is a voice or just a noise. Accurately identifying the user to pay attention to is not an easy task, nor is it easy to link the voice command to the right user.

It is essential to design an intelligent system that can recognize and communicate with the user intended. HRI-JP is investigating a multi-modal approach in which not only sounds (or speech) but also visual and hearing senses are utilized for information processing.

When a person speaks in a room, a microphone receives not only direct speech but also the ones reflected by the walls and the ceiling. Because the path to the microphone for each reflection is different, it arrives at the microphone with different delays. Since the microphone captures all of the reflections, the captured sound is smeared. This phenomenon is called reverberation.

HARK Cloud Service

Research into practical applications of microphone array processing has been advancing as more consumers install smart speakers into their homes. HARK performs sound localization and separation, but if a whole system with these capabilities is to be mounted in one computer, costs and power consumption will increase drastically. For user convenience, we are engaged in designing and developing an online cloud service "HARK SaaS."

HARK SaaS allows users to take advantage of the capabilities of HARK via the internet. One such application is for use in automobiles. In the interior of a car, there is undesirable noises generated by the engine, audio system, wind, and the road. In this environment, a system is necessary to accurately hear and analyze the speech commands to execute a given task such as map search and playing music. However, in an automotive application, it is inefficient to mount a high-performance computer due to power consumption.

In these sorts of environments, cloud-based services such as HARK SaaS are desirable. By installing a small microphone array on the dashboard and transmitting the recorded conversation to HARK SaaS, it can identify who is or is not speaking and who is dominant in the conversation.

A microphone array located at the center of the table simultaneously records conversational voices from multiple persons. The recorded data are sent to HARK cloud service, and they are analyzed in real time. We can enjoy full functionality of HARK even when huge computational power is unavailable in the local side.