Thursday, August 23, 2012

In which I bark orders at a robot, and it actually listens!

So, here's my latest adventure with HouseBot. Since I have the Kinect, and since the Kinect has directional microphones in it, I decided to do a little experimenting around with the Microsoft Speech SDK. In an earlier iteration of HouseBot (well, same platform, much less powerful computer, much more finicky drive wheels.. lots of improvements since then), I had also played around with this, but without quite as much success. The main reason for that was that I was using a microphone plugged into the computer directly, instead of using the Kinect microphones. With the other microphone, you basically had to be right on top of it (and sometimes shout) in order to get it to respond. With the Kinect, since it is designed for gaming (et al), you can be across the room and still have the mic respond.

So, I think I'll just jump right into the finished result, and then do a little of a dive into how it's done. Here's a film of HouseBot responding to voice commands.

As you can see, she's still not going to get me a beer. *sigh*... Science, such a harsh mistress you are..

So how does it work?
In order to do something like this, you need to do a couple of things:

1. Build a robot
2. Build a vocabulary of the expected commands
3. Stream audio from the Kinect to the voice recognition code
4. On recognizing a command, take some action

So keep in mind, there's a lot of support code for the robot that I'm not going to show here. It'll be fairly obvious where I'm calling into the robot, but some of the basic idea is that the robot has various tools attached to it (objects that implement ITool), such as the light, the turret, and the speech generator. The robot itself is a platform (implements IMobilityPlatform) and a sensor provider (implements ISensorProvider). So, if you see something that says "UseTool", that method call is asking the tool to do its core action or some alternate action, and if you tell a platform to turn or move forward, that's the robot itself. Some tools, such as the turret or voice generation tool also have some specialized actions -- such as "Say()" on the voice tool.

What it looks like when you are consuming all this is that first, we'll do a little setup:

And then we'll just wait for voice commands. A note - the recognizer object is the actual object doing the speech recognition, but, to make things easier, it fills in a value on a StringSensor object, which is a kind of base sensor object type I use on the robotics platform to easily represent and sense things that are string values (RFID sensor values, sensed speech, values coming from an IR remote receiver, things like that). This is the code that interprets commands:

Underneath the scenes, there's a little setup going on. In the recognizer object, we're setting up a vocabulary, getting a reference to the Kinect audio stream, and setting up a reference to the voice recognition engine provided from the Kinect, like so:

So that's about it. One of the hurdles I had to overcome was to make sure I had references to the correct SDK objects. If I set the references to the Microsoft.Kinect assembly in the IDE, things didn't work correctly. I had to look at the csproj file in the developer toolkit example, and manually edit my csproj to match. Once that was figured out, it was smooth sailing. Play with the recognition threshold if you get spurious recognitions that you don't want. At one point I had it set to 0.9 -- less than 90% certainty, and it won't respond. This actually seemed a pretty good setting.