of various parts of the image. Other robots use a laser
system in which a beam is shown over an object or area,
then a camera tracks the laser beam, translating the
refractions in the beam into a three-dimensional rendering.

The STAIR robot uses both of these processes in a
combination that allows for extremely accurate
identification of objects. The basis for this system is through
the use of context. For example, if a human is presented
with a traditional number pad, but the " 4" is scratched off,
a person will easily figure out where the " 4" should be
based on the position of the other digits. The robot uses a
similar method except that often the numbers are intact;
however, it cannot recognize them properly with its
cameras.

A three-dimensional sensor can provide a basic layout
of a panel of buttons and the camera will fill in the
characters that it recognizes. The software on the robot
must now interpolate the two data sets and use a map of
what normal panels look like to create a complete view.
With an understanding of where the button is that the
robot wants, the arm must now push the button without
hitting any object in the process.

This next step is done in relation to the three-dimensional view of the environment that the robot has
gathered. The arm is given precise movements to avoid
objects that the scanner has identified. When the robot
finally contacts the button, it has completed its objective.

How It Works

The way that the STAIR robot recognizes objects in
front of it by applying the context method was researched
by Morgan Quigley, Siddharth Batra, Stephen Gould, Ellen
Klingbeil, Quoc Le, Ashley Wellman, and Andrew Y. Ng in
their paper entitled "High-Accuracy 3D Sensing for Mobile
Manipulation: Improving Object Detection and Door
Opening." I had the opportunity to speak with Ellen at the
Artificial Intelligence Lab in Stanford where the project is
worked on. She explained a lot about the complex vision
system that the robot uses. While the robot has more than
five sensors related to interpreting the environment around
it, the 3D scan uses the laser and single camera. Through a
process called the "laser-line triangulation scheme," the
camera notes the deformations in the laser as it passes over
objects of varying depths.

The system creates 600 images in a single scan. The
combination of these yields a three-dimensional model of
points that convey the actual depth of the objects in the
scene. This 3D view is then reconciled with the view from a
traditional camera, adding a depth component to the pixels
in the image. With a color map of the objects and their
depth, the robot can implement various strategies to locate
the necessary object (or button) in a cluttered environment.

If the system only used a traditional camera, the
likelihood of identifying the correct object is very high.
However, a number of false-positives would be
encountered. The three-dimensional component eliminates

much of this because there are very few instances when an
object with the coloring of a mug also happens to have 3D
characteristics of a mug unless it is in fact the object itself.

The potential and efficiency of this system is
demonstrated graphically in Figure 1 where the map
shown was generated completely by the robot's vision. Each
red dot is where the robot stopped and scanned a desk for
an object that is then highlighted with an orange circle in
the yellow field of view.