Programming a Bolo Robot:

Recognizing Actions By Example

The Goal

The goal of this project is to implement an interactive agent in the multi-player tank
game Bolo that assists the user playing the game by recognizing the game player's actions
and responding appropriately. This is achieved using a robot tank that follows the user's
tank as the game is played. The robot watches what the user is doing and tries to
recognize actions like "attacking nearest enemy tank" using features computed
from the game world. Once an action is recognized, the robot tank changes its behavior to
help the user. For instance, if the user is judged to be attacking a tank, the robot will
join in the attack. We call our robot tank Inducting Indy.

The inspiration for teaching a the robot collaborative behavior by demonstration comes
from how people teach collaborative agents in the real world: first you tell your new
assistant exactly what you are doing, and then you demonstrate on a real example. Later,
after the assistant has seen enough examples he can judge for himself when the time is
appropriate for the same course of action, without you having to tell it so.

This document describes our current system and the issues that it raises.

What is Bolo?

Bolo was created by Stuart
Cheshire (cheshire@cs.stanford.edu) of Stanford University. It has evolved from a
little-known tank game on the BBC Micro to an explosively successful Macintosh game that
is played constantly on the Internet, as well as many smaller networks. The author
summarizes his work:

"Bolo is a 16 player graphical networked real-time multi-player tank battle
game. It has elements of arcade-style shoot-em-up action, but for the serious players who
play 12 hour games with 16 players working in teams in different networked computer
clusters around an office or university campus, it becomes more of a strategy game. You
have to play it to understand."

The graphics and controls are pretty simple, as is the objective... conquer all the
bases and pills.

Bolo Brains

Bolo brains are plug in pieces of code for Bolo that are passed the same (well, very
similar) information as the human players get, and return what actions they wish the tank
to perform. Brains are Bolo autopilots.

There are currently two main types of brains. The standard brain, known as a robot or
'bot, uses some form of AUI (Artificially Un-Intelligent) technique to play Bolo. The
second type of brain, known as a cyborg or 'borg reads from the mac keyboard and
essentially just helps a human play the game. These two types are merging.

Why Bolo?

We chose Bolo as an environment in which to study learning algorithms for several
reasons:

Interactivity

The game makes it possible to test learning algorithms in a real-time, interactive
environment where feedback to the user must be natural and fast.

Feature complexity

The key question for an interface is "which features matter?" Features in
Bolo's world can be easily extracted from the game and studied by a learning algorithm to
try and decide which ones are critical for recognizing various actions. However, the
number of features can also be restricted to a smaller subset to make the problem easier
to study.

Action complexity

Bolo is complex enough to contain many different types of actions that an agent might
recognize and respond to, but the game is structurally simple enough to make using it as a
research platform possible.

Game code

Source code to the Bolo game has not been released. However, the game is widely used and
very stable. It is one of the few real-time, interactive games that takes place in a
complex, simulated world with several different types of objects. The game is popular
because it permits many different types of actions, but it is possible to modify (and
simplify) the map of the game world when necessary to make experimentation easier.

Brain modules

Source code for plug-in brains is available. These are sophisticated robot tank
algorithms that can be modified without too much trouble to compute and output game
features to a learning algorithm and to react to another tank's actions.

Support

There is a large group of Bolo fans on the network who are willing to answer questions
about Bolo and Bolo brains.

Research

There are a few other people using Bolo for series AI and psychological experimentation.

The only limitation we have discovered with using Bolo as a platform is that is is
impossible to stop the game and allow a learning algorithm to "think." However,
this could also be thought of as a "feature" that grounds the learning algorithm
to a realistic domain and forces us to consider the issue of the user interface more
carefully. Even simple learning algorithms running on fast machines require far more
processing time than a tank receives each clocktick. We will describe how we have worked
around this issue. We have found that source is available on request for three 'bot
Brains:

Indy 2.02

One of the best robots with wonderfully clean C code source.

aIndy

An advanced Indy with better play but messier code.

Standard Autopilot

The Bolo author's original sample Brain. Not very intelligent and messy C source.

We therefore started by making modifications to Indy 2.02 code. (See the Brain list
for more information on the available Bolo Brains.) Indy 2.02 was written by John Lawrie
(jlawrie@sfu.ca) and Gersham Meharg (meharg@sfu.ca).

Computing Features

A Bolo brain tank has access to the following types of features, any of which might be
needed by a learning algorithm to recognize certain types of actions:

See Bolo
Tidbits for a list of other information available to the robot that can potentially be
used for feature computation.

Given the vast number of potential features, pruning the set to a subset that the
learning algorithm can work with requires using some domain knowledge about the task and
making guesses about which features truly provide the most information for the task at
hand. For instance, we know that since the actions we want the robot to recognize are very
local, we do not provide the learning algorithm with information about tanks out of the
immediate area.

There is no direct way to access features from the game with temporal exent (e.g.
derivatives of properties), although they could be computed with some effort. We have not
used temporal features here.

The Learning Algorithms

The goal of the learning algorithm is to take as input a set of features and
corresponding decisions and to produce a compact set of decision rules. The rules are used
to evaluate incoming features and make decisions about what action is taking place. PRISM
is an algorithm based on ID3 for computing disjunctive rules of features from a given
input set of features and decisions. Features are supplied in a form such as ((shooting
tank-near base-far ATTACK) (not-shooting tank-near base-near NOT-ATTACK)) and the
resulting rules are of the form ((shooting & tank-near -> ATTACK) (base-far ->
NOT-ATTACK)). For details on the algorithm, see the original article, [J. Cendrowska.
"PRISM: An algorithm for Inducing Modular Rules." J. Man-Machine Studies (1987)
27, 349-370.]

PRISM will produce a correct rule to fit every possible combination of features.
Consequently, it tends to over-fit data, producing a specialized rule for too many
situations. INDUCT, described in Gaines89, is an
extension to PRISM which tries to eliminate overly specialized rules at the expense of
possibly generating some incorrect rules. Our data violates two assumptions made by PRISM:
it is not complete and our examples are not always correct. Consequently, PRISM discovers
some inconsistencies and also generates too many specialized rules. Therefore, we have
applied INDUCT to the input. We are not guaranteed correctness and we don't avoid
inconsistencies, but some over-fit rules are avoided.

Training the 'Bot

A typical training session with our learning 'bot, Inducting Indy, begins by
making sure that both the 'bot and the human player are allied to each other. In the
course of the game, the human player shows Inducting Indy examples of an activity (e.g.
"attack the pillbox") that the players wants the 'bot to perform. Just before
attacking the pillbox himself, the player sends a message indicating what action he is
performing. The inducting 'bot recognizes this message and accordingly tags the features
of the world that it stashes to disk.

In Bolo, sending messages to other tanks in a timely fashion can be cumbersome when the
player is trying to control his own tank. In our system the player can send one of a set
of preconfigured messages with a single keystroke; we've also experimented successfully
with Apple's PlainTalk speech recognition software and AppleScript to link spoken messages
to the keystrokes that actually send the messages.

When the player stops the action that he is demonstrating, he then sends Inducting Indy
a message indicating that the action is over for now.

After a fixed amount of time in the game (a few minutes) the action rules are
recomputed by the INDUCT algorithm described above. The inducting 'bot then uses these new
rules to classify the actions of the player. Later, if the player happens not to tell the
'bot what he's doing, and if a particular rule fits the set of features, our 'bot
concludes that the player must be engaged in the action corresponding to the rule. He then
adopts this action as his own.

During the course of the game, when the 'bot is comparing the state of the world from
the player's point of view against its set of rules, it's also free to pursue its own
goals (the ones built in by the creators of Indy) so long as it doesn't wander too far
from the player that it can't see what he's doing, and none of the rules is firing.

The training doesn't stop when a single example has been shown to the 'bot, however.
The player is free to show the 'bot more examples of the action. In fact, this must happen
for INDUCT to derive rules that are sufficiently general to be useful. For example, the
rules derived from a single example of "attack the pillbox" will probably count
on the particular pillbox being attacked, the direction the tank is pointed, and a host of
other features which in this case happen to be irrelevant. Since each time the rules are
recomputed from the entire set of examples from the whole session, the rules continually
evolve, becoming more general and therefore more useful.

This periodic computing of the rules permits a style of interactive
"teaching" that should be useful in a game situation. After showing the 'bot a
few examples of an action, the player can then see if the it has learned the appropriate
concepts by seeing if it does the right thing in a new situation. If it does -- great. If
it doesn't, the player can add a training example on the spot rather than declare a total
failure of the 'bot's configuration.

The learning algorithm will sometimes receive one set of features with two different
decisions. For example, temporal delays by the user in labeling may lead to a (shot-in-air
ATTACK) input and a (shot-in-air EXPLORE) input. A shot may still be in flight when the
user switches from attacking to exploring. This type of contradictory input generates
unresolvable conflicts in the PRISM and INDUCT learning algorithms. Therefore, we have
added a modification to the learning code that keeps track of the number of each type of
example seen over the entire game. If 10 examples of (shot-in-air ATTACK) are received and
only 2 examples of (shot-in-air EXPLORE) are received, then only the (shot-in-air EXPLORE)
rule will not be used in training. However, if at a later time 9 more examples of
(shot-in-air EXPLORE) are received, then there will be 1 more example of that case than
(shot-in-air ATTACK), and (shot-in-air EXPLORE) will be the rule that is used. This method
of keeping track of multiple examples and eliminating conflicts based on the number of
examples received allows an action to be trained and then retained even if a few spurious
inputs are received that directly contradict the data used to learn the rules.

Action

Once the Inducting Indy 'bot has recognized some behavior, it tries to respond.
Currently it can recognize attack-tank, attack-base, attack-pill, fortify-base, hiding,
and exploring. It can respond by attacking the appropriate object (using different
strategies depending upon the object), fortifying a base with mines, or hiding in the same
region of forest as the player. Temporal smoothing of the 'bot's actions is achieved by
extending each action for a few seconds in time. For example, if a HIDING decision is
made, the tank will try to hide for the next few seconds unless it receives another
action, like ATTACK-TANK, in which case it will stop hiding immediately and respond to the
new behavior. When no actions are recognized, Inducting Indy goes about its normal
business as it follows the player around.

The kind of interactive, incremental, learn-as-you-go method of training described in
the previous section seems particularly suited to teach collaborative agents.

But it's also useful in teaching the teacher. Consider what Inducting Indy is actually
doing: it's recognizing how you go about attacking a pillbox or whatever. Later,
when it recognizes you attacking the pillbox, it proceeds to attack the pillbox in it's
own way. The teacher stands to learn how in this case the original Indy code goes about
attacking a pill, which in many cases (certainly in ours) may be more effective than our
training examples.

Examples

When training all of the recognized actions simultaneously, rule sets of 100 to two
hundred rules are created. Here's an example set of rules for a small test case where
HIDING, FORTIFY-BASE, and ATTACK-PILL were quickly trained by playing the game for about
10 minutes:

These rules were sufficient so that the player could attack an enemy pill and Inducting
Indy would then start attacking the same pill. In addition, the player could hide in the
forest and Inducting Indy would come hide nearby in the same forest. Finally, if the
player started to fortify a base, Inducting Indy would come help out.

The rules are not perfect, but given more training they would cover more cases. In
general, even this small set of rules led to interesting (and useful) performance by
Inducting Indy.

One 'bot called Speilborg generates a quicktime segment of a bolo game (see Bolo Brains
for more info). We have used Speilborg to generate a QuickTime movie of Inducting Indy and
overlayed sound, so that we can describe the training as the game progresses.

Conclusions and Possible Extensions

The system demonstrates that it is possible to adapt a machine learning technique like
PRISM to an interactive, real-time situation for the purposes of teaching collaborative
behavior. It works well ... better than we had expected.

Some possible extensions are:

When the player gives the 'bot an example of an action, the player sometimes forgets to
tell the 'bot that he has stopped performing the action (the training example has ended).
This is not surprising, since when we demonstrate an action in real life, we rarely have
to worry that the student might not realize that the demonstration is over when we leave
the room to get the phone (for example). Our 'bot might replicate this behavior by
watching how the world features change in the course of a training example. For example,
if after a bit a number of the features change, it might conclude that the demonstration
is over.

Currently there is no way to incorporate the player's negative feedback into the 'bot's
learning. The idea is to stop the 'bot when it wrongly decides to adopt some behavior.
Partly this is a user-interface issue (should the player just send a message to it or
should it ask for confirmation when it's unsure?) and a learning algorithm issue (should
the offending rule and the examples that made it happen be thrown out?).

The are other interesting PRISM-like learning algorithms that might be more suited to
our task. For example, David Maulsby's Cima system, derived from PRISM, includes ways to
assign saliency to features, factor in user's hints and all sorts of ways to compute rules
and take advantage of the interactive nature of the learning at the same time. While we
knew about Cima when we started the project, our view was that it was probably smart to
start with a simple algorithm first.

During training the player may be in a position to tell the 'bot when a particular
feature is irrelevant (or particularly relevant). For example, while demonstrating a
particular behavior like exploring, for example, the player may indicate that the
direction the tank is facing is irrelevant, or that the fact that the tank is on the move
is relevant. The problem here is that such a facility assumes that the player knows the
set of features that the 'bot is calculating. One answer is to simply go from a spoken
keyword to a particular feature or set of features.

Right now, the 'bot recognizes only features collected from a single frame of the game;
it does not recognize sequences of actions. One interesting extension involves permitting
it to recognize (and execute) sequences of actions from features collected over multiple
frames. Discrete Hidden Markov Modeling or Q-learning may be useful for recognizing
sequences.

Currently, once the 'bot is trained, it responds to what it thinks the player is doing.
However, if the player is still training, sometimes the player will indicate one action
and the 'bot will predict another. While the PRISM algorithm will eventually resolve these
inconsistencies, the fact that a direct contradiction has been confirmed is not used in
any special way, say to provide weights during rule construction.

One of the last phases of teaching collaborative agents might be called
"delegation". That is, eventually the teacher is so confident in his student
that he no longer has to be around to watch him. Also, the only kind of collaborative
behavior shown by our system is that of adopting the same behavior and the same
object of that behavior. Another kind of collaborative behavior (sometimes called
"divide and conquer") involves distributing tasks. In advanced phases of
learning collaborative behavior, both these styles of behavior, delegation and
divide-and-conquer, need to be developed. Anyone looking for a Ph.D. topic?

Technical Details

On the Mac Indy 2.02 C code was ported to compile in MetroWerks CodeWarrior on a PowerPC. The Think C
version had a code segment size limitation. Indy was modified in several ways:

A "follow closest ally tank" mode was added so the robot will tag along behind
the player's tank.

The code was modified so that the 'bot can be told to perform specific behaviors at
specific times, such as attack base, attack pill, attack tank, hide, and fortify base.

Code to compute and discretize features with respect to the tank being followed was
added to Indy. As it follows, it writes the followed tank's features to a file a few times
a second.

The Indy code was modified so that it reads rules computed by the learning algorithm in
LISP and compares to rules to incoming features and then reacts accordingly. For example,
when the ally tank being followed starts attacking an enemy tank, the follow tank will
attack until the followed tank stops.

The PRISM and INDUCT learning algorithms were implemented in Common LISP for
flexibility and ease of programming and run under LispWorks on a fast
HP735 workstation. They read rules from a file output by the mac Bolo robot, compute
rules, and output the rules to a file read by the Bolo robot. Some effort was required to
get the learning algorithm and the Bolo Brain communicating via files. Presently, the
algorithm runs fast enough that it could probably be ported to MacLISP.

A significant amount of time was required to write feature computation routines and to
tweak which features should be provided to the learner. Testing the algorithm required
running two or three tanks simultaneously. Testing was especially difficult, given our
ineptitude at actually playing the game! A new game map was created to make the training a
bit easier.

Using/Modifying/Fixing the Code

The C and Lisp code for Inducting Indy is available to anyone who wants it. Send a request via email. The version here ran on
a Unix workstation running Lisp and communicated with the Bolo C process via files. This
was a kludge that requires fixing. If you want to use Inducting Indy you have two options:
either convert the Lisp code to MacLisp and get the C and Lisp processes communicating
properly, or rewrite the PRISM algorithm in C code.