David's Simple Game - SARSA with an Adversary Controller

Java is not working. This has been tested in Firefox, and it works there!

This applet shows how a simple game on a 5x5 grid. The agent (shown
as a circle) can move up, down, left or right.

The Game

There can be
a prize at one of the 4 corners (the prize shown in the color cyan when it
is there). When the agent
lands on a prize it gets a reward of +10 and the prize
disappears. When there is no prize, a prize can appear at one of the 4
corners. The prize stays there until the agent lands on it. In this
version an adversary chooses which corner the prize appears
at. Note that the adversary does not explore and acts greedily with
respect to the Q-function (i.e., chooses the prize location that has
the lowest Q-value).

There are 5 locations where monsters can appear randomly. The monsters
are shown as red in the square. If a
monster appears when the agent is at that location, the agent gets
damaged if it wasn't already damaged. If it was damaged, the agent has a
penalty of 10 (i.e., a reward of -10). The monsters are at the
locations independently at each time. The agent can get repaired by
visiting the repair station (second to left location on the top row, shown in
magenta). The agent is yellow when it isn't damaged and is pink when
it is damaged.

There are 4 actions available to the agent: up, down, left and
right. If the agent carries out one of these actions, it have a 0.7 chance of
going one step in the desired
direction and a 0.1 change in going one step in any of the other three
directions. If it bumps into the outside wall or an inside wall (i.e., the square
computed as above is outside the grid or through an internal wall), there is a penalty on 1 (i.e.,
a reward of -1) and the agent doesn't
actually move.

The Controller

The numbers in the squares represent the Q-values of the state made
of that location and the current value of the prize and the
current damage. The blue arrows give the value for the optimal action. The agent
acts greedily with respect to the Q-values using the percent in the
applet, and chooses a random action the rest of the time.

The adversary chooses the corner that results in the state with the
lowest Q-value (i.e., the state that is worst for the moving agent.

You can control the agent yourself (using the up, left, right, down
buttons) or you can step the agent for the number of times specified
in the applet.

There are some parameters you can change:

Discount: specifies how much future rewards are discounted.

Step count: specifies the number of steps that occur for each
press of the "step" button.

The greedy exploit percent: specifies what percentage of the time
the agent chooses one of best actions; the other times it acts randomly.

The alpha gives the learnining rate, if the fixed box is
checked, otherwise alpha changes to computes the empirical average.

The initial value specifies the Q-values when Reset is pressed.

The applet reports the number of steps and the total reward
received. It specifies the minimum accumulated reward (which indicates when it has
started to learn), and the point at which the accumulated reward
changes from negative to positive.
Reset initializes these to zero and the Q-values to the
initial value. Trace on console lists the history of
steps and rewards on the console, if you want to plot it.

The commands "Brighter" and "Dimmer" change the contrast (the
mapping
between non-extreme values and colour). "Grow" and "Shrink" change the
size of the grid.