The associative reinforcement comparison (ARC) algorithm [114] is an instance of the AHC
architecture for the case of boolean actions, consisting of two
feed-forward networks. One learns the value of situations, the other
learns a policy. These can be simple linear networks or can have
hidden units.

In the simplest case, the entire system learns only to optimize
immediate reward. First, let us consider the behavior of the network
that learns the policy, a mapping from a vector describing s to a 0
or 1. If the output unit has activation , then a, the action
generated, will be 1 if , where is normal noise,
and 0 otherwise.

The adjustment for the output unit is, in the simplest case,

where the first factor is the reward received for taking the most recent
action and the second encodes which action was taken. The actions are
encoded as 0 and 1, so a - 1/2 always has the same magnitude;
if the reward and the action have the same sign, then action 1 will be
made more likely, otherwise action 0 will be.

As described, the network will tend to seek actions that given
positive reward. To extend this approach to maximize reward, we can
compare the reward to some baseline, b. This changes the adjustment
to

where b is the output of the
second network. The second network is trained in a standard
supervised mode to estimate r as a function of the input state s.

Variations of this approach have been used in a variety of
applications [4, 9, 61, 114].