Abstract

We consider scenarios from the real-time strategy game StarCraft as new
benchmarks for reinforcement learning algorithms. We propose
micromanagement tasks, which present the problem of the short-term,
low-level control of army members during a battle. From a reinforcement
learning point of view, these scenarios are challenging because the
state-action space is very large, and because there is no obvious feature
representation for the state-action evaluation function. We describe our
approach to tackle the micromanagement scenarios with deep neural network
controllers from raw state features given by the game engine. In addition, we
present a heuristic reinforcement learning algorithm which combines direct
exploration in the policy space and backpropagation. This algorithm allows for the
collection of traces for learning using deterministic policies, which appears much
more efficient than, for example, ϵ-greedy exploration. Experiments show
that with this algorithm, we successfully learn non-trivial strategies for
scenarios with armies of up to 15 agents, where both Q-learning and
REINFORCE struggle.

1 Introduction

StarCraft111StarCraft and its expansion StarCraft: Brood War
are trademarks of Blizzard EntertainmentTM is a
real-time strategy game in which each player must build an army and
control individual units to destroy the opponent’s army. As of
today, StarCraft is considered one of the most difficult games for
computers, and the best bots only reach the level of high amateur
human
players222http://webdocs.cs.ualberta.ca/c̃david/starcraftaicomp/report2015.shtml#mvm(retrieved
on August 23rd, 2016).. The main difficulty comes from the need to
control a large number of units in a wide, partially observable
environment. This implies, in particular, extremely large state and
action spaces: in a typical game, there are at least 101685
possible states (for reference, the game of Go has about 10170
states) and the joint action space is in Θ((#commands per
unit)#units), with a peak number of units of about
400[36]. From a machine learning point of
view, StarCraft provides an ideal environment to study the control of
multiple agents at large scale, and also an opportunity to define
tasks of increasing difficulty, from micromanagement, which
concerns the short-term, low-level control of fighting units during
battles, to long-term strategic and hierarchical planning under
uncertainty. While building a controller for the full game based on
machine learning is out-of-reach for current methods, we propose, as a
first step, to study reinforcement learning algorithms in
micromanagement scenarios in StarCraft.

Both the work on Atari games [19] and the recent Minecraft
scenarios studied by researchers [1, 22]
focus on the control of a single agent, with a fixed, limited set of actions.
Coherently controlling multiple agents (units) is the main challenge of
reinforcement learning for micromanagement tasks. This comes with two
main difficulties. The first difficulty is to efficiently explore the
large action space.
The implementation of a coherent strategy requires the units to take
actions that depend on each other, but it also implies that any small
alteration of a strategy must be maintained for a sufficiently long
time to properly evaluate the long-term effect of that change. In
contrast to this requirement of consistency in exploration, the
reinforcement learning algorithms that have been successful in
training deep neural network policies such as Q-learning
[44, 34] and REINFORCE
[46, 7], perform exploration by
randomizing actions. In the case of micromanagement, randomizing
actions mainly disorganizes the units, which then rapidly lose the
battle without collecting relevant feedback.
The second major difficulty of micromanagement scenarios is that
there is no obvious way to parameterize the policy given the state and
the actions, because some actions describe a relation between entities
of the state, e.g., (unit A, attack, unit B) or (unit A, move,
position B) and are not restricted to a few constant symbols such as
“move left” or “move right”. The approach of “learning directly
from pixels”, in which the pixel input is fed to a multi-class
convolutional neural network, was successful in Atari games
[20]. However, pixels only capture spatial
relationships between units. These are only parts of the relationships
of interest, and more generally this kind of multi-class architecture
cannot evaluate actions that are parameterized by an entity of the
state.

The contribution of this paper is twofold. First, we propose several
micromanagement tasks from StarCraft (Section 3), then we
describe our approach to tackle them and evaluate well known reinforcement
learning algorithms on these tasks (Section 4), such as
Q-learning and REINFORCE (Subsection 4.1). In particular,
we present an approach of greedy inference to break out the complexity of
taking the actions at each step (Subsection 4.2). We also describe
the features used to jointly represent states and actions, as well as a deep
neural network model for the policy (Section 5). Second, we
propose a heuristic reinforcement learning
algorithm to address the difficulty of exploration in these tasks
(Section 6). To avoid the pitfalls of exploration by taking
randomized actions at each step, this algorithm explores directly in policy
space, by randomizing a small part of the deep network parameters at the
beginning of an episode and running the altered, deterministic, policy
thoughout the whole episode. Parameter updates are performed using a heuristic
approach combining gradient-free optimization for the randomized parameters,
and plain backpropagation for the others. Compared to algorithms for efficient
direct exploration in parameter space (see e.g.,
[18, 27, 37, 25]),
the novelty of our algorithm is to mix exploration through parameter
randomization and plain gradient descent. Parameter randomization is efficient
for exploration but learns slowly with a large number of parameters, whereas
gradient descent does not take part in any exploration but can rapidly learn
models with millions of parameters.

2 Related work

Multi-agent reinforcement learning has been an active area of research (see
e.g., [4]). Most of the focus has been on learning
agents in competitive environments with adaptive adversaries (e.g.,
[15, 13, 40]). Some work has
looked at learning control policies for individual agents in a collaborative
setting with communication constraints
[38, 3], with applications such as soccer
robot control [31], and methods such as hierarchical
reinforcement learning for communicating high-level goals
[12], or learning an efficient communication
protocol [32]. While the decentralized control
framework is most likely relevant for playing full games of StarCraft, here we
avoid the difficulty of imperfect information, therefore we use the
multi-agent structure only as a means to structure the action space. As in the
approach of [17] with reinforcement learning for
structured output prediction, we use a greedy sequential inference scheme at
each time frame: each unit decides on its action based solely on the state
combined with the actions of units that came before it in the sequence.

Algorithms that have been used to train deep neural network controllers in
reinforcement learning include Q-learning [44, 20],
the method of temporal differences
[33, 39], policy gradient and their
variants [46, 7], and actor/critic
architectures
[2, 29, 28]. Except
for the deterministic policy gradient (DPG) [29],
these algorithms rely on randomizing the actions at each step for exploration.
DPG collects traces by following deterministic policies that remain constant
throughout an episode, but can only be applied when the action space is
continuous. Our work is most closely related to works that explore the
parameter space of policies rather than the action space. Several approaches
have been proposed that randomize the parameters of the policy at the beginning
of an episode and run a deterministic policy throughout the entire episode,
borrowing ideas from gradient-free optimization (see e.g.,
[18, 27, 37]). However, these
algorithms rely on gradient-free optimization for all parameters, which does
not scale well with the number of parameters. Osband et al.
[25] describe another type of algorithm where the
parameters of a deterministic policy are randomized at the beginning of an
episode, and learn a posterior distribution over the parameters as in Thomson
sampling [41]. Their algorithm is particularly
suitable for problems in which depth-first search exploration is efficient, so
their motivation is very similar to ours. Their approach was proved to be
efficient, but applies only to linear functions and scales quadratically with
the number of parameters. The bootstrapped deep Q-networks (BDQN)
[24] are a practical implementation of the ideas of
[25] for deep neural networks. However, BDQN still
performs exploration in the action space at the beginning of the training, and
there is no randomization of the parameters. Instead, several versions of the
last layer of the deep neural network controller are maintainted, and one of
them is used alternatively during an entire episode to generate diverse traces
and perform Q-learning updates. In contrast, we randomize the parameters of the
last layer once at the beginning of an episode, and contrarily to Q-learning,
our algorithm does not rely on the estimation of the state-action value
function.

In the context of StarCraft micromanagement, a large spectrum of AI approaches
have been studied. There has been work on Bayesian fusion of hand-designed
influence maps [36], fast heuristic search (in a
simplified simulator of battles without collisions) [5],
and even evolutionary optimization [16]. Closer to this
work, [45] successfully applied tabular Q-learning
[44] and SARSA [34], with and
without experience replay (“eligility traces”), with a reward similar to the
one used in several of our experiments. However, the action space was
reduced to pre-computed “meta-actions”: fight and retreat, and the features
were hand-crafted. None of these approaches are used as is in existing
StarCraft bots, mainly for a lack of robustness to all micromanagement
scenarios that can happen in a full game, for a lack of completeness (both can
be attributed to hand-crafting), or for a lack of computational efficiency
(speed). For a more detailed overview of AI research on StarCraft, the
reader should consult [23].

3 StarCraft micromanagement scenarios

We focus on micromanagement, which consists of optimizing each unit’s
actions during a battle. The tasks presented in this paper represent
only a subset of the complexity of playing StarCraft. As StarCraft is
a real-time strategy (RTS) game, actions are durative (are not fully
executed on the next frame), and there are approximately 24 frames per
second. As we take an action for each unit every few frames
(e.g. every 9 frames here, see A.1 in Appendix for
more details), we only consider actions that can be executed in this
time frame, which are: the 8 move directions, holding the current
position, an attack action for each of the existing enemy units. In
all tasks, we control all units from one side, and the opponent
(built-in AI in the experiments) is attacking us:

m5v5 is a task in which we control 5 Marines
(ranged ground unit), against 5 opponent
Marines. A good strategy here is to focus fire, by whatever
means. For example, we can attack the weakest opponent unit (the
unit with the least remaining life points), with tie breaking, or
attack the closest to the group.

m15v16: same as above, except we have 15 Marines and the opponent
has 16. A good strategy here is also to focus fire, while avoiding
“overkill” (spread the damage over several units if the focus firing
is enough to kill one of the opponent’s unit). A Marine has 40 hit
points, and can hit for 6 hit points every 15 frames.

dragoons_zealots: symmetric armies with two types of units:
3 Zealots (melee ground unit) and 2 Dragoons (ranged ground unit). Here a
strategy requires to focus fire, and if possible to 1) not spend too
much time having the Zealots walk instead of fight, 2) focus the
Dragoons (which receive full damage from both Zealots and Dragoons
while inflicting only half damage on Zealots).

w15v17: we control 15 Wraiths (ranged flying unit)
while the opponent has 17. Flying units have no “collision”,
so multiple units can occupy the same tile. Here more than
anywhere, it is important not to “overkill”: Wraiths have 120
hit points, and can hit for 20 damage on a 22 frame cooldown. As
there is no collision, moving is easier.

other mXvY or wXvY scenarios. The 4 scenarios above
are the ones on which we train our models, but they can learn
strategies that overfit a given number of units, so we have similar
scenarios but with different numbers of units (on each side).

For all these scenarios, a human expert can win 100% of the time against the
built-in AI, by moving away units that are hurt (thus conserving firepower) and
with proper focus firing.

4 Framework: RL and multiple units

We now describe the notation and definition underlying the
algorithms Q-learning and policy gradient (PG) used as baselines
here. We then reformulate the joint inference over the potential
actions for different units as a greedy inference which reduces to a
usual MDP with more states but fewer actions per state. We then show
how we normalize cumulative rewards at each state in order to keep
rewards in the full interval [−1,1], during an entire episode, even
when units disappear.

4.1 Preliminaries: Q-learning and REINFORCE

Notation

The environment is approximated as an MDP, with a finite set of states
denoted by S. Each state s has a set of units
U(s), and a policy has to issue a command c∈C to
each of them. The set of commands is finite. An action in that MDP is
represented as a sequence of (unit, command) pairs a=((u1,c1),...,(u|s|,c|s|)) such that
{u1,...,u|s|}=U(s). |s| denotes the number of units in state
s and A(s)=(U(s)×C)|s| the set of actions in
state s. We denote by ρ(s′|s,a) the
transition probability of the MDP and by ρ1 the probability
distribution of initial states. When there is a transition from state
st to a state st+1, the agent receives the reward rt+1=r(st,st+1), where r:S×S→R is the reward function. We assume that commands are received and
executed concurrently, so that the order of commands in an action does
not alter the transition probabilities. Finally, we consider the episodic
reinforcement learning scenario, with finite horizon T and
undiscounted rewards. The learner has to learn a (stochastic) policy
π(a|s), which defines a probability distribution over
actions in A(s) for every s∈S. The objective
is to maximize the expected undiscounted cumulative reward over
episodes R(π)=E[∑T−1t=1r(st,st+1)]=E[¯r(s1..T)], where the expectation is taken with
respect to s1∼ρ1, st+1∼ρ(.|at,st) and
at∼π(.|st).

We now briefly describe the two algorithms we use as baseline,
Q-learning [34] and REINFORCE
[46].

Q-learning

The Q-learning algorithm in the finite-horizon setting learns an
action-value function Q by solving the Bellman equation

∀s∈S,∀a∈A(s),Qt(s,a)=∑s′∈Sρ(s′|s,a)(r(s,s′)+maxa′∈A(s′)Qt+1(s′,a′)),

(1)

where Qt is the state-action value function at stage t of an
episode, and QT(s,a)=0 by convention. Qt(s,a) is also 0
whenever a terminal state is reached, and transitions from a terminal
state only go to the same terminal state.

Training is usually carried out by collecting traces (st,at,st+1,rt+1)t=1,...,T−1 using
ϵ-greedy exploration: at state s and stage t, an
action in argmaxa∈A(s)Qt(s,a) is
chosen with probability 1−ϵ, or an action in A(s)
is chosen uniformly at random with probability ϵ. In
practice, we use stationary Q functions (i.e., Qt=Qt+1), which are neural networks, as described in Section
5. Training is carried out using the standard online
update rule for Q learning with function approximation (see e.g.,
[20]), which we apply in mini-batches (see Section
A.2 for more details).

This training phase is distinct from the test phase, in which we
record the average cumulative reward of the deterministic
policy
s↦argmaxa∈A(s)Q(s,a).

Reinforce

The algorithm REINFORCE belongs to the family of policy gradient algorithms
[35]. Given a stochastic policy πΘ
parameterized by Θ, learning is carried out by generating
traces (st,at,st+1,rt+1)t=1,...,T−1 by
following the current policy. Then, stochastic gradient updates are
performed, using the gradient estimate:

T∑t=1¯r(st..T)∇Θlog(πΘ(at|st)).

(2)

We use a Gibbs policy (with temperature parameter τ) as the stochastic
policy:

πΘ(a|s)=exp(ϕΘ(a,s)/τ)∑b∈A(s)exp(ϕΘ(b,s)/τ),

(3)

where ϕΘ is a neural network with paramters Θ that
gives a real-valued score to each (state, action) pair. For testing, we
use the deterministic policy πΘ(s)=argmaxa∈A(s)ϕΘ(a,s).

4.2 The MDP for greedy inference

One way to break out the complexity of jointly infering the commands
to each individual unit is to perform greedy inference at each step:
at each state, units choose a command one by one, knowing the commands
that were previously taken by other units. Learning a greedy policy
boils down to learning a policy in another MDP with fewer
actions per state but exponentially more states, where the additional
states correspond to the intermediate steps of the greedy
inference. This reduction was previously proposed in the context of
structured prediction by Maes et al. [17], who
proved that an optimal policy in this new MDP has the same cumulative
reward as an optimal policy in the original MDP.

A natural way to define the MDP associated with greedy inference,
hereafter called greedy MDP, is to define the set of atomic actions of
the greedy policy as all possible (unit, command) pairs for the units
whose command is still not decided. This would lead to
an inference with quadratic complexity with respect to the number of
units, which is undesirable.

Another possibility is to first choose a unit, then a command to apply
to that unit, which yields an algorithm with 2|s| steps for
state s. Since the commands are executed concurrently by the
environment after all commands have been decided, the cumulative
reward does not depend on the order in which we choose the
units. Going further, we can let the environment in the greedy MDP
choose the next unit, for instance, uniformly at random among
remaining units. The resulting inference has a complexity that is
linear in the number of units.
More formally, using the notation
a1..k to denote the k first (unit, command) pairs of an
action a (with the convention a1..0=∅), the
state space ~S of the greedy MDP is defined by

The action space A(~s) of each state ~s∈~S is
constant and equal to the set of commands C. Moreover, for each
state s of the original MDP, any action a=((u1,c1),...,(u|s|,c|s|)∈A(s), the
transition probabilities ~ρ in the greedy MDP are defined by

∀k∈{0,...,|s|−1},

~ρ((s,a1..k,uk+1)∣∣(s,a1..k−1,uk),ck)=1|s|−k

(4)

~{}~{}and~{}~{}∀s′∈S,∀u′∈U(s′),

~ρ((s′,∅,u′)∣∣(s,a1..|s|−1,u|s|),c|s|)=1|s′|ρ(s′|s,a).

(5)

Finally, using the same notation as above, the reward function ~r between states that represent
intermediate steps of the algorithm is 0 and the last unit to play receives the reward:

It can be shown that an optimal policy for this greedy MDP chooses
actions that are optimal for the original MDP, because the immediate
reward in the original MDP does not depend on the order in which the
actions are taken. This result only applies if the family of policies
has enough capacity. In practice, some ordering may be easier to learn
than others, but we did not investigate this issue because the gain,
in terms of computation time, of the random ordering was critical for
the experiments.

4.3 Normalized cumulative rewards

Immediate rewards are necessary to provide feedback that guides
exploration. In the case of micromanagement, a natural reward signal
is the difference between damage inflicted and incurred between two
states. The cumulative reward over an episode is the total damage
inflicted minus the total damage incurred along the episode. However,
the scale of this quantity heavily depends on the number of units
(both our units and enemy units) that are present in the state, a
quantity which significantly decreases along an episode. Without
proper normalization with respect to the number of units in the
current state, learning will be artificially biased towards the large
immediate rewards at the beginning of the episode.

We present a simple method to normalize the immediate rewards on a
per-state basis, assuming that a scale factor z(s) is
available to the learner – it can be as simple as the number of units.
Then, instead of considering cumulative rewards from a starting state
st,
we define normalized cumulative rewards ¯nt..T as the following
recursive computation over an episode:

∀t∈{1,...,T−1},¯nt..T=rt+1+z(st+1)¯nt+1..Tz(st).

These normalized rewards maitain the invariant ¯nt..T=¯rt..Tz(st); but more importantly, the
normalization can be applied to the Bellman equation
(1), which becomes

The stochastic gradient updates for Q-learning can easily be modified
accordingly, as well as the gradient estimate in REINFORCE
(2) in which we replace ¯r by ¯n.

One way to look at this normalization process is to consider that the
reward is rt+1z(st), and
z(st+1)z(st) plays the role of an
(adaptive) discount factor, which is chosen to be at most 1, and
strictly smaller than 1 when the number of units change.

5 Features and model for micromanagement in StarCraft

The features and models we use are intended to test the ability of RL
algorithms to learn strategies when given as little
prior knowledge as possible. We voluntarily restrict ourselves to raw
features extracted from the state description given by the game
engine, without encoding any prior knwoledge of the game
dynamics. This contrast with prior work on Q-learning for
micromanagement such as [45], which use features
such as the expected inflicted damage. We do not allow ourselves
to encode the effect of an attack action on the hit points
of the attacked unit; we do not, either, construct cross-features nor
provide any relevant discretization of the features (e.g., whether
unit A is in the range of unit B). The only transformation of the raw
features we perform is the computation of distances between units and
(between) targets of commands.

We represent a state as a sequence of feature vectors, one feature
vector per unit (ally or enemy) in the state. We remind that each
state in the greedy MDP is a tuple ~s=(s,a1..k,uk+1) and an action in that MDP corresponds to a command c
that uk+1 shall execute. At each frame and for each unit, the
commands we consider are (1) attack a given enemy unit, and (2) move
to a specific position. In order to reduce the number of possible move
commands, we only consider 9 move commands, which either correspond
to a move in one of the 8 basic directions, or staying at the same
position.

Attack commands are non-trivial to featurize because the model needs
be able to solve the reference from the identifiers of the units that
attack or are attacked to their corresponding attributes. In order to
solve this issue, we construct a joint state/action feature
representation in which the unit positions (coordinates on the map)
are indirectly used to refer to the units. We now detail the feature
representation we use and the neural network model.

5.1 Raw state information and featurization

For each unit (ally or enemy), the following attributes are extracted
from the raw state description given by the game engine:

Unit attributes: the unit type, its coordinates on the
map (pos), the remaining number of hit points (hp),
the shield, which corresponds to additional hit points that
can be recovered when the unit is not attacked, and, finally the
weapon cooldown (cd, number of frames to wait to be able to
inflict damage again). An additional flag enemy is
used to distinguish between our units and enemy units.

Two attributes that describe the command that is currently
executed by the unit. First, the target attribute, which, if
not empty, is the identifier of the enemy unit currently under
attack. This identifier is an integer that is attributed arbitrarily
by the game engine, and does not convey any semantics. We do not
encode directly this identifier in the model, but rather only use
the position of the (target) unit (as we describe below in the
distance features).

The second attribute, target_pos gives the
coordinates on the map of the position of the target (the desired
destination if the unit is currently moving, or the position of the
target if the latter is not empty). These fields are
available for both ally and enemy units; from these we infer the
current command cur_cmd that the unit currently performs.

In order to assign a score to a tuple ((s,a1..k,uk+1),c) where c is a candidate command for uk+1,
the joint representation is defined as sequence of feature vectors,
one for each unit u∈U(s). The feature vector for unit
u, which is denoted by F(u,a1..k,uk+1,c), is
a joint representation of u together with its next command next_cmd if it
has already been decided (i.e. if u is an ally unit whose next
command is in a1..k), and of the command c that is
evaluated for uk+1. All commands have a field act_type
(attack or move) and a field target_pos. If we want to
featurize a command that is not available for a given unit, such as
the next command for an enemy unit, we set act_type to a
specific “no command” value and target_pos to the unit
position.

Given u∈U(s), the vector F(u,a1..k,uk+1,c)
contains the 17 features described below. We use an object-oriented
programming notation “a.b” to refer to the value of attribute b of a:

Non-positional features

u.enemy (boolean), u.type
(categorical, one-hot encoding), u.hp, u.shield, u.cd (all
three real-valued), u.cur_cmd.act_type (categorical, one-hot
encoding), u.next_cmd.act_type, (uk+1).type. At this
stage, we do not encode the type of the command
c.act_type, which is another input to the network (see
Section 5.2).

Relative distance features

∥a−b∥ for
a∈{u.pos,u.cur_cmd.target_pos,u.next_cmd.target_pos} and
b∈{uk+1.pos,uk+1.cur_cmd.target_pos,c.target_pos}. These features,
in particular, encode which unit is uk+1 because the distance
between positions is 0. They also encode which unit is the target of
the command, and which units have the same target. This encoding is
unambiguous as long as units cannot have the same position, which is
not true for flying units (units have the same position, either as
actor or target of a command, will be treated as the same). In
practice however, the confusion of units did not seem to be a major
issue since units rarely have exactly the same position.

Finally, the full (state, action) tuple of the greedy MDP ((s,a1..k,uk+1),c) is represented by an
|U(s)|×17 matrix, in which the j-th row is
F(uj,a1..k,uk+1,c). The model, which we describe
below, deals with the variable-size input with global pooling
operations.

5.2 Deep Neural Network model

As we shall see in Section 6, we consider
state-action scoring functions of the form
argmaxc∈C⟨w,Ψθ((s,a1..k,uk+1),c)⟩, where w
is a vector in Rd and Ψθ((s,a1..k,uk+1)c) is an deep network with parameters θ which
embeds the state and command of the greedy MDP into Rd.

The embedding network takes as input a |U(s)|×17 matrix, which we describe below, and operates in two steps:

(1) Cross featurization and pooling in this step, each row
F(uj,a1..k,uk+1,c) goes through a 2-layer neural
network, with each layer of width 100, with an ELU nonlinearity
[6] for the first layer and hyperbolic tangeants as
final activation functions. The resulting |U(s)|×100 matrix is then aggreated into two different vectors of
size 100: the first one by taking the mean value of each column
(average pooling), and the second one by taking the maximum (max
pooling). The two vectors are then concatenated and yield a
200-dimensional vector for the next step. We can note that this
final fixed-length representation is invariant to the ordering of rows
of the original matrix.

(2) Scoring with respect to action type the 200-dimensional
vector is then concatenated with the type of action c.act_type
(one-hot encoding of two values: attack or move). The concatenation
goes through a 2-layer network with 100 activation units at each
layer. The first non-linearity is an ELU, while the second is a
rectifier linear unit (ReLU).

The rationale behind this model is that it can represent the answer
to a variety of question regarding the relationship between the
candidate command and the state, such as: what is the type of unit of
the command’s target? How many damages shall be inflicted? How many
units already have the same target? How many units are attacking
uk+1?

Yet, in order to answer these questions, the learner must perform the
appropriate cross-features and paramter updtaes from the reinforcement
signal alone, so the learning task is non-trivial even for fairly
simple strategies.

6 Combining backpropagation and a zero-order gradient estimates

We now present our algorithm for exploring deterministic policies in
discrete action spaces, based on policies parameterized by a deep
neural network. Our algorithm is inspired by finite-difference methods
for stochastic gradient-free optimization
[14, 21, 30] as well
as exploration strategies in parameter space
[26]. This algorithm can be viewed as a
heuristic. We present it within the general MDP formulation of Section
4.1 for simplicity, although our experiments apply
it to the greedy MDP of Section 4.2.

As described in Section 5.2, we consider the case where
pairs (state, action) (s,a) are embedded by a parametric
function Ψθ(s,a)∈Rd. The deterministic
policy is parameterized by an additional vector w∈Rd, so that
the action πw,θ(s) taken at state s is
defined as

πw,θ(s)=argmaxa∈A(s)⟨w,Ψθ(s,a)⟩.

The overall algorithm is described in Algorithm 1. In order to
explore the policy space in a consistent manner during an episode, we
uniformly sample a vector u on the unit sphere and run the policy
πw+δu,θ for the whole episode, where δ>0 is
a hyper-parameter.

In addition to implementing a local random search in the policy space,
the motivation for this randomization comes from stochastic
gradient-free optimization
[14, 21, 30, 9, 11],
where the gradient of a differentiable function x∈Rd↦f(x) can be estimated with finite difference methods by

∇f(x)≈E[dδf(x+δu)u],

where the expectation is taken over the vector u sampled on the unit
sphere [21, chapter 9.3]. The constant
dδ will be absorbed by learning rates, so we ignore it
in the following. Thus, given a (state, action) pair (s,a)
and the observed cumulative reward ¯r, we use ¯ru as an
estimator of the gradient of the expected cumulative reward with
respect to w (line (*) 1).

The motivation for the update of the network parameters is the
following: given a function , we have ∇wg=g′(⟨w,v⟩)v and ∇vg=g′(⟨w,v⟩)w. Denoting by wv the term-by-term
division of vectors (assuming v contains only non-zero values) and
⊙ the term-by-term multiplication operator, we obtain ∇vg=(∇wg)⊙wv. The update (**) in the algorithm
corresponds to taking v=Ψθ(s,a) in the above, and
using ¯ru as the estimated gradient of the cumulative reward
with respect to w, as before. Since we need to make sure that the
ratios wΨθ(s,a) are bounded, in practice
we use the sign of wΨθ(s,a) to avoid
numerical issues. This “estimated” gradient is then backpropagated
through the network. Preliminary experiments suggested that taking the
sign was as effective as e.g., clipping, and was simpler since there
is no parameter, so we use this heuristic in all our experiments.

The reasoning above is only a partial justification of the update
rule (**) of Algorithm 1, because we neglected the dependency
between the parameters and the argmax operation that chooses the
actions. Nonetheless, considering (**) as a crude approximation to
some real estimator of the gradient seems to work very well in
practice, as we shall see in our experiments. Finally, we use Adagrad
[8] to update the parameters of the different
layers. We found the use of Adagrad’s update scheme fairly important
in practice, compared to other approaches such as
RMSProp [42], even though RMSProp tended to work
slightly better with Q-learning or REINFORCE in our experiments.

7 Experiments

7.1 Setup

We use Torch7333www.torch.ch for all our experiments. We connect
our Torch code and models to StarCraft through a socket server. We ran experiments with deep Q networks (DQN)
[19], policy gradient (PG) [46], and
zero order (ZO). We did an extensive hyper-parameters search, in
particular over ϵ (for epsilon-greedy exploration in DQN), τ (for
policy gradient’s softmax), learning rates, optimization methods, RL algorithms
variants, and potential annealings. See A.2 in Appendix for
more details.

7.2 Baseline heuristics

As all the results that we report are against the built-in AI, we compare our
win rates to the ones of (strong) baseline heuristics. Some of these heuristics
often perform the micromanagement in full-fledged StarCraft bots
[23], and are the basis of heuristic search
[5]. The baselines are the following:

random no change (rand_nc): select a random target for each
of our units and do not change this target before it dies (or our unit
dies). This spreads damage over several enemy units, and can be rather
bad when there are collisions (because it can require our units to move
a lot to be in range of their target).

noop: literally send no action, that is something that is
forbidden for our models to do. In this case, the built-in AI
will control our units, so this exhibit the symmetry (or not!) of a
given scenario. As we are always in a defensive position, with the
enemy commanded to walk towards us, all other things considered equal (number of
units), it should be easier for the defending built-in AI than for the
attacking one.

closest (c): each of our units targets the enemy
unit closest to it. This is not a bad heuristic as enemy units formation
(because of collisions) will always make it so that several of
our units have the same opponent unit as closest unit (some form
of focus firing), but not all of them (no overkill). It is also
quite robust for melee units (e.g. Zealots) as it means they
spend less time moving and more time attacking.

weakest closest (wc): each of our units targets the weakest
enemy unit. The distance of the enemy unit to the center of mass of our
units is used for tie-breaking. This may overkill.

no overkill no change (nok_nc): same as the weakest closest
heuristic, but register the number of our units that target each
opponent unit, choosing another target to focus fire when it becomes
overkill to keep targeting a given unit. Each of our units keep firing
on their target without changing (that would lead to erratic behavior).
Note that the “no overkill” component of the heuristic cannot easily
take the dynamics of the game into account, and so if our units die
without doing their expected damage on their target, “no overkill”
can be detrimental (as it is implemented).

7.3 Results

The first thing that we looked at were sliding average win rates (over 400
battles) during training against the built-in AI of the various models. In
Figure 1, we can see than DQN is much more dependent
on initialization and variable (fickling) than zero order (ZO). DQN can
unlearn, reach suboptimal plateaux, or overall need a lot of exploration to
start learning (high sample complexity).

Figure 1: Example of the training uncertainty (one standard deviation) on 5
different initialization for DQN (left) and zero-order (right) on the m5v5
scenario.

For all the results that we present in Tables 1
and 2, we ran the models in “test mode”
by making them deterministic. For DQN we remove the epsilon-greedy
exploration (set ϵ=0), for PG we do not sample in the Gibbs
policy but instead take the value-maximizing action, and for ZO we do
not add noise to the last layer.

We can see in Table 1 that m15v16 is at the
advantage of our player’s side (noop is at 81% win rate), whereas
w15v17 is hard (c is at 20% win rate). By looking
just at the results of the heuristics, we can see that overkill is a
problem on m15v16 and w15v17 (nok_nc is better than
wc). “Attack closest” (c) is approximatively as
good as nok_nc at spreading damage, and thus better on
m15v16 because there are lots of collisions (and attacking the closest
unit is going to trigger less movements).

Overall, the zero order optimization outperforms both DQN and PG
(REINFORCE) on most of the maps. The only map on which DQN and PG
perform well is m5v5. It seems to be easier to learn a focus firing
heuristic (e.g. “attack weakest”) by identifying and locking on a
feature, than to also learn not to “overkill”.

heuristics

RL

map

rand_nc

noop

c

wc

nok_nc

DQN

PG

ZO

dragoons_zealots

.14

.49

.67

.83

.50

.61

.69

.90

m5v5

.49

.84

.94

.96

.83

.99

.92

1.

m15v16

.00

.81

.81

.10

.68

.13

.19

.79

w15v17

.19

.10

.20

.02

.12

.16

.14

.49

Table 1: Test win rates over 1000 battles for the training scenarios, for all methods and for heuristics baselines. The best result for a given map is in bold.

We then studied how well a model trained on one of the previous maps
performs on maps with a different number of units, to test
generalization. Table 2 contains the
results for this experiment. We observe that DQN performs the best on
m5v5 when trained on m15v16, because it learned a simpler (but more
efficient on m5v5) heuristic. “Noop” and “attack closest” are
quite good with the large Marines map because they generate less moves
(and less collisions). Overall, ZO is consistently significantly
better than other RL algorithms on these generalization tasks, even
though it does not reach an optimal strategy.

train map

test map

best heuristic

DQN

PG

ZO

m15v16

m5v5

.96 (wc/c)

.96

.79

.80

m15v15

.97 (c)

.27

.16

.80

m18v18

.98 (c/noop)

.18

.25

.82

m18v20

.63 (noop)

.00

.01

.17

w15v17

w5v5

.78 (c)

.70

.70

.74

w15v13

1. (rand_nc/c)

1.

.99

1.

w15v15

.95 (c)

.87

.61

.99

w18v18

.99 (c)

.92

.56

1.

w18v20

.71 (c)

.31

.24

.76

Table 2: Win rates over 1000 games for out-of-training-domain maps, for all methods. The map on which this method was trained on is indicated on the left. The best result is in bold, the best result out of the reinforcement learning methods is in italics.

7.4 Interpretation of the learned policies

We visually inspected the model’s performance on large battles. On the larger
Marines map (m15v16), DQN learned to focus fire. Because this map has many
units, focus firing leads to units bumping into each other to try to focus on a
single unit. The PG player seemed to have a policy that attacks the closest
marine, though it doesn’t do a good job switching targets. The Marines that are
not in range often bump into each other. Our zero order optimization learns a
hybrid between focus firing and attacking the closest unit. Units would switch
to other units in range if possible, but still focus on specific targets. This
leads to most Marines attacking constantly, as well as focus firing when they
can. However, the learned strategy was not perfected, since Marines would still
split their fire occasionally when left with few units.

In the Wraiths map (w15v17), the DQN player’s strategy was hard to decipher.
The most likely explanation is that they tried to attack the closest target,
though it is likely the algorithm did not converge to a specific strategy. The
PG player learned to focus fire. However, because it only takes 6 Wraiths to
kill another, 9 actions are "wasted" (at the beginning
of the fight, when all our units are alive). Our zero order player learns that
focusing only on one enemy is not good, but it does not learn how many attacks
are necessary. This leads to a much higher win rate, but the player still
assigns more than 6 Wraiths to an enemy target (maybe for robustness to the
loss of one of our units), and occasionally will not focus fire when only a few
Wraiths are remaining. This is similar to what the zero order player learned
during the Marines scenario.

8 Conclusion

This paper presents two main contributions. First, it establishes StarCraft
micromanagement scenarios as complex benchmarks for reinforcement learning:
with durative actions, delayed rewards, and large action spaces making random
exploration infeasible. Second, it introduces a new reinforcement learning
algorithm that performs better than prior work (DQN, PG) for discrete action
spaces in these micromanagement scenarios, with robust training (see
Figure 1) and episodically consistent exploration
(exploring in the policy space).

This work leaves several doors open and calls for future work. Simpler
embedding models of state and actions, and variants of the model
presented here, have been tried, none of which produced efficient
units movement (e.g. taking a unit out of the fight when its hit
points are low). There is ongoing work on convolutional networks based
models that conserve the 2D geometry of the game (while embedding the
discrete components of the state and actions). The zero order
optimization technique presented here should be studied more in depth,
and empirically evaluated on domains other than StarCraft
(e.g. Atari). As for StarCraft scenarios specifically, the subsequent
experiments will include self-play (training and evaluation),
multi-map training (training more generic models), and more complex
scenarios which include several types of advanced units with actions
other than move and attack. Finally, the goal of playing full games of
StarCraft should not get lost, so future scenarios would also include
the actions of “recruiting” units (deciding which types of unit to use),
as well as make use of them.

Acknowledgements

We thank Y-Lan Boureau, Antoine Bordes, Florent Perronnin, Dave
Churchill, Léon Bottou and Alexander Miller for helpful discussions
and feedback about this work and earlier versions of the paper. We
thank Timothée Lacroix and Alex Auvolat for technical
contributions to our StarCraft/Torch bridge. We thank Davide Cavalca
for his support on Windows virtual machines in our cluster
environment.

Appendix A Appendix

a.1 StarCraft specifics

We advocate that using existing video games for RL experiments is interesting
because the simulators are oftentimes complex, and we (the AI programmers) do
not have control about the source code of the simulator. In RTS games like
StarCraft, we do not have access to a simulator (and writing one would be a
daunting task), so we cannot use (Monte Carlo) tree search
[10] directly, even less so in the setting of full games
[23]. In this paper, we consider the problem of
micromanagement scenarios, a subset of full RTS play. Micromanagement is
about making good use of a given set of units in an RTS game. Units have
different features, like range, cooldown, hit points (health), attack power,
move speed, collision box etc. These numerous features and the dynamics of the
game advantage player that take the right actions at the right times.
Specifically for the game(s) StarCraft, for which there are professional
players, very good competitive players and professional players perform more
than 300 actions per minute during intense battles.

We ran all our experiments on simple scenarios of battles of an
RTS game: StarCraft: Broodwar. These scenarios can be considered small scale
for StarCraft, but they already deem challenging for existing RL approaches.
For an example scenario of 15 units (that we control) against 16 enemy units,
even while reducing the action space to "atomic" actions (surrounding moves,
and attacks), we obtain 24 (8+16) possible discrete actions per unit for our
controller to choose from (2415 actions total) at the beginning of the
battle. Battles last for tens of seconds, with durative actions, simultaneous
moves, and at 24 frames per second. The strategies that we need to learn
consist in coordinated sets of actions that may need to be repeated, e.g. focus
firing without overkill. We use a featurization that gives access only to the
state from the game, we do not pre-process the state to make it easier to learn
a given strategy, thus keeping the problem elegant and unbiased.

For most of these tasks (“maps”), the number of units that our RL agent has
to consider changes over an episode (a battle), as do its number of actions.
The fact that we are playing in this specific adversarial environment is that
if the units do not follow a coherent strategy for a sufficient amount of time,
they will suffer an unrecoverable loss, and the game will be in a state of the
game where the units will die very rapidly and make little damage,
independently of how they play – a state that is mostly useless for learning.

Our tasks (“maps”) represent battles with homogeneous types of units, or with
little diversity (2 types of unit for each of the players). For instance, they
may use a unit of type Marine, that is one soldier with 40 hit points, an
average move speed, an average range (approximately 10 times its collision
size), 15 frames of cooldown, 6 of attack power of normal damage type (so a
damage per second of 9.6 hit points per second, on a unit without armor).

On symmetric and/or monotyped maps, strategies that are required to win (on
average) are “focus firing”, without overkill (not more units targeting a
unit than what is needed to kill it). For perfect win rates, some maps may
require that the AI moves its units out from the focus firing of the opponent.

a.2 Hyper-parameters

Taking an action on every frame (24 times per second at the speed at
which human play StarCraft) for every unit would spam the game
needlessly, and it would actually prevent the units from
moving444Because several actions are durative, including
moves. Moves have a dynamic consisting of per-unit-type turn rate,
max speed, and acceleration parameters.. We take actions for all
units synchronously on the same frame, even skip_frames
frames. We tried several values of this hyper-parameter (5, 7, 9, 11,
13, 17) and we only saw smooth changes in performance. We ran all the
following experiments with a skip_frames of 9 (meaning that
we take about 2.6 actions per unit per second). We also report the
strongest numbers for the baselines over all these skip frames. We
optimize all the models after each battle (episode), with RMSProp
(momentum 0.99 or 0.95), except for zero-order for which we
optimized with Adagrad (Adagrad did not seem to work better for DQN
nor REINFORCE). In any case, the learning rate was chosen among
{10−2,10−3,10−4}.

For all methods, we tried experience replay, either with episodes (battles) as
batches (of sizes 20, 50, 100), or additionally with random batches of (st,at,rt+1,st+1,terminal?) quintuplets in the case of Q-learning, it
did not seem to help compared to batching with the last battle. So, for
consistency, we only present results where the training batches consisted of
the last episode (battle).

For Q-learning (DQN), we tried two schemes of annealing for epsilon
greedy, ϵ=ϵ0√1+ϵa.ϵ0.t with t the optimization batch, and ϵ=max(0.01,ϵ0ϵa.t), Both with
ϵ0∈{0.1,1}, and respectively ϵa∈{0,ϵ0} and ϵa∈{10−5,10−4,10−3}. We
found that the first works marginally better and used that in the
subsequent experiments with ϵ0=1 and ϵa=1 for most
of the scenarios. We also used Double DQN as in [43]
(thus implemented as target DQN). For the target/double network, we
used a lag of 100 optimizations, thus a lag of 100 battles in all the
following experiments. According to our initial runs/sweep, it seems
to slightly help for some cases of over-estimation of the Q value.