Markov decision processes (MDPs) are now widely accepted as the
preferred model for decision-theoretic planning problems
[11]. The fundamental assumption behind the MDP
formulation is that not only the system dynamics but also the reward
function are Markovian. Therefore, all information needed to
determine the reward at a given state must be encoded in the state
itself.
This requirement is not always easy to meet for planning problems, as
many desirable behaviours are naturally expressed as properties of
execution sequences, see
e.g., [19,27,4,40].
Typical cases include rewards for the maintenance of some property,
for the periodic achievement of some goal, for the achievement of a
goal within a given number of steps of the request being made, or even
simply for the very first achievement of a goal which becomes
irrelevant afterwards.
For instance, consider a health care robot which assists ederly or
disabled people by achieving simple goals such as reminding them to do
important tasks (e.g. taking a pill), entertaining them, checking or
transporting objects for them (e.g. checking the stove's temperature
or bringing coffee), escorting them, or searching (e.g. for glasses or
for the nurse) [14]. In this domain, we might
want to reward the robot for making sure a given patient takes his
pill exactly once every 8 hours (and penalise it if it fails to
prevent the patient from doing this more than once within this time
frame!), we may reward it for repeatedly visiting all rooms in the
ward in a given order and reporting any problem it detects, it may
also receive a reward once for each patient's request answered within
the appropriate time-frame, etc.
Another example is the elevator control domain
[35], in which an elevator must get passengers
from their origin to their destination as efficiently as possible,
while attempting to satisfying a range of other conditions such as
providing priority services to critical customers. In this domain,
some trajectories of the elevator are more desirable than others,
which makes it natural to encode the problem by assigning rewards to
those trajectories.
A decision process in which rewards depend on
the sequence of states passed through rather than merely on the
current state is called a decision process with non-Markovian
rewards (NMRDP) [2].
A difficulty with NMRDPs is that the most efficient MDP solution
methods do not directly apply to them. The traditional way to
circumvent this problem is to formulate the NMRDP as an equivalent
MDP, whose states result from augmenting those of the original NMRDP
with extra information capturing enough history to make the reward
Markovian. Hand crafting such an MDP can however be very difficult in
general. This is exacerbated by the fact that the size of the MDP
impacts the effectiveness of many solution methods. Therefore,
there has been interest in automating the translation into an MDP,
starting from a natural specification of non-Markovian rewards and of
the system's dynamics [2,3]. This is
the problem we focus on.