The first contribution of this paper is to provide a language and a
translation adapted to another class of solution methods which have
proven quite effective in dealing with large MDPs, namely anytime state-based heuristic search methods such as LAO*
[28], LRTDP [9], and
ancestors [8,17,45]. These
methods typically start with a compact representation of the MDP based
on probabilistic planning operators, and search forward from an
initial state, constructing new states by expanding the envelope of
the policy as time permits. They may produce an approximate and even
incomplete policy, but explicitly construct and explore only a
fraction of the MDP. Neither of the two previous proposals is
well-suited to such solution methods, the first because the cost of
the translation (most of which is performed prior to the solution
phase) annihilates the benefits of anytime algorithms, and the second
because the size of the MDP obtained is an obstacle to the
applicability of state-based methods. Since here both the cost of the
translation and the size of the MDP it results in will severely impact
on the quality of the policy obtainable by the deadline, we need an
appropriate resolution of the tradeoff between the two.
Our approach has the following main features. The translation is
entirely embedded in the anytime solution method, to which full
control is given as to which parts of the MDP will be explicitly
constructed and explored. While the MDP obtained is not minimal, it
is of the minimal size achievable without stepping outside of the
anytime framework, i.e., without enumerating parts of the state space
that the solution method would not necessarily explore. We formalise
this relaxed notion of minimality, which we call blind
minimality in reference to the fact that it does not require any
lookahead (beyond the fringe). This is appropriate in the context of anytime
state-based solution methods, where we want the minimal MDP achievable
without expensive pre-processing.
When the rewarding behaviours are specified in PLTL, there does not
appear to be a way of achieving a relaxed notion of minimality as
powerful as blind minimality without a prohibitive translation.
Therefore instead of PLTL, we adopt a variant of future linear
temporal logic (FLTL) as our specification language, which we extend
to handle rewards. While the language has a more complex semantics
than PLTL, it enables a natural translation into a blind-minimal MDP
by simple progression of the reward formulae. Moreover, search
control knowledge expressed in FLTL [5] fits
particularly nicely in this framework, and can be used to dramatically
reduce the fraction of the search space explored by the solution
method.