Risk Aversion in Finite Markov Decision Processes Using Total Cost Criteria and Average Value at Risk

Risk Aversion in Finite Markov Decision Processes
Using Total Cost Criteria
and Average Value at Risk

Stefano Carpin
Yin-Lam Chow
Marco Pavone
Y.-L. Chow and M. Pavone
are with Stanford University, Stanford, CA, USA.
S. Carpin is with the School of Engineering, University of California-Merced, CA, USA.
S. Carpin is partially supported by the Army Research Lab under contract MAST-CNC-15-4-4. Y-L. Chow is partially supported by The Croucher Foundation doctoral scholarship.
M. Pavone is partially supported by the Office of Naval Research, Science of Autonomy Program, under Contract N00014-15-1-2673.
Any opinions, findings, and conclusions or recommendations expressed in these materials are those of
the authors and should not be interpreted as representing the official policies, either expressly or
implied, of the funding agencies of the U.S. Government.

Abstract

In this paper we present an algorithm to compute risk averse
policies in Markov Decision Processes (MDP) when the total cost
criterion is used together with the average value at risk (AVaR) metric.
Risk averse policies are needed when large deviations from the
expected behavior may have detrimental effects, and conventional MDP algorithms
usually ignore this aspect.
We provide conditions for the structure of the underlying
MDP ensuring that approximations for the exact problem
can be derived and solved efficiently.
Our findings are novel inasmuch as average value at risk
has not previously been considered in association with the total cost criterion.
Our method is demonstrated in a rapid deployment scenario,
whereby a robot is tasked with the objective of reaching a target
location within a temporal deadline where increased speed is associated
with increased probability of failure. We demonstrate that
the proposed algorithm not only produces a risk averse policy reducing
the probability of exceeding
the expected temporal deadline, but also provides the statistical distribution of costs,
thus offering a valuable analysis tool.

Markov Decision Processes (MDPs) are extensively used to solve
sequential stochastic decision making problems in robotics
[26] and other disciplines [9].
A solution to an MDP problem instance provides a policy mapping
states into actions with the property of optimizing (e.g., minimizing)
in expectation a given objective function.
In many practical situations a formulation
based on expectation only is, however, not sufficient.
This is the case, in particular, when variability in the system’s behavior can cause a highly undesirable outcome.
For example, in an autonomous navigation system,
a robot attempting to minimize the expected length of the
traveled path will likely travel close to obstacles, and
a large deviation from the planned
path may result in a collision causing a huge loss (e.g., damage to an expensive robot
or failure of the whole mission altogether).
Metrics that study deviations from the expected value are often referred to as risk metrics.

Typical
solution algorithms for MDP problems, like value iteration or policy iteration,
aim exclusively at optimizing the expected cost. The computed policies are therefore labeled as risk neutral.
The problem of quantifying risk associated with random variables has a rich history
and is often related to the problem of managing financial assets [2]. In fact, many risk-related studies
motivated by financial problems
have recently found applications in domains such as robotics [19]. The term risk aversion
refers to a preference for stochastic realizations with limited deviation from the
expected value. In risk averse optimal control one may prefer a policy with higher cost in expectation
but lower deviations to one with lower cost but possibly larger deviations.
Particularly in the context of robotic planning, introducing risk aversion in MDPs is crucial to guarantee mission safety. However introducing risk aversion in MDPs creates a number of additional theoretical and computational hurdles. For example, in risk averse
MDPs, optimal policies are not guaranteed to be Markov stationary but are instead history dependent.

Average Value at Risk (AVaR – also known as Conditional Value at Risk or CVaR) is a risk metric that has gained notable popularity in the area of risk averse control [2], [17]. For a given
random value and a predetermined confidence level, the AVaR is the tail average
of the distribution exceeding a given confidence level (see Section 3
for a formal definition).
Risk averse policies considering the AVaR metric have been studied for the case of MDPs with finite horizon
and discounted infinite horizon cost criteria.
In this paper we instead consider how the AVaR metric can be applied
when an undiscounted, total cost criterion is considered. In fact, such cost criterion appears particularly useful and natural for robotic applications, whereby one is usually interested in optimizing the undiscounted, total cost accrued during a mission until a random, mission-dependent stopping time. (As an aside, the total cost criterion is the typical cost model for stochastic shortest path problems, see, e.g., [20, 21].)
The contribution of this paper is three-fold:

We identify conditions for the underlying MDP ensuring that
the AVaR MDP problem is well defined when the total cost criterion is used.

We define a surrogate MDP problem that can be efficiently solved and whose solution approximates the optimal policy for the original problem with arbitrary
precision.

We validate our findings on a rapid robotic deployment task where the objective is to maximize the mission success rate under a given temporal deadline [8, 6].

The rest of the paper is organized as follows. We discuss related work in Section 2 and provide some background about risk metrics and MDPs in Section 3. In Section 4
we formulate the risk-averse, total cost MDP problem we wish to solve.
In Section 5 we propose and analyze an approximation strategy for the problem,
and in Section 6 we provide an algorithmic solution. Simulation results for a rapid deployment problem are given in Section 7,
and conclusions and future work are discussed in Section 8.

For a general introduction to MDPs the reader is referred to textbooks such as [4] or more recent collections
such as [9]. As pointed out in the introduction, risk aversion in MDPs has been studied for over four decades, with earlier efforts focusing on exponential utility [12], mean-variance [24], and percentile risk criteria [10].
With regard to mean-variance optimization in MDPs, it was recently
shown that computing an optimal policy under a variance constraint is NP-hard [15].
Recently, average value at risk was introduced in [17] in order to model the tail risk of a random outcome and to address some key limitations of the prevailing value-at-risk metric.
Efficient methods to compute AVaR are discussed in [18].
Leveraging the recent strides in AVaR risk modeling, there have been a number of efforts aimed at embedding the AVaR risk metric into risk-sensitive MDPs. In [3] the authors address the problem of minimizing the AVaR of the discounted cost over a finite and an infinite horizon, and propose a dynamic programming approach based on state augmentation. Similar techniques can be found in [5],
where the authors propose a dynamic programming algorithm for finite-horizon, AVaR-constrained MDPs. The algorithm is proven to asymptotically converge to an optimal risk-constrained policy. However, the algorithm involves computing integrals over continuous variables (Algorithm 1 in [5]) and, in general, its implementation appears quite challenging.
A different approach is taken by [16, 25, 7]
where the authors consider a finite dimensional parameterization of the control policies, and show that
an AVaR MDP can be optimized to a
local optimum using stochastic gradient descent (policy gradient).
However this approach imposes additional restrictions to the policy space and in
general policy gradient algorithms only converge to a local optimum.
Haskell and Jain recently considered the
problem of risk aversion in MDPs using a framework based on occupancy measures [13] (closely connected to our recent works where constrained MDPs
are used to solve the multirobot rapid deployment problem
[8, 6]).
While their findings are only valid for the case where an infinite horizon discounted cost
criterion is considered, the solution we propose
uses some of the ideas introduced in [13].

In this section we summarize some known concepts about risk metrics and MDPs. The reader is
referred to the aforementioned references for more details.

3.1 Risk

Consider a probability space S=(Ω,F,P),
and let L∞ be the space of all essentially bounded random variables on S.
A risk function (or risk metric) Γ:L∞→R
is a function that maps an
uncertain outcome Y∈L∞ onto the real line R.
A risk function that is particularly popular in many financial applications is the value at risk.
For τ∈(0,1) the value at risk of Y∈L∞ at level τ is defined as

VaRτ(Y):=inf{η∈R:Pr(Y≤η)≥τ}.

Here VaRτ(Y) represents the percentile value of outcome Y at confidence level τ.
Despite its popularity, VaRτ has a number of limitations.
In particular, VaR is not a coherent risk measure [2] and thus suffers from being unstable (high fluctuations under perturbations) when Y is not normally distributed. More importantly it does not quantify the losses that might be incurred beyond its value in the τ-tail of the distribution [17].
An alternative measure that overcomes most shortcomings of VaR is the average value at risk, defined as

AVaRτ(Y):=11−τ∫1τVaRt(Y)dt,

where τ∈(0,1) is the confidence level as before. Intuitively, AVaRτ is the expectation of Y in the conditional distribution of its upper τ-tail. For this reason, it can be interpreted as a metric of “how bad is bad.” AVaRτ can be equivalently written as [18]

AVaRτ(Y)=mins∈R{s+11−τE[(Y−s)+]},

(1)

where x+:=max(x,0).
This paper relies extensively on Eq. (1) and aims at devising efficient methods to approximate the
expectation in Eq. (1) when the random variable Y is the total cost of an MDP.

Furthermore, it has been recently shown in [23] that optimizing the CVaR of total reward is equivalent to optimizing the worst-case (robust) expected total reward of a system whose model uncertainty is subjected to a trajectory budget. This finding corroborates the fact that a CVaR risk metric models both the variability of random costs, as well as the robustness to system transition errors.

3.2 Total Cost, Transient Markov Decision Processes

For a finite set S, let P(S) indicate the set of mass distributions with support on
S. A finite, discrete-time Markov Decision Process (MDP) is a tuple M=(X,U,Pr,c) where

X, the state space, is a finite set comprising n elements.

U, the control space, is a collection of n finite sets {U(xi)}ni=1. Set U(xi), i=1,…,n, represents the actions that can be applied when in state xi∈X. The set of
allowable state/action pairs is defined as

K:={(x,u)∈X×U|u∈U(x)}.

Pr(y|x,u):K→R is the transition probability from state x to state y when action u∈U(x) is applied. According to our
definitions, Pr(⋅|x,u)∈P(X).

c:K→R≥0 is a non-negative cost function. Specifically, c(x,u) is the cost
incurred when executing action u∈U(x) at state x.

Let ¯¯¯¯¯K:=max(x,u)∈Kc(x,u) and note that the maximum is attained as K is a finite set.
Define the set Ht of admissible histories up to time
t by Ht:=K×Ht−1, for t≥1, and H0:=X.
An element of Ht has the form x0,u0,x1,…,xt−1,ut−1,xt, and records all states traversed
and actions taken up to time t. In the most general case
a policy is a function π:Ht→P(U(xt)),
i.e., it decides which action to take in state xt considering the
entire state-action history. Note that according to this definition a policy is
in general randomized. Let Π be the set of all policies,
i.e., including history-dependent, randomized policies.
It is well
known that in the standard MDP setting where an expected cost is minimized
there is no loss of optimality in restricting
the optimization over deterministic, stationary Markovian policies, i.e., policies of the type π:X→U.
However, in the risk-averse setting one needs to consider the more general class of history-dependent policies [1]. This is achieved through a state augmentation process
described later.

Following [13], we define the countable space (Ω,B):=(K∞,B(K∞)), where K∞=K×K×K×⋯, is the sample space and B(K∞) is the Borel field on K∞. Specific trajectories in the MDP are written as ω∈Ω, and we denote by xt(ω) and ut(ω) the state
and actions at time t along trajectory ω. In general the exact initial state x0 is unknown. Rather it is described by an
initial mass distribution β over X, i.e., β∈P(X).
A policy π and initial distribution β induce a probability distribution over (Ω,B),
that we will indicate as Prπβ.

In this paper we focus on transient total cost MDPs, defined as follows. Consider
a partition of X into sets XT and M, i.e., X=XT∪M and XT∩M=∅. A transient MDP is an MDP
where each policy π satisfies the
following two properties:

∑∞t=0Prπβ[xt=x]<∞ for each x∈XT, i.e., the state will eventually enter set M, and

P(y|x,u)=0 for each x∈M, y∈XT, u∈U(x), i.e., once the state enters M it cannot leave it.

A transient, total cost MDP is a transient MDP where

c(x,u)=0 for each x∈M, i.e., once the state enters M no additional cost is incurred,

the cost associated with each trajectory ω is given by

c(ω):=∞∑t=0c(xt(ω),ut(ω)).

Note that the cost c(ω) is a random variable depending on both the policy π and the initial distribution β. The name total cost stems from the fact that an (undiscounted) cost is incurred throughout the “lifetime” of the system (i.e., until the state hits the absorbing set).

Transient, total cost MDPs (closely related to stochastic shortest path problems, e.g., [20, 21]) represent an alternative to the more commonly used
discounted, infinite-horizon MDPs or finite horizon MDPs. As outlined in the introduction, for
many robotic applications the total cost, i.e., c(ω), is the most appropriate
cost function. We justify this statement by noting that most robotic tasks have
finite duration but such duration is usually not known in advance.
In these circumstances the finite horizon cost is inappropriate because
one cannot define the length of the finite horizon up front. Similarly,
the discounted infinite horizon cost is also ill suited because the task
does not continue forever and the cost will not be exponentially
discounted over time.

Without loss of generality, we assume that set M
consists of a single absorbing state xM equipped with a single action uxM, i.e., M={xM} and
U(xM)=uxM with Pr(xM|xM,uxM)=1. In the following, with a slight abuse of notation, we denote by K the set {(x,u)∈XT×U|u∈U(x)},
i.e., we exclude the absorbing state from the definition of K.
Moreover, we assume that for a transient, total cost MDP
β(xM)=0, i.e., the probability of starting at the absorbing state
is zero. In fact, whenever x0=xM the resulting state
trajectory will deterministically remain in xM,
and the corresponding cost is zero.

Our problem formulation relies on the following two technical assumptions
necessary to establish an a-priori upper bound
on the total cost incurred by any trajectory obtained under any policy, and to
define an approximate problem that can be efficiently solved.

The first assumption simply requires that all costs in the transient
states are positive (recall that we excluded xM when re-defining K.)
As it will be shown later, this assumption ensures a non-zero discretization step
when approximating the cumulative cost accrued by a system throughout the
trajectory ω until it is absorbed in xM.

Assumption 1 (Positivity of costs)

All costs in M except for state xM are positive and bounded,
i.e., K––:=min(x,u)∈Kc(x,u)>0.

When considering cost criteria like finite horizon or discounted infinite
horizon with a finite state space, an a-priori upper bound on the accrued cost can be immediately
established assuming that all costs are finite (a fact crucially exploited in [13]). However, the situation is more complex
when considering the total cost case, because without introducing
further hypotheses on the structure of the MDP a malicious adversary could
establish an history-dependent policy capable of invalidating any
a-priori established bound on the cost1.
The second assumption then adds a “global reachability structure” to the MDP problem.
To this end, in the following, it will be useful to consider the Markov
Chain generated by the MDP when an input is selected for each state.
For an MDP M, select u1∈U(x1),…,un∈U(xn).
The selected inputs and the transition probabilities in M define
a finite Markov Chain that we indicate as MCu1,…,un.
The state space of MCu1,…,un is equal to X
and for two states xi,xj∈X the transition probability Pri,j is
defined as Pri,j=Pr(xj|xi,ui) where ui∈U(xi) is the input
selected in the definition of MCu1,…,un and Pr is the transition probability
of the associated MDP.

Assumption 2 (Reachability of MDP)

Let MCu1,…,un be the Markov chain induced by the
n inputs ui∈U(xi).
Then the absorbing state xM, under Markov chain MCu1,…,un, is reachable from any state x∈XT, for all u1∈U(x1),…,un∈U(xn).

We recall that a state j in a Markov chain is said reachable from another state i if there exists an integer k≥1 such that the probability that the chain
will be in state j after k transitions is positive [11].
Note that when Assumption 2 holds, under every policy there is a
path of non-zero probability connecting every state to xM.
Therefore, it is impossible to devise a policy that prevents sure absorption for
an arbitrary number of steps. This holds for all policies, including history dependent policies (see Figure 1).

Figure 1: Meaning of Assumption 2: after one input has been chosen for every state, an associated Markov Chain
MCu1,…,un is defined. In this Markov Chain, the absorbing state xM is reachable from every state,
i.e., at every state there is a path of non-zero probability to xM. The probability of a path is given by the product of the probabilities
of its edges, i.e., the probability of the path xi→xj→xM is Pr(xj|xi,ui)Pr(xM|xj,uj). This requirement is imposed
for every possible choice of the inputs and every policy.

Building upon the previous material,
we can now define the problem we aim to solve in this paper:

Note that, under the assumption of transient total cost MDP, one can easily verify that E[c(ω)]<∞. Since, by equation (1), AVaRτ(c(ω))≤1/(1−τ)E[c(ω)], one obtains AVaRτ(c(ω))<∞ for all ω, as well.
However, to derive an optimization algorithm for the computation of π∗
it is necessary to formulate an a-priori
upper bound for the optimal cost in (2).
Assumptions 1 and 2 are introduced to ensure
that such bound exists and can be computed.

In this section we study an approximation strategy for the risk averse total cost MDP in equation (2).
Similar to the method presented in [13], we aim at solving the problem by using the concept of occupation measures.
However, unlike for the cases studied in [13], in total cost MDPs an explicit upper bound for the accrued cost is not
available, which makes the solution strategy in [13] not applicable. Our strategy is to find a surrogate to
problem (2). By imposing an effective horizon, we construct a total cost MDP with time-out and recast this
problem into a bilinear programming problem. Furthermore we characterize the sub-optimality gap for such surrogate approximation. We start with a technical result characterizing the convergence rate to the absorbing state.

5.1 Convergence rate to the absorbing state

Consider a selection of inputs u1∈U(x1),…,un∈U(xn) and the corresponding
Markov chain MCu1,…,un. For each state x∈XT, let
MinimumPathx→xM(MCu1,…,un) denote the simple (i.e., without cycles) path
from x to xM of lowest, strictly positive probability.
Note that MinimumPathx→xM(MCu1,…,un)
exists due to Assumption 2.
Let Pr(MinimumPathx→xM(MCu1,…,un)) be
the probability of the path, i.e., the product of the probabilities of
all the transitions along the path.
Since there are n nodes and by definition the path is simple,
MinimumPathx→xM(MCu1,…,un)
includes at most n−1 transitions between n nodes.
Let

γ:=minuk∈U(xk),k=1…,nminx∈xTPr(MinimumPathx→xM(MCu1,…,un)).

Note that the minimum is achieved as the minimization is over a finite set,
and that γ is strictly positive due to Assumption 2.
The constant γ lower bounds the probability that, under any
policy π∈Π, the absorbing state is reached in no more than n steps,
from any state x∈XT.
We are now in a position to characterize the convergence rate to the absorbing state.

Lemma 1 (Number of stages to reach the absorbing set)

For any policy π∈Π and initial distribution β,

Prπβ[xkn≠xM]≤(1−γ)k,∀k∈N.

{proof}

The claim is proven by induction on k. Base case: we prove that

Prπβ[xn≠xM]≤1−γ.

Indeed, Prπβ[xn≠xM]=∑x∈XTPrπβ[xn≠xM|x0=x]Prπβ[x0=x]. Because of Assumption 1, for any policy π, Prπβ[xn=xM]≥γ, and the base case follows.

For the inductive step, assume that Prπβ[xkn≠xM]<(1−γ)k, for some k>1. Then, Prπβ[x(k+1)n≠xM]=Prπβ[x(k+1)n≠xM|xkn≠xM]Prπβ[xkn≠xM]. By definition of γ,

Prπβ[x(k+1)n≠xM|xkn≠xM]≤(1−γ)k+1,

and the claim follows.

5.2 Surrogate problem and approximation bounds

Our solution strategy is to solve a surrogate problem, whereby after a
deterministic number d∈N of steps, the state moves to the absorbing state xM surely. In other words, d acts as a “timeout” for the MDP
problem.
The surrogate problem is simpler to solve, and we will show in the following that
its solution can approximate the solution of the original problem with
arbitrary precision2.
Denote by c[d](ω) the total cost for such surrogate problem.
Additionally, for the original problem, let t∗(ω) denote the absorbing
time, i.e., the time at which the state reaches xM. If t∗(ω)≤d, then the two processes coincide and then
c[d](ω)=c(ω).
Otherwise, for each trajectory ω such that
t∗(ω)>d, the random process is stopped after d steps, and the state goes,
deterministically, to xM at stage d+1. In such a case one has
c[d](ω)≤c(ω).

We want to characterize the relation between AVaRτ(c[d](ω)) (i.e., the risk for the surrogate problem) and AVaRτ(c(ω)) (i.e., the risk for the original problem). To this end, let cd(ω) be the total cost for the original problem up to time d, i.e.,

cd(ω):=d∑t=0c(xt(ω),ut(ω)).

The following lemma shows the equivalence between c[d](ω) and cd(ω).

Lemma 2 (Correspondence of costs)

For any policy π∈Π and any trajectory ω, c[d](ω)=cd(ω).

{proof}

Given a policy π, for any trajectory ω, the cost cumulated up to time d is the same for both the original and the surrogate problem. After time d, both cd(ω) and c[d](ω) do not cumulate any additional cost, then the claim follows.

Theorem 3 (Suboptimality bound)

The left inequality is proven by noticing that minπAVaRτ(c(ω))≥minπAVaRτ(cd(ω))=minπAVaRτ(c[d](ω)), where the equality follows from Lemma 2.

We now prove the right inequality. For any s∈R and policy π, one has

E[

(c(ω)−s)+]=E[(c(ω)−s)+∣t∗(ω)≤d]P(t∗(ω)≤d)

+E[(c(ω)−s)+∣t∗(ω)>d]P(t∗(ω)>d).

(3)

Let cl(ω):=∑∞t=d+1c(xt(ω),ut(ω)) be the tail cumulated cost, and, as before, cd(ω):=∑dt=0c(xt(ω),ut(ω)). Since the function x→x+ is sub-additive, i.e., (x+y)+≤x++y+ and the expectation operator preserves monotonicity, one obtains the inequality

The claim then follows immediately, as the above upper bound is policy-independent.

Note that according to Theorem 3, as d→∞, the optimal cost of the surrogate problem recovers the optimal cost of the original problem, i.e., the surrogate problem provides a consistent approximation to the original problem, with a sub optimality factor that is computable from problem data.

Lemma 2 and Theorem 3 ensure that c[d](ω) can approximate c(ω) with arbitrary
precision for a sufficiently large value of d.
In the next section we then show how to solve the minimization problem:

Leveraging the surrogate problem from the previous section, we can now
adapt the results proposed in [13] to solve problem (2).
An essential step to solve this optimization problem is to compute E[(c[d](ω)−s)+], which entails deriving
the probability distribution for the possible costs generated by the random
variable c[d](ω). This problem can be solved by suitably augmenting
the state space as described in the following, and then using occupancy measures.
In the space of occupancy measures, an optimal policy is determined
through the solution of a bilinear program, as explained below.
For a given policy π and initial
distribution β, we define the occupancy measure for (x,u)∈K as

ρ(x,u)=∞∑t=0Prπβ[xt(ω)=x,ut(ω)=u)].

Note that ρ(x,u) is non negative but is in general not a probability itself.
In the following we will use occupancy measures to determine the probability distribution of the total costs c[d](ω) and then
to compute the needed expectation.
According to the definition,
occupancy measures depend on the policy π and the initial distribution β. Given an absorbing MDP M=(X,U,Pr,c), we define a new state-augmented absorbing MDP with additional state components that
track the cumulated total cost and current stage.
Although the original MDP M is finite and absorbing, the set of
costs c[d](ω) generated by all possible policies can be very large,
and this can subsequently
lead to a linear program with an unmanageable number of decision variables.
To counter this problem, we introduce a discretized approximation
for c[d](π,β) whose error can be arbitrarily bounded.
To this end, we set ζ=min{K––,d¯KN′},
where N′∈N is a parameter describing the desired number of discretized values for the cumulated cost. Due to Assumption 1, ζ is strictly positive.
The effective number of different values is
N=⌈d¯Kζ⌉.
This value may be higher than N′ due to our definition of ζ.
We then define a new MDP
M′N=(X′,U′,Pr′,c′) as follows.
Its state space is X′=X×NN×Nd, where NN={0,1,…,N} and Nd={0,1,…,d}.
Elements in the augmented states will be indicated as (x,y,z).
As clarified in the following, the two additional components store the cumulated running cost (y) and current stage (z). Recall that in the surrogate problem, after d steps, the state is guaranteed to have entered the absorbing set, i.e., it is guaranteed that xd(ω)=xM.
Thus the value of the z component
is in Nd={0,1,…,d}. On the other hand the input
sets are defined as U′(x,y,z)=U(x). X′ and U′ induce a new set K′={(x,y,z,u)|(x,u)∈K∧y∈NN∧z∈Nd}.
The new cost function c′:K′→R≥0 is c′(x,y,z,u)=c(x,u). The transition probability function is modified as follows:

As evident from the definition of the new transition function, the new variables included in the state stores the discretized3 running cost and the stage.
Consistently with our definition of the surrogate problem,
the revised transition function
includes a timeout that imposes a transition to the absorbing state xM after d steps, and
from that point onwards the accrued cost does not change.
Note also that the additional state components y and z
are deterministic functions of the previous state and control input u.
Extending the formerly introduced notation, for a given trajectory ω of M′N, we write yt(ω) for the second component
of the state at time t and zt(ω) for the third component. Finally, for a given
initial distribution β on X, we define the following new initial distribution β′ on X′,

β′(x,y,z)={β(x)ify=0∧z=00otherwise.

Note that the properties of M carry over to M′N. In particular, if Assumptions 1-2
hold for M then they hold for M′N too,
and, if M is absorbing, then M′N is also absorbing. Thus we indicate with X′T its transient set of states.4
For a given realization ω, consider now c[d]t(ω)=∑ti=0c(xi,ui), i.e., the true cumulative cost of the
surrogate MDP problem without discretization.
The following theorem establishes that even though the approximation error introduced by discretizing
the running cost grows linearly with t, it is possible to bound it with arbitrary precision.

Theorem 4

For each ε>0 and each t∈{0,…,d}, there exists a discretization step ζ such that
|ζyt(ω)−c[d]t(ω)|<ε.

Proof.
Pick ζ=ε/d.
Let e(t)=c[d]t(ω)−ζyt(ω) be the approximation error at time t. Note
that by definition e(t)≥0 and e(0)=c[d]0(ω)−ζy0(ω)=0.
From the definition of the transition probability function, P′, it follows that e(t+1)≤e(t)+ζ,
which implies e(d)≤dζ=ε. Since for t>d we have
e(t)=e(d), the claim follows. □

A key step towards the solution of problem in (5) is therefore to derive the statistical
description of the discretized total cost yd(ω) that is used to approximate c[d](ω). This objective can be
achieved by exploiting the occupancy measures for the state-augmented MDP M′.
For (x,y,z,u)∈K′, the occupancy measure on M′ induced by a policy π and an initial distribution β is given as:

ρ(x,y,z,u)=

(6)

∞∑t=0Prπβ[xt(ω)=x,yt(ω)=y,zt(ω)=z,ut(ω)=u].

The occupancy measure, ρ, is a
vector in R|K′|≥0, i.e., it is a vector with |K′| non negative components.
The set of legitimate occupancy vectors is constrained by the initial distribution β and defined by the
policy π. It is well known [1] that these constraints can be expressed as follows:

∑(x′,y′,z′)∈X′T∑u∈A(x′,y′,z′)ρ(x′,y′,z′,u)[δ(x,y,z)(x′,y′,z′)−

P′((x′,y′,z′)|(x,y,z),u)]=β(x,y,z)∀(x,y,z)∈X′T

where δx(y)=1 if and only if y=x.
For 0≤k≤N we introduce random variables θ(k) with the property
that θ(k)=Pr[yd(ω)=k].
This is easily achieved using occupancy measures, i.e.,
θ(k)=∑(x,y,z,u)∈K′I(y=k∧z=d)ρ(x,y,z,u),
where I(⋅) is the indicator function equal to 1 when its argument is true, and 0 otherwise.
Note that by definition θ(k) is equal to Pr[yd(ω)=k],
and by Theorem 4yd(ω) approximates c[d](ω) with arbitrary precision.
Combining the above definitions we then get to the following problem whose
solution approximates the solution to (5):

minρ,θmins∈[0,¯Kd]s+11−τ∑y∈NN(y−s)+θ(y)

(7)

s.t.

∑(x′,y′,z′)∈X′T∑u∈A(x′,y′,z′)ρ(x′,y′,z′,u)[δ(x,y,z)(x′,y′,z′)−

P′((x′,y′,z′)|(x,y,z),u)]=β(x,y,z)∀(x,y,z)∈X′T

θ(k)=∑(x,y,z,u)∈K′I(y=k∧z=d)ρ(x,y,z,u),0≤k≤N.

When comparing this last optimization problem with (5),
the reader will note that the variable s is constrained in the interval [0,¯Kd]. Indeed, the objective
function is continuous with respect to s, and it is straightforward to verify that the partial derivative of
the objective function with respect to s is negative for s<0 and positive for s>¯Kd.
The objective function given in Eq. (7) is concave with respect to θ(y) and is defined over
a convex feasibility set [13].
To the best of our knowledge, there exist no efficient methods to determine the global minimum for this class of problems.
Hence, the problem is approximately solved fixing different values of s within the range [0,¯Kd], and then
solving the corresponding linear problem over the optimization variables ρ and θ.
Comparing the problem in Eq. (7) with the one in Eq. (2)
one might initially think that the objective function in
Eq. (2) does not depend on the policy π.
However, the dependency on π is carried over by the occupancy measure ρ, as evident from Eq. (6). Moreover,
it is well known from the theory of constrained MDPs [1] that there is a one to one correspondence between
policies and occupancy measures, i.e., every policy defines a unique occupancy measure and every occupancy measure induces a policy.

To illustrate the performance of the proposed algorithm, we adopt the rapid deployment scenario considered in [6, 8].
A graph is used to abstract and model the connectivity of a given map of an environment
(see, e.g., [14]).
One robot is positioned at a start vertex and is
tasked to reach the goal vertex within a given temporal deadline while providing some guarantee about its
probability of successfully completing the task.
When moving from vertex
to vertex, the robot can choose from a set of actions, each trading off
speed with probability of success. In particular, actions with
rapid transitions between two vertices have higher probability of failure;
and conversely when the robot moves slowly between two vertices it has a higher
probability of success. In this scenario, failure means that the robot
does not move (e.g., fails to pass through an opening), so elapsed time increases
without making progress towards the goal.
With a given temporal deadline T and success probability P, the robot is tasked to reach
the target vertex “safely” (such that the true mission success probability is at least P), while satisfying the temporal constraint.
From a design perspective it is of interest to know if there exists a policy π
achieving this objective, and to compute it. If the policy does not exist, it is
of interest to know how to modify the parameters in order to make the
task feasible.

In our previous work we solved this problem by modeling it
using Constrained Markov Decision Processes (CMDP).
In the CMDP approach, one maximizes the probability of success
while imposing a constraint on the temporal deadline.
However, this method only returns risk-neutral policies, i.e., the resultant policies only guarantee that the temporal deadline is met in expectation, and there is no explicit control on the tail probability of the constraint.
As a radical departure from the original problem formulation, the AVaR minimization method proposed in Eq. (2) searches for a policy that is feasible with respect to the temporal deadline constraint5 and systematically controls the worst-case variability of total travel time. Note that a policy with low success probability will have large tail probability in total travel time even if the expected temporal deadline is met. Therefore the optimal policies from AVaR minimization will have high success probability. This motivates the application of AVaR minimization to rapid robotic deployment.
First, note that, in the devised setting, the robot will eventually reach the final goal with positive probability.
However, due to possible failures one cannot put an a-priori bound on the
random total travel time. Therefore, the total cost criterion is indeed a
natural choice for this task. Moreover, Assumptions 1 and 2are easily justified because the immediate cost function (i.e., time to move) is always positive and
the global reachability property follows from the graph structure.

To illustrate the performance of risk-averse deployment, two different policies are compared. Here both policies are computed using unconstrained stochastic control methodologies for which the immediate cost is the travel time between two vertices, and the actions correspond to all possible node transitions on the graph.
The first is the classic risk-neutral policy obtained with
value iteration. The second is a risk averse policy obtained with
the algorithm presented in this paper using τ=0.95.
For each policy, 1000 executions are run, and the distribution of total travel time is reported.
Figures 2 and 3 show the
distributions for the two cases. The risk-neutral policy obtains
a lower expected cost, but has a longer tail, as evidenced
by the 61 instances with a cost larger than or equal to 15 (notice that T=15 is the desired time of completion in this example).
Moreover, as evidenced by the shape of the histogram, costs are more spread out.
The risk averse policy, on the other hand, results in less variability as desired. Less than 30 instances have a cost
larger or equal than 15, a reduction of the weight of the tail by more than one half.

Importantly, when computing a risk-neutral policy using classic methods like
policy iteration or value iteration, one is merely provided with a policy that
minimizes the expected cost (in our case time to completion), and no additional information is readily available.
With our approach instead, one
not only obtains a policy minimizing the AVaR criterion, but a statistical description of the costs is also obtained as a byproduct. That is to say, for
each discretized completion time k, the probability Pr[c[d]=k] is computed
as well, thus unveiling the relationship between the time to complete
the deployment task and its probability.
This is shown in Figure 4 for different values of
τ.
Hence, if the computed policy
does not meet the desired performance, the designer has information
on how to tune T and P.

Figure 4: Comparison of probability cost distribution for different values of τ.

In this paper we have considered how
risk aversion in MDPs can be introduced jointly with the AVaR risk metric under the total cost criterion.
Our results
advance the state of the art because AVaR has only been previously considered in MDPs
with finite horizon
or discounted infinite horizon cost criteria. Such extension is important as the total cost criterion appears as a natural model for robotic applications, and is non-straightforward as current algorithms, e.g., from [13] and [3], only work with bounded cumulated costs (which is not the case for total cost formulations). Under two mild assumptions, an approximation algorithm with provable sub-optimality gap was provided.
Furthermore, a rapid deployment scenario was used to demonstrate that
risk-aversion gives more informative policies when compared to
traditional risk-neutral formulations.
While our findings focus on risk averse MDPs with an AVaR risk metric,
our approach can be easily extended along multiple dimensions. In particular, by exploiting the results presented in [13], it is
possible to use our approximation for a broader range of risk metrics, i.e.,
metrics that are uniformly continuous and law invariant. Moreover,
since the algorithm we considered is based on occupancy measures,
it can be easily extended to the CMDP case. This will be the focus of future work.

Footnotes

First note that we are seeking
a uniform upper bound for all possible policies, including history
dependent policies. Hence, given a tentative bound B,
in the general case one could devise a
history dependent policy ensuring that every trajectory generated by the policy is not absorbed in xM
in less than B/K–– steps, thus invalidating the bound.

An alternative strategy would be to investigate reductions of problem (2) to an equivalent risk-averse, discounted, infinite-horizon problem by using, e.g., the results recently presented in [22], and then apply the approach in [13] to the reformulated problem. This is an interesting direction left for future research.

To be precise, the discretized running cost is scaled by ζ.

margin: [ZS]:
Why is this needed? It doesn’t appear to be used anywhere. If you keep it, “its” is ambiguous. Do you mean M′N’s set of transient states?

For any random variable Z with finite expectation, AVaRτ(Z)≥E[Z] for τ∈[0,1]. Therefore, if the solution to the AVaR minimization problem is bounded above by the temporal deadline, then the corresponding minimizer is also a feasible policy to the original problem.

First note that we are seeking
a uniform upper bound for all possible policies, including history
dependent policies. Hence, given a tentative bound B,
in the general case one could devise a
history dependent policy ensuring that every trajectory generated by the policy is not absorbed in xM
in less than B/K–– steps, thus invalidating the bound.

An alternative strategy would be to investigate reductions of problem (2) to an equivalent risk-averse, discounted, infinite-horizon problem by using, e.g., the results recently presented in [22], and then apply the approach in [13] to the reformulated problem. This is an interesting direction left for future research.

To be precise, the discretized running cost is scaled by ζ.

margin: [ZS]:
Why is this needed? It doesn’t appear to be used anywhere. If you keep it, “its” is ambiguous. Do you mean M′N’s set of transient states?

For any random variable Z with finite expectation, AVaRτ(Z)≥E[Z] for τ∈[0,1]. Therefore, if the solution to the AVaR minimization problem is bounded above by the temporal deadline, then the corresponding minimizer is also a feasible policy to the original problem.