Abstract

We consider the problem of steering a system with unknown,
stochastic dynamics to satisfy a rich, temporally-layered task given
as a signal temporal logic formula. We represent the system as a Markov decision process in which the states are built from a partition of the statespace and the transition probabilities are unknown. We present provably convergent reinforcement learning
algorithms to maximize the probability of satisfying a given
formula and to maximize the average expected robustness,
i.e., a measure of how strongly the formula is satisfied.
We demonstrate via a pair of robot navigation
simulation case studies that reinforcement learning with robustness
maximization performs better than probability maximization in terms
of both probability of satisfaction and expected robustness.

We consider the problem of controlling a system with unknown,
stochastic dynamics, i.e., a “black box”, to achieve a complex,
time-sensitive task. An example is controlling a noisy aerial vehicle
with partially known dynamics to visit a pre-specified set of regions
in some desired order while avoiding hazardous areas. We consider
tasks given as temporal logic (TL) formulae
[2], an extension of first order Boolean logic
that can be used to reason about how the state of a system evolves
over time. When a stochastic dynamical model is known, there exist
algorithms to find control policies for maximizing the probability of
achieving a given TL specification
[18, 17, 23, 13] by
planning over stochastic abstractions
[12, 1, 17]. However, only a handful
of papers have considered the problem of enforcing TL specifications
to a system with unknown dynamics.
Passive [3] and active [21, 9]
reinforcement learning has been used
to find a policy that maximizes the probability of satisfying a given
linear temporal logic formula.

In this paper, in contrast to the above works on reinforcement
learning which use propositional temporal logic, we use signal
temporal logic (STL), a rich predicate logic that can be used to
describe tasks involving bounds on physical parameters and time
intervals [7]. An example of such a property is
“Within t1 seconds, a region in which y is less than π1 is
reached, and regions in which y is larger than π2 are avoided
for t2 seconds.” STL admits a continuous measure called
robustness degree that quantifies how strongly a given sample
path exhibits an STL property as a real number rather than just
providing a yes or no answer
[8, 7]. This measure enables
the use of continuous optimization methods to solve inference
(e.g., [10, 11, 14]) or formal synthesis
problems (e.g., [20]) involving STL.

One of the difficulties in solving problems with TL formulae is the
history-dependence of their satisfaction. For instance, if the
specification requires visiting region A before region B, whether or
not the system should steer towards region B depends on whether or not
it has previously visited region A. For linear temporal logic (LTL) formulae with
time-abstract semantics, this history-dependence can be broken by
translating the formula to a deterministic Rabin automaton (DRA), a
model that automatically takes care of the history-dependent
“book-keeping” [4, 21]. In the case of STL,
such a construction is difficult due to the time-bounded semantics.
We circumvent this problem by defining a fragment of STL such that the
progress towards satisfaction is checked with some finite number
τ of state measurements. We thus define an MDP, called the
τ-MDP whose states correspond to the τ-step history of the
system. The inputs to the τ-MDP are a finite collection of
control actions.

We use a reinforcement learning strategy called Q-learning
[24], in which a policy is constructed by taking actions,
observing outcomes, and reinforcing actions that improve a given reward. Our
algorithms either maximize the probability of
satisfying a given STL formula, or maximize the expected robustness
with respect to the given STL formula. These procedures provably converge
to the optimal policy for each case. Furthermore, we
propose that maximizing expected robustness is typically more
effective than maximizing probability of satisfaction. We prove that
in certain cases, the policy that maximizes expected robustness
also maximizes the probability of satisfaction. However,
if the given specification is not satisfiable, the probability
maximization will return an arbitrary policy, while the robustness
maximization will return a policy that gets as close to satisfying the
policy as possible. Finally, we demonstrate through simulation case
studies that the policy that maximizes expected robustness in some
cases gives better performance in terms of both probability of
satisfaction and expected robustness when fewer training episodes are
available.

STL is defined with respect to continuously valued
signals. Let F(A,B) denote the set of mappings from A to B and define a signal as a member of F(N,Rn). For a
signal s, we denote st as the value of s at time t and st1:t2 as the sequence of values st1st1+1…st2.
Moreover, we denote s[t] as the suffix from time t, i.e., s[t]={st′|t′≥t}.

In this paper, the desired mission specification is described by an STL
fragment with the following syntax :

ϕ:=F[0,T]ψ|G[0,T]ψ,ψ:=f(s)≤d|¬φ|φ1∧φ2|φ1U[a,b)φ2,

(1)

where T is a finite time bound, ϕ,ψ, and φ are STL formulae, a and b are non-negative real-valued constants, and f(s)<d is a predicate where s is a
signal, f∈F(Rn,R) is a function, and d∈R is a constant.
The Boolean operators ¬ and ∧ are negation (“not”) and conjunction (“and”), respectively.
The other Boolean operators are defined as usual. The temporal operators F, G, and U
stand for “Finally (eventually)” , “Globally (always)”,
and “Until”, respectively. Note that in this paper, we use a discrete-time version of STL rather than
the typical continuous-time formulation.

In plain English, F[a,b)ϕ means “within a and b time units in the future, ϕ is true,” G[a,b)ϕ means
“for all times between a and b time units in the future ϕ is true,” and ϕ1U[a,b)ϕ2 means “There exists a time c between a and b time units in the
future such that ϕ1 is true until c and ϕ2 is true at c.”
STL is equipped with a robustness degree [8, 7] (also called “degree of
satisfaction”) that quantifies how well a given signal s satisfies a given formula ϕ. The robustness is calculated recursively according
to the quantitative semantics

We use r(s,ϕ) to denote r(s,ϕ,0). If r(s,ϕ) is large and
positive, then s would have to change by a large deviation in order to
violate ϕ. Similarly, if r(s,ϕ) is large in absolute value and negative, then s
strongly violates ϕ.

Similar to [6], let hrz(ϕ) denote the horizon
length of an STL formula ϕ. The horizon length
is the required number of samples to resolve any (future or past) requirements
of ϕ. The horizon length can be computed recursively as

(2)

where ϕ,ϕ1,ϕ2 are STL formulae.

Example 1

Consider the robot navigation problem illustrated in Figure
1(a). The specification is “Visit Regions A or B and
visit Regions C or D every 4 time units along a mission horizon of
100 units.” Let s(t)=[x(t)y(t)]T,
where x and y are the x− and y− components of the signal
s. This task can be formulated in STL as

Figure 1(a) shows two trajectories of the system beginning at the initial location of R and ending in region C that each
satisfies the inner specification ψ given in (3). Note that s2
barely satisfies ψ, as it only slightly penetrates region A, while s1 appears to satisfy it strongly, as it passes through
the center of region A and the center of region C. The
robustness degrees confirm this: r(s1,ψ)=0.3 while r(s2,ψ)=0.05.

For a system with unknown and stochastic dynamics, a critical problem is how to synthesize control to achieve a desired behavior.
A typical approach is to discretize the state and action spaces of the system and then use a reinforcement learning strategy, i.e., by learning
how to take actions through trial and error interactions with an unknown environment [22].
In this section, we present models of systems that are amenable for reinforcement learning
to enforce temporal logic specifications. We start with a discussion on the widely used LTL before introducing the particular model that we will use for reinforcement learning with STL.

Iii-AReinforcement Learning with LTL

One approach to the problem of enforcing LTL satisfaction in a
stochastic system is to partition the statespace and design
control primitives that can (nominally) drive the system from one
region to another. These controllers, the stochastic dynamical
model of the system, and the quotient obtained from the partition are
used to construct a Markob decision process (MDP), called a bounded parameter MDP or BMDP,
whose transition probabilities are interval-valued [1].
These BMDPs can then be composed with a DRA constructed from a given
LTL formula to form a product BMDP. Dynamic
programming (DP) can then be applied over this product MDP to generate a
policy that maximizes the probability
of satisfaction. Other approaches to this problem include aggregating
the states of a given quotient until an MDP can be constructed such
that the transition probability can be considered constant (with
bounded error) [16]. The optimal policy can be
computed over the resulting MDP using DP [15] or
approximate DP, e.g., actor-critic methods [5].

Thus, even when the stochastic dynamics of a system are known and the logic
that encodes constraints has time-abstract semantics, the problem of constructing an abstraction
of the system that is amenable to control policy synthesis is difficult and computationally
intensive. Reinforcement learning methods for enforcing
LTL constraints make the assumption that the underlying model under control is an MDP [3, 21, 9]. Implicitly, these procedures
compute a frequentist approximation of the transition probabilities that asymptotically approaches the true (unknown)
value as the number of observed sample paths increases. Since this algorithm doesn’t explicitly
rely on any a priori knowledge of the transition probability, it could be applied to an
abstraction of a continuous-space system that is built from a proposition-preserving partition.
In this case, the uncertainty on the motion described by intervals in the BMDP that is reduced via computation
would instead be described by complete ignorance that is reduced via learning. The resulting policy would
map regions of the statespace to discrete actions that will optimally drive the real-valued
state of the system to satisfy the given LTL specification. Different partitions will result in different
policies. In the next section, we extend the above observation to derive a discrete
model that is amenable for reinforcement learning for STL formulae.

Iii-BReinforcement learning with STL: τ-Mdp

In order to reduce the search space of the problem, we partition the statespace of the system to form the quotient graph G=(Σ,E),
where Σ is a set of discrete states corresponding to the regions of the statespace and E corresponds to the
set of edges. An edge between two states σ and σ′ exists in E if and only if σ and σ′ are neighbors
(share a boundary) in the partition.
In our case, since STL has time-bounded semantics, we cannot use an automaton with a time-abstract acceptance condition (e.g., a DRA)
to check its satisfaction. In general, whether or not a given trajectory s0:T satisfies an STL formula
would be determined by directly using the qualitative semantics. The STL fragment (1) consists of a sub-formula ψ
with horizon length hrz(ψ)=τ that is modified by either a F[0,T)
or G[0,T) temporal operator. This means that in order to update at time t whether or not the given formula ϕ has been satisfied or violated, we can use
the τ previous state values st−τ+1:t For this reason, we choose to learn policies over an MDP with finite memory, called a τ-MDP,
whose states correspond to sequences of length τ of regions in the defined partition.

Example 1 (cont’d)

Let the robot evolve according to the discrete-time Dubins dynamics

xt+1=xt+vδtcosθtyt+1=yt+vδtsinθt,

(4)

where xt and yt are the x and y coordinates of the robot at time t,
v is its forward speed, δt is a time interval, and the robot’s
orientation is given by θt. The control primitives in this case are given
by Act={up,down,left,right} which correspond to the directions on the grid.
Each (noisy) control primitive induces a distribution with support θdes±Δθ, where θdes is the orientation where the robot is facing the desired cell. When a motion primitive is enacted, the robot
rotates to an angle θt drawn from the distribution and moves along that direction
for δt time units.
The partition of the statespace and the induced
quotient G are shown in Figures 1(b) and 1(c), respectively. A state σ(i,j) in the quotient (Figure 1(c))
represents the region in the partition of the statespace (Figure 1(b)) with the point (i,j) in the lower left hand corner.
\endproof

Definition 1

Given a quotient of a system G=(Σ,E) and a finite set of actions Act, a τ-Markov Decision Process (τ-MDP) is a tuple Mτ=⟨S,Act,P⟩, where

S⊆(Σ∪ϵ)τ is the set of finite states, where ϵ is the empty
string.
Each state στ∈S corresponds to a τ−horizon (or shorter)
path in G. Shorter paths of length n<τ (representing the case in which the system has not yet evolved for τ time steps) have
ϵ prepended τ−n times.

P:S×Act×S→[0,1] is a probabilistic transition relation. P(στ,a,σ′τ) can be positive only
if the first τ−1 states of σ′τ are equal to the last τ−1 states of στ and there exists an edge in Gbetween the final state of στ and the final state of σ′τ.

We denote the state of the τ-MDP at time t as σtτ.

Definition 2

Given a trajectory st−τ+1:t of the original system, we define
its induced trace in the τ-MDP Mτ as Tr(st−τ+1:t)=σt−τ+1:t=σtτ. That is, σtτ corresponds
to the previous τ regions of the statespace that the state has resided in from time t−τ+1 to time t.

The construction of a τ-MDP
from a given quotient and set of actions is straightforward.
The details are omitted due to length constraints. We make the following key assumptions on the quotient and the resulting τ-MDP:

The defined control
actions Act will drive the system either to a point in the current region or to a point
in a neighboring region of the partition, e.g.,no regions are “skipped”.

The transition relation P is Markovian.

For every τ state σtτ, there exists a continuous set of sample paths {st−τ+1:t} whose
traces could be that state. The dynamics of the underlying system produces an unknown distribution
p(st−τ+1:t|Tr(st−τ+1:t)=σtτ). Since the robustness degree is a function of sample paths of length
τ and an STL formula ψ, we can define a distribution p(r(st−τ+1:t,ψ)|Tr(st−τ+1:t)=σtτ).

Example 1 (cont’d)

Figure 2 shows a portion of the τ-MDP constructed from
Figure 1. The states in M4 are labeled
with the corresponding sample paths of length 4 in G. The green and blue σ’s in
the states in M4 correspond to green and blue regions from Figure 1.

Problem 1 (Maximizing Probability of Satisfaction)

Let Mτ be a τ-MDP as described in the previous section. Given an STL formula ϕ with syntax (1), find a policy μ∗mp∈F(S×N,Act) such that

μ∗mp=argmaxμ∈F(S×N,Act)Prs0:T[s0:T⊨ϕ]\par

(5)

Problem 2 (Maximizing Average Robustness)

Let Mτ be as defined in Problem 1. Given an STL formula ϕ with syntax (1), find a policy μ∗mr∈F(S,Act) such that

μ∗mr=argmaxμ∈F(S×N,Act)Es0:T[r(s0:T,ϕ)]

(6)

Fig. 2: Part of the τ-MDP constructed from the robot navigation MDP shown in Figure 1

Problems 1 and 2 are two alternate solutions to enforce a given STL specification. The policy found by Problem 1, i.e. μ∗mp, maximizes the chance that ϕ will be satisfied, while the policy found by Problem 2, i.e. μ∗mr, drives the system to satisfy ϕ as strongly as possible on average. Problems similar to (5) have already been considered in the literature (e.g., [9, 21]). However, Problem 2 is a novel formulation that provides some advantages over Problem 1.
As we show in Section V, for some special systems, μ∗mr achieves the same probability of satisfaction as μ∗mp. Furthermore, if ϕ is not satisfiable, any arbitrary policy could be a solution to Problem 1, as all policies will result in a satisfaction probability of 0. If ϕ is unsatisfiable, Problem 2 yields a solution that attempts to get as close as possible to satisfying the formula, as the optimal solution will have an average robustness value that is least negative.

The forms of the objective functions differ for the two different types of formula,
ϕ=F[0,T)ψ and ϕ=G[0,T)ψ.

Case 1: Consider an STL formula ϕ=F[0,T)ψ. In this case, the
objective function in (5) can be rewritten as

Here, we demonstrate that the solution to (6) subsumes the solution to (5) for a certain class of systems. Due to space
limitations, we only consider formulae of the type ϕ=F[0,t)ψ.
Let Mτ=(Sτ,Pτ,Act) be a τ-MDP. For simplicity, we make the following
assumption on Sτ.

Assumption 1

For every state στ∈Sτ, either every trajectory st+τ−1:t
whose trace is στ satisfies ψ,
denoted στ⊨ψ, or every trajectory that passes through the sequence of regions associated with στ does not satisfy ψ,
denoted στ⊭ψ.

Assumption 1 can be enforced in practice during partitioning.
We define the set

Thus, any policy μ increasing J(μ) also leads to an increase in p. Since increasing J(μ) is equivalent to increasing ER(μ), then we can conclude that the policy that maximizes the robustness also achieves the maximum satisfaction probability.
\endproof

Vi-APolicy Generation through Q-Learning

Since we do not know the dynamics of the system under control, we cannot a priori predict how a given control action will affect the evolution of the
system and hence its progress towards satisfying/dissatisfying a given specification. Thus, we use the well-known paradigm of reinforcement learning to learn policies to solve Problems 1 and 2. In reinforcement learning, the system takes actions and
records the rewards associated with the state-action pair. These rewards are then used to update a feedback policy that maximizes the
expected gathered reward. In our cases, the rewards that we collect over Mτ are related to whether or not ψ
is satisfied (Problem 1) or how robustly ψ is satisfied/violated (Problem 2).

Our solutions to these problems rely on a Q-learning formulation [24]. Let R(σtτ,a) be the reward collected when
action a∈Act was taken in state σtτ∈S. Define the function Q:S×Act×N
as

where 0<γ<1 will cause Qt converges to Q w.p. 1 as t goes to infinity [24].

Vi-B Batch Q-learning

We cannot reformulate Problems 1 and 2 into the form (23)
(see Section IV). Thus, we
propose an alternate Q−learning formulation, called batch Q-learning , to solve these problems. Instead
of updating the Q-function after each action is taken, we wait until an entire episode s[0:T) is completed before
updating the Q-function. The batch Q-learning procedure is summarized in Algorithm 1.

The Q function is initialized to random values and μ is computed from the initial Q
values. Then, for Nep episodes, the system is simulated using μ. Randomization
is used to encourage exploration of the policy space. The observed trajectory
is then used to update the Q function according to Algorithm 2. The new value of the Q function is used
to update the policy μ. For compactness, Algorithm 2 as written only
covers the case ϕ=F[0,T)ψ. The case in which ϕ=G[0,T)ψ can be addressed similarly.

Vi-CConvergence of Batch Q-learning

Given a formula of the form ϕ=F[0,T)ψ and an objective of maximizing the expected robustness (Problem 2), we will show that applying Algorithm 1 converges to the optimal solution. The
other three cases discussed in Section IV can be proven similarly. The following analysis is based on [19].
The optimal Q function derived from (8) is

We implemented the batch-Q learning algorithm (Algorithm
1) and applied it to two case studies that adapt the
robot navigation model from Example 1. For each case
study, we solved Problems 1 and 2 and compared
the performance of the resulting policies. All simulations were
implemented in Matlab and performed on a PC with a 2.6 GHz processor
and 7.8 GB RAM.

Vii-ACase Study 1: Reachability

First, we consider a simple reachability problem. The given STL specification is

ϕcs1=F[0,20)(F[0,1)φblue∧G[1,4)¬φblue),

(34)

where φblue is the STL subformula corresponding to being in a blue region.
In plain English, (34) can be stated
as “Within 20 time units, reach a blue region and then don’t revisit a blue region
for 4 time units.” The results from applying
Algorithm 1 are summarized in Figure 3. We used the
parameters γ=1,αt=0.95, Nep = 300 and
ϵt=0.995t, where ϵt is the probability at
iteration t of selecting an action at random 1. Constructing the
τ-MDP took 17.2s. Algorithm 1 took 161s to solve
Problem 1 and 184s to solve Problem 2.

The two approaches perform very similarly. In the first row, we show
a histogram of the robustness of 500 trials generated from the system
simulated using each of the trained policies after learning has
completed, i.e. without the randomization that is used during the
learning phase. Note that both trained policies satisfied the
specification with probability 1. The performance of the two algorithms
are very similar, as the mean robustness is 0.2287
with standard deviation 0.1020 for probability maximization and 0.2617 and 0.1004,resp., for robustness maximization.
In the second
row, we see trajectories simulated by each of the trained
policies.

The similarity of the solutions in this case study is not surprising.
If the state of the system is deep within A or B, then the
probability that it will remain inside that region in the next 3 time
steps (satisfy ϕ) is higher than if it is at the edge of the
region. Trajectories that remain deeper in the interior of region A
or B also have a high robustness value. Thus, for this particular
problem, there is an inherent coupling between the policies that
satisfy the formula with high probability and those that satisfy the
formula as robustly as possible on average.

Vii-BCase Study 2: Repeated Satisfaction

In this second case study, we look at a problem involving repeatedly
satisfying a condition finitely many times. The specification of
interest is

ϕcs2=G[0,12)(F[0,4)(φblue)∧F[0,4)(φgreen)),

(35)

In
plain English, (35) is “Ensure that
every 4 time units over a 12 unit interval, a green region and a blue region is entered.” Results from this case study are shown in Figure
4. We
used the same parameters as listed in Section VII-A, except Nep = 1200,α=0.4, and
ϵt=0.9t. Constructing the τ-MDP took 16.5s. Applying
Algorithm 1 took 257.7s for Problem 1 and 258.3s
for Problem 2.

(a)

(b)

(c)

(d)

Fig. 4: Comparison of Policies for Case
Study 2..
The subplots have the same meaning as in Figure 3.

In the first row, we see that the solution to Problem 1
satisfies the formula with probability 0 while the solution to Problem
2 satisfies the formula with probability 1. At first, this
seems counterintuitive, as Proposition 2 indicates
that a policy that maximizes probability would achieve a probability
of satisfaction at least as high as the policy that maximizes the
expected robustness. However, this is only guaranteed with an infinite
number of learning trials. The performance in terms of robustness is obviously better
for the robustness maximization (mean 0.1052, standard deviation
0.0742) than for the probability maximization (mean -0.6432, standard
deviation 0.2081). In the second row, we see that the maximum robustness
policy enforces convergence to a cycle between two regions, while the maximum probability
policy deviates from this cycle.

The discrepancy between the two solutions can be explained by what
happens when trajectories that almost satisfy (35)
occur. If a trajectory that almost oscillates between
a blue and green region every four seconds is encountered when solving
Problem 1, it collects 0 reward. On the other
hand, when solving Problem 2, the policy that produces the
almost oscillatory trajectory will be reinforced much more strongly,
as the resulting robustness is less negative. However, since the robustness degree gives “partial
credit” for trajectories that are close to satisfying the policy, the
reinforcement learning algorithm performs a directed search to find
policies that satisfy the formula. Since probability maximization
gives no partial credit, the reinforcement learning algorithm is
essentially performing a random search until it encounters a
trajectory that satisfies the given formula. Therefore, if the family
of policies that satisfy the formula with positive probability is small,
it will on average take the Q-learning algorithm solving Problem
1 a longer time to converge to a solution that enforces
formula satisfaction.

In this paper, we presented a new reinforcement learning paradigm to enforce temporal logic specifications when
the dynamics of the system are a priori unknown. In contrast to existing works
on this topic, we use a logic (signal temporal logic) whose formulation
is directly related to a system’s statespace. We present a novel, convergent Q-learning algorithm that uses the robustness
degree, a continuous measure of how well a trajectory satisfies
a formula, to enforce the given specification. In certain cases, robustness maximization
subsumes the established paradigm of probability maximization and, in certain cases,
robustness maximization performs better in terms of both probability and robustness under partial training.
Future research includes formally connecting our approach to abstractions
of linear stochastic systems.

Footnotes

Although the conditions γ<1 and ∑∞k=0α2k<∞ are technically
required to prove convergence, in practice these conditions can be relaxed without having adverse effects
on learning performance