Abstract

We consider un-discounted reinforcement learning (RL) in Markov decisionprocesses (MDPs) under temporal drifts, ie, both the reward and statetransition distributions are allowed to evolve over time, as long as theirrespective total variations, quantified by suitable metrics, do not exceedcertain variation budgets. This setting captures the endogeneity, exogeneity,uncertainty, and partial feedback in sequential decision-making scenarios, andfinds applications in vehicle remarketing and real-time bidding. We firstdevelop the Sliding Window Upper-Confidence bound for Reinforcement Learningwith Confidence Widening (SWUCRL2-CW) algorithm, and establish its dynamicregret bound when the variation budgets are known. In addition, we propose theBandit-over-Reinforcement Learning (BORL) algorithm to adaptively tune theSWUCRL2-CW algorithm to achieve the same dynamic regret bound, but in aparameter-free manner, ie, without knowing the variation budgets. Finally, weconduct numerical experiments to show that our proposed algorithms achievesuperior empirical performance compared to existing algorithms. Notably, the interplay between endogeneity and exogeneity presents a uniquechallenge, absent in existing (stationary and non-stationary) stochastic onlinelearning settings, when we apply the conventional Optimism in Face ofUncertainty principle to design algorithms with provably low dynamic regret forRL in drifting MDPs. We overcome the challenge by a novel confidence wideningtechnique that incorporates additional optimism into our learning algorithms toensure low dynamic regret bounds. To extend our theoretical findings, we applyour framework to inventory control problems, and demonstrate how one canalternatively leverage special structures on the state transition distributionsto bypass the difficulty in exploring time-varying environments.

We consider un-discounted reinforcement learning (RL) in Markov decision processes (MDPs) under temporal drifts, i.e., both the reward and state transition distributions are allowed to evolve over time, as long as their respective total variations, quantified by suitable metrics, do not exceed certain variation budgets. This setting captures endogeneity, exogeneity, uncertainty, and partial feedback in sequential decision-making scenarios, and finds applications in various online marketplaces, epidemic control, and transportation. We first develop the Sliding Window Upper-Confidence bound for Reinforcement Learning with Confidence Widening (SWUCRL2-CW) algorithm, and establish its dynamic regret bound when the variation budgets are known. In addition, we propose the Bandit-over-Reinforcement Learning (BORL) algorithm to adaptively tune the SWUCRL2-CW algorithm to achieve the same dynamic regret bound, but in a parameter-free manner, i.e., without knowing the variation budgets. Finally, we conduct numerical experiments to show that our proposed algorithms achieve superior empirical performance compared with existing algorithms.

Notably, the interplay between endogeneity and exogeneity presents a unique challenge, absent in existing (stationary and non-stationary) stochastic online learning settings, when one applies the conventional Optimism in Face of Uncertainty (OFU) principle to design algorithms with provably low dynamic regret for RL in non-stationary MDPs. We overcome this challenge by a novel confidence widening technique that incorporates additional optimism into our learning algorithms to ensure low dynamic regret bounds. To extend our theoretical findings, we demonstrate, in the context of single item inventory control with fixed cost, how one can leverage special structures on the state transition distributions to bypass the difficulty of exploring time-varying environments.

Consider a general sequential decision-making framework, where a decision-maker (DM) interacts with an initially unknown environment iteratively. At each time step, the DM first observes the current state of the environment, and then chooses an available action. After that, she receives an instantaneous random reward, and the environment transitions to the next state. The DM aims to design a policy that maximizes its cumulative rewards, while facing the following challenges:

•

Endogeneity: At each time step, the reward follows a reward distribution, and the subsequent state follows a state transition distribution. Both distributions depend (solely) on the current state and action, which are influenced by the policy. Hence, the environment can be fully characterized by a discrete time Markov decision process (MDP).

•

Exogeneity: The reward and state transition distributions vary (independently of the policy) across time steps, but the total variations are bounded by the respective variation budgets.

•

Uncertainty: Both the reward and state transition distributions are initially unknown to the DM.

•

Bandit/Partial Feedback: The DM can only observe the reward and state transition resulted by the current state and action in each time step.

It turns out that many applications, such as vehicle remarketing in used-car sales and real-time bidding in advertisement (ad) auctions, can be captured by this framework.

Example 1 (Vehicle Remarketing in Used-Cars Sales)

An automobile company disposes of continually arriving off-lease vehicles (i.e., leasing vehicles that have reached the end of their fixed term) via daily wholesale vehicle auctions (Manheim 2020, Vehicle Remarketing 2020). At the beginning of each auction, the company observes the number of on-hand vehicles (the “state”), and decides the number of off-lease vehicles to be listed (the “action”). Then, the car dealers bid for the purchases via a first-price auction. The sales of vehicles generate revenue to the company while unsold vehicles incur holding cost to the company (the “reward” and “state transition”). The company aims at maximizing profit by designing a policy that dynamically decides the vehicles to be listed in each auction. However, the dealers’ bidding behaviors are affected by many unpredictable (and thus exogenous) factors (e.g., real-time customer demands, vehicles’ depreciation, and inter-dealer competitions) in addition to the company’s decisions (i.e., the vehicles listed), and can vary across time.

Example 2 (Real-Time Bidding in Ads Auctions)

Advertisers repeatedly competes for ad display impressions via real-time online auctions (Google 2011, Cai et al. 2017, Flajolet and Jaillet 2017, Balseiro and Gur 2019, Guo et al. 2019). Each advertiser begins with a budget. Upon the arrival of a user, an impression is generated, and the advertisers submit bids (the “action”) for it subject to her remaining budget (the “state”). The winning advertiser acquires the impression to display her ad to the user, and observes the user click or no-click behavior (the “reward”). For each slot won, the advertiser has to make the payment (determined by the auction mechanism) using her remaining budget, and the budget is periodically refilled (the “state transition”). Each advertiser wants to maximize the number of clicks on her advertisement subject to her own (continuously evolving) budget constraint. Nevertheless, the competitiveness of each auction exhibits exogeneity as the participating advertisers and the arriving users are different from time to time. Moreover, the popularity of an ad can change due to endogenous reasons. For instance, displaying the same ad too frequently in a short period of time might reduce its freshness, and results in a tentatively low number of clicks (i.e., we can incorporate both the remaining budget and the number of times that the ad is shown within a given window size into the state of the MDP to model endogenous dynamics).

There exists numerous works in sequential decision-making that considered part of the four challenges (Please refer to Table 1 for a summary and comparison). The traditional stream of research (Auer et al. 2002b, Bubeck and Cesa-Bianchi 2012, Lattimore and Szepesvári 2018) on stochastic multi-armed bandits (MAB) focused on the interplay between uncertainty and bandit feedback (i.e., challenges 3 and 4), and (Auer et al. 2002b) proposed the classical Upper Confidence Bound (UCB) algorithm. Starting from (Burnetas and Katehakis 1997, Tewari and Bartlett 2008, Jaksch et al. 2010), a volume of works (see Section id1) have been devoted to reinforcement learning (RL) in MDPs (Sutton and Barto 2018), which further involves endogeneity. RL in MDPs incorporate challenges 1,3,4, and stochastic MAB is a special case of MDPs when there is only one state. In the absence of exogeneity, the reward and state transition distributions are invariant across time, and these three challenges can be jointly solved by the Upper Confidence bound for Reinforcement Learning (UCRL2) algorithm (Jaksch et al. 2010).

The UCB and UCRL2 algorithms leverage the optimism in face of uncertainty (OFU) principle to select actions iteratively based on the entire collections of historical data. However, both algorithms quickly deteriorate when exogeneity emerge since the environment can change over time, and the historical data becomes obsolete. To address the challenge of exogeneity, (Garivier and Moulines 2011) considered the piecewise-stationary MAB environment where the reward distributions remain unaltered over certain time periods and change at unknown time steps. Later on, there is a line of research initiated by (Besbes et al. 2014) that studied the general non-stationary MAB environment (Besbes et al. 2014, Cheung et al. 2019b, a), in which the reward distributions can change arbitrarily over time, but the total changes (quantified by a suitable metric) is upper bounded by a variation budget(Besbes et al. 2014). The aim is to minimize the dynamic regret, the optimality gap compared to the cumulative rewards of the sequence of optimal actions. Both the (relatively restrictive) piecewise-stationary MAB and the general non-stationary MAB settings consider the challenges of exogeneity, uncertainty, and partial feedback (i.e., challenges 2, 3, 4), but endogeneity (challenge 1) are not present.

In this paper, to address all four above-mentioned challenges, we consider RL in non-stationary MDPs where bot the reward and state transition distributions can change over time, but the total changes (quantified by suitable metrics) are upper bounded by the respective variation budgets. We note that in (Jaksch et al. 2010), the authors also consider the intermediate RL in piecewise-stationary MDPs. Nevertheless, we first demonstrate in Section id1, and then rigorously show in Section id1 that simply adopting the techniques for non-stationary MAB (Besbes et al. 2014, Cheung et al. 2019b, a) or RL in piecewise-stationary MDPs (Jaksch et al. 2010) to RL in non-stationary MDPs may result in poor dynamic regret bounds.

Endogeneity

Exogeneity

Uncertainty

Bandit feedback

Stationary MAB

✘

✘

✔

✔

RL in stationary MDPs

✔

✘

✔

✔

Non-stationary MAB

✘

✔

✔

✔

RL in non-stationary MDPs

✔

✔

✔

✔

Table 1: Summary of different sequential decision-making settings. Among them, RL in non-stationary MDPs is the only setting that addresses all four challenges.

Assuming that, during the T time steps, the total variations of the reward and state transition distributions are bounded (under suitable metrics) by the variation budgets Br(>0) and Bp(>0), respectively, we design and analyze novel algorithms for RL in non-stationary MDPs. Let Dmax,S, and A be respectively the maximum diameter (a complexity measure to be defined in Section id1), number of states, and number of actions in the MDP. Our main contributions are:

•

We develop the Sliding Window UCRL2 with Confidence Widening (SWUCRL2-CW) algorithm. When the variation budgets are known, we prove it attains a O~⁢(Dmax⁢(Br+Bp)1/4⁢S2/3⁢A1/2⁢T3/4) dynamic regret bound via a budget-aware analysis.

•

We propose the Bandit-over-Reinforcement Learning (BORL) algorithm that tunes the SWUCRL2-CW algorithm adaptively, and retains the same O~⁢(Dmax⁢(Br+Bp)1/4⁢S2/3⁢A1/2⁢T3/4) dynamic regret bound without knowing the variation budgets.

•

We identify an unprecedented challenge for RL in non-stationary MDPs with conventional optimistic exploration techniques: existing algorithmic frameworks for non-stationary online learning (including non-stationary bandit and RL in piecewise-stationary MDPs) (Jaksch et al. 2010, Garivier and Moulines 2011, Cheung et al. 2019b) typically estimate unknown parameters by averaging historical data in a “forgetting” fashion, and construct the tightest possible confidence regions/intervals accordingly. They then optimistically search for the most favorable model within the confidence regions, and execute the corresponding optimal policy. However, we first demonstrate in Section id1, and then rigorously show in Section id1 that in the context of RL in non-stationary MDPs, the diameters induced by the MDPs in the confidence regions constructed in this manner can grow wildly, and may result in unfavorable dynamic regret bound. We overcome this with our novel proposal of extra optimism via the confidence widening technique. A summary of the algorithmic frameworks for stationary and non-stationary online learning settings are provided in Table 2.

As a complement to this finding, suppose for any pair of initial state and target state, there always exists an action such that the probability of transiting from the initial state to the target state by taking this action is lower bounded uniformly over the entire time horizon, the DM can attain low dynamic regret without widening the confidence regions. We demonstrate that in the context of single item inventory control with fixed cost (Yuan et al. 2019), a mild condition on the demand distribution is sufficient for this extra assumption to hold.

The rest of the paper is organized as follows: in Section id1, we describe the non-stationary MDP model of interest. In Section id1, we review related works in non-stationary online learning and reinforcement learning. In Section id1, we introduce the SWUCRL2-CW algorithm, and analyze its performance in terms of dynamic regret. In Section 7, we design the BORL algorithm that can attain the same dynamic regret bound as the SWUCRL2-CW algorithm without knowing the total variations. In Section id1, we discuss the challenges in designing learning algorithms for reinforcement learning under drift, and manifest how the novel confidence widening technique can mitigate this issue. In Section 9, we discuss the alternative approach without widening the confidence regions in inventory control problems. In Section 11, we conduct numerical experiments to show the superior empirical performances of our algorithms. In Section id1, we conclude our paper.

In this section, we introduce the notations to be used throughout paper, and introduce the learning protocol for our problem of RL in non-stationary MDPs.

Throughout the paper, all vectors are column vectors, unless specified otherwise. We define [n] to be the set {1,2,…,n} for any positive integer n. We denote 𝟏⁢[⋅] as the indicator function. For p∈[1,∞], we use ∥x∥p to denote the p-norm of a vector x∈ℝd. We denote x∨y and x∧y as the maximum and minimum between x,y∈ℝ, respectively. We adopt the asymptotic notations O⁢(⋅),Ω⁢(⋅), and Θ⁢(⋅)(Cormen et al. 2009). When logarithmic factors are omitted, we use O~⁢(⋅),Ω~⁢(⋅),Θ~⁢(⋅), respectively. With some abuse, these notations are used when we try to avoid the clutter of writing out constants explicitly.

Model Primitives: An instance of non-stationary MDP is specified by the tuple (𝒮,𝒜,T,r,p). The set 𝒮 is a finite set of states.
The collection 𝒜={𝒜s}s∈𝒮 contains a finite action set 𝒜s for each state s∈𝒮. We say that (s,a) is a state-action pair if s∈𝒮,a∈𝒜s. We denote S=|𝒮|, A=(∑s∈𝒮|𝒜s|)/S. We denote T as the total number of time steps, and denote r={rt}t=1T as the sequence of mean rewards. For each t, we have rt={rt⁢(s,a)}s∈𝒮,a∈𝒜s, and rt⁢(s,a)∈[0,1] for each state-action pair (s,a).
In addition, we denote p={pt}t=1T as the sequence of state transition distributions. For each t, we have pt={pt(⋅|s,a)}s∈𝒮,a∈𝒜s, where pt(⋅|s,a) is a probability distribution over 𝒮 for each state-action pair (s,a).

Exogeneity: The quantities rt’s and pt’s vary across different t’s in general. Following (Besbes et al. 2014), we quantify the variations on rt’s and pt’s in terms of their respective variation budgetsBr,Bp(>0):

Br

=∑t=1T-1Br,t⁢, where ⁢Br,t=maxs∈𝒮,a∈𝒜s⁡|rt+1⁢(s,a)-rt⁢(s,a)|,

Bp

=∑t=1T-1Bp,t, where Bp,t=maxs∈𝒮,a∈𝒜s∥pt+1(⋅|s,a)-pt(⋅|s,a)∥1.

(1)

We emphasize although Br and Bp might be used as inputs by the DM, individual Br,t’s and Bp,t’s are unknown to the DM throughout the current paper. We also refer to Remark 2 for the choice of infinity-norm and 1-norm in eqn. (1).

Endogeneity: The DM faces a non-stationary MDP instance (𝒮,𝒜,T,r,p). She knows 𝒮,𝒜,T, but not r,p. The DM starts at an arbitrary state s1∈𝒮. At time t, three events happen. First, the DM observes its current state st. Second, she takes an action at∈𝒜st. Third, given st,at, she stochastically transits to another state st+1 which is distributed as pt(⋅|st,at), and receives a stochastic reward Rt⁢(st,at), which is 1-sub-Gaussian with mean rt⁢(st,at). In the second event, the choice of at is based on a non-anticipatory policy Π. That is, the choice only depends on the current state st and the previous observations ℋt-1:={sq,aq,Rq⁢(sq,aq)}q=1t-1.

Dynamic Regret: The DM aims to maximize the cumulative expected reward 𝔼⁢[∑t=1Trt⁢(st,at)], despite the model uncertainty on r,p and the dynamics of the learning environment. To measure the convergence to optimality, we consider an equivalent objective of minimizing the dynamic regret(Besbes et al. 2014, Jaksch et al. 2010)

Dyn-RegT⁢(Π)=∑t=1T{ρt*-𝔼⁢[rt⁢(st,at)]}.

(2)

In the oracle ∑t=1Tρt*, the summand ρt* is the optimal long-term average reward of the stationary MDP with state transition distribution pt and mean reward rt. The optimum ρt* can be computed by solving linear program (15) provided in Section id1. We note that the same oracle is used for RL in piecewise-stationary MDPs (Jaksch et al. 2010).

Remark 1 (Comparisons with Non-Stationary MAB)

When S=1, eqn. (2) reduces to the definition (Besbes et al. 2014) of dynamic regret for non-stationary K-armed bandit. Different from the bandit case, however, the oracle ∑t=1Tρt* does not equal to the expected optimum for the non-stationary MDP problem in general. Nevertheless, we justify this choice in Proposition 1.

Remark 2 (Definition of Variation Budgets)

For brevity of exposition, we choose to define the variation budgets (see eqn. (1) ) for reward and state transition distributions with the infinity norm and 1 norm, respectively. One can also define them with respect to other commonly used metrics, such as the 2 norm (Cheung et al. 2019b), and the this would only affect the dependence on S and A for the established dynamic regret bounds in the subsequent sections.

Next, we review relevant concepts on MDPs, in order to stipulate an assumption that ensures learnability and justifies our oracle.

Consider a set of states S, a collection A={As}s∈S of action sets, and a state transition distribution p¯={p¯(⋅|s,a)}s∈S,a∈As. For any s,s′∈S and stationary policy π, the hitting time from s to s′ under π is the random variable Λ(s′|π,s):=min{t:st+1=s′,s1=s,sτ+1∼p¯(⋅|sτ,π(sτ))∀τ}, which can be infinite. We say that (S,A,p¯) is a communicating MDP iff D:=maxs,s′∈S⁡minstationary ⁢π⁡E⁢[Λ⁢(s′|π,s)] is finite. The quantity D is the diameter associated with (S,A,p¯).

Remark 3 (Diameter and RL in MDPs)

As shown in (Jaksch et al. 2010), “diameter” plays a fundamental role in characterizing the complexity of RL in MDPs. Intuitively, in order to make informative decisions, the DM has to have accurate estimates of the quantities rt⁢(s,a)’s and pt(⋅|s,a).’s. In other words, she must visit every state s∈S and choose each of its available actions a∈As frequently enough to collect relevant samples. Consequently, the harder to transition from a state s to another state s′, the more the DM would suffer during the learning process, and the diameter of a MDP captures the “hardness” of transitioning between states in this MDP.

With the above remark, we make the following assumption.

Assumption 1 (Bounded Diameters)

For each t∈[T], the tuple (S,A,pt) constitutes a communicating MDP with diameter at most Dt. We denote the maximum diameter as D𝑚𝑎𝑥=maxt∈{1,…,T}⁡Dt.

The following proposition justifies our choice of oracle ∑t=1Tρt*.

Proposition 1

Consider an instance (S,A,T,p,r) that satisfies Assumption 1 with maximum diameter D𝑚𝑎𝑥, and has variation budgets Br,Bp for the rewards and transition distributions respectively. In addition, suppose that T≥Br+2⁢D𝑚𝑎𝑥⁢Bp>0, then it holds that

∑t=1Tρt*≥maxΠ⁡{𝔼⁢[∑t=1Trt⁢(stΠ,atΠ)]}-4⁢(D𝑚𝑎𝑥+1)⁢(Br+2⁢D𝑚𝑎𝑥⁢Bp)⁢T.

The maximum is taken over all non-anticipatory policies Π’s. We denote {(stΠ,atΠ)}t=1T as the trajectory under policy Π, where atΠ∈AstΠ is determined based on Π and Ht-1∪{stΠ}, and st+1Π∼pt(⋅|stΠ,atΠ) for each t.

The Proposition is proved in section id1 of the appendix. In fact, our dynamic regret bounds (see the forthcoming Theorems 1 and 2) are larger than the error term 4⁢(Dmax+1)⁢(Br+2⁢Dmax⁢Bp)⁢T, thus justifying the choice of ∑t=1Tρt* as the oracle. It turns out that the oracle ∑t=1Tρt* is more convenient for analysis than the expected optimum, since the former can be decomposed to summations across different intervals, unlike the latter where the summands are intertwined due to endogenous dynamics, i.e., st+1Π∼pt(⋅|stΠ,atΠ).

RL in stationary (discounted and un-discounted reward) MDPs has been widely studied in
(Burnetas and Katehakis 1997, Bartlett and Tewari 2009, Jaksch et al. 2010, Agrawal and Jia 2017, Fruit et al. 2018a, b, Sidford et al. 2018b, a, Wang 2019, Zhang and Ji 2019, Fruit et al. 2019, Wei et al. 2019). For the discounted reward setting, the authors of (Sidford et al. 2018b, Wang 2019, Sidford et al. 2018a) proposed (nearly) optimal algorithms in terms of sample complexity. For the un-discounted reward setting, the authors of (Jaksch et al. 2010) established a minimax lower bound Ω⁢(Dmax⁢S⁢A⁢T) on the regret when both the reward and state transition distributions are time-invariant. They also designed the UCRL2 algorithm and showed that it attains a regret bound O~⁢(Dmax⁢S⁢A⁢T). The authors of (Fruit et al. 2019) proposed the UCRL2B algorithm, which is an improved version of the UCRL2 algorithm. The regret bound of the UCRL2B algorithm is O~⁢(S⁢Dmax⁢A⁢T+Dmax2⁢S2⁢A). The minimax optimal algorithm is provided in (Zhang and Ji 2019) although it is not computationally efficient.

In a parallel work (Ortner et al. 2019), the authors considered a similar setting to ours by applying the “forgetting principle” from non-stationary bandit settings (Garivier and Moulines 2011, Cheung et al. 2019a) to design a learning algorithm. To achieve its dynamic regret bound, the algorithm by (Ortner et al. 2019) partitions the entire time horizon [T] into time intervals ℐ={Ik}k=1K, and crucially requires the access to ∑t=min⁡Ikmax⁡Ik-1Br,t and ∑t=min⁡Ikmax⁡Ik-1Bp,t,i.e., the variations in both reward and state transition distributions of each interval Ik∈ℐ (see Theorem 3 in (Ortner et al. 2019)). In contrast, the SWUCRL2-CW algorithm and the BORL algorithm require significantly less information on the variations. Specifically, the SWUCRL2-CW algorithm does not need any additional knowledge on the variations except for Br and Bp,i.e., the variation budgets over the entire time horizon as defined in eqn. (1), to achieve its dynamic regret bound (see Theorem 1). This is similar to algorithms for the non-stationary bandit settings, which only require the access to Br(Besbes et al. 2014). More importantly, the BORL algorithm (built upon the SWUCRL2-CW algorithm) enjoys the same dynamic regret bound even without knowing either of Br or Bp (see Theorem 2).

For online learning and bandit problems where there is only one state, the works by (Auer et al. 2002a, Garivier and Moulines 2011, Besbes et al. 2014, Keskin and Zeevi 2016) proposed several “forgetting” strategies for different non-stationary MAB settings. More recently, the works by (Karnin and Anava 2016, Luo et al. 2018, Cheung et al. 2019b, a, Chen et al. 2019b) designed parameter-free algorithms for non-stationary MAB problems. Another related but different setting is the Markovian bandit (Kim and Lim 2016, Ma 2018), in which the state of the chosen action evolve according to an independent time-invariant Markov chain while the states of the remaining actions stay unchanged. In (Zhou et al. 2020), the authors also considered the case when the states of all the actions are governed by the same (uncontrollable) Markov chain.

In this section, we first describe a unique challenge in RL in non-stationary MDPs, and then present the SWUCRL2-CW algorithm, which incorporates our novel confidence widening technique and sliding window estimates (Garivier and Moulines 2011) into UCRL2 (Jaksch et al. 2010).

For stationary MAB problems, the UCB algorithm (Auer et al. 2002b) suggests the DM should iteratively execute the following two steps in each time step:

0.

Estimate the mean reward of each action by taking the time average of all observed samples.

0.

Pick the action with the highest estimated mean reward plus the confidence radius, where the radius scales inversely proportional with the number of observations (Auer et al. 2002b).

The UCB algorithm has been proved to attain optimal regret bounds for various stationary MAB settings (Auer et al. 2002b, Kveton et al. 2015). For non-stationary problems, (Garivier and Moulines 2011, Keskin and Zeevi 2016, Cheung et al. 2019a) shown that the DM could further leverage the forgetting principle by incorporating the sliding-window estimator (Garivier and Moulines 2011) into the UCB algorithms (Auer et al. 2002b, Kveton et al. 2015) to achieve optimal dynamic regret bounds for a wide variety of non-stationary MAB settings. The sliding window UCB algorithm with a window size W∈ℝ+ is similar to the UCB algorithm except that the estimated mean rewards are computed by taking the time average of the Wmost recent observed samples.

As noted in Section id1, (Jaksch et al. 2010) proposed the UCRL2 algorithm, which is a UCB-alike algorithm with nearly optimal regret for RL in stationary MDPs. It is thus tempting to think that one could also integrate the forgetting principle into the UCRL2 algorithm to attain low dynamic regret bound for RL in non-stationary MDPs. In particular, one could easily design a naive sliding-window UCRL2 algorithm that follows exactly the same steps as the UCRL2 algorithm with the exception that it uses only the W most recent observed samples instead of all observed samples to estimate the mean rewards and the state transition distributions, and to compute the respective confidence radius.

Under non-stationarity and bandit feedback, however, we show in Proposition 3 of the forthcoming Section id1 that the diameter of the estimated MDP produced by the naive sliding-window UCRL2 algorithm with window size W can be as large as Θ⁢(W), which is orders of magnitude larger than Dmax, the maximum diameter of each individual MDP encountered by the DM. Consequently, the naive sliding-window UCRL2 algorithm may result in undesirable dynamic regret bound. In what follows, we discuss in more details how our novel confidence widening technique can mitigate this issue.

The SWUCRL2-CW algorithm first specifies a sliding window parameter W∈ℕ and a confidence widening parameter η≥0. Parameter W specifies the number of previous time steps to look at. Parameter η quantifies the amount of additional optimistic exploration, on top of the conventional optimistic exploration using upper confidence bounds. The latter turns out to be helpful for handling the temporal drifts in the state transition distributions (see Section id1).

The algorithm runs in a sequence of episodes that partitions the T time steps. Episode m starts at time τ⁢(m) (in particular τ⁢(1)=1), and ends at the end of time step τ⁢(m+1)-1. Throughout an episode m, the DM follows a certain stationary policy π~τ⁢(m). The DM ceases the mth episode if at least one of the following two criteria is met:

•

The time index t is a multiple of W. Consequently, each episode last for at most W time steps. The criterion ensures that the DM switches the stationary policy π~τ⁢(m) frequently enough, in order to adapt to the exogenous dynamics.

•

There exists some state-action pair (s,a) such that ντ⁢(m)⁢(s,a), the number of time step t’s with (st,at)=(s,a) within episode m, is at least as many as the total number of counts for it within the W time steps prior to τ⁢(m),i.e., from (τ⁢(m)-W)∨1 to (τ⁢(m)-1). This is similar to the doubling criterion in (Jaksch et al. 2010), which ensures that each episode is sufficiently long so that the DM can focus on learning.

The combined effect of these two criteria allows the DM to learn a low dynamic regret policy with historical data from an appropriately sized time window and confidence widening parameter. One important piece of ingredient is the construction of the policy π~τ⁢(m), for each episode m. To allow learning under endogenous and exogenous dynamics, the SWUCRL2-CW algorithm computes the policy π~m based on the history in the W time steps prior to the current episode m,i.e., from round (τ⁢(m)-W)∨1 to round τ⁢(m)-1. The construction of π~τ⁢(m) involves the Extended Value Iteration (EVI) (Jaksch et al. 2010), which requires the confidence regions Hr,τ⁢(m),Hp,τ⁢(m)⁢(η) for rewards and state transition distributions as the inputs, in addition to an precision parameter ϵ. The confidence widening parameter η≥0
is capable of ensuring the MDP output by the EVI has a bounded diameter most of the time.

To describe SWUCRL2-CW algorithm, we first define for each state-action pair (s,a) and each time t in episode m,

Nt(s,a)=∑q=(τ⁢(m)-W)∨1t-1𝟏((sq,aq)=(s,a)),Nt+(s,a)=max{1,Nt(s,a)}.

(3)

For each state-action pair (s,a) and each time t in episode m, we consider the empirical mean estimator

t is not a multiple of W and νm⁢(st,π~τ⁢(m)⁢(st))<Nτ⁢(m)+⁢(st,π~τ⁢(m)⁢(st))\StateChoose action at=π~τ⁢(m)⁢(st), observe reward Rt⁢(st,at) and the next state st+1.\StateUpdate ντ⁢(m)⁢(st,at)←ντ⁢(m)⁢(st,at)+1,t←t+1.
\Ift>T\StateThe algorithm is terminated.
\EndIf\EndWhile\EndFor

For each state-action pair s,a and each time step t in episode m, we consider the empirical mean estimator

p^t(s′|s,a)=1Nt+⁢(s,a)(∑q=(τ⁢(m)-W)∨1t-1𝟏(sq=s,aq=a,sq+1=s′)),

which serves to estimate the average transition probability

p¯t(s′|s,a)=1Nt+⁢(s,a)∑q=(τ⁢(m)-W)∨1t-1pq(s′|s,a)𝟏(sq=s,aq=a).

(5)

Different from the case of estimating reward, the confidence region Hp,t⁢(η)={Hp,t⁢(s,a;η)}s∈𝒮,a∈𝒜s for the transition probability involves a widening parameter η≥0:

Hp,t(s,a;η)={p˙∈Δ𝒮:∥p˙(⋅|s,a)-p^t(⋅|s,a)∥1≤𝗋𝖺𝖽-p,t(s,a)+η},

(6)

with confidence radius 𝗋𝖺𝖽⁢-p,t⁢(s,a)=2⁢2⁢S⁢log⁡(S⁢A⁢T/δ)/Nt+⁢(s,a).
With η>0, the DM can explore state transition distributions that deviate from the sample average, and the exploration is crucial for learning MDPs under endogenous and exogenous dynamics. In a nutshell, the incorporation of η provides an additional source of optimism. We treat η as a hyper-parameter at the moment, and provide a suitable choice of η when we discuss our main results (see Theorem 1).

The SWUCRL2-CW algorithm relies on the EVI, which solves MDPs with optimistic exploration to near-optimality. We extract and rephrase a description of EVI in Section id1 of the appendix. EVI inputs the confidence regions Hr,Hp for the rewards and the state transition distributions. The algorithm outputs an “optimistic MDP model”, which consists of reward vector r~ and state transition distribution p~ under which the optimal average gain ρ~ is the largest among all r˙∈Hr,p˙∈Hp:

Output: The returned policy π~ and the auxiliary output (r~,p~,ρ~,γ~). In the latter, r~,p~, and ρ~ are the selected “optimistic” reward vector, state transition distribution, and the corresponding long term average reward. The output γ~∈ℝ+𝒮 is a bias vector(Jaksch et al. 2010). For each s∈𝒮, the quantity γ~⁢(s) is indicative of the short term reward when the DM starts at state s and follows the optimal policy. By the design of EVI, for the output γ~, there exists s∈𝒮 such that γ~⁢(s)=0. Altogether, we express

EVI⁢(Hr,Hp;ϵ)→(π~,r~,p~,ρ~,γ~).

Combining the three components, a formal description of the SWUCRL2-CW algorithm is shown in Algorithm id1.

We now analyze the performance of the SWUCRL2-CW algorithm. First, we introduce two events ℰr,ℰp, which state that the estimated reward and state transition distributions lie in the respective (un-widened) confidence regions.

ℰr={r¯t(s,a)∈Hr,t(s,a)∀s,a,t},ℰp={p¯t(⋅|s,a)∈Hp,t(s,a;0)∀s,a,t}.

We prove that ℰr,ℰp hold with high probability.

Lemma 1

We have Pr⁡[Er]≥1-δ/2, Pr⁡[Ep]≥1-δ/2.

The proof of Lemma 1 is provided in Section id1 of the appendix. In defining ℰp, the widening parameter η is set to be 0, since we are only concerned with the estimation error on p. Next, we bound the dynamic regret of each time step, under certain assumptions on Hp,t⁢(η). To facilitate our discussion, we define the following variation measure for each t in an episode m:

𝗏𝖺𝗋r,t=∑q=(τ⁢(m)-W)∨1t-1Br,q,𝗏𝖺𝗋p,t=∑q=(τ⁢(m)-W)∨1t-1Bp,q.

Proposition 2

Consider an episode m. Condition on events Er,Ep, and suppose that there exists a state transition distribution p satisfying two properties: (1) ∀s∈S⁢∀a∈As, we have p(⋅|s,a)∈Hp,τ⁢(m)(s,a;η), and (2) the diameter of (S,A,p) at most D. Then, for every t∈{τ⁢(m),…,τ⁢(m+1)-1} in episode m, we have

Remark 4 (Confidence Widening)

Similar to the regret analysis of the UCRL2 algorithm (Section 4 of (Jaksch et al. 2010)) and the UCRL2B algorithm (Lemma 3 and eqn. (10) of (Fruit et al. 2019)), Proposition 2 states that, if the confidence region Hp,τ⁢(m)⁢(η) contains a state transition distribution with diameter at most D, then the EVI provided with Hp,τ⁢(m)⁢(η) returns a policy with dynamic regret bound that grows at most linearly with D during episode m. However, as shown in Section id1 later, the parameter η has to be carefully chosen for D to be small as the worst case diameter of every state transition distribution in Hp,τ⁢(m)⁢(0) (i.e., setting η=0) can grow as Ω~⁢(W), and might result in unfavorable dynamic regret bound. Here, the parameter η is the keystone of our novel confidence widening technique and the resulting dynamic regret bound: As η increases, the confidence region Hp,τ⁢(m)⁢(s,a;η) becomes larger for each state-action pair (s,a). Consider the first time step τ⁢(m) of each episode m: if pτ⁢(m)(⋅|s,a)∈Hp,τ⁢(m)(s,a;η) for all state-action pair (s,a), then Proposition 2 can be leveraged; otherwise, the widened confidence region enforces that a considerable amount of variation budget is consumed.

Remark 6 (Connections with Regret Bounds for RL in Stationary MDPs)

When r1=…=rT and p1=…=pT, our problem becomes the RL in stationary MDPs problem studied by (Jaksch et al. 2010), and the SWUCRL2-CW algorithm with W=T and η=0 can recover the regret bound O~⁢(Dmax⁢S⁢A⁢T) of the UCRL2 algorithm studied in (Jaksch et al. 2010).

Remark 7 (Dynamic Regret Bound without Knowing Br and Bp)

Similar to (Cheung et al. 2019b, a), if Bp,Br are not known, we can set W and η obliviously as W=S23⁢A12⁢T12,η=W/T=S23⁢A12⁢T-12 to obtain a dynamic regret bound O~⁢(D𝑚𝑎𝑥⁢(Br+Bp+1)⁢S2/3⁢A1/2⁢T3/4).

As pointed out by Remark 7, in the case of unknown Br and Bp, the dynamic regret of SWUCRL2-CW algorithm scales linearly in Br and Bp. However, by Theorem 1, we are assured a fixed pair of parameters (W*,η*) can ensure low dynamic regret. For the bandit setting, (Cheung et al. 2019a, b) propose the bandit-over-bandit framework that uses a separate copy of EXP3 algorithm to tune the window size. Inspired by it, we develop a novel Bandit-over-Reinforcement Learning (BORL) algorithm with parameter-free O~⁢(Dmax⁢(Br+Bp+1)1/4⁢S2/3⁢A1/2⁢T3/4) dynamic regret here.

Following a similar line of reasoning as (Cheung et al. 2019a), we make use of the SWUCRL2-CW algorithm as a sub-routine, and “hedge” (Bubeck and Cesa-Bianchi 2012) against the (possibly adversarial) changes of rt’s and pt’s to identify a reasonable fixed window size and confidence widening parameter.

As illustrated in Fig. 1, the BORL algorithm divides the whole time horizon into ⌈T/H⌉ blocks of equal length H rounds (the length of the last block can ≤H), and specifies a set J from which each pair of (window size, confidence widening) parameter are drawn from. For each block i∈[⌈T/H⌉], the BORL algorithm first calls some master algorithm to select a pair of (window size, confidence widening) parameters (Wi,ηi)(∈J), and restarts the SWUCRL2-CW algorithm with the selected parameters as a sub-routine to choose actions for this block. Afterwards, the total reward of block i is fed back to the master, and the “posterior” of these parameters are updated accordingly.

One immediate challenge not presented in the bandit setting (Cheung et al. 2019b) is that the starting state of each block is determined by previous moves of the DM. Hence, the master algorithm is not facing a simple oblivious environment as the case in (Cheung et al. 2019b), and we cannot use the EXP3 (Auer et al. 2002a) algorithm as the master. Nevertheless, fortunately the state is observed before the starting of a block. Thus, we use the EXP3.P algorithm for multi-armed bandit against an adaptive adversary (Auer et al. 2002a, Bubeck and Cesa-Bianchi 2012) as the master algorithm. We follow the exposition in Section 3.2 in (Bubeck and Cesa-Bianchi 2012) for adapting the EXP3.P algorithm.

Figure 1: Structure of the BORL algorithm

We are now ready to state the details of the BORL algorithm. For some fixed choice of block length H (to be determined later), we first define a couple of additional notations:

H=⌊3⁢S23⁢A12⁢T12⌋,Φ=12⁢T12,ΔW=⌊ln⁡H⌋,Δη=⌊ln⁡Φ-1⌋,Δ=(ΔW+1)⁢(Δη+1),

(9)

JW={H0,⌊H1ΔW⌋,…,H},Jη=S13⁢A14×{Φ0,Φ1Δη,…,Φ},J={(W,η):W∈JW,η∈Jη}.

Here, JW and Jη are all possible choices of window size and confidence widening parameter, respectively, and J is the Cartesian product of them with |J|=Δ. We also let 𝖱i⁢(W,η,s) be the total rewards for running the SWUCRL2-CW algorithm with window size W and confidence widening parameter η for block i starting from state s,

The EXP3.P algorithm treats each element of J as an arm. It begins by initializing

Then it sets (ji,ki)=(j,k) with probability u(j,k),i⁢∀(j,k)∈M. The selected pair of parameters are thus Wi=⌊Hji/ΔW⌋ and ηi=⌊Φki/Δη⌋. Afterwards, the BORL algorithm starts from state s(i-1)⁢H+1, selects actions by running the SWUCRL2-CW algorithm with window size Wi and confidence widening parameter ηi for each round t in block i. At the end of the block, the BORL algorithm observes the total rewards 𝖱⁢(Wi,ηi,s(i-1)⁢H+1). As a last step, it rescales 𝖱⁢(Wi,ηi,s(i-1)⁢H+1) by dividing it by H so that it is within [0,1], and updates

The formal description of the BORL algorithm (with H defined in the next subsection) is shown in Algorithm the-at-equationgroup-at-IDf.
{algorithm}[!ht]
BORL algorithm\[email protected]@algorithmic[1]
\StateInput: Time horizon T, state space 𝒮, and action space 𝒜, initial state s1.\StateInitializeH,Φ,ΔW,Δη,Δ,JW,Jη according to eqn. (9), and α,β,γ according to eqn. (10).
\StateM←{(j′,k′):j′∈{0,1,…,ΔW},k′∈{0,1,…,Δη}},q(j,k),1←0⁢∀(j,k)∈M.\Fori=1,2,…,⌈T/H⌉\StateDefine distribution (u(j,k),i)(j,k)∈M according to eqn. (11), and set (ji,ki)←(j,k) with probability u(j,k),i.\StateWi←⌊Hji/ΔW⌋,ηi←⌊Φki/Δη⌋.\Fort=(i-1)⁢H+1,…,i⋅H∧T\StateRun the SWUCRL2-CW algorithm with window size Wi and blow up parameter ηi, and observe the total rewards 𝖱⁢(Wi,ηi,s(i-1)⁢H+1).\EndFor\StateUpdate q(j,k),i+1 according to eqn. (12).
\EndFor

The dynamic regret guarantee of the BORL algorithm can be presented as follows

Theorem 2

Assume S>1, with probability 1-O⁢(δ), the dynamic regret bound of the BORL algorithm is O~⁢(Dmax⁢(Br+Bp+1)1/4⁢S2/3⁢A1/2⁢T3/4)

In stochastic online learning problems, one usually estimates a latent quantity by taking the time average of observed samples, even when the sample distribution varies across time. This has been proved to work well in stationary and non-stationary bandit settings (Auer et al. 2002b, Garivier and Moulines 2011, Cheung et al. 2019a, b). To extend to RL, it is natural to consider the sample average transition distribution p^t, which uses the data in the previous W rounds to estimate the time average transition distribution p¯t to within an additive error O~(1/Nt+(s,a)) (see Section id1 and Lemma 1). In the case of stationary MDPs, where ∀t∈[T]⁢pt=p, one has p¯t=p. Thus, the un-widened confidence region Hp,t⁢(0) contains p with high probability (see Lemma 1). Consequently, the UCRL2 algorithm by (Jaksch et al. 2010), which optimistic explores Hp,t⁢(0), has a regret that scales linearly with the diameter of p.

The approach of optimistic exploring Hp,t⁢(0) is further extended to RL in piecewise-stationary MDPs by (Jaksch et al. 2010, Gajane et al. 2018). The latter establishes a O⁢(ℓ1/3⁢Dmax2/3⁢S2/3⁢A1/3⁢T2/3) dynamic regret bounds, when there are at most ℓ changes. Their analyses involve partitioning the T-round horizon into C⋅T1/3 equal-length intervals, where C is a constant dependent on Dmax,S,A,ℓ. At least C⁢T1/3-ℓ intervals enjoy stationary environments, and optimistic exploring Hp,t⁢(0) in these intervals yields a dynamic regret bound that scales linearly with Dmax. Bounding the dynamic regret of the remaining intervals by their lengths and tuning C yield the desired bound.

In contrast to the stationary and piecewise-stationary settings, optimistic exploration on Hp,t⁢(0) might lead to unfavorable dynamic regret bounds in non-stationary MDPs. In the non-stationary environment where pt-W,…,pt-1 are generally distinct, we show that it is impossible to bound the diameter of p¯t in terms of the maximum of the diameters of pt-W,…,pt-1. More generally, we demonstrate the previous claim not only for p¯t, but also for every p~∈Hp,t⁢(0) in the following Proposition. The Proposition showcases the unique challenge in exploring non-stationary MDPs that is absent in the piecewise-stationary MDPs, and motivates our notion of confidence widening with η>0. To ease the notation, we put t=W+1 without loss of generality.

Proposition 3

There exists a sequence of non-stationary MDP transition distributions p1,…,pW such that

•

The diameter of (𝒮,𝒜,pn) is 1 for each n∈[W].

•

The total variations in state transition distributions is O⁢(1).

Nevertheless, under some deterministic policy,

0.

The empirical MDP (𝒮,𝒜,p^W+1) has diameter Θ⁢(W)

0.

Further, for every p~∈Hp,W+1⁢(0), the MDP (𝒮,𝒜,p~) has diameter Ω⁢(W/log⁡W)

\@trivlist

The sequence p1,…,pW alternates between the following 2 instances p1,p2. Now, define the common state space 𝒮={1,2} and action collection 𝒜={𝒜1,𝒜2}, where 𝒜1={a1,a2}, {𝒜2}={b1,b2}. We assume all the state transitions are deterministic, and a graphical illustration is presented in Fig. 2. Clearly, we see that both instances have diameter 1.

Figure 2: Example MDPs. Since the transitions are deterministic, the probabilities are omitted.

Now, consider the following two deterministic and stationary policies π1:π1⁢(1)=a1,π1⁢(2)=b2, and π2:π2⁢(1)=a2,π2⁢(2)=b1. Since the MDP is deterministic, we have p^W+1=p¯W+1.

In the following, we construct a trajectory where the DM alternates between policies π1,π2 during time {1,…,W} while the underlying transition distribution alternates between p1,p2. In the construction, the DM is almost always at the self-loop at state 1 (or 2) throughout the horizon, no matter what action a1,a2 (or b1,b2) she takes. Consequently, it will trick the DM into thinking that p^W+1⁢(1|1,ai)≈1 for each i∈{1,2}, and likewise p^W+1⁢(2|2,bi)≈1 for each i∈{1,2}. Altogether, this will lead the DM to conclude that (𝒮,𝒜,p^W+1) constitute a high diameter MDP, since the probability of transiting from state 1 to 2 (and 2 to 1) are close to 0.

The construction is detailed as follows. Let W=4⁢τ. In addition, let the state transition distributions be

p1=…=pτ=p1,pτ+1=…=p2⁢τ=p2,p2⁢τ+1=…=p3⁢τ=p1,p3⁢τ+1=…=p4⁢τ=p2.

The DM starts at state 1. She follows policy π1 from time 1 to time 2⁢τ, and then policy π2 from 2⁢τ+1 to 4⁢τ.

Under the specified MDP models and policies, it can be readily verified that the DM takes action a1 from time 1 to τ+1, action b2 from time τ+2 to 2⁢τ, action b1 from time 2⁢τ+1 to 3⁢τ+1, and action a2 from time 3⁢τ+2 to 4⁢τ. As a result, the DM is at state 1 from time 1 to τ+1, state 2 from time τ+2 to 3⁢τ+1, and eventually state 1 from time 3⁢τ+2 to 4⁢τ as depicted in Fig. 3.

Figure 3: Illustration of the latent MDPs, policies, and state visits.

and It can be readily verified that the diameter of (𝒮,𝒜,p^W+1) is τ+1=Θ⁢(W). Finally, for the confidence region Hp,W+1⁢(0)={Hp,W+1⁢(s,a;0)}s,a constructed without confidence widening, for any p~∈Hp,W+1⁢(0) we have

respectively. Since the stochastic confidence radii Θ⁢(log⁡Wτ+1) and Θ⁢(log⁡Wτ-1) dominate the sample mean 1τ+1 and 0. Therefore, for any p~∈Hp,W+1⁢(0), the diameter of the MDP constructed by (𝒮,𝒜,p~) is at least Ω⁢(Wlog⁡W). □\@endparenv

Remark 8

In Proposition 3, there are two reasons for the discrepancy between the individual MDPs p1,…,pW and the MDPs in the un-widened confidence region Hp,W+1⁢(0):

•

First, due to the bandit feedback, the samples used to construct p^W+1 come from different state-action pairs at different time. As a result, p^W+1 and p¯W+1 can be very different than each of the individual state transition probability distributions p1,…,pW.

•

Second, the number of visits to each state-action pair is roughly W/4, which means we would have very “narrow” confidence regions (of the order O~⁢(1/W)) if we follow standard optimistic exploration techniques based on concentration inequalities (i.e., the confidence regions shrink as the number of samples grows).

Critically, as shown in Proposition 2 (as well as Section 4 of (Jaksch et al. 2010)) as well as Lemma 3 and eqn. (10) of (Fruit et al. 2019)), the minimum diameter of the MDPs in the confidence regions play a key role in leading to low (dynamic) regret bounds. We thus believe the caveat in learning non-stationary MDPs via conventional optimistic exploration is fundamental in general. In the current paper, we leverage our novel confidence widening technique to prevent the confidence regions from becoming too narrow even if we have lots of samples.

Remark 9

Inspecting the prevalent OFU guided approach for stochastic MAB and RL in MDPs settings (Auer et al. 2002b, Abbasi-Yadkori et al. 2011, Jaksch et al. 2010, Bubeck and Cesa-Bianchi 2012, Lattimore and Szepesvári 2018), one usually concludes that a tighter design of confidence region can result in a lower (dynamic) regret bound. In (Abernethy et al. 2016), this insights has been formalized in stochastic K-armed bandit settings via a potential function type argument. Nevertheless, Proposition 3 (together with Theorem 1) demonstrates that using the tightest confidence region in learning algorithm design may not be enough to ensure low dynamic regret bound for RL in non-stationary MDPs.

As demonstrated in previous sections, running the proposed algorithms with the widened confidence regions can help the DM to attain provably low dynamic regret in general RL in non-stationary MDPs. Nevertheless, confidence widening is not always necessary if the state transition distributions bear a special structure. In particular, we consider the following assumption on the state transition distributions p1,…,pT.

Assumption 2

There exists a positive quantity (not necessarily known to the DM) ζ∈R+, such that for any pair of states s,s′∈S, there is an action a(s,s′)∈As that satisfies pt⁢(s′|s,a(s,s′))≥ζ for all t∈[T].

We can now analyze the dynamic regret bound of the SWUCRL2-CW algorithm under Assumption 2. Here, we follow the notations introduced in Section id1 for consistency. In general, Assumption 2 ensures that for every time step t∈[T], there exists a state transition distribution p∈Hp,t⁢(0) such that the induced diameter of the MDP (𝒮,𝒜,p) is upper bounded by the constant D¯:=1/ζ with high probability.

Proposition 4

Under Assumption 2 and conditioned on the event Ep, there exists a state transition distribution p in the confidence region Hp,t⁢(0), such that the induced diameter of the MDP (S,A,p) is at most D¯:=1/ζ for all t∈[T].

The proof of Proposition 4 is provided in Section id1 of the appendix. The proposition indicates that the DM can achieve a bounded dynamic regret by implementing the SWUCRL2-CW algorithm with η=0. To analyze its dynamic regret bound, we provide a variation of
Proposition 2 as follows.

Proposition 5

Consider an episode m. Conditioning on events Er,Ep, then for every t∈{τ⁢(m),…,τ⁢(m+1)-1} in episode m, we have

The proof is similar to that of Proposition 2 with Dτ⁢(m) replaced by D¯ and η set to 0, respectively. We are now ready to state the dynamic regret bound of the SWUCRL2-CW algorithm when Assumption 2 holds.

In this subsection, we first elaborate on Assumption 2 in the context of single non-perishable item inventory control problem with zero lead time, fixed cost, and lost sales similar to (Yuan et al. 2019), and then demonstrate how to implement the SWUCRL2-CW algorithm for this problem. For each time step t∈[T] of the inventory control problem (with some abuse of notations), the following sequence of events happens:

0.

The seller first observes her stock level st, and decides the quantity at to order.

0.

If at>0, a fixed cost f and a c per-unit ordering cost are incurred, and the order arrives instantaneously. The stock level then becomes st+at.

0.

The demand Xt is realized, and the seller observes the censored demand Yt=min⁡{Xt,st+at}. The DM faces non-stationary demands, in the sense that the demand distributions X1,…,XT at time steps 1,…⁢T are independent but not identically distributed.

0.

Unfulfilled demand incurs a l per-unit lost sales cost, while excess inventory leads to a h per-nit holding cost. The total cost for time step t is

Ct(st,at)=f⋅𝟏[at>0]+c⋅at+l⋅[Xt-st-at]++h⋅[st+at-Xt]+.

(13)

Due to demand censoring, the cost is not observable.

The seller’s objective is to minimize the cumulative total cost ∑t=1TCt⁢(st,at). To map this into the non-stationary MDP model we described in Section id1, we represent the level of stock at the beginning of each time step as the state. Same as (Yuan et al. 2019) (and similar to (Huh and Rusmevichientong 2009, Zhang et al. 2018, Agrawal and Jia 2019)), we assume the DM has a limited shelf capacity, and she can hold at most S units of inventory at any time. Consequently, 𝒮={0,…,S}, and 𝒜s={0,…,S-s} for each s∈𝒮. We also define the reward and state transition distributions for all t∈[T],s,s′∈𝒮, and a∈𝒜s as follows,

Rt⁢(s,a)=-Ct⁢(s,a) and pt⁢(s′|s,a)=Pr⁡(s+a-min⁡{s+a,Xt}=s′).

However, it is worth emphasizing that, different than our setup in Section id1, Rt⁢(s,a) is not observable as Ct⁢(s,a) is not observable. Nevertheless, we shall demonstrate in Section id1 that one could use the technique of pseudo-reward proposed in (Agrawal and Jia 2019) to bypass this issue.

Following Assumption 2, we make the strictly positive probability mass function (PMF) assumption on X1,…,XT.

Assumption 3 (Strictly Positive PMF)

There is a ζ>0 such that Pr⁡(Xt=s)≥ζ>0 for all t∈[T] and s∈{0,…,S}.

Remark 10

It can be readily verified that if the demands satisfy the strictly positive PMF assumption, the underlying inventory control problem satisfies Assumption 2. Indeed, the DM could transit from a state s∈S to another state s′∈S with probability at least ζ by ordering S-s units of the item, since then pt⁢(s′|s,S-s)=Pr⁡(Xt=S-s′)≥ζ.

We first compare our setting and existing ones on single non-perishable item inventory control problem with lost sales.

Similar to (Huh and Rusmevichientong 2009, Zhang et al. 2018, Yuan et al. 2019, Agrawal and Jia 2019), the model presented in this section studies the single non-perishable item inventory control problem with lost sales. However, there are several key differences between ours and the existing works in terms of cost functions, demand distributions, and lead time:

Cost Functions: In (Huh and Rusmevichientong 2009), the authors assume a linear purchasing cost function without fixed cost, linear lost sales and holding cost functions. In (Yuan et al. 2019), the authors additionally allow fixed cost. In (Zhang et al. 2018, Agrawal and Jia 2019), the authors assume the lost sales cost function and the holding cost function are linear, and there is no purchasing cost. In our setting, our cost function is the same as that of (Yuan et al. 2019).

•

Demand Distributions: In (Huh and Rusmevichientong 2009, Zhang et al. 2018, Yuan et al. 2019, Agrawal and Jia 2019), the authors assume stationary demand distributions, but they admit both continuous or discrete demand distributions. In contrast, we allow non-stationary demand distributions, but we impose that the demand distribution has to be discrete, and satisfies the strictly positive PMF assumption described above.

•

Lead Time: In (Zhang et al. 2018, Agrawal and Jia 2019), the authors allow the lead time to be positive; while in (Huh and Rusmevichientong 2009, Yuan et al. 2019) and our setting, we assume the lead time is zero.

As pointed out in Section id1, different than the model we present in Section id1, the reward in each time step t is not directly observable due to the censored demand. Nevertheless, we can follow the pseudo-reward technique proposed in (Agrawal and Jia 2019) to implement the SWUCRL2-CW algorithm on a sequence of suitably designed pseudo-reward distributions.

In particular, we define the pseudo-reward following (Agrawal and Jia 2019) for each time step t∈[T], every state s, and every action a∈𝒜s as

Rtpseudo(s,a):=Rt(s,a)+l⋅Xt=-f⋅𝟏[a>0]-c⋅at-h⋅[s+a-Yt]++l⋅Yt,

where we recall Yt=min⁡{s+a,Xt} is the censored demand. We note that the pseudo-reward is perfectly observable. We also define the mean pseudo-reward or each time step t∈[T], every state s, and every action a∈𝒜s as

This indicates regardless of state and action, the mean pseudo-reward of a time step t can be obtained from shifting the corresponding mean reward uniformly by l⋅𝔼⁢[Xt]. Without loss of generality, we assume for all t∈[T],s∈𝒮, and a∈𝒜s, the mean pseudo-reward is bounded, i.e., rtpseudo⁢(s,a)∈[0,1], and the pseudo-reward Rtpseudo⁢(s,a) is 1-sub-Gaussian with mean rtpseudo⁢(s,a). Defining ρt*pseudo as the optimal long-term average reward of the stationary MDP with state transition distribution pt and mean reward rtpseudo={rtpseudo⁢(s,a)}s∈𝒮,a∈𝒜s, we can show that for any policy Π, the dynamic regret of the non-stationary MDP instance specified by the tuple ℳ=(𝒮,𝒜,T,r,p) and the dynamic regret of the non-stationary MDP instance specified by the tuple ℳpseudo=(𝒮,𝒜,T,rpseudo={rtpseudo}t=1T,p) are the same.

Proposition 6

For any policy Π, we denote the sample path for following Π on M as {st⁢(M),at⁢(M)}t=1T, and the sample path for following Π on M𝑝𝑠𝑒𝑢𝑑𝑜 as {st⁢(M𝑝𝑠𝑒𝑢𝑑𝑜),at⁢(M𝑝𝑠𝑒𝑢𝑑𝑜)}t=1T, we have

The proof of Proposition 6 is provided in Section id1 in the appendix. Together with Theorem 3, we have the following dynamic regret bound guarantee for the SWUCRL2-CW algorithm on the the single non-perishable item inventory control problem with zero lead time, fixed cost, and lost sales.

Theorem 4

For the inventory control model in Section id1, under Assumption 3 and assuming S>1, the SWUCRL2-CW algorithm with window size W, confidence widening parameter η=0, and δ=T-1 satisfies the dynamic regret bound

Dyn-RegT⁢(SWUCRL2-CW)=O~⁢(Br⁢W+D¯⁢[Bp⁢W+S32⁢TW+S2⁢TW+T])

If we further put W=W*=S⁢T2/3⁢(Br+Bp+1)-2/3, this dynamic regret bound is O~⁢(D¯⁢(Br+Bp+1)1/3⁢S⁢T2/3).

Remark 11

To interpret the dynamic regret bound of the SWUCRL2-CW algorithm in the context of inventory control, we note that in Theorem 4, we normalize the cost functions so that the cost incurs in each time period is in [0,1]. This is slightly different than the setups in (Huh and Rusmevichientong 2009, Zhang et al. 2018, Yuan et al. 2019, Agrawal and Jia 2019), where the upper bound of the cost functions are of order O⁢(S).

As a complement to our theoretical results, we conduct numerical experiments on synthetic datasets to compare the dynamic regret performances of our algorithms with the UCRL2 algorithm (Jaksch et al. 2010), which is one of the most widely used benchmarks for RL in MDPs due to its nearly-optimal regret bound in stationary environments (Wei et al. 2019), and also the restarting UCRL2 (denoted as UCRL2.S) algorithm for RL in piecewise-stationary MDPs (Jaksch et al. 2010)

Setup: We consider a MDP with 2 states {s1,s2} and 2 actions {a1,a2}, and set T=5000. The rewards are deterministically set to

rt⁢(s1,a1)=0.2+3⁢cos⁡(5⁢Vr⁢π⁢t/T),rt⁢(s1,a2)=0.2+cos⁡(5⁢Vr⁢π⁢t/T),

rt⁢(s2,a1)=0.2-cos⁡(5⁢Vr⁢π⁢t/T),rt⁢(s2,a2)=0.2-3⁢cos⁡(5⁢Vr⁢π⁢t/T).

The total variations in mean rewards is thus Br=15⁢Vr=Θ⁢(Vr). An illustration of the reward process of state s2 and action a2 is provided in Fig. 4 (the mean rewards of other (state,action) pairs are similar).

Figure 4: Illustrations of mean rewards rt⁢(s2,a2) (the mean rewards of other state-action pairs are similar)

The state transition distributions are set to

pt⁢(s1|s1,a1)=1,pt⁢(s2|s1,a1)=0,pt⁢(s1|s1,a2)=1-βt,pt⁢(s1|s1,a2)=βt,

pt⁢(s1|s2,a1)=0,pt⁢(s2|s2,a1)=1,pt⁢(s1|s2,a2)=βt,pt⁢(s1|s2,a2)=1-βt.

where βt is governed by the following process:

βt=0.5+0.3⁢sin⁡(5⁢Vp⁢π⁢t/T).

The total variations in the state transition distributions is thus Bp=12⁢Vp=Θ⁢(Vp). In this simulation, we allow both Vr and Vp to take values from {T0.2,T0.5} to evaluate the performances of the algorithms in low and high variations scenarios. Here, we assume the SWUCRL2-CW algorithm knows the variation budgets, and the UCRL2.S algorithm restarts the UCRL2 algorithm every ⌊T2/3⌋ time steps. All the results are averaged over 50 runs.

(a) Bp=Θ⁢(T0.2),Br=Θ⁢(T0.2)

(b) Bp=Θ⁢(T0.2),Br=Θ⁢(T0.5)

(c) Bp=Θ⁢(T0.5),Br=Θ⁢(T0.2)

(d) Bp=Θ⁢(T0.5),Br=Θ⁢(T0.5)

Figure 5: Cumulative rewards of the algorithms

Results: The cumulative rewards of the algorithms under various variation budgets are shown in Fig. 5. The results show that both the SWUCRL2-CW algorithm and the BORL algorithm are able to collect at least 20% more rewards then the UCRL2 algorithm and the UCRL2.S algorithm except for the case when Bp=Θ⁢(T0.5) and Br=Θ⁢(T0.2), the percentage improvement is 12%. Comparing the results in Figs. 5(a), 5(b), and 5(c), we can see that both the SWUCRL2-CW algorithm and the BORL algorithm are more robust to variations in the state transition distributions than that in reward distributions. This demonstrate the power of our confidence widening technique. Interestingly, we can see that in Figs. 5(a), 5(b), and 5(c), the cumulative rewards of the BORL algorithm (does not know the variation budgets) are higher than those of the SWUCRL2-CW algorithm (knows the variation budgets). This indeed has no contradiction to our theoretical results. Theorems 1 and 2 state that the SWUCRL2-CW algorithm and the BORL algorithm enjoy the same (in the sense of O~⁢(⋅)) worst case dynamic regret bound. Nevertheless, the environments we construct in Fig. 4 are not the worst case scenario, and the results indicate that the adaptive master algorithm (i.e., the EXP3.P algorithm) of the BORL algorithm is able to leverage this more benign environment to attain higher rewards.

In this paper, we study the problem of un-discounted reinforcement learning in a gradually changing environment. In this setting, the parameters, i.e., the reward and state transition distributions, can be different from time to time as long as the total changes are bounded by some variation budgets, respectively. We first incorporate the sliding window estimator and the novel confidence widening technique into the UCRL2 algorithm to propose a SWUCRL2-CW algorithm with low dynamic regret when the variation budgets are known. We then design a parameter-free BORL algorithm that allows us to enjoy the same dynamic regret bound as the SWUCRL2-CW algorithm without knowing the variation budgets. The main ingredient of the proposed algorithms is the novel confidence widening technique, which injects extra optimism into the design of learning algorithms, and thus ensure low dynamic regret bounds. This is in contrast to the widely held believe that optimistic exploration algorithms for (stationary and non-stationary) stochastic online learning settings should employ the lowest possible level of optimism. To extend this finding, we also use the problem of single-item inventory control with fixed cost as an example to demonstrate how one can leverage special structures in the state transition distributions to attain low dynamic regret bound without widening the confidence region.

Acknowledgments

The authors would like to express sincere gratitude to Dylan Foster, Negin Golrezaei, and Mengdi Wang, as well as various seminar attendees for helpful discussions and comments.

To arrive at (23), note that the second expectation in (21), which is a telescoping sum, is equal to 0, since sτ+1 is distributed as p(⋅|sτ,aτ). In addition, we trivially lower bound the first expectation in (22) by -2⁢Dmax by applying Lemma 2. Next, consider partitioning the horizon of T steps into intervals of W time steps, where last interval could have less than W time steps. That is, the first interval is {1,…,W}, the second is {W+1,…,2⁢W}, and so on. Applying the bound (23) on each interval and summing the resulting bounds together give

∑t=1Tρt*

≥𝔼⁢[∑t=1Trt⁢(st,at)]-2⁢⌈TW⌉⁢Dmax-2⁢W⁢∑t=1T(Br,t+2⁢Dmax⁢Bp,t)

≥𝔼⁢[∑t=1Trt⁢(st,at)]-4⁢T⁢DmaxW-2⁢W⁢(Br+2⁢Dmax⁢Bp).

(24)

Choosing W to be any integer in [T/(Br+2⁢Dmax⁢Bp),2⁢T/(Br+2⁢Dmax⁢Bp)] yields the desired inequality in the Theorem. Finally, we go back to proving inequalities (18,19). These inequalities are clearly true when t=τ, so we focus on the case t<τ.

Proving inequality (18). It suffices to show that the solution (ρτ*+∑q=tτ-1(Br,q+2⁢Dmax⁢Bp,q),γτ*) is feasible to the linear program D(rt,pt). To see the feasibility, it suffices to check the constraint of D(rt,pt) for each state-action pair s,a:

Property 1

Property 2

For each state s∈S, we have

r~⁢(s,π~⁢(s))≥ρ~+γ~⁢(s)-∑s′∈𝒮p~⁢(s′|s,π~⁢(s))⁢γ~⁢(s′)-ϵ.

Property 1 ensures the feasibility of the output dual variables (ρ~,γ~), with respect to the dual program 𝖣⁢(r˙,p˙) for any r˙,p˙ in the confidence regions Hr,Hp. The feasibility facilitates the bounding of maxs∈𝒮⁡γ~⁢(s), which turns out to be useful for bounding the regret arise from switching among different stationary policies. To illustrate, suppose that Hp is so large that it contains a transition distribution p˙ under which (𝒮,𝒜,p˙) has diameter D. By Lemma 2, we have 0≤maxs∈𝒮⁡γ~⁢(s)≤2⁢D.

Property 2 ensures the near-optimality of the dual variables (ρ~,γ~) to the (r~,p~) optimistically chosen from Hr,Hp. More precisely, the deterministic policy π~ near-optimal for the MDP with time homogeneous reward function r~ and time homogeneous transition distribution p~, under which the policy π~ achieves a long term average reward is at least ρ~*-ϵ.

We employ the self-normalizing concentration inequallity (Abbasi-Yadkori et al. 2011). The following inequality is extracted from Theorem 1 in (Abbasi-Yadkori et al. 2011), restricted to the case when d=1.

Let {Fq}q=1T be a filtration. Let {ξq}q=1T be a real-valued stochastic process, such that ξq is Fq-measurable, and ξq is conditionally R-sub-Gaussian, i.e. for all λ≥0, it holds that E⁢[exp⁡(λ⁢ξq)|Fq-1]≤exp⁡(λ2⁢R2/2). Let {Yq}q=1T be a non-negative real-valued stochastic process such that Yq is Fq-1-measurable. For any δ′∈(0,1), it holds that

Pr⁡(∑q=1tξq⁢Yqmax⁡{1,∑q=1tYq2}≤2⁢R⁢log⁡(T/δ′)max⁡{1,∑q=1tYq2}⁢ for all t∈[T])≥1-δ′.

In particular, if {Yq}q=1T be a {0,1}-valued stochastic process, then for any δ′∈(0,1), it holds that

Pr⁡(∑q=1tξq⁢Yqmax⁡{1,∑q=1tYq}≤2⁢R⁢log⁡(T/δ′)max⁡{1,∑q=1tYq}⁢ for all t∈[T])≥1-δ′.

(29)

The Lemma is proved by applying Proposition 7 with suiatable choices of ℱq=1T,{ξq}q=1T,{Yq}q=1T,δ. We divide the proof into two parts.

It suffices to prove that, for any fixed s∈𝒮,a∈𝒜s,t∈[T], it holds that

since then Pr⁡[ℰr]≥1-δ/2 follows from the union bound over all s∈𝒮,a∈𝒜s,t∈[T]. Now, the trajectory of the online algorithm is expressed as {sq,aq,Rq}q=1T. Inequality (30) directly follows from Proposition 7, with {ℱq}q=1T,{ξq}q=1T,{Yq}q=1T,δ defined as

ℱq

={(sℓ,aℓ,Rℓ)}ℓ=1q∪{(sq+1,aq+1)},

ξq

=Rq⁢(s,a)-rq⁢(s,a),

Yq

=𝟏(sq=s,aq=a,((t-W)∨1)≤q≤t-1),

δ′

=δ2⁢S⁢A⁢T.

Each ξq is conditionally 2-sub-Gaussian, since -1≤ξq≤1 with certainty. Altogether, the required inequality is shown.

We start by noting that, for two probability distributions p,{p⁢(s)}s∈𝒮,p′={p′⁢(s)}s∈𝒮 on 𝒮, it holds that

∥p-p′∥1=maxθ∈{-1,1}𝒮⁡θ⁢(s)⋅(p⁢(s)-p′⁢(s)).

Consequently, to show Pr⁡[ℰp]≥1-δ/2, it suffices to show that, for any fixed s∈𝒮,a∈𝒜s,t∈[T],θ∈{-1,1}𝒮, it holds that

Pr⁡(∑s′∈𝒮θ⁢(s)⋅(p^t⁢(s′|s,a)-p¯t⁢(s′|s,a))≤𝗋𝖺𝖽⁢-p,t⁢(s,a))

≤

Pr(1Nt+⁢(s,a)∑q=(τ⁢(m)-W)∨1t-1[∑s′∈𝒮θ(s′)𝟏(sq=s,aq=a,sq+1=s′)]

-[∑s′∈𝒮θ(s′)pq(s′|s,a)⋅𝟏(sq=s,aq=a)]≤2log⁡(2⁢S⁢A⁢T2⁢2S/δ)Nt+⁢(s,a))

≥

1-δ2⁢S⁢A⁢T⁢2S,

(31)

since then the required inequality follows from a union bound over all s∈𝒮,a∈𝒜s,t∈[T],θ∈{-1.1}𝒮. Similar to the casea of ℰr, (31) follows from Proposition 7, with {ℱq}q=1T,{ξq}q=1T,{Yq}q=1T,δ defined as

ℱq

={(sℓ,aℓ)}ℓ=1q+1,

ξq

=[∑s′∈𝒮θ(s′)𝟏(sq=s,aq=a,sq+1=s′)]-[∑s′∈𝒮θ(s′)pq(s′|s,a)],

Yq

=𝟏(sq=s,aq=a,((t-W)∨1)≤q≤t-1),

δ′

=δ2⁢S⁢A⁢T⁢2S.

Each ξq is conditionally 2-sub-Gaussian, since -1≤ξq≤1 with certainty. Altogether, the required inequality is shown.

In this section, we prove Proposition 2. Throughout the section, we impose the assumptions stated by the Proposition. That is, the events ℰr,ℰp hold, and there exists p with (1) p∈Hp,τ⁢(m)⁢(η), (2) (𝒮,𝒜,p) has diameter at most D. We begin by recalling the following notations:

Step (34) is by Property 1 of the output from EVI, which is applied with confidence regions Hr,τ⁢(m),Hp,τ⁢(m)⁢(η). Step (35) is because of the assumption that p∈Hp,τ⁢(m)⁢(η). Altogether, the solution (ρ~τ⁢(m),γ~τ⁢(m)) is feasible to 𝖣⁢(r˙,p) for any r˙∈Hr,τ⁢(m). Now, by Lemma 2, we have maxs,s′∈𝒮⁡|γ~τ⁢(m)⁢(s)-γ~τ⁢(m)⁢(s′)|≤2⁢D. Finally, inequality (32) follows from the fact that the bias vector γ~τ⁢(m) returned by EVI is component-wise non-negative, and there exists s∈𝒮 such that γ~τ⁢(m)=0.

Step (36) is again by Property 1 of the output from EVI, and step (37) is by the assumptions that r¯τ⁢(m)∈Hr,τ⁢(m), and p¯τ⁢(m)∈Hp,τ⁢(m)⁢(0)⊂Hp,τ⁢(m)⁢(η).

Now, we claim that (ρ~τ⁢(m)+𝗏𝖺𝗋r,t+2⁢D⋅𝗏𝖺𝗋p,t,γ~τ⁢(m)) is a feasible solution to the tth period dual problem 𝖣⁢(rt,pt), which immediately implies the Lemma. To demonstrate the claim, for every state-action pair (s,a) we have

r¯τ⁢(m)⁢(s,a)

≥rt⁢(s,a)-𝗏𝖺𝗋r,t

(38)

∑s′∈𝒮γ~τ⁢(m)⁢(s′)⁢pτ⁢(m)⁢(s′|s,a)

≥∑s′∈𝒮γ~τ⁢(m)(s′)pt(s′|s,a)-∥γ~τ⁢(m)∥∞∥pt(⋅|s,a)-p¯τ⁢(m)(⋅|s,a)∥1

≥∑s′∈𝒮γ~τ⁢(m)⁢(s′)⁢pt⁢(s′|s,a)-2⁢D⋅𝗏𝖺𝗋p,t,.

(39)

Inequality (38) is by Lemma 3 on the rewards. Step (39) is by inequality (32), and by Lemma 3 which shows ∥pt(⋅|s,a)-p¯τ⁢(m)(⋅|s,a)∥1≤𝗏𝖺𝗋p,t. Altogether, putting (38), (39) to inequality (33), our claim is shown, i.e., for all s∈𝒮 and a∈𝒜s,

Step (41) is by Lemma 3 on t. Step (42) is by conditioning that event ℰr holds. Step (43) is by Property 2 for the output of EVI. In step (44), we upper bound ρ~τ⁢(m) by Lemma 4 and we upper bound ∑s′∈𝒮p~τ⁢(m)⁢(s′|st,at)⁢γ~τ⁢(m)⁢(s′) by Lemma 5. Rearranging gives the Proposition.

To facilitate the exposition, we denote M⁢(T) as the total number of episodes. By abusing the notation we, let τ⁢(M⁢(T)+1)-1=T. Episode M⁢(T), containing the final round T, is interrupted and the algorithm is forced to terminate as the end of time T is reached. We can now rewrite the dynamic regret of the SWUCRL2-CW algorithm as the sum of dynamic regret from each episode:

Case 1.m∈U. Under this situation, we apply Proposition 2 to bound the dynamic regret during the episode, using the fact that pτ⁢(m) satisfies the assumptions of the proposition with D=Dτ⁢(m)≤Dmax.

•

Case 2.m∈[M⁢(T)]∖U. In this case, we trivially upper bound the dynamic regret of each round in episode m by 1.

For case 1, we bound the dynamic regret during episode m by summing the error terms in (7, 8) across the rounds t∈[τ⁢(m),τ⁢(m+1)-1] in the episode. The term (7) accounts for the error by switching policies. In (8), the terms 𝗋𝖺𝖽⁢-r,τ⁢(m),𝗋𝖺𝖽⁢-p,τ⁢(m) accounts for the estimation errors due to stochastic variations, and the term 𝗏𝖺𝗋r,t,𝗏𝖺𝗋p,t accounts for the estimation error due to non-stationarity.

For case 2, we need an upper bound on ∑m∈[M⁢(T)]∖U∑t=τ⁢(m)τ⁢(m+1)-11, the total number of rounds that belong to an episode in [M⁢(T)]∖U. The analysis is challenging, since the length of each episode may vary, and one can only guarantee that the length is ≤W. A first attempt could be to upper bound as ∑m∈[M⁢(T)]∖U∑t=τ⁢(m)τ⁢(m+1)-11≤W⁢∑m∈[M⁢(T)]∖U1, but the resulting bound appears too loose to provide any meaningful regret bound. Indeed, there could be double counting, as the starting time steps for a pair of episodes in case 2 might not even be W rounds apart!

To avoid the trap of double counting, we consider a set QT⊆[M⁢(T)]∖U where the start times of the episodes are sufficiently far apart, and relate the cardinality of QT to ∑m∈[M⁢(T)]∖U∑t=τ⁢(m)τ⁢(m+1)-11.
The set QT⊆[M⁢(T)] is constructed sequentially, by examining all episodes m=1,…,M⁢(T) in the time order. At the start, we initialize QT=∅. For each m=1,…,M⁢(T), we perform the following. If episode m satisfies both criteria:

0.

There exists some s∈𝒮 and a∈𝒜s such that pτ⁢(m)(⋅|s,a)∉Hp,τ⁢(m)(s,a;η);

0.

For every m′∈QT,τ⁢(m)-τ⁢(m′)>W,

then we add m into QT. Afterwards, we move to the next episode index m+1. The process terminates once we arrive at episode M⁢(T)+1. The construction ensures that, for each episode m∈[M⁢(T)], if τ⁢(m)-τ⁢(m′)∉[0,W] for all m′∈QT, then ∀s∈𝒮∀a∈𝒜spτ⁢(m)(⋅|s,a)∈Hp,τ⁢(m)(s,a); otherwise, m would have been added into QT.

We further construct a set Q~T to include all elements in QT and every episode index m such that there exists m′∈QT with τ⁢(m)-τ⁢(m′)∈[0,W]. By doing so, we can prove that every episode m∈[M⁢(T)]∖Q~T satisfies pτ⁢(m)(⋅|s,a)∈Hp,τ⁢(m)(s,a)∀s∈𝒮∀a∈𝒜s. The procedures for building Q~T (initialized to QT) are described as follows: for every episode index m∈[M⁢(T)], if there exists m′∈QT, such that τ⁢(m)-τ⁢(m′)∈[0,W], then we add m to Q~T. Formally,

Q~T=QT∪{m∈[M⁢(T)]:∃m′∈QT⁢τ⁢(m)-τ⁢(m′)∈[0,W]}.

Figure 6: Both episodes mi and mi+4 belong to QT (and thus Q~T) because pτ⁢(mi)∉Hp,τ⁢(mi)⁢(η) and pτ⁢(mi+4)∉Hp,τ⁢(mi+4)⁢(η).mi+1 is added to Q~T (but not QT) because τ⁢(mi+1)-τ⁢(mi)∈[0,W].mi+2 and mi+3 belong to neither of QT nor Q~T as pτ⁢(mi+2)∈Hp,τ⁢(mi+2)⁢(η) and pτ⁢(mi+3)∈Hp,τ⁢(mi+3)⁢(η).

We can formalize the properties of QT and Q~T as follows.

Lemma 6

Conditioned on Ep,|QT|≤Bp/η.

Lemma 7

For any episode m∉Q~T, we have pτ⁢(m)(⋅|s,a)∈Hp,τ⁢(m)(s,a;η) for all s∈S and a∈As.

The proofs of Lemmas 6 and 7 are presented in Sections id1 and id1, respectively.

Together with eqn. (45), we can further decompose the dynamic regret of the SWUCRL2-CW algorithm as

where the last step makes use of Lemma 7 and Proposition 2. We accomplish the promised dynamic regret bound by the following four Lemmas that bound the dynamic regret terms (♠, ♣, theequation-at-IDdz, theequation-at-IDea).

Lemma 10

Lemma 11

The proofs of Lemmas 8, 9, 10, and 11 are presented in Sections id1, id1, id1, and id1, respectively. Putting all these pieces together, we have the dynamic regret of the SWUCRL2-CW algorithm is upper bounded as

O~⁢(Bp⁢Wη+Br⁢W+Dmax⁢[Bp⁢W+S⁢A⁢TW+T⁢η+S⁢A⁢TW+T]),

and by setting W and η accordingly, we can conclude the proof.

We first claim that, for every episode m∈QT, there exists some state-action pair (s,a) and some time step tm∈[(τ⁢(m)-W∨1),τ⁢(m)-1] such that

∥pτ⁢(m)(⋅|s,a)-ptm(⋅|s,a)∥1>η.

(46)

For contradiction sake, suppose the otherwise, that is, ∥pτ⁢(m)(⋅|s,a)-pt(⋅|s,a)∥1≤η for every state-action pair s,a and every time step t∈[(τ⁢(m)-W∨1),τ⁢(m)-1]. For each state-action pair (s,a), consider the following cases on Nτ⁢(m)(s,a)=∑q=(τ⁢(m)-W)∨1τ⁢(m)-1𝟏(sq=s,aq=a):

Step (50) follows by the second criterion of the construction of QT, which ensures that for distinct m,m′∈QT, the time intervals [tm,τ⁢(m)], [tm′,τ⁢(m′)] are disjoint. Step (51) is by applying the triangle inequality, for each m∈QT, on the state-action pair (s,a) that maximizes the term ∥∑q=tmτ⁢(m)-1(pq+1(⋅|s,a)-pq(⋅|s,a))∥1=∥(pτ⁢(m)(⋅|s,a)-ptm(⋅|s,a))∥1. Step (52) is by applying the claimed inequality (46) on each m∈QT. Altogether, the Lemma is proved. ■

We prove by contradiction. Suppose there exists an episode m∉Q~T, a state s∈𝒮, and an action a∈𝒜s such that
pτ⁢(m)(⋅|s,a)∉Hp,τ⁢(m)(s,a;η), then m should have been added to QT. To see this, we first note that episode m trivially satisfies criterion 1 in the construction of QT. Moreover, at the time when m is examined, we know that any m′ has been added to QT should satisfy τ⁢(m)-τ⁢(m′)>W as otherwise m would have been added to Q~T. Therefore, we have prove m∈QT⊆Q~T, which is clearly a contradiction.

Denote QT={m1,…,m|QT|}. By construction, for every element m∈Q~T, there exists an unique m′∈QT such that

τ⁢(m)-τ⁢(m′)∈[0,W].

(53)

We can thereby partition the elements of Q~T into |QT| disjoint subsets Q~T⁢(m1),…,Q~T⁢(m|QT|) such that

Here, inequality (54) holds by boundedness of rewards, inequality (55) follows from the fact that episodes are mutually disjoint, inequality (56) makes the observations that each episode can last for at most W time steps (imposed by the SWUCRL2-CW algorithm) as well as criterion 2 of the construction of Q~T⁢(m′)’s, and the last step uses Lemma 6.

We first give an upper bound for M⁢(T), the total number of the episodes.

Lemma 12

Conditioned on events Er,Ep, we have M⁢(T)≤S⁢A⁢(2+log2⁡W)⁢T/W=O~⁢(S⁢A⁢T/W) with certainty.

\@trivlist

irst, to demonstrate the bound for M⁢(T), it suffices to show that there are at most S⁢A⁢(2+log2⁡W) many episodes in each of the following cases: (1) between time steps 1 and W, (2) between time steps j⁢W and (j+1)⁢W, for any j∈{1,…,⌊T/W⌋-1}, (3) between time steps ⌊T/W⌋⋅W and T. We focus on case (2), and the edge cases (1, 3) can be analyzed similarly.

Between time steps j⁢W and (j+1)⁢W, a new episode m+1 is started when the second criterion νm⁢(st,π~m⁢(st))<Nτ⁢(m)+⁢(st,π~m⁢(st)) is violated during the current episode m. We first illustrate how the second criterion is invoked for a fixed state-action pair (s,a), and then bound the number of invocations due to (s,a). Now, let’s denote m1,…,mL as the episode indexes, where j⁢W≤τ⁢(m1)<τ⁢(m2)<…<τ⁢(mL)<(j+1)⁢W, and the second criterion for (s,a) is invoked during mℓ for 1≤ℓ≤L. That is, for each ℓ∈{1,…,L}, the DM breaks the while loop for episode mℓ because νmℓ⁢(s,a)=Nτ⁢(mℓ)+⁢(s,a), leading to the new episode mℓ+1.

To demonstrate our claimed bound for M⁢(T), we show that L≤2+log2⁡W as follows. To ease the notation, let’s denote ψℓ=νmℓ⁢(s,a). We first observe that ψ1=Nτ⁢(m1)+⁢(s,a)≥1. Next, we note that for ℓ∈{2,…⁢L}, we have ψℓ≥∑k=1ℓ-1ψk. 11endnote: 1
We proceed slightly differently from the stationary case, where the corresponding Nt⁢(s,a) is non-decreasing in t(Jaksch et al. 2010), which is clearly not true here due to the use of sliding windows Indeed, we know that for each ℓ we have (τ⁢(mℓ+1)-1)-τ⁢(m1)≤W, by our assumption on m1,…,mℓ. Consequently, the counting sum in Nτ⁢(mℓ)⁢(s,a), which counts the occurrences of (s,a) in the previous W time steps, must have counted those occurrences corresponding to ψ1,…,ψℓ-1. The worst case sequence of ψ1,ψ2,…,ψL that yields the largest L is when ψ1=ψ2=1, ψ3=2, and more generally ψℓ=2ℓ-2 for ℓ≥2. Since ψℓ≤W for all W, we clearly have L≤2+log2⁡W, proving our claimed bound on L.
Altogether, during the T time steps, there are at most (S⁢A⁢T⁢(2+log2⁡W))/W episodes due to the second criterion and T/W due to the first, leading to our desired bound on M⁢(T).
\@endparenv

Next, we establish the bound for (♣). By Lemma 7, we know that γ~τ⁢(m)⁢(s)∈[0,2⁢Dmax] for all m∈[M⁢(T)]∖Q~T and s. For each episode m∈[M⁢(T)]∖Q~T, we have

Define the filtration ℋt-1={(sq,aq,Rq⁢(sq,aq))}q=1t. Now, we know that 𝔼⁢[Yt|ℋt-1]=0, Yt is σ⁢(ℋt)-measurable, and |Yt|≤2⁢Dmax. Therefore, we can apply the Hoeffding inequality (Hoeffding 1963) to show that

since 𝗏𝖺𝗋r,t≥0 and 𝗏𝖺𝗋p,t≥0 for all t.
We can thus work with the latter quantity.

We first bound ∑t=1T𝗏𝖺𝗋r,t. Now, recall the definition that, for time t in episode m, we have defined 𝗏𝖺𝗋r,t=∑q=τ⁢(m)-Wt-1Br,q. Clearly, for i⁢W≤q<(i+1)⁢W, the summand Br,q only appears in 𝗏𝖺𝗋r,t for i⁢W≤q<t≤(i+2)⁢W, since each episode is contained in {i′⁢W,…,(i′+1)⁢W} by our episode termination criteria (t is a multiple of W) of the SWUCRL2-CW algorithm. Altogether, we have

2⁢∑t=1T𝗏𝖺𝗋r,t≤2⁢∑t=1T-1Br,t⁢W=2⁢Br⁢W.

(59)

Next, we bound ∑t=1T𝗏𝖺𝗋p,t. Now, we know that τ⁢(m+1)-τ⁢(m)≤W by our episode termination criteria (t is a multiple of W) of the SWUCRL2-CW algorithm. Consequently,

4⁢Dmax⁢∑t=1T𝗏𝖺𝗋p,t≤4⁢Dmax⁢∑t=1T-1Bp,t⁢W=4⁢Dmax⁢Bp⁢W.

(60)

Finally, combining (59, 60) with 2⁢Dmax⁢∑t=1Tη, the Lemma is established.

Due to non-negativity of 𝗋𝖺𝖽⁢-⁢rt⁢(s,a)’s and 𝗋𝖺𝖽⁢-⁢pt⁢(s,a)’s, we have

We analyze by considering the dynamics of the algorithm in each consecutive block of W time steps, in a way similar to the proof of Lemma 9. Consider the episodes indexes m0,m1⁢…,m⌈T/W⌉,m⌈T/W⌉+1, where τ⁢(m0)=1, and τ⁢(mj)=j⁢W for j∈{1,…,⌈T/W⌉}, and m⌈T/W⌉+1=m⁢(T)+1 (so that τ⁢(m⌈T/W⌉+1-1) is the time index for the last episode in the horizon).

To prove (63, 64), it suffices to show that, for each j∈{0,1,…,⌊T/W⌋}, we have

∑m=mjmj+1-1∑t=τ⁢(m)τ⁢(m+1)-11Nτ⁢(m)+⁢(st,at)=O⁢(S⁢A⁢W).

(65)

Without loss of generality, we assume that j∈{1,…,⌊T/W⌋-1}, and the edge cases of j=0,⌊T/W⌋ can be analyzed similarly.

Now, we fix a state-action pair (s,a) and focus on the summands in (65):

It holds that νmj⁢(s,a)≤Nτ⁢(mj)⁢(s,a), by the episode termination criteria of the SWUCRL2-CW algorithm,

0.

For m∈{mj+1,…,mj+1-1}, we have ∑m′=mjm-1νm′⁢(s,a)≤Nτ⁢(m)⁢(s,a) . Indeed, we know that episodes mj,…,mj+1-1 are inside the time interval {j⁢W,…,(j+1)⁢W}. Consequently, the counts of “(st,at)=(s,a)” associated with {νm′⁢(s,a)}m′=mjm-1 are contained in the W time steps preceding τ⁢(m), hence the desired inequality.

Step (67) is by Lemma 19 in (Jaksch et al. 2010), which bounds the sum in the previous line. Step (68) is by the fact that episodes mj,…,mj+1-1 partitions the time interval j⁢W,…,(j+1)⁢W-1, and νm⁢(s,a) counts the occurrences of (st,at)=(s,a) in episode m. Finally, observe that (66)=0 if νm⁢(s,a)=0 for all m∈{mj,…,mj+1-1}. Thus, we can refine the bound in
(68) to

To begin, we consider the following regret decomposition, for any choice of (W†,η†)∈J, we have

Dyn-RegT⁢(𝙱𝙾𝚁𝙻)=

∑i=1⌈T/H⌉𝔼⁢[∑t=(i-1)⁢H+1i⋅H∧Tρt*-𝖱i⁢(Wi,ηi,s(i-1)⁢H+1)]

=

∑i=1⌈T/H⌉𝔼⁢[∑t=(i-1)⁢H+1i⋅H∧Tρt*-𝖱i⁢(W†,η†,s(i-1)⁢H+1)]

+

∑i=1⌈T/H⌉𝔼⁢[∑i=1⌈T/H⌉𝖱i⁢(W†,η†,s(i-1)⁢H+1)-𝖱i⁢(Wi,ηi,s(i-1)⁢H+1)].

(70)

For the first term in eqn. (70), we can apply the results from Theorem 1 to each block i∈⌈T/H⌉,i.e.,

∑t=(i-1)⁢H+1i⋅H∧T[ρt*-𝖱⁢(W†,η†,s(i-1)⁢H+1)]=

O~⁢(Bp⁢(i)⁢W†η†+Br⁢(i)⁢W†+Dmax⁢[Bp⁢(i)⁢W†+S⁢A⁢HW†+H⁢η†+S⁢A⁢HW†+H]),

(71)

where we have defined

Br⁢(i)=(∑t=(i-1)⁢H+1i⋅H∧TBr,t),Bp⁢(i)=(∑t=(i-1)⁢H+1i⋅H∧TBp,t)

for brevity. For the second term, it captures the additional rewards of the DM were it uses the fixed parameters (W†,η†) throughout w.r.t. the trajectory on the starting states of each block by the BORL algorithm, i.e., s1,…,s(i-1)⁢H+1,…,s(⌈T/H⌉-1)⁢H+1, and this is exactly the regret of the EXP3.P algorithm when it is applied to a Δ-arm adaptive adversarial bandit problem with reward from [0,H]. Therefore, for any choice of (W†,η†), we can upper bound this by

O~⁢(H⁢Δ⁢T/H)=O~⁢(T⁢H)

as Δ=O⁢(ln2⁡T). Summing these two, the regret of the BORL algorithm is

By virtue of the EXP3.P, the BORL algorithm is able to adapt to any choice of (W†,η†)∈J. Note that

H≥W*=3⁢S23⁢A12⁢T12(Br+Bp+1)12≥3⁢T12(3⁢T)12≥1,

(73)

S13⁢A14≥η*=(Bp+1)12⁢S13⁢A14(Br+Bp+1)14⁢T14≥S13⁢A142⁢T12=S13⁢A14⁢Φ.

(74)

Therefore, there must exists some j† and k† such that

Hj†/ΔW≤W*≤H(j†+1)/ΔW,S13⁢A14⁢Φk†/Δη≥η*≥S13⁢A14⁢Φ(k†+1)/Δη

(75)

By adapting W† to Hj†/ΔW and η† to Φk†/Δη, we further upper bound eqn. (72) as

Dyn-RegT⁢(𝙱𝙾𝚁𝙻)=O~⁢(Dmax⁢(Br+Bp+1)14⁢S23⁢A12⁢T34).

where we use H1/ΔW=exp⁡(1) and Φ1/Δη=exp⁡(-1) in the last step.

We first show the following lemma.

Lemma 13

Conditioned on Ep, there exists a state transition distribution p∈Hp,t⁢(0) such that for every pair of states s,s′∈S,

p⁢(s′|s,a(s,s′))≥ζ

for every time step t∈[T].

The proof of the lemma is provided in Section id1. We then consider the state transition distribution p∈Hp,t⁢(0) specified in Lemma 13. For an arbitrary state s′∈𝒮, we consider the policy π such that π⁢(s)=a(s,s′) for all s∈𝒮 (see Assumption 2 for the definition of a(s,s′)). Starting from an arbitrary state s∈𝒮, the policy either hits state s′ in the next step, which happens with probability at least ζ, or it transits to another state s′′≠s′, which would then hit state s′ in the next step with probability at least ζ. Therefore, the hitting process stochstically dominates the Bernoulli process with success probability ζ, and thus the expected hitting time is at most 1/ζ.

First, we recall the definition of defintion of confidence region Hp,t⁢(s,a;0) in eqn. 6,

Hp,t(s,a;0)={p˙∈Δ𝒮:∥p˙(⋅|s,a)-p^t(⋅|s,a)∥1≤𝗋𝖺𝖽-p,t(s,a)}.

For every pair of states s,s′∈𝒮, we construct p by distinguishing the following two cases:

•

If Nt⁢(s,a(s,s′))=0, then by definition, 𝗋𝖺𝖽⁢-p,t⁢(s,a(s,s′))≥1, therefore every probability distribution p¯ on 𝒮 belongs to Hp,t⁢(s,a;0). Setting p(⋅|s,a(s,s′))=pt(⋅|s,a(s,s′)) for any t satisfies the requirement in the Lemma.

By conditioning on ℰp, we know that p¯t(⋅|s,a(s,s′))∈Hp,t(s,a(s,s′);0), and we can thus set p(⋅|s,a(s,s′))=p¯t(⋅|s,a(s,s′)).

Combining the above cases, the transition probability distribution p satisfies the stated inequality in the Lemma, and we conclude the proof.

We first show that for any time step t∈[T], we have ρt*-ρt*pseudo=-l⋅𝔼⁢[Xt]. From Section id1, we have ρt* is equal to the optimal value of the following linear program 𝖯⁢(rt,pt); while ρt*pseudo is equal to the optimal value of the following linear program 𝖯⁢(rtpseudo,pt). The two linear programs has the same set of constraints, and follows from eqn. (14), the only difference is that the objective value of 𝖯⁢(rtpseudo,pt) is l⋅𝔼⁢[Xt] more than that of 𝖯⁢(rt,pt) (note that the summation of x⁢(s,a) over s∈𝒮 and a∈𝒜s is 1 from the second constraint of the linear program (15)). Therefore, we have

∑t=1T(ρt*-ρt*pseudo)=∑t=1T-l⋅𝔼⁢[Xt].

(76)

Next, conditioned on any demand realizations X1,…,XT, we can show by induction that the trajectory generated by Π on ℳ and ℳpseudo are exactly the same as they use the same sequence of state transition distributions. Therefore,