Abstract

We consider power allocation for an access-controlled transmitter with energy harvesting capability based on causal observations of the channel fading state. We assume that the system operates in a time-slotted fashion and the channel gain in each slot is a random variable which is independent across slots. Further, we assume that the transmitter is solely powered by a renewable energy source and the energy harvesting process can practically be predicted. With the additional access control for the transmitter and the maximum power constraint, we formulate the stochastic optimization problem of maximizing the achievable rate as a Markov decision process (MDP) with continuous state. To efficiently solve the problem, we define an approximate value function based on a piecewise linear fit in terms of the battery state. We show that with the approximate value function, the update in each iteration consists of a group of convex problems with a continuous parameter. Moreover, we derive the optimal solution to these convex problems in closed-form. Further, we propose power allocation algorithms for both the finite- and infinite-horizon cases, whose computational complexity is significantly lower than that of the standard discrete MDP method but with improved performance. Extension to the case of a general payoff function and imperfect energy prediction is also considered. Finally, simulation results demonstrate that the proposed algorithms closely approach the optimal performance.

I Introduction

The utilization of renewable energy is an important characteristic of the green wireless communication systems [1]. Renewable energy powered transmitters can be deployed for wireless sensor networks or cellular networks, reducing the reliance on traditional batteries and prolonging the transmitter’s lifetime [2][3]. However, the fluctuation of the energy harvesting together with the variation of the channel fading brings many challenges to the design of energy-harvesting communication systems [4][5].

Wireless transmission schemes for energy-harvesting transmitters have been investigated by a number of recent works [6][7][8][9]. In order to achieve the optimal throughput, a “shortest path” based energy scheduling algorithm was proposed in [6] for a static channel with finite battery capacity and non-causal energy harvesting state. The authors of [7] discussed an MDP model for the case when the energy harvesting and channel fading are known causally and there is no maximum power constraint. A staircase water-filling algorithm was proposed in [7] for the case when the battery capacity is infinite, and the energy harvesting and fading channel states are known non-causally. With a finite battery capacity and non-causal energy harvesting and fading channel states, a water-filling procedure was studied in [8], and with an additional maximum power constraint a dynamic water-filling algorithm was proposed in [9]. The authors of [10] developed an online approximately optimal algorithm based on Lyapunov optimization, which is designed to maximize a utility function, based on the number of packet transmissions in energy harvesting networks. In [11], using the discrete MDP model, a reinforcement learning based approach was used to optimize the number of packet transmissions without the prior knowledge of the statistics of the energy harvesting process and the channel fading process. The authors of [12] considered a static channel with causal knowledge of the stationary Poisson energy arrival process and gave an MDP-based solution to maximize the average throughput with unconstrained transmission power. On the other hand, the throughput optimization problem with causal information on the energy harvesting state and the fading channel state, and under the maximum power constraint, remains open. In this paper, we will tackle this problem.

Specifically, we first consider the power allocation for an access-controlled transmitter, which is powered by a renewable energy source and equipped with a finite-capacity battery and has a maximum power constraint. The channel fading is assumed to be a random variable in a slot and is independent across different slots. For energy harvesting, we first assume that it can be predicted accurately for the scheduling period, which can be realized in practice [13][14], and then later introduce the prediction error variables. Furthermore, we assume that a control center can temporarily suspend the transmitter’s access due to channel congestion. Such channel access control for the transmitter is modeled as a first-order Markov process. Under the above setting, this paper finds the approximately optimal power allocation for both the finite- and infinite-horizon cases.

To obtain the power allocation, we formulate the stochastic optimization problem as a discrete-time and continuous-state Markov decision process (MDP), with the objective of maximizing the sum of the payoff in the current slot and the discounted expected payoffs in the future slots, where the payoff function is the achievable channel rate. Since the state variables including the battery state and the channel state in the MDP problem are continuous, to avoid the prohibitively high complexity for updating the value function caused by the continuous states, this paper introduces an approximate value function. We show that the approximate value function is concave and non-decreasing in the variable corresponding to the energy stored in the battery, which further enables the approximate value function be updated in closed-form. This is then used to find the approximately optimal solution of the power allocation for both the finite- and infinite-horizon cases.

The proposed algorithms provide approximate solutions, whose performances are lower bounded by the standard discrete MDP method. Also, to obtain the solution, we solve at most O(Bmax/δ⋅C) convex optimization problems where Bmax is the battery capacity, δ is the approximation precision, and C is the length of horizon for the finite-horizon case or the maximum number of iterations for the infinite-horizon case. In particular, for the infinite-horizon case, given a convergence tolerance α, the α-converged solution can be obtained within O(logγα) iterations, where γ is the discount factor.

The remainder of the paper is organized as follows. In Section II, we describe the system model, formulate the energy scheduling problem as a continuous-state MDP problem and define the value function. In Section III, we define an approximate value function and prove that the approximate value function is non-decreasing and concave with respect to the continuous battery state. In Section IV, we derive the optimal closed-form procedure for updating the approximate value function and develop the power allocation algorithms for both finite- and infinite-horizon cases. The proposed algorithms are extended to deal with the model with a general payoff function and imperfect energy prediction in Section V. Section VI provides simulation results and Section VII concludes the paper.

Ii Problem Formulation

Ii-a System Model

We consider a point-to-point communication system with one transmitter and one receiver, as shown in Fig. 1. We assume a slow fading channel model where the channel gain is constant for a coherence time of Tc (corresponding to a time slot) and changes independently across slots. The signal model for slot k is given by

yk=Hkxk+wk,

(1)

where yk∈CTc is the received signal, xk∈CTc is the transmitted signal, Hk∈C is the channel gain in slot k and wk∈CTc is the additive white Gaussian noise consisting of CN(0,1) elements.

Fig. 1: The system block diagram.

At the beginning of each slot, the transmitter is informed of the channel access status Ak∈{0,1} for the current slot from the control center, where Ak=0 indicates that the channel access is not permitted for slot k while Ak=1 indicates otherwise. We assume that Ak follows a stationary first-order Markov process, whose transition probabilities are given as Pr(Ak+1=0|Ak=1)=qk and Pr(Ak+1=0|Ak=0)=~qk. If Ak=0, then the transmit power in slot k is pk=0. On the other hand, if Ak=1, then the transmitter needs to decide its transmit power pk.

The transmitter is powered by an energy harvesting device, e.g., a solar panel, and a battery. The battery, which buffers the harvested energy, has a finite capacity, denoted by bmax. Since the energy harvesting process is steady or can be well predicted, we assume that the energy harvested over the next K slots can be non-causally known, denoted as ek (the causal energy harvesting model will be considered in Section V). We assume hk≜|Hk|2 is independent across slots (i.i.d. when K=∞).

In slot k, the transmitter transmits at a power level of pk (pk=0 if Ak=0), which is constrained by the maximum transmission power pmax and the available energy bk, i.e.,

0≤pk≤min{pmax,bk/Tc}.

(2)

The battery level at the beginning of slot k+1 is given as

bk+1=min{bmax,bk+ek−pkTc},

(3)

with the constraint that the battery level is non-negative for all slots, i.e.,

bk≥0.

(4)

Further, the transmitter receives a payoff r(p,h) based on the transmission power and channel gain. In this paper, we use the achievable channel rate as the payoff, i.e., r(p,h)=log(1+ph). Also, in Section V, we consider a general payoff function r(p,h) which is continuous, non-decreasing, and concave with respect to p given h.

Ii-B Problem Formulation

We assume that ek can be predicted non-causally while all other variables are only known causally to the transmitter (we will relax this assumption in Section V where we assume that ek is predicted with a random error εk). Denote H≜[h1,h2,…,hK], A≜[A1,A2,…,AK], and a discount factor γ∈[0,1]. We assume that all the side information, e.g., the distributions of all random variables and the predictions of the harvested energy, is known before the first slot. Then the power allocation policy P≜{pk(Γk)|k=1,2,…,K} needs to be calculated to maximize the expected total payoff in the next K slots, where Γk≜(bk,hk,Ak) consists of the observations available at the beginning of slot k. Since bk and hk are continuous variables, it is not possible to store P in a look-up table. Instead, we only store some of the intermediate results, i.e., the approximate value function introduced in Section III, in an efficient way, and then calculate the power allocation when Γk is observed. Specifically, at the beginning of slot k, given Γk, if channel access is permitted, i.e., Ak=1, the transmitter calculate the power level pk. And if the channel access is not permitted, i.e., Ak=0, then pk=0. To that end, we formulate the following optimization problem for defining the optimal policy

Note that by (3), the battery level bk forms a continuous-state first-order Markov chain, whereas the channel access state Ak is a discrete-state Markov chain by assumption. Then, we can convert the problem in (5) to its equivalent MDP recursive form [15] in terms of the value function, which represents the total payoff received in the current slot and expected to be received in the future slots.

Specifically, in the MDP model we treat the battery level b and the channel access state A, i.e., (b,A), as the state, the channel h as the observation, and the transmit power p as the decision. Then, the state space becomes {0≤b≤bmax}×{0,1}; and the corresponding decision space is D1(b)={0≤p≤min{b/Tc,pmax}} and D0={0}, corresponding to A=1 and A=0, respectively. The value function is then recursively defined as

Note that, vk(bk,Ak) represents the expected maximum discounted payoff between slots k and K given the side information bk and Ak. Due to the causality and the backward recursion, the observation Γk in slot k does not affect the value function for slot k+1. Also, when Ak=1, given the value function for slot k+1, the optimal power allocation for slot k can be obtained by

log(1+phk)+γuk(bk,p,1)},

(9)

where uk(b,p,A) is calculated using (7). Moreover, when Ak=0, we always have

p∗k(Γk)=0.

(10)

Iii Approximate Value Function

By recursively computing the value function vk(b,A) defined in (6), in theory we can obtain the optimal solution to (9) for each k∈{1,2,…,K}. However, a closed-form expression for vk(b,A) is hard to obtain when K is large, e.g., K≥3. A typical approach is to quantize the continuous variables (b,p,h) to finite number of discrete levels, i.e., to convert the original problem to a discrete MDP problem [15]. However, with such discretization, solving the corresponding discrete MDP problem involves an exhaustive search on D1(b) for all discretized h, and we can only obtain discrete power levels.

In order to efficiently solve the MDP problem and obtain the continuous power allocation, in this section, we will define an approximate value function by using a piecewise linear approximation based on some discrete samples of {vk(B,A)|B∈{0,δ,2δ,…,bmax},A∈{0,1}} where δ is an approximation precision. This approximate value function is shown to be concave and non-decreasing in the variable corresponding to the energy stored in the battery, making the optimal power allocation problem in (9) (or (18)) a convex optimization problem.

Iii-a Value Function Approximation

With an approximation precision parameter δ, we define a piecewise linear approximation operator:

Note that, in (13)-(15), we made the substitutions vk(b,A)←Wkδ(b,A) and uk(bk,pk,Ak) in (7) and (6), respectively. Thus we can treat the approximate value function Wkδ(b,A)≜L[Vk(b,A),δ], which is updated by (13)-(15), as an approximation to the value function vk(b,A), which is updated by (6)-(7).

We consider the approximation error ||Wkδ(b,A)−vk(b,A)||∞ at slot k (or iteration i=K−k+1). In each iteration, the error is produced by the piecewise linear approximation in (15) and propagated through solving the problem in (14). Then, at the end of each iteration the total error accumulated by the obtained approximate value function is the sum of the newly produced error and the discounted propagated error, growing with the iteration number. Since the update rules for both vk(b,A) and Wkδ(b,A) start from the same initial value function vK(b,A), then the total error in the i-th iteration (we use the subscript (i) to denote the i-th iteration, which represents slot K−i+1) can be bounded by

With the approximate value function for each slot k, when A=1, the power allocation given Γ can be obtained by

p∗k(Γ)=argmaxp∈D1(b){log(1+ph)+γUk(b,p,1)}.

(18)

Define Bδ≜{0,δ,2δ,…,bmax}. Note that the approximate value function is linearly recovered from the sample set {Vk(b,A)|b∈Bδ} and Wkδ(b,A)=Vk(b,A) for all b∈Bδ. We can consider the standard dynamic programing with the discretized state space as a special case of the update rules in (13)-(15). Then, the performance achieved with the approximate value function can be characterized as follows.

Proposition 1

The approximate value function obtained by recursively solving (13)-(15) is no less than the discrete value function obtained by the standard dynamic programming method with the state space Bδ×{0,1} where δ is the approximate precision.

Proof:

Given the discrete state space Bδ×{0,1}, since W(i)δ(B,A)=V(i)(B,A) for any B×A∈Bδ×{0,1}, the standard dynamic programming follows the same update rule in (13)-(15) but with a discrete feasible power allocation set for the optimization problem in (14), which is a subset of D1(b). \qed

Moreover, in the standard discrete dynamic programming, we discretize all continuous variables, i.e., bk,hk,ek,pk, and then perform the dynamic programming with an exhaustive search on pk for all possible combinations of (bk,hk); while with the proposed approximate value function, we only discretize the battery state bk and then obtain the approximate value function for each discretized bk in closed-form.

Iii-B Concavity of Approximate Value Function

In (13)-(15), we note that the approximate value function is based on the solution to an optimization problem (14). To facilitate solving (14), in this subsection, we will show that the approximate value function Wkδ(b,A) given in (15) is concave for 0≤b≤bmax given A∈{0,1}. Then (14) is a convex optimization problem given h and b.

First, we introduce the following lemma, which can be easily shown and illustrated in Fig. 2.

Lemma 1

If a function f(x)∈R(x∈X⊆R) is non-decreasing, for any x′∈X, f(min{x,x′}) is also non-decreasing. Further, if the non-decreasing function f(x) is concave, then f(min{x,x′}) is concave for x∈X∪[x′,∞).

Proposition 2

For any k∈{1,2,…,K−1}, if the approximate value function Wk+1δ(b,A) is non-decreasing with respect to b∈[0,bmax] given A∈{0,1}, so is Wkδ(b,A).

Proof:

If Wk+1δ(b,A) is non-decreasing with respect to b∈[0,bmax] for A∈{0,1}, by Lemma 1, we have that Wk+1δ(min{bmax,b},A) is also non-decreasing with respect to b∈[0,+∞). Then, we have that Uk(b,p,A), which is a linear combination of the terms of the form Wk+1δ(min{bmax,b+ek−pkTc},A), is also non-decreasing with respect to b∈[0,bmax], given p and A.

Given any battery level b∈[0,bmax), channel fading h, the power p0 such that p0∈DA(b), and ϵ>0 such that b+ϵ≤bmax, we have

p0∈DA(b+ϵ),

(19)

and

log(1+p0h)+γUk(b,p0,A)

≤log(1+p0h)+γUk(b+ϵ,p0,A)

(20)

≤maxp∈DA(b+ϵ){log(1+ph)+γUk(b+ϵ,p,A)}.

(21)

Since Vk(b,A) is a non-negative linear combination of the terms of the form maxp∈DA(b){log(1+ph)+Uk(b,p,A)}, Vk(b,A) is non-decreasing with respect to b∈[0,bmax]. Then, by (15), we have that Wkδ(b,A) is also non-decreasing with respect to b∈[0,bmax].
\qed

The next result is on the concavity of Wkδ(b,A).

Proposition 3

For any k∈{1,2,…,K}, if the approximate value function Wk+1δ(b,A) is non-decreasing and concave with respect to b∈[0,bmax] given A∈{0,1}, so is Wkδ(b,A).

Proof:

Since Wk+1δ(b,A) is non-decreasing and concave with respect to b∈[0,bmax] given A∈{0,1}, by Lemma 1, we have Wk+1δ(min{bmax,b},A) is non-decreasing and concave with respect to b≥0 given A∈{0,1}. Since b+e−pTc is a linear combination of b and p, then Wk+1δ(min{bmax,b+e−pTc},A) is jointly concave with respect to b and p. Moreover, it follows that Uk(b,p,A) is also jointly concave with respect to b and p given A∈{0,1}[16].

Since the feasible domain DA(b) is different under A=0 and A=1. We consider the two cases separately.

When A=0, since D0=0, vk(b,0) can be written as

Vk(b,0)=Ehk[γUk(b,0,0)].

(22)

Since Uk(b,p,A) is concave with respect to b∈[0,bmax] given p and A∈{0,1}, so is Vk(b,0)[16]. Then, by (15), Wkδ(b,0) is non-decreasing with respect to b∈[0,bmax].

When A=1, the feasible domain of the objective function in (6) is given by C≜{(b,p):0≤b≤bmax,0≤p≤min{b/Tc,pmax}}. It can be verified that C is a convex set. Then, for any (b1,p1),(b2,p2)∈C, their convex combination (θb1+¯θb2,θp1+¯θp2)∈C, where θ∈[0,1] and ¯θ≜1−θ.

where (25) follows from the joint concavity, and (26) follows from the definitions in (23) and (24).

Therefore, we have that maxp∈D1(b){log(1+ph)+γUk+1(b,p,1)} is concave with respect to b∈[0,bmax]. By (14) and (15), we further have Wkδ(b,1) is concave with respect to b∈[0,bmax][16].
\qed

From Propositions 2 and 3, we have that if Wk+1δ(b,A) is non-decreasing and concave so is Wkδ(b,A) for any k∈{1,2,…,K−1}. Since log(1+ph) is non-decreasing and concave with respect to b∈[0,bmax], it is easily verified by (6) that WKδ(b,A)=VK(b,A)=vK(b,A) is also non-decreasing and concave with respect to b∈[0,bmax] given A. By induction, we obtain the following theorem.

Theorem 1

For k=1,2,…,K, the approximate value function Wkδ(b,A) is non-decreasing and concave with respect to b∈[0,bmax] given A∈{0,1}. Further, the problem in (14) is a convex optimization problem given b∈[0,bmax] and A∈{0,1}.

Since both V(i)(b,A) and W(i)δ(b,A) are concave and non-decreasing, where i=K−k+1 is the iteration number, we can further bound the approximation error ϵi(δ) in (17) as follows.

Proposition 4

For any iteration i, given A, we have

0≤ϵi(δ)≤2V(i)(δ,A)−V(i)(2δ,A)−V(i)(0,A).

(27)

Proof:

By Theorem 1, V(i)(b,A) is non-decreasing and concave with respect to b given A. As illustrated in Fig. 3, for b∈[0,δ], the value of V(i)(b,A) is smaller than the value on line (*) but larger than W(i)δ(b,A), and therefore the distance between the value on line (*) and W(i)δ(b,A) can also be considered as an upper bound on the approximation error, i.e., V(i)(b,A)−W(i)δ(b,A) for b∈[0,δ]. According to the second-order derivative property of the concave function, we have that

V(i)((n+1)δ,A)−V(i)(nδ,A)−(V(i)((n+2)δ,A)−V(i)((n+1)δ,A))

≥

V(i)((n+2)δ,A)−V(i)((n+1)δ,A)−(V(i)((n+3)δ,A)−V(i)((n+2)δ,A))

(28)

for all n≥0. Then, we further have that 0≤ϵi(δ)≤max{2V(i)(δ,A)−V(i)(2δ,A)−V(i)(0,A),2V(i)(2δ,A)−V(i)(3δ,A)−V(i)(δ,A),⋯}=2V(i)(δ,A)−V(i)(2δ,A)−V(i)(0,A), where ϵi(δ)=||V(i)(b,A)−W(i)δ(b,A)||∞.
\qed

Fig. 3: The piecewise linear approximation of the value function and the approximation error bound.

Iv Power Allocation with Prefect Energy Prediction

Note that in (14), we need to solve the following optimization problem for a given B∈Bδ and A∈{0,1}:

p∗(h)=argmaxp(h)∈DA(B){log(1+p(h)h)+γUk(B,p(h),A)},h≥0.

(29)

When A=0, p∗(h)=0. On the other hand, when A=1, we will obtain the optimal solution p∗(h) in closed-form.

Since the approximate value function Wk+1δ(b,A) in (15) is a piecewise linear function of b given A, it follows that Uk(B,p,1) in (13) is also a piecewise linear function with respect to p given B, which is differentiable everywhere except at J≜{p|p=(B+ek−B0)/Tc,B0∈Bδ}. By Theorem 1 and Lemma 1, Uk(B,p,1) is also concave and non-decreasing with respect to p.

Since Uk(B,p,1) is a piecewise linear function, we denote I≜{p0,p1,…,pN} as the set of the non-differentiable points, where p0=0, pN=min{pmax,B/Tc}, and pi,(0<i<N) is the i-th smallest element in J∩D1(B)∖{p0,pN}. Also, we denote W={w1,w2,…,wN} as the set of the corresponding slopes, where wi is the slope of the segment [pi−1,pi], given by

wi≜−γTcδEA|1{

Vk+1(⌈min{bmax,B+ek−piTc}/δ⌉δ,A)

−Vk+1(⌊min{bmax,B+ek−piTc}/δ⌋δ,A)},

(30)

which is derived from (13) and (15). Hence, the derivative of Uk(B,p,1) for p∈D1(B)∖I is

w(p)=wi, if p∈(pi−1,pi).

(31)

Since Uk(b,p,A) is concave and non-decreasing with respect to p, we have 0≥w0>w1>…>wN. Fig. 4 is a sketch of the stair-case function w(p).

Fig. 4: The derivative of Uk(B,p,1) with respect to p.

In this section we first obtain the closed-form solution to (29), and then use it to obtain the optimal power allocation for both finite- and infinite-horizon cases.

Lemma 2

Note that, Condition 1 corresponds to the case that p∗ is in the interior of D1(B). In this case, the left-derivative and the right-derivative should have opposite signs or be both zero at p∗ so that the increasing and decreasing of p both lead to the decreasing of the objective function. Condition 2 and Condition 3 correspond to the cases that p∗ is on each side of the boundary of D1(B), where the objective function is non-decreasing and non-increasing for all p∈D1(B), respectively.

The following proposition gives a sufficient condition for the optimality of p∗(h) given B.

Proposition 5

Given any B∈Bδ, for h≥0, if the energy schedule p∗(h)∈intD1(B) satisfies

p∗(h)=⎧⎪⎨⎪⎩−1w(p∗(h))−1h,%whenp∗(h)∈intD1(B)∖I,−1w(p∗(h)−)−1h or −1w(p∗(h)+)−1h, when p∗(h)∈I,

Proof:

Substituting (37) into (34)-(35), we have g′h(p∗(h)+)=0 or g′(p∗(h)−)=0 when p∗(h)∈I, and g′h(p∗(h)+)=g′(p∗(h)−)=0 when p∗(h)∈intD1(B)∖I. Since g′h(p∗(h)+)≤g′h(p∗(h)−), we have g′h(p∗(h)+)≤0≤g′h(p∗(h)−). Moreover, since gh(p) is concave, we have 0≤g′h(p∗(h)−)<g′h(0−) and g′h(min{pmax,B/Tc}−)<g′h(p∗(h)+)≤0. By Lemma 2 (Condition 1), we conclude the optimality.
\qed

Then it is easy to verify that for 1h∈[−1wi−pi,−1wi−pi−1]∩[0,+∞),i=1,2,…,N−1, the solution given by (36) satisfies the optimality condition in Proposition 5.

For 1h∈(−1wi+1−pi,−1wi−pi)∩[0,+∞),i=1,2,…,N−1, we use the next proposition to prove the optimality of (36).

Proposition 6

For any non-differentiable point pi∈I∖{p0,pN}, pi is the optimal solution to (29) for any 1h∈(−1wi+1−pi,−1wi−pi)∩[0,+∞).

Proof:

From (34)-(35), g′h(p+i) and g′h(p−i) are functions of 1h for a given pi. If (−1wi+1−pi,−1wi−pi)∩[0,+∞) is not empty, it is easy to verify that 0=g′h(pi−)>g′h(pi+) when 1h=−1wi−pi, and g′h(p−i)>g′h(p+i) and g′h+(pi)