This question has already been posed in Cross Validated without receiving a correct formal answer, so I reformulate it here to gain attention of mathematicians. I am referring to chapter 3 of Sutton and barto book "Reinforcement learning. An introduction" available here:
1

Let us assume that we have three finite sets, $S$ the set of states, $A$ the set of actions, and $R\subset \mathbb{R}$ the set of rewards.
Let $p : S \times R \times S \times A \rightarrow [0,1]$ be a function such that

$$\sum_{s'\in S} \sum_{r \in R} p(s',r,s,a)=1$$ for each $s\in S, a \in a$, so that it defines a joint discrete probability distribution on $S\times R$ for every choice of $s\in S, a \in a$, and we denote it then, with a slight abuse of notation, as $p(s',r|s,a)$. From a probabilistic point of view, we can think that if at time $t$ we are in state $S_t=s$ and we perform an action $A_t=a \in A$ then the probability of the two random variables $S_{t+1},R_{t+1}$ representing respectively next state and the reward obtained is given exactly by $p$:

$$\mathbb{P}\{S_{t+1}=s',R_{t+1}=r|S_{t}=s,A_t=a\} = p(s',r|s,a)$$

and so that, for example, given present state $s$ and action $a$, the expected value of immediate reward is $r(s,a)=\sum_{r \in R} r\sum_{s'\in S}p(s',r|s,a)$, and the state transition probability (again with a slight abuse of notation) is $p(s'|s,a)=\sum_{r\in R}p(s',r|s,a)$.

Let us define $G_t$ as the random variable representing the sum of discounted future rewards obtainable from time $t$$$G_t=\sum_{k=0}^{+\infty} \gamma^k R_{t+1+k}$$
that it is immediate to prove having this representation
$$G_t= R_{t+1}+\gamma G_{t+1}$$

We define a policy$\pi$ as a function $\pi: A \times S \rightarrow [0,1]$ such that for every $s \in S$$\sum_{a \in A}\pi(a,s)=1$, so that it defines, for every choice of $s \in S$ a probability distribution over $A$, and we denote it with $\pi(a|s)$. We also define the state-value function$v_{\pi}$ for a policy $\pi$ as

$$v_{\pi}(s)=\mathbb{E}_{\pi}[G_t|S_t=s]$$

that is the expected value cof $G_t$ conditioned on being in state $S_t=s$ and using policy $\pi$ to select (randomly) actions at present time and also in future time steps, and analogously the action-value function $q_{\pi}$ as

$$q_{\pi}(s,a) = \mathbb{E}_{\pi}[G_t |S_t=s,A_t=a]$$.

It is quite simple to prove that these two functions satisfy these two mutual relations

A policy $\pi$ is said to be better of another one $\pi'$ if and only if $\pi(s) \geq \pi'(s)$ for all $s\in S$; an optimal policy $\pi_*$ is a policy that is better than all other ones, that is $v_{\pi_*}(s) \geq v_{\pi}(s)$ for each $s \in S$ and for each policy $\pi$.
Let us assume that there exists at least one optimal policy (this also should be proved, but let's skip it in this question). Then we can define the optimal state-value function

$$v_*(s) \doteq \max_\pi v_{\pi}(s)$$

for each $s \in S$, and it is clear that, for each optimal policy $\pi_*$, we have $v_{\pi_*}(s)=v_*(s)$ for each $s \in S$. The same apply to the optimal action-value function$$q_*(s,a) \doteq \max_{\pi} q_\pi(s,a)$$

that is for each optimal policy $\pi_*$ it is $q_{\pi_*}(s,a)=q_*(s,a)$.

Finally, here it is my question:

why is it true that $$v_*(s) = \max_{a \in A} q_*(s,a)$$ for each $s \in S$?

It is obvious that, given an optimal policy $\pi_*$, if for each $s \in S$ we take $a_s \in \text{arg}\,\max\limits_{a \in A}q_{\pi_*}(s,a)$, then we have
$$v_*(s)=v_{\pi_*}(s)=\sum_{a \in A} \pi_*(a|s)q_{\pi_*}(s,a)\leq \sum_{a \in A} \pi_*(a|s)q_{\pi_*}(s,a_s)=q_{\pi_*}(s,a_s)=q_*(s,a)$$

but I couldn't find a simple way to prove the reverse inequality, neither directly nor by contradiction.
Can anybody help?

3 Answers
3

If I have read through the text correctly, I would say that indeed $v_{\pi}(s) \le \max_{a \in A} q_{\pi}(s,a)$ for all policies $\pi$, since $v_{\pi}(s) = \mathbb{E}_{a}[q_{\pi}(s,a)] $ and the expected value should be less or equal to the maximum value.

Now with respect to your question, if
$$
v_{\pi_*}(s) < \max_{a \in A} q_{\pi_*}(s,a) = q_{\pi_*}(s,a_s)
$$
then there should be an even better policy $\pi_*'(a|s) = [a=a_s]$ that would select $a_s$ deterministically when in state s, giving $v_{\pi_*}(s) < v_{\pi_*'}(s) = q_{\pi_*}(s,a_s)$ and contradicting the fact that $\pi_*$ is an optimal policy.

$\begingroup$The equality @hardu is due to the fact that the policy $\pi_*'(a|s) = [a=a_s]$ is optimal by definition, given the fact that $a_s \in \text{arg}\,\max\limits_{a \in A}q_{\pi_*}(s,a) = \text{arg}\,\max\limits_{a \in A}q_{*}(s,a)$. Note that $q_{*}(s,a)$ is policy free and thus the choice $\pi_*'(a|s) = [a=a_s]$ represents the best choice that can be made, when in state $s$ (or one of the best, in case that there are many global maxima of $q_{\pi_*}(s,a)$ with respect to $a$).$\endgroup$
– SotirisJan 27 at 19:42

Since the characters allowed for comments are not enough to express my doubts about your answer, I write them here.
I agree with you that one possible way to answer to my question by contradiction is the one you have provided, and intuitively this idea works, but I still believe that it should be formalized correctly.
Let me be more precise: what you are writing is that if there exists $\bar{s} \in S$ such that $v_*(\bar{s}) < \max_{a \in A} q_*(\bar{s},a)$, then taking an optimal policy $\pi_*$ and $a_{\bar{s}} \in \text{arg}\,\max\limits_{a \in A}q_*(\bar{s},a)$ we can define a policy $\pi_{\bar{s}}$ such that
$$\pi_{\bar{s}}(a|s) = \begin{cases}
\pi_*(a|s) \qquad \text{if} \qquad s\not= \bar{s}\\
1 \qquad \text{if} \qquad s= \bar{s}, a=a_{\bar{s}}\\
0 \qquad \text{otherwise}
\end{cases}
$$

Then it is clear that for that policy $v_{\pi_{\bar{s}}}(\bar{s})=q_{\pi_{\bar{s}}}(\bar{s},a_{\bar{s}})$, so that if we prove that $q_{\pi_{\bar{s}}}(\bar{s},a_{\bar{s}})=q_*(\bar{s},a_{\bar{s}})$ we are done since we get the contradiction $v_{\pi_{\bar{s}}}(\bar{s}) > v_*(\bar{s})$.

But my doubt is then exactly this:
why is it true that $q_{\pi_{\bar{s}}}(\bar{s},a_{\bar{s}})=q_*(\bar{s},a_{\bar{s}})$?

Let $T$ be the time step variable and let $t$ be a particular time step.

Given an agent, we define a policy $\pi$ to be the conditional PMF that that agent takes a particular action given a particular state. That is, $\condpmfsym{\pi}{a}{s}$ is the probability that that agent takes action $a$ given the state $s$. More explicitly, let $\Pi$ denote the set of all possible policies on $\mcalS$ and $\mcalA$ and, for a policy $\pi$, we define

Notice that the conditional PMF $\condpmfsym{\pi}{a}{s}$ is constant relative to the time step variable $T$. Many authors use this definition and hence assume that each policy behaves the same at all time steps.

But it is often helpful to define a policy where the conditional PMF's vary with the time step variable $T$. Indeed, there is no mathematical or probabilistic reason that precludes us from defining a policy with a different PMF at each time step. And when we learn about optimal policies (below), we can easily imagine some optimal policies whose behavior differs between time steps.

Hence, more generally, we define a policy $\pi$ by a set of conditional PMF's

Proof If $g_{t+1}$ is a possible outcome of $G_{t+1}$, then $\condpmf{g_{t+1}}{s',\phi_t}=\condpmf{g_{t+1}}{s',\pi}$ since $\phi_t=\pi$ for $T=t+1,t+2,\dots$. Also note that, for all $a\in\mcalA(s_0)$, we have

$$\condpmf{s'}{s_0,a,\phi_t}=\condpmf{s'}{s_0,a,\pi}$$

That is, if we know the state and action in the current time step, then the policy is irrelevant to determining the next state. Hence, for all $a\in\mcalA(s_0)$, we have

$\begingroup$Only doubt I have with regard to this proof is that usually policies are not defined with respect to a particular time step, but only w.r.t. the states (that is $\pi(a|s)$, not $\pi(a|s,t)$). It reminds me the proof given by S. Ross at page 31 of his book "Introduction to stochastic dynamic programming" which I unsuccessfully tried to modify in order to apply to this case. Thanks for your contribution.$\endgroup$
– hardhuFeb 3 at 11:03

$\begingroup$This looks like a great book, thanks. In regards to your doubt, can you think of a reason why the policy should not be dependent on the time step? There's no mathematical reason to preclude the definition of a policy as time-dependent. I will add a quick blurb about this to the beginning of my answer. I would really appreciate your further thoughts on this.$\endgroup$
– waynemystirFeb 3 at 21:42

$\begingroup$Sorry, I wasn't very clear in my comment: you are right, from a general point of view there is no reason not to consider time-dependent policies, and this instead could be an advance in tackling non-stationary problems. What I meant is that in the description of Markov decision process in Sutton and Barto book which I mentioned, policies were introduced as dependent only on states, since the aim there is to find a rule to choose the best action in a state regardless of the time step in which the state is visited.$\endgroup$
– hardhuFeb 5 at 15:56

$\begingroup$I don't see that he makes any such restriction to stationary policies after chapter 2. And the similar equation to which your question refers is in chapter 3.$\endgroup$
– waynemystirFeb 5 at 23:29