The following solution algorithms are currently implemented for the DiscreteDP type:

value iteration;

policy iteration (default);

modified policy iteration.

Policy iteration computes an exact optimal policy in finitely many iterations,
while value iteration and modified policy iteration return an $\varepsilon$-optimal policy
for a prespecified value of $\varepsilon$.

Value iteration relies on (only) the fact that
the Bellman operator $T$ is a contraction mapping
and thus iterative application of $T$ to any initial function $v^0$
converges to its unique fixed point $v^*$.

Policy iteration more closely exploits the particular structure of the problem,
where each iteration consists of a policy evaluation step,
which computes the value $v_{\sigma}$ of a policy $\sigma$
by solving the linear equation $v = T_{\sigma} v$,
and a policy improvement step, which computes a $v_{\sigma}$-greedy policy.

Modified policy iteration replaces the policy evaluation step
in policy iteration with "partial policy evaluation",
which computes an approximation of the value of a policy $\sigma$
by iterating $T_{\sigma}$ for a specified number of times.

Below we describe our implementation of these algorithms more in detail.
(While not explicit, in the actual implementation each algorithm is terminated
when the number of iterations reaches max_iter.)

Compute a $v^{i+1}$-greedy policy $\sigma$, and return $v^{i+1}$ and $\sigma$.

Given $\varepsilon > 0$,
the value iteration algorithm terminates in a finite number of iterations,
and returns an $\varepsilon/2$-approximation of the optimal value funciton and
an $\varepsilon$-optimal policy function
(unless max_iter is reached).

Given $\varepsilon > 0$,
provided that $v^0$ is such that $T v^0 \geq v^0$,
the modified policy iteration algorithm terminates in a finite number of iterations,
and returns an $\varepsilon/2$-approximation of the optimal value funciton and
an $\varepsilon$-optimal policy function
(unless max_iter is reached).

The condition for convergence, $T v^0 \geq v^0$, is satisfied
for example when $v^0 = v_{\sigma}$ for some policy $\sigma$,
or when $v^0(s) = \min_{(s', a)} r(s', a)$ for all $s$.
If v_init is not specified, it is set to the latter, $\min_{(s', a)} r(s', a))$.

If $\{\sigma^i\}$ is the sequence of policies obtained by policy iteration
with an initial policy $\sigma^0$,
one can show that $T^i v_{\sigma^0} \leq v_{\sigma^i}$ ($\leq v^*$),
so that the number of iterations required for policy iteration is smaller than
that for value iteration at least weakly,
and indeed in many cases, the former is significantly smaller than the latter.

The evaluation step in policy iteration
which solves the linear equation $v = T_{\sigma} v$
to obtain the policy value $v_{\sigma}$
can be expensive for problems with a large number of states.
Modified policy iteration is to reduce the cost of this step
by using an approximation of $v_{\sigma}$ obtained by iteration of $T_{\sigma}$.
The tradeoff is that this approach only computes an $\varepsilon$-optimal policy,
and for small $\varepsilon$, takes a larger number of iterations than policy iteration
(but much smaller than value iteration).