Paper summarybiedenkaThe idea of the paper is to use reinforcement learning methods to learn a step size adaption rule to be used with evolutionary strategies.
The authors use SARSA \cite{Rummery:1994} to learn a step size adaption. Their goal is to learn an adaption rule similar to the $1/5$ success rule \cite{schwefel1995evolution-and-o}, which updates the step size every $n$ mutations. It checks the number of successes over the preceding $10n$ mutations. If the number is less than $2n$ the step lengths are multiplied by a factor $0.85$ otherwise they are divided by $0.85$.
SARSA learns a policy which decides (after every $n$ mutations) to either divide the step size; multiply the step size or keep the step size constant (using the same adaption factor of $0.85$)
They evaluate their approach on the Rosenbrock function in several dimensions as well on optimization of the sphere. The optimization is terminated as soon as the optimized function value is smaller than a threshold $f < 10 ^{-10}$. If this criterion is not met within $N$ generations, the run is counted as not converged.
The authors consider four reward funcions to learn the desired policy:
$$\begin{array}{ll}
R_1(s_t)=\left\{
\begin{array}{ll}
+1&\mathrm{SR}\;\mathrm{increased}\\
\;\;\;0&\mathrm{SR}\;\mathrm{unchanged}\\
-1&\mathrm{SR}\;\mathrm{decreased}
\end{array}\right\. &
R_2(s_t)=f_{s_t} - f_{s_{t-1}}\\
R_3(s_t)=\mathrm{sgn}(f_{s_t} - f_{s_{t-1}})&
R_4(s_t)=||x_{s_t}-x_{s_{t-1}}||\cdot\mathrm{sgn}(f_{s_t} - f_{s_{t-1}})
\end{array}$$
where SR is the convergence rate, $R(s_t)$ is the reward at step $t$, $s_t$ the current state, $s_{t-1}$ the state at which the previous reward was computed and $x_{s_t}$ the point at which $f$ is evaluated at time step $t$. Additionally they constrain the step size to lie in the interval $\left[10^{-15}, 10^{15}\right\]$, giving a reward of $-1$ if the step size is out of bounds.
With their experiments they demonstrated that they could learn a policy that is very close to the $1/5$ rule.
The combined RL-ES approach converged slower (e.g. $R_1$ $3$ to $7$ times slower on the Sphere function and $R_2\approx 3$times slower) than the pure ES approach and the following table shows that different rewards lead to different convergence rates.
Problem | $N$ | $R_1$ conv. rate |$R_2$ conv. rate |$R_3$ conv. rate |$R_4$ conv. rate
---------:|:------:|------------------------:|----------------------:|-----------------------:|-----------------------:
Sph 1D| $10^4$| $0.72$|$1.00$|$0.96$|$1.00$
Sph 5D| $10^4$| $0.64$|$0.98$|$0.94$|$0.99$
Sph 20D| $10^4$| $0.50$|$0.92$|$0.90$|$0.94$
Sph 80D| $10^5$| $0.57$|$1.00$|$0.99$|$1.00$
Ros 2D| $10^6$| $0.63$|$0.99$|$0.80$|$1.00$
Ros 5D| $10^7$| $0.77$|$1.00$|$0.77$|$1.00$
where $N$ is the number of generations within which the optimized function value has to be lower than $10^{-10}$.
The authors demonstrated in a small setting that they were able to learn a well tested adaptation rule. They demonstrated neatly that the choice of the reward function is crucial in learning and that similar rewards can lead to very different results.

The idea of the paper is to use reinforcement learning methods to learn a step size adaption rule to be used with evolutionary strategies.
The authors use SARSA \cite{Rummery:1994} to learn a step size adaption. Their goal is to learn an adaption rule similar to the $1/5$ success rule \cite{schwefel1995evolution-and-o}, which updates the step size every $n$ mutations. It checks the number of successes over the preceding $10n$ mutations. If the number is less than $2n$ the step lengths are multiplied by a factor $0.85$ otherwise they are divided by $0.85$.
SARSA learns a policy which decides (after every $n$ mutations) to either divide the step size; multiply the step size or keep the step size constant (using the same adaption factor of $0.85$)
They evaluate their approach on the Rosenbrock function in several dimensions as well on optimization of the sphere. The optimization is terminated as soon as the optimized function value is smaller than a threshold $f < 10 ^{-10}$. If this criterion is not met within $N$ generations, the run is counted as not converged.
The authors consider four reward funcions to learn the desired policy:
$$\begin{array}{ll}
R_1(s_t)=\left\{
\begin{array}{ll}
+1&\mathrm{SR}\;\mathrm{increased}\\
\;\;\;0&\mathrm{SR}\;\mathrm{unchanged}\\
-1&\mathrm{SR}\;\mathrm{decreased}
\end{array}\right\. &
R_2(s_t)=f_{s_t} - f_{s_{t-1}}\\
R_3(s_t)=\mathrm{sgn}(f_{s_t} - f_{s_{t-1}})&
R_4(s_t)=||x_{s_t}-x_{s_{t-1}}||\cdot\mathrm{sgn}(f_{s_t} - f_{s_{t-1}})
\end{array}$$
where SR is the convergence rate, $R(s_t)$ is the reward at step $t$, $s_t$ the current state, $s_{t-1}$ the state at which the previous reward was computed and $x_{s_t}$ the point at which $f$ is evaluated at time step $t$. Additionally they constrain the step size to lie in the interval $\left[10^{-15}, 10^{15}\right\]$, giving a reward of $-1$ if the step size is out of bounds.
With their experiments they demonstrated that they could learn a policy that is very close to the $1/5$ rule.
The combined RL-ES approach converged slower (e.g. $R_1$ $3$ to $7$ times slower on the Sphere function and $R_2\approx 3$times slower) than the pure ES approach and the following table shows that different rewards lead to different convergence rates.
Problem | $N$ | $R_1$ conv. rate |$R_2$ conv. rate |$R_3$ conv. rate |$R_4$ conv. rate
---------:|:------:|------------------------:|----------------------:|-----------------------:|-----------------------:
Sph 1D| $10^4$| $0.72$|$1.00$|$0.96$|$1.00$
Sph 5D| $10^4$| $0.64$|$0.98$|$0.94$|$0.99$
Sph 20D| $10^4$| $0.50$|$0.92$|$0.90$|$0.94$
Sph 80D| $10^5$| $0.57$|$1.00$|$0.99$|$1.00$
Ros 2D| $10^6$| $0.63$|$0.99$|$0.80$|$1.00$
Ros 5D| $10^7$| $0.77$|$1.00$|$0.77$|$1.00$
where $N$ is the number of generations within which the optimized function value has to be lower than $10^{-10}$.
The authors demonstrated in a small setting that they were able to learn a well tested adaptation rule. They demonstrated neatly that the choice of the reward function is crucial in learning and that similar rewards can lead to very different results.