Arxiv on Feb. 22nd

Many continuous control tasks have bounded action spaces and clip
out-of-bound actions before execution. Policy gradient methods often optimize
policies as if actions were not clipped. We propose clipped action policy
gradient (CAPG) as an alternative policy gradient estimator that exploits the
knowledge of actions being clipped to reduce the variance in estimation. We
prove that CAPG is unbiased and achieves lower variance than the original
estimator that ignores action bounds. Experimental results demonstrate that
CAPG generally outperforms the original estimator, indicating its promise as a
better policy gradient estimator for continuous control tasks.
( https://arxiv.org/abs/1802.07564 , 445kb)