Arxiv on Mar. 1st

Deep reinforcement learning has emerged as a powerful tool for a variety of
learning tasks, however deep nets typically exhibit forgetting when learning
multiple tasks in sequence. To mitigate forgetting, we propose an experience
replay process that augments the standard FIFO buffer and selectively stores
experiences in a long-term memory. We explore four strategies for selecting
which experiences will be stored: favoring surprise, favoring reward, matching
the global training distribution, and maximizing coverage of the state space.
We show that distribution matching successfully prevents catastrophic
forgetting, and is consistently the best approach on all domains tested. While
distribution matching has better and more consistent performance, we identify
one case in which coverage maximization is beneficial – when tasks that receive
less trained are more important. Overall, our results show that selective
experience replay, when suitable selection algorithms are employed, can prevent
catastrophic forgetting.
( https://arxiv.org/abs/1802.10269 , 4709kb)

Model-free reinforcement learning (RL) methods are succeeding in a growing
number of tasks, aided by recent advances in deep learning. However, they tend
to suffer from high sample complexity, which hinders their use in real-world
domains. Alternatively, model-based reinforcement learning promises to reduce
sample complexity, but tends to require careful tuning and to date have
succeeded mainly in restrictive domains where simple models are sufficient for
learning. In this paper, we analyze the behavior of vanilla model-based
reinforcement learning methods when deep neural networks are used to learn both
the model and the policy, and show that the learned policy tends to exploit
regions where insufficient data is available for the model to be learned,
causing instability in training. To overcome this issue, we propose to use an
ensemble of models to maintain the model uncertainty and regularize the
learning process. We further show that the use of likelihood ratio derivatives
yields much more stable learning than backpropagation through time. Altogether,
our approach Model-Ensemble Trust-Region Policy Optimization (ME-TRPO)
significantly reduces the sample complexity compared to model-free deep RL
methods on challenging continuous control benchmark tasks.
( https://arxiv.org/abs/1802.10592 , 7192kb)

In this paper, we explore deep reinforcement learning algorithms for
vision-based robotic grasping. Model-free deep reinforcement learning (RL) has
been successfully applied to a range of challenging environments, but the
proliferation of algorithms makes it difficult to discern which particular
approach would be best suited for a rich, diverse task like grasping. To answer
this question, we propose a simulated benchmark for robotic grasping that
emphasizes off-policy learning and generalization to unseen objects. Off-policy
learning enables utilization of grasping data over a wide variety of objects,
and diversity is important to enable the method to generalize to new objects
that were not seen during training. We evaluate the benchmark tasks against a
variety of Q-function estimation methods, a method previously proposed for
robotic grasping with deep neural network models, and a novel approach based on
a combination of Monte Carlo return estimation and an off-policy correction.
Our results indicate that several simple methods provide a surprisingly strong
competitor to popular algorithms such as double Q-learning, and our analysis of
stability sheds light on the relative tradeoffs between the algorithms.
( https://arxiv.org/abs/1802.10264 , 6883kb)
Title: DiGrad: Multi-Task Reinforcement Learning with Shared Actions
Authors: Parijat Dewangan, S Phaniteja, K Madhava Krishna, Abhishek Sarkar,
Balaraman Ravindran
Categories: cs.LG cs.AI cs.RO stat.ML

Most reinforcement learning algorithms are inefficient for learning multiple
tasks in complex robotic systems, where different tasks share a set of actions.
In such environments a compound policy may be learnt with shared neural network
parameters, which performs multiple tasks concurrently. However such compound
policy may get biased towards a task or the gradients from different tasks
negate each other, making the learning unstable and sometimes less data
efficient. In this paper, we propose a new approach for simultaneous training
of multiple tasks sharing a set of common actions in continuous action spaces,
which we call as DiGrad (Differential Policy Gradient). The proposed framework
is based on differential policy gradients and can accommodate multi-task
learning in a single actor-critic network. We also propose a simple heuristic
in the differential policy gradient update to further improve the learning. The
proposed architecture was tested on 8 link planar manipulator and 27 degrees of
freedom(DoF) Humanoid for learning multi-goal reachability tasks for 3 and 2
end effectors respectively. We show that our approach supports efficient
multi-task learning in complex robotic systems, outperforming related methods
in continuous action spaces.
( https://arxiv.org/abs/1802.10463 , 442kb)

We propose Scheduled Auxiliary Control (SAC-X), a new learning paradigm in
the context of Reinforcement Learning (RL). SAC-X enables learning of complex
behaviors – from scratch – in the presence of multiple sparse reward signals.
To this end, the agent is equipped with a set of general auxiliary tasks, that
it attempts to learn simultaneously via off-policy RL. The key idea behind our
method is that active (learned) scheduling and execution of auxiliary policies
allows the agent to efficiently explore its environment – enabling it to excel
at sparse reward RL. Our experiments in several challenging robotic
manipulation settings demonstrate the power of our approach.
( https://arxiv.org/abs/1802.10567 , 8663kb)