Paper summaryrobertmuellerThe authors propose a unified setting to evaluate the performance of batch reinforcement learning algorithms. The proposed benchmark is discrete and based on the popular Atari Domain. The authors review and benchmark several current batch RL algorithms against a newly introduced version of BCQ (Batch Constrained Deep Q Learning) for discrete environments.
https://i.imgur.com/zrCZ173.png
Note in line 5 that the policy chooses actions with a restricted argmax operation, eliminating actions that have not enough support in the batch.
One of the key difficulties in batch-RL is the divergence of value estimates. In this paper the authors use Double DQN, which means actions are selected with a value net $Q_{\theta}$ and the policy evaluation is done with a target network $Q_{\theta'}$ (line 6).
**How is the batch created?**
A partially trained DQN-agent (trained online for 10mio steps, aka 40mio frames) is used as behavioral policy to collect a batch $B$ containing 10mio transitions. The DQN agent uses either with probability 0.8 an $\epsilon=0.2$ and with probability 0.2 an $\epsilon = 0.001$. The batch RL agents are trained on this batch for 10mio steps and evaluated every 50k time steps for 10 episodes. This process of batch creation differs from the settings used in other papers in i) having only a single behavioral policy, ii) the batch size and iii) the proficiency level of the batch policy.
The experiments, performed on the arcade learning environment include DQN, REM, QR-DQN, KL-Control, BCQ, OnlineDQN and Behavioral Cloning and show that:
- for conventional RL algorithms distributional algorithms (QR-DQN) outperform the plain algorithms (DQN)
- batch RL algorithms perform better than conventional algorithms with BCQ outperforming every other algorithm in every tested game
In addition to the return the authors plot the value estimates for the Q-networks. A drop in performance corresponds in all cases to a divergence (up or down) in value estimates.
The paper is an important contribution to the debate about what is the right setting to evaluate batch RL algorithms. It remains however to be seen if the proposed choice of i) a single behavior policy, ii) the batch size and iii) quality level of the behavior policy will be accepted as standard. Further work is in any case required to decide upon a benchmark for continuous domains.

First published: 2019/10/03 (8 months ago)Abstract: Widely-used deep reinforcement learning algorithms have been shown to fail in
the batch setting--learning from a fixed data set without interaction with the
environment. Following this result, there have been several papers showing
reasonable performances under a variety of environments and batch settings. In
this paper, we benchmark the performance of recent off-policy and batch
reinforcement learning algorithms under unified settings on the Atari domain,
with data generated by a single partially-trained behavioral policy. We find
that under these conditions, many of these algorithms underperform DQN trained
online with the same amount of data, as well as the partially-trained
behavioral policy. To introduce a strong baseline, we adapt the
Batch-Constrained Q-learning algorithm to a discrete-action setting, and show
it outperforms all existing algorithms at this task.

The authors propose a unified setting to evaluate the performance of batch reinforcement learning algorithms. The proposed benchmark is discrete and based on the popular Atari Domain. The authors review and benchmark several current batch RL algorithms against a newly introduced version of BCQ (Batch Constrained Deep Q Learning) for discrete environments.
https://i.imgur.com/zrCZ173.png
Note in line 5 that the policy chooses actions with a restricted argmax operation, eliminating actions that have not enough support in the batch.
One of the key difficulties in batch-RL is the divergence of value estimates. In this paper the authors use Double DQN, which means actions are selected with a value net $Q_{\theta}$ and the policy evaluation is done with a target network $Q_{\theta'}$ (line 6).
**How is the batch created?**
A partially trained DQN-agent (trained online for 10mio steps, aka 40mio frames) is used as behavioral policy to collect a batch $B$ containing 10mio transitions. The DQN agent uses either with probability 0.8 an $\epsilon=0.2$ and with probability 0.2 an $\epsilon = 0.001$. The batch RL agents are trained on this batch for 10mio steps and evaluated every 50k time steps for 10 episodes. This process of batch creation differs from the settings used in other papers in i) having only a single behavioral policy, ii) the batch size and iii) the proficiency level of the batch policy.
The experiments, performed on the arcade learning environment include DQN, REM, QR-DQN, KL-Control, BCQ, OnlineDQN and Behavioral Cloning and show that:
- for conventional RL algorithms distributional algorithms (QR-DQN) outperform the plain algorithms (DQN)
- batch RL algorithms perform better than conventional algorithms with BCQ outperforming every other algorithm in every tested game
In addition to the return the authors plot the value estimates for the Q-networks. A drop in performance corresponds in all cases to a divergence (up or down) in value estimates.
The paper is an important contribution to the debate about what is the right setting to evaluate batch RL algorithms. It remains however to be seen if the proposed choice of i) a single behavior policy, ii) the batch size and iii) quality level of the behavior policy will be accepted as standard. Further work is in any case required to decide upon a benchmark for continuous domains.