OpenAI Baselines: DQN

We’re open-sourcing OpenAI Baselines, our internal effort to reproduce reinforcement learning algorithms with performance on par with published results. We’ll release the algorithms over upcoming months; today’s release includes DQN and three of its variants.

Reinforcement learning results are tricky to reproduce: performance is very noisy, algorithms have many moving parts which allow for subtle bugs, and many papers don’t report all the required tricks. By releasing known-good implementations (and best practices for creating them), we’d like to ensure that apparent RL advances never are due to comparison with buggy or untuned versions of existing algorithms.

RL algorithms are challenging to implement correctly; good results typically only come after fixing many seemingly-trivial bugs. This post contains some best practices we use for correct RL algorithm implementations, as well as the details of our first release: DQN and three of its variants, algorithms developed by DeepMind.

Best Practices

Compare to a random baseline: in the video below, an agent is taking random actions in the game H.E.R.O. If you saw this behavior in early stages of training, it’d be really easy to trick yourself into believing that the agent is learning. So you should always verify your agent outperforms a random one.

See the world as your agent does: like most deep learning approaches, for DQN we tend to convert images of our environments to grayscale to reduce the computation required during training. This can create its own bugs: when we ran our DQN algorithm on Seaquest we noticed that our implementation was performing poorly. When we inspected the environment we discovered this was because our post-processed images contained no fish, as this picture shows.

When transforming the screen images into greyscale we had incorrectly calibrated our coefficients for the green color values, which led to the fish disappearing. After we noticed the bug we tweaked the color values and our algorithm was able to see the fish again.

To debug issues like this in the future, Gym now contains a play function, which lets a researcher easily see the same observations as the AI agent would.

Fix bugs, then hyperparameters: After debugging, we started to calibrate our hyperparameters. We ultimately found that setting the annealing schedule for epsilon, a hyperparameter which controlled the exploration rate, had a huge impact on performance. Our final implementation decreases epsilon to 0.1 over the first million steps and then down to 0.01 over the next 24 million steps. If our implementation contained bugs, then it’s likely we would come up with different hyperparameter settings to try to deal with faults we hadn’t yet diagnosed.

Double check your interpretations of papers: In the DQN Nature paper the authors write: “We also found it helpful to clip the error term from the update […] to be between -1 and 1.”. There are two ways to interpret this statement — clip the objective, or clip the multiplicative term when computing gradient. The former seems more natural, but it causes the gradient to be zero on transitions with high error, which leads to suboptimal performance, as found in one DQN implementation. The latter is correct and has a simple mathematical interpretation — Huber Loss. You can spot bugs like these by checking that the gradients appear as you expect — this can be easily done within TensorFlow by using compute_gradients.

The majority of bugs in this post were spotted by going over the code multiple times and thinking through what could go wrong with each line. Each bug seems obvious in hindsight, but even experienced researchers tend to underestimate how many passes over the code it can take to find all the bugs in an implementation.

Deep Q-Learning

We use Python 3 and TensorFlow. This release includes:

DQN: A reinforcement learning algorithm that combines Q-Learning with deep neural networks to let RL work for complex, high-dimensional environments, like video games, or robotics.

Dueling DQN: Splits the neural network into two — one learns to provide an estimate of the value at every timestep, and the other calculates potential advantages of each action, and the two are combined for a single action-advantage Q function.

AI is an empirical science, where the ability to do more experiments directly correlates with progress. With Baselines, researchers can spend less time implementing pre-existing algorithms and more time designing new ones. If you’d like to help us refine, extend, and develop AI algorithms then join us at OpenAI.

Data Science Tidings is a leading media platform for Data Science Evangelists and entrepreneurs, dedicated to delivering interesting innovative curated stories from the Data Science world. It aims to provide useful and latest curated feed on Data Science. It is a great destination to find the most fresh updates and murky strategies you have missed.