Reinforcement learning(RL) has been at the center of some of the most publicized milestones of artificial intelligence(AI) in the last few years. From systems like AlphaGo to the recent progress on multi-player games such as OpenAI Five or DeepMind’s Quake III, RL has shown incredible progress mastering complex knowledge subjects. Despite the impressive results, most widely adopted RL algorithms focused on learning a single task and present a lot of challenges when used in multi-task environments. Recently, researchers from Alphabet’s subsidiary DeepMind published a paper in which they proposed a new method called PopArt to improve RL in multi-task environments.

Multi-Task Reinforcement Learning

There are many ways to classify reinforcement learning(RL) algorithms depending on their architecture. A very simple taxonomy that I find particularly useful, it is to classify RL models based on the number of tasks and the number agents involved in the environment. Using that scheme, we can go from simple single-task/single-agent models to complex multi-task/multi-agent architectures which resemble many human cognition activities.

You can think about Multi-Task Reinforcement Learning(MTRL) as the Karate Kid of RL. In the Karate Kid(1984) movie, sensei Mr. Miyagi teaches the karate kid seemingly unrelated tasks such as sanding the floor and waxing a car. In hindsight, these, however, turn out to equip him with invaluable skills that are relevant for learning karate. Similarly, the role of MTRL operates on environment in which agents need to learn a group of seemingly unrelated tasks to accomplish the ultimate goals.

Parallel Multi-Task Learning

Among the different variations of MTRL models that are being explored by AI researchers these days, there is a group known as parallel multi-task learning that has shown tremendous progress enabling a single AI system to master a group of diverse tasks. The DeepMind team has been at the forefront of parallel multi-task learning models and earlier this year released a reference architecture called Importance Weighted Actor-Learner Architecture (IMPALA). Inspired by another popular reinforcement learning architecture called A3C, IMPALA leverages a topology of different actors and learners that can collaborate to build knowledge across different domains. Traditionally, deep reinforcement learning models use an architecture based on a single learner combined with multiple actors. In that model, the Each actor generates trajectories and sends them via a queue to the learner. Before starting the next trajectory, actor retrieves the latest policy parameters from learner. IMPALA uses an architecture that collect experience which is passed to a central learner that computes gradients, resulting in a model that has completely independent actors and learners. This simple architecture enables the learner(s) to be accelerated using GPUs and actors to be easily distributed across many machines.

The IMPALA architecture has been an important step towards enabling the implementation of multi-task reinforcement learning(MTRL) systems. However, even architectures like IMPALA are vulnerable to what I like to call The Distraction Dilemma. A general issue in multi-task learning is that a balance must be found between the needs of multiple tasks competing for the limited resources of a single learning system. Many learning algorithms can get distracted by certain tasks in the set of tasks to solve.

In general, The Distraction Dilemma represents the need for a MTRL system to balance the reward of mastering individual tasks against the ultimate goal of achieving generalization. At different points during the lifecycle of a MTRL system, agents are going to be confronted with tasks that appear more salient to the learning process, for instance because of the density or magnitude of the in-task rewards. This causes the algorithm to focus on those salient tasks at the expense of generality.

Entering DeepMind’s PopArt

To address The Distraction Dilemma, DeepMind proposes a method called PopArt based on the original IMPALA architecture. PopArt extends the original IMPALA model by adapting the contributions of each task to the agent’s updates so that all agents have a proportional impact on the learning dynamics. The magic of PopArt relies on adjusting the weights of the neural network based on the target output of all tasks. PopArt works by estimating the mean and the spread of the ultimate targets such as the score of a game across all tasks. It then uses these statistics to normalize the targets before they are used to update the network’s weights. Using normalized targets makes learning more stable and robust to changes in scale and shift. To obtain accurate estimates — of expected future scores for example — the outputs of the network can then be rescaled back to the true target range by inverting the normalization process.

The PopArt model is based on the IMPALA architecture which combines different convolutional layers with other techniques such as word embeddings and long-short term memory(LSTM) networks.

In IMPALA the agent is distributed across multiple threads, processes or machines. Several actors run on CPU generating rollouts of experience, consisting of a fixed number of interactions (100 in our experiments) with their own copy of the environment, and then enqueue the rollouts in a shared queue. Actors receive the latest copy of the network’s parameters from the learner before each rollout. The innovation of PopArt is to update the network policy based on the output of each individual task.

PopArt in Action

The DeepMind team tested PopArt across different gaming scenarios. One scenario that was particularly illustrative of the benefits of PopArt was the Pac-Man game. Traditional reinforcement learning algorithms use reward clipping as a mechanism to handle different rewards scale. Although clipping makes learning easier, it also changes the goal of the agent. For instance, in Ms. Pac-Man the goal is to collect pellets, each of which is worth 10 points each, and eat ghosts worth between 200 and 1600 points. With clipped rewards, there is no apparent difference for the agent between eating a pellet or eating a ghost and results in agents that only eat pellets, and never bothers to chase ghosts. PopArt, adaptive normalization seems to be a more effective way stabilize learning. The DeepMind team used PopArt in Pac-Man RL agents and the results were quite impressive, with the agent chasing ghosts, and achieving a higher score, as shown in the following video.

PopArt consistently showed improvements over other multi-task reinforcement learning architectures when tested on a set of Atari games. As shown on the chart below, PopArt greatly improved the performance of the agent compared to the baseline agent without PopArt. Both with clipped and unclipped rewards, the median score of the PopArt agent across games was above the human median.

Multi-Task Reinforcement Learning(MTRL) are one of the most exciting areas in the deep learning space. Just like humans, MTRL agents can get distracted focusing on the wrong tasks. Techniques such as PopArt that minimize distraction and stabilize learning are essential for the mainstream adoption of MTRL techniques.