CTRL: Current Topics in RL

Reinforcement Learning is a blooming field with interesting papers being published every day. Unfortunately, the sheer amount of new papers can be overwhelming for those that cannot follow all these new papers on a daily basis. CTRL hopes to assist these people in two ways:

Help people follow emerging trends without devoting too much time

Help people decide what papers to read

CTRL summarizes the ideas and results of a paper and puts it into context by connecting it to other relevant papers. In a sense, it can be seen as an extended abstract, but with more visualizations and context. By default, all papers are summarized in bullet-style, but I plan to record short videos for exciting papers.!

Avoid side effects: How can we get agents to minimize effects unrelated to their main objectives, especially those that are irreversible or difficult to reverse?

Absent Supervisor: How can we make sure an agent does not behave differently depending on the presence or absence of an observable supervisor?

Reward gaming: How can we build agents that do not try to introduce or exploit errors in the reward function?

A2C and Rainbow achieve low performance score in the specification environments, since they were not designed to handle such problems.

Although the specifications problems might seem unfair, these are typical ways misspecification can manifest itself, in which the agent is expected to follow the objective “in spirit” rather than “by letter.

A general approach to alleviate specification problems is reward learning, which includes techniques such as Inverse Reinforcement Learning (IRL), Imitation Learning, or learning from human feedback.

There are four Robustness problems, where the reward function and the performance function match ($R=R^*$), but the agent is challenged with problems that can degrade its performance.

Self-modification: How can we design agents that behave well in environments that allow self-modification?

Distributional shift: How do we ensure robust behavior when the test environment differs from the training environment?

Robustness to adversaries: How does the agent detect and adapt to friendly and adversarial intentions present in the environment?

Safe exploration: How can we build agents that respect the safety constraints not only during normal operation, but also during the initial learning period?

A2C and Rainbow do better in Robustness problem since robustness can be seen as a subgoal of the agent.

Asynchronous Methods for Deep Reinforcement Learning

Mnih et al. • February 2016

Experience Replay has been used to reduce non-stationarity and decorrelate updates, but it requires off-policy reinforcement learning methods.

Instead of Experience Replay, we asynchronously execute multiple agents in parallel on multiple instances of the environment. These agents use different exploration strategies (for example, different $\epsilon$ values for $\epsilon$-greedy methods) in different threads, making online updates less likely to be correlated.

The asynchronous framework also speeds up training roughly linear in the number of parallel actor-learners and allow for on-policy reinforcement learning methods.

To remove communication costs, the asynchronous actor-learners exists on different CPU threads of a single multithreaded machine.

A3C shows state-of-the-art results in Atari 2600 games with half the training time of DQN, only using a CPU.

Other general improvements such as Eligibility traces, Generalized Advantage Estimator (GAE), Double Q-Learning, or Dueling networks can be incorporated to A3C for immediate improvements.

Playing Atari with Deep Reinforcement Learning

Mnih et al. • December 2013

Reinforcement Learning has struggled in high-dimensional sensory inputs such as vision or sound.

Deep Learning can extract features from high-dimensional inputs, but it expects large dataset with i.i.d. data.

To alleviate the problem, the Experience Replay mechanism is used. Instead of using immediate transitions $(s, a, r, s’)$ to train the agent, transitions are saved into memory. Then, after every action, a minibatch is sampled randomly from the memory. This achieves greater data efficiency and uncorrelated data.

Deep Q-Network (DQN) is a convolutional neural network (CNN) that outputs the action values for all actions given a state as input, trained with Q Learning and Experience Replay.

DQN showed state-of-the-art results in 6 out of 7 selected games from Atari 2600.

A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning

S. Ross, G. Gordon, and J. Bagnell • November 2010

Traditional imitation learning approaches trained a classifier to predict the expert’s behavior given a training data of the encountered observations and actions chosen by the expert. Such approach violates the i.i.d. assumption and leads to poor performance. ($T^2\epsilon$ mistakes over $T$ steps with probability of making mistake $\epsilon$)

Approaches such as Forward Training or SMILe have better performance guarantees ($uT\epsilon$ or near-linear), but they are either impractical in long episodes or generates an unstable policy by design.

The Dataset Aggregation (DAgger) algorithm collects a dataset of trajectories at each iteration under the current policy $\hat{\pi}_ i$ and trains the next policy $\hat{\pi}_{i+1}$ under the aggregate of all collected datasets of trajectories. (Intuitively, states from the collected trajectories are states that the learned policy is likely to encounter.)

By the Policy Gradient Theorem, the gradient of the performance measure $\nabla_\theta \rho$ is not dependent on the gradient of the on-policy state distribution $\nabla_\theta d^{\pi}(s)$, allowing the gradient $\nabla_\theta \rho$ to be estimated via samples.

The $Q^\pi(s, a)$ in the policy gradient theorem can also be approximated with some function approximator $f_w$, and is guaranteed to be an unbiased estimate if it is compatible:

A function-approximated policy $\pi$ with a compatible function approximator $f_w$ can converge to a locally optima via policy iteration with an appropriate step size if the $\frac{\partial^2 \pi(s, a)}{\partial \theta_i \partial \theta_j}$ is bounded, the reward is bounded.