The theoretical limitations of DQN

Aug 29, 2017
• Aidan Rocke

Introduction:

Less than three years after the publication of Deep Mind’s publication ‘Playing Atari with Deep Reinforcement Learning’
the practical impact of this method on RL literature has been profound, as evidenced by the above graphic. However, the
theoretical limitations of the original method haven’t been thoroughly investigated. As I will show, such an analysis
actually clarifies the evolution of DQN and highlights which research directions are worth prioritising.

Background on DQN:

The main idea behind Deep Q-learning, hereafter referred to as DQN, is that given actions and states in a Markov
Decision Process(MDP), it’s sufficient to optimise action selection with respect to the expected return:

In particular the aim is to approximate a parametrised value function where estimation is shifted towards the target:

and gradient descent updates are done as follows:

In addition, epsilon-greedy approaches are used for exploration and to avoid estimates that merely reflect
recent experience the authors of DQN regularly allow the network to perform experience replay: batch updates
based on less recent experience.

Given the above description of DQN, we may note the following:

Selection and evaluation in DQN is done with respect to the same parameters .

Assuming that variance is unavoidable, the operator in (2) leads to over-optimistic estimates.

The expression in (1) provides an asymptotic guarantee which implicitly requires an ergodic MDP.

These issues shall be addressed in the sections that follow.

Asymptotic nonsense or the data-inefficiency of DQN:

In the simple case of i.i.d. data if and , a simple application of Chebyshev’s inequality gives:

Essentially, this inequality shows that even in simple scenarios convergence in expectation requires a lot of data
and the rate of convergence depends on the variance . Furthermore, we must note that this inequality ignores
the following facts:

For fixed , is rarely unimodal in practice.

rarely has negligible variance.

Our data is sequential and hardly ever i.i.d.

From these points it follows that important estimation errors are unavoidable but as I will show, this isn’t the main
problem.

The unreasonable optimism of DQN:

Over-optimism with respect to estimation errors:

The authors in [3] highlight that in (2), evaluation of the target and action selection are done with respect to
the same parameters which over-optimistic value estimates more likely with respect to the operator.
This suggests that estimation errors of any kind are more likely to result in overly-optimistic policies.

While this is problematic, the authors of [3] discovered the following elegant solution:

The resulting method, known as Double DQN, essentially decouples selection and evaluation by using two sets of weights
and .

Over-optimism with respect to risk regardless of estimation error:

Consider the classic problem in decision theory of having to choose between an envelope which contains $90.00 and envelope
which contains $200.00 or $0.00 with equal probability. Although , our agent’s
ignorance of the bimodality of would lead it to act in an over-optimistic fashion. Due to the operator
it would make a decision solely based on the fact that .

The above problem clearly requires a very different perspective.

Two papers which address the second problem are [5] and [7]. While I won’t go into either paper in any detail I would recommend that the
reader start with [5] which provides an elegant and scalable solution with what can be thought of as a data-dependent
version of dropout [8]. The consideration of value distributions helps reduce uncertainty and improve inference.

The latent value of hierarchical models:

Perhaps the most important question when considering the evolution of DQN is how will these agents develop rich conceptual abstractions
that will allow scientific induction or generalisation. Although one can argue that a DQN learns good statistical representations of
environmental states it doesn’t learn any higher-order abstractions such as concepts. Moreover, vanilla DQN is purely reactive
and doesn’t incorporate planning in any meaningful sense. This is where Hierarchical Deep Reinforcement Learning can play a very important role.

In particular, I would like to mention the promising work of Tejas Kulkarni who investigated the use of hierarchical DQN, which has the following architecture:

Controller: which learns policies in order to satisfy particular goals

Meta-Controller: which chooses goals

Critic: which evaluates whether a goal has been achieved

Together these three components cooperate so that a high-level policy is learned over intrinsic goals and a lower-level policy is learned
over ‘atomic’ actions to satisfy the given goals. The work, which I’ve only vaguely described, opens up a lot of interesting
research directions which may not seem immediately obvious. One I’d like to mention is the possibility of learning a
grammar over policies. I think this might be a necessary component for the emergence of language in machines.

The interpretation of the ‘Critic’ is also very interesting. Perhaps one can argue that it provides the agent with a rudimentary form of
introspection.

Conclusion:

I find it remarkable that a simple method such as DQN should inspire many new approaches. Perhaps it’s not so much the brilliance
of the method but rather its generality which allowed this method to adapt and evolve. In particular, I think the coupling
of Distributional RL with Hierarchical Deep RL has a very bright future. Together, this will lead to signficant improvements in terms of inference and generalisation.