MIRI have uploaded a third set of videos from their Colloquium Series on Robust and Beneficial AI, which was co-hosted with the Future of Humanity Institute. These talks were part of the week focused on preference specification in AI systems, including the difficulty of specifying safe and useful goals, or specifying safe and useful methods for learning human preferences. All released videos are available on the CSRBAI web page.

How can we design good goals for arbitrarily intelligent agents? Reinforcement learning (RL) may seem like a natural approach. Unfortunately, RL does not work well for generally intelligent agents, as RL agents are incentivised to shortcut the reward sensor for maximum reward — the so-called wireheading problem.

In this paper we suggest an alternative to RL called value reinforcement learning (VRL). In VRL, agents use the reward signal to learn a utility function. The VRL setup allows us to remove the incentive to wirehead by placing a constraint on the agent’s actions. The constraint is defined in terms of the agent’s belief distributions, and does not require an explicit specification of which actions constitute wireheading. Our VRL agent offers the ease of control of RL agents and avoids the incentive for wireheading.

An artificial agent is corrigible if it accepts or assists in outside correction for its objectives. At a minimum, a corrigible agent should allow its programmers to turn it off. An artificial agent is functional if it is capable of performing non-trivial tasks. For example, a machine that immediately turns itself off is useless (except perhaps as a novelty item).

In a standard reinforcement learning agent, incentives for these behaviors are essentially at odds. The agent will either want to be turned off, want to stay alive, or be indifferent between the two. Of these, indifference is the only safe and useful option but there is reason to believe that this is a strong condition on the agent’s incentives. In this talk, I will propose a design for a corrigible, yet functional, agent as the solution to a two-player cooperative game where the robot’s goal is to maximize the humans sum of rewards.

We do an equilibrium analysis of the solutions to the game and identify three key properties. First, we show that if the human acts rationally, then the robot will be corrigible. Second, we show that if the robot has no uncertainty about human preferences, then the robot will be incorrigible or non-function if the human is even slightly suboptimal. Finally, we analyze the Gaussian setting and characterize the necessary and sufficient conditions, as a function of the robot’s belief about human preferences and the degree of human irrationality, to ensure that the robot will be corrigible and functional.

Jan Leike, a recent hire at the Future of Humanity Institute, spoke about general reinforcement learning (slides). Abstract:

General reinforcement learning (GRL) is the theory of agents acting in unknown environments that are non-Markov, non-ergodic, and only partially observable. GRL can serve as a model for strong AI and has been used extensively to investigate questions related to AI safety. Our focus is not on practical algorithms, but rather on the fundamental underlying problems: How do we balance exploration and exploitation? How do we explore optimally? When is an agent optimal? We outline current shortcomings of the model and point to future research directions.

We will discuss ongoing research into value learning: how an agent can gradually learn to understand the world it’s in, learn to understand what we mean for it to do, learn to understand as well as be compelled to adhere to proper values, and learn to do so robustly in the face of inaccurate, inconsistent, and incomplete information as well as underspecified, conflicting, and updatable goals. To fulfill this ambitious vision we have a long road of gradual teaching and testing ahead of us.

For a recap of the week 2 videos on robustness and error-tolerance, see the previous blog post. For a summary of how the event as a whole went, and videos of the opening talks by Stuart Russell, Alan Fern, and Francesca Rossi, see the first blog post.