Online Learning 2: Bandit learning with catastrophes

0

The usual training procedures for machine learning models are not always well-equipped to avoid rare catastrophes. In order to maintain the safety of powerful AI systems, it will be important to have training procedures that can efficiently learn from such events. [1]

We can model this situation with the problem of exploration-only online bandit learning. In this scenario, we grant the AI system an exploration phase, in which it is allowed to select catastrophic arms and view their consequences. (We can imagine that such catastrophic selections are acceptable because they are simulated or just evaluated by human overseers). Then, the AI system is switched into the deployment phase, in which it must select an arm that almost always avoids catastrophes.

##Setup
In outline, the learner will receive a series of randomly selected examples, and will select an expert (modeled as a bandit arm) at each time step. The challenge is to find a high-performing expert in as few time steps as possible. We give some definitions:

Let X be some finite set of possible inputs.

Let A:={1,2,...,K} be the set of available experts (i.e. bandit arms).

Let R:X×A↦[0,b] be the reward function. R(xt,i) is the reward for following expert i on example xt.

Let C:X×A↦[0,1] be the catastrophic risk function. C(xt,i) is the catastrophic risk incurred by following expert i on example xt.

Let M(x,i):=R(x,i)−C(x,i)τ be the mixed payoff that the learner is to optimize, where τ:(0,1] is the risk-tolerance. τ can be very small, on the order of 10−20.

Let p:X↦[0,1] be the input distribution from which examples are drawn in the deployment phase.