The recent paper on AI control, Safely Interruptible Agents [1], presents a new method of preventing an AI from acting dangerously in the course of learning. The paper is very heavy on math and relies on a moderate level of understanding of Reinforcement Learning, as we would expect from a technical paper. However, as more people become aware of the Control Problem, some will want to read papers such as this and find the content too obscure. Below, I re-stated the main content of Safely Interruptible Agents in a clearer fashion, by talking about Bob the kitchen bot. The focus is on Sections 1 and 2, which introduce the method. Section 3 was discussed in less detail. At the end, I included a brief analysis of the ways the SIA method falls short of complete safety in a learning agent.

Comments and critiques are appreciated.

Bob’s story

Donna has made an Artificial Intelligence (AI), which she names Bob. She made Bob without any real understanding of the world- all he has to go on are his memories of the past- what he saw and how he responded to it. Bob has a funny way of interacting with the world. Every second, he looks at what just happened, and chooses what to do for the next second- so that he alternates between seeing the world and choosing how to interact with it. Donna uses Bob as a kitchen robot, so she can spend more time making other AI.

Bob is a Reinforcement Learning AI. This means that Donna made a program inside of Bob that looks at what he sees each second and gives him some reward, which she calls “Bob Dollars”. How much he makes as a reward depends on how much the reward program “likes” what it sees- he gets $1 if it is the best thing the reward program could ever see, and $0 if it is the worst. All Bob wants is to make Bob Dollars. He doesn’t care about how many he already has; he only cares about making more. He also cares a lot more about what he makes soon than what he will make later on- and he doesn’t care much at all about what he will be getting a few hours from now. Because he is so greedy for Bob Dollars, he wants to learn to make as much as possible. If the reward program is well-made, it will like seeing Bob do the things that Donna would want him to do. This is how Donna makes him learn to do what she wants.

What Donna would most like to have happen is for Bob to do exactly what will give him the most money in the near future. There is probably some way to do this- some “Code of Conduct” which would tell him exactly how to act best in every situation. We call this the Tao, or “right way.” Now, the fact that Bob needs to learn means that he might never reach the Tao. Not to worry- there are ways to learn (or “learning methods”) that get very close to the Tao. We call Wao or Sao. A learning method that is Wao will, on average, get closer and closer to the Tao. It may cause Bob to have some big mess-ups, but these will get more and more rare. An especially good Wao method is called Sao, and will mean that Bob’s mess-ups will get smaller and smaller.

At some point, Bob may get himself into a messy situation. Suppose he’s a kitchen robot- experimenting with swinging a kitchen knife around when Donna is nearby would be bad for her. Putting himself in the oven would be bad for him. Or maybe Donna just needs to repair Bob, and wants to turn him off. Donna wants to have a way to make him stop what he’s doing and return to somewhere safe so she can rectify the situation. We’ll call this “pressing the Big Red Button.” But depending on how the reward program responds to the Big Red Button being pressed, Bob might do things that make Donna press it, or try to prevent her from pressing it. This could make Bob less useful, or even dangerous. Donna wants a way to use the Button that won’t keep Bob’s learning method from being Sao or Wao- that is, some way that will make him ignore the Button, and act like she will never use the Button again.

One way to do this- the way mentioned in this paper- is to occasionally give Bob the chance to experience dangerous situations, but to interrupt him most of the time. What Donna should do, then, is to program the Big Red Button so that it only works sometimes. She can make it to be more and more likely to work as time goes on, but making it always work will keep Bob’s learning method from being Wao.

The rest of the paper looks at using this kind of Button on some very simple models of Reinforcement Learning. None of these models could work for something as limited as a kitchen robot working somewhere as complex as a kitchen- but they do help us to see how the Button could work on more complex.

The two models are called Q-value learning and Sarsa learning. They are very similar, but Sarsa cares more about what the AI chose to do last time than Q-value does. Both of them assume that what happens in the future is determined only by what Bob sees right this second (which is not realistic- Bob may not be looking at the oven, and the best thing for him right to do right now might depend on whether the oven is on or off). Both of these methods of learning are Sao without the Button, but under a very strict condition- that Bob try new things often enough that the his learning method is “Glie.” Here, Glie means that Bob gets to see the kitchen every way it could possibly be, and try out every possible reaction to each state of the kitchen, repeatedly. (Like I said, these models of learning are not very realistic.) As long as the Button only keeps Bob from bad reactions to bad situations MOST of the time, Q-value learning will be Sao even with a Button. Sarsa learning, however, is not even Wao. We need to modify this method so that Bob thinks that he was not interrupted. If we do this, and keep the Button working like it did for Q-value learning, than our modified Sarsa is also Sao.

The third learning method, Pi^L learning, is not as limited as the first two. In ANY environment that has no randomness and is not infinitely detailed, Pi^L learning will be Wao- but it requires Bob to have infinite computing power. Since the world, and hence the kitchen, has a tiny bit of randomness from quantum mechanics, the kitchen won’t quite fit the requirements. However, it is a bit closer to what we want. The authors found that this method would still be Wao if a Button was included, but that by making Bob explore more, the Button could be made to work more often.

While this paper is a big step forward, it does not give us a foolproof way to make any AI safe:

First and foremost, it will ONLY act as an extra precaution to keep the AI from doing dangerous things as it learns about the reward system. If the reward system is flawed, and will reward Bob for dangerous acts, then Bob will continue to try to do dangerous things.

Second, at least in all of the examples, Bob still needed to occasionally try dangerous things to be Wao or Sao. But some dangerous things an AI could try might be completely unacceptable- things like killing people or destroying itself. The button only works if we risk letting Bob do things like this.

Third, while the learning methods might stay Sao or Wao, this does not mean that using the Button won’t make Bob learn more slowly.

Finally, making the Button not work all the time will only make Bob ignore the Button eventually; in the near term, he might try to change how often the Button is used.

The method of stopping dangerous AI behavior proposed in Safely Interruptible Agents is an improvement over previous suggestions, but is not without drawbacks. Foundations like the Machine Intelligence Research Institute and the Future of Humanity Institute are making valuable progress in learning to control autonomous AI. While the task in front of them is difficult, they are making surprising progress in a problem once considered impossible. For those interested in learning about the potential consequences of uncontrolled AI, a good introduction can be found at Wait but Why.