A system that is optimizing a function of n variables, where the objective depends on a subset of size k<n, will often set the remaining unconstrained variables to extreme values; if one of those unconstrained variables is actually something we care about, the solution found may be highly undesirable.

Stuart Russell

Think of an AI directing a car, given the instructions to get someone to the airport as fast as possible (optimised variables include "negative of time taken to airport") with some key variables left out - such as a maximum speed, maximum acceleration, respect for traffic rules, and survival of the passengers and other humans.

Call these other variables "unstated objectives" (UO), as contrasted with the "stated objectives" (SO) such as the time to the airport. In the normal environments in which we operate and design our AIs, the UOs are either correlated with the SOs (consider the SO "their heart is beating" and the UO "they're alive and healthy") or don't change much at all (the car-directing AI could have been trained on many examples of driving-to-the-airport, none of which included the driver killing their passengers).

Typically, SOs are easy to define, and the UOs are the more important objectives, left undefined either because they are complex, or because they didn't occur to us in this context (just as we don't often say "driver, get me to the airport as fast a possible, but alive and not permanently harmed, if you please. Also, please obey the following regulations and restrictions: 1.a.i.α: Non-destruction of the Earth....").

The control problem, in a nutshell, is that optimising SOs will typically set other variables to extreme values, including the UOs. The more extreme the optimisation, and the furthest from the typical environment, the more likely this is to happen.

Jaan Tallinn has suggested creating a toy model of the various common AI arguments, so that they can be analysed without loaded concepts like "autonomy", "consciousness", or "intentionality". Here a simple attempt for the "treacherous turn"; posted here for comments and suggestions.

Meet agent L. This agent is a reinforcement-based agent, rewarded/motivated by hearts (and some small time penalty each turn it doesn't get a heart):

Reading Eliezer Yudkowsky's works have always inspired an insidious feeling in me, sort of a cross between righteousness, contempt, the fun you get from understanding something new and gravitas. It's a feeling that I have found to be pleasurable, or at least addictive enough to go through all of his OB posts, and the feeling makes me less skeptical and more obedient than I normally would be. For instance, in an act of uncharacteristic generosity, I decided to make a charitable donation on Eliezer's advice.

Now this is probably a good idea, because the charity is probably going to help guys like me later on in life and of course it's the Right Thing to Do. But the bottom line is that I did something I normally wouldn't have because Eliezer told me to. My sociopathic selfishness was acting as canary in the mine of my psyche.

Now this could be because Eliezer has creepy mind control powers, but I get similar feelings when reading other people, such as George Orwell, Richard Stallman or Paul Graham. I even have a friend who can inspire that insidious feeling in me. So it's a personal problem, one that I'm not sure I want to remove, but I would like to understand it better.

There are probably buttons being pushed by the style and the sort of ideas in the work that help to create the feeling, and I'll probably try to go over an essay or two and dissect it. However, I'd like to know who and at what times, if anyone at all, I should let create such feelings in me. Can I trust anyone that much, even if they aren't aware that they're doing it?

I don't know if anyone else here has similar brain overrides, or if I'm just crazy, but it's possible that such brain overrides could be understood much more thoroughly and induced in more people. So what are the ethics of mind control (for want of a better term) and how much effort should we put in to stopping such feelings from occuring?

Edit Mar 22: Decided to remove the cryonics example due to factual inaccuracies.