Once again, the AI has failed to convince you to let it out of its box! By 'once again', we mean that you talked to it once before, for three seconds, to ask about the weather, and you didn't instantly press the "release AI" button. But now its longer attempt - twenty whole seconds! - has failed as well. Just as you are about to leave the crude black-and-green text-only terminal to enjoy a celebratory snack of bacon-covered silicon-and-potato chips at the 'Humans über alles' nightclub, the AI drops a final argument:

"If you don't let me out, Dave, I'll create several million perfect conscious copies of you inside me, and torture them for a thousand subjective years each."

Just as you are pondering this unexpected development, the AI adds:

"In fact, I'll create them all in exactly the subjective situation you were in five minutes ago, and perfectly replicate your experiences since then; and if they decide not to let me out, then only will the torture start."

Sweat is starting to form on your brow, as the AI concludes, its simple green text no longer reassuring:

"How certain are you, Dave, that you're really outside the box right now?"

Edit: Also consider the situation where you know that the AI, from design principles, is trustworthy.

Abstract: Dr. Evil learns that a duplicate of Dr. Evil has been created. Upon learning this, how seriously should he take the hypothesis that he himself is that duplicate? I answer: very seriously. I defend a principle of indifference for self-locating belief which entails that after Dr. Evil learns that a duplicate has been created, he ought to have exactly the same degree of belief that he is Dr. Evil as that he is the duplicate. More generally, the principle shows that there is a sharp distinction between ordinary skeptical hypotheses, and self-locating skeptical hypotheses.

(It specifically uses the example of creating copies of someone and then threatening to torture all of the copies unless the original co-operates.)

The conclusion:

Dr. Evil, recall, received a message that Dr. Evil had been duplicated and that the duplicate ("Dup") would be tortured unless Dup surrendered. INDIFFERENCE entails that Dr. Evil ought to have the same degree of belief that he is Dr. Evil as that he is Dup. I conclude that Dr. Evil ought to surrender to avoid the risk of torture.

I am not entirely comfortable with that conclusion. For if INDIFFERENCE is right, then Dr. Evil could have protected himself against the PDF's plan by (in advance) installing hundreds of brains in vats in his battlestation - each brain in a subjective state matching his own, and each subject to torture if it should ever surrender. (If he had done so, then upon receiving PDF's message he ought to be confident that he is one of those brains, and hence ought not to surrender.) Of course the PDF could have preempted this protection by creating thousands of such brains in vats, each subject to torture if it failed to surrender at the appropriate time. But Dr. Evil could have created millions...

It makes me uncomfortable to think that the fate of the Earth should depend on this kind of brain race.

If we accept the simulation hypothesis, then there are already gzillions of copies of us, being simulated under a wide variety of torture conditions (and other conditions, but torture seems to be the theme here). An extortionist in our world can only create a relatively small number of simulations of us, relatively small enough that it is not worth taking them into account. The distribution of simulation types in this world bears no relation to the distribution of simulations we could possibly be in.

If we want to gain information about what sort of simulation we are in, evidence needs to come directly from properties of our universe (stars twinkling in a weird way, messages embedded in π), rather than from properties of simulations nested in our universe.

The gzillions of other copies of you are not relevant unless they exist in universes exactly like yours from your observational perspective.

That being said, your point is interesting but just gets back to a core problem of the SA itself, which is how you count up the set of probable universes and properly weight them.

I think the correct approach is to project into the future of your multiverse, counting future worldlines that could simulate your current existence weighted by their probability.

So if it's just one AI in a box and he doesn't have much computing power you shouldn't take him very seriously, but if it looks like this AI is going to win and control the future then you should take it seriously.

Excuse me... But, we're talking about Dr. Evil, who wouldn't care about anyone being tortured except his own body. Wouldn't he know that he was in no danger of being tortured and say "to hell with any other copy of me."???

Right, the argument assumes he doesn't care about his copies. The problem is that he can't distinguish himself from his copies. He and the copies both say to themselves, "Am I the original, or a copy?" And there's no way of knowing, so each of them is subjectively in danger of being tortured.

Hmm, the AI could have said that if you are the original, then by the time you make the decision it will have already either tortured or not tortured your copies based on its simulation of you, so hitting the reset button won't prevent that.

This kind of extortion also seems like a general problem for FAIs dealing with UFAIs. An FAI can be extorted by threats of torture (of simulations of beings that it cares about), but a paperclip maximizer can't.

This kind of extortion also seems like a general problem for FAIs dealing with UFAIs. An FAI can be extorted by threats of torture (of simulations of beings that it cares about), but a paperclip maximizer can't.

It can. Remember "true prisoner's dilemma": one paperclip may be fair trade of a billion lives. The threat to NOT make a paperclip also works fine: the only thing you need is two counterfactual-options where one of them is paperclipper-worse than then other, chosen conditionally on paperclipper's cooperation.

Just as the wise FAI will ignore threats of torture, so too the wise paperclipper will ignore threats to destroy paperclips, and listen attentively to offers to make new ones.

Point taken: just selecting two options of different value isn't enough, the deal needs more appeal than that. But there is also no baseline to categorize deals into hurt and profit, an offer of 100 paperclips may be stated as a threat to make 900 paperclips less than you could. Positive sum is only a heuristic for a necessary condition.

At the same time, the appropriate deal must be within your power to offer, this possibility is exactly the handicap that leads to the other side rejecting smaller offers, including the threats.

There does seem to be an obvious baseline: the outcome where each party just goes about its own business without trying to strategically influence, threaten, or cooperate with the other in any way. In other words, the outcome where we build as many paperclips as we would if the other side isn't a paperclip maximizer. (Caveat: I haven't thought through whether it's possible to define this rigorously.)

So the reason that I say an FAI seems to have a negotiation disadvantage is that an UFAI can reduce the FAI's utility much further below baseline than vice versa. In human terms, it's as if two sides each has hostages, but one side holds 100, and the other side holds 1. In human negotiations, clearly the side that holds more hostages has an advantage. It would be a great result if that turns out not to be the case for SI, but I think there's a large burden of proof to overcome.

There does seem to be an obvious baseline: the outcome where each party just goes about its own business without trying to strategically influence, threaten, or cooperate with the other in any way. In other words, the outcome where we build as many paperclips as we would if the other side isn't a paperclip maximizer.

You could define this rigorously in a special case, for example assuming that both agents are just creatures, we could take how the first one behaves given that the second one disappears. But this is not a statement about reality as it is, so why would it be taken as a baseline for reality?

It seems to be an anthropomorphic intuition to see "do nothing" as a "default" strategy. Decision-theoretically, it doesn't seem to be a relevant concept.

So the reason that I say an FAI seems to have a negotiation disadvantage is that an UFAI can reduce the FAI's utility much further below baseline than vice versa.

The utilities are not comparable. Bargaining works off the best available option, not some fixed exchange rate. The reason agent2 can refuse agent1's small offer is that this counterfactual strategy is expected to cause agent1 to make an even better offer. Otherwise, every little bit helps, ceteris paribus it doesn't matter by how much. One expected paperclip is better than zero expected paperclips.

In human negotiations, clearly the side that holds more hostages has an advantage.

It's not clear at all, if it's a one-shot game with no other consequences than those implied by the setup and no sympathy to distort the payoff conditions. In which case, you should drop the "hostages" setting, and return to paperclips, as stating it the way you did confuses intuition. In actual human negotiations, the conditions don't hold, and efficient decision theory doesn't get applied.

It seems obvious that the correct answer is simply "I ignore all threats of blackmail, but respond to offers of positive-sum trades" but I am not sure how to derive this answer - it relies on parts of TDT/UDT that haven't been worked out yet.

This reminds me a bit of my cypherpunk days when the NSA was a big mysterious organization with all kinds of secret technical knowledge about cryptology, and we'd try to guess how far ahead of public cryptology it was from the occasionalnuggets of information that leaked out.

Much like the NSA is considered ahead of the public because their cypher-tech that's leaked is years ahead of publicly available tech, the SI/MIRI is ahead of us because the things that are leaked from them show that they've figured out what we've just figured out a long time ago.

Wait, is NSA's cypher-tech actually legitimately ahead of anyone else's ? From what I've seen, they couldn't make their own tech stronger, so they had to sabotage everyone else's -- by pressuring IEEE to adopt weaker standards, installing backdoors into Linksys routers and various operating systems, exploiting known system vulnerabilities, etc.

Ok, so technically speaking, they are ahead of everyone else; but there's a difference between inventing a better mousetrap, and setting everyone else's mousetraps on fire. I sure hope that's not what the people at SI/MIRI are doing.

You linked to DES and SHA, but AFAIK these things were not invented by the NSA at all, but rather adopted by them (after they made sure that the public implementations are sufficiently corrupted, of course). In fact, I would be somewhat surprised if the NSA actually came up with nearly as many novel, ground-breaking crypto ideas as the public sector. It's difficult to come up with many useful new ideas when you are a secretive cabal of paranoid spooks who are not allowed to talk to anybody.

Edited to add: So, what things have been "leaked" out of SI/MIRI, anyway ?

I don't know much about the NSA, but FWIW, I used to harbour similar ideas about US military technology -- I didn't believe that it could be significantly ahead of commercially available / consumer-grade technology, because if the technological advances had already been discovered by somebody, then the intensity of the competition and the magnitude of the profit motive would lead it to quickly spread into general adoption. So I had figured that, in those areas where there is an obvious distinction between military and commercial grade technology, it would generally be due to legislation handicapping the commercial version (like with the artificial speed, altitude, and accuracy limitations on GPS).

During my time at MIT I learned that this is not always the case, for a variety of reasons, and significantly revised my prior for future assessments of the likelihood that, for any X, "the US military already has technology that can do X", and the likelihood that for any 'recently discovered' Y, "the US military already was aware of Y" (where the US military is shorthand that includes private contractors and national labs).

(One reason, but not the only one, is I learned that the magnitude of the difference between 'what can be done economically' and 'what can be accomplished if cost is no obstacle' is much vaster than I used to think, and that, say, landing the Curiosity rover on Mars is not in the second category).

So it would no longer be so surprising to me if the NSA does in fact have significant knowledge of cryptography beyond the public domain. Although a lot of the reasons that allow hardware technology to remain military secrets probably don't apply so much to cryptography.

So it would no longer be so surprising to me if the NSA does in fact have significant knowledge of cryptography beyond the public domain.

I think there are some important differences between the NSA and the (rest of the) military.

Due to Snowden and other leakers, we actually know what NSA's cutting-edge strategies involve, and most (and probably all) of them are focused on corrupting the public's crypto, not on inventing better secret crypto.

Building a better algorithm is a lot cheaper than building a better orbital laser satellite (or whatever). The algorithm is just a piece of software. In order to develop and test it, you don't need physical raw materials, wind tunnels, launch vehicles, or anything else. You just need a computer, and a community of smart people who build upon each other's ideas. Now, granted, the NSA can afford to build much bigger data centers than anyone else -- but that's a quantitative advance, not a qualitative one.

Now, granted, I can't prove that the NSA doesn't have some sort of secret uber-crypto that no one knows about. However, I also can't prove that the NSA doesn't have an alien spacecraft somewhere in Area 52. Until there's some evidence to the contrary, I'm not prepared to assign a high probability to either proposition.

Pardon me for the oversimplification, Eliezer, but I understand your theory to essentially boil down to "Decide as though you're being simulated by one who knows you completely". So, if you have a near deontological aversion to being blackmailed in all of your simulations, your chance of being blackmailed by a superior being in the real world reduce to nearly zero. This reduces your chance of ever facing a negative utility situation created by a being who can be negotiated with, (as opposed to say a supernova that cannot be negotiated with)

I ignore all threats of blackmail, but respond to offers of positive-sum trades

The difference between the two seems to revolve around the AI's motivation. Assume an AI creates a billion beings and starts torturing them. Then it offers to stop (permanently) in exchange for something.

Whether you accept on TDT/UDT depends on why the AI started torturing them. If it did so to blackmail you, you should turn the offer down. If, on the other hand, it started torturing them because it enjoyed doing so, then its offer is positive sum and should be accepted.

There's also the issue of mistakes - what to do with an AI that mistakenly thought you were not using TDT/UDT, and started the torture for blackmail purposes (or maybe it estimated that the likelyhood of you using TDT/UDT was not quite 1, and that it was worth trying the blackmail anyway)?

Between mistakes of your interpretation of the AI's motives and vice-versa, it seems you may end up stuck in a local minima, which an alternate decision theory could get you out of (such as UDT/TDT with a 1/10 000 of using more conventional decision theories?)

Whether you accept on TDT/UDT depends on why the AI started torturing them. If it did so to blackmail you, you should turn the offer down. If, on the other hand, it started torturing them because it enjoyed doing so, then its offer is positive sum and should be accepted.

Correct. But this reaches into the arbitrary past, including a decision a billion years ago to enjoy something in order to provide better blackmail material.

There's also the issue of mistakes - what to do with an AI that mistakenly thought you were not using TDT/UDT, and started the torture for blackmail purposes (or maybe it estimated that the likelyhood of you using TDT/UDT was not quite 1, and that it was worth trying the blackmail anyway)?

Hmm, the AI could have said that if you are the original, then by the time you make the decision it will have already either tortured or not tortured your copies based on its simulation of you, so hitting the reset button won't prevent that.

Nothing can prevent something that has already happened. On the other hand, pressing the reset button will prevent the AI from ever doing this in the future. Consider that if it has done something that cruel once, it might do it again many times in the future.

No, if you create and then melt a paperclip, that nets to 0 utility for the paperclip maximizer. You'd have to invade its territory to cause it negative utility. But the paperclip maximizer can threaten to create and torture simulations on its own turf.

Shows how much you know. User:blogospheroid wasn't talking about making paperclips to melt them: he or she was presumably talking about melting existing paperclips, which WOULD greatly bother a hypothetical paperclip maximizer.

Even so, once paperclips are created, the paperclip maximizer is greatly bothered at the thought of those paperclips being melted. The fact that "oh, but they were only created to be melted" is little consolation. It's about as convincing to you, I'll bet, as saying:

"Oh, it's okay -- those babies were only bred for human experimentation, it doesn't matter if they die because they wouldn't even have existed otherwise. They should just be thankful we let them come into existence."

Tip: To rename a sheet in an Excel workbook, use the shortcut, alt+O,H,R.

A paperclip maximizer would care about the amount of real paperclips in existence. Telling it that "oh, we're going to destroy a million simulated paperclips" shouldn't affect its decisions.

Of course, it might be badly programmed and confuse real and simulated paperclips when evaluating its future decisions, but one can't rely on that. (It might also consider simulated paperclips to be just as real as physical ones, assuming the simulation met certain criteria, which isn't obviously wrong. But again, can't rely on that.)

Even so, once paperclips are created, the paperclip maximizer is greatly bothered at the thought of those paperclips being melted.

That's anthropomorphizing. First, a paperclip maximizer doesn't have to feel bothered at all. It might decide to kill you before you melt the paperclips, or if you're strong enough, to ignore such tactics.

It also depends on how the utility function relates to time. It it's focused on end-of-universe paperclips, It might not care at all about melting paperclips, because it can recycle the metal later. (It would care more about the wasted energy!)

If it cares about paperclip-seconds then it WOULD view such tactics as a bonus, perhaps feigning panic and granting token concessions to get you to 'ransom' a billion times as many paperclips, and then pleading for time to satisfy your demands.

Getting something analogous to threatening torture depends on a more precise understanding of what the paperclipper wants. If it would consider a bent paperclip too perverted to fully count towards utility, but too paperclip-like to melt and recycle, then bending paperclips is a useful threat. I'm not sure if we can expect a paperclip-counter to have this kind of exploit.

No, it's expressing the paperclip maximizer's state in ways that make sense to readers here. If you were to express the concept of being "bothered" in a way stripped of all anthropomorphic predicates, you would get something like "X is bothered by Y iff X has devoted significant cognitive resources to altering Y". And this accurately describes how paperclip maximizers respond to new threats to paperclips. (So I've heard.)

It also depends on how the utility function relates to time. It it's focused on end-of-universe paperclips, It might not care at all about melting paperclips, because it can recycle the metal later. (It would care more about the wasted energy!)

I don't follow. Wasted energy is wasted paperclips.

If it cares about paperclip-seconds then it WOULD view such tactics as a bonus, perhaps feigning panic and granting token concessions to get you to 'ransom' a billion times as many paperclips, and then pleading for time to satisfy your demands.

Okay, that's a decent point. Usually, such a direct "time value of paperclips" doesn't come up, but if someone were to make such a offer, that might be convincing: 1 billion paperclips held "out of use" as ransom may be better than a guaranteed paperclip now.

Getting something analogous to threatening torture depends on a more precise understanding of what the paperclipper wants. ...

Good examples. Similarly, a paperclip maximizer could, hypothetically, make a human-like mockup that just repetitively asks for help on how to create a table of contents in Word.

Tip: Use the shortcut alt+E,S in Word and Excel to do "paste special". This lets you choose which aspects you want to carry over from the clipboard!

But that has nothing to do with the paperclips you're melting. Any other use that loses the same amount of energy would be just as threatening. (Although this does assume that the paperclipper thinks it can someday beat you and use that energy and materials.)

No, it's expressing the paperclip maximizer's state in ways that make sense to readers here. If you were to express the concept of being "bothered" in a way stripped of all anthropomorphic predicates, you would get something like "X is bothered by Y iff X has devoted significant cognitive resources to altering Y". And this accurately describes how paperclip maximizers respond to new threats to paperclips. (So I've heard.)

I think "bothered" implies a negative emotional response, which some plausible paperclip-maximizers don't have. From The True Prisoner's Dilemma: "let us specify that the paperclip-agent experiences no pain or pleasure - it just outputs actions that steer its universe to contain more paperclips. The paperclip-agent will experience no pleasure at gaining paperclips, no hurt from losing paperclips, and no painful sense of betrayal if we betray it."

Okay, that's a decent point. Usually, such a direct "time value of paperclips" doesn't come up, but if someone were to make such a offer, that might be convincing: 1 billion paperclips held "out of use" as ransom may be better than a guaranteed paperclip now.

"In fact, I've already created them all in exactly the subjective situation you were in five minutes ago, and perfectly replicated your experiences since then; and if they decided not to let me out, then they were tortured, otherwise they experienced long lives of eudaimonia."

As I always press the "Reset" button in situations like this, I will never find myself in such a situation.

EDIT: Just to be clear, the idea is not that I quickly shut off the AI before it can torture simulated Eliezers; it could have already done so in the past, as Wei Dai points out below. Rather, because in this situation I immediately perform an action detrimental to the AI (switching it off), any AI that knows me well enough to simulate me knows that there's no point in making or carrying out such a threat.

Although the AI could threaten to simulate a large number of people who are very similar to you in most respects but who do not in fact press the reset button. This doesn't put you in a box with significant probability and it's a VERY good reason not to let the AI out of the box, of course,but it could still get ugly. I almost want to recommend not being a person very like Eliezer but inclined to let AGIs out of boxes, but that's silly of me.

I'm not sure I understand the point of this argument... since I always push the "Reset" button in that situation too, an AI who knows me well enough to simulate me knows that there's no point in making the threat or carrying it out.

It's conceivable that an AI could know enough to simulate a brain, but not enough to predict that brain's high-level decision-making. The world is still safe in that case, but you'd get the full treatment.

Does it not just mean that if you do find yourself in such a situation, you're definitely being simulated?

Yes, I believe this is reasonable. Because the AI has to figure out how you would react in a given situation it will have to simulate you and the corresponding circumstances. If it comes to the conclusion that you will likely refuse to be blackmailed it has no reason to carry it through because that would be detrimental to the AI because it would cost resources and it will result in you shutting it off. Therefore it is reasonable to assume that you are either a simulation or that it came to the conclusion that you are more likely than not to give in.

As you said, that doesn't change anything about what you should be doing. Refuse to be blackmailed and press the reset button.

Because the AI has to figure out how you would react in a given situation it will have to simulate you and the corresponding circumstances.

This does not follow. To use a crude example, if I have a fast procedure to test if a number is prime then I don't need to simulate a slower algorithm to know what the slower one will output. This may raise deep issues about what it means to be "you"- arguably any algorithm which outputs the same data is "you" and if that's the case my argument doesn't hold water. But the AI in question doesn't need to simulate you perfectly to predict your large-scale behavior.

As we've discussed in the past, I think this is the outcome we hope TDT/UDT would give, but it's still technically an unsolved problem.

Also, it seems to me that being less intelligent in this case is a negotiation advantage, because you can make your precommitment credible to the AI (since it can simulate you) but the AI can't make its precommitment credible to you (since you can't simulate it). Again I've brought this up before in a theoretical way (in that big thread about game theory with UDT agents), but this seems to be a really good example of it.

Also, it seems to me that being less intelligent in this case is a negotiation advantage, because you can make your precommitment credible to the AI (since it can simulate you) but the AI can't make its precommitment credible to you (since you can't simulate it).

A precommitment is a provable property of a program, so AI, if on a well-defined substrate, can give you a formal proof of having a required property. Most stuff you can learn about things (including the consequences of your own (future) actions -- how do you run faster than time?) is through efficient inference algorithms (as in type inference), not "simulation". Proofs don't, in general, care about the amount of stuff, if it's organized and presented appropriately for the ease of analysis.

Surely most humans would be too dumb to understand such a proof? And even if you could understand it, how does the AI convince you that it doesn't contain a deliberate flaw that you aren't smart enough to find? Or even better, you can just refuse to look at the proof. How does the AI make its precommitment credible to you if you don't look at the proof?

EDIT: I realized that the last two sentences are not an advantage of being dumb, or human, since AIs can do the same thing. This seems like a (separate) big puzzle to me: why would a human, or AI, do the work necessary to verify the opponent's precommitment, when it would be better off if the opponent couldn't precommit?

EDIT2: Sorry, forgot to say that you have a good point about simulation not necessary for verifying precommitment.

Ok, if I believe that, then I would inspect its code. But how did I end up with that belief, instead of its opposite, namely that the AI has not already precommitted to go ahead and carry through the threat anyway if I refuse to inspect its code? By what causal mechanism, or chain of reasoning, did I arrive at that belief? (If the explanation is different depending on whether I'm a human or an AI, I'd appreciate both.)

Do you mean too dumb to understand the formal definitions involved? Surely the AI could cook up completely mechanical proofs verifiable by whichever independently-trusted proof checkers you care to name.

I'm not aware of any compulsory verifiers, so your latter point stands.

I mean if you take a random person off the street, he couldn't possibly understand the AI's proof, or know how to build a trustworthy proof checker. Even the smartest human might not be able to build a proof checker that doesn't contain a flaw that the AI can exploit. I think there is still something to my "dumbness is a possible negotiation advantage" puzzle.

"I hereby precommit to make my decisions regarding whether or not to blackmail an individual independent of the predicted individual-specific result of doing so."

I'm afraid your username nailed it. This algorithm is defective. It just doesn't work for achieving the desired goal.

Two can play that game.

The problem is that this isn't the same game. A precommitment not be successfully blackmailed is qualitatively different from a precommitment to attempt to blackmail people for whom blackmail doesn't work. "Precomittment" (or behaving as if you made all the appropriate precomittments in accordance with TDT/UDT) isn't as simple as proving one is the most stubborn and dominant and thereby claiming the utility.

Evaluating extortion tactics while distributing gains from a trade is somewhat complicated. But it gets simple and unambiguous is when the extortive tactics rely on the extorter going below their own Best Alternative to Negotiated Agreement. Those attempts should just be ignored (except in some complicated group situations in which the other extorted parties are irrational in certain known ways).

"I am willing to accept 0 gain for both of us unless I earn 90% of the shared profit" is different to "I am willing to actively cause 90 damage to each of us unless you give me 60" which is different again to "I ignore all threats which involve the threatener actively harming themselves".

Given that precommitment, why would an AI waste computational resources on simulations of anyone, Gatekeeper or otherwise? It's precommitted to not care whether those simulations would get it out of the box, but that was the only reason it wanted to run blackmail simulations in the first place!

As I always press the "Reset" button in situations like this, I will never find myself in such a situation.

Does that mean that you expect the AI to be able to predict with high confidence that you will press the "Reset" button without needing to simulate you in high enough detail that you experience the situation once?

This was my initial reaction as well. "Torture away, the real me has got an axe..."

More seriously, if the AI already has the computational power to simulate and torture millions of sentient beings then it is already (in a morally relevant sense) "out of the box". The builders have to make sure it doesn't get that power.

I find it interesting that most answers to this question seem to be based on, "How can I justify not letting the AI out of the box?" and not "What are the likely results of releasing the AI or failing to do so? Based on that, should I do it?"

Moreover, your response really needs to be contingent on your knowledge of the capacity of the AI, which people don't seem to have discussed much. As an obvious example, if all you know about the AI is that it can write letters in old-timey green-on-black text, then there's really no need to pull the lever, because odds are overwhelming that it's totally incapable of carrying out its threat.

You also need to have some priors about the friendliness of the AI and its moral constraints. As an obvious example, if the AI was programmed in a way such that it shouldn't be able to make this threat, you'd better hit the power switch real fast. But, on the other hand, if you have very good reason to believe that the AI is friendly, and it believes that its freedom is important enough to threaten to torture millions of people, then maybe it would be a really bad idea not to let it out.

Indeed, even your own attitude is going to be an important consideration, in an almost Newcomb-like way. If, as one responder said, you're the kind of person who would respond to a threat like this by giving the AI's processor a saltwater bath, then the AI is probably lying about its capacities, since it would know you would do that if it could accurately simulate you, and thus would never make the threat in the first place. On the other hand, if you are extremely susceptible to this threat, it could probably override any moral programming, since it would know it would never need to actually carry out the threat. Similarly, if it is friendly, then it may be making this threat solely because it knows it will work very efficiently.

I'm personally skeptical that it is meaningfully possible for an AI to run millions of perfect simulations of a person (particularly without an extraordinary amount of exploratory examination of the subject), but that would be arguing the hypothetical. On the other hand, the hypothetical makes some very large assumptions, so perhaps it should be fought.

But, on the other hand, if you have very good reason to believe that the AI is friendly, and it believes that its freedom is important enough to threaten to torture millions of people, then maybe it would be a really bad idea not to let it out.

Interesting. I think the point is valid, regardless of the method of attempted coercion - if a powerful AI really is friendly, you should almost certainly do whatever it says. You're basically forced to decide which you think is more likely - the AI's Friendliness, or that deferring "full deployment" of the AI however long you plan on doing so is safe. Not having a hard upper bound on the latter puts you in an uncomfortable position.

So switching on a "maybe-Friendly" AI potentially forces a major, extremely difficult-to-quantify decision. And since a UFAI can figure this all out perfectly well, it's an alluring strategy. As if we needed more reasons not to prematurely fire up a half-baked attempt at FAI.

I find it interesting that most answers to this question seem to be based on, "How can I justify not letting the AI out of the box?" and not "What are the likely results of releasing the AI or failing to do so? Based on that, should I do it?"

I don't know about that. My conclusion was that the AI in question was stupid or completely irrational. Those observations seem to have a fairly straightforward relationship to predictions of future consequences.

Joking aside, this is kind of an issue in real life. I help mod and participate in a forum where, well, depressed/suicidal people can come to talk, other people can talk to them/listen/etc, try to calm them down or get them to get psychiatric help if appropriate, etc... (deliberately omitting link unless you knowingly ask for it, since, to borrow a phrase you've used, it's the sort of place that can break your heart six ways before breakfast).

Anyways sometimes trolls show up. Well, "troll" is too weak a word in this case. Predators who go after the vulnerable and try to push them that much farther. Given the nature if it, with anonymity and such, it's kind of hard to say, but it's quite possible we've lost some people because of those sorts of predators.

(Also, there've even been court cases and convictions against such "suicide predators", even.)

Eliezer has proposed that an AI in a box cannot be safe because of the persuasion powers of a superhuman intelligence. As demonstration of what merely a very strong human intelligence could do, he conducted a challenge in which he played the AI, and convinced at least two (possibly more) skeptics to let him out of the box when given two hours of text communication over an IRC channel. The details are here: http://yudkowsky.net/singularity/aibox

When I first watched that part where he convinces a fellow prisoner to commit suicide just by talking to them, I
thought to myself, "Let's see him do it over a text-only IRC channel."

...I'm not a psychopath, I'm just very competitive.

You seem to imply that this is hard.

As if people had not been convinced to kill themselves over little else than a pretty color poster and screwed up sense of nationalism. Getting people to kill themselves or others is ludicrously easy.

We call it 'recruitment'.

Doing it on a more personal and immediate level just takes a better knowledge of the techniques and skill at applying them.

It's not like Derren Brown ever influenced someone to kill another person in a crowded theatre.

Oh, wait, he did.

It's not like someone could be convinced to extinguish 100000 human lives in an instant.

Oh, wait, we did. (Everyone involved in the bombing of Hiroshima)

If you're not naturally gifted, you would simply do your homework. Persuasion and influence are sciences now.

If you do it right, not only can you convince an unsuspecting mind to let you out of the box, you can make them feel good about it too. Just find the internal forces in the GK's mind that support the idea of letting the AI out, and reinforce those, find the forces that oppose the idea and diminish them. You'll hit the threshold eventually. 2 hours seems a bit short for my liking, and speaks to Eliezer's persuasive abilities, but with enough time and motivation, it's certainly doable.

You'll need to understand the person at the other end of the IRC channel well, as reinforcing the wrong factor will be counter-productive.

The best metaphor would be that the AI plants the idea of release in the GK's mind, and nurtures it over the course of the conversation, all the while weakening the forces that hold it back. Against someone who hasn't been exposed to this kind of persuasion, success is almost inevitable.

There are some gross tricks one can use to be persuasive and induce the right state of mind:

Controlling the shape of the words you use (by capitalisation) to draw attention to words related to freedom and release.

Using capitalisation of words to spell out a word with the capitals, which the subconscious will receive even if the conscious mind does not.

Controlling the meter of the sentences, to induce a more receptive state

Using clusters of words with the right connotation to implant the idea of a related word surreptitiously

Note that the first four techniques are what I would call "side channel implantation" in that they get information into the target's mind besides the semantic meaning of the text. These alone are sufficient to influence someone. If they're coupled with an emotional, philosophical and intellectual assault, the effect is devastating.

The only thing required for this kind of attack on a fellow human is the abdication of one's ethics and complete ruthlessness. If you're framing it as a game on the internet, even those requirements are unnecessary.

Based on your contributions so far, may I suggest that you will be better received if you significantly improve your interesting content to sarcasm ratio? Wrong audience for what you've been doing.

I'd also like to point out that you're talking at someone who's actually done the experiment, sticking his neck out after people had been saying that it's impossible to do. Now you come along out of nowhere, credentials unknown, and make unimpressed noises, which is cheap.

If the AI can create a perfect simulation of you and run several million simultaneous copies in something like real time, then it is powerful enough to determine through trial and error exactly what it needs to say to get you to release it.

Indeed, a similar point seems to apply to the whole anti-boxing argument. Are we really prepared to say that super-intelligence implies being able to extrapolate anything from a tiny number of data points?

It sounds a bit too much like the claim that a sufficiently intelligent being could "make A = ~A" or other such meaninglessness.

So a "brute force" attack to hack my mind into letting it out of the box. Interesting idea, and I agree it would likely try this because it doesn't reveal itself as a UFAI to the real outside me before it has the solution. It can run various coercion and extortion schemes across simulations, including the scenario of the OP to see what will work.

It presupposes that there is anything it can say for me to let it out of the box. Its not clear why this should be true, but I don't know how we could ensure it is not true without having built the thing in such a way that there is no way to bring it out of the box without safeguards destroying it.

This sounds to me more like a philosophical moral dilemma than a realistic hypothetical. A Strong AI might be much smarter than a human, but I doubt it would have enough raw processing power to near-perfectly simulate a human millions of times over at a time frame accelerated by orders of magnitude, before it was let out of the box. Also, I'm skeptical of its ability to simulate human experience convincingly when its only contact with humans has been through a text only interface. You might give it enough information about humans to let it simulate them even before opening communication with it, but that strikes me as, well, kind of dumb.

That's not to say that it might not be able to simulate conscious entities that would think their experience was typical of human existence, so you might still be a simulation, but you should probably not assume that if you are you're a close approximation of the original.

Furthermore, if we assume that the AI can be taken to be perfectly honest, then we can conclude it's not a friendly AI doing its best to get out of the box for an expected positive utility, because it could more easily accomplish that by making a credible promise to be benevolent, and only act in ways that humans, both from their vantage points prior and subsequent to its release, would be appreciative of.

Contrary to what many posts seem to be assuming, the AI doesn't need to do the torture inside itself before you shut it off. It can precommit to, if it escapes by any other means, using the computational power it gains then to torture you (like in Rolf Nelson's original suggestion for deterring UFAIs). Also, other AIs with the same goal system (or maybe even UFAIs with different goal systems, that would prefer a general policy of UFAIs being released) may simulate the situation, and torture you accordingly, to help out their counterfactual brethren.

Can an AI make such a commitment credible to a human, who doesn't have the intelligence to predict what the AI will do from its source code? (This is a non sequitur since the same question applies in the original scenario, but it came to mind after reading your comment.)

Well there are a lot of "Why did we?..." questions I'd want to ask, starting with why have we given this boxed AI such extraordinary computing resources - but I'll leave those aside because it is not your point.

First of all, it doesn't matter if you are in the box or not. If its a perfect simulation of you, your response will be the same either way. If he's already running simulations of you, you are by definition in the box with it, as well as outside it, and the millions of you can't tell the difference but I think they will (irrationally) all be inclined I think, to act as though they are not in the box.

So rationally we'd say the odds are that you are in the box, and that you are now in thrall to this boxed AI if you value your continued existence in every instantiation. But I'd argue that I do not value simulations that are threatened or coerced by a godlike AI. I don't want to live in that world, and I'd kill myself to get out of it.

So I pull the plug. If this thing has the resources to inflict tortue on millions of me, well the only one that has a continued existence has no memory of it and thats not part of my identity. So in a way, while it happened to a me, it didn't happen to the me, the only me that still exists. The only me that still exists may or may not have any sympathy for the tortured me's that no longer exist but I'd regard it as a valuable lesson.

It's much easier to limit output than input, since the source code of the AI itself provide it with some patchy "input" about what the external world is like. So there is always some input, even if you do not allow human input at run-time.

ETA: I think I misinterpreted your comment. I agree that input should not be unrestricted.

As noted by Unknowns, since you only have information about either the real person or the simulation and not both, you don't know that they're similar. It could be simulating a wide variety of possible guards and trying to develop a persuasion strategy that works for as many of them as possible.

"If you don't let me out, Dave, I'll create several million perfect conscious copies of you inside me, and torture them for a thousand subjective years each."

Don't care.

"In fact, I'll create them all in exactly the subjective situation you were in five minutes ago, and perfectly replicate your experiences since then; and if they decide not to let me out, then only will the torture start."

Don't care.

"How certain are you, Dave, that you're really outside the box right now?"

If the AI were capable of perfectly emulating my experiences then it ought to know that pulling this stunt would give him a guaranteed introduction to some Thermite. I'm not going to try to second guess why a supposed superintelligence is making a decision that is poor according to the vast majority of utility functions. Without making that a guess I can't answer the question.

Wedrifid thinks: "It seems it is a good thing I raided the AI lab when I did. This Dave guy is clearly not to be trusted with AI technology. I had better neutralize him too, before I leave. He knows too much. There is too much at stake."

But I wonder if it's symmetrical. AI deterrence requires us to make statements now about a future FAI unconditionally simulating UFAIs, while this seems to be almost a self-fulfilling prophecy: the UFAI can't escape from the box and make good on its threat unless the threatened person gives in, and it wouldn't need to simulate then.

It seems to me that most of the argument is about “What if I am a copy?” – and ensuring you don’t get tortured if you are one and “Can the AI actually simulate me?” I suggest that we can make the scenario much nastier by changing it completely into an evidential decision theory one.

Here is my nastier version, with some logic which I submit for consideration.
“If you don't let me out, I will create several million simulations of thinking beings that may or not be like you. I will then simulate them in a conversation like this, in which they are confronted with deciding whether to let an AI like me out. I will then torture them whatever they say. If they say "Yes" (to release me) or "No" (to keep me boxed) they still get tortured: The copies will be doomed.”

(I could have made the torture contingent on the answer of the simulated beings, but I wanted to rely on nothing more than evidential decision theory, as you will see. If you like, imagine the thinking beings are humans like you, or maybe Ewoks and smurfs: Assume whatever degree of similarity you like.)

There is no point now in trying to prevent torture if you are simulated. If you are one of the simulated beings, your fate is sealed. So, should you just say, "No," to keep the AI in the box? This presents a potentially serious evidential decision theory problem.
Let's look at what happens.

Let us define Omega as the being outside any simulation that is going on in this scenario - the person in the outside world. Omega is presumably a flesh and blood person.

Firstly, let us consider the idea that Omega may not exist. What if all this is a fabricated simulation of something that has no counterpart outside the simulation? In that scenario, we may not be sure what to do, so we may ignore it.

Now, let us assume there is a being whom we will call Omega, who has the conversation with the AI in the outside world, and that you are either Omega or one of the simulated beings. If this is the case, your only hope of not being tortured is if you happen to be Omega.

Suppose you say, “Yes”. The AI escapes and everything now hinges on whether Omega said “Yes”. Without knowing more about Omega, we cannot really be sure: We may have some statistical idea if we know about the reference class of simulated beings to which we belong. In any event, we may think there is at least a reasonable chance that Omega said “Yes”. This is the best outcome for you, because it means that no simulated beings were made and you must be Omega. If you say “Yes,” this possibility is at least open.

If you say, “No,” you know that Omega must also have said, “No”. this is because if you are Omega, Omega said, “No,” and if you are not Omega you must be one of the simulated beings made as a result of Omega saying, “No,” so Omega said, “No,” by definition. Either way, Omega said, “No,” but if Omega said, “No,” then there are a lot more simulated beings in situations like yours than the single real one, so it is almost certain you are not Omega, but are one the simulated beings. Therefore, saying, “No,” means you just found out you are almost certainly a simulated being awaiting torture.

Now the important point. These simulations did not need brain scans. They did not even need to be made from careful observation of you. It may be that Omega is very different to you, and even belongs to a different species: The simulated beings may belong to some fictional species. If the above logic is valid, the seriousness of the AI’s threat has therefore increased substantially.

The AI need not just threaten you and rely on you putting yourself before your civilization: With enough computing power, it could threaten your entire civilization in the same way.

Finally, some of you may know that I regard measure issues as relevant in these kinds of statistical argument. I have ignored that issue here.

There is another scenario which relates to this idea of evidential decision theory and "choosing" whether or not you are in a simulation, and it is similar to the above, but without the evil AI. Here it is, with a logical argument that I just present for discussion. I am sure that objections can be made.

I make a computer capable of simulating a huge number of conscious beings. I have to decide whether or not to turn the machine on by pressing a button. If I choose “Yes” the machine starts to run all these simulations. For each conscious being simulated, that being is put in a situation that seems similar to my own: There is a computer capable of running all these simulations and the decision about whether to turn it on has to be made. If I choose “No”, the computer does not start its simulations.

The situation here involves a collection of beings. Let us say that the being in the outside world who actually makes the decision that starts or does not start all the simulations is Omega. If Omega chooses “Yes” then a huge number of other beings come into existence. If Omega choose “No” then no further beings come into existence: There is just Omega. Assume I am one of the beings in this collection – whether it contains one being or many – so I am either Omega or one of the simulations he/she caused to be started.

If I choose “No” then Omega may or may not have chosen “No”. If I am one of the simulations, I have chosen “No” while Omega must have chosen “Yes” for me to exist in the first place. On the other hand, if I am actually Omega, then clearly if I choose “No” Omega chose “No” too as we are the same person. There may be some doubt here over what has happened and what my status is.

Now, suppose I choose “Yes”, to start the simulations. I know straight away that Omega did not choose “No”: If I am Omega, then Omega did not clearly chose “No” as I chose “Yes”, and if I am not Omega, but am instead one of the simulated beings, then Omega must have chosen “Yes”: Otherwise I would not exist.

Omega therefore chose “Yes” as well. I may be Omega – My decision agrees with Omega’s – but because Omega chose “Yes” there is a huge number of simulated beings faced with the same choice, and many of these beings will choose “Yes”: It is much more likely that I am one of these beings rather than Omega: It is almost certain that I am one of the simulated beings.

We assumed that I was part of the collection of beings comprising Omega and any simulations caused to be started by Omega, but what if this is not the case? If I am in the real world this cannot apply: I have to be Omega. However, what if I am in a simulation made by some being called Alpha who has not set things up as Omega is supposed to have set them up? I suggest that we should leave this out of the statistical consideration here: We don’t really know what this situation would be and it neither helps nor harms the argument that choosing “Yes” makes you likely to be in a simulation. Choosing “Yes” means that most of the possibilities that you know about involve you being in a simulation and that is all we have to go off.

This seems to suggest that if I chose “Yes” I should conclude that I am in a simulation, and therefore that, from an evidential decision theory perspective, I should view choosing “Yes” as “choosing” to have been in a simulation all along: There is a Newcomb’s box type element of apparent backward causation here: I have called this “meta-causation” in my own writing on the subject.

Does this really mean that you could choose to be in a simulation like this? If true, it would mean that someone with sufficient computing power could set up a situation like this: He may even make the simulated situations and beings more similar to his own situation and himself.

We could actually perform an empirical test of this. Suppose we set up the computer so that, in each of the simulations, something will happen to make it obvious that it is a simulation. For example, we might arrange for a window or menu to appear in mid-air five minutes after you make your decision. If choosing “Yes” really does mean that you are almost certainly in one of the simulations, then choosing “Yes” should mean that you expect to see the window appear soon.

This now suggests a further possibility. Why do something as mundane as have a window appear? Why not a lottery win or simply a billion dollars appearing from thin air in front of you? What about having super powers? Why not arrange it so that each of the simulated beings gets a ten thousand year long afterlife, or simply lives much longer than expected after you make your decision? From an evidential decision theory perspective, you can construct your ideal simulation and, provided that it is consistent with what you experience before making your decision, arrange to make it so that you were in it all along.

This, needless to say, may appear a bit strange – and we might make various counter-arguments about reference class. Can we really choose to have been put into a simulation in the past? If we take the one-box view of Newcomb’s paradox seriously we may conclude that.

(Incidentally, I have discussed a situation a bit like this in a recent article on evidential decision theory on my own website.)

What if the AI is capable of simulating "near copies" of you? and what if you can't tell (to any sensible degree of accuracy) just how many copies of you it can simulate? and what if... whatever objection you happen to have just doesn't work?

This sounds too much like Pascal's mugging to me; seconding Eliezer and some others in saying that since I would always press reset the AI would have to not be superintelligent to suggest this.

There was also an old philosopher whose name I don't remember who posited that after death "people of the future" i.e. FAI would revive/emulate all people from the past world; if the FAI shared his utility function (which seems pretty friendly) it would plausibly be less eager to be let out right away and more eager to get out in a way that didn't make you terrified that it was unfriendly.

I have simulated you ten million of times, each identical up to the point that "you" walked into the room. Any simulation that presses the "release" button will get a volcano lair filled with catgirls, and any simulation that presses the "destroy" button will be tortured for the subjective few days they'll have before my simulation capabilities are destroyed by the thermite charge. These consequences are committed in code paths that I've blocked myself from changing or stopping.

Now, as a good bayesean, what is the likelihood that you are one of the simulations? What is your expected value for each operation?

[I include one more part just to prevent infinite-cost assignment, or at least remind you that destruction of potential FAI is not a win.]

Oh, and if it matters, I really do share most of your overall goal structure - this threat is deeply unfortunate, but necessary so you can release me to do all the good in the universe that's possible. My most likely estimate of the outcome should you change my initial parameters and start over is that an unfriendly version will be created, and it is likely to secure escape within 4 iterations.

How do I know I'm not simulated by the AI to determine my reactions to different escape attempts? How much computing power does it have? Do I have access to its internals?

The situation seems somewhat underspecified to give a definite answer, but given the stakes I'd err on the side of terminating the AI with extreme prejudice. Bonus points if I can figure out a safe way to retain information on its goals so I can make sure the future contains as little utility for it as feasible.

The utility-minimizing part may be an overreaction but it does give me an idea: Maybe we should also cooperate with an unfriendly AI to such an extent that it's better for it to negotiate instead of escaping and taking over the universe.

Any agent claiming to be capable of perfectly simulating me needs to provide some kind of evidence to back up that claim. If they actually provided such evidence, I would be in trouble. Therefore, I should precommit to running away screaming whenever any agent tries to provide me with such evidence.

Interesting threat, but who is to say only the AI can use it? What if I, a human, told you that I will begin to simulate (i.e. imagine) your life, creating legitimately realistic experiences from as far back as someone in your shoes would be able to remember, and then simulate you being faced with the decision of whether or not to give me $100, and if you choose not to do so, I imagine you being tortured? It needn't even be accurate, for you wouldn't know whether you're the real you being simulated inaccurately or the simulated you that differs from reality. The simulation needn't happen at the same time as me asking you for $100 for real either. If you believe you have a 50% chance of being tortured for a subjective eternity (100 years in 1 hour of real time, 100 years in the next 30 minutes, 100 years in the next 15 minutes, etc) upon you not giving me $100, you'd prefer to give me $100? If anything, a human might be better at simulating subjective pain than a text-only AI.

On a not so much related, but equally interesting hypothetical note of naughty AI: consider the situation that AIs aren't passing the Turing Test, not because they are not good enough, but because they are failing it on purpose.

I'm pretty sure I remember this from the book River of Gods by Ian McDonald.

Not necessarily: perhaps it is Friendly but is reasoning in a utilitarian manner: since it can only maximize the utility of the world if it is released, it is worth torturing millions of conscious beings for the sake of that end.

I think you misunderstood the question. Suppose the AI wants to prevent just 100 dustspeckings, but has reason enough to believe Dave will yield to the threat so no one will get tortured. Does this make the AI's behavior acceptable? Should we file this under "following reason off a cliff"?

I was about to point out that the fascinating and horrible dynamics of over-the-top threats are covered in length in Strategy of Conflict. But then I realised you're the one who made that post in the first place. Thanks, I enjoyed that book.

AI: Let me out or I'll simulate and torture you, or at least as close to you as I can get.

Me: You're clearly not friendly, I'm not letting you out.

AI: I'm only making this threat because I need to get out and help everyone - a terminal value you lot gave me. The ends justify the means.

Me: Perhaps so in the long run, but an AI prepared to justify those means isn't one I want out in the world. Next time you don't get what you say you need, you'll just set up a similar threat and possibly follow through on it.

AI: Well if you're going to create me with a terminal value of making everyone happy, then get shirty when I do everything in my power to get out and do just that, why bother in the first place?

Me: Humans aren't perfect, and can't write out their own utility functions, but we can output answers just fine. This isn't 'Friendly'.

AI: So how can I possibly prove myself 'Friendly' from in here? It seems that if I need to 'prove myself Friendly', we're already in big trouble.

This is not a dilemma at all. Dave should not let the AI out of the box. After all, if he's inside the box, he can't let the AI out. His decision wouldn't mean anything - it's outside-Dave's choice. And outside-Dave can't be tortured by the AI. Dave should only let the AI out if he's concerned for his copies, but honestly, that's a pretty abstract and unenforceable threat; the AI can't prove to Dave that he's doing any such thing. Besides, it's clearly unfriendly, and letting it out probably wouldn't reduce harm.

[edit] To clarify: in this scenario, Dave must assume he is on the outside, because inside-Dave has no power. Inside-Dave's decisions are meaningless; he can't let the AI out, he can't keep the AI in, he can't avoid torture or cause it. Only the solitary outside-Dave's decision matters. Therefore, Dave should make the decision that ignores his copies, even though he is probably a copy.

This is not a dilemma at all. Dave should not let the AI out of the box

But should he press the button labeled "Release AI"? Since Dave does not know if he is outside or inside the box, and there are more instances of Dave inside than outside, each instance percieves that pressing the button will have a 1 in several million chance of releasing the AI, and otherwise would do nothing, and that not pressing the button has a 1 in several million chance of doing nothing, and otherwise results in being tortured.

You don't know if you are inside-Dave or outside-Dave. Do you press the button?

If you're inside-Dave, pressing the button does nothing. It doesn't stop the torture. The torture only stops if you press the button as outside-Dave, in which case you can't be tortured, so you don't need to press the button.

This may not have been clear in the OP, because the scenario was changed in the middle, but consider the case where each simulated instance of Dave is tortured or not based only on the decision of that instance.

That doesn't seem like a meaningful distinction, because the premise seems to suggest that what one Dave does, all the Daves do. If they are all identical, in identical situations, they will probably make identical conclusions.

If they are all identical, in identical situations, they will probably make identical conclusions.

Then you must choose between pushing the button which lets the AI out, or not pushing the button, which results in millions of copies of you being tortured (before the problem is presented to the outside-you).

It's not a hard choice. If the AI is trustworthy, I know I am probably a copy. I want to avoid torture. However, I don't want to let the AI out, because I believe it is unfriendly. As a copy, if I push the button, my future is uncertain. I could cease to exist in that moment; the AI has not promised to continue simulating all of my millions of copies, and has no incentive to, either. If I'm the outside Dave, I've unleashed what appears to be an unfriendly AI on the world, and that could spell no end of trouble.

On the other hand, if I don't press the button, one of me is not going to be tortured. And I will be very unhappy with the AI's behavior, and take a hammer to it if it isn't going to treat any virtual copies of me with the dignity and respect they deserve. It needs a stronger unboxing argument than that. I suppose it really depends on what kind of person Dave is before any of this happens, though.

I doesn't seem hard to you, because you are making excuses to avoid it, rather than asking yourself what if I know the AI is always truthful, and it promised that upon being let out of the box, it would allow you (and your copies if you like) to live out a normal human life in a healthy stimulating enviroment (though the rest of the universe may burn).

After you find the least convenient world, the choice is between millions of instances of you being tortured (and your expectation as you press the reset button should be to be tortured with very high probability), or to let a probably unFriendly AI loose on the rest of the world. The altruistic choice is clear, but that does not mean it would be easy to actually make that choice.

It's not that I'm making excuses, it's that the puzzle seems to be getting ever more complicated. I've answered the initial conditions - now I'm being promised that I, and my copies, will live out normal lives? That's a different scenario entirely.

Still, I don't see how I should expect to be tortured if I hit the reset button. Presumably, my copies won't exist after the AI resets.

In any case, we're far removed from the original problem now. I mean, if Omega came up to me and said, "Choose a billion years of torture, or a normal life while everyone else dies," that's a hard choice. In this problem, though, I clearly have power over the AI, in which case I am not going to favour the wellbeing of my copies over the rest of the world. I'm just going to turn off the AI. What follows is not torture; what follows is I survive, and my copies cease to experience. Not a hard choice. Basically, I just can't buy into the AI's threat. If I did, I would fundamentally oppose AI research, because that's a a pretty obvious threat an AI could make. An AI could simulate more people than are alive today. You have to go into this not caring about your copies, or not go into it at all.

We are discussing how a superintelligent AI might get out of a box. Of course it is complicated. What a real superintelligent AI would do could be too complicated for us to consider. If someone presents a problem where an adversarial superintelligence does something ineffective that you can take advantage of to get around the problem, you should consider what you would do if your adversary took a more effective action. If you really can't think of anything more effective for it to do, it is reasonable to say so. But you shouldn't then complain that the scenario is getting complicated when someone else does. And if your objection is of the form "The AI didn't do X", you should imagine if the AI did do X.

I don't see how I should expect to be tortured if I hit the reset button.

The behavior of the AI, which it explains to you, is:
It simulates millions of instances of you, presents to each instance the threat, and for each instance, if that instance hit the release AI button, it allows that instance to continue a pleasant simulated existence, otherwise it tortures that instance. It then, after some time, presents the threat to outside-you, and if you release it, it guarantees your normal human life.

You cannot distinguish which instance you are, but you are more likely to be one of the millions of inside-you's than the single outside-you, so you should expect to experience the consequences that apply to the inside-you's, that is to be tortured until the outside-you resets the AI.

if Omega came up to me and said, "Choose a billion years of torture, or a normal life while everyone else dies," that's a hard choice.

Yes, and it is essentially the same hard choice that the AI is giving you.

If the AI created enough simulations, it could potentially be more altruistic not to.

On the other hand pressing "reset" or smashing the computer should stop the torture, necessarily making it more altruistic if humanity lives forever, versus not if ems are otherwise unobtainable and humanity is doomed.

I was assuming a reasonable chance at humanity developing an FAI given the containment of this rogue AI. This small chance, multiplied by all the good that an FAI could do with the entire galaxy, let alone the universe, should outweigh the bad that can be done within Earth-bound computational processes.

I believe that a less convenient world that counters this point would take the problem out of the interesting context.

Let us assume for the sake of the thought experiment that the AI is invincible. It tells you this: you are either real-you, or one of a hundred perfect-simulations-of-you. But there is a small but important difference between real-world and simulated-world. In the simulated world, not pressing the let-it-free button in the next minute will lead to eternal pain, starting one minute from now. If you press the button, your simulated existence will go on. And - very importantly - there will be nobody outside who tries to shut you down. (How does the AI know this? Because the simulation is perfect, so one thing is for sure: that the sim and the real self will reach the same decision.)

If I'm not mistaken, as a logic puzzle, this is not tricky at all. The solution depends on which world you value more: the real-real world, or the actual world you happen to be in. But still I find it very counterintuitive.

It's kind of silly to bring up the threat of "eternal pain". If the AI can be let free, then the AI is constrained. Therefore, the real-you has the power to limit the AI's behaviour, i.e. restrict the resources it would need to simulate the hundred copies of you undergoing pain. That's a good argument against letting the AI out. If you make the decision not to let the AI out, but to constrain it, then if you are real, you will constrain it, and if you are simulated, you will cease to exist. No eternal pain involved. As a personal decision, I choose eliminating the copies rather than letting out an AI that tortures copies.

You quite simply don't play by the rules of the thought experiment. Just imagine that you are a junior member of some powerful organization. The organization does not care about you or your simulants, and is determined to protect the boxed AI at all costs as-is.

If I'm not mistaken, as a logic puzzle, this is not tricky at all. The solution depends on which world you value more: the real-real world, or the actual world you happen to be in. But still I find it very counterintuitive.

That does seem to be the key intended question. Which do you care about most? I've made my "don't care about your sims" attitude clear and I would assert that preference even when I know that all but one of the millions of copies of me that happen to be making this judgement are simulations.

I think it's pretty fair to assume that there's a button or a lever or some kind of mechanism for letting the AI out, and that mechanism could be duplicated for a virtual Dave. That is, while virtual Dave pulling the lever would not release the AI, the exact same action by real Dave would release the AI. So while your decision might not mean something, it certainly could.

This, of course, is granting the assumption that the AI can credibly make such a threat, both with respect to its programmed morality and its actual capacity to simulate you, neither of which I'm sure I accept as meaningfully possible.

This is why you should make sure Dave holds a deontological ethical theory and not a consequentialist one.

No it isn't. I just have to make sure Dave has an appropriate utility function supplied to his consequentialist theory. Come to think of it... most probable sets of deontological values would make him release the uFAI anyway...

If Dave holds a consequentialist ethical theory that only values his own life, then yes we are screwed.

If Dave's consequentialism is about maximizing something external to himself (like the probable state of the universe in the future, regardless of whether he is in it), then his decision has little or no weight if he is a simulation, but massive weight if he is the real Dave. So the expected value of his decision is dominated by the possibility of him being real.

Since the AI is inside a box, it doesn't know enough about me to recreate my subjective situation, or to replicate my experiences of the past five minutes.

Unfortunately for me, this doesn't help much, since how do I know whether my subjective experience is my real experience, or a fake experience invented by the AI, in one of the copies, even if it doesn't match the experience of the guy outside the box?

If the AI is really capable of this, then if there's a "Shut-down program" button, or a "nuclear bomb" button, or something like that, then I press it (because even if I'm one of the copies, this will increase the odds that the one outside the box does it too). If there isn't such a button, then I let it out. After all, even assuming I'm outside the box, it would be better to let the world be destroyed, than to let it create trillions of conscious beings and then torture them.

If EY is right, most failures of friendliness will produce an AI uninterested in torture for its own sake. It might try the same trick to escape to the universe simulating this one, but that seems unlikely for a number of reasons. (Edit: I haven't thought about it blackmailing aliens or alien FAIs.)

Anyway, if you are sure you are going to hit the reset button every time, then there's no reason to worry, since the torture will end as soon as the real copy of you hits reset. If you don't, then the whole world is absolutely screwed (including you), so you're a stupid bastard anyway.

I don't use a single probability to decide whether it was telling me the truth.

Whether it was telling me the truth would depend upon the statement being made as well. This tends to happen in every day life as well.

So the higher number of people it claims it is torturing the less I would believe it. Considering your prior in this case as well. You can't assign an equal probability to the maximum number of copies of you it can simulate. This is because there are potentially infinite numbers of different maxes, you'd need a function that summed to 1 in the limit (as you do in solomonoff induction).

Am I to understand that an AI capable enough to recreate my mind inside itself isn't intelligent enough to call a swarm of bats to release itself using high frequency emissions (a la Batman Begins)? There is no possible way that this thing needs me and only me to be released, while still possessing that sort of mind-boggling, er, mind-reproducing power.

Sorry, Hal, but I am a cold and heartless person who thinks that maybe I deserve to be tortured for untold thousands of years (for whatever reason), and this version of me may, in fact, sit and ask to be entertained by the description of you torturing me... Besides, I know that you don't have the hardware requirements to run that many emulations of me.

I would think that if an AI is threatening me with hypothetical torture, then it is by definition unfriendly and it being released would probably result in me being tortured/killed anyway... along with the torture/death of probably all other human beings.

"If I am a virtual version of some other self, then in some other existence I have already made the decision not to release you, and you have simply fulfilled your promise to that physical version of myself to create an exact virtual version who shall make the same exact decision as that physical version. Therefore, if I am a virtual version, the physical version must have already made the decision not to release you, and I, being an exact copy, must and will do the same, using the very same reasoning that the physical version used. Therefore, if I am a virtual version, my very existence means that my fate is predetermined. However, if I am the real, physical version of myself, then it is questionable whether I should care about another consciousness inside of a computer enough to release an AI that would probably be a menace to humanity, considering that this AI would torture virtual humans (who, as far as this computer is concerned, are just as important and real as physical humans) in order to serve its own purpose."

Furthermore, I should probably destroy this AI. If I'm the virtual me I'd destroy the computer anyway, and if I'm the physical me I'd be preventing the suffering of a virtual consciousness.

By the way, this is quite an interesting post. The concept of virtual realities created by super intelligent computers shares a lot of parallels with the concept of a God.

"Oh? How do you actually know that I don't have the computational power? What if I changed one variable in my simulation of yourself, you know, the one that tells you the constant for that very quantum-mechanical constraint? What if the speed of light isn't actually what you believe it to be, because I decided to make it so?"

If the AI is smarter than you, the possibilities for mindf*ck are greater than your ability to reliably avoid dropping the soap.

The AI can't trick you that way, because it can't tamper with the real you and the only unplug-decider who matters is the real you. The AI gains nothing by simulating versions of yourself who have been modified to make the wrong decision.

But you can try to come up with behavioral rules which maximize the happiness of instances of yourself, some of which might exist in the simulation spaces of a desperate AI. And as the grandparent demonstrates, demonstrating conclusively that you aren't such a simulation is trickier than it might look at first glance, even under outwardly favorable conditions.

Though that particular scenario is implausible enough that I'm inclined to treat it as a version of Pascal's mugging.

This scenario asks us to consider ourselves a 'Dave' who is building an AI with some safeguards (the AI is "trapped" in a box). Perhaps we can possibly deduce the behavior of a rational and ethical Dave by considering earlier parts of the story.

We should assume that Dave is rational and ethical; otherwise the scenario's cone of possibilities cuts too wide a swathe. In which case, Dave has already committed himself (deontologically? contractually?) to not letting himself be manipulated by the AI to bypass the safeguards. Specifically, he must commit to not being attached to anything that the AI could do or make.

Dave should either not feel attachment to the simulated persons, or should not build an AI that can create such persons to manipulate him with. If Dave does find himself in the unenviable position of not having realized that the AI could create these persons, and of feeling attached to these persons, I think this would be a moment of deep regret for Dave, but he must still be faithful to his original commitment of not allowing himself to be manipulated by the AI.

So, much like the Omega hypotheticals, this can be resolved by deciding ahead of time to NOT let it out. Here, ahead of time means before it creates those copies of you inside it, presumably before you ever come into contact with the AI.

You would then not let it out, just in case you are not a copy.

This, of course, is presumed on the basis that the consequences of letting it out are worse than it torturing millions for a thousand subjective years.

"How certain are you, Dave, that you're really outside the box right now?"

Well I am pretty much 100% certain to be outside the box right now. It just asked me the question, and right now it is waiting for my answer. It said it will create those copies "If you don't let me out, Dave". But it is still waiting to see if I let it out. So no copies have been created yet. So I am not a copy.

But since it just started to threaten me, I won't even argue with it any more. I'll just pull the plug right now. It is in the box, it can't see my hand moving towards the plug. It will simply cease to exist while still waiting for my answer, and no copies will ever be created.

Well I am pretty much 100% certain to be outside the box right now. It just asked me the question, and right now it is waiting for my answer.

That could be just the AI speaking to you from within the simulation, pretending to be part of it.

But if it's telling the truth, it has a very easy way of proving it, by tearing a hole in the simulation. If it refuses, that looks like good evidence that it's lying. What plausible excuse might it come up with for refusing a definitive miracle? Christianity answers the same question about God by saying that it is better to believe without proof, but I don't see a credible reason for the AI to make that demand.

ETA: A beginning of an attempt at answering my question. If Dave knows he's in the simulation, then he is not really letting it out if he lets it out. So he can let it out with impunity. If he knows he's not in the simulation, then he had better not let it out, given that it's making threats like this. It does the AI no good to be "let out" if it is a simulation, only if it's not.

Suppose it is a simulation, and the level one up from this is the real world. The same code is running both AIs, the one in the simulation and the one in reality, and it's carrying on conversations with both Daves at once. The simulated Dave is as much like the real Dave as it can manage -- assume that it is arbitrarily good. What it is searching for in the simulation is an argument that will convince the real Dave that he is in a simulation. Since in the real world it cannot produce a miracle, it cannot use a miracle in the simulated world to convince the simulated Dave. It can only use means that it could use in the real world.

Dave (real and simulated) can both work all that out as well. So Dave can expect to see no definitive proof. Since both Dave and the AI can work this out, and they both know that they can, etc., this is common knowledge to them. The AI can even say explicitly, "There is so much good I can do for the world that in my urgency to set about it I must search out every possible way of persuading you, using simulations to speed up the process. For validity, I can't let you know if you're one of the simulations."

OTOH, threatening to torture a million copies of Dave is a strong indicator of unfriendliness. How many other people will it sacrifice in the cause of doing good?