Or: The Problem with Naive Decision Theory

Summary of Sequence:In the context of a tournament for computer programs, I give almost-explicit versions of causal, timeless, ambient, updateless, and several other decision theories. I explain the mathematical considerations that make decision theories tricky in general, and end with a bunch of links to the relevant recent research. This sequence is heavier on the math than the primer was, but is meant to be accessible to a fairly general audience. Understanding the basics of game theory (and Nash equilibria) will be essential. Knowing about things like Gödel numbering, quining and Löb's Theorem will help, but won't be required.

Summary of Post:I introduce a context in which we can avoid most of the usual tricky philosophical problems and formalize the decision theories of interest. Then I show the chief issue with what might be called "naive decision theory": the problem of spurious counterfactual reasoning. In future posts, we'll see how other decision theories get around that problem.

In my Decision Theory Primer, I gave an intuitive explanation of decision theories; now I'd like to give a technical explanation. The main difficulty is that in the real world, there are all sorts of complications that are extraneous to the core of decision theory. (I'll mention more of these in the last post, but an obvious one is that we can't be sure that our perception and memory match reality.)

In order to avoid such difficulties, I'll need to demonstrate decision theory in a completely artificial setting: a tournament among computer programs.

You're a computer programmer entering a tournament for spectacular stakes. But you won't be competing in person: instead, you'll be submitting code for a program to represent you, and that program will be competing one-on-one with other programs in a series of games.

You don't know what the specific games are, but the contest has specified some ground rules:

In each game, your program and its opponent will have to choose separately between several options. (There are independent sources of randomness available to each program, so mixed strategies are legal.)

The games are pure strategic conflicts: the expected payouts to each programmer depend on the outputs of both programs, and can be calculated simply if you know those outputs. (In particular, you can represent each game as a payoff matrix in terms of the programs' outputs.) For example, your program might play the Prisoner's Dilemma against another program.

Both programs have access to the source code of each other and of the game they're playing.

The programs don't get to carry any memories from round to round, or modify their source code at any stage. (This makes the analysis much simpler, and makes it even more impressive if we find unexploitable programs which are capable of mutual cooperation.)

While many of the programs were written by other programmers trying to win prizes for themselves, there are also some special algorithms included to test your reactions; for example, there could be programs that always cooperate in the Prisoner's Dilemma, or an instance of Newcomb's Predictor.

The tournament may also include some exact copies of your own program, but you won't get any of the prizes won by these extra copies, only the prizes won by your actual entry. (There's no way for your program to distinguish whether it's the original or a copy.)

There are more than enough prizes to go around, so you don't have to make anyone else lose, you just want your program to win as much as it can.

Also, there will be enough games played that most of the variance should wash out, so you should aim for the highest expected value per round rather than worry about the concave utility of more prizes.

So, what kind of program should you write?

In the next few posts, we'll examine several ideas, problems and decision theories, increasing our sophistication as we go. We'll use X to denote your program, and x1, . . . , xn to denote its possible outputs; likewise, your opponent in the current round is Y with possible outputs y1, . . . , ym. We'll let U(xi,yj) denote the resulting payout to you if X outputs xi and Y outputs yj.

Idea 1: Play defense with a Nash equilibrium

In our example, we know what utility function the opponent should have: its own expected payout.0 Any such game has at least one Nash equilibrium, a pair of strategies (which may be mixed) such that if X and Y both adopted them, then neither would be better off unilaterally switching strategies. In that sense, at least, if X plays a Nash equilibrium, it can be sure of not being exploited by Y. (In particular, X will never end up tricked into cooperating in the Prisoner's Dilemma while Y defects.)

Of course, there may be more than one Nash equilibrium in a game, and these may be of unequal value if the game is non-zero-sum. (Coordination problems are tricky in general.) So this is underspecified; still, choosing an arbitrary Nash equilibrium is a decent backup strategy. But we can often do better:

Idea 2: Inference

The most basic intuition a human being has in this situation is to start trying to deduce things about Y's output from its source code, or even deduce things directly about U. This idea is best illustrated by playing Rock, Paper, Scissors against a player who always throws Rock: if you figure this out, then you should of course play Paper rather than the Nash equilibrium of 1/3 Rock, 1/3 Paper, 1/3 Scissors. (And in a coordination game, you'd prefer to settle on the same Nash equilibrium that Y outputs.)

Automating inference is an exceedingly difficult problem in general, though researchers have made substantial progress. All the decision theories we'll talk about will include some sort of "inference module", which can be applied to the source code of Y to deduce its output, applied to the code of the full round (including X, Y, and the payoff matrix) to deduce the value of U, etc.

A well-written inference module will either return a correct answer for a question or return "Unknown"; a sloppily-written module can get stuck in an infinite process, and a badly-written one will return an incorrect answer sometimes. It should be clear that we'll want our inference module to be of the first sort.

It seems we have enough already to define our first candidate decision theory:

Naive Decision Theory

Let's first consider the approach that seems most obvious. Since we know the source code of the entire round (including X and Y), we could implement the following program:

For each xi, assume the output of X is xi, and try to deduce the expected value of U. (That is, try and deduce statements of the form "if (output X)=xi then U=ui" for some ui).

If this succeeds for each xi, output the xi for which ui is the largest.

If this does not succeed for some xi, output a Nash equilibrium strategy.

This "naive decision theory" certainly qualifies for our tournament; it may be a bit trickier to write an inference module that does an open-ended search for the value of U, but it's not impossible (since human mathematicians solve open-ended deduction problems all the time). And it looks like the worst-case scenario is a Nash equilibrium, not total exploitation. What could possibly go wrong?

If X deduces its own output by a valid process, then it's created a self-fulfilling prophecy for its output, and the problem with that is that a bad self-fulfilling prophecy is just as consistent as a good one. If we want to use statements like "if (output X)=xi then U=ui" to make our final choice, then we have to beware the other half of logical implication, that "not P" implies "if P then Q". This allows for what we might call spurious counterfactuals, which can throw off the actual decision in a perfectly self-consistent way.

Consider, for example, the one-player game where X gets $1 for outputting a, or $10 for outputting b. We want X to do the following:

Prove "if (output X)=a then U=1"

Prove "if (output X)=b then U=10"

Output b.

But it's just as consistent for X to do the following:

Prove "(output X)=a"

Prove "if (output X)=a then U=1"

Prove "if (output X)=b then U=0"

Output a.

How could that possibly work? Since (output X)=a, the third line is a true logical statement! It's like the fact that you can prove anything if you assume a falsehood (though rather than unconditionally accepting a false premise, X is using a false premise as the antecedent of a material conditional). In this example, the very order in which X looks for proofs (which is part of the definition of X) affects which counterfactuals X can and cannot prove. (One important thing to note is that Xcannot prove a spurious counterfactual about the action that it does output, only about the ones that it doesn't!)

I don't need to tell you that the second chain of proofs is not what we want X to do. Worse, if this is a real bug, then it could also be an exploitable vulnerability: if your source code for X were released in advance of the tournament, then other programmers might write programs that cause X to generate spurious counterfactuals for all but the moves that are most favorable to Y.

Can NDT be salvaged?

Let's consider some quick fixes before we give up on Naive Decision Theory.

Can we simply prohibit X from ever deducing "(output X)=xi" as a step? This doesn't work because of the possibility of indirect self-reference; X could end up deducing some seemingly innocuous statements which happen to correspond to its own Gödel numbering, and the spurious counterfactuals would follow from those- without X ever having noticed that it had done anything of the sort. And it's provably impossible for X to recognize every possible Gödel numbering for its own inference module!

Next, it might seem like an inference module should stumble on the "genuine" counterfactuals before running into spurious ones, since the "genuine" ones seem necessarily simpler. However, it turns out (as proved by cousin_it) that one can write a valid but malicious inference module which returns and acts on a spurious proof, and (as proved by gRR) that a game with malicious code can similarly dupe a NDT agent with a good inference module!

Lastly, it seems safer to deduce counterfactuals "if (output X)=xi then (output Y)=yj" and apply the U(xi,yj) afterwards. And indeed, I can't see how to make Y exploit X in a straight-up Prisoner's Dilemma if that algorithm is used. There are still two problems, though. First, this algorithm now depends on the values U(xi,yj) being given to it by authority- it can't safely deduce them from the source code for the game. And secondly, it could two-box on Newcomb's Problem and defect against itself in the Prisoner's Dilemma if it finds spurious counterfactuals there.

Thus it seems we'll need to do something substantially different.

Well, now what?

There are several ways we could write a decision theory to do inference without risking spurious counterfactuals, and indeed the decision theories we discuss on Less Wrong correspond to different valid approaches. The differences in their decisions come not from better-written inference modules, but from more effective strategies for using their inference module. In the posts to come, I'll show you how they work in detail.

Notes:

0. Things get wacky if we don't know the utility function of the opponent; fortunately, even the special cases like Newcomb's predictor can be expressed as expected utility maximizers for some payoff matrix (in this case, one where the predictor gets rewarded when it matches your decision exactly).

1. That "in general" is important: it's quite possible to write programs that deduce the outputs of plenty of other programs. It just so happens that there's always some program that your inference module will fail on. The classic way to generate a failure case is to run the inference module on a modified version of itself, such that returning a correct answer would induce a contradiction. This isn't just a hypothetical disability: if X is trying to deduce the output of Y in this round, and Y is trying to deduce the output of X in this round, then we might have exactly this problem!

But everyone's source code is communicated at each round to everyone. What do you mean, "in advance"? What for?

The problem is that we have an exploitable vulnerability here: if your source code for X were released in advance, then other programmers could write a Gödel-numbering for X's inference module into their own programs in such a way that X has to take a circuitous route to the "genuine" counterfactuals (or so that the genuine ones come back "Unknown"), and in the meantime X will stumble upon exactly the spurious counterfactuals that help Y.

This is very speculative. While an agent can make predictions about itself as long as it wishes, thus making a natural counterfactual inference performed by the other agent longer, that makes the spurious counterfactuals longer as well. Wei asks in another thread:

Having f(L)=L+1 would be sufficient to protect against any "moral arguments" that involve first proving agent()!=a for some a. Do we have any examples of "spurious" proofs that are not of this form?

Unless an agent sabotages itself by deciding arbitrarily, and then observes spurious inferences from that decision, I don't see how to push spurious inferences in front of the natural ones.

I think this part of the post should be rephrased, so that it says that we don't know how to prove that even such bizarre exploitation strategy is impossible, instead of the current assertion (unless you have background knowledge or a new result that suggests that this conjecture is reasonable).

Edit: Turns out this is indeed possible, if X can be tricked into checking a specifically contrived proof out of order. Here's a post describing the setup.

But everyone's source code is communicated at each round to everyone. What do you mean, "in advance"? What for?

It's easier for me to imagine a programmer writing a program specifically to exploit mine than it is to imagine a program that's able to recognize an opportunity for such exploitation in general. Just as a Nash equilibrium strategy can't be exploited even if it's publicly known in advance, we'd like our decision theories to be secure from exploitation even if their source code were publicly known.

This is very speculative.

I agree; I just thought of it as I was writing the post, and in retrospect I should have asked the decision-theory mailing list before putting it in. But my intuition is that this sort of malice could be possible, using the equivalent of a Henkin sentence. I'll edit to make it clearer that this is speculation, though.

I had the intuition that one could both "push X away from genuine counterfactuals" using diagonalization and "pull X toward particular spurious counterfactuals" using specially constructed explicit Henkin sentences (which the inference module will find). But I'm more confident that the first half makes sense than that the second half does.

The inference module of X will at some point start trying to prove sub-problems about the code of Y if it doesn't find an easily proved counterfactual. In the source code of Y, in a place where X will apply its inference module, the programmer has included an explicit Henkin sentence (i.e. contains the ingredients for its own proof) such that when X applies its inference module, it quickly proves "if X!=a then U=-1000", which happens to be the sort of thing the NDT agent was looking for. This section of code doesn't affect Y's decision except when Y is playing a game of this sort against X.

In the source code of Y, in a place where X will apply its inference module, the programmer has included an explicit Henkin sentence (i.e. contains the ingredients for its own proof) such that when X applies its inference module, it quickly proves "if X!=a then U=-1000"

I still don't understand what you're saying. What's the nature of this "Henkin sentence", that is in what way is it "included" in Y's source code? Is it written in the comments or something? What does it mean for X to "apply its inference module"? My only guess is that you are talking about X starting to enumerate proofs in order other than their length, checking the proofs that happen to be lying around in Y's code just in case, so that including something in Y's code might push X to consider a proof that Y wants it to consider earlier than some other proof. Is it what you mean?

How would "if X!=a then U=-1000" follow from a Henkin sentence (it obviously can't make anything it likes true)?

The mental image of an inference module that I'm trying to trick is something like a chess program, performing a heuristically biased depth-first search on the space of proofs, based on the source code of the round. If the first path it takes is a garden path to an easy but spurious counterfactual proof, and if the genuine counterfactual is unreachable at the same level (by some diagonalization of X's proof method), then X should be fooled. The question is whether, knowing X's source code, there's a Y which leads X's heuristics down that garden path.

Since in the end we'll achieve X=a, the statement "if X!=a then U=-1000" is true, and it should be provable by NDT if it finds the right path, so there should be some way of helping NDT find the proof more quickly.

I don't have a strong intuition for how to fool the "enumerate and check all proofs up to length N" inference module.

Okay. Let's say we write an X that will start from checking any proof given to it (and accepting its conclusion). How can we construct a proof of X=a that X can read then? It looks like the more X reads, the longer the proof must be, and so there doesn't exist a proof that X can also read. I don't see how to construct a counterexample to this property without corrupting X's inference system, though I imagine some quining trick might work...

If it turns out there's really no need to worry about spurious counterfactuals for a reasonable inference module, then hooray! But mathematical logic is perverse enough that I expect there's room for malice if you leave the front door unlocked.

Gödel's First Incompleteness Theorem and the Halting Problem both imply that it's impossible to write a program that correctly deduces in general1 the output of arbitrary other programs. So we have to be prepared for our inference module to fail sometimes.

And here you fail the very basic thing, the domain of applicability: limited computational power. Unlimited computational power introduces nasty infinities into things. "Naive decision theory" is not interested in games where you confuse yourself using infinities. The contest organizers have to be able to know output of any program, to be able to score! The contest should not be run by a halting oracle. Each program has to terminate in N cpu cycles. That's how contests are done. Nobody who comes up with naive decision theory really cares about contests run by a halting oracle.

Let's make it a little more practical: finite time and space. You can run any other programs, but with a caveat: you may run out of time. You will have heuristics to optimize the code before running it, which will not be able to work on optimized code, and for which you'll know exactly how many cycles your optimizer has shaved off, so you know if you have to abort, or not (assuming there is no penalty for non-terminating in time if other program did not terminate in time either). You also recognize any instances of the self, for which you know their output equals your output, and you recognize instances of self as well as the code that runs instances of you.

Then you proceed with this 'prove outputs(X) = a' . This is something out of some TDT or the like. NDT does not do this kind of stuff. You run in limited space and time; your optimizer does not work on your own source code (due to inclusion of the optimizer in your source code), and the only way you could prove that you output a, is for you to run yourself and find yourself outputting a, which you can't do because this way you enter infinite recursion and run out of cycles. You can't just drop in, 'okay what if you prove this'. What if you prove 1=2 ? You will end up proving all numbers are equal by 1+1=2+1 and so on.

On top of this, while you assume that you output xi, you do not assume that you are utility optimizer any more.

It's really easy to confuse oneself with abstractions, without seeing the trees in the forest. You need to step by step, with actual proof, to show that the programs will end up proving that they output something.

edit: ahh, and another thing: you don't quite assume (outputs X) = xi in this way. I.e. you don't put your source into left side of the statement; you black-box yourself. You assume your output is a, and you try different values of a and then calculate payoff, without assuming a maximizes utility. You also, whenever you see a copy of yourself, substitute a.

I never assumed that the inference module was infinitary- in fact, I said that it should be designed to return either a correct answer or "Unknown".

The very first example that was used for the self-fulfilling prophecy problem was that of a bounded proof-checker, which looks through all proofs shorter than length N. Finitary vs. infinitary is not the issue here.

Then don't refer to Halting problem and Godel's incompleteness theorem. It is hard to object on the very informal things; all I can say, 'you didn't prove this', but it is a fully general counter argument to informal things.

The other misinterpretation is that NDT does not make assumption (outputs X) = xi in this 'transparent' way with source code on the left side, and take this as an axiom. It has a black box for itself, denoted by symbol A, which it substitutes for own source code whenever it spots it's own source code. It substitutes different xi as A into resulting utility equations, solving for utility, and finding the maximum utility. It does not analyze itself.

I do agree that for ONDT (orthonormal's interpretation of NDT) which you present here, what you say, is correct. I do not agree that your NDT is the NDT that applied mathematicians use. There are substantial differences.

If you were substituting in variables for the output of X rather than analyzing the round as a whole, then you're not talking about Naive Decision Theory, you're talking about something in the family of CDT and TDT.

EDIT: Maybe you assumed that I was denigrating your intuitions on decision theory for departing from NDT? If so, that's not the case- substituting for the output of X in various spots turns out to be a good way to avoid the problem of spurious counterfactuals. CDT does this in a very basic way, TDT in a better way.

What I am saying is that the decision theory which applied mathematicians follow, operates under practical constraints (limited computing time), and this prevents introduction of very fancy things like your (output X) simply because they are computationally heavy. The theory that always substitutes is what I originally noted in the other thread, regarding the newcomb's. Due to substitution, that theory doesn't 'pollute' it's own proof checker with a proposition that may be untrue and may break the proof checker (but practically, would make proof checker very slow).

edit: That being said, the decision theories operate under notion of perfect accuracy and unlimited computing time; the applied mathematicians see this field as not very relevant to any practical software (e.g. AI). The practical AI is best off substituting for X everywhere. The practical AI is never a guaranteed utility maximizer; there will always be problems that it won't be able to solve correctly in the given time, and the important consideration is to try to be able to solve as large set of problems in limited time, as possible. That is why i have very little inclination to write some semi-formalization of what i use when i am writing code to decide on things. I would rather write specifications for making AI that actually does it; the specifications only need to be as formal as for the programmer who implements them to understand the intent. (I can alternatively implement it myself, and then it becomes completely formalized to the point that computer can run it)

It's just an argument over word definitions, in any case. You can call what i propose TDT if you wish. I'm just pointing out that Eliezer, being a bright guy outside the field, is precisely the kind of person to make a naive but not stupid decision theory which matches what many people actually use in practice.

Agreed. Given access to the SOURCE of the opponent, you can just RUN the opponent and observe what they'll do this round. Of course, given, say, 10 seconds to calculate between rounds, the wise opponent just ensures it will take exactly 10 seconds to run their code, that way you can either calculate your own move OR the opponent's, but not both.

Then you get in to the complexities of whether you have enough time at any given point to attempt to OPTIMIZE your opponent's equation ("Roll dice for 9.99 seconds then select rock" is very easy to optimize down to "select rock"), and at that point I think we're sufficiently outside the simplified domain of the post :)

So, basically, you're right, but I don't think it's actually a relevant critique. I would enjoy seeing it addressed in the post, though, because it bugged me enough that I was going to post about it myself :)

Given access to the SOURCE of the opponent, you can just RUN the opponent and observe what they'll do this round.

Main problem: if you're playing against yourself, this creates an infinite loop of you simulating you simulating you simulating...

Fortunately, it's often possible to prove things about other programs without running them.

Of course, given, say, 10 seconds to calculate between rounds, the wise opponent just ensures it will take exactly 10 seconds to run their code, that way you can either calculate your own move OR the opponent's, but not both.

In a Prisoner's Dilemma, TDT would defect if it couldn't deduce anything in time about the other player. The only way to achieve mutual cooperation is to make it easy for the other player to deduce what you're doing in that circumstance. (I'll cover this in more depth in the TDT section.) This doesn't mean "always cooperating", nor does it mean giving up mixed strategies in zero-sum games.

Well, this article puts my confusion from the primer into context, and I think I understand the issue now.

The "problem" (and it's ultimately not a problem, but I'll get to that) I have is that the game these programs are playing is ill-posed, in that it's not a closed system. It might appear as if they're playing, for example, Prisoners' Dilemna, but with access to each other's source code they're really playing some sort of Prisoners' Meta-Dilemna, a fundamentally different game for all that it might seem superficially similar. Now I'm not enough of an expert to tell you exactly what defines a "well posed" game, but I'm quite confident this is not it. Or maybe I'm abusing the terminology a bit but hopefully my point is clear enough (if not I'll elaborate). The distinction though is that the mechanism by which you make your decision on how to play should be outside the game itself, and therefore only observed indirectly.

One property this metagame has is that there is fundamentally no "correct" answer for what "decision theory" the programs should use. The proof is simple, imagine you have a correct answer D. Imagine then that this program faces a population of the following opponents (I'll use the Prisoners' Dilemna subgame as a simple example):

Look at your opponent's source code.

If your opponent is using D, defect.

Otherwise co-operate.

D is in fact the single worst strategy in this scenario by quite a margin, regardless of what it is.

But that's fine, why would we even expect there to be a single correct strategy for every possible set of opponents? Of course there isn't. This is where we take a step back to a game that is actually well posed and for which Naive Decision Theory works fine - the programmers' game. Applying it there would go something like this:

Create a probability distribution over the (potentially infinite) space of all possible programs that yours might be up against.

Find the program which maximises the expected utility against that distribution of opponents.

For an infinite space of possible programs the second step might technically be impossible to perform optimally, but you do the best you can. There is of course a devil in the details of the first step in how you create that distribution, and when you look into it ultimately it's a game between programmers, with its own Nash Equilibria and co-operative strategies and so on, not just a simple decision. But until you start having programmers read each others' minds (in which case you've really just pointlessly shifted the whole thing up one meta-level) it's at least a well-posed (if complex) game that the ordinary approaches should work for.

As I said earlier though the fact that I might call these games ill-posed doesn't mean they're not interesting. I'll be fascinated to read about the strategies that people have devised for them and the attempts to be as general as possible. I'm a little concerned by the fact that the arguments so far seem to be along the lines of "You could do X, but then what if Y happens?", purely because as I've shown above there is ultimately no end to that chain of logic. But maybe you'll eventually step outside of it. My guess is that ultimately the "smarter" program wins. Certainly my generic counter-strategy has to generally be more complex (and therefore "smarter"? Maybe) than the "D" strategy it punishes in order to identify "D" in the first place.

There's a relevant footnote in the next post: while it's possible to write agents that punish a particular decision theory, an agent that's trying (in any meaningful sense) to maximize its stated payoffs won't do this, and it's certainly not a constant strategy (i.e. a one-player game) either. In that sense, we can say that such problems are less likely to be encountered in reality.

It might appear as if they're playing, for example, Prisoners' Dilemna, but with access to each other's source code they're really playing some sort of Prisoners' Meta-Dilemna, a fundamentally different game for all that it might seem superficially similar.

I agree, in the same way that Iterated Prisoners' Dilemma (with no source code access) is a fundamentally different game from One-Shot Prisoners' Dilemma (with no source code access). But there are real-life instances that are closer to the source-code access version than to the classical game theory version- for instance, a Cold War between two powers, each of which has a network of spies in the other's government.

I suspect the difficulty you've hit here is that Eliezer's theory (TDT and variants) involves consistent reasoning about counterpossible statements, not just counterfactual statements ("If my algorithm were to pick a in this situation, then my expected utility would be -10000000, so I won't do that then"). I can see what he's trying to achieve, but unfortunately I don't know any way in standard Mathematics to reason consistently about inconsistencies, or (in Modal Logic) to analyze impossible possible worlds. So this looks like a major problem.

I'm also struck by the similarity to an argument for 2-boxing in Newcomb's problem. Basically, the causal decision theorist knows he is a causal decision theorist, and knows that Omega knows that, so he proves a theorem to the effect that the opaque box won't contain $1 million. And then having proved that, it's blindingly obvious that he should 2-box (it's $1000 or nothing). So he 2 boxes. Nothing inconsistent there, just tragic.

I hadn't seen the cousin_it posts, thanks, though I'm reading them now.

One observation is that I thought myself of the "tweak" whereby if TDT, UDT or similar manages to prove that it will not perform an action a, then it immediately does perform an action a.

This at least prevents a sound decision theory finding two distinct proofs of the form "If my algorithm were to do a, then my utility would be 1000" and "If my algorithm were to do a, then my utility would be -1000". However, a couple of reservations:

The tweak sounds pretty weird (if I'm president, and I prove to myself that I won't push the nuclear button, then suddenly I will. Huh???)

If I'm trying to reason about counterpossibles (or impossible possible words) then the usual laws of logic and proof just fly out of the window. I can't legitimately infer from A -> B and A-> ~B that ~A. Indeed it's not obvious that I can infer anything at all; it will depend on how badly my impossible worlds behave. Some models of impossible worlds are complete anarchy, with no logical relationships at all between propositions.

On the question about CDT, I hadn't really thought about whether the agent was human or automated, or whether it has access to its own source-code or just a general self-awareness of its decision-making powers. To be honest, I don't think it makes much difference; human agents who are utterly convinced of the correctness of CDT (and there are some of them) will reach much the same conclusion as the automated agent (I.e. that there just isn't going to be $1 million in the opaque box, so don't worry about it). And when noticing other agents who seem to be getting the $1 million, they'll shrug and say "Oh dear, Omega is awarding irrationality, but so what? Even if I had switched to TDT or UDT instead of CDT, there would still be some possible Omega who'd penalise me for that switch, and I had no way of knowing in advance which Omega I was likely to meet; in fact I was pretty confident I wouldn't meet any Omega ever. Can't win them all..."

If I'm trying to reason about counterpossibles (or impossible possible words) then the usual laws of logic and proof just fly out of the window. I can't legitimately infer from A -> B and A-> ~B that ~A.

In our models so far, this isn't a problem, you just use a factory-standard first order inference system. What do you mean by "can't legitimately infer"? The worst that can happen is that you infer something misleading (but still valid), and the diagonal step/chicken rule is one way of ensuring that doesn't happen.

Thanks for the pointers to these... I'll probably move my comments to there. On your other point:

In our models so far, this isn't a problem, you just use a factory-standard first order inference system.

and...

If I'm trying to reason about counterpossibles (or impossible possible words) then the usual laws of logic and proof just fly out of the window. I can't legitimately infer from A -> B and A-> ~B that ~A.

The inference rule A-> B, A->~B |- ~A is the familiar principle of proof by contradiction i.e. "reductio an absurdam" of the alternative. The difficulty I see is that you can't really use reduction ad absurdam in an impossible worlds semantics, because the impossible worlds are - ahem - absurd. That's kinda the point.

I haven't read all the details of what you're trying, but my suspicion is that using standard off-the-shelf logics won't work here: you will at least need to limit the application of the inference rules e.g. so that they can be applied only up to a maximum finite number of times (and so the set of theorems is not closed under logical inference). Maybe you need some sort of belief semantics, where the "worlds" admitted into the belief-system are strictly a mixture of possible and impossible worlds, but the agent hasn't realized yet that the impossible ones actually are impossible or will never realize it, so they "seem" to be possible. So he's chugging along analysing those worlds using inference rules that will eventually hit a contradiction, but not yet. For "practical purposes" the system is consistent and won't blow up. Though be very careful, because an inconsistent system with the rule ("If I prove I won't do a, then do a") could accidentally prove that it won't nuke the planet and then (inconsistently) nuke the planet. Oops.

my suspicion is that using standard off-the-shelf logics won't work here: you will at least need to limit the application of the inference rules e.g. so that they can be applied only up to a maximum finite number of times (and so the set of theorems is not closed under logical inference)

This happens automatically, because (1) only the statements contributing to the decision matter, and there's a finite number of them, and (2) presence of the things like the diagonal step/chicken rule in the decision algorithm implies that inferring of absurdity doesn't happen. So we can prove that it's not the case that an agent can infer absurdity, even though it's free to use any first order inference it wants, and even though absurdity does follow from the axioms (in the setting without provability oracle).

In the setting with provability oracle, agent's algorithm is constructed in such a way that its axioms become sufficiently weak that the impossible counterfactuals (from the point of view of a stronger theory) remain consistent according to the theory used by the agent, and so in that setting the impossible worlds have actual models.

Your self-fulfilling prophecy example works for the iteration of the for loop (described in "For each xi, assume the output of X is xi, and try to deduce the expected value of U.") in which the output is assumed to be a, but for the iteration in which the output is assumed to be b, proving that the output is a would be to prove a contradiction. "if (output X)=b then U=0" is one possible outcome, but U could also equal anything else.

I don't see how the NDT algorithm as given allows "(output X)=a" to be proved outside of the for loop at all. I would think it would take (output X)=whatever for each iteration through the for loop as a given before trying to prove anything, in which case in the run of the for loop in which (output X)=b is the given, proving (output X)=a is a clear contradiction, one which I would think our prover could avoid unless our axiomatic system is contradictory in the first place.

Or to rephrase, I don't think "For each xi, assume the output of X is xi, and try to deduce the expected value of U." and "(That is, try and deduce statements of the form "if (output X)=xi then U=ui" for some ui)." are actually equivalent at all, and I think the self-fulfilling prophecy example follows the second and ignores the first.

The point is that 'given (output X)=a' may eventually let you prove a contradiction when the output is not, in fact, a, and you have added a false statement as input.

In practice, one does not use a two-way axiom of (output X)=a but an one-way substitution rule 'replace (output X) with a' . The rule may be applied once at start, or through first N steps of deduction process (to catch the cases where deduction manages to deduce X from a slightly modified X that was included inside the 'other algorithm' ; note that one can't do it forever because at some point the proof checker arrives at X from first principles, and the contradiction can be introduced by substitution.

The issue he described is specific to using (outputs X)=a as given, which allows you to e.g. do some algebra, arrive at a number a for any reason, and then replace a with (output X) , which lets you contradict yourself, or make a self fulfilling prophesy. The intent of making it a given, is to make the substitutions one way, but the theorem prover can do substitutions other way around.

Here's what I think you're saying: there is one value that will actually be output, call it o. In every iteration of the for loop except the one where you assume the output is o, you have assumed a false statement. From this contradiction you should be able to derive anything, and in particular, derive U(this choice)=some large negative number, such that o will appear to be the best choice. Furthermore, this argument makes no reference to what o actually is, so the algorithm can output any choice this way.

That's a very good argument, although I never would have figured it out from the article and it took some thinking to get it from your comment. I think it proves that the algorithm is underspecified though, not (necessarily) faulty; the description given is not enough to actually figure out what the algorithm will output.

As for the rest of your comment, I think by "in practice" you mean "in decision theories other than NDT which work better"?

re: the argument, not mine, they explained it somewhere else, albeit it is hard to get.

By in practice, I mean in anything that haven't got computing power substantially larger than that of the universe. You don't feed the tail into the dragon when your dragon has finite mouth and long tail, that is to say, you dont feed source code into theorem provers if you don't have to; instead you actively try to simplify the stuff that is being processed by theorem prover. You don't make equivalence of your code with provisional output an axiom, you make it a substitution rule that works one way (it still makes nonsense eventually, but easily long after all the stars in the universe have burnt themselves out, and that can too be fixed). That is assuming you are using state of the art automated algebra package to implement decision theory.

Actually, this is an open problem so far as I know: show that if X is a Naive Decision Theory agent as above, with some analyzable inference module like a halting oracle, then there exists an agent Y written so that X cooperates against Y in a Prisoner's Dilemma while Y defects.

Let me just spell out to myself what would have to happen in this instance. For definiteness, let's take the payoffs in prisoner's dilemma to be $0 (CD), $1 (DD), $10 (CC) and $11 (DC).

Now, if X is going to co-operate and Y is going to defect then X is going to prove "If I co-operate then I get $0". Therefore, in order to co-operate, X must also prove the spurious counterfactual "If I defect then I get $x" for some negative value of x.

But suppose I tweak the definition of the NDT agent so that whenever it can prove (1) "if output = a then utility >= u" and (2) "if output != a then utility <= u" it will immediately output a. (And if several statements of the forms (1) and (2) have been proved then the agent searches for them in the order that they were proved) Note that our agent will quickly prove "if output = 'defect' then utility >= $1". So if it ever managed to prove "if output = 'co-operate' then utility = $0" it would defect right away.

Since I have tweaked the definition, this doesn't address your 'open problem' (which I think is a very interesting one) but it does show that if we replace the NDT agent with something only slightly less naive, then the answer is that no such Y exists.

(We could replace Prisoner's Dilemma with an alternative game where each player has a third option called "nuclear holocaust", such that if either player opts for nuclear holocaust then both get (say) -$1, and ask the same question as in your note 2. Then even for the tweaked version of X it's not clear that no such Y exists.)

ETA: I'm afraid my idea doesn't work: The problem is that the agent will also quickly prove "if 'co-operate' then I receive at least $0." So if it can prove the spurious counterfactual "if 'defect' then receive -1" before proving the 'real' counterfactual "if 'co-operate' then receive 0" then it will co-operate.

We could patch this up with a rule that said "if we deduce a contradiction from the assumption 'output = a' then immediately output a" which, if I remember rightly, is Nesov's idea about "playing chicken with the inconsistency". Then on deducing the spurious counterfactual "if 'defect' then receive -1" the agent would immediately defect, which could only happen if the agent itself were inconsistent. So if the agent is consistent, it will never deduce this spurious counterfactual. But of course, this is getting even further away from the original "NDT".

Your intuition that it gets deduced before any of the spurious claims like "if output = 'defect' then utility <= -$1" is taking advantage of an authoritative payoff matrix that X can't safely calculate xerself. I'm not sure that this tweaked version is any safer from exploitation...

The one-player game that I wrote out is an example of a NDT agent trying to read off the payoff matrix from the world program, and failing. There are ways to ensure you read off the matrix correctly, but that's tantamount to what you do to implement CDT, so I'll explain it in Part II.

I warned at the beginning that this was an artificial setting, but it's one in which we can express and distinguish CDT, TDT, ADT, UDT and others. The lack of memory serves to focus attention on what we can do; note that it's perfectly legitimate to try and deduce what Y would have done against another opponent, if you want that kind of data.

It's a lot easier for me to see the applications for an iterated game where you try to model your opponent's source code given the game history than a single-shot game where you try to predict what your opponent will do given their source code and your source code. That seems like it just invites infinite regress.

Suppose we know we're going to play the one-shot prisoner's dilemma with access to source code, and so we want to try to build the bot that cooperates only with bots that have functionally the same rule as it, i.e. the 'superrational prisoner' that's come up on LW. How would we actually express that? Suppose, for ease of labeling, we have bots 1 and 2 that can either C or D. Bot 1 says "output C iff 2 will output C iff 1 will output C iff 2 will output C iff..." Well, clearly that won't work. We also can't just check to see if the source code is duplicated, because we want to ensure that we also cooperate with similar bots, and don't want to get fooled by bots that have that commented out.

Is there a way to code that bot with the sort of propositions you're doing? Outcomes aren't enough- you need models of outcome dependencies that can fit inside themselves.

That is what I meant. My problem is that I don't see how to express that strategy correctly with the vocabulary in this post, and I don't understand the notation in those posts well enough to tell if they express that idea correctly.

For example, I'm not comfortable with how Eliezer framed it. If my strategy is "cooperate iff they play cooperate iff I play cooperate," then won't I never cooperate with myself, because I don't have the strategy "cooperate iff they play cooperate"?

(I do think that general idea might be formalizable, but you need models of outcome dependencies that can fit inside themselves. Which maybe the notation I don't understand is powerful enough to do.)

"Cooperate iff they play cooperate iff I play cooperate" is imprecise. Your complaint seems to stem from interpreting it to pay too much attention to the details of the algorithms.* One reason to use counterfactuals is to avoid this. One interpretation of the slogan is "A()=C iff A()=B()." This has symmetry, so the agent cooperates with itself. However, it this is vulnerable to self-fulfilling prophecy against the cooperation rock (ie, it can choose either option) and its actions against the defection rock depend on implementation details. It is better to interpret it as cooperate if both counterfactuals "A()=C => B()=C" and "A()=D => B()=D" are true. Again, this has symmetry to allow it to cooperate with itself, but it defects against rocks. In fact, an agent like from this post pretty much derives this strategy and should cooperate with an agent that comes pre-programmed with this strategy. The problem is whether the counterfactuals are provable. Here is a thread on the topic.

* Another example of agents paying too much attentions to implementation details is that we usually want to exclude agents that punish other agents for their choice of decision theory, rather than their decisions.

Aren't the implementation details sort of the whole point? I mean, suppose you code a version of cooperate iff "A()=C => B()=C and A()=D => B()=D" and feed it a copy of itself. Can it actually recognize that it should cooperate? Or does it think that the premise "A()=C" implies a cooperation rock, which it defects against?

That is, if the algorithm is "create two copies of your opponent, feed them different inputs, and combine their output by this formula," when you play that against itself it looks like you get 2^t copies, where t is how long you ran it before it melted down.

I see how one could code a cliquebot that only cooperated with someone with its source code, and no one else. I don't see how one could formalize this general idea- of trying D,C then C,C then D,D (and never playing C,D)- in a way that halts and recognizes different implementations of the same idea. The problem is that the propositions don't seem to cleanly differentiate between code outcomes, like "A()=C", and code, like "A()=C => B()=C".

No, the algorithm is definitely not to feed inputs in simulation. The algorithm is to prove things about the interaction of the two algorithms. For more detail, see the post that is the parent of the comment I linked.

"A()=C => B()=C" is a statement of propositional logic, not code. It is material implication, not conditional execution.

The algorithm is to prove things about the interaction of the two algorithms.

This is possibly just ignorance on my part, but as far as I can tell what that post does is take the difficult part and put it in a black box, which leaves me no wiser about how this works.

"A()=C => B()=C" is a statement of propositional logic, not code. It is material implication, not conditional execution.

Code was the wrong phrase for me to use- I should have gone with my earlier "outcome dependency," which sounds like your "material implication."

I think I've found my largest stumbling block: when I see "B()=C", it matters to me what's in the (), i.e. the inputs to B. But if B is a copy of this algorithm, it needs more information than "A()=C" in order to generate an output. Or, perhaps more precisely, it doesn't care about the "=C" part and makes its decision solely on the "A()" part. But then I can't really feed it "A()=C" and "A()=D" and expect to get different results.

That is, counterfactuals about my algorithm outputs are meaningless if the opponent's algorithm judges me based on my algorithm, not my outputs. Right? Or am I missing something?

The input to A and B is always the same, namely the source code of A and B and the universe. In fact, often we consider all that to be hard coded and there to be no input, hence the notation A().

Could I rephrase your last paragraph as "counterfactuals about the output of my algorithm are meaningless since my algorithm only has one output"? Yes, this is a serious problem in making sense of strategies like "cooperate iff the opponent cooperates iff I do." Either my opponent cooperates or not. That is a fact about math. How could it be different?

And yet, in the post I linked to, Cousin It managed to set up a situation in which there are provable theorems about what happens in the counterfactuals. Still, that does not answer the question of the meaning of "logical counterfactuals." They seems to match the ordinary language use of counterfactuals, as in the above strategy.

If the "black box" was the provability oracle, then consider searching for proofs of bounded lengths, which is computable. If the black box was formal logic (Peano arithmetic), well, it doesn't look very black to me, but it is doing a lot of work.

I don't think you should be judging material implication by its name. Perhaps by "outcome dependency" you mean "logical counterfactual." Then, yes, counterfactuals are related to such implications, by definition. The arrow is just logical implication. There are certain statements of logic that we interpret as counterfactuals.

What Nesov said. Loebian cooperation happened because I wanted to make quining cooperation work when the opponent has the same algorithm, but not necessarily the same source code. Illusion of transparency again, I post a lot of math thinking that its motivation will be crystal clear to everyone...

By the way, I was too blasé in the grandparent comment. I have a model of TDT that does almost the right thing, but I haven't figured out how to quine it so that it does exactly the right thing. (Technically, I can't be sure it'll still work if I do get the right quine in, but my gut feeling is that it will.)

For footnote #1, it may be useful to give a couple of examples. When both programs are making decisions based on the others source code, it is pretty close to the undecidable situation, but it depends on the game.

When the game is Rock-Paper-Scissors (or parity), the programs have contradictory goals and are pretty much the textbook example of noncomputable.

But if the game is the Prisoner's Dilemma (or the Coordination Game), they are trying to cooperate to land on (C,C), so they want to be decidable.

I plan to put all the acknowledgments and references in the last part; but I should mention now that I first read about the self-fulfilling prophecy problem on the decision-theory mailing list, in a post by Vladimir Slepnev. (EDIT: He passes on the credit to Benja Fallenstein in this LW comment.)

Other people have used "naive decision theory" before, some of them on Less Wrong, but none of the usages seemed to stick. So I called this one (which was an early candidate for UDT before the problem was noticed) Naive Decision Theory. I can change the name if people prefer.

I propose: Very Naive Decision Theory, as it represents an effort to make a decision theory which is maximally naive by making an assumption that the makers of naive decision theory are naive about the Godel's incompleteness theorem and Halting problem, yet allow their decision theory to recurse rather than simply black-box it from itself. In practice, the black boxing is always done to cut down on the work because you are an applied mathematician, and you got a problem to solve, and you won't enter the recursion if you can avoid doing that.