The two envelopes problem

Introduction

I came across the two envelopes problem a couple of weeks ago. I read the Wikipedia article, and this article by a mathematician named Keith Devlin, and for a while I felt like I understood the solution. I've been thinking about this a lot since then, and now I believe that those two articles are wrong about the solution.

I've decided to post my thoughts here, so that I can get some input from others. If I'm wrong about something, I'd like to know about it.

The problem

For those who aren't familiar with the problem, here it is:

You have two indistinguishable envelopes in front of you. Both of them contain money, one of them twice as much as the other. You're allowed to choose one of the envelopes and keep whatever's in it. So you pick one at random, and then you hesitate and think "maybe I should pick the other one". Let's call the amount in the envelope that you picked first A. Then the other envelope contains either 2A or A/2. Both amounts are equally likely, so the expectation value of the amount in the other envelope is

.

Since this is more than the amount in the envelope we have, we should definitely switch.

This conclusion is of course absurd. The symmetry of the problem alone is enough to guarantee that it doesn't matter if we switch or not. A more formal way of showing it is to note that if the smaller amount is X, then both expectation values are equal to

,

so it doesn't matter if we switch or not.

The problem is that we have two calculations that look correct, but only one of them can be. It's obvious that the second calculation is the correct one, but it's surprisingly difficult to understand exactly what's wrong with the first one.

Why this post is so long

Because this problem is much more difficult than it seems, and because I've been discussing this problem with a few friends who aren't very familiar with Bayes's theorem. In order to make it easier for them (and others who are interested) to understand the disscussion, I will go through the basics of Bayesian probabilities in the next section. Then I will examine the line of reasoning that Devlin claims solves the problem.

If you already know about Bayesian probabilities, you can skip to the section titled "The solution?".

Probabilities, conditional probabilities and Bayes's theorem

Let S be a set. We define a function P that takes a subset of S to a number in the closed interval [0,1] by

,

where |X| and |S| are the number of elements in X and S respectively. We call P(X) the probability of X. Note that this agrees with our intuitive idea of what a probability is: For example, if S is the set of all people and X the subset of all people with dark hair, then P(S) is the probability that a person chosen at random has dark hair. We call S the sample space.

We now define another function, also represented by the letter P, that takes two subsets of S to a number in the closed interval [0,1] by

,

where the P on the right-hand side is the probability function we defined above. To interpret this, we note that the right-hand side is equal to

.

Suppose for example that X is the set of all people with dark hair, and Y the set of all females. Then P(X|Y) is the probability that a randomly chosen female has dark hair. We call P(X|Y) the conditional proability of X, given Y.

The definition of the conditional probability implies that

but since the left-hand side is symmetric under exchange of X and Y, the right-hand side must be too, so we must have

.

Bayes's theorem is obtained by solving for P(X|Y).

When we need to calculate conditional probabilites , where the are members of a family of pairwise disjoint subsets that cover S (e.g. "people aged 0-9", ="people aged 10-19", and so on), we often already know the , so it's convenient to put Bayes's theorem in a different form.

.

This is all very abstract, so let's look at an example.

An example of how to use Bayes's theorem

Drawer 1 contains 10 white socks, Drawer 2 contains 10 white socks and 90 black socks. First a drawer is chosen at random, and then a sock from that drawer is chosen at random. This sock is white. What is the probability that we took it from drawer 1?

Let X be the set of events such that a sock from drawer 1 is chosen. Then is the set of events such that a sock from drawer 2 was chosen. The information that the drawer was chosen at random tells us that the number of elements in and in (=S-X) must be the same. This implies that . Let be the set of events when a white sock was picked. What we're looking for is the conditional probability of X (drawer 1), given Y (white sock), so we use Bayes's theorem:

is the probability that a random sock from drawer 1 is white. This is 10/10=1. and are both 1/2. is the probability that a random sock from drawer 2 is white. This is 10/100=1/10. So we get

If we had not known that the sock we picked is white, we would have said that the probability that we picked it from drawer 1 is P(X)=1/2. When we learned that it's white, our assessment changed to P(X|Y)=10/11. Because of this, P(X) is often called the prior probability of X, and P(X|Y) the posterior probability of X.

What if the sample space is infinite?

The equation that defines probability above has the number of elements in the sample space in a denominator. If the sample space is infinite, all probabilities would be zero, unless the subsets we consider are also infinite, in which case the probability wouldn't be well defined. In the discussion that follows, I'm assuming that the definitions above can be generalized to the infinite case, and that Bayes's theorem still holds. If someone has objections to this, or would like to show us the definitions and proofs, I would appreciate it.

The solution?

I don't know who suggested this solution first, but I read about it in Devlin's article. The difference between what I've written here and what Devlin wrote is mainly in the notation. I'm using the same notation as I did in the sections preceding this one.

The first thing we need to realize is that when we try to calculate the expectation value of the contents of the other envelope as a function of the value in the first envelope, we need to use conditional probabilites. For example, the first 1/2 in the calculation that gave us the answer 5/4*A must be replaced by the conditional probability that the first envelope contains the smaller amount, given that the smaller amount is A. Are these conditional probabilities also 1/2? Devlin argues that they can't be. We will soon see why.

Just as in the example with the two sock drawers, we have an sample space that is split in equally large halves (whatever that means when the sample space is infinite) by the requirement that one of two options is chosen at random. Let X be the set of events such that the smaller amount is chosen first. Then X^c is the set of events such that the larger amount is chosen first. Let Y be the set of events such that the envelope that gets chosen first contains A.

With these definitions, the expectation value can be expressed as

The requirement that the two envelopes are chosen at random means that , but the conditional probabilites aren't necessarily both 1/2. Let's find out if they can be. Bayes's theorem tells us that

We already know what P(X) and P(X^c) are, but what are P(Y|X) and P(Y|X^c)? For example, P(Y|X) is the conditional probability that the envelope we choose first contains A, given that we have chosen the envelope with the smaller amount. Another way of saying that is that P(Y|X) is the probability that the amounts in the envelopes are specifically A and 2A, rather than some other amounts.

This is when we must realize that it's not possible to make sense of the expression that we have claimed is the expectation value, unless we postulate the existence of something like a function 0,\infty]\rightarrow[0,1]" alt="Q0,\infty]\rightarrow[0,1]" /> that tells us the probability Q(x) that the smaller value is x.

Actually we should postulate the existence of a probability distribution g on the set of positive real numbers. This is basically a "function" that assigns a probability density g(x) to each x. The reason I put quotation marks around the word "function" is that g isn't technically a function, since it can have infinitely sharp peaks like a delta function (which is also not a function...yeah, I know). Given the probability distribution g, we could obtain the probability that the smaller value is in the interval (a,b) by integrating g over that interval.

For the moment I will only consider discrete probability distributions. This is equivalent to saying that we are allowing ourselves to use the function Q as I defined it above, with the added requirement that it can only be non-zero on a countable subset of its domain of definition. (This subset can still be unbounded, and hence infinite).

Now that we have defined Q, we see that

This means that

This is only =1/2 if

.

If this doesn't hold for arbitrary A, then the expression we've been using for the expectation value simply isn't valid. We're assuming that it is valid, so we must have Q(x/2)=Q(x) for all x. But we can't have Q(A/2)=Q(A) for arbitrary A, because that would make the sum of all Q(x) infinite. That sum has to be =1 of course, since it's the sum of the probabilities of all possible values of the smaller amount.

According to Devlin, this result solves the problem. He doesn't say much after arriving at this result, and ends his article with the following words:

Originally Posted by Devlin

To summarize: the paradox arises because you use the prior probabilities to calculate the expected gain rather than the posterior probabilities. As we have seen, it is not possible to choose a prior distribution which results in a posterior distribution for which the original argument holds; there simply are no circumstances in which it would be valid to always use probabilities of 0.5.

Why I don't think that this solves the problem

If the above line of reasoning is correct, then the result of the calculation shouldn't just be different from 5/4*A. It should be exactly A, i.e. we should have

,

,

.

So if the expression we've been using for the expectation value really is correct, then Q must satisfy Q(x)=1/2*Q(x/2) for all x. But if it does, then Q has a form that's even more absurd then the uniform distribution that we dismissed earlier, since Q(x) goes to infinity as x goes to zero.

Devlin found that there's no Q that makes (and hence the expectation value equal to 5/4*A) for arbitrary A, and concluded that this resolves the paradox. But we have found, using the same methods, that there's also no Q that makes the expectation value equal to A for arbitrary A.

This makes me think that that Devlin's solution is incorrect.

Interpretation of the probability distribution

When I started thinking about this problem, I thought of the probability distribution g (or the function Q) as representing the method that someone used to determine the amounts that he or she later put in the envelopes. The problem with that is that since that method isn't known to us, we won't ever be able to calculate the expectation value this way. The alternative is to think of it as representing our belief about what the smaller amount is. Then we can assume that the probability distribution is known to us, and we can calculate the expectation value, but we will only get the right answer if we had the right "belief" to begin with!

My conclusions

To define A to be the amount in the first envelope is to treat that value as if it's already known to us (regardless of whether we actually know it or not). Because of this, the factors of 1/2 in the first calculation in this post must be replaced by conditional probabilities, as we did in the section titled "The solution?". These conditional probabilities must be calculated using Bayes's theorem, and but when we try, we see that the result depends on a probability distribution g that isn't known to us. This means that we can't calculate the expectation value this way.

However, since we know that the correct expectation value is A, we might be able to use Bayes's theorem to find a class of probability distributions that are reasonable instead. A "reasonable" probability distribution would be one that makes the expectation value equal to A, when we calculate it using Bayes's theorem. Since the probability distibution can be interpreted as representing our belief about what the smaller amount is, this procedure might tell us which beliefs are reasonable and which ones are absurd. We have only investigated the discrete probability distibutions, and found that all of them must be considered absurd. However, that doesn't seem reasonable, so now I'm back to thinking that there's something fundamentally wrong with this method of calculating the expectation value.

So what is the correct solution of this problem? I still don't know. Maybe I'll find it in some other article.

The mistakes in the Wikipedia article

The simple "solution" that's suggested early in the Wikipedia article isn't a solution at all.

Originally Posted by Wikipedia

The most common way to explain the paradox is to observe that A isn't a constant in the expected value calculation, step 7 above. In the first term A is the smaller amount while in the second term A is the larger amount. To mix different instances of a variable or parameter in the same formula like this shouldn't be legitimate, so step 7 is thus the proposed cause of the paradox.

The "paradox" is that we have two calculations, the first and the second in this post, that give us different answers. The challenge isn't to discover the second calculation, but to explain what's wrong with the first one. The text I just quoted is equivalent to saying "you should use the second instead of the first". This is to ignore the problem, not to solve it. It can't possibly be wrong to define A as the amount in the first envelope, and once that definition has been made, A is a constant.

The Wikipedia article is also wrong when it claims that the problem is more difficult when you're allowed to look inside the first envelope and find out what A is. It seems to me that the author has misunderstood why we are using conditional probabilites. The reason is that we're calculating the expectation value as a function of A. That forces us to treat A as if we know its value, even though we don't.

I've decided to post my thoughts here, so that I can get some input from others. If I'm wrong about something, I'd like to know about it.

The problem

For those who aren't familiar with the problem, here it is:

You have two indistinguishable envelopes in front of you. Both of them contain money, one of them twice as much as the other. You're allowed to choose one of the envelopes and keep whatever's in it. So you pick one at random, and then you hesitate and think "maybe I should pick the other one". Let's call the amount in the envelope that you picked first A. Then the other envelope contains either 2A or A/2. Both amounts are equally likely, so the expectation value of the amount in the other envelope is

.

Since this is more than the amount in the envelope we have, we should definitely switch.

This conclusion is of course absurd. The symmetry of the problem alone is enough to guarantee that it doesn't matter if we switch or not. A more formal way of showing it is to note that if the smaller amount is X, then both expectation values are equal to

,

so it doesn't matter if we switch or not.

The problem is that we have two calculations that look correct, but only one of them can be. It's obvious that the second calculation is the correct one, but it's surprisingly difficult to understand exactly what's wrong with the first one.

Here are my views:
Suppose you know beforehand that in one of the envelopes there are 10 bucks and in the other envelope there are 20 bucks.
When you pick up your first envelope, your expectation would be
Similarly, when you are going to pich up the second envelope, you know either it contains 10 bucks or 20 bucks.
Hence your expectataion would be

Now suppose that you are again beginning the game and you don't know how many bucks are kept inside the envelopes(in the first case you were told 10 and 20 bucks)
You picked up one envelope and called this amount A.
Now, there are two possibilities:initially the amount ia A& A/2 or A&2A.Hence the expection would be:.

Thanks for the replies guys, but I think you're both missing the point.

Originally Posted by malaygoel

Suppose you know beforehand that in one of the envelopes there are 10 bucks and in the other envelope there are 20 bucks.
When you pick up your first envelope, your expectation would be
Similarly, when you are going to pich up the second envelope, you know either it contains 10 bucks or 20 bucks.
Hence your expectataion would be

If you don't know the amounts in the envelopes you can still do that same calculation. You would have to write something like X instead of the number 10, but you would still get the result that both expectation values are the same. (This is the second calculation in my post).

Originally Posted by malaygoel

Now suppose that you are again beginning the game and you don't know how many bucks are kept inside the envelopes(in the first case you were told 10 and 20 bucks)
You picked up one envelope and called this amount A.
Now, there are two possibilities:initially the amount ia A& A/2 or A&2A.Hence the expection would be:.

You can do this calculation even when you know what A is. The Wikipedia article even argues that it's only when you know A that you can even consider using this calculation. (I strongly disagree with this).

Originally Posted by malaygoel

Your expectation changes with your knowledge about the envelopes.

There is information that we could receive that would change the expectation, but it's not sufficient to know the amount in the first envelope. We also need to know the method that was used to determine the amounts in the first place (i.e. what I called g or Q in my post)

Originally Posted by Quick

So your expected outcome is the average of the two envelopes...

Yes this is obvious, and as you can see I wrote that in my post. However that doesn't solve the problem. The challenge isn't to find the correct expectation value, it's to find what's wrong with the calculation that tells us it's a good idea to switch.

You have two indistinguishable envelopes in front of you. Both of them contain money, one of them twice as much as the other. You're allowed to choose one of the envelopes and keep whatever's in it. So you pick one at random, and then you hesitate and think "maybe I should pick the other one". Let's call the amount in the envelope that you picked first A. Then the other envelope contains either 2A or A/2. Both amounts are equally likely, so the expectation value of the amount in the other envelope is

.

Since this is more than the amount in the envelope we have, we should definitely switch.

This calculation is clearly nonsense. Make it specific suppose the two sums are $10 and $5.

If the envelope that you have contains the $5 A=5 but there is zero probability that the other contains $2.50.

The expected content of the other envelope is clearly:

p(A=$10)5+p(A=$5)10=$7.50

which is identical to the expected content of the envelope that you already
hold.

The published articles about this do not dismiss it as nonsense. I think it probably is nonsense, but it's not easy to see why.

Originally Posted by CaptainBlack

Make it specific suppose the two sums are $10 and $5.

If the envelope that you have contains the $5 A=5 but there is zero probability that the other contains $2.50.

The expected content of the other envelope is clearly:

p(A=$10)5+p(A=$5)10=$7.50

which is identical to the expected content of the envelope that you already
hold.

If A=5, then P(A=10) is 0, but I think I understand what you're trying to say. You're just repeating the second calculation from my post (the one that says that the EV is 1/2*2X+1/2*X/2=3/2*X).

Also note that the EV in dollars depends on the prior distribution, i.e. the method that was used to select the amount that went into the envelopes. (The EV in units of whatever the smaller amount is, is always 3/2, assuming that we don't know the value of A). Suppose for example that the person who prepared the envelopes flipped a coin until it came up heads and then put 2^n dollars and 2^(n+1) dollars, where n is the number of "tails" before the first "heads", into the envelopes. Then (assuming that we don't know the value of A) the EV in dollars is

1/2*1+1/4*2+1/8*4+...=1/2+1/2+1/2+...

The possibility of infinite expectation values is probably the reason why this problem is so strange.

If the envelope that you have contains the $5 A=5 but there is zero probability that the other contains $2.50.

The expected content of the other envelope is clearly:

p(A=$10)5+p(A=$5)10=$7.50

which is identical to the expected content of the envelope that you already
hold.

If A=5, then P(A=10) is 0, but I think I understand what you're trying to say. You're just repeating the second calculation from my post (the one that says that the EV is 1/2*2X+1/2*X/2=3/2*X).

You misunderstand my notation: P(A=$10) denotes the probability that
A is $10. At no point have I considered what I would write as
P(A=$10|A=$5) which what you write implies you think I mean.

None of the probabilities I have written in the quoted fragment are
conditioned on A.

I have made the sums specific as otherwise we have an ill posed problem.
I could have used any sum of money for the lower sum and twice that for the
larger, and the result would be to scale the expectation by the appropriate
factor.

One way to cut through the confusion surrounding such problems is to aske
yourself how you would simulate the problem. It will force you to make
everything specific, and so eliminate the ambiguity which leads to most of
the confusion.

Now instead of a fixed sum let the smaller sum S be generated by some process such that the probability that RV S has value s is P(s)

Then the expected value of the other envelope is:

sum(P(s)[p(A=2s)s+p(A=s)2s],s=...)=sum(1.5 P(s) s, s=...)=1.5 mean(S)

which is the same as the expectation of your envelope.

Yes, I agree, but isn't this just another calculation that proves what we already know, i.e. that both expectation values are the same? Does it really tell us anything about why the other calculation is wrong? If it does, I don't see it.

Originally Posted by CaptainBlack

The problem as stated is in fact ill posed and untill you specify how the sums to be put in the envelopes is generated there is in fact no problem.

That was my conclusion too. It's strange that this point isn't emphasized in the articles I've read.

OK, let's see if we agree about why we shouldn't trust the calculation that gave us EV=5/4*A.

We can't take for granted that the EV of the other envelope is 1/2*2A+1/2*A/2, because the correct expression is actually

and the conditional probabilities in this expression aren't well defined because they depend on the prior distribution. So to make progress here, we have to assume that a prior distribution has been specified.

Suppose that a discrete distribution has been specified. This is effectively a function Q with the property that Q(x) is the probability that the smaller amount is x. There is no Q that make both of the conditional probabilities =1/2 regardless of what A is (because such a Q would have to satisfy Q(x)=Q(x/2) for all x, and that would make the sum of all Q(x) infinite instead of =1).

So it is in fact not possible to calculate the EV of the other envelope as 1/2*2A+1/2*A/2.

This is the argument that several articles about this problem has used. Do you agree so far? If you do, then what do you think about what I said in the section titled "Why I don't think that this solves the problem"? My conclusion there is that simply replacing the factors of 1/2 with the appropriate conditional probabilities is not enough. The expression still doesn't give us the correct result. (The result should be A, but that implies that Q(x)=1/2*Q(x/2) for all x, which is absurd).

Edit: If it isn't obvious, the problem I have with the line of reasoning that supposedly solves the problem is that the solution relies on the claim that the expression with the conditional probabilities is the correct expression for the EV, but if you assume that it is, you get a contradiction. So unless I'm wrong about that last part, this "solution" (which has appeared in several articles on the problem) isn't a solution at all!

I think that why you are getting two different expectations is bacause you are using two different base values.
In one (X)you are using the smaller value in the two envelopes as the base value and in the other (A) you are using the money in the first picked envelope as the base value.
(I am not sure, but please consider it)

and the conditional probabilities in this expression aren't well defined because they depend on the prior distribution. So to make progress here, we have to assume that a prior distribution has been specified.

We agree, but the reason that I'm not bothered with refuting the particularerroneous arguments presented is that there is no need, what is needed isfor its supporters to provide a clear explanation of why they think thisargument is valid.

If we can show an argument is wrong, and we trust in the consistencyof (Bayesian) statistics, and of reality at a macroscopic level, then wedon't have to show where a specific argument goes astray, just that it does.

I have this discussion at regular intervals with my significant other andher mother. At regular intervals they receive mail telling them that they have won multiple mega-bucks in some draw that they have never heard of, all they have do do to claim their prize is ... My meta argument runs: the company organising this "thing" has a business model which involves themmaking money. From what they say we cannot see how that is possible sothe offer cannot be what it seems and so is a scam. However this explanation never seems to satisfy them, they always want to know howit is a scam - if I can't explain (can't be bothered to find the detail is the truth) they still think this one may be different, and are also reluctant toapply the argument themselves in future (and so avoid bothering me with it).