Tuesday, August 26, 2014

I still don't understand the philosophy of Bayesian probability

Brad DeLong is having an extremely fascinating conversation with an E.E. Doc Smith deus ex machina character, an emulation of a Princeton professor, looser emulations of two famous dead probabilists, and a made-up Greek mediator himself about the philosophy of Bayesian probability (see also here). DeLong focuses on the question of whether probabilities should be "sharp" - i.e., whether we should always say "I believe the probability of the event is x%" (as Bayesians always do), or whether we should say something along the lines of "I believe the probability of the event is between x% and y%."

But I want to focus on a deeper question, which is: What is a probability in the first place? I mean, sure, it's a number between 0 and 1 that you assign to events in a probability space. But how should we use that mathematical concept to represent events in the real world? What observable things should we represent with those numbers, and how should we assign the numbers to the things?

The philosophy of Bayesian probability says that probabilities should be assigned to beliefs. But are beliefs observable? Only through actions. So one flavor (the dominant flavor?) of Bayesian probability theory says that you observe beliefs by watching people make bets. As DeLong writes:

Thomas Bayes: It is simple. [Nate Silver assigning a 60% probability to a GOP takeover of the Senate in 2014] means that Nate Silver stands ready to bet on [Republican] Senate control next January at odds of 2-3.

Thrasymakhos: “Stands ready”?

Thomas Bayes: Yes. He stands ready to make a (small) bet that the Majority Leader of the Senate will [not] be a Republican on January 5, 2015 if he gets at least 2-3 odds, and he stands ready to make a (small) bet that the Majority Leader of the Senate will not be a Republican on January 5, 2015 if he gets at least 3-2 odds.

DeLong is very careful to write "a (small) bet". If he wrote "a bet", we would have to introduce Nate Silver's risk aversion into our interpretation of the observed action, if the bet size were large. DeLong is assuming that a small bet will get rid of Silver's risk aversion.

However, there's a problem: DeLong's assumption, though characteristic of the decision theory used in most economic models, does not fit the evidence. People do seem to be risk-averse over small gambles. One (probably wrong) explanation for this is prospect theory. Loss aversion (one half of prospect theory) makes people care about losing, no matter how small the loss is. To back out beliefs from bets, you need a model of preferences. And that model might be right for one person at one time, but wrong for other people and/or other times!

But isn't that just a practical, technological problem? Why do we need real-world observation in order to define a philosophical notion? Well, we don't. We already defined a probability as a real number between 0 and 1 (which gets assigned to the latter slot in the tuples that are the elements of a probability measure). That's fine. But the Bayesian philosophical definition of probability, if it is to be more than "a number between 0 and 1," seems like it has to include a scientific component. The Bayesian notion of "probability as belief" explicitly posits a believer, and ascribes the probability to that real, observable entity (note: This is also why I think the "Weak Axiom of Revealed Preference" is not an axiom). If we can't observe the probability, then it doesn't exist - or, rather, it goes back to just being "a number between 0 and 1".

So can't we just posit a hypothetical purely rational person, and define beliefs as his bet odds? Well, it seems to me that this will probably lead to circular reasoning. "Rational" will probably be defined, in part, as "taking actions based on Bayesian beliefs." But the Bayesian beliefs, themselves, will be defined based on the actions taken by the person! This means that imagining this purely rational person gets us nowhere. Maybe there's a way around this, but I haven't thought of it.

Does all this mean that the definition of Bayesian probability is logically incoherent? No. It means that defining Bayesian probability without reference to preferences (or other decision-theoretical rules that stand in for preferences) is scientifically useless. In physics, a particle that interacts with no other particles - and is hence unobservable, even indirectly - might as well not exist. So by the same token, I claim that Bayesian probabilities might as well not exist independently of other elements of the decision theory in which they are included. You can't chop decision theory up into two parts; it's all or nothing.

I assume philosophers and decision-theory people thought of this long ago. In fact, I'm probably wrong; there's probably some key concept I'm missing here.

But does it matter? Well, yes. If I'm right, it means the argument over whether stock prices swing around because of badly-formed beliefs or because of hard-to-understand risk preferences is pretty useless; there's no fundamental divide between "behavioral" and "rational" theories of asset pricing.

It's also going to bear on the more complicated question Brad is thinking about. If you're talking to people who make decisions differently than you do, it might not be a good idea to report a number whose meaning is conditional on your own decision-making process (which your audience does not know). So that could be a reason not to report sharp probabilities to the public, even if you would make your own decisions in the standard Bayesian-with-canonical-risk-aversion way. But what you should do instead, I'm not sure.

You are right that WARP is not an axiom. Certainly not by Aristotle's or Euclid's definitions. Aristotle held that an axiom was a fact that was so unquestionably true that no student would challenge it, and an assumption used not in a specific area of study but in all fields of study. WARP satisfies neither criteria: not only has WARP challenged in serious academic research (so that it's not unquestionably true), it is also not applicable to all fields--it's really just an economics topic. All of this is to say that WARP is actually just a "postulate," a maintained assumption for a limited field of study. So, it should be called WPRP.

Anonymous, your comment points out the utter tragedy that blogspot does not allow upvoting or something similar. There is no doubt that it is but a "klaim". Since 2008 how many economists have said to themselves, "as God is my witness, I thought those klaims could fly"?

Aristotle's definition of an axiom is not what modern day mathematicians take to be an axiom. The axioms of euclidean geometry are sufficient conditions of euclidean spaces, but are not unconditional truths.

I think the best response from a Bayesian would be to say that "rational" isn't necessarily defined by "takes actions based on Bayesian probability." Rather, it would be something more like "isn't subject to a dutch book" or "can't get money pumped." From that definition of rationality, the idea that a rational agent has to operate from something like Bayesian probabilities falls out.

Of course, this could be construed as placing a constraint on preferences that people cannot have beliefs that allow them to lose money in every state of the world. But it doesn't necessitate a model as restrictive as traditional expected utility (an agent could be loss averse and still avoid a dutch book).

Yeah, I've had a good bunch of back-and-forth conversations with Bayesians about CPT, and I never found a really good response from them on that point. I guess I more meant to say that they seem to have some extension of the dutch book idea in mind when talking about "rationality" (which, depending on how that extension works, might wind up just being circular).

Yeah, I guess I should have said that Bayesians would view "rationality" as some kind of extension of the idea of "not-getting-dutch-booked," and they would (maybe wrongly) object to that reasoning being called circular.

But in any case, I don't even buy the Dutch Book argument in the first place, since in the real world there are strategic interactions. Remember how noise traders earn higher expected returns than arbitrageurs in a DeLong-Shleifer-Summers-Waldmann model? Sure, they earn lower risk-adjusted returns, but over time, their higher non-adjusted returns will make them like Kelly betters and drive the arbitrageurs right out of the market! Where's the Dutch Book now? ;-)

Also, check out this old-ish paper on coherent preferences (the generalization of the dutch-book argument): http://www.jstor.org/discover/10.2307/2958578?uid=3739808&uid=2&uid=4&uid=3739256&sid=21104626866523

One or DeLong's characters tried to talk about Lady Luck drawing balls from an urn, which DeLong N Silver rejected. But what's wrong with the idea that there are 28 billion parallel universes which will be in January 2015, and we don't know which one of them our observation will collapse into the one we inhabit. But in 16.8 billion (60%) of them, a bet on the R's will pay off, while in 40%, for whatever reason, it will not. Under any concave monotonic utility, the bet will be made. (Presumably,DeLong N Silver is untroubled by prospect theory, as his real concern is his reputation as a sage, not making money, so pushes his actual belief as his prediction.)

I may be wrong, but the point made in that article is that the many worlds interpretation of quantum mechanics doesn't give a reason for the Born rule, not that it doesn't define probability (actually, many worlds seems to be a frequentist definition of probability).

I think I see what you are saying Will and I agree with your point as stated, but I don't agree that is how a frequentist would solve the Schrodinger equation. It's not like the mathematics of probability using a frequentist interpretation or a Bayesian interpretation of probability are different. If you calculate P = 0.5, then it will be P = 0.5 regardless of whether you are a frequentist or Bayesian. The interpretation of what P = 0.5 means is where the difference lies.

Both a frequentist and a Bayesian with a many worlds view of quantum mechanics can use the Schrodinger equation and the Born rule to turn the wave function into a probability distribution P(x). A frequentist interpretation of P(x) doesn't then go back in time and cause the frequentist to make a mathematical error in solving the Schrodinger equation (you don't solve the Schrodinger equation by counting the outcomes in the separate worlds).

Holy hell. Did you not take any decision theory during your phd? If you want to learn about some philosophical decision theory dont read De Long, read someone like Itzhak Gilboa who has written extensively on the subject and actually knows what he is talking about.

Maybe you should read the works of Laplace. He literally wrote down the philosophy of Bayesian probability.It seems to me that the concept of Bayesian statistics is grounded on the belief that it is possible to look at a certain infinitely small moment of time, to realize how all the matter and energy in the universe is moving, and to know the future of the universe ad infinitum. That's determinism. It's the situation we should strive for to make accurate predictions. But since we're human this will never be possible. But every bit of knowledge we have received up to the moment when we attempt a prediction plus a bit of guesswork and intuition can at least enable us to assign a probability of a certain forecast coming true.I guess the alternative is the approach of Sam Wang there in DeLong's example. The probability cannot be exactly determined and for now merely gives an idea of what the probility is. I don't know, is that the difference between unknown knowns and unknown unknowns, knightian uncertainty?

The problem of the "small" bet is that it raises the question of what "small" is. When Mitt Romney attempted to make a $10.000 bet with Rick Perry at one of the presidential debates in 2012, he did not convince the audience of his beliefs, he just showed how irrelevant those $10.000 are to him. People are more risk-averse when they have more to lose. And that depends not only on current wealth, but also on current income, age, future income, future wealth (inheritances?) and so on...

Noah, I think you might want to read David Lewis on the definition of theoretical terms for psychological states characterized by functional role, because I think that's where you're headed with your thinking on the behavioral interdependence of unobservable psychological states:

Lewis also did important and highly influential work on the distinction between subjective (Bayesian) probability and objective chances as the latter are, arguably, required by fundamental physics. He argued that subjectivists were actually the best-positioned to make sense of the latter concept:

One way forward is to (1) specify rules for the rational observer to follow in learning from her observations, (2) define her degrees of belief to be summaries of her present epistemic state, and (3) prove first that her degrees of belief satisfy the laws of probability, and second that these evolve in response to new observations per Bayes' Theorem.

E.g., at the outset, the learner takes all (mathematically) possible worlds to be candidates for the actual world, and then proceeds to rule out worlds as they turn out to contradict new observations. Her degrees of belief in X would then be the share of epistemically possible worlds (the ones that remain candidates) consistent with X. You can then easily prove that her degrees of belief are probabilities, and that when a new observation is made, her probabilities undergo a Bayesian update. Notice that betting (and so on) make no appearance here.

In practice, when I say the probability of heads on a coin flip is 0.5, then, what I've got in mind is a model of the the rational observers prior distribution, which is itself the posterior distribution following every observation to date, and then I combine that with a likelihood and arrive at 0.5. In other words, Bayesian probability statements are true or false inasmuch as they correctly describe the present epistemic state of the rational observer, and the rational observers epistemic state is a well-define objective thing, just one that is computationally impossible for us to pin down exactly. So we model it.

Yes, observed betting behavior will almost always incorporate some degree of risk aversion (or risk-seeking in the case of some people), but that doesn't kill the idea of a "true" probability, or a belief about expected probability. The probability of a fair 6-sided die coming up 3 is 1/6. Someone might need 7:1 odds to bet on a single roll coming up 3, or 1:5 odds to bet on a single roll not coming up 3, but those decisions are informed by a belief that the likelihood of the outcome being 3 is 1/6. If someone is forced to set odds without knowing which side of the bet they will be forced to take (like a bookie setting odds on a sports game), the rational thing to do consistent with one's beliefs would be to set the odds at 6:1. It would be irrational to set the odds otherwise, because over a long period of time you would be guaranteed to lose money. I don't see the difficulty in applying this framework to other events. Given all the information available to a person at the time, probability is where that person would set odds if he didn't get to choose which side of the bet he would be on.

(1) Bayesian probabilities are not facts about the world (like frequentist probabilities), but facts about beliefs.(2) Beliefs exist in brains, so we can't talk about "the probability of X", but only "the probability of X according to brain Y".(3) There is no neuron or pattern of neurons in a human brain that codes for "the probability of X is p", so we can't define Bayesian probabilities with reference to the physical brain.(4) We can't consistently infer beliefs from peoples' decisions without (probably false) assumptions about their decision algorithms.

It seems to me that this is all true as far as it goes, but it doesn't bother me too much. I'm fine saying that a probability is somebody's subjective belief in something, even though I can't define what I mean by that at the level of neurons, or always infer it from their actions. It's still a coherent concept, and you can usually get close to putting a number on someone's belief if you try hard enough. In other words, its a useful descriptive concept, even if it's not unambiguously defined in all cases.

But more than this, I think of Bayesian probability as an ideal. If you wanted to design an AI to reason under uncertainty, you would probably make it Bayesian (unless you had weird preferences over AIs!). And further, I think that any sort of successful reasoning under uncertainty is (at its root) some sort of kludgy computationally-bounded approximation to Bayesian reasoning. Human brains are a pretty bad implementation in a lot of ways (we don't even have explicit representations of numerical probabilities in our brains!), but to the extent they work at all, its because they're doing something like Bayesian reasoning.

This business of using "small" bets to measure the degree of belief of an individual has always seemed unworkable to me. Opposed to risk aversion, there is a countervailing force that one might call get-out-of-bed aversion. If a bet is big enough to be interesting, it seems that it must inherently incorporate some degree of risk aversion. And even if there is some sweet spot where the stakes are high enough to induce a but low enough to allow risk aversion to be ignored, how are very small or very large probabilities to be measured? Either the amount to be lost will be too high or the amount to be gained too small.

And, is "small" supposed to be scale-invariant? Is a "small" bet the same size for me and for Bill Gates? If not, then "small" is endogenous to the outcome of bets. In that case, a completely risk-neutral, hyper-rational bettor will not bet according to ensemble probabilities, which are arithmetic averages, but according to geometric averages (the Kelly criterion you alluded to above.) In that case, riskiness will get tangled up in probability measurements even when the subject has no risk aversion!

Crikey! Another economist who won't mention Keynes in connection with probability!

Keynes' interpretation was in fact Bayesian, of the "logical" school. You can see his influence on Cox and Jaynes pretty clearly. But Keynes not only disputed the claim that probability must have a numerical value (sharp or otherwise), he asserted that probability wasn't necessarily even cardinal!

You can also get risk mixed into probability even when considering only ensemble expectations. Suppose you believe that the probability of winning a bet is exactly 60%; you would be unlikely to bet your house. But most people would be quite pleased to bet one-millionth of their house on one million such independent bets. Now suppose that 60% is only a point estimate and that your credible interval is 40%-80%. Most people would no longer be so keen to make a million of these bets.

That example comes by way of Riccardo Rebonato's book Plight of the Fortune Tellers. Brad was attempting to use a critique of Rebonato's interpretation of Bayesianism by Shalizi as a stick with which to beat critics of Nate Silver. He was too busy riding his hobby horse to pay attention to his source material - lazy.

This seems like a good explanation of what I was driving at before, which was the idea that if a person uses Bayesian probabilities to make decisions, their method will likely be self-reinforcing somewhat independent of whether it is appropriate because the pseudo-random nature of the process itself will make it look valid over time.

We can only be sure this works well with truly random processes, and we know of few truly random processes in the macroscopic world.

I have a hard time reading your posts on bayesian probability for some reason, and I struggle with why that is. I think it is because you seem to be unclear about how a bayesian is different from a frequentist when it comes to how they conceive of probability. It seems like you over complicate something that to me seems rather simple (maybe I am ignorant though) and you tend to mix objectivist and subjectivist bayesian interpretations.

In the context of silver's 60% probablity, a frequentist would view that to mean that if the same set of elections happened 100 times, 60 of those "runs" would result in republican control of the senate. This is because they see the unobserved parameters as fixed propensities. A bayesian would view that same situation as meaning that given what I know (which mixes what I originally thought and what I thought after observing some evidence), there is a 60% chance that republicans will control the senate after the elections. That probability isnt a frequency, rather it is a reflection of my certainty about the potential outcome. I prefer the bayesian conception because the election isnt going to happen 100 times. Its going to happen once, period. Its not like flipping a coin or playing cards, so fixed frequencies are irrelevant and confusing. What matters is my state of knowledge (expressed as a probability) after accounting for my biases and considering the available evidence. I fail to see whats confusing about that philosophy.

With respect to Delong's issue. Should the probability be expressed as a range, or as a point estimate? To me it doesn't really matter because both give me the same information. Saying the probability is 60% or between 50% and 70% or 40% and 80% tell me that you are quite uncertain about what is going to happen. However, if you give me 2-3 odds, I take the bet because thats higher than what I would consider a "fair" bet, which is 3-5

I think this is a good explanation. However, I would offer different advice.

I don't call this a probability, rather I call it a degree of certainty. Using this method, a better should take 50-50 odds with a small bet. If they are 80% sure, they should take a 50-50 bet with a larger amount, therefore a larger risk, but they should not change the odds at which they are willing to bet.

@jefftopia, I dont see how the concept of fixed frequencies in repeated trials makes sense at all in the case of a one time election, or for that matter many scientific applications (hypothesis testing in particular). Again, this isnt coin flipping or card playing where, admittedly, fixed propensities make sense. Lastly, just because it is easier to conceptualize frequencies or easier for you to understand them, doesnt make the bayesian conception of probability incoherent.

"Well, it seems to me that this will probably lead to circular reasoning. "Rational" will probably be defined, in part, as "taking actions based on Bayesian beliefs." But the Bayesian beliefs, themselves, will be defined based on the actions taken by the person!"

Lucas and Sargent beat you to this point about 4 decades ago, only they (a) weren't explicit about it and (b) considered it a feature rather than a bug.

Bayesian probability is a way of distilling information about the real world given a set of measurements and a model of the real world. When Nate Silver says that the Republicans have a 60% chance of winning the Senate, he is stating that, given the polling results available, 60% of the possible election results consistent with that polling information result in Republican control. Of course, he isn't just using the polling results. He also has a model using demographic information and information he has gathered about the predictive effectiveness and methodologies of the various polls.

You get this a lot in Bayesian probabilities: observations and models.

If you have a standard 52 card deck, you can say the probability of the second card drawn being an ace is 4/51 if the first card was not an ace, but only 3/51 if the first card was an ace. In this example we have the advantage of having a perfect model of a deck of cards.

If you get a negative result from some medical test, your doctor will tell you the probability that you actually have the negative disorder. Similarly, if you get a positive result, you may be told the odds that you actually are in healthy. This probability is based on a statistical model of the prevalence of the disorder and the reliability of the test.

You are absolutely right about Bayesian probability being tightly linked to the real world and observations. If there is no underlying world model or no method of observation, then Bayesian probability is inapplicable. I think there is a problem with using the word "belief". This word opens up a can of worms. It might be better to use a phrase like "model results" or "modeled estimate". A model doesn't have to be rational. You don't have to "believe" the model, you just need to be able to apply it. Using words like "belief" invokes all sorts of messy philosophy, or worse, gets one into doxastic logic which is rather far afield.

I think Sam Wang's issue is less the philosophy than the expression of uncertainty as a formal probability. A percentage chance implies that you know all possible states (eg that the die has six sides, not eight or twenty). In the real world we often do not know just how many sides the dice has (although we may be able to set some bounds). Brains deal with this by first operating as analogue rather than digital computers and secondly by linking the issues at stake together. So our beliefs are not points of probability but assemblages of states continuously varying not only in relative weight but also in relationship to one another. Expressing this as a percentage is a drastic simplification. It is not hard, for instance, to think of possible events which could alter not only the outcome in one senate race but in many or all of them, or even events which render the election as a whole moot. these events cannot, as Keynes noted, be sensibly assigned any definite mathematical probability.

Try to read Jaynes : "Probability theory: The logic of science". Bayesian believer is even more of a metaphor then immediately noticed, a metaphore taken to seriously these days.

"Obviously, the operation of real human brains is so complicated that we can make no pretenseof explaining its mysteries; and in any event we are not trying to explain, much less reproduce, allthe abberations and inconsistencies of human brains. That is an interesting and important subject;but it is not the subject we are studying here. Our topic is the normative principles of logic ; andnot the principles of psychology or neurophysiology.To emphasize this, instead of asking, \How can we build a mathematical model of humancommon sense?" let us ask, \How could we build a machine which would carry out useful plausiblereasoning, following clearly defined principles expressing an idealized common sense?"

@KV you haven't demonstrated any understanding of the subject matter. You're barely even coherent and worse, you're acting like a dick. Maybe go read up on the Dunning Kruger effect before you study what bayesianism is actually about.

I have dealt with Bayesian analysts in my work; I speak from experience; belief has no probability and Bayesian analysis is BS par excellence, like many who take two data vectors and run a correlation and BS about inventing a new science. I would much rather prefer a seed value when we can not model or estimate a probability of an event; and, say it is a guess, and not contaminate sciences by beliefs.

By the way, if I am barely even coherent, how the heck you figured that? By some Bayesian belief?

The mathematical approach is simple: take the model parameters as random variables belonging to some distribution, hold the data as a fixed outcome of a model in those parameters. A frequentist will differ in at least one respect: they ignore the fact that the parameter is in a probability space even though its estimate and confidence interval are drawn from it (see Box, Hunter and Hunter). As far as probabilities go, posit the Beta distribution for your model parameter (the marginal distribution) theta. Maybe a bet on theta has more to do with your confidence in your models than with whatever the future event is?

...take the model parameters as random variables belonging to some distribution, hold the data as a fixed outcome of a model in those parameters...

This presumes a model; its parameters are random variables...who are presumed to belong to undefined distributions......collect data without any rhyme or reason and assigned these data as "outcome" of an undefined model...

Formal mathematics is based on provable truth. In mathematics, any number of cases supporting a conjecture, no matter how large, is insufficient for establishing the conjecture's veracity, since a single counterexample would immediately bring down the conjecture. Conjectures disproven through counterexample are sometimes referred to as false conjectures (cf. Pólya conjecture and Euler's sum of powers conjecture).

http://en.wikipedia.org/wiki/Conjecture

Belief is a state of the mind, treated in various academic disciplines, especially philosophy and psychology, as well as traditional culture, in which a subject roughly regards a thing to be true.[1] "Dispositional and occurrent belief" is the contextual activation of a belief system in specific thoughts or ideas.

Did you think about this problem from a generalized utility function angle? People make decisions based on their U(). Past decision behavior is an excellent proxy for future decision making. Under an assumed U(), you can still say a lot about individual level risk preferences. I know you'll attack the word "assume", but there is ample evidence to show that a sample of U()s from a cohort (say by income group) will display remarkable homogeneity.

This is a very big mistake, and from this starting point no good consequences can be derived.

I think that I had the benefit of being taught statistics and econometrics by people who had awesome learning and insight (Bayesians BTW), and the one thing they kept banging on all the time to use students was that *stochastic* numbers are not at all the same as real numbers, even if they are written the same way; and in particular that the algebra of stochastic numbers is not the same as that of real numbers.

In particular because stochastic numbers are involved in expressing the properties of samples rather than populations (the average of a sample is a completely different concept from the average of a population).

Some of the commenters above seem to base implicitly some of their points on an understanding that the algebras of samples and populations are conceptually very different and should never be confused. This is good.

The idea that P(T) could be based on a mere hunch may seem unsettling. After all,diﬀerent people may have very diﬀerent hunches about the truth of a theory, and so may begin the process with very diﬀerent values for P(T)! In a way, however, this does not matter very much. This is because of the phenomenon of the washing out of the priors. If you and I begin with very diﬀerent evaluations of T, but we agree on P(E|T) and P(E|¬T), then our posterior probabilities will get closer and closer to each other the more evidence we investigate. In the long run, we will end up with the same assessment of T even if we started out with very diﬀerent guesses.

Before I give the link, I would like to stress these words in the above para:

...the more evidence we investigate. In the long run,...

The question is how much more evidence we MUST investigate, and for how LONG...