Monday, February 11, 2013

I've been talking recently about robust statistics, and the consequences of replacing means with medians. However, I've only looked at this in a fairly limited way, asking about one particular distribution (the bell curve). Mean values are everywhere in statistics; perhaps to a greater degree than you realize, because we often refer to the mean value as the "expected value". It's a simple alias for the same thing, but that may be easy to forget when we are taking expectations everywhere.

In some sense, the "expectation" seems to be a more basic concept than the "mean". We could think of the mean as simply one way of formalizing the intuitive notion of expected value. What happens if we choose a different formalization? What if we choose the median?

The post on altering the bell curve is (more or less) an exploration of what happens to some of classical statistics if we do this. What happens to Bayesian theory?

The foundations of Bayesian statistics are really not touched at all by this. A Bayesian does not rely as heavily on "statistics" in the way a frequentist statistician does. A statistic is a number derived from a dataset which gives some sort of partial summary. We can look at mean, variance, and higher moments; correlations; and so on. We distinguish between the sample statistic (the number derived from the data at hand) and the population statistic (the "true" statistic which we could compute if we had all the examples, ever, of the phenomenon we are looking at). We want to estimate the population statistics, so we talk about estimators; these are numbers derived from the data which are supposed to be similar to the true values. Unbiased estimators are an important concept: ways of estimating population statistics whose expected values are exactly the population statistics.

These concepts are not exactly discarded by Bayesians, since they may be useful approximations. However, to a Bayesian, a distribution is a more central object. A statistic may be a misleading partial summary. The mean (/mode/median) is sort of meaningless when a distribution is multimodal. Correlation does not imply... much of anything (because it assumes a linear model!). Bayesian statistics still has distribution parameters, which are directly related to population statistics, but frequentist "estimators" are not fundamental because they only provide point estimates. Fundamentally, it makes more sense to keep a distribution over the possibilities, assigning some probability to each option.

However, there is one area of Bayesian thought where expected value makes a great deal of difference: Bayesian utility theory. The basic law of utility theory is that we choose actions so as to maximize expected value. Changing the definition of "expected" would change everything! The current idea is that in order to judge between different actions (or plans, policies, designs, et cetera) we look at the average utility achieved with each option, according to our probability distribution over the possible results. What if we computed the median utility rather than the average? Let's call this "robust utility theory".

From the usual perspective, robust utility would perform worse: to the extent that we take different actions, we would get a lower average utility. This begs the question of whether we care about average utility or median utility, though. If we are happy to maximize median utility, then we can similarly say that the average-utility maximizers are performing poorly by our standards.

At first, it might not be obvious that the median is well-defined for this purpose. The median value coming from a probability distribution is defined to be the median in the limit of infinite independent samples from that distribution, though. Each case will contribute instances in proportion to its probability. What we end up doing is lining up all the possible consequences of our choice in order of utility, with a "width" determined by the probability of each, and taking the utility value of whatever consequence ends up in the middle. So long as we are willing to break ties somehow (as is usually needed with the median), it is actually well-defined more often than the mean! We avoid problems with infinite expected value. (Suppose I charge you to play a game where I start with a $1 pot, and start flipping a coin. I triple the pot every time I get heads. Tails ends the game, and I give you the pot. Money is all you care about. How much should you be willing to pay to play?)

Since the median is more robust than the mean, we also avoid problems dealing with small-probability but disproportionately high-utility events. The typical example is Pascal's Mugging. Pascal walks up to you and says that if you don't give him your wallet, God will torture you forever in hell. Before you object, he says: "Wait, wait. I know what you are thinking. My story doesn't sound very plausible. But I've just invented probability theory, and let me tell you something! You have to evaluate the expected value of an action by considering the average payoff. You multiply the probability of each case by its utility. If I'm right, then you could have an infinitely large negative payoff by ignoring me. That means that no matter how small the probability of my story, so long as it is above zero, you should give me your wallet just in case!"

A Robust Utility Theorist avoids this conclusion, because small-probability events have a correspondingly small effect on the end result, no matter how high a utility we assign.

Now, a lot of nice results (such as the representation theorem) have been derived for average utilities over the years. Naturally, taking a median utility might do all kinds of violence to these basic ideas in utility theory. I'm not sure how it would all play out. It's interesting to think about, though.

Sunday, February 10, 2013

Hi all! This month, I have the honor of hosting the monthly blog review, Philosopher's Carnival. This reviews some of the best philosophy postings of the previous month.Two Metaphysical Pictures, by Richard Yetter Chappell of Philosophy et cetera, outlines a broad classification of metaphysical theories into two different types. (For the record, I prefer the second type! However, as one comment rightly points out, we should avoid lumping views together to create false dichotomies in this way.)Special relativity and the A-theory, at Alexander Pruss's Blog, discusses the relationship between the philosophical view of time and what we know from physics. (This is related to the two views of the previous post, at least to the extent that you buy the false dichotomy.)

Metaphysical Skepticism a la Kriegel, by Eric Schwitzgebel of The Splintered Mind, reviews a paper by Uriah Kriegel which suggests that although there may be meaningful metaphysical questions about which there are true and false answers, we cannot arrive at knowledge of those answers by any means. In particular, there is no means by which we can come to know if sets of objects have a literal existence or are merely mental constructs.

Grim Reapers vs. Uncaused Beginnings, by Joshua Rasmussen of Prosblogion, gives a discussion of some "grim reaper" arguments. The Grim Reaper argument is an argument which is supposed to show the implausibility of an infinite past. Joshua shows that a very similar argument would conclude that a finite past with an uncaused beginning is equally implausible.

Monday, February 4, 2013

Taleb and others have mentioned that the bell curve (or Gaussian) does not deal with outliers well; it gives them a very small probability, and the parameter estimates end up being highly dependent on them.

Yet, one of the justifications of the Gaussian is that it's the max-entropy curve for a given mean and standard deviation.

Entropy is supposed to be a measure of the uncertainty associated with a distribution; so, shouldn't we expect that the max-entropy distribution would give as high a probability to outliers as possible?

There are several answers.

First: a basic problem is that phrase "a given mean and standard deviation". In particular, to choose a standard deviation is to choose an acceptable range for outliers (in some sense). If we have uncertainty about the standard deviation, it turns out the resulting curve has a polynomial decay rather than an exponential one! (This means distant outliers are far more probable.) Essentially, estimating deviations from data (using maximum-likelihood) makes us extremely overconfident that the data will fall in the range we've experienced before. A little Bayesian uncertainty (which still estimates from data, but admits a range of possibilities) turns out to be much less problematic in this respect.

This is definitely helpful, but doesn't solve the riddle: it still feels strange, that the max-entropy distribution would have such a sharp (super-eponential!) decay rate. Why is that?

My derivation will be somewhat heuristic, but I felt that I understood "why" much better by working it out this way than by following other proofs I found (which tend to start with the normal and show that no other distribution has greater entropy, rather than starting with desired features and deriving the normal). [Also, sorry for the poor math notation...]

First, let's pretend we have a discrete distribution over n points, xn. The result will apply no matter how many points we have, which means it applies in the limit of a continuous distribution. Continuous entropy is not the limit of discrete entropy, so I won't actually be maximising discrete entropy here; I'll maximise the discrete version of the continuous entropy formula:

f(x) maximising sum_i: f(xi)log(f(xi))

Next, we constrain the distribution to sum to a constant, have a constant mean, and have constant variance (which also makes the standard deviation constant):

sumi[f(xi)] = C1
sumi[xif(xi)] = C2
sumi[xi2f(xi)] = C3

To solve the constrained optimisation problem, we make lagrange multipliers for the constraints:

That's exactly the form of the Gaussian: a constant to the power of a 2nd-degree polynomial!

So, we can see where everything comes from: the exponential comes from our definition of entropy, and the function within the exponent comes from the Lagrange multipliers. The Gaussian is quadratic precisely because we chose a quadratic loss function! We can get basically any form we want by choosing a different loss function. If we use the kurtosis rather than the variance, we will get a fourth degree polynomial rather than a second degree one. If we choose an exponential function, we can get a doubly exponential probability distribution. And so on. There should be some limitations, but more or less, we can get any probability distribution we want, and claim that it is justified as the maximum-entropy distribution (fixing some measure of spread). We can even get rid of the exponential by putting a logarithm around our loss function.

Last time, I mentioned robust statistics, which attempts to make statistical techniques less sensitive to outliers. Rather than using the mean, robust statistics recommends using the median: whereas a sufficiently large outlier can shift the mean by an arbitrary amount, a single outlier has the same limited effect on the median no matter how extreme its value.

I also mentioned that it seems more intuitive to use the absolute deviation, rather than the squared deviation.

If we fix the absolute deviation and ask for the maximum-entropy function, we get something like e-|x| as our distribution. This is an ugly little function, but the maximum-likelihood estimate of the center of the distribution is precisely the median! e-|x| justifies the strategy of robust statistics, reducing sensitivity to outliers by making extreme outliers more probable. (The reason is: the max-likelihood estimate will be the point which minimizes the sum of the loss functions centred at each data point. The derivative at x is equal to the number of data points below x minus the number above x. Therefore the derivative is only zero when these two are equal. This is the minimum loss.)

What matters, of course, is not what nice theoretical properties a distribution may have, but how well it matches the true situation. Still, I find it very interesting that we can construct a distribution which justifies taking the median rather than the mean... and I think it's important to show how arbitrary the status of the bell curve as the maximum entropy distribution is.

Just to be clear: the Bayesian solution is not usually to think too much about what distributions might have the best properties. This is important, but when in doubt, we can simply take a mixture distribution over as large a class of hypotheses as we practically can. Bayesian updates give us nice convergence to the best option, while also avoiding overconfidence (for example, the asymptotic probability of outliers will be on the order of the most outlier-favouring distribution present).

Still, a complex machine learning algorithm may still need a simple one as a sub-algorithm to perform simple tasks; a genetic programming approximation of Bayes may need simpler statistical tools to make an estimation of distribution algorithm work. More generally, when humans build models, they tend to compose known distributions such as Gaussians to make them work. In such cases, it's interesting to ask whether classical or robust statistics is more appropriate.