Softmax without Overflow

Softmax is another name for the generalization of the logistic sigmoid function to n outcomes. It’s popular as an activation function for neural networks and as an (inverse) link function for generalized linear models like logistic regression.

Softmax converts an arbitrary real-valued vector into a multinomial probability vector. Except when it doesn’t.

LingPipe uses softmax it in two places. First, to construct a classify.JointClassification using joint log probabilities, which requires properly scaled linear probabilities in order to construct its superclass classify.ConditionalClassification. The second is inside multinomial logistic regression, where its used to convert the linear predictors into probabilities.

Suppose we have a vector of finite numbers:

The softmax operation converts it into a vector whose entries are finite, non-negative, and sum to 1:

where the normalizing term is the usual sum:

so that by construction:

The goal is to actually compute the terms in the softmax vector. The problem? Overflow. The terms in x might be large. So large that Math.exp(x[i]) overflows. That doesn’t even need to be that big, just Math.log(Double.MAX_VALUE), or about 710. It’s especially easy to do in logistic regression under many conditions (e.g. large absolute feature values, large variance within a dimension of features, large predictive differences, weak priors with separable features).

It’s algebra month here at Alias-i, so you shouldn’t be surprised when I note that the vector x can be shifted dimensionwise without changing the softmax result:

where the additive term factors through the normalizing constant:

and thus drop out to establish the equivalence.

What we do is rescale so there’s no overflow by subtracting the maximum of the x values from each x[i]. That converts overflow into underflow. Underflow is no problem, because that rounds off to zero, which is a well-behaved floating point number. The code is roughly as follows:

The old classifier implementation used the max rather than the min to rescale, which isn’t agressive enough in that it still allows overflow when the inputs may be large an dpositive. That wasn’t a problem for classifiers when the numbers were log probabilities, because they’re already all negative; so there the scaling prevents catastrophic underflow, and we already had that right.

This method is guaranteed not to have any overflow. And it’s guaranteed not to have all zero values, too. I’m afraid it’s simply not possible to represent two numbers more than exp(711) apart or so using IEEE double-precision floating point arithmetic. And frankly, once events are relatively this unlikely, we’re not too worried about rounding to zero.

Look for the new implementation in LingPipe 4.0’s logistic regression classification function. Coming soon to a web site near you.

5 Responses to “Softmax without Overflow”

People who often deal with Bayesian estimation and inference often have a logsumexp(x) function (x=[x1,…,xn]) in their library. It’s like (Again in matlab language ):
function s=logsumexp(x)
y=max(x);
s=y+log(sum(exp(x-y)));