4 Answers
4

Logits simply means that the function operates on the unscaled output of earlier layers and that the relative scale to understand the units is linear. It means, in particular, the sum of the inputs may not equal 1, that the values are not probabilities (you might have an input of 5).

tf.nn.softmax produces just the result of applying the softmax function to an input tensor. The softmax "squishes" the inputs so that sum(input) = 1: it's a way of normalizing. The shape of output of a softmax is the same as the input: it just normalizes the values. The outputs of softmax can be interpreted as probabilities.

In contrast, tf.nn.softmax_cross_entropy_with_logits computes the cross entropy of the result after applying the softmax function (but it does it all together in a more mathematically careful way). It's similar to the result of:

sm = tf.nn.softmax(x)
ce = cross_entropy(sm)

The cross entropy is a summary metric: it sums across the elements. The output of tf.nn.softmax_cross_entropy_with_logits on a shape [2,5] tensor is of shape [2,1] (the first dimension is treated as the batch).

If you want to do optimization to minimize the cross entropy AND you're softmaxing after your last layer, you should use tf.nn.softmax_cross_entropy_with_logits instead of doing it yourself, because it covers numerically unstable corner cases in the mathematically right way. Otherwise, you'll end up hacking it by adding little epsilons here and there.

Edited 2016-02-07:
If you have single-class labels, where an object can only belong to one class, you might now consider using tf.nn.sparse_softmax_cross_entropy_with_logits so that you don't have to convert your labels to a dense one-hot array. This function was added after release 0.6.0.

About the softmax_cross_entropy_with_logits, I don't know if I use it correctly. The result is not that stable in my code. The same code runs twice, the total accuracy changes from 0.6 to 0.8. cross_entropy = tf.nn.softmax_cross_entropy_with_logits(tf.nn.softmax(tf.add(tf.matmul(x,W),b)),y) cost=tf.reduce_mean(cross_entropy). But when I use another way, pred=tf.nn.softmax(tf.add(tf.matmul(x,W),b)) cost =tf.reduce_mean(-tf.reduce_sum(y*tf.log(pred),reduction_indices=1)) the result is stable and better.
– RidaJul 12 '16 at 12:30

12

You're double-softmaxing in your first line. softmax_cross_entropy_with_logits expects unscaled logits, not the output of tf.nn.softmax. You just want tf.nn.softmax_cross_entropy_with_logits(tf.add(tf.matmul(x, W, b)) in your case.
– dgaJul 14 '16 at 21:56

6

@dga I think you have a typo in your code, the b needs to be outside of the bracket, tf.nn.softmax_cross_entropy_with_logits(tf.add(tf.matmul(x, W), b)
– jriekeSep 15 '16 at 16:16

1

what does "that the relative scale to understand the units is linear." part of your first sentence mean?
– Charlie ParkerJan 22 at 20:37

2

Upvoted-but your answer is slightly incorrect when you say that "[t]he shape of output of a softmax is the same as the input - it just normalizes the values". Softmax doesn't just "squash" the values so that their sum equals 1. It also redistributes them, and that's possibly the main reason why it's used. See stackoverflow.com/questions/17187507/…, especially Piotr Czapla's answer.
– Paolo PerrottaJun 7 at 20:13

In the output layer of your neural network, you will probably compute an array that contains the class scores for each of your training instances, such as from a computation y_hat = W*x + b. To serve as an example, below I've created a y_hat as a 2 x 3 array, where the rows correspond to the training instances and the columns correspond to classes. So here there are 2 training instances and 3 classes.

Note that the values are not normalized (i.e. the rows don't add up to 1). In order to normalize them, we can apply the softmax function, which interprets the input as unnormalized log probabilities (aka logits) and outputs normalized linear probabilities.

It's important to fully understand what the softmax output is saying. Below I've shown a table that more clearly represents the output above. It can be seen that, for example, the probability of training instance 1 being "Class 2" is 0.619. The class probabilities for each training instance are normalized, so the sum of each row is 1.0.

So now we have class probabilities for each training instance, where we can take the argmax() of each row to generate a final classification. From above, we may generate that training instance 1 belongs to "Class 2" and training instance 2 belongs to "Class 1".

Are these classifications correct? We need to measure against the true labels from the training set. You will need a one-hot encoded y_true array, where again the rows are training instances and columns are classes. Below I've created an example y_true one-hot array where the true label for training instance 1 is "Class 2" and the true label for training instance 2 is "Class 3".

Is the probability distribution in y_hat_softmax close to the probability distribution in y_true? We can use cross-entropy loss to measure the error.

We can compute the cross-entropy loss on a row-wise basis and see the results. Below we can see that training instance 1 has a loss of 0.479, while training instance 2 has a higher loss of 1.200. This result makes sense because in our example above, y_hat_softmax showed that training instance 1's highest probability was for "Class 2", which matches training instance 1 in y_true; however, the prediction for training instance 2 showed a highest probability for "Class 1", which does not match the true class "Class 3".

Note that total_loss_1 and total_loss_2 produce essentially equivalent results with some small differences in the very final digits. However, you might as well use the second approach: it takes one less line of code and accumulates less numerical error because the softmax is done for you inside of softmax_cross_entropy_with_logits().

I get it. Why not call the function, tf.nn.softmax_cross_entropy_sans_normalization?
– auroJan 26 '16 at 6:57

5

@auro because it normalizes the values (internally) during the cross-entropy computation. The point of tf.nn.softmax_cross_entropy_with_logits is to evaluate how much the model deviates from the gold labels, not to provide a normalized output.
– erickrfMay 11 '16 at 8:52

In the case of using tf.nn.sparse_softmax_cross_entropy_with_logits() computes the cost of a sparse softmax layer, and thus should only be used during training what would be the alternative when running the model against new data, is it possible to obtain probabilities from this one.
– SerialDevMar 14 '17 at 17:33

1

@SerialDev, it's not possible to get probabilities from tf.nn.sparse_softmax_cross_entropy_with_logits. To get probabilities use tf.nn.softmax.
– NandeeshJun 30 '17 at 15:36

Adding to that, Tensorflow has optimised the operation of applying the activation function then calculating cost using its own activation followed by cost functions. Hence it is a good practice to use: tf.nn.softmax_cross_entropy() over tf.nn.softmax(); tf.nn.cross_entropy()

You can find prominent difference between them in a resource intensive model.