Month: October 2018

There are far fewer exercises in this chapter than in the previous ones and so this will be a much shorter post! Chapter 6 is on frequentist statistics, and it is clear that the author is not a fan!

6.1 Pessimism of LOOCV

Suppose we have a completely random labelled dataset (i.e. the features tell us nothing about the class labels) with \(N_1\) examples of class 1 and \(N_2\) examples of class 2, with \(N_1 = N_2\). What is the best misclassification rate that any method can achieve? What does LOOCV predict as the misclassification rate?

Let’s say \(N_1 = N_2 = N\) for notational simplicity. Clearly on average there is no method that can do better than a 50% misclassification rate. If we think about what happens when we apply leave-one-out-cross-validation, every time we will leave out one sample (either of class 1 or class 2, it doesn’t matter, but to be concrete let’s imagine it’s class 1). In the remaining data, which the best method can only predict completely randomly, we will then have \(N-1\) examples of the class 1 and \(N\) of class 2. As such this classifier will, on average, predict class 1 only with probability \(\frac{N-1}{2N-1} < \frac{1}{2}\) . This means that the predicted misclassification rate will be \( 100 \times \frac{N}{2N-1} \% \), which is worse than it should be (hence the “pessimism” of LOOCV).

I imagine my rate of doing these exercises is going to drop slightly as I’ve started my machine learning internship at BMLL Technologies and will be there up until Christmas. Still, my plan is to (hopefully) get up to about Chapter 8 finished by then – although this may be optimistic as my working days are pretty long at the moment! Still, I am learning a lot which means the exercises should be easier, right?! Anyway, chapter 5 is about Bayesian statistics and there are significantly less exercises than in the previous two chapters!

5.1 Proof that a mixture of conjugate priors is indeed conjugate

We consider the case where we have a prior of the form: \(p(\theta) = \sum_k p(Z=k) p(\theta | Z=k) \) where each \(p(\theta | Z=k)\) is conjugate to the likelihood, i.e. \(P(\theta | D, Z=k) \) is the same type of probability distribution as \(p(\theta | Z=k)\). We want to derive the general result:

\(p(\theta | D) = \sum_k p(Z=k | D) p(\theta | D, Z=k) \)

as this shows that the posterior is a weighted sum of the posterior of each mixture component – so the whole thing is conjugate to the mixture prior we started with. I’m not sure if I’m missing something and this is too simplistic but it seems we can just say:

We then note that \(P(D | \theta) = P(D | \theta, Z=k)\) since if we already know \(\theta\) then knowing what Z is tells us nothing. We then use that \(p(D | \theta, Z=k) p(\theta | Z=k) = p(\theta | D, Z=k) p(D | Z=k) \) and then \(\frac{p(D| Z=k) p(Z=k)}{p(D)} = p(Z=k | D) \) which gives us the result we want!

5.2 Optimal threshold on classification probability

Consider a case where we have learned a conditional probability distribution \(p(y | x)\), and suppose there are only two classes. Let \(p_0 = P(Y=0 | x)\) and \(p_1 = P(Y=1 | x)\). Consider the following loss matrix:

Predicted label \(\hat{y}\)

True label y

0

1

0

0

\(\lambda_{01}\)

1

\(\lambda_{10}\)

0

(a) Show that the decision \( \hat{y}\) which minimises the expected loss is equivalent to setting a probability threshold \(\theta\)

We just want \( 0.1 = \frac{\lambda_{10}}{\lambda_{01} + \lambda_{10}}\), which implies that \(9 \lambda_{10} = \lambda_{01}\). So any loss matrix which satisfies this.

5.3 Reject option in classifiers

In many classification problems it is useful to include a reject option – where we are too uncertain to make a classification. Let \(\alpha_i\) be the action you choose with \(i \in \{1, \dots, C+1 \}\), where \(C\) is the number of classes and action \(C+1\) is reject. Let \(Y=j\) be the true (but unknown) state of nature. Consider the following loss:

Let’s first define \(P(Y=j | x) = p_j\). Now if we choose an action \( a \in \{1, \dots, C\}\), then we have the following expected loss:

\( \rho(a | x) = \sum_{i=1}^C L(a | Y=i) p_i\)

But of course \(L(i | Y=i) = 0\) and for all other values of Y is a constant, so this gives:

\(\rho(a | x) = (1-p_a) \lambda_s\)

So choosing amongst classes we want to choose the a which has the minimum value of: \(1-p_a) \lambda_s\). So we need:

\(p_a \ge p_b \ \forall b \in \{1,\dots,C\}\)

This shows that if we are going to choose a class then as we should intuitively expect we choose the most probable one. However, we can also choose to reject. In this case we have an expected loss equal to \(\lambda_r\), and so if we are going to choose a class rather than reject then we need the best class to satisfy:

(b) Describe qualitatively what happens as \(\lambda_r\) is increased from 0 to 1.

The threshold for rejection changes from 1 down to 0.

5.4 More reject options

Consider, for example, a case in which the cost of misclassification is $10, but the cost of having a human manually make the decision is only $3.

(a) Suppose \(P(y=1 | x)\) is predicted to be 0.2. Which decision minimises the expected loss?

This is pretty much the same question as the previous one – clearly we are going to choose \(y=0\) if we are going to classify at all. The threshold for rejection is then: \(1-\frac{\lambda_r}{\lambda_s}) = 1 – \frac{3}{10} = 0.7\). So as \(P(y=0 | x) = 0.8\), we don’t reject and classify as \(y=0\).

(b) Now suppose \(P(y=1 | x) = 0.4\). Now which decision is optimal?

In this case, \(P(y=0 | x)=0.6\) which is below the reject threshold, and so we choose the reject option as we are not sure enough to predict \(y=0\).

(c) Show that in general there are two thresholds \(\theta_0\) and \(\theta_1\) such that the optimal decision is to predict 0 if \(p_1 < \theta_0\), reject if \( \theta_0 \le p_1 \le \theta_1\) and predict 1 if \(p_1 > \theta_1\)

This is clear given the results of the previous question: \(\theta_0 = 0.3\) and \(\theta_1 = 0.7\)

5.5 Newsvendor problem

Consider the following classic problem from decision theory/ economics. Suppose you are trying to decide how much quantity of some product (e.g. newspapers) to buy to maximize your profits. The optimal amount will depend on how much demand D you think there is, as well as its cost C and its selling price P. Suppose D is unknown but has pdf f(D) and cdf F(D), We can evaluate the expected profit by considering two cases: if \(D > Q\), then we sell all Q items, and make profit \(\pi = (P-C)Q\), but if \( D < Q\) we only sell D items, at profit \( (P-C)D\), but have wasted \(C(Q-D)\) on the unsold items. So the expected profit if we buy quantity Q is:

But the second to last term is just \(-CQ F(Q)\) which cancels with the third term. The last term also cancels with a part of the fourth term, so we just have:

\( \mathbb{E}_{\pi}(Q) = (P-C)Q – PQ F(Q) + \int_0^Q P D f(D) dD\)

Differentiating with respect to Q:

\( (P-C) – P F(Q) – PQ f(Q) + PQ f(Q) = 0\)

And so therefore:

\((P-C) = P F(Q*) \implies F(Q*) = \frac{P-C}{P}\)

5.6 Bayes factors and ROC curves

Let \( B= p(D | H_1) / p(D | H_0)\) be the Bayes factor in favour of model 1. Suppose we plot two ROC curves, one computed by thresholding \(B\) and the other computed by thresholding \(p(H_1 | D)\). Will they be the same or different?

I’m not sure if I fully understand what this question is trying to get at, but we can write \(B = \frac{p(D | H_1)}{ p(D | H_0)} = \frac{ p(H_0) p(H_1 | D) }{ p (H_1) p(H_0 | D)}\), and so it seems like in general using B as a threshold will not be the same as using \(p(H_1 | D) \)

5.7 Bayes model averaging helps predictive accuracy

Let \(\Delta\) be a quantity we are trying to predict. Then let \(\mathcal{D}\) is the observed data and \(\mathcal{M}\) be a finite set of models. Suppose our “action” is to provide a probabilistic prediction p(), with a loss function \( L(\Delta, p()) = – \log(p(\Delta)) \). We can predict \(\Delta\) in two different ways – firstly by using Bayes model averaging:

Let \( x \in \{0,1\} \) denote a result of a coin toss (\(x=0\) for tails, \(x=1\) for heads). Let’s say that heads occurs with probability \(\theta_1\). Suppose that someone else observes the coin flip and observes the outcome, y. But this person is unreliable and only reports the result correctly with probability \(\theta_2\). Assume that \(\theta_2\) is independent of x and \(\theta_1\).

(a) Write down the probability distribution \(p(x,y | \theta\) in terms of \(\theta_1\) and \(\theta_2\).

\(p(x=0, y=0 | \theta) = (1-\theta_1) \theta_2\)

\(p(x=0, y=1 | \theta) = (1-\theta_1)(1-\theta_2) \)

\(p(x=1, y=0 | \theta) = \theta_1 (1-\theta_2) \)

\(p(x=1, y=1 | \theta) = \theta_1 \theta_2 \)

(b) Suppose we have the following dataset: \(\mathbf{x} = (1,1,0,1,1,0,0), \mathbf{y} = (1,0,0,0,1,0,1)\). What are the MLEs for \(\theta_1\) and \(\theta_2\)? What is \(P(D | \mathbf{\hat{\theta}}, M_2\)) (where \(M_2\) denotes this 2-parameter model)?

There are 4 heads in the data, and 4 values where the values of x and y are in agreement, so:

(c) Now consider a model with 4 parameters, \(\mathbf{\theta} = (\theta_{0,0}, \theta_{0,1}, \theta_{1,0}, \theta_{1,1})\), representing \(p(x,y|\mathbf{\theta}) = \theta_{x,y}\). Only three of these parameters are free (as they must sum to zero). What is the MLE of \(\mathbf{\theta}\)? What is \(p(D|\mathbf{\hat{\theta}}, M_4)\), where \(M_4\) denotes this 4-parameter model?

We can then evaluate \(P(D| \mathbf{\hat{\theta}}, M_4) = \frac{2^6}{7^7} \simeq 7.8 \times 10^{-5}\). This doesn’t seem significantly higher than the 2 parameter model, so already this is reason to suspect that the 2 parameter model would be preferable.

(d) Suppose we are not sure which model is correct. We compute the leave-one-out cross-validated log-likelihood of each model:

where \(\hat{\theta}(\mathcal{D}_{-i}) \) denotes the MLE computed on \(\mathcal{D}\) excluding row i. Which model will CV pick and why?

Initially I wrote some code to generate this for each model, but actually to answer the question this really isn’t necessary. The reason is that for \(M_4\) there is only one data point of \(x=0, y=1\), and so when we estimate the MLE for \(\theta_{0,1}\) this is zero. This means when we use it to evaluate the full log-likehood using our MLE estimate we get a \(\log(0) \to -\infty\) term. This means that clearly \(M_2\) will be favoured by this cross-validation technique.

(e) Recall that an alternative to CV is to use the BIC score, defined as: