Exercise

Using bins

One of the difficulties in working with a binary response variable is understanding how it "changes." The response itself (\(y\)) is either 0 or 1, while the fitted values (\(\hat{y}\))---which are interpreted as probabilities---are between 0 and 1. But if every medical school applicant is either admitted or not, what does it mean to talk about the probability of being accepted?

What we'd like is a larger sample of students, so that for each GPA value (e.g. 3.54) we had many observations (say \(n\)), and we could then take the average of those \(n\) observations to arrive at the estimated probability of acceptance. Unfortunately, since the explanatory variable is continuous, this is hopeless---it would take an infinite amount of data to make these estimates robust.

Instead, what we can do is put the observations into bins based on their GPA value. Within each bin, we can compute the proportion of accepted students, and we can visualize our model as a smooth logistic curve through those binned values.

We have created a data.frame called MedGPA_binned that aggregates the original data into separate bins for each 0.25 of GPA. It also contains the fitted values from the logistic regression model.

Here we are plotting \(y\) as a function of \(x\), where that function is
$$
y = \frac{\exp{( \hat{\beta}_0 + \hat{\beta}_1 \cdot x )}}{1 + \exp( \hat{\beta}_0 + \hat{\beta}_1 \cdot x ) } \,.
$$
Note that the left hand side is the expected probability \(y\) of being accepted to medical school.

Instructions

100 XP

Create a scatterplot called data_space for acceptance_rate as a function of mean_GPA using the binned data in MedGPA_binned. Use geom_line() to connect the points.

Augment the model mod. Create predictions on the scale of the response variable by using the type.predict argument.