Modeling completely separated data in R

You’ll notice that the y-axis is binary, there are only 0s and 1s. That’s so meta if you think about it…

See in the lower graph how there are only zeroes until a specific point on X when they all turn into ones? That’s completely separated data, also called ‘monotone likelihood’ (ever wonder why scientists have trouble speaking with normal folks?). Complete separation is common with binomial data. And because I fancy binomial data, I guess I’m being punished (or rewarded?) with separation. Hence this post.

The problem of complete separation may be new to some folks, but it actually rears its ugly head fairly often. I’ve already experienced it in two places, first with predation data (e.g. predators ATE EVERYTHING; see figure) and again with physiological data (e.g. everything died at 32 degrees C!). We gave it a pretty hefty explanation in our recent paper. Separation causes issues by producing parameter estimates that diverge to + infinity (this is bad).

The fuzzy sarlacc pit. This voracious predator is ready to consume all of its prey, producing completely separated data that foils scientists trying to unlock its secrets.

But for those of you that have separated data, or those procrastinating on writing dissertation chapters, you’ve come to the right place!! Here’s some background reading and you should check out Heinze and Schemper 2002 for the solution that the ‘brglm’ package uses. Complete separation occurs often enough in medicine and other sciences that this paper has just under 800 citations. In brief, the solution reduces the bias of maximum likelihood estimates and produces robust standard errors.

Enough jabbering, let’s first try modeling separated data with the good ol’ glm function and watch in amusement as R freaks out.

First, let’s load packages, create the data, plot it, then turn it into a dataframe.

Excellent! But this is a pretty simple scenario to model. What if we had two groups? Good question!

w<-c(rep(0,20),rep(1,20)) #create a group
z<-c(rep(0,32),rep(1,8)) #create another group
data2<-data.frame(x,w,z) #this is wide format, we need it to be in long, use 'str' if this doesn't make sense
data3<-gather(data2, group,y, w:z)#conversion to long format
str(data3) #take a peek, long format baby!

Ok, so here, I’ve created two groups to model. Both exhibit complete separation. Don’t believe me?

By the way, those color codes are the default for the two color scheme in ggplot2, thanks Stack Exchange!

You’ll notice that the slopes for those two groups are exactly the same (same logistic shape, just differing intercepts).
What if we wanted to allow the slopes to vary? First, let’s tweak the data so it makes sense for the slopes to vary.

Hey alright! The rate of change from one binary state to another might differ between two groups. Interesante!

You could also use the package ‘logistf’ but the ‘brglm’ package I’ve used here seems promising. Check out Heinze and Ploner 2004 that describes ‘logistf’. If you are of the Bayesian persuasion, there is a solution to complete separation using weakly informative priors (i.e. Cauchy distribution; Gelman et al 2008). Perhaps this will be the next post!