Modeling Trick: Impact Coding of Categorical Variables with Many Levels

One of the shortcomings of regression (both linear and logistic) is that it doesn’t handle categorical variables with a very large number of possible values (for example, postal codes). You can get around this, of course, by going to another modeling technique, such as Naive Bayes; however, you lose some of the advantages of regression — namely, the model’s explicit estimates of variables’ explanatory value, and explicit insight into and control of variable to variable dependence.

Here we discuss one modeling trick that allows us to keep categorical variables with a large number of values, and at the same time retain much of logistic regression’s power.

For this example, we will use a data set that contains all the police incidents (except homicide and manslaughter) that were reported in San Francisco for the month of June, 2012. The link to the most recent past-month’s incident data is here. The data set contains the date, time of day, and day of week that each incident was reported, along with the incident’s category, a brief description, and location (as police district, lat-long coordinates, and address to the nearest block).

Supposed we are interested in predicting the likelihood of a given incident being a violent crime, as a function of time, day, and location. We will define violent crimes ourselves, as assault, robbery, rape, kidnapping, and purse snatching. Here is the R code:

A little stronger, but still not very strong. The rate of violent incidents increases significantly (relative to midnight) at about 1 am; the rate is significantly lower than at midnight around late morning and noon.

In fact, using time of day and district only reduces the deviance from the "null model" -- that is, simply predicting the global rate of violent incidents -- by 2%. There is always the danger, of course, that using block-level data will lead to overfit, but let's give it a try, anyway.

To do that, we replace the categorical variable with a submodel that returns the probability of a violent incident, conditional on each category value. (in this case, the city block). In our case, it's possible that there are city blocks that had no reported incidents -- this month. That may change next month. We guard against this contingency by smoothing novel levels to the grand average. We choose to call this trick impact coding because it summarizes the impact of each category value on the outcome. (This is not standard terminology.)

Location is the most predictive variable; but even after controlling for location, the proportion of violent incidents peaks in the early hours of the morning: from about 1 am to 4 am. The resulting model explains about 48% of the null deviance. Not great, but not bad, either.

When we compare modelHr and modeHrAddr, we see that, overall, the time coefficients of modelHrAddr are less significant than those of modelHr. This indicates some correlation between time and location. This is where logistic regression's handling of dependence is useful -- a Naive Bayes model that used time and location would tend to overestimate the proportion of violent incidents at hotspots. The logistic regression model can compensate for these dependencies, and provides more accurate estimates.

And even though the impactAddr variable is less transparent than the corresponding categorical variable, the effect of time is clearer, since we have pulled out the effect of location. This is how we can make statements like "time has the following impact on the proportion of violent incidents, even after controlling for address," even though address is a very large categorical variable, which is itself subject to possible overfitting -- as well as being too large for typical logistic regression code.

Thus, while impact coding has limitations, it lets us do more than we would otherwise be able to do, especially with other categorical variables.

I would commonly suggest using multiple magnitudes of shrinkage and including each of them as a feature. They are fairly collinear, so you’d need to be sure to ridge any secondary model if your chosen algorithm doesn’t already do so. I get much better results doing this commonly.

Archives

Archives

Comment Policy

All comments are held for moderation. Only comments that will be interesting to other readers will be considered for posting. Comments that are irrelevant, offensive or link-spam will be deleted. Also we do use a mechanical comment spam filter, and would like to apologize in advance for any comments that get lost to the filter.