Each time we face real applications in an applied econometrics course, we have to deal with categorical variables. And the same question arises, from students: how can we automatically combine factor levels? Is there a simple R function?

I did upload a few blog posts, over the past few years. But so far, nothing satisfying. Let me write down a few lines about what could be done. And if someone wants to write a nice R function, that would be awesome. To illustrate the idea, consider the following (simulated dataset):

The slope for x_1 is the same, we simply add a different constant for each level. As we can see, some levels are very very close, so it seems legitimate to combine them into one single category. Here is the output of the linear regression:

Here the reference category is "I". And it looks like we could actually combine that category with several others. One strategy here would be to select all categories that seem to be not significantly different, and to run a (multiple) test:

Actually, it is possible to use another strategy. We start from some level, say "A". Then, we merge it with all non-significantly different levels. If "B" is not one of them, we use it as the new reference. Etc.

I guess it would be necessary to randomly run the order in which we go through the levels. Last, but not least, one can use regression trees. The problem is that there is another explanatory variable that might interphere. So I would suggest (1) to fit a linear model

to calculate the residuals,

(2) to run a regression tree, to explain

with categorical variable x_2 (I did explain how trees are build when the explanatory variable is a categorical one in a previous post):