Let's say you have a logistic regression model. Some of the factors are intrinsically categorical but some are continuous variables. Under which circumstances should a continuous variable be binned into categories?

For example, logistic regression is widely used in retail credit modeling, and age is an explanatory variable. When is it wise to bin age (e.g., 2-3, 4-10, 11+) and when should you leave it as a continuous variable?

The issues pointed out in the various footnotes and references here do not seem to address this issue. There are numerous situations where detailed intelligent binning is not only appropriate, but adds value to the model.

Let's break it down to the basics, which is that in a digital world every thing is categorical. We never measure AGE down to the second, minute, day, week or even month. Why not? Because we assume that at those minute intervals the response variable is the same. How is tat different from assuming that the response variable is the same for those between ages of 25 and 27, If the data shows that the credit risk for people with 10-15 years of credit history is the same, why would we assume there is a linear relationship there?

Isn't that just trying to impute more in the data than exists?

It is true that by discretizing the data we increase the available degrees of freedom, but only if we assume each interval is defined by a separate variable AND with large data sets (thousands of observations are not only common, but on the low side of many data sets). I think the problem with many comments here are from areas where sample sizes are small, biomed, social sciences, ... In the marketing and financial and other consumer worlds there is more data than you can shake a stick at.

Finally, binning has been an accepted and proven practice in the consumer industry since Fair, Isaac first started building scorecards, back in the 1960's. FICO still uses complex binning techniques for almost all of their models today. One of the current top data mining tools, TreeNet from Salford, is essentially based on binning techniques.

So, anyone who considers binning to not be best practice as a potential transformative technique is not behind the technology curve, way behind.

Hi Mike, welcome to quant.SE. Great answer, I completely agree that intelligent binning can add value to a model. You are essentially taking a somewhat Bayesian view that priors should be imposed, even when using classical statistical tools.
–
Tal FishmanJul 23 '12 at 13:42

Is there a reference to back these assertions?
–
JaseAug 15 '13 at 12:10

I am a practitioner in the field and I have used both continuous transforms and binning approaches.

In practical terms in a model there would be around 10 factors. If your model development dataset is large and robust enough, you could easily have binned each of the 10 factors into 10 bins. Roughly speaking, using the binned scorecard you could then assign $10^{10}$ different scores to your customers. This is way larger than any bank's customer base. So don't worry too much about your model not being granular enough. In mostly cases it will be.

For datasets with small sample size, you will likely end up with adjacent bins with large point differentials. This is not desirable! You don't want your customer's scores' to change so drastically if he/she is simply moving to the adjacent bin in one of the factors. In practise I have overcome this by simply employing an automated binning algorithm on the data, and bootstrapping it 100 times. If you think of binning as a step function that maps raw values to a WOE then simply taking the average of the 100 bootstrapped binning will naturally smooth out sudden jumps in WOE. This will make the large point differential problem go away. This method is completely data-based and does not need any other subjective input.

In other approaches where a continuous function is fitted the analyst building the model often have to look into limiting the range of values where the curve should be fitted. This could be due to the data sample being not robust enough near the ends of the distribution (e.g. for income variables, very few of your customers would be earning \$10 million+, so someone has to subjective choose where to stop fitting the function say at $1 million). This is all manual and does not lend well to automation. The bootstrapped binning approach in the above paragraphs avoids this problem.

Binning approach is also easier to implement as a series of if statements should suffice.

Generally even though I use continuous transforms sometimes, I prefer binning and use it whenever it is possible. If your sample size is really small and you have less than 30 defaults to even apply binning, then you may want to relax the bad definition and lengthen your observation period. However even if that does not bring your total bad counts up to around 60~70 then continuous transform might be the way to go.