Introduction

Chapter 5 of Machine Learning for Hackers is a relatively simple
exercise in running linear regressions. Therefore, this post will be
short, and I’ll only discuss the more interesting regression example,
which nicely shows how patsy formulas handle categorical variables.

Linear regression with categorical independent variables

In chapter 5, the authors construct several linear regressions, the last
of which is a multi-variate regression descriping the number of page
views of top-viewed web sites. The regression is pretty straightforward,
but includes two categorical variables: HasAdvertising, which takes
values True or False; and InEnglish, which takes values Yes,
No and NA (missing).

If we include these variables in the formula, then patsy/statmodels will
automatically generate the necessary dummy variables. For
HasAdvertising, we get a dummy variable equal to one when the the
value is True. For InEnglish, which takes three values, we get two
separate dummy variables, one for Yes, one for No, with the missing
value serving as the baseline.