An intro to dummy encoding with Skoot

Using Skoot to accelerate your ML pre-processing workflow

Posted on June 18, 2018

This post will introduce you to dummy coding in skoot, one of my projects dedicated to helping machine learning practitioners automate as much of their workflow as possible. Those who have worked in the field for a while know that 80 - 90% of a data scientist’s time is spent solely on cleaning up data or building bespoke transformers to fit into an eventual production pipeline—skoot aims to solve exactly this problem by abstracting common transformer classes and data cleansing tasks into a reusable API.

Note that this is a very high level intro to the package and that the full package documentation is available for review here

Mo’ data, mo’ problems

(Kinda and not to say you’d ever ask for less data. But you know what I’m getting at…)

Imagine a client comes at you with a business question and hands you all the data you’ll need to solve it. Is it ever sparkling clean and free of errors (typographical, erroneous sensor values, data omission or other)?

NO! Even when the data has been used for modeling before, you’ll generally spend a significant amount of time cleaning your data, and the more features you have, the more time you’ll spend on data cleansing tasks.

Our aim in this dataset is to predict whether a person makes less than or greater than $50k (binary classification). It’s immediately recognizable that there are several different datatypes that will require transformations for us to be able to perform any modeling. Typically, a data scientist would spend an immense amount of time on cleaning up data and preparing meaningful features for modeling. With skoot, we can begin to chip away at this bottleneck in a matter of minutes.

Converting categorical fields to numeric fields

If you want the cleanest pipeline possible, you’ll end up building several custom TransformerMixin classes over the course of your modeling, one of which typically handles categorical encoding and dummy variables. There are a number of solutions to this problem out there, including the pd.get_dummies, but not all of them account for two issues that Skoot does:

Skoot addresses these for us seamlessly. If we look at the dtypes of the dataset, we can identify which will need dummy-encoding:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21

importpandasaspdfromsklearn.model_selectionimporttrain_test_splitdf=pd.read_csv("~/Downloads/adult.data.txt",header=None,names=["age","workclass","fnlwgt","education","education-num","marital-status","occupation","relationship","race","sex","capital-gain","capital-loss","hours-per-week","native-country","target"])y=df.pop("target")object_cols=df.select_dtypes(["object","category"]).columns.tolist()# with some examination we can see that "education-num" is just # an ordinal mirror of "education", so we can drop itdf.drop("education-num",axis=1,inplace=True)# As always, we need to split our dataX_train,X_test,y_train,y_test=train_test_split(df,y,test_size=0.2,random_state=42)

This gives us the following fields as “object” (or string) type:

workclass

education

marital-status

occupation

relationship

race

sex

native-country

With skoot we can very quickly one-hot encode all the categorical variables and drop one level (to avoid the dummy trap). Note that skoot does not force types when defining the DummyEncoder—this is because often times int fields are actually ordinal categorical features that should be encoded (like the “education-num” above). Instead, skoot allows us to define which specific columns on which to apply a transformation:

To apply this to your test data, just as with any other scikit-learn transformer, you simply use the transform method:

1

encoder.transform(X_test)

Things to note

The resulting features drop one factor level from each categorical variable if drop_one_level=True is specified (by default).

We address the situation where an unknown factor level is present

Here’s a demo of what happens when there’s a new factor level present:

1
2
3
4
5
6
7
8
9
10
11
12

# select a test row:test_row=X_test.iloc[0]# set the country to something that is obviously not real:test_row.set_value('native-country',"Atlantis")# transform the new row:trans2=encoder.transform(pd.DataFrame([test_row]))# prove that we did not assign a country encoding:nc_mask=trans2.columns.str.contains("native-country")asserttrans2[trans2.columns[nc_mask]].sum().sum()==0

And there you have it! <2 minutes to dummy encode your categorical features. The full code for this example is located in the code folder.