Data Preparation for Gradient Boosting with XGBoost in Python

XGBoost is a popular implementation of Gradient Boosting because of its speed and performance.

Internally, XGBoost models represent all problems as a regression predictive modeling problem that only takes numerical values as input. If your data is in a different form, it must be prepared into the expected format.

In this post, you will discover how to prepare your data for using with gradient boosting with the XGBoost library in Python.

Notice how the XGBoost model is configured to automatically model the multiclass classification problem using the multi:softprob objective, a variation on the softmax loss function to model class probabilities. This suggests that internally, that the output class is converted into a one hot type encoding automatically.

One Hot Encode Categorical Data

Some datasets only contain categorical data, for example the breast cancer dataset.

This dataset describes the technical details of breast cancer biopsies and the prediction task is to predict whether or not the patient has a recurrence of cancer, or not.

We can see that all 9 input variables are categorical and described in string format. The problem is a binary classification prediction problem and the output class values are also described in string format.

We can reuse the same approach from the previous section and convert the string class values to integer values to model the prediction using the LabelEncoder. For example:

1

2

3

4

# encode string class values as integers

label_encoder=LabelEncoder()

label_encoder=label_encoder.fit(Y)

label_encoded_y=label_encoder.transform(Y)

We can use this same approach on each input feature in X, but this is only a starting point.

1

2

3

4

5

6

7

8

# encode string input values as integers

features=[]

foriinrange(0,X.shape[1]):

label_encoder=LabelEncoder()

feature=label_encoder.fit_transform(X[:,i])

features.append(feature)

encoded_x=numpy.array(features)

encoded_x=encoded_x.reshape(X.shape[0],X.shape[1])

XGBoost may assume that encoded integer values for each input variable have an ordinal relationship. For example that ‘left-up’ encoded as 0 and ‘left-low’ encoded as 1 for the breast-quad variable have a meaningful relationship as integers. In this case, this assumption is untrue.

Instead, we must map these integer values onto new binary variables, one new variable for each categorical value.

We can one hot encode each feature after we have label encoded it. First we must transform the feature array into a 2-dimensional NumPy array where each integer value is a feature vector with a length 1.

1

feature=feature.reshape(X.shape[0],1)

We can then create the OneHotEncoder and encode the feature array.

1

2

onehot_encoder=OneHotEncoder(sparse=False)

feature=onehot_encoder.fit_transform(feature)

Finally, we can build up the input dataset by concatenating the one hot encoded features, one by one, adding them on as new columns (axis=2). We end up with an input vector comprised of 43 binary input variables.

1

2

3

4

5

6

7

8

9

10

11

12

13

# encode string input values as integers

encoded_x=None

foriinrange(0,X.shape[1]):

label_encoder=LabelEncoder()

feature=label_encoder.fit_transform(X[:,i])

feature=feature.reshape(X.shape[0],1)

onehot_encoder=OneHotEncoder(sparse=False)

feature=onehot_encoder.fit_transform(feature)

ifencoded_x isNone:

encoded_x=feature

else:

encoded_x=numpy.concatenate((encoded_x,feature),axis=1)

print("X shape: : ",encoded_x.shape)

Ideally, we may experiment with not one hot encode some of input attributes as we could encode them with an explicit ordinal relationship, for example the first column age with values like ’40-49′ and ’50-59′. This is left as an exercise, if you are interested in extending this example.

Below is the complete example with label and one hot encoded input variables and label encoded output variable.

Again we can see that the XGBoost framework chose the ‘binary:logistic‘ objective automatically, the right objective for this binary classification problem.

Support for Missing Data

XGBoost can automatically learn how to best handle missing data.

In fact, XGBoost was designed to work with sparse data, like the one hot encoded data from the previous section, and missing data is handled the same way that sparse or zero values are handled, by minimizing the loss function.

Once loaded, we can see that the missing data is marked with a question mark character (‘?’). We can change these missing values to the sparse value expected by XGBoost which is the value zero (0).

1

2

# set missing values to 0

X[X=='?']=0

Because the missing data was marked as strings, those columns with missing data were all loaded as string data types. We can now convert the entire set of input data to numerical values.

1

2

# convert to numeric

X=X.astype('float32')

Finally, this is a binary classification problem although the class values are marked with the integers 1 and 2. We model binary classification problems in XGBoost as logistic 0 and 1 values. We can easily convert the Y dataset to 0 and 1 integers using the LabelEncoder, as we did in the iris flowers example.

Running this example we see results equivalent to the fixing the value to one (1). This suggests that at least in this case we are better off marking the missing values with a distinct value of zero (0) rather than a valid value (1) or an imputed value.

1

Accuracy: 79.80%

It is a good lesson to try both approaches (automatic handling and imputing) on your data when you have missing values.

Summary

In this post you discovered how you can prepare your machine learning data for gradient boosting with XGBoost in Python.

Specifically, you learned:

How to prepare string class values for binary classification using label encoding.

How to prepare categorical input variables using a one hot encoding to model them as binary variables.

How XGBoost automatically handles missing data and how you can mark and impute missing values.

Do you have any questions about how to prepare your data for XGBoost or about this post? Ask your questions in the comments and I will do my best to answer.

Thanks for the tutorial with such useful information! I have one question regarding the label encoding and the one hot encoding you applied on the breast cancer dataset.

You perform label encoding and one hot encoding for the whole dataset and then split into train and test set. This way it can be ensured that all the data are transformed with the same encoding configuration.

However, if we have new unseen data with the raw dataset type, how can we ensure that label encoding and one hot encoding is still transforming the unseen data in the same way? Do we need to save the encoders for the sake of processing unseen data?

Besides this kinds of data transformation, do we need to consider scaling or normalisation of the input variables before passing to XGBoost? We know that it generally yields better result for SVM especially with kernel function.

I had a question around how to treat “default” values of continuous predictors for XGBoost. For example, let’s say attributer X may take continuous values as such (say in range of 1 -100). But certain records may have some default value (say 9999) which denotes certain segment of customers for whom that predictor X cannot be calculated or is unavailable. Can we directly use predictor X as input variable for an XGBoost model? Or, should we do some data treatment for X? If so, what would that be?

For your missing data part you replaced ‘?’ with 0. But you have not mentioned while defining XGBClassifier model that in your dataset treat 0 as missing value. And by default ‘missing’ parameter value is none which is equivalent to treating NaN as missing value. So i don’t think your model is handling missing values.

In that webPage, your code classifies IRIS with 96.6% accuracy, which is very good.
In comments section, you told this guy (Abhilash Menon) to use gradient boosting with XGBoost.
Which is this tutorial. Here is also IRIS classification.
Well I do not understand:
Here, the IRIS classification with gradient boosting with XGBoost yields 79% or 83% accuracy (only).
Why should we use gradient boosting with XGBoost then?
Accuracy is too low.

Hi, Jason. First of all i would like to thank you for the wonderful material. I have been trying the XGBoost algorithm, it seems its acting weird on my pc. The first iris species dataset i got score of 21.2 %. Now this one of the breast cancer my accuracy is 2.1%. I really dont know wats wrong, can you please, please help me

Hello Jason, I know that for regression models we should drop the first dummy variable to avoid the “dummy trap”. I don’t see you doing that in this case. Is is because the dummy variable trap only applies to linear regression models and not gradient boosting algorithms?

Unfortunately I could not get the “datasets-uci-breast-cancer.csv” – looks like it was removed from the website. I search on the web but could not find a similar file to try out your example.
Is there a possibility you can send me this file?

Thanks for taking the time to map this all out. Prior to reading your tutorial, I used the DataCamp course on XGBoost as a guide, where they use two steps for encoding categorical variables: LabelEncoder followed by OneHotEncoder.

So a categorical variable with 5 levels is converted to values 0-4 and then these are one-hot encoded into columns five columns.

I am interested in the feature importance, so xgb.plot_importance is a great tool. However, the features are two steps removed from their original state.

How would you undo this two-step encoding to get the original variable names?

Jason, Do you have any idea about applying XGBoost on a multilabel classification problem? I just need some help from you on the data preparation part. Some suggestions from stackoverflow I found is massage your data a little by making k copies of every data point that has k correct/positive labels. Then you can hack your way to a simpler multi class problem…. Any suggestions would be highly appreciated!