Tutorial Overview

The process is quite simple. Below is an overview of the steps we are going to complete in this tutorial.

Setup Your AWS Account (if needed).

Launch Your AWS Instance.

Login and Run Your Code.

Train an XGBoost Model.

Close Your AWS Instance.

Note, it costs money to use a virtual server instance on Amazon. The cost is very low for ad hoc model development (e.g. less than one US dollar per hour), which is why this is so attractive, but it is not free.

The server instance runs Linux. It is desirable although not required that you know how to navigate Linux or a Unix-like environment. We’re just running our Python scripts, so no advanced skills are needed.

1. Setup Your AWS Account (if needed)

You need an account on Amazon Web Services.

1. You can create an account using the Amazon Web Services portal by clicking “Sign in to the Console”. From there you can sign in using an existing Amazon account or create a new account.

AWS Sign-in Button

2. If creating an account, you will need to provide your details as well as a valid credit card that Amazon can charge. The process is a lot quicker if you are already an Amazon customer and have your credit card on file.

Note: If you have created a new account, you may have to request to Amazon support in order to be approved to use larger (non-free) server instance in the rest of this tutorial.

2. Launch Your Server Instance

Now that you have an AWS account, you want to launch an EC2 virtual server instance on which you can run XGBoost.

Launching an instance is as easy as selecting the image to load and starting the virtual server.

We will use an existing Fedora Linux image and install Python and XGBoost manually.

3. Select “N. California” from the drop-down in the top right hand corner. This is important otherwise you may not be able to find the image (called an AMI) that we plan to use.

Select N California

4. Click the “Launch Instance” button.

5. Click “Community AMIs”. An AMI is an Amazon Machine Image. It is a frozen instance of a server that you can select and instantiate on a new virtual server.

Community AMIs

6. Enter AMI: “ami-02d09662” in the “Search community AMIs” search box and press enter. You should be presented with a single result.

This is an image for a base installation of Fedora Linux version 24. A very easy to use Linux distribution.

Select the 64-bit Fedora Linux AMI

7. Click “Select” to choose the AMI in the search result.

8. Now you need to select the hardware on which to run the image. Scroll down and select the “c3.8xlarge” hardware.

This is a large instance that includes 32 CPU cores, 60 Gigabytes of RAM and a 2 large SSD disks.

Select the c3.8xlarge Instance Type

9. Click “Review and Launch” to finalize the configuration of your server instance.

You will see a warning like “Your instance configuration is not eligible for the free usage tier”. This is just indicating that you will be charged for your time on this server. We know this, ignore this warning.

Your instance configuration is not eligible for the free usage tier

10. Click the “Launch” button.

11. Select your SSH key pair.

If you have a key pair because you have used EC2 before, select “Choose an existing key pair” and choose your key pair from the list. Then check “I acknowledge…”.

If you do not have a key pair, select the option “Create a new key pair” and enter a “Key pair name” such as “xgboost-keypair”. Click the “Download Key Pair” button.

12. Open a Terminal and change directory to where you downloaded your key pair.

13. If you have not already done so, restrict the access permissions on your key pair file. This is required as part of the SSH access to your server. For example, on your console you can type:

1

2

cd Downloads

chmod 600 xgboost-keypair.pem

14. Click “Launch Instances.

Note: If this is your first time using AWS, Amazon may have to validate your request and this could take up to 2 hours (often just a few minutes).

15. Click “View Instances” to review the status of your instance.

Review Your Running Instance and Note its IP Address

Your server is now running and ready for you to log in.

3. Login and Configure

Now that you have launched your server instance, it is time to login and configure it for use.

You will need to configure your server each time you launch it. Therefore, it is a good idea to batch all work so you can make good use of your configured server.

Configuring the server will not take long, perhaps 10 minutes total.

1. Click “View Instances” in your Amazon EC2 console if you have not already.

2. Copy “Public IP” (down the bottom of the screen in Description) to clipboard.

In this example my IP address is 52.53.185.166.Do not use this IP address, your IP address will be different.

3. Open a Terminal and change directory to where you downloaded your key pair. Login in to your server using SSH, for example you can type:

1

ssh -i xgboost-keypair.pem fedora@52.53.185.166

4. You may be prompted with a warning the first time you log into your server instance. You can ignore this warning, just type “yes” and press enter.

You are now logged into your server.

Double check the number of CPU cores on your instance, by typing

1

cat /proc/cpuinfo | grep processor | wc -l

You should see:

1

32

3a. Install Supporting Packages

The first step is to install all of the packages to support XGBoost.

This includes GCC, Python and the SciPy stack. We will use the Fedora package manager dnf (the new yum).

4. Train an XGBoost Model

This dataset is available for free from Kaggle (you will need to sign-up to Kaggle to be able to download this dataset). It describes the 93 obfuscated details of more than 61,000 products grouped into 10 product categories (e.g. fashion, electronics, etc.). Input variables are counts of different events of some kind.

The goal is to make predictions for new products as an array of probabilities for each of the 10 categories and models are evaluated using multiclass logarithmic loss (also called cross entropy).

This competition was completed in May 2015 and this dataset is a good challenge for XGBoost because of the nontrivial number of examples, the difficulty of the problem and the fact that little data preparation is required (other than encoding the string class variables as integers).

Create a new directory called work/ on your workstation.

You can download the training dataset train.csv.zip from the Data page and place it in your work/ directory on your workstation.

We will evaluate the time taken to train an XGBoost on this dataset using different numbers of cores.

We will try 1 core, half the cores 16 and all of the 32 cores. We can specify the number of cores used by the XGBoost algorithm by setting the nthread parameter in the XGBClassifier class (the scikit-learn wrapper for XGBoost).

The complete example is listed below. Save it in a file with the name work/script.py.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

# Otto multi-core test

from pandas import read_csv

from xgboost import XGBClassifier

from sklearn.preprocessing import LabelEncoder

import time

# load data

data=read_csv('train.csv')

dataset=data.values

# split data into X and y

X=dataset[:,0:94]

y=dataset[:,94]

# encode string class values as integers

label_encoded_y=LabelEncoder().fit_transform(y)

# evaluate the effect of the number of threads

results=[]

num_threads=[1,16,32]

forninnum_threads:

start=time.time()

model=XGBClassifier(nthread=n)

model.fit(X,label_encoded_y)

elapsed=time.time()-start

print(n,elapsed)

results.append(elapsed)

Now, we can copy your work/ directory with the data and script to your AWS server.

From your workstation in the current directory where the work/ directory is located, type:

1

scp -r -i xgboost-keypair.pem work fedora@52.53.185.166:/home/fedora/

Of course, you will need to use your key file and the IP address of your server.

This will create a new work/ directory in your home directory on your server.

Log back onto your server instance (if needed):

1

ssh -i xgboost-keypair.pem fedora@52.53.185.166

Change directory to your work directory and unzip the training data.

1

2

cd work

unzip ./train.csv.data

Now we can run the script and train our XGBoost models and calculate how long it takes using different numbers of cores:

1

python script.py

You should see output like:

1

2

3

(1, 84.26896095275879)

(16, 6.597043037414551)

(32, 7.6703619956970215)

You can see little difference between 16 and 32 cores. I believe the reason for this is that AWS is giving access to 16 physical cores with hyperthreading, offering an additional virtual cores. Nevertheless, building a large XGBoost model in 7 seconds is great.

You can use this as a template for your copying your own data and scripts to your AWS instance.

A good tip is to run your scripts as a background process and forward any output to a file. This is just in case your connection to the server is interrupted or you want to close it down and let the server run your code all night.

You can run your code as a background process and redirect output to a file by typing:

1

nohup python script.py >script.py.out 2>&1 &

Now that we are done, we can shut down the AWS instance.

5. Close Your AWS Instance

When you are finished with your work you must close your instance.

Remember you are charged by the amount of time that you use the instance. It is cheap, but you do not want to leave an instance on if you are not using it.

1. Log out of your instance at the terminal, for example you can type:

1

exit

2. Log in to your AWS account with your web browser.

3. Click EC2.

4. Click “Instances” from the left-hand side menu.

5. Select your running instance from the list (it may already be selected if you only have one running instance).

The computing speed is smaller than my macbook air, in which running the whole train set need 70 seconds. I am confused whether I installed some packages incorrectly, or the speed of x3.xlarge is low indeed.

My data is usually no larger than a few gig. I upload it once and keep the instance to the end of the project (e.g. a few hundred bucks, no big deal). I would recommend S3 for longer projects or bigger data.

Thanks Jason. I read a post recommending having RAM at least 3x the size of your dataset to avoid an analysis taking hours or days to run. So if datasets are < 20 GB I can see it's preferable to load into EC2.

I’m having an issue where I get this error pasted below; following some stack overflow advice, I added the option low_memory = False to pd.read_csv, but the error persists and seems related to this datetime field. Any advice would be appreciated, thanks! Great walkthrough BTW.