Loading the data

First we are going to load the dataset as a dataframe. We are assuming that the current working directory is in the same directory where the dataset is stored. We add the sep option because the default separator is the empty string. In addition, as one can observe from the dataset instructions, the missing values are denoted with ?. To check the documentation of the read.table function use the command ?read.table.

The next step is to split the dataset into a training (70%) and a validation set (30%). For comparing later different models or the same models trained with differernt parameters, we are going to use the same training and validation set. Since we are splitting them randomly, we set a seed so that we maintain the same split throughout our experiments.

Training

Now we load the libraries rpart, rpart.plot and party. If they are not in your system you will have to install them with the commands: install.packages("rpart"), install.packages("rpart.plot") and `install.packages(“party”).

library(rpart)
library(rpart.plot)
library(party)

rpart

Let’s product a decision tree by training the induction algorithm on the train dataset. Check out the options of rpart with the command ?rpart.

tree = rpart(Class ~ ., data=trainData, method="class")

One can also use a differrent split criterion like the entropy split decision rule:

For their meaning you can check out the documentation ?rpart.control. The most important of them are: - minsplit: the minimum number of observations that must exist in a node in order for a split to be attempted. - minbucket: the minimum number of observations in any terminal (leaf) node. - maxdepth: sets the maximum depth of any node of the final tree - cp: parameter controlling if the complexity for a given split is allowed and is set empirically. Bigger values equal more prunning.