Matt Gregory

There seems nothing the British press likes more than a good house price story. Accordingly we use the Kaggle dataset on house prices as a demonstration of the data science workflow. With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home (every dataset has a story, see here for details). I found this dataset particularly interesting, as it informs someone new to he housing market as to what variables one should ask questions about if one were to buy a house.

We start by downloading the data from Kaggle and reading the training data into R using the readr package, a subset of the excellent package of packages that is the tidyverse. We check the variables or features are the appropriate data class.

library(tidyverse)train<-readr::read_csv(file="./data/2016-10-23-train.csv",col_names=TRUE)test<-readr::read_csv(file="./data/2016-10-23-test.csv",col_names=TRUE)# problems with the names of some variables having `ticksnames(train)<-make.names(names(train))names(test)<-make.names(names(test))# VARIABLE TYPE -------------------------------------------------------------------# note from documentation, all character class are factor variables, no strings# We identify and select all character variables and get their namesis_char<-sapply(train,is.character)to_correct<-names(select(train,which(is_char)))# Correct data type of variable# use mutate_each_, which is the standard evaluation version, to change variable classestrain_correct_type<-train%>%mutate_each_(funs(factor),to_correct)#glimpse(train_correct_type)

Inspection of some of our factors reveals that unsurprisingly the levels of the factor have not been ordered correctly. However, the ordering has not been explicitly set, as that is not the default for factor. This loses us information, for example if we compare what R thinks is the case and what should be the case, we see a discrepancy. R defaults to use alphabetical order. If we were to set the levels correctly for each factor this could improve our predictions.

levels(train_correct_type$BsmtQual)

## [1] "Ex" "Fa" "Gd" "TA"

is.ordered(train_correct_type$BsmtQual)

## [1] FALSE

BsmtQual: Evaluates the height of the basement

Code

Description

Inches

Ex

Excellent

100+

Gd

Good

90-99

TA

Typical

80-89

Fa

Fair

70-79

Po

Poor

<70

NA

No Basement

NA

Missing data

Every messy data is messy in its own way - Hadley Wickham

There is missing data for a variety of combinations of the variables. The visualising indicator matrices for missing values is a shortcut (visna()). The columns represent the missing data and the rows the missing patterns. Here we have plenty of missing patterns. The bars beneath the columns show the proportions of missingness by variable and the bars on the right show the relative frequencies of the patterns. Most of the data is complete.

The missing data is found in about a dozen or so of the variables (those on the left). The variables that contribute to the bulk of the data (note the heavy skew of the dodgy variables). PoolQC, MiscFeature, Alley and Fence tend to be missing. These variables warrant closer inspection of the supporting documentation to suggest why this might be the case.

extracat::visna(train_correct_type,sort="b")

Keep it simple

For now, we will keep things simple by ignoring all categorical variables and dropping them. We also remove the artificial Id variable.

Outliers

Let’s look for any unusual outliers that may affect our fitted model during training. As SalePrice is our response variable and what we are trying to predict, we get a quick overview using the scatterplot matrix for numeric data. We don’t plot it here but provide the code for you to explore (it takes a minute to compute).

# p1 <- GGally::ggpairs(train_no_factors)# p1

Let’s take a closer look at GrLiveArea, there seem to be four outliers. Accordingly, I would recommend removing any houses with more than 4000 square feet from the data set (which eliminates these five unusual observations).

plot(train_no_factors$SalePrice,train_no_factors$GrLivArea)

# remove outlierstrain<-train_correct_type%>%filter(GrLivArea<4000)

Inspecting numeric variables correlated with the response variable

corrplot::corrplot(cor(train_no_factors),method="circle",type="lower",diag=FALSE,order="FPC",tl.cex=0.6,tl.col="black")# plot matrix and ordered by first principal component

This display of the correlation matrix shows the most important variables associated with SalePrice. This provides a good starting point for modelling and or feature selection. For example OverallCond shows poor correlation with SalePrice, perhaps we need to adjust this variable to improve its information content. Or perhaps people ignore the condition and think of the property as a fixer upper opportunity. As you can see there is huge depth to the data and it would be easy to feel overwhelmed. Fortunately, we’re not trying to win the competition, just produce some OK predictions quickly.

Premature optimization is the root of all evil - Donald Knuth

Regression

To simplify the problem and celebrate the mlr package release (or at least my discovery of it), I implement some of the packages tools for regression and feature selection here. For a detailed tutorial, which this post draws heavily from, see the mlr home page. Also some Kagglers have also contributed many and varied useful ideas about this problem.

Machine Learning Tasks

Learning tasks encapsulate the data set and further relevant information about a machine learning problem, for example the name of the target variable for supervised problems, in this case SalePrice.

library(mlr)# first row is Idregr_task<-makeRegrTask(id="hprices",data=train[,2:81],target="SalePrice")regr_task

As you can see, the Task records the type of the learning problem and basic information about the data set, e.g., the types of the features (numeric vectors, factors or ordered factors), the number of observations, or whether missing values are present.

Constructing a Learner

A learner in mlr is generated by calling makeLearner. In the constructor we specify the learning method we want to use. Moreover, you can:

Set hyperparameters.

Control the output for later prediction, e.g., for classification whether you want a factor of predicted class labels or probabilities.

Set an ID to name the object (some methods will later use this ID to name results or annotate plots).

Function train returns an object of class WrappedModel, which encapsulates the fitted model, i.e., the output of the underlying R learning method. Additionally, it contains some information about the Learner, the Task, the features and observations used for training, and the training time. A WrappedModel can subsequently be used to make a prediction for new observations.

Submission

Submissions are evaluated on Root-Mean-Squared-Error (RMSE) between the logarithm of the predicted value and the logarithm of the observed sales price. (Taking logs means that errors in predicting expensive houses and cheap houses will affect the result equally.)

Feature engineering

We likely want to code categorical variables into dummy variables and think about how to combine or use the available variables for this regression problem to further reduce our RMSE below 0.27. There are also some tools for feature filtering in mlr.

Conclusion

Following house prices is a national obsession. Here we elucidate an alternative dataset to the traditional MASS::Boston suburban house values with a more contemporary, comprehensive and complicated data set. This contributes to the Kaggle learning experience by providing Kagglers with a mild introduction into Machine Learning with R, specifically the mlr package. We predict house prices with a respectable 0.27 RMSE using out of the box approaches. Feature selection will help nudge the accuracy towards the dizzying heights of the Kaggle scoreboard albeit with a high demand on the Kaggler’s time and insight.