Predicting house prices on Kaggle: a gentle introduction to data science – Part I

Data is ubiquitous these days, and being generated at an ever-increasing rate. However, left untouched and unexplored, it is of course of little use. This post will be the first in a series of tutorial articles exploring the process of moving from raw data to a predictive model. We’ll walk through the basic steps involved, and talk about some of the common pitfalls along the way.

Find a data set and ask it a question

Before we can begin any analysis, we first need to obtain some data and decide on a quantity that we would like to predict. For this, we’ll turn to Kaggle. The House Prices: Advanced Regression Techniques challenge asks us to predict the sale price of a house in Ames, Iowa, based on a set of information about it, such as size, location, condition, etc. A real estate agent might be able to do this based on intuition, experience and various rules of thumb, but we – lacking this ability and knowledge – would like to do so based only on the data we have about house sales in the past.

Although the details vary from problem to problem, the general process to get from data to predictive model tends to involve three major components:

Getting to know the data.

Cleaning and preparing the data for modelling.

Fitting models and evaluating their performance.

In this post, we will focus on the first of these, returning to the rest in Parts II and III. We’ll use Python as our language of choice throughout.

So let’s begin!

Load and get to know the data

First up, let’s load the python packages we’ll be using for the rest of this post.

Numpy is the basis of scientific computing in Python and will give us powerful array objects and the ability to perform mathematical operations on them. Pandas will give us efficient and convenient data structures (dataframes) that we will use to store and transform our data. Finally, matplotlib and seaborn will allow us to create some nice visualizations.

In the Kaggle House Prices challenge we are given two sets of data:

A training set which contains data about houses and their sale prices.

A test set which contains data about a different set of houses, for which we would like to predict sale price.

Here, we see that that training set contains 81 columns. The first 80 of these also appear in the test set: these will be the features on which we will base our predictions. The final column, SalePrice, is our target variable. A brief description of each column and its contents is provided by Kaggle in the ‘data_description.txt’ file.

Notice that, in total, the training set contains 1460 rows: each of these represents one house sold. Some columns, however, contain notably fewer entries. This tells us that we have missing values in our dataset.

Notice also that the data types of the columns are mixed: we have floats, integers and objects (strings). Looking more closely, we see that this is not merely a question of representation. Our features come in fundamentally different types:

Some features are inherently numerical: they are quantities that we can measure or count. Some of these are continuous, such as the total living area (GrLivArea), while others are discrete, such as the number of rooms (TotRmsAbvGrd).

Other features are categorical: they are qualitative or descriptive in nature. For example, this includes the neighbourhood in which the house is located (Neighborhood), and the type of foundation the house was built on (Foundation). There is no inherent ordering to these features and mathematical operations don’t make sense.

Yet others are ordinal: they comprise categories with an implicit order. Examples of this include the overall quality rating (OverallQual) or the irregularity of the lot (LotShape). We can think of them as representing values on an arbitrary scale.

We will talk about how to deal with missing values and non-numerical data types in Part II. For now, however, let’s have a closer look at the data.

Explore and visualize the data

Exploratory data analysis is a rich topic in its own right and there are many ways in which we can proceed, depending on the particular problem at hand. In general, however, we will typically want to have a look at

The distribution of the target variable and of individual features (univariate analysis).

The relationship between pairs of variables (bivariate analysis).

Spending some time doing this before launching into model building can make a huge impact on results. It can give us ideas about which kinds of feature transformations and models might be most useful, and help us find outliers and spurious values in our data. We won’t go into a full in-depth analysis here – we’ll leave that for the reader – but let’s have a look at a few examples.

1. Univariate analysis

Numerical variables

To get an idea of the distribution of numerical variables, histograms are an excellent starting point. Let’s begin by generating one for SalePrice, our target variable.

plt.hist(data_train.SalePrice, bins = 25)

Immediately, we see that the distribution is skewed towards cheaper homes, with a relatively long tail at high prices. To make the distribution more symmetric, we can try taking its logarithm:

plt.hist(np.log(data_train.SalePrice), bins = 25)

Besides making the distribution more symmetric, working with the log of the sale price will also ensure that relative errors for cheaper and more expensive homes are treated on an equal footing. In fact, if we have a look at the metric used to evaluate this Kaggle competition, we see that it is actually based on the log of the sale price rather than sale price itself (see here). As such, we can think of log(SalePrice) as our true target variable.

Categorical variables

For categorical variables, bar charts and frequency counts are the natural counterparts to histograms. As an example, let’s have a look at the distribution of Foundation in our training set:

Here we can immediately see that only two types of foundation (poured concrete (PConc) and cinderblock (CBlock)) dominate in our dataset. Stone and wood are very rare indeed. As with the histogram example above, this might prompt us to make a transformation. For example, depending on the type of model we decide to use, we may want to merge Stone, Wood and Slab into a single (‘other’) category. Alternatively, if we think stone and wood houses are very important, it may alert us to the fact that we have a deficiency in our data, and need to go out and collect more stone and wood examples.

Related to this last point, it is also important to check whether the distributions in our training and test sets are similar to each other. Since our models can only learn to make predictions for the kinds of data they have seen, if the distributions are very different, our models may not perform as well as we would hope. Plotting Foundation for the test set, we can see that this is luckily not the case here:

sns.countplot(data_test.Foundation)

2. Bivariate analysis

Having looked at some of our variables individually, let’s move on to exploring the relationships between them. Of course, most interesting will be the relationship between the target variable (sale price) and the features we will use for prediction. However, as we will see, studying relationships among features can also be important.

Numerical variables

For numerical features, scatter plots are the go-to tool. Since the total living area of a house is likely to be an important factor in determining its price, let’s create one for GrLivArea and SalePrice. We’ll plot the living area against the log of the sale price as well for comparison.

Immediately, we see that there is indeed a strong dependence of sale price on the total living area. As expected: the larger the house, the more expensive it tends to be. Notice that in the first plot the data points are bunched up at smaller values, just as we saw in the SalePrice histogram, and the amount variation in sale price increases with increasing area. When we take the log in the second plot, the distribution looks notably more balanced, giving us further motivation to use the log of the sale price as our target variable.

While there is clearly a trend of sale price increasing with area, if we look a little more closely, we also see that there are two points that don’t seem to fit in with the rest. Towards the lower right part of the plot, there are two very large houses (bigger than 4500 sqft) with unusually low sale prices. Such data points are known as outliers and, left untreated, can have a huge impact on the accuracy of a model. The way we handle outliers in general will very much depend on the problem we want to solve and the origin of the outlier values. In the simplest case, if we have a good reason to believe that the outliers represent spurious values or mistakes in the data – that is, they are instances we don’t want the model to learn from – they can simply be removed. In other cases, however, outliers can be crucially important. For example, in fraud detection, the outliers would be precisely the points we would be most interested in. In our example, according to the statistics professor who originally supplied the housing data, the outlier points are “Partial Sales that likely don’t represent actual market values” (see here). As such, we can take the simplest approach and exclude them.

Before moving on to categorical features, let’s see if we can also learn something by looking at the relationship between pairs of features. We would expect YearBuilt and GarageYrBlt to be related, so let’s create a scatter plot for them. Note that since we are not considering SalePrice this time, we can plot both training and test data.

As we might expect, the figure tells us that the majority of garages were built at the same time as the houses they belong to: these form the diagonal line that runs across the plot. A significant number were also added later: these are the points above the line. Inspired by this, we might consider creating a new feature that tells us whether or not a garage was originally constructed with the house or how many years later one was added.

In addition to this, we also see a number of values that seem rather strange. In both training and test sets, we have several garages that were built as many as 20 years earlier than their houses (the points below the diagonal line), and in the training set we have a garage from the future – the record claims that it was built in 2207! Clearly something has gone wrong with these entries and – if we have some means to do so – we would ideally replace them with corrected values. If this is not possible, however, we can proceed to treat them as if they were missing: a topic we will come to in Part II.

As a final remark, note that we have used the alpha parameter in our scatter plots above to make the points partially transparent. This allows us to keep track of the density of points, which can be particularly useful in cases when the number of points is very large. Doing so can reveal structure in the data that wouldn’t be visible otherwise, and thereby give us further ideas for data preprocessing and model selection. Besides using alpha within plot as we have done here, we could, for example, also use hexbin to create 2D density plots – for very large datasets, this may well be the better choice.

Categorical variables

For categorical variables, seaborn offers several nice alternatives to the scatter plot, including stripplot, pointplot, boxplot and violinplot (see here for a tutorial). Let’s have a look at a couple of examples for sale price as a function of neighbourhood – another feature that’s likely to be important for our predictive models.

The figure above is directly analogous to the scatter plots we looked at for numerical variables, with two main differences:

Jitter is used to randomly shift the points horizontally within each neighbourhood to make them more visible.

Since neighbourhood is categorical and therefore has no natural ordering, we are free to order the values along the x-axis as we like. In the plot above, we have sorted the neighbourhoods alphabetically.

As we might expect, there is considerable variation in price between neighbourhoods. The figure allows to get an idea of how different areas compare to each other at a glance. We can go further if we sort the neighbourhoods by average sale price:

Here, the points represent the average sale price for each neighbourhood, while the vertical bars indicate the uncertainty in this value.

Until next time…

There is, of course, a great deal more we can explore and discover in this dataset. However, this is where we’ll leave it for now. In Part II, we’ll move on to looking at what we need to do to get the data ready for modelling: in particular, we’ll talk about how to handle missing values and how to treat non-numerical variables. Until then, happy data-exploring!