Predict the Price of Your Favorite House

In this post we are going to talk about Linear Regression which is one of the most widely used statistical tools in Machine Learning. The idea is very simple. We have some features and we want to know how our predictions change as we change the value of features. Features are the square footage of the house, # of bathrooms, # of bedrooms etc. and observation is the price of the house.

So we want to generate a model that takes features as input and outputs the predicted price.

First, let’s explain a naive method to create such an application. We will only consider the square feet of the house as a feature and we will try to predict the price of the house. Let’s take our observations and make a plot of them.

X axis represents the “feature” square feet. It is also called “covariate” or “predictor”. Y axis represents the “observation” that we collect. It is also called “response” or “dependent variable”. Also each point on the graph represents a previous house sale.

So the question is how are we going to use these observations to estimate price of a house? One way is to look at how big the house and look for the similar price range as shown below.

The problem with that approach is that we are only considering 2 house sales that we are going to base our estimate off of. We are throwing out all the other house sales and the question is, is that approach reasonable?

Of course no. In that approach we leave all the other observations as they have nothing to the with our prediction. We can instead think about modeling the relationship between the square footage of the house and the house sales price. To do this we are going to use Linear Regression.

Our main goal is to understand the relationship between the square footage of the house and the house sales price. The simplest model would be just fitting a straight line to data.

This line is defined by;

W0 is the intercept and the W1 is being the weight on the feature X. Intercept and slope are the parameters of our model.

So now the question is, which line is the best line? We need to define a cost for given line to find the best fit. We will use Root Mean Square Deviation(RMSE) to minimize our cost.

Now we are ready to get started to make our prediction with some real data. First of all, click here to download the dataset which includes house sales in King County, the region where the city of Seattle, WA is located. Then open up your iPython Notebook. If you are not familiar with the iPython Notebook and GraphLab Create, I strongly encourage you to read this post.

We start by importing the GraphLab Create library then we load our data.

import graphlab
sales = graphlab.SFrame('home_data.gl/')

Fire up GraphLab Create, and load the data

You can view the data in iPython notebook by typing;

sales

This will show the very first few lines of the data.

Let’s explore little bit more about the data. We know that the house price is correlated with the number of square feet of living space. Let’s show this on a scatter plot.

# Set the target to iPython Notebook, so it won't open in a new tab
graphlab.canvas.set_target('ipynb')
sales.show(view="Scatter Plot", x="sqft_living", y="price")

Now it is time to create a simple linear regression model of sqft_living to price.

We need to split the data into training set and test set.

We will use seed = 0 so that everyone running this notebook gets the same results. In practice, you may set a random seed (or let GraphLab Create pick a random seed for you).

# We take 80% of the data as our training set and the remaining 20% as the test set.
train_data,test_data = sales.random_split(.8,seed=0)

Now we can build our model using linear_regression.create function with only sqft_living as a feature.

Isn’t it would be great to see what our predictions look like? Surely it would. We will use Matplotlib for visualizing our predictions. Matplotlib is a Python plotting library that is also useful for plotting. You can install it with: ‘pip install matplotlib’

Our advanced model did a great job! The original price of the house was $2,200,000 and we predicted $2,115,905 which is pretty reasonable!

At the end it is also possible that in some cases, the model with more features may provide a worse prediction than the simpler model with only 1 feature. However, on average, the model with more features is better. Also note that predictions may vary from yours with just a little bit difference.