Worthless in Seattle?

Aside from a being a director at my previous company, he has a lucrative side gig in property.

Given his computer science background, he’s able to scrape data from property websites, and analyse how a house’s attributes affect its price.

He can therefore identify properties that are significantly undervalued relative to expectations, invest smartly, and make handsome profits when re-selling.

The business case for any kind of data science pertaining to the murky world of property sales has therefore been plain to me for some time (or at least from the first time he took me to one of his two private members’ clubs in central London).

The Skyneedle — probably an outlier in King County’s property market…The dataset we consider here is for property sales in King County, Washington, which runs from Seattle in the North West, to the Okanogan-Wenatchee National Forest in the East.

This blog will explore how we can clean and mung this dataset, with a view to performing a multi-variate linear regression model to predict a property’s price.

The github repository for this work can be found here.

Step 1.

Inspecting and understanding the dataWe can import the data (given to us as a csv file) using the Pandas library, and inspect it using the .

info( ) method.

df = pd.

read_csv(‘kc_house_data.

csv’)df.

info()The key takeaways from this:We have 21597 observations (each representing one sale transaction), so not a bad sample size.

Data is complete, apart from in ‘waterfront’ and ‘yr_renovated’ columns.

We will need to investigate these to see if we can fill in the blanks.

‘date’ should have a datetime type to facilitate further analysis.

Also, ‘sqft_basement’ has type ‘object’, which we would not expect (this should be a float or an integer).

This will require investigation.

Many of the column titles are fairly self-explanatory, though this is not universal.

We can make use of the (rather good) King County website to get a better understanding of attributes such as grade and condition.

We may need to do further investigation of the dataset to get to grips with attributes such as ‘view’, whose meaning is not obvious.

Before continuing, we should also check that we have no duplicate data by checking that the below code returns 0.

len(df[df.

duplicated()])Step 2.

Investigating incomplete dataWe check the ‘waterfront’ and ‘yr_renovated’ columns to see what share of our rows contains at least one NaN value.

len(df.

loc[(df[‘yr_renovated’].

isna()) | (df[‘waterfront’].

isna())]) / len(df)This returns 27% — a chunk of the dataset that is too large to delete outright.

Let’s investigate the columns in turn and see if we can be more nuanced.

Looking first at the renovation column, let’s get a feel for the data by looking at the most commonly occurring values.

df.

groupby(by = ‘yr_renovated’)[‘yr_renovated’].

count().

sort_values(ascending = False).

head()So we see that, as well as missing values, 17,011 rows have ‘0’ as the year of last renovation.

This suggests that the house has never been renovated, or, as is more likely, data is not available.

We can drop this column — it will not provide insight given the share of unknowns here.

df.

drop(‘yr_renovated’, axis = 1, inplace = True)We now check back in on the ‘waterfront’ column, and see that only 11% of rows have N/As.

len(df.

loc[(df[‘waterfront’].

isna())]) / len(df)This is a small enough part of the sample for us to simply drop the offending rows, using the .

dropna( ) method.

df.

dropna(inplace=True)Step 3.

Fixing data typesEnsuring that data is cast as the most appropriate type is very important for when we come to do our exploratory analysis — data in the wrong type can throw errors when we try to use even basic functions (calculating a mean of strings is, of course, conceptually impossible).

The date column is easy enough to deal with, using the to_datetime( ) method:df[‘date’] = pd.

to_datetime(df[‘date’])This format will allow Pandas to recognise the date’s day of the week, month number, and will make the feature ‘ordinal’ — i.

e.

we will be able to put the dates in chronological order.

The ‘sqft_basement’ feature is currently type ‘object’, suggesting that we might have string entries as placeholders in the column.

Let’s look at the most commonly occurring values.

df.

groupby(by = ‘sqft_basement’)[‘sqft_basement’].

count().

sort_values(ascending = False).

head(4)It looks like the question mark ‘?’ entries are the ones causing us bother here.

We need to get rid of them, but ideally, we would replace the question marks with values, rather than simply deleting the rows.

Happily, we can calculate the basement area from the difference between the overall living space and the ‘above’ living space, given in the ‘sqft_living’ and ‘sqft_above’ columns.

We can write code that isolates the indexes of ‘?’ basement values, and replaces them with this calculation.

for i in list(df.

loc[df[‘sqft_basement’] == ‘?’].

index.

values): df.

loc[i,’sqft_basement’] = df.

loc[i,’sqft_living’] – df.

loc[i,’sqft_above’]Having done this, we can cast the column as a float.

df[‘sqft_basement’] = df[‘sqft_basement’].

astype(‘float64’)Step 4 .

Looking for outliersWhereas the last few steps were quite objective (a basement can’t have an area of ? square feet), dealing with outliers requires a little bit more nuance.

It’s perfectly reasonable to ask: if there’s no objective way to identify and eliminate an outlier, then why do it all?.The answer really gets to the heart of what it is we’re doing.

Ultimately, we need to decide what we want our model to do.

This normally involves making a trade off between model accuracy, and model scope.

In other words, do you want:A very accurate model that covers a kind of wide range of property types, or:2.

A kind of accurate model that covers a very wide range of property types.

Typically, we would want the first option — better predictions are what we’re after, even if it means narrowing the scope a bit.

So how do we choose what to drop?.Given that we’re getting a bit subjective, we can think about this visually.

In particular, histograms will show long tails (of, for example, very expensive properties).

We could conclude that a house with more than 7 bedrooms, more than 5 bathrooms, or with a lot of over 300k square feet, is unlikely to be representative of the sample.

Similarly, properties selling for over $1m (and indeed $100k) are liable to skew the model, as per the histograms above.

Looking at the lower end of the market — we can see only 21 houses with grades 3 or 4 (cabins or houses that do not meet building standards).

These are unlikely to be representitive.

We can also run with an assumption that a property probably ought to have at least 1 bathroom.