Become a data scientist in less than an hour

We had a very special guest presenter for this last webinar; Louis Dorard, author of the popular book Bootstrapping Machine Learning, was here to show us how everyone can do data science with the help of a few tools. To prove this he showed a simple three step way to analyse the relationship between price and all the other attributes of the house (square footage, number of bathrooms, etc). In practical terms, doing this should allow us to predict how much a house might sell for based on it’s characteristics. Pretty useful if you’re looking to sell your place soon!

Get the data

Before we can do any of that awesome data analysis and work out how much we can sell our houses for, we need to get some data! Using import.io he built a quick Crawler to a popular US real estate website.

In this case he built a Crawler to the results page for Las Vegas, NV. Because there are multiple houses on the page, we need to map it as multiple results. Remember as well that you need to give it 5 examples of pages that have data you’re interested in.

Cleanse it with Pandas

Once you’re Crawler has finished you can download all that awesome data as a CSV. The next step in your journey to become a data scientist, is what’s known as “data cleansing”. It’s exactly what it sounds like. It helps you clean up your data and get rid of anything you don’t need that might mess up your model.

To do this, Louis used the Pandas library. If you haven’t heard of it, I highly recommend you check out the “10 minutes to Pandas” tutorial for a quick introduction.

If you want to follow along with Louis’ example exactly you can use this handy notebook he created to clean your data set. You simply input your CSV and then follow the steps to clean it.

First, you’ll want to get rid of all the columns of data that you don’t need. To do this, select the columns you want to keep and then display the data again. The next thing to check is whether or not you have any duplicate data. Pandas will easily help you to delete duplicate rows.

Then, you’ll want to look for missing data. If you’ll remember some of the things we pulled from the real estate site were lots, which means they didn’t have things like number of bathrooms. In Pandas you can select which columns need to have data in them for a row to be included in your data set.

In some cases, you won’t want to delete the row altogether such as when a house has 2 bathrooms, but no ½ bathrooms. Instead you’ll want to specify a default value for every time that happens.

The next thing you want to look for is mixed units – like when we have the lot measured in sq ft and acres. To fix this you need to find a conversion function to change all the acres into square feet.

The final step is to randomly shuffle the rows so that the machine learning isn’t biased by the order of the data.

Create a predictive model

Now it’s time for the good stuff! To analyse the data, Louis used BigML, which is a powerful Machine Learning service that offers an easy-to-use interface for you to import your data and get predictions out of it.

Louis didn’t have much time to run through the steps to building a BigML predictive model in much detail, but you can read them on his blog. In the end you should get a decision tree like this one:

Once you’ve analysed and refined your model it’s time to make some predictions!

If you want to learn more about machine learning and predictive analytics, pick up a copy of Louis’ book: Bootstrapping Machine Learning. Also, make sure to follow him on Twitter (@louisdorard) and check out his blog for more great tips and how-tos of machine learning. You can also follow along with the slides from his presentation.