RichardKappahttp://richardkappa.co.uk
Data blog where I post new things as I learn them. Currently interested in learning Excel skills, VBA and R.Fri, 01 Dec 2017 09:00:09 +0000en-UShourly1https://wordpress.org/?v=4.9136774503Simple machine learning in R – ROChttp://richardkappa.co.uk/data/simple-machine-learning-in-r-roc/
http://richardkappa.co.uk/data/simple-machine-learning-in-r-roc/#respondFri, 01 Dec 2017 09:00:09 +0000http://richardkappa.co.uk/?p=233A quick follow up on the last post. I forgot to write about plotting ROC curves in R based on the different models. In the last post I created 5 progressively more complicated decision trees which didn’t really add any benefit when looking at the accuracy of the model. But accuracy is just one metric, what do the ROC curves look like and what are the areas under the curves?

For this we need to load the package ROCR and then the process is pretty simple. Here is the process for plotting the ROC curve for one of the models:

On this measure the more complex trees do add value. Which shows the value of using more than one performance metric.

]]>http://richardkappa.co.uk/data/simple-machine-learning-in-r-roc/feed/0233Simple machine learning in R - Decision Treeshttp://richardkappa.co.uk/data/simple-machine-learning-in-r-decision-trees/
http://richardkappa.co.uk/data/simple-machine-learning-in-r-decision-trees/#respondWed, 01 Nov 2017 09:00:39 +0000http://richardkappa.co.uk/?p=219Now that the land registry data has been imported and had some initial exploratory work done to it lets have a go at making a price prediction model. I’ll use a small subset of the data and initially only try to predict whether or not the house is worth more or less than £500k, rather than the more complicated process of predicting the price. The code used in this post is largely based upon the DataCamp course “Introduction to Machine Learning”. Code for this project is on my GitHub page here. This post focuses on decision trees using the package rpart.

The initial part of the code just setting up the data. We then define the variable “FiveHundredPlus”

sales2016$FiveHundredPlus <- ifelse(sales2016$Price<=500000, 0, 1)

It’s 0 if the house sold for less than £500k and 1 if it sold for more. It would be nice to be jumping straight into predicting the final price but this course seems to start off by predicting classifier variables rather than continuous variables.

If the house is in one of the counties Greater London, Buckinghamshire, Hertfordshire, Surrey, Windsor and Maidenhead or Wokingham then the tree predicts that the house is a detached house (70% of these houses are). Otherwise it is worth less than £500k.

The accuracy of this tree is 0.92 which seems pretty good to me.

Models 2, 3, 4 and 5 - Changing the complexity control

The first tree is pretty simple with just three branches. We can now change the settings so that the algorithm allows splits which have a lower impact on the final complexity.

I’m not going to try to explain what is going on in these trees in words. The question is, which is the most accurate at predicting on the holdout sample?

Model

Accuracy

Decision Tree 1

0.9201290

Decision Tree 2

0.9202014

Decision Tree 3

0.9202771

Decision Tree 4

0.9204318

Decision Tree 5

0.9204054

Adding the larger amounts of complexity didn’t really do anything so in my opinion the best model is the first tree.

]]>http://richardkappa.co.uk/data/simple-machine-learning-in-r-decision-trees/feed/0219More mapping data in Excelhttp://richardkappa.co.uk/uncategorized/more-mapping-data-in-excel/
http://richardkappa.co.uk/uncategorized/more-mapping-data-in-excel/#respondSat, 07 Oct 2017 16:40:38 +0000http://richardkappa.co.uk/?p=215Quite a long time ago now I wrote a post on my map of UK postcode towns in Microsoft Excel based on various posts I had seen on clearandsimply.com. It turns out I do actually use this map quite a lot as it is very quick to use and doesn’t require any extra specialist software. So I thought I would revisit this.

First things first I thought I’d go back to the source of my original map to see if there were any new posts, and there were. There is a new post of mapping US counties but instead of just one big map there is now a main US map and a more detailed map of a particular county. This is pretty cool and while I’d like to do it for postcode districts or even sectors that would be too time consuming.

This brings be onto my next point. While these maps are handy for the quick mapping of simple data. At somepoint we need to stop using Excel and start using some proper GIS software lets not do that yet. What are the ways of using Excel:

This looks like it is very good, but you’d expect that from a professional piece of software rather than something you found on the internet. The biggest issue here is that you have to buy ArcGIS which I haven’t done.

I don’t really like the use of this in this dashboard, but you might find a use for it.

From a fairly extensive search google it seems that most posts on mapping on excel are derived from the clear and simply posts that I originally based my post on. Chandoo might have another method but I can’t find one.

Please let me know if you know of other methods.

]]>http://richardkappa.co.uk/uncategorized/more-mapping-data-in-excel/feed/0215Importing data into Rhttp://richardkappa.co.uk/data/importing-data-into-r/
http://richardkappa.co.uk/data/importing-data-into-r/#respondWed, 18 Jan 2017 22:44:09 +0000http://richardkappa.co.uk/?p=208Recently I’ve been working through the various courses on DataCamp. It soon became apparent that this was something that I should have done long ago. It has soon become apparent that it would have saved me a lot of time reading through various blogs and Stack Overflow questions.

The most recent course I’ve worked on is on importing data. It starts off using the standard functions, e.g. read.csv, but it them moves onto what they claim to be better and quicker functions from the packages readr and data.table. While I do believe the course, I wanted to find out which was quicker for myself. So, using the data from the UK land registry I set up this little test

Have run this on my not very good home laptop and a virtual machine on AWS and had the results:

My laptop

AWS

Wow my computer is slow compared to the Amazon one. I checked and the data was loaded correctly in the amazon case.

From this not very scientific test we can see that on my laptop both the fread function and the read_csv functions are both much quicker than the read.csv function and that fread is slightly quicker in this case but there isn’t much in it. This is reflected on the AWS machine so from now on I’ll use one of fread or read_csv. Thanks Datacamp.

On a different note it turns out that like physical computers, virtual machines need to be rebooted every so often otherwise they crash.

]]>http://richardkappa.co.uk/data/importing-data-into-r/feed/0208Land Registry Motion Charthttp://richardkappa.co.uk/uncategorized/land-registry-motion-chart/
http://richardkappa.co.uk/uncategorized/land-registry-motion-chart/#respondTue, 25 Oct 2016 07:31:41 +0000http://richardkappa.co.uk/?p=203Following on from the previous Land Registry posts I've had a go at using the GoogleVis package to plot the data. So far my favourite method is the motion chart below. Although I need to make more equal sales areas as London ruins the x-axis a bit. The best view is the histogram view.

I'll hopefully post a more comprehensive post on all of this at some point soon.

]]>http://richardkappa.co.uk/uncategorized/land-registry-motion-chart/feed/0203Land Registry Part 2http://richardkappa.co.uk/uncategorized/land-registry-part-2/
http://richardkappa.co.uk/uncategorized/land-registry-part-2/#respondThu, 20 Oct 2016 22:50:12 +0000http://richardkappa.co.uk/?p=182In my first land registry post I imported a month’s worth of land registry data, named the rows and had a go at using the ggplot2 package to produce a number of nice looking charts. This time I want to progress a little further. My aims are, using the same dataset to:

Look at the distribution of prices

Look at the prices by different factors

Initially just using factors in the land registry data

The first aim is easy, plotting a histogram only requires the code

#Look at the distribution of priceshist(landregistry$Price)

Which produces the chart

The problem is that this data contains sales from different times, not just 2016 sales. I only want to see the sales from 2016 so need to first filter the sales and then produce a histogram of these sales. This is done using the code

In the end it looks very similar to the non filtered dataframe but it was the right thing to do either way.

Now to move onto the house prices by the different factors. In the data there are a number of different factors which could be worth looking at:

Property type

Old/new

Duration

County

Date of transfer, I’ll just look at the month

There are details on these variables here. I could also look at details based upon the postcode but I don’t feel like looking at them at the moment. Throughout this section I’ll be using the 2016 only data.

Which is a bit better, but not much. Using the same process for property type and old/new we get:

Which show that, unsurprisingly shows that Detached houses are the most expensive and flats, terraced houses and semi-detached houses being cheaper. It is interesting that on this view new builds are more expensive but I think there is more to it than that. I haven't looked at the other variables I said I would because I have run out of time.

Hopefully I'll have time to look into this data more and post what I find. I'll try to post my R code from what I do here

]]>http://richardkappa.co.uk/uncategorized/land-registry-part-2/feed/0182Land Registry Datahttp://richardkappa.co.uk/data/land-registry-data/
http://richardkappa.co.uk/data/land-registry-data/#respondThu, 13 Oct 2016 18:35:02 +0000http://richardkappa.co.uk/?p=178My computer has been struggling with some of the code I’ve been trying to run, it is pretty old and doesn’t have enough memory for large datasets in R. So rather than buy a better laptop I’ve set up an Amazon Web Service account and using this guide set up a computer so I don’t have to use mine. I’m only using the free one for now but if I want to have a go at processing something larger this will allow me to pay a small fee to use a more powerful machine for a short period of time.

One thing this does mean is that I can’t just download my data to my computer and reference it from a file location but that isn’t exactly and issue in R.

So, on to my first project using this cloud based R. I’m going to have a look at the different analysis tools in R using the UK land registry data from data.gov. In this post I’ll be playing with just one month’s worth of data and using the package ggplot.

Ggplot is easy to use and produces very nice looking charts with no effort at all. After loading the data into a dataframe in R and a little processing around the dates. The code

I haven’t had much time to play with all of this but what I have done so far I’ve found interesting. Hopefully I’ll be able to post more soon. The code I’ve used to get this far is saved here.

]]>http://richardkappa.co.uk/data/land-registry-data/feed/0178Plotting shapefiles in Rhttp://richardkappa.co.uk/uncategorized/plotting-shapefiles-in-r/
http://richardkappa.co.uk/uncategorized/plotting-shapefiles-in-r/#respondMon, 02 May 2016 10:36:07 +0000http://richardkappa.co.uk/?p=150After my last post on the ONS data structure this post is the first of a few on using that structure and some other public data, mostly UK government data, and mapping it using R. This first post is about getting shapefiles from various locations, loading them into R and plotting them.

First I need some shape files of the UK. I’ve found UK shapefiles for counties and countries, the ONS LSOA boundaries and postcode boundaries at sector, district and area level. For these I used these sources:

Note that you will need to read each of their private policies to see how their data can be used. UK government data can usually be used privately and commercially but the other sources may not be so generous.

The code I’m using throughout this project is saved on my GitHub account here. The different data sources I’ve got my data from generally use one of two map projections

OSGB36 – The Ordinance Survey’s national grid. This uses Eastings and Northings and gives the number of metres north and east of a point just off the south west of Cornwall.

The general process for loading in the shapefiles is to first use the function readOGR from the package rgdal. Then use the function spTransform, again from rgdal, to convert the coordinates into the correct coordinate system. You also need to have the package sp loaded for the function CRS to work. CRS tells you what projection the current SpatialPointsDataFrame uses.

The code does also try to dissolve the postcode areas away to give a map of the UK using the functions spChFIDs (sp) and gUnaryUnion (rgeos) but it doesn’t seem to work that well, that is also why the code loads the various ONS lookup files and doesn’t use them I just can’t get the dissolve to work to get the boundaries for MSOA and LADs.

This process is repeated for each of the shapefiles.

Now that they have all been loaded we can see what they look like using progressively more complicated code. All of the examples below will use the postcode area file since this contains a nice number of areas to plot. And produces a nice comparison to what I did with choropleth maps in Excel and how much simpler this process is!

At the moment this is just giving the areas random colours (I think), how do we get some real data on here. Here I merge on some real data (saved here) and plot the population of each area in England and Wales

If you output the HTML file you can open it even if you aren’t connected to the internet, your browser just doesn’t open the map underneath. And the HTML file is here GooglePostArea.

Now the problem here is that the above map is over 20mb which is a bit too big for such a simple map. This is because the shape file is very, very detailed. So for plotting purposes we can simplify the data using the function gSimplify (rgeos).

Post.Area.SP2 <- gSimplify(Post.Area.SP,500,topologyPreserve = TRUE)

Which creates this chart,GooglePostArea500m , which is what I’ll use for plotting in future posts.

Now I just need to find some actual data to process.

]]>http://richardkappa.co.uk/uncategorized/plotting-shapefiles-in-r/feed/0150ONS Data Structurehttp://richardkappa.co.uk/data/ons-data-structure/
http://richardkappa.co.uk/data/ons-data-structure/#respondSun, 17 Apr 2016 17:49:19 +0000http://richardkappa.co.uk/?p=144I have been looking through the ONS geographic data on their Geo Portal and there are acronyms and variables everywhere so I thought it best to understand what they all mean. Whenever I refer to the output areas and super output areas I’m referring to the ones as at the 2011 census in England and Wales.

UDPRN

Unique Delivery Point Reference. A unique eight-digit code representing each postal address in the Royal Mail PAF database

Postcode

(I know what this is but for completion:) An alphanumeric value representing a number of addresses (UDPRNs) within an area. A full postcode known as the “Unit Postcode”

Postcodes are split into two parts, the inner and outer code, separated by a space.

The outer code is the bit before the space. The letter(s) at the start of the outer code is called the postcode area. See mapping postcodes in excel. The outer code is also known as the district.

The district and the numeric part of the inner code is called the sector.

Output Area (OA)

A cluster of adjacent postcodes with a minimum population (as at 2011 census) of 50 and 20 households. The aim was for them to contain 50 households.

Lower Layer Super Output Area (LSOA)

A cluster of adjacent output areas.

Middle Layer Super Output Area (MSOA)

A cluster of adjacent LSOAs

LSOA

MSOA

Population

1000-1200

5,000-15,000

Households

400-1200

2,000-6,000

Count

32,844 (England) 1909 (Wales)

6,791 (England) 410 (Wales)

Local Authority District (LAD)

The LSOAs and MSOAs line up with the 2011 local authority boundaries in England and Wales.

Regions

The highest level of area below country. Includes “South East”, “North West”, …etc.

The ONS Geoportal contains mappings between the above and more on top of that. The main postcode lookup file on the portal has all of these, plus may more, classifications for each postcode so you can merge this data on easily.

]]>http://richardkappa.co.uk/data/ons-data-structure/feed/0144Changing a colour scheme and extracting data from the internethttp://richardkappa.co.uk/data/changing-a-colour-scheme-and-extracting-data-from-the-internet/
http://richardkappa.co.uk/data/changing-a-colour-scheme-and-extracting-data-from-the-internet/#respondMon, 28 Mar 2016 12:11:35 +0000http://richardkappa.co.uk/?p=128Sometimes I have a spreadsheet containing lots of spreadsheets of similar formats all using the same colour scheme. And if after a while I decide I don’t like the colours any more then it can be quite annoying to change all of the colours. So I decided to write this short little macro to change the colours.

I didn’t have a long spreadsheet to hand so I decided to create one using some of the Met office's data. They provide historical data for a number of their weather stations in text files all in similar formats. This spreadsheet opens each of web links, copies the data into a spreadsheet and puts it in a nice(ish) standardised format.