Saturday, 19 March 2016

Using R to visualize data with GoogleEarth

Recently I had to analyse a file with 3.5 million rows of data. The reason why this becomes challenging is that this file cannot be opened in Excel or even notepad The first challenge was to check that we had the right ammount of data. We new the we should have 3455385. Before writing some C# utilities or importing it into SQL I decided to try R.

R is a statistics language and framework that is very popular in accedemia and over the last 7 years has increasingly being used. At the Build last year it was announced that R was to become a part of SQL2016. Also R is being used in Machine Learning in Azure. So it looks worth a closer look

So the first task to determine the total number of rows. This can be done as below

We load a DataFrame using read.csv and assign this to a variable frame2. This takes a couble of minutes. A Dataframe is like a table, internally it is a list of vectors. Each vector can have different data types, the stringsAsFactors=FALSE stops the default behaviour of reading strings as a Factor (enumeration in C#). The total number of rows can then be found using nrow.

To see the data we can use the head function to get the first 5 rows of data.

> head(frame2,n=5)

If frame2 was a sql table In SQL this would be select top 5 * from frame2

In the case of 3.5 Million rows the file was still small enough to fit into memory. Apparently there are ways of handling truely huge files aka big data. In anycase my data is small enough to be handled in memory so I don't have to wory about that yet.

Now we would like to query this data to learn more about it's data quality