Menu

Data Analysis

As a DBA I am learning an enormous amount about managing data – which is an important step in my big plan… to become a data scientist. Right now it is about finding the feathers to fit the bow; it’s going to take a lot of hunting and trial-and-error.

This page is full of tidbits of information, projects, experiments and just general ‘play’ in the areas of data science and data analysis. There will be posts on programming, posts on data management, posts about me learning R, the projects I am currently working on and all sorts of other things. Current projects / activities are centered around:

Interactive data analysis with .NET (C#) and the JavaScript library D3: This project is driven by a need to present information in a way that is familiar and visually appealing. Information must be readily accessible through appropriate visualisations to display the data. SQL Server, C#, Javascript and CSS will be used to deliver these goals.

Data Analysis with the R Programming language: You hear a lot about the increasing popularity of Python in data science, and as one of my favourite languages I am biased towards Python. However R is still incredibly popular, is a very mature language and seems to have a great community around it and my initial experience has been very positive.

Aggregate Data Models: Most of the interesting questions involve aggregating data in some form. But the cost of aggregating large datasets is huge. The traditional database solution is a data warehouse, where transactional data is transformed into aggregate reporting structures. Science (and increasingly, industry) makes use of distributed platforms to farm out the workload and return aggregated results (e.g. SETI-at-home, MapReduce, Hadoop, Google). And there have been a rise of non-relational data models through NoSQL databases (MongoDB, Cassandra, Redis, CouchDB, …). This is a massive area of interest to me.

If these things interest you, please comment on my blogs below or check out my repos on GitHub.

The days of single-device applications are long gone. The current generation of applications must support multiple devices, screen sizes, resolutions, operating systems… the list goes on. To design responsive applications is no longer a professional courtesy – it is a must have. With all the data entry methods complete, I will begin to design the layout and style of the application using responsive methods and techniques.

This weekend while working on the design aspects of the data visualisation project, I stumbled across Flat UI. Flat UI is a minimalistic approach that is entirely… flat. That is it is primarily 2-dimensional, using block colours and contrast to deliver a simple, striking pallet to display your content. The contrast between the understated design and interactive data visualisations can only enhance the impact of the visualisations. All-in-all I think Flat UI is the perfect platform for this project.

Like any multi-paradigm language, there are a number of options for looping in R. R’s declarative constructs provide a convenient mechanism for aggregating data. This Experimental-RLab aggregates the data before calculating the correlation between two variables.

Sorting data is one of the fundamental techniques of any data analysis. In R, sorting is all done in-memory with a variety of functions (e.g. sort, order). SQL -like queries are also possible through the `sqldf` library.

The University of Auckland maintain an outstanding blog for statisticians, StatsChat, and there has been a lot of discussion around this letter, most of it centered on the validity of the statistics and choice of phrasing. Thankfully, there have been more balanced views on this as well. Do we sometimes forget that the primary role of statistics is to help us convey a story?

7 PM Saturday night rolled around and we tuned the television into the live election broadcasts. The polls were closed and everyone was expecting a close fought battle between a hopeful Labour party and a beleaguered National party. Negative publicity around the DotCom scandal and the shadow of alleged dirty politics made it seem like if National were to cross the line, it would do so with a sever limp. But despite all the hoo-ha, I didn’t wait up for the live results. Instead, I called it an early night and woke up early to start digging into the voting data!

Accurate and efficient prediction of heat stress indicators, such as WBGT, is important for the modeling of climate change. There are no accurate numerical solutions for the calculation of WBGT, where Tw is unknown, which has led to the development of iterative algorithms. However, iterative algorithms are expensive when applied to large databases of real-world climate data. To improve the accuracy and efficiency of calculating WBGT we propose a method for the relaxation of Tw that uses the well-known and widely used gradient descent algorithm. We have shown excellent agreement between the original algorithm and the gradient descent algorithm to within 0.18 +/- 0.125 degrees Celcius with a 95 % confidence interval. Overall, it appears that the gradient descent algorithm is able to accurately and efficiently approximate WBGT. Also, as gradient descent is widely used in learning algorithms, such as linear regression and logistic regression, the algorithm should be relatively easy to implement and share across research groups.