Blend it like a Bayesian!

Sunday, 9 August 2015

More and more real-world systems are relying on data science and analytical models to deliver sophisticated functionality or improved user experiences. For example, Microsoft combined the power of advanced predictive models and web services to develop the real-time voice translation feature in Skype. Facebook and Google continuously improve their deep learning models for better face recognition features in their photo service.

Some have characterised this trend as a shift from Software-as-a-Service (SaaS) to an era of Models-as-a-Service (MaaS). These models are often written in statistical programming languages (e.g., R, Python), which are especially well suited to analytical tasks.

With analytical models playing an increasingly important role in real-world systems, and with more models being developed in R and Python, we need powerful ways of turning these models into APIs for others to consume.

But how?

It is actually a lot easier than you might think. I wrote a step-by-step guide about deploying analytical models as REST APIs. This guide will walk you through how to set up your own MaaS WITHOUT a team of full-stack developers/engineers. All you need are the R/Python models you develop and a Domino Data Lab account. You can find the full article on ProgrammableWeb here.

When I created this blog back in 2013, my aim was simply to learn ggplot2. Thanks to the feedback and advice from the R community, I continued to learn new stuff and somehow found an opportunity to work for Domino and Virgin Media. I wouldn't say I have seen enough to make a fair comparison with other programming communities. But so far the support from the R community, for me, has been truly special! I believe blogging is one of the best ways to contribute so I better get back to the writing habit! For the next post, I would like to talk about using R with other Microsoft tools (SQL Server, PowerPoint) in a commercial environment.

London Kaggle Meetup

I met Alex Glaser and Wojtek Kostelecki after my LondonR talk. They have already set up a meetup for Kagglers. We are working on a collaborative project to build / test / stack models on the Domino platform. For more information, join the meetup first. Let's Kaggle together!

Saturday, 8 August 2015

Before we get over-excited, let's have a look at the past. Here is a shiny app to visualise some stats from last ten seasons. It is my first attempt at developing data-driven sports stories with R, Shiny and Shiny Dashboard. Those stories ain't new, but I had always wanted to visualise the data instead of just reading them from articles.

It includes the last-minute title decider in 2011-2012 season (I am a red devil. That. Still. Hurts.) ...

Monday, 22 September 2014

Following up on my previous posts about H2O Deep Learning (TTTAR1) and RUGSMAPS (TTTAR2), here is a quick update on two interesting things I have been working on: a Kaggle tutorial and a new RUGSMAPS app.

Short Tutorials based on a Kaggle Competition

As a sequel to TTTAR1, this blog post is a more in-depth machine learning article with starter code and short tutorials. The purpose is to get more people started with R, H2O Deep Learning and Domino Data Lab using a recent Kaggle competition as case study. The short tutorials should be generic enough for Kaggle competitions as well as any general data mining exercises. I hope you will find it useful if you are interested in machine learning stuff.

RUGSMAPS2: A Crowd-Sourcing Experiment

Shortly after RUGSMAPS went public, Ines Garmendia kindly pointed out that there were several mistakes in the app (thanks, Ines! … at the same time, doh!!!). For example, the two groups in Madrid are supposed to be much further apart than I thought. More importantly, they are subgroups of the main “Comunidad R Hispano” group. Without Ines' feedback, I would never be able to notice that myself. I was only relying on the source data from the contest.

I know a lot more local knowledge is required to make RUGSMAPS a much better app for the R community. It is also necessary to streamline the updating process so that new groups can be added easily. Therefore, I am now proposing a crowd-sourcing experiment and am hoping that more RUGs organisers/members can contribute in future. My idea is a dynamic web app (RUGSMAPS2) that reads information directly from a live Google spreadsheet.

Let's start with my favourite LondonR and its sister group ManchesterR. I know a lot about them personally so I can provide information like their key sponsor (Mango Solutions), venues and websites. To make this information available to all other R users, I just need to update the Google spreadsheet and the new RUGSMAPS2 will automatically render maps with new data.

Adding venue, key sponsors, websites and other information.

LondonR and ManchesterR with additional information.

For the Comunidad R Hispano issues I mentioned above, you can see that Ines helped me to fill in some new information about the four subgroups:

So if you're interested in helping me out or know someone else who might be able to help, please spread the word and forward this Google spreadsheet. Oh, wait ... can EVERYONE edit that spreadsheet??? Yes. I understand there might be issues if everyone can edit the spreadsheet without my permission. That's why I am calling it an experiment! I would like to see whether I can turn this into a successful crowd-sourcing project (or, otherwise, organised chaos). The app won't go live until I am happy with the new information and features. You can check out the RUGSMAPS2 repository for all the latest development!

That's it for now. Take my words for it, give H2O and Domino a try and you won't regret it!

Tuesday, 26 August 2014

Thing To Try After useR! part 2 (TTTAR2)

Originally, this post was supposed to be a sequel to TTTAR1 about h2o machine learning. Since TTTAR1 I have been carrying out more h2o tests both locally and on the cloud with the very kind support of Nick Elprin from Domino. The more I find out about h2o and Domino, the more I get addicted. So my original plan was to write a new post about combining the best of h2o and Domino - building a scalable data analytic platform on the cloud with R as the main interface.

Well something happened unexpectedly so here is a change of plan – I am going to talk about Shiny web application and data visualisation again. I will get back to h2o machine learning and Domino later. In the meantime, check out Nick’s recent blog post about Domino which is directly relevant to what I am planning to blog next.

Beautiful Shiny Apps at useR!

I know how to use Shiny and Bootstrap. Yet, before useR!, I had never felt the need to combine them. Shiny already has a clean and tidy layout by default – so why bother changing it? But then I changed my mind completely after useR!. There were quite a few beautiful Shiny apps with custom CSS on display. For example, during the second poster session, I saw this amazing Shiny app created by Christian Gonzalez and Robert Youmans.

When I had the chance to play with the app on Christian’s laptop, that sleek and professional interface just completely blew my mind. It was at this moment I realised that I was wrong and it’s time to up my game.

I did, however, spend a lot more time on the layout design and user interface. Font size, colours, white spaces, choice of base map etc you name it. I also tried quite a few Bootstrap themes and finally settled with Spacelab. I am not the best person to describe the final design with correct art terminology. I just feel that this combination of blue, grey and white gives a clean layout (yet not too flashy). The RUGSMAPS app is currently hosted on ShinyApps (once again, thanks RStudio!!!)

Don't agree with my colour choices? No worries, fork the repository and modify the boostrap.css in the www folder. More information can be found on the "About" page of the app so I am not going to repeat it here. Please try it out and tell me what you think about it.

What's Next?

As the RUGSMAPS app is my final submission to the contest, I am going to lock down the version so the app becomes a reference point version 1. Further improvements will be made independently (let’s just call it RUGSMAPS v2 for now). The idea is to collect more local information from the RUGs and display it accordingly on the maps. For example, in additional to the group name and city location, I can include more information such as usual meetup location, key sponsors, website, photos etc.

I have seen a cool example from Alex Bresler that shows images in the markers’ pop-up window. I am certain that we can display more information than just the group name and city location. BTW, Alex and his colleagues at Aragorn Technologies are doing some very cool interactive sports data visualisation with R for huge events (e.g. NBA, US Open) – do check them out!!!

Together We Can Make It Better!

If you have some local information about the RUGs and are interested in improving the RUGSMAPS for the community, please drop me an email (jofai.chow@gmail.com), comment on this post or create issues on the repository. We can make this a central info hub of all RUGs! Let’s do it!!!

Acknowledgement

First of all, I would like to thank Revolution Analytics (David and Joseph) for the award! I would also like to emphasise that the RUGSMAPS app did not just come out of my head from nowhere. It is the direct result of many cool ideas I learned and borrowed from the R community over the last year or so. To all my R friends I met on Twitter / GitHub / useR!, if you're reading this, you know I am talking about you. So thank you very much everyone!

The Rise of R in Hydroinformatics

(View from the Empire State Building Observatory - it was at this moment I realised that I should begin my career as Batman)

Just a bit off the topic, I attended the Hydroinformatics conference in New York last week and I witnessed the rise of R in this field. Although MATLAB still seems the most common programming language (ever wonder why my twitter handle sounds a bit too odd?), there were more talks about using R for Hydroinformatics this time (compared to none in the same conference two years ago). People were presenting their R packages and discussing how they use R for interfacing as well as teaching. RStudio IDE was on the big screen several times!

Friday, 25 July 2014

Annual R User Conference 2014

The useR! 2014 conference was a mind-blowing experience. Hundreds of R enthusiasts and the beautiful UCLA campus, I am really glad that I had the chance to attend! The only problem is that, after a few days of non-stop R talks, I was (and still am) completely overwhelmed with the new cool packages and ideas.

What's H2O?

"The Open Source In-Memory, Prediction Engine for Big Data Science" - that's what Oxdata, the creator of H2O, said. Joseph Rickert's blog post is a very good introduction of H2O so please read that if you want to find out more. I am going straight into the deep learning part.

Deep Learning in R

Deep learning tools in R are still relatively rare at the moment when compared to other popular algorithms like Random Forest and Support Vector Machines. A nice article about deep learning can be found here.
Before the discovery of H2O, my deep learning coding experience was mostly in Matlab with the DeepLearnToolbox. Recently, I have started using 'deepnet', 'darch' as well as my own code for deep learning in R. I have even started developing a new package called 'deepr' to further streamline the procedures. Now I have discovered the package 'h2o', I may well shift the design focus of 'deepr' to further integration with H2O instead!

But first, let's play with the 'h2o' package and get familiar with it.

The H2O Experiment

The main purpose of this experiment is to get myself familiar with the 'h2o' package. There are quite a few machine learning algorithms that come with H2O (such as Random Forest and GBM). But I am only interested in the Deep Learning part and the H2O cluster configuration right now. So the following experiment was set up to investigate:

How to set up and connect to a local H2O cluster from R.

How to train a deep neural networks model.

How to use the model for predictions.

Out-of-bag performance of non-regularized and regularized models.

How does the memory usage vary over time.

Experiment 1:

For the first experiment, I used the Wisconsin Breast Cancer Database. It is a very small dataset (699 samples of 10 features and 1 label) so that I could carry out multiple runs to see the variation in prediction performance. The main purpose is to investigate the impact of model regularization by tuning the 'Dropout' parameter in the h2o.deeplearning(...) function (or basically the objectives 1 to 4 mentioned above).

Experiment 2:

The next thing to investigate is the memory usage (objective 5). For this purpose, I chose a bigger (but still small in today's standards) dataset MNIST Handwritten Digits Database (LeCun et al.). I would like to find out if the memory usage can be capped at a defined allowance over long period of model training process.

Findings

OK, enough for the background and experiment setup. Instead of writing this blog post like a boring lab report, let's go through what I have found out so far. (If you want to find out more, all code is available here so you can modify it and try it out on your clusters.)

Setting Up and Connecting to a H2O Cluster

Smoooooth! - if I have to explain it in one word. Oxdata made this really easy for R users. Below is the code to start a local cluster with 1GB or 2GB memory allowance. However, if you want to start the local cluster from terminal (which is also useful if you see the messages during model training), you can do this java -Xmx1g -jar h2o.jar (see the original H2O documentation here).

By default, H2O starts a cluster using all available threads (8 in my case). The h2o.init(...) function has no argument for limiting the number of threads yet (well, sometimes you do want to leave one thread idle for other important tasks like Facebook). But it is not really a problem.

Loading Data

In order to train models with the H2O engine, I need to link the datasets to the H2O cluster first. There are many ways to do it. In this case, I linked a data frame (Breast Cancer) and imported CSVs (MNIST) using the following code.

Training a Deep Neural Network Model

The syntax is very similar to other machine learning algorithms in R. The key differences are the inputs for x and y which you need to use the column numbers as identifiers.

Using the Model for Prediction

Again, the code should look very familiar to R users.

The h2o.predict(...) function will return the predicted label with the probabilities of all possible outcomes (or numeric outputs for regression problems) - very useful if you want to train more models and build an ensemble.

Out-of-Bag Performance (Breast Cancer Dataset)

No surprise here. As I expected, the non-regularized model overfitted the training set and performed poorly on test set. Also as expected, the regularized models did give consistent out-of-bag performance. Of course, more tests on different datasets are needed. But this is definitely a good start for using deep learning techniques in R!

Memory Usage (MNIST Dataset)

This is awesome and really encouraging! In near idle mode, my laptop uses about 1GB of memory (Ubuntu 14.04). During the MNIST model training, H2O successfully kept the memory usage below the capped 2GB allowance over time with all 8 threads working like a steam train! OK, this is based on just one simple test but I already feel comfortable and confident to move on and use H2O for much bigger datasets.

Conclusions

OK, let's start from the only negative point. The machine learning algorithms are limited to the ones that come with H2O. I cannot leverage the power of other available algorithms in R yet (correct me if I am wrong. I will be very happy to be proven wrong this time. Please leave a comment on this blog so everyone can see it). Therefore, in terms of model choices, it is not as handy as caret and subsemble.

Having said that, the included algorithms (Deep Neural Networks, Random Forest, GBM, K-Means, PCA etc) are solid for most of the common data mining tasks. Discovering and experimenting with the deep learning functions in H2O really made me happy. With the superb memory management and the full integration with multi-node big data platforms, I am sure this H2O engine will become more and more popular among data scientists. I am already thinking about the Parallella project but I will leave it until I finish my thesis.

I can now understand why John Chambers recommended H2O. It has already become one of my essential R tools for data mining. The deep learning algorithm in H2O is very interesting, I will continue to explore and experiment with the rest of the regularization parameters such as 'L1', 'L2' and 'Maxout'.

Code

Personal Highlight of useR! 2014

Just a bit more on useR! During the conference week, I met so many cool R people for the very first time. You can see some of the photos by searching #user2014 and my twitter handle together. Other blog posts about the conference can be found here, here, here, here, here and here. For me, the highlight has to be this text analysis by Ajay:

During the conference banquet, Jeremy Achin (from DataRobot) suggested that I might as well change my profile photo to a Python logo just to make it even more confusing! It was also very nice to speak to Matt Dowle in person and to learn about his amazing data.table journey from S to R. I have started updating some of my old code to use data.table for the heavy data wrangling tasks.

By the way, Jeremy and the DataRobot team (a dream team of top Kaggle data scientists including Xavier who gave a talk about "10 packages to Win Kaggle Competitions") showed me an amazing demo of their product. Do ask them for a beta account and see for yourself!!!

There are more cool things that I am trying at the moment. I will try to blog about them in the near future. If I have to name a few right now ... that will be:

Friday, 6 June 2014

Interactive Parallel Coordinates with Multiple Colours

For my research project, I need a tool to visualise results from multi-objective optimisations. Below is one of my early attempts using base R and parcoord in the MASS package, I have no problem using them for publication. However, these charts are all static. For a practical decision support tool (something I am working on), I need the charts to be interactive so that users can adjust the range/thresholds in each parameter and narrow down the things to display in real time.

Many thanks to Ken (timelyportfolio) who kindly pointed me to his code examples. Based on that, I developed a prototype version of the interactive parallel coordinates plot with multiple colours (as shown above). OK, the values in the chart are totally unrelated to my research - I just used the 'Theoph' dataset in R for testing purposes. Yet, this is a much needed exercise to see if I can use rCharts parallel coordinates for my research. The answer, of course, is YES. It also works with my customised colour palette too (using Bart Simpson this time)!

Showing your rCharts on bl.ocks.org

In the process of making this plot, I also discovered how to display rCharts (d3, html or practically any code) on Mike Bostock's site "bl.ocks.org". If you haven't seen his site, do check this out. It is one of the coolest things on earth.

I wanted to have a gallery like that too ... but I didn't know how. I used to think that Ramnath and Ken must have bought Mike a beer so that they can have their stuff hosted on bl.ocks.org (see bl.ocks.org/ramnathv and bl.ocks.org/timelyportfolio). I was very wrong, everyone with a GitHub account can do it. All you need are your imagination (and some gists). The site automatically pulls your gists and displays them as beautiful blocks gallery.

In order to display your cool rCharts on bl.ocks.org, you can either:

publish the rCharts to gist using the '$publish' function (e.g. r1$publish('name.of.gist', host = 'gist') where r1 is the rCharts object)

save the rCharts as a stand-alone HTML (e.g. r1$save('index.html', cdn = TRUE)) and then include it in a gist.

For optimal display, I would recommend setting your rCharts size to 960 x 500 (same as the display size on bl.ocks.org). You can also include a 'README.md' file and a 'thumbnail.png' to provide more information. I think the best resolution for the thumbnail is 230 x 120 (about the same aspect ratio as full display). You will need to manually push the png file (see this post for more details).

Latest on Colour Palette Generator

First, let me point you to Russell Dinnage's blog post. It is easily one of the finest R blog posts I've read so far. All these colours and graphs. Wow! It's yet another #RCanDoThat moment for me (so good it needs a hashtag).

So many thanks to his effort and cool ideas, we continue to add more functions to the rPlotter package. It is also a great opportunity for us to better understand the pull/merge GitHub mechanism.

Credits

Again, I would like to thank Ken for his help (not only this time but many times before this on visualisation stuff) as well as Ramnath, Mike and Russell.

Tuesday, 27 May 2014

Why?

I love colours, I love using colours even more. Unfortunately, I have to admit that I don't understand colours well enough to use them properly. It is the same frustration that I had about one year ago when I first realised that I couldn't plot anything better than the defaults in Excel and Matlab! It was for that very reason, I decided to find a solution and eventually learned R. Still learning it today.

What's wrong with my previous attempts to use colours? Let's look at CrimeMap. The colour choices, when I first created the heatmaps, were entirely based on personal experience. In order to represent danger, I always think of yellow (warning) and red (something just got real). This combination eventually became the default settings.

"Does it mean the same thing when others look at it?"

This question has been bugging me since then. As a temporary solution for CrimeMap, I included controls for users to define their own colour scheme. Below are some examples of crime heatmaps that you can create with CrimeMap.

Personally, I really like this feature. I even marketed this as "highly flexible and customisable - colour it the way you like it!" ... I remember saying something like that during LondonR (and I will probably repeat this during useR later).

Then again, the more colours I can use, the more doubts I have with the default Yellow-Red colour scheme. What do others see in those colours? I need to improve on this! In reality, you have one chance, maybe just a few seconds, to tell your very important key messages and to get attention. You can't ask others to tweak the colours of your data visualisation until they get what it means.

Therefore, I know another learning-by-doing journey is required to better understand the use of colours. Only this time, I already have about a year of experience with R under my belt, I decided to capture all the references, thinking and code in one R package.

Existing Tools

Given my poor background in colours, a bit of research on what's available is needed. So far I have found the following. Please suggest other options if you think I should be made aware of (thanks!). I am sure this list will grow as I continue to explore more options.

Other Languages:

The Plan

"In order to learning something new, find an interesting problem and dive into it!" - This is roughly what Sebastian Thrun said during "Introduction to A.I.", the very first MOOC I participated. It has a really deep impact on me and it has been my motto since then. Fun is key. This project is no exception but I do intend to achieve a bit more this time. Algorithmically, the goal of this mini project can be represented as code below:

> is.fun("my.colours") & is.informative("my.colours")
> TRUE

Seriously speaking, based on the tools and packages mentioned above, I would like to develop a new R package that does the following five tasks. Effectively, these should translate into five key functions (plus a sixth one as a wrapper that goes through all steps in one go).

Extracting colours from images (local or online).

Selecting and (adjusting if needed) colours with web design and colour blindness in mind.

Arranging colours based on colour theory.

Evaluating the aesthetic of a palette systematically (quantifying beauty).

I decided to start experimenting with colourful movie posters, especially those from Quentin Tarantino. I love his movies but I also understand that those movies might be offensive to some. That is not my intention here as I just want to bring out the colours. If these examples somehow offend you, please accept my apologies in advance.

First function - rPlotter :: extract_colours( )

The first step is to extract colours from an image. This function is based on dsparks' k-means palettle gist. I modified it slightly to include the excellent EBImage package for easy image processing. For now, I am including this function with my rPlotter package (a package with functions that make plotting in R easier - still in early development).

Note that this is the very first step of the whole process. This function ONLY extracts colours and then returns the colours in simple alphabetical order (of the hex code). The following examples further illustrate why a simple extraction alone is not good enough.

Example One - R Logo

Let's start with the classic R logo.

So three-colour palette looks OK. The colours are less distinctive when we have five colours. For the seven-colour palette, I cannot tell the difference between colours (3) and (5). This example shows that additional processing is needed to rearrange and adjust the colours, especially when you're trying to create a many-colour palette for proper web design and publication.

Example Two - Kill Bill

What does Quentin_Tarantino see in Yellow and Red?

Actually the results are not too bad (at least I can tell the differences).

Example Three - Palette Tarantino

OK, how about a palette set based on some of his movies?

I know more work is needed but for now I am quite happy playing with this.

Example Four - Palette Simpsons

Don't ask why, ask why not ...

I am loving it!

Going Forward

So the above examples show my initial experiments with colours. It will be, to me, a very interesting and useful project in long-term. I look forward to making some sports related data viz when the package reaches a stable version.

The next function in development will be "select_colours()". This will be based on further study on colour theory and other factors like colour blindness. I hope to develop a function that automatically picks the best possible combination of original colours (or adjusts them slightly only if necessary). Once developed, a blog post will follow. Please feel free to fork rPlotter and suggest new functions.

Acknowledgement

I would like to thank Karthik Ram for developing and sharing the wesanderson package in the first place. I asked him if I could add some more colours to it and he came back with some suggestions. The conversation was followed by some more interesting tweets from Russell Dinnage and Noam Ross. Thank you all!