Data science, statistics or machine learning in broken English

Actually I've known about MXnet for weeks as one of the most popular library / packages in Kaggler, but just recently I heard bug fix has been almost done and some friends say the latest version looks stable, so at last I installed it.

Convolutional Neural Network (CNN)

I believe almost all readers of this blog already know well about Deep Learning and Convolutional Neural Network (CNN)... so here I just show you a brief overview.

CNN is a variant of Deep Learning and it has been well known for its excellent performance of image recognition. In particular, after CNN won ILSVRC 2012, CNN has gotten more and more popular in image recognition. The most recent success of CNN would be AlphaGo, I believe.

Indeed, we already have a lot of implementation of CNN as libraries / packages. For example, a traditional implementation of by Theano requires high skills in coding but it is still popular and widely used. I think PyLearn2 is a little easier but it looks to make many users desperate :P) Torch / Caffe are also great implementation, and Chainer provided some intuitive coding of CNN*1. Finally we see TensorFlow / CNTK, distributed by global IT giants for global standard. Very recently Keras is distributed as a wrapper for both Theano and TensorFlow.

I won't write anything about its theoretical basis and details of the algorithm because I'm never any expert in machine learning nor mathematician :P))) Please search on the web with keywords such as "convolutional neural network" and you'll see too many useful pages, slides or textbooks! At least please make sure about its basic concept, as "input layer --> 'convolution layer x m --> pooling layer' x n --> fully connected layer x p --> output layer", illustrated below.

This is very similar to and inspired by visual information processing in the visual cortex of the human brain*2. First input signals are filtered based on orientation selectivity, feature extraction etc.*3, second an invariance of parallel shift is added, finally all processes are integrated.

Classification of the short version of MNIST using R package {mxnet} from MXnet

MXnet is a framework implementing Deep Learning with graph abstraction, very similar to TensorFlow. It's newer than other major libraries / packages of Deep Learning so it has a lot of useful implementation of the pioneers. For example, MXnet can distribute computations, change from CPU to GPU or vice versa easily, provide pre-training models for ImageNet, not only DNN / CNN but also LSTM-RNN, and provide wrappers for Python, R, C++ and Julia which are much popular in data science and/or machine learning community. Even documentation alone looks attractive!

In particular, as far as I've known, there is almost no useful R packages implementing CNN in R, so I think MXnet and its R package {mxnet} would be a great tool for R users. Just personally, Theano requires complicated coding and is a little annoying for me, PyLearn2 is not so easy for me, Caffe / Chainer require GPU instances so I don't like, so I only tried TensorFlow on my own AWS EC2 instance. So MXnet is really great for me because it works on either CPU or GPU, even on local laptops.

OK, let's try the R package {mxnet} in accordance with MXnet's tutorial to see how it works. For your information, my computing environment is as follows:

Installation

It's impressive that not devtools but drat is attached. It looks the latest technology in 2016, doesn't it?

Preparing datasets

As raised in the title of this section, let's use the short version of MNIST handwritten digits dataset on my GitHub repository (5,000 rows for training and 1,000 for test). This dataset is not large so I think no classifiers can reach accuracy 0.98. Let's run as below to transform the dataset that can be handle by {mxnet}.

Accuracy 0.976 was the performance of our simple CNN. In the tutorial, num.round is set to 1 but I changed it to 20*4. Computation time was 270 sec, further shorter than the case with {h2o}. It's great! :O)

As just a trial, 4 samples of '9' digit in the test dataset that were incorrectly predicted as 4, 5 or 8 are visualized below.

Hey, who can identify them :P))) They are too ambiguous to be recognized correctly even by our brief CNN based on the tutorial of {mxnet}. Of course I know it's a fun for Kagglers to recognize such weird digit samples in MNIST.

Comparison to other methods

Accuracy 0.976 on the short version of MNIST means the best benchmark ever in my blog. Let's review other benchmarks given by other classifiers.

Accuracy was 0.962, through the best parameter tuning shown in a certain slideshare by H2O... our brief CNN by {mxnet} even overtook it. This fact means even a simple CNN is better than the best-tuned DNN. It's plausible these days almost everybody loves to use CNN for image recognition.

My comments

The first point is its usability. Changing CPU to GPU or vice versa is very easy, and its coding is intuitive to compose almost any kind of Deep Net with just specifying parameters. The second point is its speed. {mxnet} is very fast as well as {xgboost} from the same DMLC. It's even further faster than DNN by {h2o}, based Java VMs.

The more important point is that we can run CNN both in R and Python in almost the same manner. This is a huge advantage for R / Python users, in particular people doing ad-hoc analysis; in several business missions, we often have to build a prototype on a local machine like "a Kaggle competition only by me" and if it's successful we would implement it onto products. In such cases, data scientists like me often use R first, and Python second.

In my personal opinion, R is better than Python for building prototypes because manipulating variables is much easier in R than other languages, but R is not good for implementation on products. On the other hand, Python is vice versa... so a lot of data scientists like me love both R and Python. Actually in my previous job, once I built a prototype of a machine learning system in R and then I implemented it in Python for a product with Xgboost. MXnet enables us to run almost the same procedure for CNN. This is very much helpful for all R users, I believe.

Of course some problems still remain; in particular, parameter tuning of CNN is an incredibly annoying job and even an issue in machine learning researches. As far as I've heard, there are some academic studies in which parameters are optimized through Bayesian sampling and/or Monte Carlo search... I can't imagine how long it will take on local machines. Even with MXnet, we have to keep on struggling with such a kind of remaining problems.

At any rate, I think MXnet can be a strong candidate of Deep Learning library that can compete Chainer / TensorFlow. I hope there will be further interesting and useful libraries / packages of Deep Learning in the future.

In a coming post, I'm planning the other framework in MXnet such as LSTM-RNN, but it requires my own further understanding of its theoretical background. I won't tell when the post will be published :P)

Appendix 1

When activation function is replaced with ReLU, its performance got improved.