A topnotch WordPress.com site

Month: November 2015

When you start learning programming, it is recommended to visit the sites of community of languages. “R” and “python” have big communities, and they have been contributing to the progress of each language. This is good for all users. H2O. ai also held an annual community conference “H2O WORLD 2015” this month. Now video and presentation slides are available through the internet. I could not attend the conference as it was held in Silicon Valley in the US. But I can follow and enjoy it just by going through websites. I recommend you to have a quick look to understand how knowledge and experiences can be shared at the conference. It is good for anyone who are interested in data analysis.

1. The user communities can accelerate the progress of open source languages

When I started learning “MATLAB®” in 2001, there were few user communities in Japan as far as I knew. So I should attend the paid seminars to learn this language, which were not cheap. But now most of uses communities are available without any fee. In addition to that, this kind of communities have been bigger and bigger recently. One of the main reasons is that number of “open source languages” are increasing recently. “R” and “python” are also open source languages. It means that when someone want to try certain language, all they have to do is just “download and use it”. Therefore, users can be increased at an astonishing pace. On the other hand, if someone want to try “proprietary software” such as MATLAB, they must buy each license before using it. I loved MATLAB for many years and recommended my friends to use it. But unfortunately no one uses it privately because it is difficult to pay license fee privately. I imagine that most users of proprietary software are in organizations such as companies and universities. In such case, organizations pay license fees. So each individual can enjoy no freedom to choose languages they want to use. Generally it is difficult to switch from one language to another when proprietary softwares are used. It is called “Vendor lock-in“. Open source languages can avoid that. This is one of the reasons why I love open source languages now. The more people can use, the more progress can be achieved. New technologies such as “machine learning” can be developed thought user communities because more users will join going forward.

2. The real industry experiences can be shared in communities

It is the most exciting part of the community. As a lot of data scientists and engineers from industry join communities, their knowledge and experience are shared frequently. It is difficult to find this kind of information in other places. For example, the theory of algorithms and methods of programming can be found in the courses provided by universities in MOOCs. But there are few about industry experiences in MOOCs in a real time basis. For example, in H2O WORLD 2015, there are sessions with many professionals and CEOs from industries. They share their knowledge and experiences there. It is a treasure not only for experts of data analysis, but for business personnel who are interested in data analysis. I would like to share my own experience in user communities in future.

3. Big companies are supporting uses communities

Recently major IT big companies have noticed the importance of the user community and try to support them. For example, Microsoft supports “R Consortium” as a platinum member. Google and Facebook support communities of their open source languages, such as “TensorFlow” and “Torch“. Because new things are likely to happen and be developed among users outside the companies. Therefore It is also beneficial to big IT companies when they support user communities. Many other IT companies are supporting communities, too. You can find many names as sponsors under the big conference of user communities.

Note: Toshifumi Kuga’s opinions and analyses are personal views and are intended to be for informational purposes and general interest only and should not be construed as individual investment advice or solicitation to buy, sell or hold any security or to adopt any investment strategy. The information in this article is rendered as at publication date and may change without notice and it is not intended as a complete analysis of every material fact regarding any country, region market or investment.

Data from third-party sources may have been used in the preparation of this material and I, Author of the article has not independently verified, validated such data. I and TOSHI STATS.SDN.BHD. accept no liability whatsoever for any loss arising from the use of this information and relies upon the comments, opinions and analyses in the material is at the sole discretion of the user.

When I learned data analysis a long time ago, the number of samples of data was from 100 to 1,000. Because teachers should explain what the data are in the details. There were a little parameters that was calculated, too. Therefore, most of statistical tools could handle these data within a reasonable time. Even spread sheets worked well. There are huge volume data, however, and there are more than 1,000 or10,000 parameters that should be calculated now. We have problems to analyze data because It takes too long to complete the analysis and obtain the results. This is the problem in the age of big data.

This is one of the biggest reasons why new generation tools and languages of machine learning appear in the market. Torch became open sourced from Facebook at January 2015. H2O 3.0 was released as open source in May 2015 and TensorFlow was also released from Google as open source in this month. Each language explains itself as “very fast” language.

Let us consider each of the latest languages. I think each language puts importance into the speed of calculations. Torch uses LuaJIT+C, H2O uses Jave behind it. TensorFlow uses C++. LuaJIT , Java and C++ are usually much faster compared to script languages such as python or R. Therefore new generation languages must be faster when big data should be analyzed.

Last week, I mentioned deep learning by R+H2O. Then let me check how fast H2O runs models to complete the analysis. This time, I use H2O FLOW, an awesome GUI, shown below. The deep learning model runs on my MAC Air11 (1.4 GHz Intel Core i5, 4GB memory, 121GB HD) as usual. Summary of the data used as follows

Then I create the deep learning model with three hidden layers and corresponding units (1024,1024,2048). You can see it in red box here. It is a kind of complex model as it has three layers.

It took just 20 minutes to complete. It is amazing! It is very fast, despite the fact that deep learning requires many calculations to develop the model. If deep learning models can be developed within 30 minutes, we can try many models at different setting of parameters to understand what the data means and obtain insight from them.

I did not stop running the model before it fitted the data. These confusion matrices tell us error rate is 2.04 % for training data (red box) and 3.19 % of test data (blue box). It looks good in term of data fitting. It means that 20 minutes is enough to create good models in this case.

Now it is almost impossible to understand data by just looking at them carefully because it is too big to look at with our eye. However, through analytic models, we can understand what data means. The faster analyses can be completed, the more insight can be obtained from data. It is wonderful for all of us. Yes, we can have an enough time to enjoy coffee and cakes with relaxing after our analyses are completed!

Note: Toshifumi Kuga’s opinions and analyses are personal views and are intended to be for informational purposes and general interest only and should not be construed as individual investment advice or solicitation to buy, sell or hold any security or to adopt any investment strategy. The information in this article is rendered as at publication date and may change without notice and it is not intended as a complete analysis of every material fact regarding any country, region market or investment.

Data from third-party sources may have been used in the preparation of this material and I, Author of the article has not independently verified, validated such data. I and TOSHI STATS.SDN.BHD. accept no liability whatsoever for any loss arising from the use of this information and relies upon the comments, opinions and analyses in the material is at the sole discretion of the user.

Last Sunday, I tried “deep learning” in H2O because I need this method of analysis in many cases. H2O can be called from R so it is easy to integrate H2O into R. The result is completely beyond my expectation. Let me see in detail now!

1. Data

Data used in the analysis is ” The MNIST database of handwritten digits”. It is well known by data-scientists because it is frequently used to validate statistical model performance. Handwritten digits look like that (1).

Each row of the data contains the 28^2 =784 raw grayscale pixel values from 0 to 255 of the digitized digits (0 to 9). The original data set of The MNIST is as follows.

Training set of 60,000 examples,

Test set of 10,000 examples.

Number of features is 784 (28*28 pixel)

The data in this analysis can be obtained from the website (Training set of 19,000 examples, Test set of 10,000 examples).

2. Developing models

Statistical models learn by using training set and predict what each digit is by using test set. The error rate is obtained as “number of wrong predictions /10,000″. The world record is ” 0.83%” for models without convolutional layers, data augmentation (distortions) or unsupervised pre-training (2). It means that the model has only 83 error predictions in 10,000 samples.

This is an image of RStudio, IDE of R. I called H2O from R and write code “h2o.deeplearning( )”. The detail is shown in the blue box below. I developed the model with 2 layers and 50 size for each. The error rate is 15.29% (in the red box). I need more improvement of the model.

Then I increase the number of layers and sizes. This time, I developed the model with 3 layers and 1024, 1024, 2048 size for each. The error rate is 3.22%, much better than before (in the red box). It took about 23 minutes to be completed. So there is no need to use more high-power machines or clusters so far ( I use only my MAC Air 11 in this analysis). I think I can improve the model more if I tune parameters carefully.

Usually, Deep learning programming is a little complicated. But H2O enable us to use deep learning without programming when graphic user interface “H2O FLOW” is used. When you would like to use R, the command of deep learning to call H2O is similar to the commands for linear model (lm) or generalized linear model (glm) in R. Therefore, it is easy to use H2O with R.

This is my first deep learning with R+H2O. I found that it could be used for a variety cases of data analysis. When I cannot be satisfied with traditional methods, such as logistic regression, I can use deep learning without difficulties. Although it needs a little parameter tuning such as number of layers and sizes, it might bring better results as I said in my experiment. I would like to try “R+H2O” in Kaggle competitions, where many experts compete for the best result of predictive analytics.

P.S.

The strongest competitor to H2O appears on 9 Nov 2015. This is ” TensorFlow” from Google. Next week, I will report this open source software.

Note: Toshifumi Kuga’s opinions and analyses are personal views and are intended to be for informational purposes and general interest only and should not be construed as individual investment advice or solicitation to buy, sell or hold any security or to adopt any investment strategy. The information in this article is rendered as at publication date and may change without notice and it is not intended as a complete analysis of every material fact regarding any country, region market or investment.

Data from third-party sources may have been used in the preparation of this material and I, Author of the article has not independently verified, validated such data. I and TOSHI STATS.SDN.BHD. accept no liability whatsoever for any loss arising from the use of this information and relies upon the comments, opinions and analyses in the material is at the sole discretion of the user.

Last week I found the awesome tool for digital marketing as well as data analysis. It is called “H2O“. Although it is open source software, its performance is incredible and easy to use. I would like to introduce it to Sales/Marketing personnel who are interested in Digital marketing.

“H2O is open-source software for big-dataanalysis. It is produced by the start-up H2O.ai(formerly 0xdata), which launched in 2011 in Silicon Valley. The speed and flexibility of H2O allow users to fit hundreds or thousands of potential models as part of discovering patterns in data. With H2O, users can throw models at data to find usable information, allowing H2O to discover patterns. Using H2O, Cisco estimates each month 20 thousand models of its customers’ propensities to buy while Google fits different models for each client according to the time of day.” according to Wikipedia(1).

Although its performance looks very good, it is open source software. It means that everyone can use the awesome tool without any fee. It is incredible! “H2O” is awarded one of ” Bossie Awards 2015: The best open source big data tools” (2). This image shows H2O user interface “H2O FLOW”.

By using this interface, you can use the state of art algorithm such as “Deep learning” without programming. It is very important for beginners of data analysis. Because they can start data analysis without programming anyway. Dr. Arno Candel, Physicist & Hacker at H2O.ai. , said “And the best thing is that the user doesn’t need to know anything about Neural Networks”(3). Once models are developed by this user interface, program of the model with “Java” is automatically generated. It can be used in production systems with ease.

One of the advantages of open source is that many user’s cases are publicly available. Open source can be public, therefore it is easy to be distributed as users’ experiences of “What is good?” and “What is bad?”. This image is a collection of tutorials “H2O University“. It is also available for free. There are many other presentations, videos about H2O in the internet, too! You may find your industry”s cases among them. Therefore, there is a lot of materials to learn H2O by ourselves.

In addition to that, “H2O” can be used as an extension of “R“. R is one of the most widely-used analytical language. “H2O” can be controlled from R console easily. Therefore “H2O” can be integrated with R. “H2O” also can be used with Python.

There are so many other functionalities in H2O. I cannot write everything here. I am sure it is an awesome tool for both business personnel and data scientists. I would like to start using “H2O” and publish my experiences of “H2O”going forward. Why don’t you join “H2O community”?

Note: Toshifumi Kuga’s opinions and analyses are personal views and are intended to be for informational purposes and general interest only and should not be construed as individual investment advice or solicitation to buy, sell or hold any security or to adopt any investment strategy. The information in this article is rendered as at publication date and may change without notice and it is not intended as a complete analysis of every material fact regarding any country, region market or investment.

Data from third-party sources may have been used in the preparation of this material and I, Author of the article has not independently verified, validated such data. I and TOSHI STATS.SDN.BHD. accept no liability whatsoever for any loss arising from the use of this information and relies upon the comments, opinions and analyses in the material is at the sole discretion of the user.