Kaggle offer machine learning competitions and have polled their user base as to the tools and programming languages used by participants in competitions. They posted results in 2011 titled Kagglers’ Favorite Tools (also see the forum discussion). The results suggested the abundant use of R. The results also show good use of MATLAB and SAS with much lower Python representation. I can attest that I prefer R over Python for competition work. It just feels though it has more on offer in terms of data analysis and algorithm selection.

Ben comments that MATLAB/Octave is a good language for matrix operations and can be good when working with a well defined feature matrix. Python is fragmented by comprehensive and can be very slow unless you drop into C. He prefers Python when not working with a well defined feature matrix and uses Pandas and NLTK. Ben comments that “As a general rule, if it’s found to be interesting for statisticians, it’s been implemented in R” (well said). He also complains about the language itself being ugly and painful to work with. Finally, Ben comments on Julia that doesn’t have much to offer in the way of libraries but is his new favorite language. He comments that it has the conciseness of languages like MATLAB and Python with the speed of C.

Anthony Goldbloom, the CEO of Kaggle gave a presentation to the Bay Area R user group in 2011 on the popularity of R in Kaggle competitions titled Predictive modeling competitions: making data science a sport (see the powerpoint slides). The presentation slides give more detail on the use of programming languages and suggest an Other category that is as close to as large as large as the usage of R. It would be nice to have the raw data that was collected (why didn’t they release it to their own data community, seriously!?).

John Langford on his blog Hunch has an excellent article on the properties of a programming language to consider when working with machine learning algorithms titled “Programming Languages for Machine Learning Implementations“. He divides the properties into concerns of speed and the concerns of programability (programming ease). He points to powerful industry standard implementations of algorithms, all in C and comments that he has not used R or MATLAB (the post was written 8 years ago). Take some time and read some of the comments by academics and industry specialists alike. This is a deep and nuanced problem that really comes down to the specifics of the problem you are solving and the environment in which you are solving it.

Machine Learning Languages

I think of programming languages in the context of the machine learning activities I want to perform.

MATLAB/Octave

I think MATLAB is excellent for representing and working with matrices. As such, I think it’s an excellent language or platform to use when climbing into the linear algebra of a given method. I think it’s suited to learning about algorithms both superficially the first time around and deeply when you are trying to figure something out or go deep into the method. For example, it’s popular in university courses for beginners, like Andrew Ng’s Coursera Machine Learning course.

R

R is a workhorse for statistical analysis and by extension machine learning. Much talk is given to the learning curve, I didn’t really see the problem. It is the platform to use to understand and explore your data using statistical methods and graphs. It has an enormous number of machine learning algorithms, and advanced implementations too written by the developers of the algorithm.

Python

Python if a popular scientific language and a rising star for machine learning. I’d be surprised if it can take the data analysis mantle from R, but matrix handling in NumPy may challenge MATLAB and communication tools like IPython are very attractive and a step into the future of reproducibility.

I think the SciPy stack for machine learning and data analysis can be used for one-off projects (like papers), and frameworks like scikit-learn are mature enough to be used in production systems.

Java-family/C-family

Implementing a system that uses machine learning is an engineering challenge like any other. You need good design and developed requirements. Machine learning is algorithms, not magic. When it comes to serious production implementations, you need a robust library or you customize an implementation of the algorithm for your needs.

There are robust libraries, for example, Java has Weka and Mahout. Also, note that the deeper implementations of core algorithms like regression (LIBLINEAR) and SVM (LIBSVM) are written in C and leveraged by Python and other toolkits. I think you are serious you may prototype in R or Python, but you will implement in a heavier language for reasons such as execution speed and system reliability. For example, the backend of BigML is implemented in Clojure.

Other Concerns

Not a Programmer: If you are not a programmer (or not a confident programmer) I recommend playing machine learning via a GUI interface like Weka.

One Language for Research and Ops: You may want to use the same language for prototyping and for production to reduce risk of not effectively transferring the results.

Pet Language: You may have a pet language of favorite language and want to stick to that. You can implement algorithms yourself or leverage libraries. Most languages have some form of machine learning package, however primitive.

The question of machine learning programming language is popular on blogs and question and answer sites. A few choice discussions include:

27 Responses to Best Programming Language for Machine Learning

I am admittedly new to ML but have recently had the opportunity to try it with R, python, and Matlab. You can divide up the problem into different parts. In all cases, it’s a good idea to go beyond the basic installation: for R, you want RStudio as an IDE; for python, IPython notebooks and several major libraries are a must; and Matlab is much nicer to work in than Octave.

1. Data input, output, preprocessing, and postprocessing: Python, hands down. It’s all fine and good if you are just dealing with CSVs but that is often not the case, so in the real world python is quite handy. Frankly, there are few languages better at this than python, and it is surely a big part of its popularity.

5. Exploration: R (with RStudio) or IPython are both very good. R is probably a bit better, since it handles matrices better. IPython makes it easy to record and rerun your efforts.

6. Teaching: Matlab/octave has the most concise expression of matrix operations, so for many algorithms it is the one of choice. I kind of wonder about tree structures though.

7. Sharing and dissemination: IPython notebooks are pretty nice and don’t require viewers to install anything. R vignettes are good if they have R and the proper libraries installed.

8. Performance: I can’t really say for sure, as I have not properly tested. Python is the only one of the three in which out-of-core or online processing is particularly natural to express, thanks to generators, as far as I can tell. There are many interesting code performance initiatives in place for Python. Other languages should obviously perform better (C, java; as noted Julia is particularly interesting).

Really great comments, thanks. R is my go-to platform when I’m looking to get the most out of a problem.

I’ve explored using theano with Python on GPUs and played a lot with various parallel packages on R to get speed-ups. In the end, I’ve found rolling my own implementation the best when speed is the highest priority.

Another language to consider is Lua. Specifically the LuaJIT implementation with Touch7. This is what Google and Facebook AI groups use, probably because they hired a folks from Yann LaCun’s lab. Torch7 has been extended further with more ML stuff produced at Facebook and they have made it available to the public. Probably check out stuff on why Lua/LuaJit over Python and LuaJIT’s interface with c-code. Also, LuaJIT is used a lot by gamers and I heard LuaJIT (or was it Lua) will replace Action Script in Adobe’s products.

Sorry, no, not off the top of my head. I can say that there are great libs written in c like libsvm that are often used via wrappers in python or R. Learning the native lib in c might be a fun experience!

Hi Jason. I’m new to machine learning. I’ve gone through the AI online course from Berkeley and plan to go through Yaser Abu-Mostafa “Learning from Data”. It is a language agnostic course however, which, according to what was stated in some reviews, demands intense effort in implementing algorithms by ourselves, without guidance. I like this approach, since it really forces one to research and deal with real challenges of implementation, not just concepts. The problem is that my language of choice, for other reasons, is C#, which I don’t see listed, here and elsewhere, among used languages for machine learning. I have limited experience with python, from the AI and linear algebra courses, which made most of the framework available.
The question is: how far apart is C# from Python, in terms of libraries useful for machine learning? How would it compare to Java, in the same terms?
Should I use a language like Python to develop machine learning code and make it interact with C# code, considering it will continue to be my main developing language? What about Accord.Net? Is it any good?

You make several good points about context. I would add that there is a dimension which runs from “scripting” (summoning existing machine learning routines) to “programming” (writing the machine learning routines oneself). Some languages lend themselves more to one of these operations more than the other. In SAS, for instance, analysts tend to call existing SAS “procs”: They are not writing logistic regression from scratch.

If a script-writing analyst and I fit such the same model form to the same data, we will get the same model parameters. The differences are that I know how and why that modeling process works (and when it won’t), and I can modify it directly when needed.

Hi Jason,
When I started a Data Science course, I had two choices Python or R. As always I have a passion on programming, I chose Python and worked on it through out the course. Though in the course series, they preferred R for Time series, I was following your Blog on Time Series using Python.
Some friends are suggesting AndrewsNg course in Coursera as a next step. But I felt as a newbie to Machine Learning field, I would stick to one language and get used to various Algorithms using it. Once comfortable, then i can explore more into R and MatLab.

Even me (I am the author of Accord.NET mentioned a few comments above) I use scikit-learn on a daily basis for production use at work. However, if for any reason you or any of your blog readers would like to use machine learning in contexts where Python just wouldn’t be available (such as embedded devices through Xamarin, UWP apps or even Java), please give Accord.NET a try.

If you find issues in your application, or something that you believe should have been done better, register it at the project’s issue tracker and it should be taken care of in no time. The goal of this project is also to address platforms which have not been historically been served very well by Python-only implementations.