The War on Data Science: Python versus R

Akram Hussain

July 27th, 2014

Data science

The relatively new field of data science has taken the world of big data by storm. Data science gives valuable meaning to large sets of complex and unstructured data. The focus is around concepts like data analysis and visualization. However, in the field of artificial intelligence, a valuable concept known as Machine Learning has now been adopted by organizations and is becoming a core area for many data scientists to explore and implement.

In order to fully appreciate and carry out these tasks, data scientists are required to use powerful languages. R and Python currently dominate this field, but which is better and why?

The power of R

R offers a broad, flexible approach to data science. As a programming language, R focuses on allowing users to write algorithms and computational statistics for data analysis. R can be very rewarding to those who are comfortable using it. One of the greatest benefits R brings is its ability to integrate with other languages like C++, Java, C, and tools such as SPSS, Stata, Matlab, and so on. The rise to prominence as the most powerful language for data science was supported by R’s strong community and over 5600 packages available.

However, R is very different to other languages; it’s not as easily applicable to general programming (not to say it can’t be done). R’s strength and its ability to communicate with every data analysis platform also limit its ability outside this category. Game dev, Web dev, and so on are all achievable, but there’s just no benefit of using R in these domains. As a language, R is difficult to adopt with a steep learning curve, even for those who have experience in using statistical tools like SPSS and SAS.

The violent Python

Python is a high level, multi-paradigm programming language. Python has emerged as one of the more promising languages of recent times thanks to its easy syntax and operability with a wide variety of different eco-systems. More interestingly, Python has caught the attention of data scientists over the years, and thanks to its object-oriented features and very powerful libraries, Python has become the go-to language for data science, many arguing it’s taken over R.

However, like R, Python has its flaws too. One of the drawbacks in using Python is its speed. Python is a slow language and one of the fundamentals of data science is speed!

As mentioned, Python is very good as a programming language, but it’s a bit like a jack of all trades and master of none. Unlike R, it doesn’t purely focus on data analysis but has impressive libraries to carry out such tasks.

The great battle begins

While comparing the two languages, we will go over four fundamental areas of data science and discuss which is better. The topics we will explore are data mining, data analysis, data visualization, and machine learning.

Data mining: As mentioned, one of the key components to data science is data mining. R seems to win this battle; in the 2013 Data Miners Survey, 70% of data miners (from the 1200 who participated in the survey) use R for data mining. However, it could be argued that you wouldn’t really use Python to “mine” data but rather use the language and its libraries for data analysis and development of data models.

Data analysis: R and Python boast impressive packages and libraries. Python, NumPy, Pandas, and SciPy’s libraries are very powerful for data analysis and scientific computing. R, on the other hand, is different in that it doesn’t offer just a few packages; the whole language is formed around analysis and computational statistics. An argument could be made for Python being faster than R for analysis, and it is cleaner to code sets of data. However, I noticed that Python excels at the programming side of analysis, whereas for statistical and mathematical programming R is a lot stronger thanks to its array-orientated syntax. The winner of this is debatable; for mathematical analysis, R wins. But for general analysis and programming clean statistical codes more related to machine learning, I would say Python wins.

Data visualization: the “cool” part of data science. The phrase “A picture paints a thousand words” has never been truer than in this field. R boasts its GGplot2 package which allows you to write impressively concise code that produces stunning visualizations. However. Python has Matplotlib, a 2D plotting library that is equally as impressive, where you can create anything from bar charts and pie charts, to error charts and scatter plots. The overall concession of the two is that R’s GGplot2 offers a more professional feel and look to data models. Another one for R.

Machine learning: it knows the things you like before you do. Machine learning is one of the hottest things to hit the world of data science. Companies such as Netflix, Amazon, and Facebook have all adopted this concept. Machine learning is about using complex algorithms and data patterns to predict user likes and dislikes. It is possible to generate recommendations based on a user’s behaviour. Python has a very impressive library, Scikit-learn, to support machine learning. It covers everything from clustering and classification to building your very own recommendation systems. However, R has a whole eco system of packages specifically created to carry out machine learning tasks. Which is better for machine learning? I would say Python’s strong libraries and OOP syntax might have the edge here.

One to rule them all

From the surface of both languages, they seem equally matched on the majority of data science tasks. Where they really differentiate is dependent on an individual’s needs and what they want to achieve. There is nothing stopping data scientists using both languages. One of the benefits of using R is that it is compatible with other languages and tools as R’s rich packagescan be used within a Python program using RPy (R from Python). An example of such a situation would include using the Ipython environment to carry out data analysis tasks with NumPy and SciPy, yet to visually represent the data we could decide to use the R GGplot2 package: the best of both worlds.

An interesting theory that has been floating around for some time is to integrate R into Python as a data science library; the benefits of such an approach would mean data scientists have one awesome place that would provide R’s strong data analysis and statistical packages with all of Python’s OOP benefits, but whether this will happen remains to be seen.

The dark horse

We have explored both Python and R and discussed their individual strengths and flaws in data science. As mentioned earlier, they are the two most popular and dominant languages available in this field. However a new emerging language called Julia might challenge both in the future. Julia is a high performance language. The language is essentially trying to solve the problem of speed for large scale scientific computation. Julia is expressive and dynamic, it’s fast as C, it can be used for general programming (its focus is on scientific computing) and the language is easy and clean to use. Sounds too good to be true, right?