I have been programming for about a year and I am really interested in data analysis and machine learning. I am taking part in a couple of online courses and am reading a couple of books.

Everything I am doing uses either R or Python and I am looking for suggestions on whether or not I should concentrate on one language (and if so which) or carry on with both; do they complement each other?

-- I should mention that I use C# in school but am familiar with Python through self-study.

This question exists because it has historical significance, but it is not considered a good, on-topic question for this site, so please do not use it as evidence that you can ask similar questions here. This question and its answers are frozen and cannot be changed. More info: help center.

This question appears to be off-topic. The users who voted to close gave this specific reason:

"Questions about what language, technology, or project one should take up next are off topic on Programmers, as they can only attract subjective opinions for answers. There are too many individual factors behind the question to create answers that will have lasting value. You may be able to get help in The Whiteboard, our chat room." – MichaelT, JeffO, World Engineer

5 Answers
5

I use both Python (for data analysis ofcourse including numpy and scipy) and R next to each other. However, I use R exclusively to perform data analysis, and Python for more generic programming tasks (e.g. workflow control of a computer model).

In terms of basic operations, say operations on arrays and the sort, R and Python + numpy are very comparable. It is in the very large library of statistical functions that R has an advantage. In addition, matplotlib does not seem to be as good as ggplot2, but I have not used matplotlib that much. In addition, I would focus first on one language and become good at the specifics of that. You seem to be primairily interested in data analysis, not software engineering. I would pick R and stick to that. That said, I think choosing for Python + numpy + scipy + scikit is defintely and excellent choice, it is just that I feel that R is just a bit more excellent.

I would also take a look around you what your colleagues and other people in your field are using. If they all use, say, Python, it would make sense to stick to that in order to more easily learn from them and exchange code.

Disclaimer: Note that I am a heavy R user, so my opinion might be biased, although I have tried to keep my answer as objective as possible. In addition, I have not used Python + numpy extensively, altough I know collegaues who do all their data analysis in it.

I use R and Python for all my research (with Rcpp or Cython as
needed), but I would rather avoid writing in C or C++ if I can avoid
it. R is a wonderful language, in large part because of the incredible
community of users. It was created by statisticians, which means that
data analysis lies at the very heart of the language; I consider this
to be a major feature of the language and a big reason why it won't
get replaced any time soon. Python is generally a better overall
language, especially when you consider its blend of functional
programming with object orientation. Combined with Scipy/Numpy,
Pandas, and statsmodels, this provides a powerful combination. But
Python is still lacking a serious community of
statisticians/mathematicians.

I mean Python + numpy yes, otherwise the choice would be even easier. I think in terms of basic operations, say operations on arrays and the sort, R and Python + numpy are very comparable. It is in the very large library of statistical functions that R has an advantage. In addition, matplotlib does not seem to be as good as ggplot2, but I have not used matplotlib that much.
–
Paul HiemstraJan 3 '13 at 13:14

That said, I think choosing for Python + numpy + scipy + scikit is defintely and excellent choice, it is just that I feel that R is just a bit more excellent.
–
Paul HiemstraJan 3 '13 at 13:17

@PaulHiemstra - You make some nice points in your comments that would probably improve your answer - if the question is re-opened and you have that opportunity.
–
psrJan 3 '13 at 19:54

@psr I edited in my comments, apparently no need for the question to be open if I want to edit it (maybe only I can edit it...).
–
Paul HiemstraJan 3 '13 at 20:43

Background: I'm a data scientist at a startup in Austin, and I come from grad school (Physics). I use Python day-to-day for data analysis, but use R a bit. I also use C#/.NET and Java (just about daily), I used C++ heavily in grad school.

I think the main problem with using Python for numerics (over R) is the size of the user community. Since the language has been around for ever, lots of people have done things that you're likely to want to do. This means that, when faced with a hard problem, you can just download the package and get to work. And R "just works": you give it a dataset, and it knows what summary statistics are useful. You give it some results, and it knows what plots you want. All the common plots you'd want to make are there, even some pretty esoteric ones that you'll have to look up on Wikipedia. As nice as scipy/numpy/pandas/statsmodels/etc. are for Python, they're not at the level of the R standard library.

The main advantage of Python over R is that it's a real programming language in the C family. It scales easily, so it's conceivable that anything you have in your sandbox can be used in production. Python has Object Orientation baked in, as opposed to R where it feels like kind of an afterthought (because it is). There's other stuff that Python does nicely too: threading and parallel processing are pretty easy, and I'm not sure if that's the case in R. And learning Python gives you a powerful scripting tool, too. There are also really good (free) IDEs for Python, much better ones if you're willing to pay (less than $100), and I'm not sure this is the case for R--the only R IDE I know of is R Studio, which is pretty good, but isn't as good as PyDev + Eclipse, in my experience.

I'll add this as a bit of a kicker: since you're still in school, you should think about jobs. You'll find more job postings for highly skilled Python devs than you will for highly skilled R devs. In Austin, jobs for Django devs are kind of falling out of the sky. If you know R really well, there are a few places where you'll be able to capitalize that skill (Revolution Analytics, for example), but lots of shops seem to use Python. Even in the field of data analysis/data science, more people seem to be turning to Python.

And don't underestimate that you may work with/for people who only know (say) Java. Those people will be able to read your Python code pretty easily. This won't necessarily be the case if you do all of your work in R. (This comes from experience.)

Finally, this may sound superficial, but I think the Python documentation and naming conventions (which are religiously adhered to, it turns out) is a lot nicer than the utilitarian R doc. This will be hotly debated, I'm sure, but the emphasis in Python is readability. That means that arguments to Python functions have names that you can read, and that mean something. In R, argument names are often truncated---I've found this less true in Python. This may sound pedantic, but it drives me nuts to write things like 'xlab' when you could just as easily name an argument 'x_label' (just one example)---this has a huge effect when you're trying to learn a new module/package API. Reading R doc is like reading Linux man pages---if that's what floats your boat, then more power to you. When I have a question about how something works in R, I avoid the R documentation, whereas I START with the Python doc when I'm confused about Python.

All of that being said, I'd suggest the following (which is also my typical workflow): since you know Python, use that as your first tool. When you find Python lacking, learn enough R to do what you want, and then either:

Write scripts in R and run them from Python using the subprocess module, or

Install the RPy module.

Use Python for what Python is good at and fill in the gaps with one of the above. This is my normal workflow---I usually use R for plotting things, and Python for the heavy lifting.

So to sum up: because of Python's emphasis on readability (search gooogle for "Pythonic"), the availability of good, free IDEs, the fact that it's in the C family of languages, the greater possibility that you'll be able to capitalize the skillset, and the all-around better documentation-style of the language, I'd suggest making Python your go-to, and relying on R only when necessary.

Ok, this is (by far) my most popular answer ever on a stack site, and it's not even #1 :) I hope this has helped a few people along the path.

At any rate, I've come to the following conclusion after several years in the field:

This is probably the wrong question to ask.

Asking "should I learn this particular technology" is a bad question. Why?

Technology changes. You'll always have to learn another technology. If you go work at Twitter, they run Scala. Some places are Python shops. Some places don't care. You're not going to be hired because you know or don't know some particular piece of tech--if you can't learn a new tech, you can (and should be) fired. It's like, if a new pipe wrench comes out, and you're a plumber, and you can't figure out how the new pipe wrench works, you're probably a pretty lousy plumber.

Given the choice of "Do I learn this technology" or "Do I spend more time solving real problems", you should always choose the latter, without exception.

As a data scientist, your job is to solve problems. That single bit of wisdom is pretty much always lost at every conference or meetup you go to--every "big data" talk I've ever seen has focused on tech, not on solving problems. The actual problem solving is usually relegated to a few slides at the end:

[Talk title = "Deep learning at Cool New Startup"]...[45 minutes of diagrams and techno-babel during which I zone out and check my phone]...And, after implementing our Hadoop cluster and [Ben zones out again] we can run our deep learning routine, [wake up: this is why I came!] the details of which are proprietary. Questions?

This gives a bad impression that the field is about tech, and it's just not true. If you're really good at Scala, or Python, or R, but you're really bad at solving problems you will make a lousy data scientist.

Paco Nathan was in Austin a few months ago at a day long "big data" conference, and said something like "Chemistry isn't about test tubes". That pretty much sums it up--data science isn't about Scala, or Hadoop, or Spark, or whatever-other-tech-du-jour pops up. At the end of the day, I want to hire people who think, not people who are adept at using Stack Overflow to learn toolkits.

Likewise, if you go to a job interview, and they don't hire you just because you don't know some programming language, then that company sucks. They don't understand what "data scientist" means, and it's probably better for you if it didn't work out.

Finally, if your problem solving abilities are marginal (be honest with yourself), or you really just enjoy the tech side of things, or learning tech is what you really love (again, be honest) then learn a lot of tech. You'll always be able to find "data engineer" type roles that fit your skill set. This isn't a bad thing, data engineers grease the wheels and make it possible for you to do your job as a data scientist. (The difference is akin to software architect vs. the development team.)

I will say, though, that if I were working on a trading floor, and the head trader came to me with a csv of option prices and wanted me to fit them with a log-linear distribution and back out the mean and standard deviation, I wouldn't even consider Python. I think it's like three lines of code to do this in R.
–
BenDundeeJan 23 '13 at 2:31

So, I have primarily done data analysis in Matlab, but have done some in Python (and more used Python for general purpose) and also I've started a bit of R. I am going to go against the grain here and suggest you use Python. The reason why is because you are doing data analysis from a Machine Learning perspective, not stats (where R is dominant) or digital signal processing (where Matlab is dominant).

There is obviously heavy overlap between Machine Learning and Stats. But overlap is not identity. Machine Learning uses ideas from CS that I for one would not want to implement in R. Sure, you can compute a minimal spanning tree in R. It may look like an ugly mess though. Machine learning people will assume you have easy access to hash tables, binary search trees, and so on. It is easier in my mind to implement a stats algorithm afresh when necessary, than to try to shoehorn what is basically a domain specific language into a general programming language.

The side benefits of Python for data analysis are much higher too. You will learn a real programming language at the same time, which can handle scripting, create larger applications, etc. R is really a niche language of the stats community, even Matlab is far more widely used.

I guess, I would look at some of the papers first, and see in what language they post code. If it's not in R, then don't use it.

Thank you very much. I am definitely more interested in the ML side of things.
–
The_Cthulhu_KidJan 9 '13 at 5:32

1

Just a minor addendum: I'm sure R can do this in some fashion as well, but Python is well known for it's ability to call C or compile functions into C using Cython with minimal overhead. So you can usually get faster with less effort, a major consideration for looking at real data. Another (final) exotic note: Java has some really good machine learning libraries (like WEKA). However, what's cool is you can call these as well from Python, using Jython :-)
–
Nir FriedmanJan 9 '13 at 6:52

As an old school (over 50) scientist who has and continues to use a number of these tools I will add my two cents. I have worked with colleagues who still write every piece of code in Fortran, from trivial one-off data analysis jobs to code that dominates some of the worlds supercomputers. Recent Fortran dialects (F90, F95, F2003, F2008) are IMHO, some of the best designed languages in existence. Decades of experience with high performance computing has led to a quite impressive language development.

I have only used Python at times, and will revisit it (mostly because of Sage) but I use a time tested suite of languages that work well for me. Fortran, C, Perl, R, and Scheme (with tcl for scripting VMD). I find the combination of R and Fortran and C to be very comfortable. In contrast to other comments made about the object model in R, it is a good object model for interactive work, based upon the CLOS concept of generic functions and method dispatch. When working interactively with a new package you can often rely upon generic functions like “print” and “plot” to do something productive.

The API to Fortran and C is very easy to use. If your used to working in Fortran and C for modeling and data analysis this is a big plus. The ability to dynamically generate R code and evaluate it, while not nearly as clean as the macro systems in Lisp and C, is very useful when working up dynamic data sets.

Some limitations of R for real data include the call by value approach. While there are CS reasons for call by value, real world programming with big numeric data requires some form of call by reference (note the importance of Fortran common blocks in older code, or module data in newer code). The approach adopted by PDL (Perl Data Language) is especially elegant in this regard. (Pdls are essentially call by reference unless you request a copy. Sub-pdl’s reference a sub section of a parent pdl, in a far cleaner syntax than Fortran or C provide).

It is good to learn many languages. Python is undoubtably an important language, but R is as well in it’s domain. But when the rubber really needs to meet the road in science Fortran and C (and C++ for some) will be hard to displace.

A key feature of R is that it is a library of packages, as much as it is a programming language. Every package writer has access, in principle, to what is in every other package. This dramatically reduces the need, to re-invent, to re-document, to re-learn. This applies both to package authors and to users. Of course, this infrastructure comes at a cost. Package authors must accommodate standards that become increasingly finicky with the passage of time. Some of this may spill over into what users encounter.

Python does not, as I understand, have a package management system. There is no equivalent of R's Comprehensive R Archive Network (http://cran.r-project.org), and no direct equivalent of the R task views (http://cran.csiro.au/web/views/). Thus it is, to an extent that is not the case for R, a tool for programmers working pretty much on their own rather than as part of a communitarian effort to build on what is already available.

For data analysis and machine learning, the demand is surely, to a very large extent, to build on and take advantage of abilities that are already in place. For more generic programming tasks, Python may well have advantages. Will you do this type of work enough to justify the effort involved in learning Python?

Python has a package management system called pip. It is not part of the standard library, but it will come shipped with standard Python starting with Python 3.4, which will be released next month (March 2014).
–
Cody PiersallFeb 17 '14 at 1:20