Posted
by
samzenpus
on Friday April 05, 2013 @05:25AM
from the brand-new dept.

DaBombDotCom writes "R, a popular software environment for statistical computing and graphics, version 3.0.0 codename "Masked Marvel" was released. From the announcement: 'Major R releases have not previously marked great landslides in terms of new features. Rather, they represent that the codebase has developed to a new level of maturity. This is not going to be an exception to the rule. Version 3.0.0, as of this writing, contains only [one] really major new feature: The inclusion of long vectors (containing more than 2^31-1 elements!). More changes are likely to make it into the final release, but the main reason for having it as a new major release is that R over the last 8.5 years has reached a new level: we now have 64 bit support on all platforms, support for parallel processing, the Matrix package, and much more.'"

It also feels more appropriate, somehow, to do research code in R: It's supposed to be shareable and reproducible, and using an expensive and proprietary language kind of defeats the purpose. Besides, CRAN and Bioconductor have rather a lot of useful stuff...

Tell that to all the "scientists" and "researchers" paying money for _and_ investing lifetimes worth of effort into writing libraries for Matlab, Maple, Mathematica, LabView and other proprietary environments, instead of contributing to make the existing free environments better.

Tell that to all the "scientists" and "researchers" paying money for _and_ investing lifetimes worth of effort into writing libraries for Matlab, Maple, Mathematica, LabView and other proprietary environments, instead of contributing to make the existing free environments better.

Times are changing. There are many forces at work here:

1. Cutbacks in funding is making lead scientists look for ways to save money.
2. The proprietary vendors upgrading their software and charging license fees for each version (one particular vendor licenses specific minor versions).
3. The desire to share work and non-proprietary methods are the best way to do it.
4. New postdocs are familiar with python (they like working in iPython in particular) and its libraries.
5. R is gaining ground with the older scientists due to its features and price.

Drives me crazy. At least with statisticians, R is by far the dominant package now. But in science, it's Matlab Matlab Matlab.

Python + Numpy/Scipy is such a better alternative now it's not even funny. It's actually a real language, and has loads of packages. And unlike Matlab, you don't have to pay extra money for additional packages (or any money).

The use of closed source software in science is a waste of scarce resources, and it hurts openness. Another thing is that every numerical type class I've had has used Matlab. It's really unfair to expect students to purchase a copy. I use Octave when I have to deal with this, but it is not perfectly compatible.

Students have it good when it comes to matlab -- you can get a student version of matlab + simulink (with 10 or so toolboxes) for $99. The people who are really hurt by matlab's pricing schemes are the hobbyists who don't qualify for a student copy. There's this huge price dichotomy; when you're a student it's $99, after you graduate it's $5000+, and that's without any toolboxes.

However, for academic use it makes perfect sense for scientists to use matlab over the alternatives. At least in the UC (universi

I tried switching to Python + Numpy/Scipy from Matlab. In the end I switched back to Matlab. I'm already familiar with python, and have done a lot of C++ programming so slight langauge differences were not the issue. Here are some of the reasons I switched back to Matlab:

IDE Matlab comes with a ready to use IDE.

Value semantics Matlab treats Matrices (and all classes that are not derived from "handle") using value semantics, so you know that Y=f(X) won't change X, if X is a matrix. However it also u

Yeah,it's incredibly easy just to offload loops or whatever into Fortran and just use F2py. As an aside, Fortran 90 is just about as easy as Numpy or Matlab, so it do 90% of my work in there. I just use F2py to compile my Fortran modules as Python modules. Then I have the flexibility of using an interpreter with the speed of Fortran.

I haven't used it, but PyPy is a very fast JIT compiler for python to speed up native Python code.

While I despise MATLAB for a large set of reasons, I agree that a large variety of toolboxes is available for pretty much anything you might want to do.
And those are what most of the MATLAB users look for, in my experience: they might dislike the language, but MATLAB provides/they can purchase toolboxes that do what they need for their research.

Tell that to all the "scientists" and "researchers" paying money for _and_ investing lifetimes worth of effort into writing libraries for Matlab, Maple, Mathematica, LabView and other proprietary environments

Depends on who they're doing it for, it seems. It's a time vs. money balance.

In a commercial environment, Matlab tends to win out purely because of the toolboxes - especially current ones where Matlab has real-world interfaces so after modelling, you can prototype your control system with real hardware.

I haven't used labview, but Knime is both opensource and awesome. I can quickly prototype pretty much any workflow I want and get really good reproducibility. Debugging and unit tests need to be more directly integrated, but it is still a great package for practical science. It has R/Java/python integration as well!

LabView is similar - a horrible mess if you want to program with it, but scientists and the like love it because it means not having to mess with code.

I'm not sure "love" is the right word, at least in my experience. Although I've worked on several projects that used LabView, 90+% of the people working on it seemed to bitch and complain about it constantly. Many use it because they have already existing code using it, or because of equipment that has drivers that are much, much easier to use in LabView, or because in a few select cases, it lets you bang out a GUI control really quick. But otherwise, it seems to make the rest of the project a nightmare

I have a license for SAS through my university. I gave up trying to convince the stupid thing to install. If the installer wasn't crashing, the license manager was.

MatLab has similar, though less severe problems.

R had a nice double click installer that worked the first time. Later I compiled it, which worked without any headaches. There's a nice bridge from R to Python and you can extend either one, or embed either or both in other applications.

You meant R has better accessibility options for the disabled but it's just plain more accessible.

I am a SAS developer and have never run into any such problems but I won't say I don't believe you. However, the benefit of that large licensing fee is the easy access to SAS help resources (real live people living over there in Cary, NC) who get back to you VERY QUICKLY for ANY level of technical question you have.

Their employees, at least the hundred or so I've met over the years when presenting at SAS Global Forum, have been INCREDIBLY friendly and helpful.

I am a SAS developer and have never run into any such problems but I won't say I don't believe you. However, the benefit of that large licensing fee is the easy access to SAS help resources (real live people living over there in Cary, NC) who get back to you VERY QUICKLY for ANY level of technical question you have.

Their employees, at least the hundred or so I've met over the years when presenting at SAS Global Forum, have been INCREDIBLY friendly and helpful.

If commercial software is your thing, and you can afford it, and the vendor offers good support, 100% agreed.

If you're looking for R help the best two places to start are:

* Get a copy of The R Book, by Crawley -- it'll save you days of pointless/incomplete search for web resources* Swing by the R IRC channel on Freenode (irc://irc.freenode.net/#R) -- we welcome n00bz

A couple of years ago I ran into SAS at a trade show. It really surprised me that they were still around; I'd previously seen their products on mainframes back in the late 70s, with punch cards. (I forget by now whether I'd used SAS or SPSS, which were the two competing commercial stats packages in that environment.)

Julia is available in 64 bits on other platforms, but posting it as a reply in a thread that was complaining how late R is to the 64 bit game is a bit rich. R has had 64 bit releases for all platforms for 3 years now. What's new in 3.0.0 is the removal of the remaining 32 bit limit on individual objects.

I recently switched my scientific programming from R to Python with NumPy and Matplotlib, as I couldn't bear programming in such a misdesigned and underdocumented language any more. R is fine as a statistical analysis system, i.e. as a command line interface to the many ready-made packages available in CRAN, but for programming it's a perfect example of how not to design and implement a programming language. It's also unusably slow unless you vectorise your code or have a tiny amount of data. Unfortunately, vectorisation is not always possible (i.e. the algorithm may be inherently serial), and even when it is, it tends to yield utterly unreadable code. Then there is the disfunctional memory management system which leads you to run out of memory long before you should, and documentation even of the core library that leaves you no choice but to program by coincidence [pragprog.com].

As an example of a fundamental problem, here's an R add-on package [r-project.org] that has as its goal to be "[..] a set of simple wrappers that make R's string functions more consistent, simpler and easier to use. It does this by ensuring that: function and argument names (and positions) are consistent, all functions deal with NA's and zero length character appropriately, and the output data structures from each function matches the input data structures of other functions.". Needless to say that there is absolutely no excuse for having such problems in the first place; if you can't write consistent interfaces, you have no business designing the core API of any programming language, period.

Python has its issues as well, but it's overall much nicer to work with. It has sane containers including dictionaries (R's lists are interface-wise equivalent to Python's dictionaries, but the complexity of the various operations is...mysterious.) and with NumPy all the array computation features I need. Furthermore it has at least a rudimentary OOP system (speaking of Python 2 here, I understand they've overhauled it in 3, but I haven't looked into that) and much better performance than R. On the other hand, for statistics you'd probably be much better off with R than with Python. I haven't looked at available libraries much, but I don't think the Python world is anywhere near R in that respect.

Anyway, for doing statistics I don't really think there's anything more extensive out there than R, proprietary or not, although some proprietary packages have easier to learn GUIs. In that field, R is not going to go anywhere in the foreseeable future. For programming, almost anything is better than R, and I agree that those improvements you mention are not doing much to improve Rs competitiveness in that area.

I can somewhat relate to the documentation issue although I believe that it is more a question of organizing the documentation.

When you mention "a fundamental problem" you mention function implementations, thus library rather than language issues. R itself is an extremely expressive, functional (or rather multi-paradigm) language that can be programmed to run efficient code. Yet it is syntactically minimalistic without unneeded syntax (as opposed to all of the scripting languages perl/python/ruby). This mak

I can somewhat relate to the documentation issue although I believe that it is more a question of organizing the documentation.

One of the things that bothers me about the documentation is that there's often no distinction between interface and implementation. Instead of a description of what a function does, you get implementation details mixed up with what it approximately hopes to achieve, leaving you unable to see the forest for the trees.

When you mention "a fundamental problem" you mention function implementations, thus library rather than language issues. R itself is an extremely expressive, functional (or rather multi-paradigm) language that can be programmed to run efficient code. Yet it is syntactically minimalistic without unneeded syntax (as opposed to all of the scripting languages perl/python/ruby). This makes it a truly postmodern language IMO.

Well, there's only one implementation, so it's rather pointless that it could be implemented efficiently. The language specification isn't exactly good enough to create a competing, compatible

Despite R's weaknesses as a programming language, R has such a large number of well-documented, well-tested, statistical functions with a wide array of arguments to vary that it is very difficult for another language to match. For example, maybe you want to build an arima time series model. OK, not too tough to find a library in Python or C++ that does that. Now what if you want to add an exogenous variable to the arima model? Maybe a seasonal component? Next maybe you want to automatically pick the best model according to AIC? Oops, make that BIC. Looking at it again maybe a Vector Autoregressive model is best. Or a VECM?

While I'm sure there are excellent implementations of all of these wrinkles in other languages, with R, I have great confidence that the functions that I want and need now and in the future are going to be there and are going to be implemented correctly, and kudos to the R team for giving us that kind of confidence.

R does have a lot of problems, among the worst is loop performance. It really forces you to vectorize everything, which leads to less maintainable code, and is generally a coding technique that new hires coming from other languages will face a steep learning curve with. What I have found useful is to use R as a data exploration and model parameterization tool, but once the model is ready to be put into production, you can use the parameters calculated by R in an implementation in the language of your choice, e.g., C++.

I guess this is a long winded way of saying that as with so many questions of "which language is best," the real question is "which question is best for you and your application?" R is usually the best language only for people who are regularly using a such a wide variety of statistical analyses that you won't find a large part of what you need in the libraries of other languages. For me, I couldn't imagine working without it.

Needless to say that there is absolutely no excuse for having such problems in the first place; if you can't write consistent interfaces, you have no business designing the core API of any programming language, period.

I guess you missed the memo that the K&R string functions are deprecated in many projects such as OpenBSD which has their own recommended set of string functions.

Way back when, Iverson and his APL cronies put a great deal of effort into defining the APL arithmetic operator set to conform t

Writing a tutorial from nothing is hard. You can do this to get some good ideas:

1. Download a free evaluation copy of 'Minitab'.(I'm not connected with Minitab, but I've used it a lot, and it's great 'basic' stats analysis software)2. Install, and then open help3. Consult 'tutorials' section:) Obviously, don't just rip off their stuff; not cool

As a suggested flow, I've found that, as a start, you can introduce basic stats, then demonstrate how the software works.Using the same data-set for the first few, (say ten), lessons is better. Minitab tutorials keep changing the data, which confuses students.You'll only need 5 columns or so, and remember to include some discrete variables to enable stratification of your continuous variables.Use a real-world example, such as household expenses for different families, whatever.

For tutorial flow, what works for me as a 'basic' intro to a stats package:

1. What is data? What are statistics?2. Types of data, how they look as raw data, (in the database) and then once we start to analyse them with stats and graphs (to start, just 'common' stuff like continuous variables, normal & lognormal, and discrete, binomial & poisson).3. Basic stats & graphical analysis for single variables. Normality tests. Include time series plots as well as histograms / dotplots / boxplots.4. Multivariate analysis; x/y charts, matrix plots, interaction plots.5. Hypo tests (for both cont & disc variables)6. Regression, (simple, then multiple if you're feeling brave)7. Control charts (for both cont & disc variables)

If you work out how to do this in 'R', by actually using it, your tutorial will pretty much write itself, (keep saving your screens - Irfanview is a great, free, tool I use for this. Install, open, hit 'C' for manual or automatic screen save options.)

A new, easy to use, free, online R system is StatAce (www.statace.com). The GUI analysis is still in infancy (only descriptives, correlation and OLS at this stage) but it supports any and all R code, many libraries, and has good data management (e.g. allows you to save results).

The single best R resource I've ever used was The R Book, by Crawley. Before buying it I invested way too much time searching all over the web for solutions to simple and complicated things alike, almost always with poor or incomplete results. The O'Reilly R books are barely OKi. Short circuit the BS and go straight to The R Book. It paid for itself in about 2 hours of coding (it's expensive and runs between $80 and $150, when it's available -- my time is way more valuable, though).

R's developers are, unlike many other Open Source developers, very careful about releasing production-quality software.

As in: when they release it, you can trust it to work.

Hence they didn't mess around with major reconstruction of R's guts until they could release something that's finished (and well-tested !) and bumped the version number to 3.0.0 when they did in order to properly differentiate it from previous versions.

This is very gratifying as R happens to see widespread use in academia, government and business when it comes to data analysis and statistics.

If R has a weakness, it is that uses an in-memory approach to data-processing, unlike e.g. SPSS, which keeps almost nothing in memory and simply makes passes through datafiles whenever it needs something. R is also a bit memory-hungry, so the need for genuine 64-bit implementations should be clear.

Apart from sporting about 4000 useful and ready-to-run statistical applications packages, R has convenient and efficient integration with C code and has what's probably a contender for the best support for data-graphics anywhere.

For those who didn't know, even packages like SPSS and SAS have incorporated R interfaces to tap into the wealth of application packages that R offers. Can't think of a more significant compliment right now.

Even i have used R in the past for my thesis. My statistician was using S-plus to do magical things that the hospitals SPSS definitely could not do.However, S-plus was not available to us non-statisticians.As a complete non-programmer, mediocre statistician, i was able to reproduce en build upon his examples in R.

But what i truly missed was a usable GUI. there were some, and i tried them all at the time, but none were able to do more than the basics. For someone using R daily, a GUI will be more trouble and

There are no new GUIs in the R distribution, but there are several GUIs produced by third parties that probably weren't available when you were doing your thesis. I like RStudio and recommend it to my students, but there are others too.

I think you are confusing GUI with IDE; RStudio and most of the other R "guis" don't make R more discoverable. SPSS and the like are used because they offer guidance on what one should try given what they already know. With an IDE, you still have to know how to program. Throwing together a text editor, an output window, and an execution button doesn't do much.

It's really disheartening that a professor thinks this solves any of the major pedagogical problems that R forces. I really wish you would STOP re

I have recently implemented RStudio for a customer. http://www.rstudio.com/ [rstudio.com]
It's a web interface for R which appears to be clean and easy to use. Installation was straightforward from RPM, you only need R-core, R-devel xdgutils and the rstudio RPM itself.

There are usable GUI's for R, and best of all: they can be installed as packages from within R.

The best-known one is called 'R commander' (package name = Rcmdr ). It gives you a point-and-click interface and (like SPSS) drops the R code to repeat what you did using the menu (so that your work is reproducable).

Every time I switch institutions I can use it. No problem with lack of site license,no grant money for a license or activation problems on a new machine. I can use it on whatever OS the organization owns. I can get it up and running in about 5 minutes and it will work.

Awsome community. If you have a problem there's a good chance there's something in the CRAN that solves it.

But super steep learning curve. Begginner documentation is at best suboptimal ("go bu

If you just use R to run data through a package (which in my opinion is the quickest way to get a lot of value out of R) then the learning curve is tolerable. Less steep than for SAS (I think), but steeper than for SPSS.

On the other hand: R in and by itself is mostly a tool for statisticians and data analysts (or anyone else who doesn't flinch at having to write scripts, who's acquainted with the phenomenon of 'manual', and who's used to spending a few hours or s