Wednesday, January 21, 2009

After reading the interesting post of Ajay, I decided to write a post about the good aspects of R. First, I would like to state that I'm not a SAS nor a Clementine user. So the following arguments are my opinions as a R programmer:

R is easy and free to improve: R contains hundreds of useful packages (data mining, finance, etc.). If this is not enough, you can program your own packages and share them with others. You are not dependent on some programmers.

R is a white-box: Since R is a programming language, it is easy to understand the overall process of the system in development. There is no GUI that allows you to put black-box components that may be unclear.

When you know R, you know everything: Ok, this is a bit too much. But the message is that it is much more easier to start with R and then move to SAS or Clementine than the opposite. Especially for users who only use the GUI.

R is free: This is very good since small companies don't have the money to buy SAS or Clementine. Also, if several users need such tools, then the price increase. Of course, in a large company, SAS and SPSS tools may be an alternative.

R is a good choice: R is as convenient as Matlab (or even more?) and as cheap as Java (which means free). Which makes R an excellent choice among existing tools and programming languages.

8 comments:

Steffen
said...

I totally agree...

I wondered ... Sandro, can you recommend a good R Programming Book ? Or (more important) Software Development with R (S4 ...) ? One of the drawbacks of a scripting language like R is the invitation to hack code together...

I would like to give the top one reason I think why R is not used in operational data mining: One of R main weaknesses is the way data is managed. There is a workspace in memory in which data have to be imported and then from which results are exported. This means that for big dataset memory issues are frequent.

Remember that the vast majority of operational data mining (I mean by that, the data mining projects which results are used operationally on a day to day basis) are made in CRM. In this field, we have regularly training data sets with hundreds or thousands columns and hundreds of thousands lines, so R is cornered into domains with less data volume constraints.

@Steffen: I don't know about R books, but I'm sure they exist. I prefer to use tutorials such as Data Mining with R, for example.

@Matthias: I agree that R has some limitations, and maybe in some situations (very big data sets) it is not possible to use R.

@Erik: That's a very good point. In fact I have the same issue in using R in finance since I have to load all prices for a given time period and a set of stocks... in my case, this is not feasible under Windows (due to RAM limitations).

Actually, I am curious as to scalability. I see that someone else has mentioned a limitation in data size to physical RAM, but I wonder more about speed of computation. In my limited experience several years ago with S-Plus (R's commercial cousin), performance on data sets I would consider small was abysmally slow. Can you characterize R's performance on data tables whose size are typical of data mining projects?

@Will: Thanks for your comment! What I meant by "R is as convenient as Matlab", was in the programming point of view (I realized the sentence was not clear enough). It is easy to program in R and Matlab (compared to other languages). Of course, this is a very personal point of view.