Why R and not spreadsheets?

Graphics should be considered vital when doing anything with data. R has amazing graphical capabilities.

It is becoming more and more common for advertisements for jobs involving data analysis to mention R. The demand for people who know R is growing rapidly. The jobs that mention R are better paid than those that just require spreadsheets. And rightly so — data analysis is done better and faster in R.

If you are an employer, you will get more data analysis for the amount you spend by moving to R (plus the analyses are more likely to be bug-free). If you do data analysis, then you may be able to get higher pay by knowing R — maybe not now but probably soon.

Get to the starting gate

Obviously you need to install R on your computer before you can use it. That’s easy to do — you’re unlikely to have any problems.

Data frames are familiar

The superbowl object that was created above is a data frame. Data frames are R objects that are very much like the most common way of using spreadsheets:

the data are rectangular

columns hold variables

rows hold observations

In both spreadsheets and R there are likely to be different types of data in different columns: numbers, character data, dates and so on. The difference is that R forces there to be only one type of data in a column.

You can look at the first few rows:

> head(superbowl)
Winner DowJonesSimpleReturn DowJonesUpDown DowJonesCorrect
1967 National 0.15199379 Up correct
1968 National 0.04269094 Up correct
1969 American -0.15193642 Down correct
1970 American 0.04817832 Up wrong
1971 American 0.06112621 Up wrong
1972 National 0.14583240 Up correct

The “> ” is R’s prompt, you type what is after it (and hit the “return” or “enter” key).

You can also see how big it is:

> dim(superbowl)
[1] 45 4

This says that there are 45 rows and 4 columns. You might have expected the number of columns to be 5 and not 4. The years on the very left are row names rather than actually part of the data, similar to how “Winner” is a column name and not part of the data proper.

A slice of computing

R includes a number of datasets that are automatically attached. One of them is airquality:

Related Posts

If you’re using Postgres for RDBMS, then you can embed R in your application using PL/R. And, while I’ve not attempted, PG/9.3 has improved foreign databases (aka, federation) integration, so one could use R functions in the PG engine and report (and write, to some extent) from other databases. There is, at yet, no native driver for DB2 (JDBC/ODBC will work), but there are for most other databases.

and yet…the example provided could easily be done in excel
and the R output has, at least in the graph shown, the same hideous lookng default formatting as excel
so…
show me a REAL example of why R is better then excel
(a good one would be x,y data, x irregular time intervals, y some variable ; you do a linear regression in excel, fine, no problme, but what happens when you want lines showing the 95% ci for the regression line
possible, but hard
or smoothly fit a 4,5 paramter logistic fit to a sigmoid curve and extract min, max, slope, midpoint with CI…

Great stuff Pat, nice to see R being extolled for spreadsheeters. First thing I do with any spreadsheet-loving students is send them to read your Spreadsheet Addiction. But they still get scared by the code. What do you think of the new R Commander version? I’m pretty impressed, as an entry-level for beginners.

Your introduction concerning more salary due to R experience, reminded me of a paper i recently read. It has even brought first evidence, about the relation of early investments of experts with data science skills and firm productivity: http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2294077

Teaser from the abstract: “The estimates indicate that from 2006 to 2011, firms’ use of big data technologies, measured as the employment of engineers with Hadoop skills, was associated with 3% faster productivity growth, but only for firms with a) significant data assets and b) access to technical workers from other early big data adopters.”

[…] If you analyse data and particularly statistics, you should really have R in your toolbox. Like all programming, it makes light work of repetitive tasks. As a trade off, there’s a learning curve involved. Thankfully the syntax is easy to get to grips with (if I can manage it, anyone can). There are some tips here for moving from spreadsheets to R. […]