R users already know why the R language is the lingua franca of statisticians today: because it's the most powerful statistical language in the world. Revolution Analytics builds on the power of open source R, and adds performance, productivity and integration features to create Revolution R Enterprise. In this webinar, author and blogger David Smith will introduce the additional capabilities of Revolution R Enterprise, including:

Multi-processor speed improvements and parallel processing

Productivity and debugging with an integrated development environment (IDE) for the R language

Web Services for R, to integrate R computations and graphics into Web-based applications

Technical support and consulting services for R

This webinar will be of value to current R users in industry and government who want to learn more about the additional capabilities of Revolution R Enterprise to enhance the productivity, ease of use, and enterprise readiness of open source R. R users in academia will also find this webinar valuable: we will explain how all members of the academic community can obtain Revolution R Enterprise free of charge.

If you know of anyone who could benefit from learning about R and Revolution R Enterprise, please send them the link below.

August 19, 2011

As I stand[*] here at Heathrow waiting for my flight back to the States, I thought I'd dash off a few quick reflections of the userR! 2011 conference at University Warwick. It was an outstanding event. There's something about a conference of just a few hundred attendees (there were about 450) that creates a sense of camaraderie and common purpose you just don't get at larger conferences. It was wonderful to re-connect with colleagues, meet long-standing collaborators previously only known via email, and meet many new friends. The event was tremendously well-run, and as a community-run conference proceeded much more smoothly than many professionally-managed conferences I've been to. A big thank-you to the useR 2011 organizing committee (John Aston, Julia Brettschneider, David Firth, Ashley Ford, Ioannis Kosmidis, Tom Nichols, Elke Thönnes and Heather Turner) for such a fantastic conference, and we at Revolution Analytics were proud to have been a sponsor. (Photo courtesy useR! 2011.)

There was some great information shared in the tutorials and invited and contributed sessions, too. Here are a few quick nuggets compiled from my notes:

From Max Kuhn's tutorial:

The caret package is a powerful yet easy-to-use front end to more than 122 different kinds of predictive models in R. It provides a consident user interface to all models, and makes it easy to tune and compare models to select the one with the best predictive power for your data.

August 18, 2011

I have always thought it odd that statisticians who live and die by the formal machinery of Neyman-Pearson hypothesis testing will also examine residual plots and qqplots to assess the validity of their models. To be fair, looking a plots is far from being all that is done, but still – looking? Even after having gained some experience myself in these matters there still lingers a cognitive dissonance. Think of all the anguish associated with setting up the experiment, choosing the level of the test and steeling oneself to accept the grim tyranny of the p-value. Give this, how could anyone without the prescience of a guild navigator be guided by just looking?

Well, I got over it. Now, I hate to build any model without looking at something. But, this need to look has made working with big data files emotionally challenging: it’s just not practical to plot galaxies of millions and billions of points. To feel better about things I have taken to sampling.

The code below, adapted from Maindonald and Braun’s book “Data Analysis and Graphics Using R” uses Revolution Analytics’ RevoScaleR package to run a simple linear regression on the entire airlines challenge data set (120M + observations) and then samples from an Xdf file containing the residuals to produce lots and lots of qqplots. Each graph contains a qqplot of the residuals in the lower left corner surrounded by 7 reference plots drawn from a normal distribution having a mean and variance of equal to the sample statistics. Figure 1, is one such cluster of plots. By itself, it may not mean much, but if thousands of the samples looked like this, I would be inclined to say that the residuals are not close to being normal.

If you like the idea, adapt this code to navigate through your favorite galactic size residual space, prepare the orange spice and, guided by your intuition, gaze intently for an afternoon. (It takes my 8-core Dell loptop about 13 seconds to paint the screen with each plot)

August 17, 2011

"The R-Files" is an occasional series from Revolution Analytics, where we profile prominent members of the R Community.

Name: Martyn Plummer

Occupation: Statistician at International Agency for Research on Cancer

Nationality: British

Years Using R: 16

Known for: Member of R core group; member of R Journal editorial board

Martyn Plummer is a longtime contributor to the R community and a member of the R core group, which consists of 20 members that help oversee the continued evolution of the project. Plummer also serves on the editorial board of the R Journal, the official journal of the R project. By day, he serves as a Statistician and Epidemiologist at the International Agency for Research on Cancer (IARC), based in Lyon, France.

Plummer, who has been using R since 1995, has developed or contributed to a number of popular packages, including coda for analyzing Markov Chain Monte Carlo output, JAGS, a clone of the popular WinBUGS software Bayesian analysis and Epi, which provides functions for epidemiologists and accompanies an annual course that aims to introduce epidemiologists to R.

He has also incorporated R into his work at IARC, where he works in the Infection and Cancer Epidemiology group. Much of the work of this group is focused on human papillomavirus (HPV), which causes half a million cases of cervical cancer per year worldwide. Plummer and his colleagues use R (including his own Epi package) to analyze epidemiological studies of HPV infection and try to tease out some aspects of HPV natural history that are difficult to understand without statistical modeling, such as whether different HPV types interact with each other. He also relies heavily on R’s graphical capabilities for visualizing data in scientific publications.

Prior to R, Plummer worked primarily with S+ for analyzing data. He had been working in Cambridge, United Kingdom in the Biostatistics Unit at the Medical Research Council when he was offered a position at the IARC in Lyon. He recalls the transition, and how his new position introduced an entirely different computing environment. Soon after moving, he was introduced to the recently-formed R project by his colleague David Clayton.

“From the beginning, I saw enormous potential in R,” says Plummer. “While I was accustomed to S+, it wasn’t long before I completely switched over to R. It was and continues to be unparalleled in its flexibility in terms of data analysis.”

Plummer also points to R’s extensible nature as one of its defining features. As a modern language, R is able to effectively adapt to the changing nature of data analysis in an era of increasingly large, unstructured data sets. “One of the most important features of R is that it’s built around the data; it’s designed for programming with data, so it can take these developments in stride,” he says.

He went on to describe a recent article in the R journal that analyzed 18 months’ worth of text from the R mailing lists and identified relationships between prominent members of the R community based on the topics they discussed. Plummer cites it as an example of R’s ability to keep up with the ever-changing notion of “data.”

“10 years ago, I would have never called such an amalgamation of text a ‘data set,’” he says. “Today, though, we find ourselves in a situation where we can elicit structure from large and complex data sets and glean meaning from it.”

When asked about how he sees the R project evolving in coming years, Plummer speaks of a delicate yet effective balance. “R manages a difficult equilibrium; it’s partly a frontier for innovation in statistical computing, yet it’s also a stable platform for data analysis. It’s unique in this regard and I don’t see it facing serious competition for quite some time.”

He sees the current situation being maintained at least over the next few years, though one challenge for R users is to navigate the increasing number of contributed packages. While there’s incredible innovation being done for a diverse range of functions, Plummer says, there are also opportunities for the community as a whole to pool and share their work.

“One of the most important and oft-overlooked values of the R community is its interdisciplinary nature,” he says. “It’s remarkable to be able to collaborate with so many talented people from a diverse range of fields. We’re all statisticians, but statistics has a terrible tendency to fragment by subject matter. R gives us all a common platform and brings us together to encourage innovation.”

August 16, 2011

R Core member Professor Brian Ripley from Oxford University gave the first keynote presentation of useR! 2011 today, and gave some insights into what goes on behind the scenes to create two updates to R (plus several patches) every year. He began with some facts about the history of R (noting that if they'd known R would take off like it has, there would be better records of the early days):

The first still-existing version of R dates from Jun 1995, and the distribution totals 465Kb.

R 1.0.0 was released on February 29, 2000 (up to 2.8Mb)

R 2.0.0 was released on October 4, 2004 (mostly because the name R 1.10.0 was unappealing because it would sort to the top of the list of R versions in some systems). At this point the distribution had grown to some 10Mb in size.

Prof. Ripley also showcased some of the major improvements from recent versions of R, including multi-language support (thanks to which R is widely used in China and Japan, for example), support for R as a scripting language (most of R's own build scripts are now written in R), and improved graphics rendering.

Looking to the future, Prof. Ripley noted that there are no plans to make any backwards-incompatible changes that would warrant a jump to a 3.x numbering scheme. R 2.14 is planned for October, and after that the Core Group will move to an annual (rather than bi-annual) release schedule beginning with R 2.15 in (provisionally) March October 2012. [Updated Nov 17 2011 -- my error.] He also gave a glimpse even details of R's development plans, with low-level support for multi-threaded computing, a standard parallel computing library, and support for a 64-bit native R engine possibly on the horizon.

The talk also included some rather poignant insight into the level of altruistic commitment provided by the active members of R-core to keep the R project running as smoothly as it does. For example, there are more than 110 contributions to CRAN each week, each of which requires manual review and often direct feedback on how to fix problems from CRAN maintainer Kurt Hornik. Also, many members of R-core spend a lot of volunteer time on the R-help and R-devel mailing lists interacting with R users: so many request for help and suggestions for changes to R take a lot of effort to respond to, even when asked respectfully -- and these contributions from R-core perhaps aren't always treated with respect.

So I'd like to join with the rest of the R community in giving thanks to Prof Ripley and the R core team for making R available to the community at large. Each of us has benefited greatly from their selfless contributions in taking statistical computing to the next generation and I, amongst many I'm sure, am extremely thankful to them for their generosity.

I gave my talk to the useR! 2011 conference this morning: The R Ecosystem. The goal of the talk was to show R in context: that the combination of the R project and its leadership, the R userbase, and the companies supporting and using R makes for a thriving ecosystem and is indicative of an extremely successful open source project. I also summarize some statistics from the R userbase survey from last week, and extrapolate (unscientifically, but the best I can do for now) to the size of the R userbase, which I estimate exceeds 2 million users. The link to the Prezi presentation is below; for best effect play it in fill screen (click "More") and note that the links to resources are clickable.

August 15, 2011

Back when I was a grad student, I was a devoted Emacs user. I basically used it like an operating system: it wasn't just my text editor, but also my mail reader, my Web browser, my news reader, and so much more. (I once even asked our sysadmin to change my default shell to /usr/bin/emacs. He refused.) So when I started doing development in the S language, it was inevitable for me to think I could tweak some existing Emacs scripts and make it easier for me to edit S code. Sure enough, this turned into a major project (S-mode) with several collaborators, that culminated in being able to run the S interpreter within Emacs, and get (then-radical) features like command history and transcript management. When R came along, a new team adapted S-mode for R, and ESS -- Emacs Speaks Statistics -- was born.

I'm ashamed to say that while I still use Emacs occasionally, I never got around to installing it on my Mac. Today, in the ESS tutorial at useR! presented by Stephen Eglen I learned that I needn't have hesitated, because pre-compiled binaries of Emacs for Windows and Mac are available thanks to Vincent Goulet, which include ESS (and much more) pre-installed. No configuration necessary, just install and run -- I was up and running with ESS in less than 2 minutes.

The Revolution Analytics team is at the R user conference useR! at Warwick University this week. We'll bring you the updates from the conference with the latest from the R community as we go, but as we're on UK time blogging will be at unusual hours for the next few days. Things are off to a great start, with lots of people here already for the pre-conference tutorials. Much more to come, so stay tuned.

August 12, 2011

I love astronomy images, and I look at a lot of them, but this one from APOD last month is the most jaw-dropping image I've seen in a long time:

(Click for the desktop-worthy large version.) The photo, capturing almost the full 360-degree-by-360-degree spherical projection of space seen by a viewer floating in the Solar System, is a composite of two dark-sky photographs, from Chile in the southern hemisphere and the other from the Canary Islands in the North. What's jaw-dropping about this picture is the S-shaped band of light: it's not the Milky Way. In these images, the Milky Way is behind the mountains on the horizon. The band is actually the interplanetary dust between the planets in our own Solar System (or rather, sunlight reflected the dust orbiting along the ecliptic plane). Amazing. Also, the sight of the Magellanic clouds warms the cockles of my ex-Southern-hemisphere heart. There's something about seeing another galaxy with the naked eye that gives one context, y'know?