Daily news about using open source R for big data analysis, predictive modeling, data science, and visualization since 2008

April 2010

April 30, 2010

If you're being particularly cost-conscious about your use of printer ink or toner, you may be wondering which font you should choose to minimize ink use. Here's an infographic with the answer:

This is an interesting infographic in its own right, but what makes it cool is that these are not photoshopped images of Bic biros. Matt Robinson created this chart by writing the word "Sample" on a wall using a brand new biro for each font; the amount of ink remaining indicates the density of the type. In other words, the data is the chart. Brilliant! Click the link below to see how he did it.

April 29, 2010

Let's face it: you can do some pretty awesome things with R -- statistical models, beautiful charts, you name it -- but if the only way to do those things is from the R command line you're limiting the audience of people who might make use of all this awesomeness to a limited subset: R programmers.

What if you could get the results of R programs, charts and data, into a Web application? R's an open system, so it's definitely possible. To get started, Neil Saunders shows us how to build a simple Web application that displays data and a chart from an R script, by serving JSON or CSV from Rails to RApache and sending a result back to Rails. As he points out, it's not a complete solution -- "a real application would require more model or controller methods, views, R functions, error checks and prettier views, perhaps with a dash of AJAX thrown in", but it's a great starting point.

April 28, 2010

Steve Miller has posted his interview with Revolution's CEO Norman Nie at Information Management blogs. In the interview, Steve digs into Norman's motivations for taking on a new venture around R after his successes with SPSS and how what he learned there applies to Revolution Computing. Also up for discussion: the benefits and challenges of an open-source business model; the relationship between academia and R; and how analytics can be relevant for a non-expert audience. Read Part 1 of interview at the link below; a follow-up will appear next week.

April 27, 2010

Conditional probabilities are bane of many students of Statistics, but statements of conditional probability come up surprisingly often in real life. For example, as Steven Strogatz writes in the New York Times, when doctors are asked to estimate the probability that a woman has breast cancer given a positive mammogram test result, most get the answer wildly wrong despite being given the population frequency of breast cancer and the conditional probability of false positives from a mammogram test. Here's one doctor's experience trying to come up with a number:

“[He] was visibly nervous while trying to figure out what he would tell the woman. After mulling the numbers over, he finally estimated the woman’s probability of having breast cancer, given that she has a positive mammogram, to be 90 percent. Nervously, he added, ‘Oh, what nonsense. I can’t do this. You should test my daughter; she is studying medicine.’ He knew that his estimate was wrong, but he did not know how to reason better. Despite the fact that he had spent 10 minutes wringing his mind for an answer, he could not figure out how to draw a sound inference from the probabilities.” [The correct answer is 9 percent.]

Most students (and doctors!) are taught to use Bayes' Theorem to calculate marginal probabilities from conditional probabilities, but as Strogatz point out this isn't exactly an intuitive calculation, with the dividing of probabilities by probabilities and all. He suggests a more intuitive (but slightly less accurate) method is to think instead about frequencies within concrete groups and sub-groups. For the mammogram test, the calculation becomes:

Eight out of every 1,000 women have breast cancer. Of these 8 women with breast cancer, 7 will have a positive mammogram. Of the remaining 992 women who don’t have breast cancer, some 70 will still have a positive mammogram. Imagine a sample of women who have positive mammograms in screening. How many of these women actually have breast cancer?

Since a total of 7 + 70 = 77 women have positive mammograms, and only 7 of them truly have breast cancer, the probability of having breast cancer given a positive mammogram is 7 out of 77, which is 1 in 11, or about 9 percent.

This method is frowned upon by textbooks, because it's not as accurate (in the example above, rounding to whole numbers of women in the groups), and because it implicitly assumes that the frequency of the event (here, breast cancer) is determined solely by the probability, with no accounting for variation. But it is an intuitive method for understanding conditional probability, that seems more likely (ha!) to come up with an reasonably accurate answer for many people.

Read the rest of Strogatz's article for other examples of intuitive conditional probability calculations, including a great example from the OJ Simpson trial.

April 23, 2010

This one's for the musicians out there. (By the way, in my purely anecdotal experience, musical aptitude appears to have a higher-then-expected representation amongst stats folks. I however am the exception that proves the rule, as anyone who's suffered through my Rock Band vocals can attest. But I digress.) What do the chords C#minor, A, E and B have in common? Quite a lot, as it turns out. Australian comedy group Axis of Awesome explain. (Some language NSFW.)

At the Information Management blogs, Steve Miller has posted a great roundup of last weekend's R/Finance 2010 conference in Chicago. Here's Steve's overall take:

This year's conference was even better than the 2009 inaugural, the in-excess-of-200 participants consumed by more than 20 consecutive high-powered presentations over the fast-paced day and a half. And while I'm a quantitative finance welterweight at best, there was plenty to pique my interest, including the latest developments to scale R for size and performance.

Check of the rest of Steve's post for a great review of the other talks at the conference.

As Steve mentions, analysis of large data sets was a big focus of the conference with at least six presentations on the topic, including my own. I talked about a research project we've been working on at REvolution for a while, to make data processing and statistical analysis techniques for huge data sets available in REvolution R, breaking the bottlenecks of single-CPU processing, slow disk I/O processing, and being limited to RAM on just one machine. I deviated from the pre-advertised title, and the title in the slides, "A Herd of Unicorns" (download as PDF), may require a little explanation out of context. The "unicorn" here is something powerful and (at least today) mythical: the combination of analytic algorithms for really large data sets, and a flexible programming environment that enables modern statistical analysis: exploration, data manipulation, visualization, model evaluation. In other words, the R environment. And if you had the freedom to do large-scale data analysis in R, while making the use of the power of multiple machine in a cluster or in the cloud then that would be, well, a herd of unicorns. We're working hard to make that fantasy a reality, soon.

April 22, 2010

There's a new local R User Group in San Diego (CA, USA), and they're meeting tonight. If you're in the area, why not RSVP and come along? The topic looks great:

Our speaker, Scott Wallihan, will be covering how to expand R's functionality through custom packages. This topic will be covered over two meetings. In our April meeting, we will cover the basics of creating R packages in standard R code, and also in C and C++. In the following meeting, we will cover how to write high-performance R packages using multithreaded C/C++ programs, and also CUDA, enabling R programs to leverage the highly parallel computational power of NVIDEA's GPU compute devices, including ordinary video cards.

The latest version of R from the R Project, R 2.11.0, is now available in source code form. Binaries for Windows, Mac and Linux will appear in your local CRAN mirror in the next few days. Some new features include:

Support for rendering bitmap images in graphics devices, via a new function rasterImage()

The new function vapply is like apply, but checks for consistency of the return value

This release also marks support of the 64-bit Windows platform by the R Project for the first time. With the assistance of members of the R Core Group. REvolution Computing pioneered support for R on 64-bit Windows: it's major platform of our REvolution R Enterprise product and has been for almost 2 years. While R 2.11.0 is built using the free mingw compiler, REvolution R Enterprise is built using a commercial toolchain, and links to multi-threaded libraries for improved processing speed on multi-core systems.

April 21, 2010

I had the great pleasure of sitting down for a beer with Steve O'Grady (from the open-source analyst group RedMonk), at the MySQL conference last week. It was great to get the perspective of someone who knows the tech industry so well, sees predictive analytics as a hot area, and is taking an active interest in statistics and R (Steve has been getting into R programming recently). I asked him, amongst all the software tools available, why choose to learn R for predictive analytics? He answered with a great analogy to scuba diving, which he just shared on his blog:

I had the opportunity to dive in a lot of interesting places, from Key West to Cayman Brac to Bonaire to plain old Rockport, MA. One of the things I noticed was that most of the professionals, pretty much to a person, used the same BCD [scuba equipment]: workman-like, beat up Scubapro designs. Ugly, even industrial-looking, but functional. Day after day, dive after dive.

Which begged the question that so many ask themselves in so many industries: what did I know about diving that the professionals did not?

Exactly. My next BCD, which I still own today, was a Scubapro.

I relate this story here because I told it to REvolution Computing’s David Smith last week to explain our interest in the R language.

I love this analogy. R today may not be as pretty as some of the alternatives (though we do have big plans for REvolution R), but it sure is functional, reliable and powerful. And that's why the professionalsareusing it.