The Bay Area R User Group Meeting on Data Mining with R

Put up a poster that says something like “Data Mining with R” anywhere in the Bay Area and you will surely draw a crowd. But it was still a bit of a surprise that the monthly meeting of the Bay Area R User’s group was so well attended. At one point there were 160 people on the meetup list signed-up to attend the event, and 79 people on the waiting list. (BARUG members are either excessively optimistic or they have some good models of the dynamics of waiting lists.)

George Roumeliotis, our host for the evening and Data Scientist at Intuit, began the meeting by welcoming the attendees. The announcementsincluded a request for BARUG members to submit ideas for speakers and topics for 2012 to the BARUG organizers at [email protected] as well as an offer from BARUG sponsor Revolution Analytics allow BARUG members to test-drive Revolution R Enterprise 5.0 free of charge for 90 days.

Sanjiv Das, a speaker who would make the lead-off spot on anybody’s R lecture team, delivered the first talk, a summary of his 60 page paper on the identification and analysis of Venture Capital communities. In 10 minutes, with skills no doubt honed by lecturing to twitter-tuned students, Professor Das presented simple R code wrapped around an implementation of the walk trap algorithm he showed how to identify the communities and then glided through the econometrics arguments that the communities to be influential. (Sanjiv Das's excellent talk from the November meetup, Using R in Academic Finance, is also available online.)

Next up, Anthony Sabbadini, founder of Economic Risk Management, a San Francisco start-up, presented different ways to visualize a company’s supply chain. In addition to highlighting the ease with which R code can be made to work with other systems, Anthony’s mash-up of NOAA weather data with truck and rail shipments showed the kind of aesthetic sensibility that grabs your attention and draws you into the data.

The third speaker was Nicholas-Lewin-Koh, a statistician from Genentech and one of the BARUG organizers. In his 10 minutes, Nicholas covered 10 years of the history of optimization algorithms in R with the authority of someone who has been grappling with the nitty-gritty details of optimization challenges in statistical applications for at least that long. The big take away for people not working in this area is that R now has a rich variety of easy to use optimizers to choose from.

Batting “clean-up” Giovanni Seni, now a Data Scientist at Intuit, provided an introduction to the cutting edge work on Rule Ensembles being done in R. Starting with an example of decision trees viewed as conjunctive rule ensembles from the book “Ensemble Methods in Data Mining” that he and John Elder co-authored Givovanni moved quickly to more complex examples. Giovanni showed an example of the kinds of regularized models with mixed linear and non-linear terms that can be fit with the Stanford R package RuleFit, and The Toolkit for Multivariate Data Analysis with ROOT (TMVA). Low key, but eye opening, Giovanni’s presentation provided a window into this research area.

Houtao Deng, also a data scientist at Intuit, followed with an overview of the general problem of feature selection in classification models, the pros and cons of both univariate and multivariate filter methods, and R packages that implement these methods. Pointers to the work of Isabelle Guyon on Support vector machines with recursive feature elimination and Ramon Diaz-Uriarte on random forests with recursive feature selection opened up a whole new area for me. I think that anyone trying to sort through the literature in this field will find Houtao’s guidance on which feature selection methods are appropriate for various types of data sets; linearly separable, non-linearly separable etc. to be very valuable.

Last up, Thomson Nguyen, Data Scientist from Lookout Mobile Security, delivered an engaging and informative talk on his work with the Heritage Health Care Kaggle competition. Any way you look at them, the raw HHC data sets are pretty ugly, but Thompson’s exposition of the preliminary data preparation and cleaning steps he worked through was so thoughtfully done and rationally laid out that it ought to be a paradigm for how to go about data cleaning. One might make different choices than Thomson on some of the big issues (throw away obviously bad observation or impute) but his over process and parting advice to seek the help of domain experts in the cleaning process were spot on. Anybody serious about chasing the $3M prize should have a good look at Thomson’s work.

All of the speakers showed remarkable knowledge and discipline in presenting their topics within the very tight time limits and Houtao Deng and his colleagues at Intuit did a first class job of providing and preparing the venue.