Other sites

Strata + Hadoop World 2013 Recap

[This article was first published on Revolutions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

We're back from NYC after a very busy Strata + Hadoop World 2013 conference. Many thanks to all the friendly folks who dropped by the Revolution Analytics booth, attended Joe and Antonio's R and Hadoop tutorial, or simply came up to say hello during the event. It was a jam-packed conference, literally standing room only at many times, so it's great to hear it will be moving to a larger venue next year for a little more breathing room.

Monsanto's presentation on using geospatial data to help farmers increase crop yield was fascinating. The focus was on planting and growing, and making use of the massive quantities of data generated by sensor-laden farm equipment — these aren't your granddad's tractors we're talking about! By measuring the physical layout of fields (aspect, slope) and the spatially-tagged data on soil characteristics and yield measured every 18 inches, Monsanto can provide planting recommendations to farmers, enabling them to grow more food on the same fields. For example, a grower can vary the planting density and the fertilizer used across the field almost on a plant-by-plant basis, to account for the varying fertility, soil types and environmental exposure on the ground. Monstanto has an impressive architectural stack to analyze the 10Tb of data collected, including Cloudera's Hadoop for the data, HBase and Solr for queries, and Revolution R Enterprise for the statistical analysis. Here's a diagram from the slide presentation:

eHarmony had an interesting and fun presentation about the Data Science of Love: using affinity matching to try and predict which couples are most likely to have a long-lasting relationship, while presenting a diversity of options to a customer. eHarmony uses massive quantites of data to support this operation: technologies include Hadoop and Scala for data, and R and Vowpal Wabbit for the analysis. There's lots more info in the slide presentation.

Wes McKinney, founder and CEO of DataPad, shared his experiences as the developer of Pandas on building effective data science workflows from a variety of tools. His basic point was that there is a lot of value in building tools that let developers be more productive at data analysis: interfaces that provide elegance, simplicy and time savingsare important, even if the functionality might already exist. Wes cited Hadley Wickham's plyr and ggplot R packages as examples, and he's working on a new project of his own as well. One particular nugget from his talk that was new to me: one of the reason so many tools built on top of SQL databases have such terrible support for time series analysis is that SQL itself is terrible at time series operations. (Unfortunately, the slides from Wes's talk haven't yet been published to the Strata website.)

Roger Margoulis, who is stepping into Edd Dumbill's shoes as the Strata co-presenter with Alistair Croll starting next year, presented interim results of the Data Scientist Salary Survey (follow this link for the video). One result from the survey so far: the most commonly used software tools are SQL (71% of respondents), followed by R (43%) and then Python (41). The survey is still open: if you're a data scientist, you can contribute to the survey here.

There were plenty of other great presentations, but I couldn't make it to even a fraction of them. It's worth browsing through the conference program — many of the talks have video and/or slides for you to view.