For data scientists, creating a perfect statistical model is all for naught if the compute power required is prohibitive. We need tools to assess the performance impacts of modeling alternatives

Big data is all about advanced analytics at extreme scales. Data scientists are, among many things, one of the key application developers for this new age. Quite often, the statistical models they build become production assets that must scale with performance commensurate to the volume, velocity, and variety of the business analytic workloads.

However, most data scientists are, at heart, statistical analysts. While conducting their deep data explorations, they may not be focusing on their downstream production performance of the analytic models they build and refine. If the regression, neural-network, or natural language processing algorithms they've incorporated don't scale under heavy loads, the models may have to be scrapped or significantly reworked before they can be considered production-ready.

Here's where devops can assist. Devops is a software development method that stresses collaboration and integration between developers and operations professionals. It's not yet in the core vocabulary of business data scientists, but it should be. Intensifying performance requirements on advanced analytics will bring greater focus on the need for rapid, thorough performance testing of analytic models in production-grade environments. As these needs grow, the mismatches in perspective and practice between data scientists (who may treat performance as an afterthought) and IT administrators (who live and breathe performance) will become more acute.

A recent article on applied predictive modeling echoed my concern. In reviewing a book on the topic, author Steve Miller offered this observation:

One critique I have of statistical learning (SL) pedagogy is the absence of computation performance considerations in the evaluation of different modeling techniques. With its emphases on bootstrapping and cross-validation to tune/test models, SL is quite compute-intensive. Add to that the re-sampling that's embedded in techniques like bagging and boosting, and you have the specter of computation hell for supervised learning of large data sets. In fact, R's memory constraints impose pretty severe limits on the size of models that can be fit by top-performing methods like random forests. Though SL does a good job calibrating model performance against small data sets, it'd sure be nice to understand performance versus computational cost for larger data.

I second that recommendation and up it to the next level. It's always best to spot a resource-hogging algorithm before it's too late. If you inadvertently build it into your production big data analytic application, you'll pay the piper eventually. Either your company will need to invest in expensive CPU, memory, storage, and interconnect capacity necessary to feed the beast -- or your data scientists will have to rebuild it all from scratch using more resource-efficient approaches.