Data Science Versus Data Engineering

Tags:

In the third installment of this blog, we told you that “the analytic discovery process has more in common with research and development (R&D) than with software engineering.” But – symmetrically, if confusingly - what comes after the discovery process typically has more in common with software engineering than with R&D.

The objective of our analytic project will very often be to produce a predictive model that we can use, for example, to predict the level of demand for different products next week, to understand which customers are most at risk of churning, to forecast whether free cash flow next quarter will be above – or below – plan, etc.

For that model to be useful in a large organization, we may need to apply it to large volumes of data - some of it freshly minted - so that we can, for example, continuously predict product demand at the store / SKU level. In other cases, we may need to re-run the model in near real-time and on an event-driven basis, for example if we are building a recommendation engine to try and cross-sell and up-sell products to customers on the web, based on both the historical preferences that they have expressed in previous interactions with the website and on what they are browsing rightnow. And in almost all cases, the output of the model – a sales forecast, a probability-to-churn score, or a list of recommended products – will need to be fed back into one of the transactional systems that we use to run the business in order that we can take some useful action, based on the insight that the model has provided us.

To deliver any value to the business, then, we may need to take a model built in the lab from brown paper and string and use it to crunch terabytes, or petabytes, of data on a weekly, daily - or even hourly basis. Or we may need to simultaneously perform thousands of complex calculations on smaller data-sets - and to send the results back to an operational system within only a few hundred milliseconds. And achieving those sorts of levels of performance and scalability will require that we build a well-engineered system on top of a set of robust and well-integrated technologies.

Involve those other stakeholders often and early if you want to discover something worth learning in your data – and if you want that learning to change the way that you do business.

Our system may have to ingest data from several sources, integrate them, transform the raw data into “features” that are the input to the model, crunch those features using one-or-more algorithms – and send the resulting output somewhere else. When you hear data engineers talking about building “machine learning pipelines”, it is this fetch-integrate-transform-crunch-send process that they are referring to.

Building, tuning and optimizing these pipelines at web and Internet of Things scale is a complex engineering challenge – and one that often requires a different set of skills from those required to design-and-prove the prototype model that demonstrates the feasibility and utility of the original concept. Some organizations put data scientists and data engineers into multi-disciplinary teams to address this challenge; others focus their data scientists on the discovery part of the process – and their data engineers on operationalizing the most promising of the models developed in the laboratory by the data scientists. Both of these approaches can work, but it is important to ensure that you have the right balance of both sets of skills. Over-emphasize creativity and innovation and you risk creating lots of brilliant prototypes that are too complex to implement in production; over-emphasize robust engineering and you risk decreasing marginal returns, as the team focusses on squeezing the last drop from an existing pipeline, rather than considering a completely new approach and process.

Of course, not every analytic discovery project will automatically result in a complex implementation project. As we pointed out in a previous blog, the “failure” rate for analytic projects is relatively high – so we may go several times around the CRISP-DM cycle before we settle on an approach worth implementing. And sometimes our ability may be merely to understand. For example, a bricks and mortar Retailer might want to identify and to understand different shopping missions – and “implementation” might then be about making changes to ranging and merchandising strategies, rather than about deploying a complex, real-time software solution.

Whilst there are several ways of employing the insight from an analytic discovery project, the one thing that they all have in common is this: change. As a former boss once said to one of us: old business process + expensive new technology = expensive old business process. And achieving meaningful and significant change in large and complex organizations is never merely about data, analytics and engineering – it’s also about organizational buy-in, culture and good change management. Whilst data scientists and data engineers often have different backgrounds and different skill sets, one thing that they often have in common - and that they may take for granted in others – is a belief that the data know best. Since plenty of other stakeholders see the business through a variety of entirely different lenses, securing the organizational buy-in required to action the insight derived from an analytic project is often as complex a process as the most sophisticated machine learning pipeline. Involve those other stakeholders often and early if you want to discover something worth learning in your data – and if you want that learning to change the way that you do business.