Data Science automation is a hot topic recently, with several articles about it[1]. Most of them discuss the so-called “automation” tools[2]. Too often, editors claim that their tools can automate the Data Science process. This provides the feeling that combining these tools with a Big Data architecture can solve any business problems.

The misconception comes from the confusion between the whole Data Science process[3] and the sub-tasks of data preparation (feature extraction, etc.) and modeling (algorithm selection, hyper-parameters tuning, etc.) which I call Machine Learning. This issue is amplified by the recent success of platforms such as Kaggle (www.kaggle.com) and DrivenData (www.drivendata.org). Competitors are provided with a clear problem to solve and clean data. Choosing and tuning a machine learning algorithm is the main task. Participants are evaluated using metrics such as test set accuracy. In industry, data scientists will be evaluated on the value added to the business, rather than algorithm accuracy. A project with 99% classification accuracy, but that isn’t deployed in production, is bringing no value to the company.

I recently read how the winner of a Kaggle competition, Gert Jacobusse, spent his time on solving the challenge[4]: “I spent 50% on feature engineering, 40% on feature selection plus model ensembling, and less than 10% on model selection and tuning”. This is very far from what I have experienced in industry. It is usually more something like: data preparation and modeling (10%) and the rest (90%). I will explain below what I mean by “the rest”. When you read news about tools that automate Data Science and Data Science competitions, people with no industry experience may be confused and think that Data Science is only modeling and can be fully automated.

Automating Data Science (Source: Shutterstock)

On my blog[5], I listed the different Data Science steps and discuss the ones that can be automated. Most complex and time consuming tasks such as defining the problem to solve, getting data, exploring data, deploying the project, debugging and monitoring can’t be fully automated. This is without mentioning the iterative aspect of the whole process (see the CRISP-DM figure). In a recent study from MIT[6], researchers said their tool bested more than 600 teams out of 900. What was the benchmark used? Clearly defined and closed world problem from Kaggle competitions. Such challenges don’t represent the heart of data scientists’ activities. It’s not that available tools are useless, on the contrary, they can free up time for the data scientist. Still, they don’t automate Data Science.

Don’t get me wrong: Kaggle and the likes are really good places to start learning about Machine Learning algorithms and it will certainly improve your feature engineering and modeling skills. However, you won’t learn the main aspects of Data Science within these competitions: business problem definition, data gathering and cleaning, deployment, stakeholder management, email communications, presentation skills…well, “the rest”. A recent article mentions that Data Science will be automated within a few years[7]. Machine Learning, as defined above, can be automated to a certain extend. A good example is the meta-mining framework described by Phong Nguyen in the Swiss Analytics Magazine #6. However, we are far from automating the whole Data Science process. Even for Machine Learning, we need specialists to develop new algorithms, adapted to our business challenges, people that will make the field progress. Here is an interesting metaphor, from Berry and Linoff[8] relating Data Science to photography:

“The camera can relieve the photographer from having to set the shutter speed, aperture and other settings every time a picture is taken. This makes the process easier for expert photographers and makes better photography accessible to people who are not experts. But this is still automating only a small part of the process of producing a photograph. Choosing the subject, perspective and lighting, getting to the right place at the right time, printing and mounting, and many other aspects are all important in producing a good photograph.”

The main reason that makes Data Science difficult to automate is that business challenges are by definition ill-posed open world problems. To the often asked question “Will machines replace Data Scientists?”, my answer is “Yes, just after all the other jobs in the World”.

If you read this blog, you are very likely to be involved in any kind of data collection, manipulation or analysis. When not performed wisely, your analysis will lead you to incorrect conclusions. Alex Reinhart, in his book Statistics Done Wrong, has listed several concepts that are key when analysing data, such as statistical power, correlation/causation and publication bias.

Superforecasting – by Tetlock and Gartner – explains the huge study performed by Tetlock about the ability of people to predict future events (mainly geo-political). The closed questions (i.e. choose between yes/no) are far from real numbers you will predict in business forecasting. Tetlock discusses skills that have been identified as driving accurate forecasts. The point of the authors is that forecasting… Continue reading...

Big data is changing the way the financial world handles client interaction. No matter what sector data analytics are employed in (IT, marketing, sales etc.), its implications are leading to a new wave of Business Intelligence (BI).

Any company that uses analytics on a daily basis will understand the ability of big data to transform customer relations and optimize management… Continue reading...

While working on forecasting (understand “time series analysis”) I found several interesting and state of the art articles from Rob J. Hyndman. He is the co-author, with George Athanasopoulos of Forecasting: Principles and Practice. This is an excellent concise and comprehensive text explaining concepts behind forecasting, common algorithms and how to implement them in R (for a business… Continue reading... | 1 Comment

“You don’t need to predict the future. Just choose a future — a good future, a useful future — and make the kind of prediction that will alter human emotions and reactions in such a way that the future you predicted will be brought about. Better to make a good future than predict a bad one.”

Big data is the latest competitive advantage for businesses. Data are now woven into every industry and function across the global economy. The use of Big data will become the basis of competition and growth for businesses by enhancing the productivity and creating significant value for global economy with waste reduction and increased quality of products and services.