The goal of data scientists is to put themselves out of business

We are now in the era of big data 2.0, as defined by Foster Provost and Tom Fawcett in Data Science for Business. There’s growing interest in predictive analytics solutions powered by machine learning. As InsightsOne’s CEO Waqar Hasan puts it: “Predictive is the ‘killer app’ for big data.” Quite interestingly, McKinsey & Company predicted a shortage of machine learning talent in the coming years, and at the same time, we started to see services that made machine learning and predictive analytics accessible to the masses. We’re seeing more and more of these services: Apigee launched one last April, just a couple of months after buying InsightsOne.

One of the first things I learned when I took computer science classes at university was that it was our job to “put ourselves out of business.” There are things that we do by hand, and our job as computer scientists is to make programs that do the same things, then other programs that replace them and that are quicker, more reliable, require less maintenance, and so on. The same applies to data science.

Technology that replaces data scientists

Most of a data scientist’s time is spent creating predictive models: finding the variables that matter to make predictions, the right type of model, the best set of parameters, etc. Work is being done to automate all of this, and so far it has resulted in solutions such as Emerald Logic’s FACET and in the creation of prediction APIs such as Google’s and Ersatz Labs’. These APIs abstract away the complexities of learning models from data. You can just focus on preparing the data (collecting/enriching/cleaning it), you then send that data to the API, it automatically creates a model, and it uses that model when you ask for predictions.

These new tools imply a new paradigm in which no data scientist is involved, but everyone else in the company is: business execs set a vision, managers define specs for integrating predictions, software engineers work on implementation. This requires that everyone knows a bit about machine learning, but that can be rapidly learned even by non-technical types once you skip algorithms and theory to only focus on studying the core concepts, intuitions and possibilities of machine learning, and some key examples.

Actually, if the domain experts are in charge, they will have more chances of incorporating domain knowledge into the system they’re building, by picking the right representations of the domain (“features”) to use to make better predictions.

It can only go further

Machine learning is a set of artificial intelligence techniques where “intelligence” is built by referring to example data.

We’re building artificial intelligence but still need manual model/algorithm selection and tuning? Surely we can come up with an intelligent and automatic way to do this! Hence a trend in artificial intelligence of building “meta AI algorithms” whose job is to find the right AI algorithm with the right parameters for a given problem.

The way to do this in machine learning can be principled, such as with probabilistic inference for setting parameters and finding weights to assign to features. It can also be brute force, and the computational power that we have today allows to rapidly test a multitude of possibilities and see what works well. This brute force can be regular cross-validation or it can be based on evolutionary techniques, as is the case with Emerald Logic’s FACET.

Detecting (and thus exploiting) domain specificities by looking at data starts with simple things. For instance, if we see data for a binary classification task in which classes are strongly unbalanced, then an anomaly detection algorithm should be used.

Because of the new tools, the role of the data scientist is evolving. It might be getting easier to become one, but what’s for sure is that doing the same job as a data scientist is getting even easier, thanks to prediction APIs. Their jobs end up being performed by database and software engineers, which has led some to go as far as to say that data science is not a real thing. I would just say that data science is evolving.

In the prediction APIs world, data scientists still have a role in helping teams use these APIs and become autonomous. If their expertise is needed, it should be in supervising roles, and they should be much less involved than previously, for a similar result.

Most importantly, data scientists should keep working on automating more machine learning techniques. It’s encouraging to see that, after supervised learning, we’re now seeing reinforcement learning APIs. Also, work is needed to create simple formalisms that allow domain experts to describe more specificities of the domain and encode their knowledge into algorithms.

“If we can get usable, flexible, dependable machine learning software into the hands of domain experts, benefits to society are bound to follow.” — Dr Kiri L. Wagstaff, researcher at NASA JPL