Saturday, 3 March 2018

A.I. will be like Electrical Current

I am just returning from this bitkom's Big-Data.AI Summit 2018. This was two days packed with talks about data, A.I., data science, and everything in-between. The conference attracted about 1.200 people, mostly from industry, but also from academia here and there (like myself). And it was a very inspiring event. Congrats to bitkom for pulling this off!

A selection of some loosely connected takeaways:

Wave–particle duality => A.I.-Big Data duality: A.I. and Big Data are like wave and particle in physics: they are two views on the same thing. Therefore, it does not make sense to talk about A.I. without talking about (Big) Data and vice versa. Both are heavily intertwined. Therefore it was a very good idea to host both topics at the same conf. There should be much more interaction among these two fields, ahemm, I mean "views".

Bla bla bla: The confusion in industry about all these buzz words like big data, data lakes, NoSQL, machine learning, AI, data science, <you name it> is insane. This insanity is good to a certain degree (when you want to sell stuff to laymen), but also quite bad (when you are trying to understand what this is all about). As academics our role here is to lift the fog.

"big data"="data lake": These days, big data is typically read as either "large data" or "something with HDFS and a data lake". So, as observed in the past 10 years already, the term "big data" is a moving target.

It is a looonnnngggg pipeline. Sometimes people ignore that the entire data analysis pipeline that is required to analyze data is pretty long. This pipeline includes collecting data, cleaning data, curating data, normalizing data, managing data, selecting the right features (which is often more important than picking the right machine learning model anyways), ..., and, yes, eventually doing some fancy, or often just very old-fashioned, machine learning. But til you get there may take a while.

Do you remember why data warehousing projects fail typically? Correct, it is not due to machine learning, it is due to difficult data cleaning and integration tasks. And many machine learning problems sound to me like good old warehousing projects where the final step, the data warehouse, is replaced by some ML/AI stuff.

No data. Some companies don't even have the data to do a meaningful analysis. Then their task is to identify which data should be collected in the future to allow for any meaningful analysis. This is much better than doing nothing. If you don't have the right data, you can't do meaningful analysis. Can you afford to wait another year til you even have some toy data to play with?

A.I. will be like electrical current. Supervised machine learning allows us to learn a function f(x)=y. We train models and learn that function with some data and then test it with some other data. In production, we use f() to make some prediction, i.e., we name the function predict():=f(). So we have one function. One.

But any computer software has multiple to zillions of functions. So what happens if we start replacing those functions in our software gradually? This was the topic of a very interesting workshop, unfortunately only briefly discussing this effect then. I mean, forget about the performance implications, these will be solved and are non-issues for many types of software anyways these days. What kind of software will that be where many of its parts are simply trained models? How probabilistic will that software be? How will control flow look like in such software? It will be dependent on some probabilistic outcome of a model. This sounds pretty scary. Or maybe not.

So, imagine your favorite word processor (or database system or whatever) being implemented as a bunch of functions which are actually trained models. This sounds pretty undeterministic. But hey, actually, this might be an improvement over current word processors (or whatever software you have in mind), and more deterministic and robust than what we have today...

So if you say "ML/AI will become a commodity", "ML/AI is the new oil", well, I feel this is probably not even strong enough. The atoms and molecules of software are functions. And those atoms and molecules will not necessarily be hand-crafted and coded anymore. Soon.

This also applies to hardware. Some or many of those functions will be replaced by trained models. These functions will be everywhere and they will be used everywhere.