Machine Learning (Theory)

Other links

I have just started reading "Blink, the power of thinking without thinking" by Malcolm Gladwell. Similarly to Freakonomics, this book has sold very well in the US. I was thus curious about it.Overall, it is fun to read, although a bit unorganized. But what is especially striking for me is the main claim that humans can reason unconsciously.More precisely, there are many situations (and the book gives a large number of surprising yet convincing examples) where humans are able to perform difficult "classification" tasks unexpectedly fast. For example, some art experts are able to tell apart genuine sculptures from fake ones virtually in a blink. Even more surprising: they are completely unable to explain what makes them think a specific sculpture is fake!

It thus seems (and there are plenty of psychological studies about this), that, with enough training, humans are able to learn very difficult classification (I take it in the classical Machine Learning sense) tasks, including tasks that are not natural.

Let me try to explain what is new here.We know that humans are very good at learning certain classification tasks: young children can classify objects from images very easily and get a much better performance than any computer to date.Also we know that once this has been learned, the actual classification of any new image can be done in a few milliseconds.Hence, with enough training, the brain is able to perform this complicated task very easily, without requiring any conscious reasoning to take place.

However, I used to think that the tasks we can learn easily are those for which we have a sufficiently strong prior encoded into our genes. In other words, I thought that the ability to learn visual classification tasks was the result of a long natural evolution (which provides us with the appropriate pre-wiring, or the prior in Bayesian terms) combined with a short period of adaptation (similar to computing the posterior in Bayesian terms again).What is new to me in this book is the following: we can be trained to perform tasks that have nothing to do with evolutionary constraints, and this training can be performed unconsciously (without any explicit or conscious reasoning). An example of this phenomenon is given in the book: a tennis coach once realized that he could predict whether a tennis player would miss his service right before he would hit the ball. However he would not be able to explain why and how he could do so!

This may show that our brain hosts a powerful learning engine (with a powerful feature extractor to isolate the relevant information) that does not even require our attention to be triggered and that can deal with many different learning tasks.Of course this raises the question of the prior: we know that there is no better learning algorithm, but only algorithms better adapted to learning problems. In other words, we can only learn the problems that have a large enough weight under the prior, which means it is hard to be good simultaneously for many different tasks. Why is it that the prior encode into our brain allows us to learn such useless tasks as being able to tell whether a tennis-man will fail his service? and why is it that this prior is not more "peaked" around the tasks that are really useful for our survival?I guess this book is related to a lot of interesting cognitive science problems but it also revived my interest in human learning and its relationship to Machine Learning...

It seems natural that the goal of any good Machine Learning algorithm should be to extract information from the available data.However, when you are faced with practical problems, this is not enough. More precisely, data by itself does not hold the solution. One needs "prior knowledge" or "domain knowledge". So far, nothing new.

But what is important is how to actually get and use this knowledge, and this is very rarely addressed or even mentioned!My point here is that building efficient algorithms should mean building algorithms that can extract and make maximum use of this knowledge!To achieve this, here are some possible directions:

A first step is probably to think about what are the natural "knowledge bits" one may have about a problem and how to formalize them. For example, it can be knowledge about how the data was collected, what the features mean, what kind of errors can be made in the data collection,...

A second step is to provide simple but versatile tools to encode prior knowledge: this can be done off-line, for example when using a probabilistic framework one can allow the probability distributions to be customized, or on-line (i.e. interactively) with a trial-and-error procedure (based on cross-validation or on expert validation).

There is also a possibility to go one level higher: often, knowledge is gained by integration of very diverse sources of information, humans (as learning systems) are never isolated: all problems they can solve have some relationship to their environment. So ideally our systems should be able to integrate several sources and have some sort of meta-learning capability rather than starting from scratch every time a new dataset is to be used, and focusing only on this specific dataset.

All the above explains the title of my post, and to be more precise, I even tend to think that research efforts should be focused on knowledge extraction from experts rather than from data!!!

Finally, I would like to give examples of such an extraction (we are not talking about sitting experts in a chair with electrodes connected into their brains! but just about providing software that can interact a bit with them).Below is a (non-exhaustive) list of what can be learned from the user by a learning system:

Implicit knowledge (when the data is collected and put in a database)

data representation: the way the data is represented (the features that are used to represent the objects) already brings a lot of information and often a problem is solved once the appropriate representation has been found.

setting of the problem: the way the problem is set up (i.e. the choice of which variables are the inputs and which are the outputs, the choice of the samples...) also bring information.

Basic information (when the analysis starts)

choice of features: choosing the right features, ignoring those that are irrelevant...

Interactive knowledge: all the above can be repeated by iteratively trying various options. Each trial can be validated using data (cross-validation) or expertise (judging the plausibility of the built model).

As a final remark, let me just mention that the interactive mode is often used (although not explicitly) by practitioners who try several different algorithms and take the one that seems the best (on a validation set). Of course this gives rise to the risk of overfitting, especially because the information brought by the interaction is very limited. Indeed, it simply amounts to the validation error which cannot be considered as knowledge: this kind of interaction simply brings in more data (the validation data) rather than more knowledge.It would probably be interesting to formalize a bit better these notions...

Usually, performing inductive inference occurs in two steps. The first one consists in constructing a set of assumptions that summarize the knowledge one has about a phenomenon of interest prior to observing instances of this phenomenon. The second one consists in actually observing these instances and deriving new knowledge from this observation.

A possible question is: what principle may guide each of these steps?A possible answer is: be as rational as possible. In other words, try to avoid inconsistencies.

Regarding the second one, it is sometimes possible to formulate the problem as a purely deductive one. Indeed, the question is "given such assumptions and given such data, what can I deduce?". For example, in a probabilistic framework, one would have a prior distribution and observations and would aim at obtaining an updated distribution. The rational way of doing this is to apply Bayes rule.In other settings, when the assumptions are not formulated in a probabilistic language, or when the objective is to optimize some sort of worst-case performance, other rules could be used.The point is that once the objective is clearly and formally specified, rationality naturally leads to the solution via pure deduction.

Regarding the first one (constructing the assumptions), the situation is less obvious. There are guiding principles though, which again rely on rationality.One such principle is the one of symmetry: if there is no reason to prefer one side of a coin to the other (or to assume that both faces would have different properties), simply consider them equally probable. A more elaborated version of this principle is the principle of maximum entropy: when choosing a prior distribution over a set of possibilities, choose, among the ones that are consistent with your prior beliefs, the one with maximum entropy.Finally, there is also the principle of simplicity (Occam's razor) which suggests to give more prior weight to the simple hypotheses than the complex ones.

However, all these principles cannot be justified in a formal way. One can surely construct settings where applying one specific principle is the "best" thing to do, but this is somewhat artificial and does not provide a justification.

Instead of proving things, I guess the best thing to do is to provide recommendations. One such recommendation is "be rational", or in other words, try to take into account every piece of evidence you may have before observing the data and to do this in a way that does not lead to contradictions and does not expose you to more risk than you are willing to accept. So in a way, inferences should take into account both your knowledge and your uncertainty and be calibrated according to what you accept to loose if you fail.I like the idea that performing an inference is like horse race gambling: you try to get as much information you can about the horses, but you know there will always be some missing piece of information. Even if gambling is somewhat irrational, when you have no choice but to do it, better do it in the most rational way!

In a recent issue of The Economist, there is a very nice article (see here) about how everyday reasoning can be compared to Bayesian inference.This article is based on a recent paper by Griffiths and Tenenbaum (see here). What they have done is to ask questions such as "How long do you think a man who is xx years old will live?" to several people. It turns out that the answers matched very well with those which would have been obtained by applying Bayes rule. Even more, they tried this with several different types of questions, for which the implicit priors are very different (Gaussian, Erlang or power-law distributions) and in all cases, the intuitive answers given by people had the right form (in terms of distribution).

What they conclude from this is that the way people intuitively reason about the world is quite similar to applying Bayesian inference.

What is intriguing is that the article in The Economist tries to see there a proof of domination of the Bayesian over the frequentist point of view. Also in the paper of Griffith and Tenenbaum, they use the term "optimal" when they talk about Bayes rule. I think this is very misleading and inaccurate.

Indeed, the only conclusion one should draw from this study is that the way people naturally make inferences about events in the world is very much rational and this confirms the fact that has been observed many times before that the intuitive notion of rationality we have match very well with the rules of the calculus of probabilities.But this is no surprise because these rules were designed in order to be intuitively rational (what else?). What is interesting is that rationality leads necessarily to these rules and no other, but this has been known for years.

I do not see what this study has to do with the debate between Bayesian vs frequentist. First of all, there is no real opposition between these points of view. Indeed, they lead to the same rules for combining probabilities, the only difference is in the meaning that is associated to these probabilities. So this debate is mostly philosophical and should not interfere with cognitive science studies, nor (even less) with machine learning.

I recently came across the webpage of Jeffrey Shallit, a very impressive computer scientist, and I saw he gave a talk on a topic that can be of interest to people reading this blog: Can a Computer Think?The slides are very documented and comprehensive, he also has a reading list associated to this talk on his website.What I especially like about this talk is that it gives an interesting historical perspective, showing how many people had predicted that computers would achieve some task in the near future and none of these predictions were correct.

Also of interest is the quote by Hofstadter who essentially says that "intelligence" is what computers cannot do. Indeed, once computers can do something, we start to think that it does not require intelligence.

But if the definition of "thinking" is very controversial, it might be a better choice to focus on simpler things like "learning" and ask the question "Can a computer learn?".Of course, ML researchers are exactly after that, and to some extent, it is clear that computers can learn.

However, if we try to define more precisely what learning is, there are several issues. In particular, there are at least three levels at which we can define the learning phenomenon:

Low-level: Ability to adapt to a (changing) environment

Medium-level: Ability to perform a task or to improve at performing a task without being taught explicitly (by practice or imitation)

High-level: Ability to infer general laws from particular instances (induction)

The first level is somewhat "unconscious" and is something that could be said of most animals.The second level is also something many animals can do.The third level is more "conceptual" and seems to require some "thinking". But this is not necessarily an exclusively human ability: indeed, when a dog learns that bringing back the stick will get him a stroke, this is also some kind of induction.

I am not sure the above distinction really makes sense and it might be impossible to say which form of learning actually occurs in a specific situation.However, computers have clearly demonstrated all of them, at least in a very simple way.

I have been asked several questions revolving about the usefulness of research: why do you do research in mathematics, computers can do the calculations? or why do you do research in computer science, is it to build faster computers? or what is the use of doing all these complicated calculations, is there any application you can make money of?At some point I used to answer that there is nothing more useful than something that seems useless like a new mathematical theory. My argument was that things that are immediately useful are only useful immediately, while things for which we do not see any immediate application may very well turn out to lead to entirely new technologies in the long run. As an example, take the complex numbers. When they were invented, they were considered as a nice creation of the mind, as something only some mathematicians understood, but as something that would never have any application in the real world. Centuries later they are at the basis of many fields of technology we could hardly live without such as signal processing or electronics.Unfortunately, it is not easy to predict which of the many mathematical works that are done today will be most useful in a few centuries. There is thus potentially a lot of wasted effort.Another thing I used to say is the following: it is more fruitful to build a theory that explains several phenomena than to solve a specific problem (this is the difference between science and engineering).A quote which I like is : "There is nothing so practical as a good theory". It is originally from Kurt Lewin (although some ML people think it is due to Vapnik because it is often used by him).

Anyway, instead of trying to justify scientific research, it is probably more interesting to think about how this research should be conducted and in particular what should be the motivations of someone doing so.I recently found some interesting answers in the following quotes from Albert Einstein (taken from "The Einstein-Besso Manuscript", Scriptura, Aristophil 2005):

"My scientific work is motivated by an irresistible longing to understand the secrets of Nature and by no other feelings. My love for justice and the striving to contribute toward the improvement of human conditions are quite independent from my scientific interests."

"The important thing is not to stop questioning. Curiosity has its own reason for existing. One cannot help but be in awe when he contemplates the mysteries of eternity, of life, of the marvelous structure of reality. It is enough if one tries merely to comprehend a little of this mystery every day. Never lose a holy curiosity.

"To be sure, it is not the fruits of scientific research that elevate a man an enrich his nature, but the urge to understand, the intellectual work, creative or receptive."

"Where the world ceases to be the scene of our personal hopes and wishes, where we face it as free beings admiring, asking and observing, there we enter the realm of Art and Science."

"The most beautiful thing we can experience is the mysterious. It is the source of all true art and science, who can no longer pause to wonder and stand rapt in awe is as good as dead: his eyes are closed."

"After a certain high level of technical skill is achieved, science and art tend to coalesce in aesthetics, plasticity, and form. The greatest scientists are always artists as well.

"It is my inner conviction that the development of science seeks in the main to satisfy the longing for pure knowledge."

So as a conclusion, the main motivation is the curiosity or the desire to understand, there should be no other. This is probably a bit idealistic, but what is life without a bit of idealism?

Learning theory is about the process of induction, that is the process of building theories or models from observations.Most of what physicits do is actually induction about natural phenomena. So one may wonder whether there can be some relationship between Physics and Learning Theory.One could argue that Physics is using induction in a particular setting, while Learning Theory is studying induction in general, so they cannot really be compared.But here are some surprising connections:

In quantum physics Bell's inequality provides a test for the existence of "hidden variables" that explain entanglement. This inequality is based on a statistical reasoning.

Some physicists study the connection between Bayes formula for updating probabilities and the collapse of the wave function when measurements of a quantum system are made (see e.g. the work of Christopher Fuchs). There are even "Bayesian" and "non-Bayesian" physicists, just like the Machine Learning people !

Even further, Lucien Hardy tries to rethink the way physics theories are built up. His starting point is that the work of any physicist is to accumulate and correlate data. He thus develops physics theories as theories for how data should be handled ! (see e.g. http://arxiv.org/PS_cache/gr-qc/pdf/0509/0509120.pdf)

The formalization of the concept of probability has a long history. Probability Theory is now a well-founded and very mature part of Mathematics, mainly due to its axiomatization by Kolmogorov who grounded the concept of probability in measure theory.It may thus seem that defining and combining probabilities can be done in a unique way, without any questions.However, there is still a lot of disagreement on the crucial issue: the interpretation of probability. The problem here is that interpretation means connection to the real world. In that respect, the issue is not just a technical one but also a philosophical one, which explains why there can be many different points of view.

First of all, it is possible to distinguish between the objective and the subjective points of view:

Objective probability: the objective point of view consists in
postulating that probabilities do not depend on the person observing
events or performing experiments. This means that there exists some
absolute notion of probability for every possible event and this
probability originates from Nature itself. Once this is assumed, the
question becomes : how to "measure" these pre-existing probabilities,
or how to confirm that the probability of a given event has a given
value?

Subjective probability: in the subjective point of view, probabilities are not something that can be measured, but something one assumes. The idea is that events either occur or do not occur and the probability is not a property of Nature but rather a convenient way of representing someone's uncertainty prior to the event actually occurring.

There are two classical (and opposed) ways of interpreting probabilities: the frequency and Bayesian interpretations.

Probability as frequency: in this approach, the probability of an event is defined as the ratio of how many times the event occurs to the number of times a similar experiment is performed. For example, if you repeatedly flip a coin, the probability of this coin landing on "heads" will be defined as the percentage of trials where it does land on "heads". Of course, this will highly depend on the number of such trials and may vary from one sequence to another. However, this issue is solved by the theorem called "the law of large numbers" which essentially states that the frequency of an event in successive independent trials will converge to a fixed value (its probability). In other words, if you flip your coin again and again, the frequency will (slowly but surely) converge to a definite value. There are some issues about the definition of independent trials and about the fact that one can really perform successive experiments in exactly the same way, but we will not worry about this now.

Bayesian probability: it is obvious that not all notions of probability (as they are used in every day life) can be properly captured by the frequency definition given above. For example, when one speaks about the probability of an event that may occur only once (hence it is not possible to perform repeated experiments) such as the probability of a politician being elected at a given election, it is clear that frequency does not make practical sense and cannot be tested. Another issue with frequency is that it makes sense in the limit only: say we
start flipping a coin and it keeps landing heads up; how many times does it need to land heads
up before we decide that this is not happening with probability
1/2? Five? Ten? A thousand? A million? There is no reasonable answer to this question. Hence (subjective) Bayesians do not attempt to measure probabilities, rather they consider that a probability is a "degree of belief" that someone may have in the fact that a given event will occur. The whole point is that how you obtain your "prior" probability or initial degree of belief (before observing anything) does not matter. What matters is how these values are combined and updated when events are observed.

There is of course a lot to be said about the above two interpretations and there are many refinements or deviations from these. I hope to be able to explore this in more details in later posts.

It is enlightening to try and define terms properly when trying to understand foundations of a scientific domain. In the case of the learning phenomenon, the distinction between deduction and induction is a crucial one.

Deductive reaonsoning consists in combining logical statements according to certain agreed upon rules in order to obtain new statements. This is how mathematicians prove theorems from axioms. Proving a theorem is nothing but combining a small set of axioms with certain rules. Of course, this does not mean proving a theorem is a simple task, but it could theoretically be automated.

Inductive reasoning consists in constructing the axioms from the observation of supposed consequences of these axioms. This is what scientists like physicists for example do: observing natural phenomena, they postulate the laws of Nature.

Both deduction and induction have limitations. One limitation of deduction is exemplified by Gödel's theorem which essentially states that for a rich enough set of axioms, one can produce statements that can be neither proved nor disproved.Induction on the other hand is limited in that it is impossible to prove that an inductive statement is correct. At most can one empirically observe that the deductions that can be made from this statement are not in contradiction with experiments. But one can never be sure that no future observation will contradict the statement.

From someone outside the learning community, it may seem that researchers spend their time looking for THE optimal learning algorithm, that is a completely generic algorithm which would beat all the other ones. One may even think that such an algorithm would be so sophisticated that researchers can only incrementally approach it and this explains why the progress is relatively slow in this area.However, this is a serious misinterpretation of what is going on in this research field.

First of all, there is no such thing as an optimal learning algorithm, and my point here is to explain why this is so.There are at least three possible explanations and I will go from the most informal to the most formal one:

Any algorithm has some bias: given a data sample, a learning algorithm typically builds a function (or a model) that agrees (to a certain degree) with this data and that is able to make predictions for new data, that is it extrapolates the data. However for each data sample, there are infinitely many ways to extrapolate it, and each learning algorithm does it in its own way. The bias of a learning algorithm can be thought of as the way this algorithm ranks the possible functions. The point is that in order to build a function, one needs a way to decide which function to pick among all the functions that (at least partially) agree with the data. So the question is whether there could exist some sort of optimal ranking of functions, or optimal way to decide which function to pick. The problem is that, given a learning problem characterized by the function to be learned, one can always construct an algorithm that performs optimally, simply by choosing a ranking that puts this particular function first. So there is always an optimal algorithm for each problem, but this optimal algorithm will necessarily be sub-optimal on other problems. So, roughly speaking, there is no way to have a good performance simultaneously on all problems.

There exist several results that make this more precise, in particular, the so called No Free Lunch (NFL) theorem. This theorem essentially says that if you consider all the possible learning problems, all possible learning algorithms have the same performance on average over all the problems. As a consequence, a learning algorithm that performs well on some problems will necessarily perform poorly on others to balance this. Thus there cannot exist an optimal and universal learning algorithm.

A more involved version of the NFL theorem is the Slow Rate Theorem which states that for any learning algorithm, if you choose any sequence of numbers that converge to zero, there exists a learning problem such that the algorithm's generalization error on that problem will converge to zero slower than the chosen sequence (as the sample size increases). In other words, a learning algorithm may converge to the optimal solution arbitrarily slowly and can have arbitrarily poor performance for any fixed sample size.

All this implies that there cannot be an algorithm that is both universal (that can learn any problem) and optimal (that performs better than the others on all problems).So the only thing that we can hope for is an algorithm that has good properties for a restricted set of problems only. This is not so bad however since you can assume that the problems you encounter in the real world are somehow well-behaved and do not span the space of all possible problems.

As a conclusion, what Machine Learning researchers do is not to look for THE optimal algorithm, but to look for a learning algorithm that is optimal for a small set of learning problems, namely the "real-world problems".