Ever since the "Prior Confusion" post I was planning to formulate one of its paragraphs as the following abstract puzzle, but somehow it took me 8 years to write it up.

According to fictional statistical studies, the following is known about a fictional chronic disease "statistite":

About 30% of people in the world have statistite.

About 35% of men in the world have it.

In Estonia, 20% of people have statistite.

Out of people younger than 20 years, just 5% have the disease.

A recent study of a random sample of visitors to the Central Hospital demonstrated that 40% of them suffer from statistite.

Mart, a 19-year Estonian male medical student is standing in the foyer of the Central Hospital, reading these facts from an information sheet and wondering: what are his current chances of having statistite? How should he model himself: should he consider himself as primarily "an average man", "a typical Estonian", "just a young person", or "an average visitor of the hospital"? Could he combine the different aspects of his personality to make better use of the available information? How? In general, what would be the best possible probability estimate, given the data?

Basic linear algebra, introductory statistics and some familiarity with core machine learning concepts (such as PCA and linear models) are the prerequisites of this post. Otherwise it will probably make no sense. An abridged version of this text is also posted on Quora.

Most textbooks on statistics cover covariance right in their first chapters. It is defined as a useful "measure of dependency" between two random variables:

The textbook would usually provide some intuition on why it is defined as it is, prove a couple of properties, such as bilinearity, define the covariance matrix for multiple variables as , and stop there. Later on the covariance matrix would pop up here and there in seeminly random ways. In one place you would have to take its inverse, in another - compute the eigenvectors, or multiply a vector by it, or do something else for no apparent reason apart from "that's the solution we came up with by solving an optimization task".

In reality, though, there are some very good and quite intuitive reasons for why the covariance matrix appears in various techniques in one or another way. This post aims to show that, illustrating some curious corners of linear algebra in the process.

Meet the Normal Distribution

The best way to truly understand the covariance matrix is to forget the textbook definitions completely and depart from a different point instead. Namely, from the the definition of the multivariate Gaussian distribution:

We say that the vector has a normal (or Gaussian) distribution with mean and covariance if:

To simplify the math a bit, we will limit ourselves to the centered distribution (i.e. ) and refrain from writing out the normalizing constant . Now, the definition of the (centered) multivariate Gaussian looks as follows:

Much simpler, isn't it? Finally, let us define the covariance matrix as nothing else but the parameter of the Gaussian distribution. That's it. You will see where it will lead us in a moment.

Transforming the Symmetric Gaussian

Consider a symmetric Gaussian distribution, i.e. the one with (the identity matrix). Let us take a sample from it, which will of course be a symmetric, round cloud of points:

We know from above that the likelihood of each point in this sample is

(1)

Now let us apply a linear transformation to the points, i.e. let . Suppose that, for the sake of this example, scales the vertical axis by 0.5 and then rotates everything by 30 degrees. We will get the following new cloud of points :

What is the distribution of ? Just substitute into (1), to get:

(2)

This is exactly the Gaussian distribution with covariance . The logic works both ways: if we have a Gaussian distribution with covariance , we can regard it as a distribution which was obtained by transforming the symmetric Gaussian by some , and we are given .

More generally, if we have any data, then, when we compute its covariance to be , we can say that if our data were Gaussian, then it could have been obtained from a symmetric cloud using some transformation , and we just estimated the matrix , corresponding to this transformation.

Note that we do not know the actual , and it is mathematically totally fair. There can be many different transformations of the symmetric Gaussian which result in the same distribution shape. For example, if is just a rotation by some angle, the transformation does not affect the shape of the distribution at all. Correspondingly, for all rotation matrices. When we see a unit covariance matrix we really do not know, whether it is the “originally symmetric” distribution, or a “rotated symmetric distribution”. And we should not really care - those two are identical.

There is a theorem in linear algebra, which says that any symmetric matrix can be represented as:

(3)

where is orthogonal (i.e. a rotation) and is diagonal (i.e. a coordinate-wise scaling). If we rewrite it slightly, we will get:

(4)

where . This, in simple words, means that any covariance matrix could have been the result of transforming the data using a coordinate-wise scaling followed by a rotation. Just like in our example with and above.

Principal Component Analysis

Given the above intuition, PCA already becomes a very obvious technique. Suppose we are given some data. Let us assume (or “pretend”) it came from a normal distribution, and let us ask the following questions:

What could have been the rotation and scaling , which produced our data from a symmetric cloud?

What were the original, “symmetric-cloud” coordinates before this transformation was applied.

Which original coordinates were scaled the most by and thus contribute most to the spread of the data now. Can we only leave those and throw the rest out?

All of those questions can be answered in a straightforward manner if we just decompose into and according to (3). But (3) is exactly the eigenvalue decomposition of . I’ll leave you to think for just a bit and you’ll see how this observation lets you derive everything there is about PCA and more.

The Metric Tensor

Bear me for just a bit more. One way to summarize the observations above is to say that we can (and should) regard as a metric tensor. A metric tensor is just a fancy formal name for a matrix, which summarizes the deformation of space. However, rather than claiming that it in some sense determines a particular transformation (which it does not, as we saw), we shall say that it affects the way we compute angles and distances in our transformed space.

Namely, let us redefine, for any two vectors and , their inner product as:

(5)

To stay consistent we will also need to redefine the norm of any vector as

(6)

and the distance between any two vectors as

(7)

Those definitions now describe a kind of a “skewed world” of points. For example, a unit circle (a set of points with “skewed distance” 1 to the center) in this world might look as follows:

And here is an example of two vectors, which are considered “orthogonal”, a.k.a. “perpendicular” in this strange world:

Although it may look weird at first, note that the new inner product we defined is actually just the dot product of the “untransformed” originals of the vectors:

(8)

The following illustration might shed light on what is actually happening in this -“skewed” world. Somehow “deep down inside”, the ellipse thinks of itself as a circle and the two vectors behave as if they were (2,2) and (-2,2).

Getting back to our example with the transformed points, we could now say that the point-cloud is actually a perfectly round and symmetric cloud “deep down inside”, it just happens to live in a skewed space. The deformation of this space is described by the tensor (which is, as we know, equal to . The PCA now becomes a method for analyzing the deformation of space, how cool is that.

The Dual Space

We are not done yet. There’s one interesting property of “skewed” spaces worth knowing about. Namely, the elements of their dual space have a particular form. No worries, I’ll explain in a second.

Let us forget the whole skewed space story for a moment, and get back to the usual inner product . Think of this inner product as a function , which takes a vector and maps it to a real number, the dot product of and . Regard the here as the parameter (“weight vector”) of the function. If you have done any machine learning at all, you have certainly come across such linear functionals over and over, sometimes in disguise. Now, the set of all possible linear functionals is known as the dual space to your “data space”.

Note that each linear functional is determined uniquely by the parameter vector , which has the same dimensionality as , so apparently the dual space is in some sense equivalent to your data space - just the interpretation is different. An element of your “data space” denotes, well, a data point. An element of the dual space denotes a function , which projects your data points on the direction (recall that if is unit-length, is exactly the length of the perpendicular projection of upon the direction ). So, in some sense, if -s are “vectors”, -s are “directions, perpendicular to these vectors”. Another way to understand the difference is to note that if, say, the elements of your data points numerically correspond to amounts in kilograms, the elements of would have to correspond to “units per kilogram”. Still with me?

Let us now get back to the skewed space. If are elements of a skewed Euclidean space with the metric tensor , is a function an element of a dual space? Yes, it is, because, after all, it is a linear functional. However, the parameterization of this function is inconvenient, because, due to the skewed tensor, we cannot interpret it as projecting vectors upon nor can we say that is an “orthogonal direction” (to a separating hyperplane of a classifier, for example). Because, remember, in the skewed space it is not true that orthogonal vectors satisfy . Instead, they satisfy . Things would therefore look much better if we parameterized our dual space differently. Namely, by considering linear functionals of the form . The new parameters could now indeed be interpreted as an “orthogonal direction” and things overall would make more sense.

However when we work with actual machine learning models, we still prefer to have our functions in the simple form of a dot product, i.e. , without any ugly -s inside. What happens if we turn a “skewed space” linear functional from its natural representation into a simple inner product?

(9)

where . (Note that we can lose the transpose because is symmetric).

What it means, in simple terms, is that when you fit linear models in a skewed space, your resulting weight vectors will always be of the form . Or, in other words, is a transformation, which maps from “skewed perpendiculars” to “true perpendiculars”. Let me show you what this means visually.

Consider again the two “orthogonal” vectors from the skewed world example above:

Let us interpret the blue vector as an element of the dual space. That is, it is the vector in a linear functional . The red vector is an element of the “data space”, which would be mapped to 0 by this functional (because the two vectors are “orthogonal”, remember).

For example, if the blue vector was meant to be a linear classifier, it would have its separating line along the red vector, just like that:

If we now wanted to use this classifier, we could, of course, work in the “skewed space” and use the expression to evaluate the functional. However, why don’t we find the actual normal to that red separating line so that we wouldn’t need to do an extra matrix multiplication every time we use the function?

It is not too hard to see that will give us that normal. Here it is, the black arrow:

Therefore, next time, whenever you see expressions like or , remember that those are simply inner products and (squared) distances in a skewed space, while is a conversion from a skewed normal to a true normal. Also remember that the “skew” was estimated by pretending that the data were normally-distributed.

The Dual Space Metric Tensor

“But wait”, you should say here. “You have been talking about expressions like all the time, while things like are also quite common in practice. What about those?”

Hopefully you know enough now to suspect that is again an inner product or a squared norm in some deformed space, just not the “internal data metric space”, that we considered so far. Which space is it? It turns out it is the “internal dual metric space”. That is, whilst the expression denoted the “new inner product” between the points, the expression denotes the “new inner product” between the parameter vectors. Let us see why it is so.

Consider an example again. Suppose that our space transformation scaled all points by 2 along the axis. The point (1,0) became (2,0), the point (3, 1) became (6, 1), etc. Think of it as changing the units of measurement - before we measured the axis in kilograms, and now we measure it in pounds. Consequently, the norm of the point (2,0) according to the new metric, will be 1, because 2 pounds is still just 1 kilogram “deep down inside”.

What should happen to the parameter ("direction") vectors due to this transformation? Can we say that the parameter vector (1,0) also got scaled to (2,0) and that the norm of the parameter vector (2,0) is now therefore also 1? No! Recall that if our initial data denoted kilograms, our dual vectors must have denoted “units per kilogram”. After the transformation they will be denoting “units per pound”, correspondingly. To stay consistent we must therefore convert the parameter vector (”1 unit per kilogram”, 0) to its equivalent (“0.5 units per pound”,0). Consequently, the norm of the parameter vector (0.5,0) in the new metric will be 1 and, by the same logic, the norm of the dual vector (2,0) in the new metric must be 4. You see, the “importance of a parameter/direction” gets scaled inversely to the “importance of data” along that parameter or direction.

More formally, if , then

(10)

This means, that the transformation of the data points implies the transformation of the dual vectors. The metric tensor for the dual space must thus be:

(11)

Remember the illustration of the “unit circle” in the metric? This is how the unit circle looks in the corresponding metric. It is rotated by the same angle, but it is stretched in the direction where it was squished before.

Intuitively, the norm (“importance”) of the dual vectors along the directions in which the data was stretched by becomes proportionally larger (note that the “unit circle” is, on the contrary, “squished” along those directions).

But the “stretch” of the space deformation in any direction can be measured by the variance of the data. It is therefore not a coincidence that is exactly the variance of the data along the direction (assuming ).

The Covariance Estimate

Once we start viewing the covariance matrix as a transformation-driven metric tensor, many things become clearer, but one thing becomes extremely puzzling: why is the inverse covariance of the data a good estimate for that metric tensor? After all, it is not obvious that (where is the data matrix) must be related to the in the distribution equation .

Here is one possible way to see the connection. Firstly, let us take it for granted that if is sampled from a symmetric Gaussian, then is, on average, a unit matrix. This has nothing to do with transformations, but just a consequence of pairwise independence of variables in the symmetric Gaussian.

Now, consider the transformed data, (vectors in the data matrix are row-wise, hence the multiplication on the right with a transpose). What is the covariance estimate of ?

(12)

the familiar tensor.

This is a place where one could see that a covariance matrix may make sense outside the context of a Gaussian distribution, after all. Indeed, if you assume that your data was generated from any distribution with uncorrelated variables of unit variance and then transformed using some matrix , the expression will still be an estimate of , the metric tensor for the corresponding (dual) space deformation.

However, note that out of all possible initial distributions , the normal distribution is exactly the one with the maximum entropy, i.e. the “most generic”. Thus, if you base your analysis on the mean and the covariance matrix (which is what you do with PCA, for example), you could just as well assume your data to be normally distributed. In fact, a good rule of thumb is to remember, that whenever you even mention the word "covariance matrix", you are implicitly fitting a Gaussian distribution to your data.

A question on Quora reminded me that I wanted to post this explanation here every time I got a chance to teach SVMsandKernelmethods, but I never found the time. The post expects basic knowledge of those topics from the reader.

Introductory Background

The concept of kernel methods is probably one of the coolest tricks in machine learning. With most machine learning research nowadays being centered around neural networks, they have gone somewhat out of fashion recently, but I suspect they will strike back one day in some way or another.

The idea of a kernel method starts with the curious observation that if you take a dot product of two vectors, , and square it, the result can be regarded as a dot product of two "feature vectors", where the features are all pairwise products of the original inputs:

Analogously, if you raise to the third power, you are essentially computing a dot product within a space of all possible three-way products of your inputs, and so on, without ever actually having to see those features explicitly.

If you now take any linear model (e.g. linear regression, linear classification, PCA, etc) it turns out you can replace the "real" dot product in its formulation model with such a kernel function, and this will magically convert your model into a linear model with nonlinear features (e.g. pairwise or triple products). As those features are never explicitly computed, there is no problem if there were millions or billions of them.

Consider, for example, plain old linear regression: . We can "kernelize" it by first representing as a linear combination of the data points (this is called a dual representation):

and then swapping all the dot products with a custom kernel function:

If we now substitute here, our model becomes a second degree polynomial regression. If it is the fifth degree polynomial regression, etc. It's like magic, you plug in different functions and things just work.

It turns out that there are lots of valid choices for the kernel function , and, of course, the Gaussianfunction is one of these choices:

It is not too surprising - the Gaussian function tends to pop up everywhere, after all, but it is not obvious what "implicit features" it should represent when viewed as a kernel function. Most textbooks do not seem to cover this question in sufficient detail, usually, so let me do it here.

The Gaussian Kernel

To see the meaning of the Gaussian kernel we need to understand the couple of ways in which any kernel functions can be combined. We saw before that raising a linear kernel to the power makes a kernel with a feature space, which includes all -wise products. Now let us examine what happens if we add two or more kernel functions. Consider , for example. It is not hard to see that it corresponds to an inner product of feature vectors of the form

Multiplying a kernel function with a constant is also meaningful. It corresponds to scaling the corresponding features by . For example, .

Still with me? Great, now let us combine the tricks above and consider the following kernel:

Apparently, it is a kernel which corresponds to a feature mapping, which concatenates a constant feature, all original features, all pairwise products scaled down by and all triple products scaled down by .

Looks impressive, right? Let us continue and add more members to this kernel, so that it would contain all four-wise, five-wise, and so on up to infinity-wise products of input features. We shall choose the scaling coefficients for each term carefully, so that the resulting infinite sum would resemble a familiar expression:

We can conclude here that is a valid kernel function, which corresponds to a feature space, which includes products of input features of any degree, up to infinity.

But we are not done yet. Suppose that we decide to normalize the inputs before applying our linear model. That is, we want to convert each vector to before feeding it to the model. This is quite often a smart idea, which improves generalization. It turns out we can do this “data normalization” without really touching the data points themselves, but by only tuning the kernel instead.

Consider again the linear kernel . If we normalize the vectors before taking their inner product, we get

With some reflection you will see that the latter expression would normalize the features for any kernel.

Let us see what happens if we apply this kernel normalization to the “infinite polynomial” (i.e. exponential) kernel we just derived:

Voilà, the Gaussian kernel. Well, it still lacks in the denominator but by now you hopefully see that adding it is equivalent to scaling the inputs by

To conclude: the Gaussian kernel is a normalizedpolynomial kernel of infinite degree (where feature products if -th degree are scaled down by before normalization). Simple, right?

An Example

The derivations above may look somewhat theoretic if not "magical", so let us work through a couple of numeric examples. Suppose our original vectors are one-dimensional (that is, real numbers), and let , . The value of the Gaussian kernel for these inputs is:

Let us see whether we can obtain the same value as a simple dot product of normalized polynomial feature vectors of a high degree. For that, we first need to compute the corresponding unnormalized feature representation:

As our inputs are rather small in magnitude, we can hope that the feature sequence quickly approaches zero, so we don't really have to work with infinite vectors. Indeed, here is how the feature sequences look like:

In boldface are the decimal digits, which match the value of . The discrepancy is probably more due to lack of floating-point precision rather than to our approximation.

A 2D Example

The one-dimensional example might have seemed somewhat too simplistic, so let us also go through a two-dimensional case. Here our unnormalized feature representation is the following:

This looks pretty heavy, and we didn't even finish writing out the third degree products. If we wanted to continue all the way up to degree 20, we would end up with a vector with 2097151 elements!

Note that many products are repeated, however (e.g. ), hence these are not really all different features. Let us try to pack them more efficiently. As you'll see in a moment, this will open up a much nicer perspective on the feature vector in general.

Basic combinatorics will tell us, that each feature of the form must be repeated exactly times in our current feature vector. Thus, instead of repeating it, we could replace it with a single feature, scaled by . "Why the square root?" you might ask here. Because when combining a repeated feature we must preserve the overall vector norm. Consider a vector , for example. Its norm is , exactly the same as the norm of the single-element vector .

As we do this scaling, each feature gets converted to a nice symmetric form:

This means that we can compute the 2-dimensional feature vector by first expanding each parameter into a vector of powers, like we did in the previous example, and then taking all their pairwise products. This way, if we wanted to limit ourselves with maximum degree 20, we would only have to deal with = 231 features instead of 2097151. Nice!

Here is a new view of the unnormalized feature vector up to degree 3:

Let us limit ourselves to this degree-3 example and let , (if we picked larger values, we would need to expand our feature vectors to a higher degree to get a reasonable approximation of the Gaussian kernel). Now:

There are many ways in which smart people tend to explain Bayesian statistics and contrast it with a "non-Bayesian" one. One usually highlights that the primary concept of a Bayesian approach is the the desire to model everything as a probability distribution. Once this is fact is clear, many smart people would proceed to claim that this is, in fact, what fundamentally sets Bayesian statistics aside from the "classical" one. However, I feel that this kind of explanation is somewhat incomplete. It is not like classical statisticians do not use complete probability distributions. The difference is in general somewhat more subtle and philosophical.

Consider the question "what is your height?". For a classical statistician there exists some abstract "true answer", say "180cm", which is a fixed number - your one and only height. The problem is, of course, you do not know this number because every measurement is slightly different, so the classical statistician will add that "there is a normally-distributed measurement error". In the world of a pure Bayesian there are almost no "fixed numbers" - everything is a probability distribution, and so is your height! That is, a Bayesian should say that "your height is a Normal distribution centered around 180cm".

Note that from the mathematical perspective there is no difference between the two representations - in both cases the number 180cm is mentioned, and the normal distribution. However, from a philosophical, syntactical, methodological and "mental" perspectives this tends to have serious implications, and there has been historically a kind of an ongoing intellectual feud between the statisticians who lend more towards the first or the second approach (it is somewhat resemblant of how there is a divide among the physicists with regard to their support of the Copenhagen interpretation of quantum mechanics).

One of the implications of denying the fact that things in the world are mostly fixed (and are all pure distributions instead) is that you may not use many of the common sense inference methods directly. What is my height if I stand on a chair? "Well, it is your height plus the height of a chair", a classical statistician would say. He can keep in mind the measurement errors, if necessary, but those could be dealt with later. In the Bayesian world heights are not numbers, so the procedure of adding heights implies convoluting two distributions to get the resulting distribution. If both distributions are Gaussian, the result will match that of the "common sense", but note that now the common sense somehow became "just one special case". Moreover, a Bayesian might even keep the possibility that "your height and the height of the chair are dependent" in the back of his mind, just in case. Because when you speak about two numbers in the Bayesian world, you must immediately start thinking about their joint distribution.

On the other hand, modeling everything in probabilities lets you use probability theory inference methods (Bayes rule, convolutions, marginalizations, etc) everywhere, without the need to differentiate between "fixed numbers" and "random measurement errors" and this adds peace of mind as well as tends to make your explanations clearer. A Bayesian confidence interval, for example, is a "fixed interval such that 95% of height measurements fall into it". A classical confidence interval, on the other hand, is "a random interval such that the true height may fall into it with 95% probability". Again, mathematically and numerically those may often be the same, but think how different the two explanations are.

Bayesian "thinking" tends to be more flexible for complex models. Many classical statistics models would stick to fixed parameters, point or "interval" inferences, and try to "hide" the complexity of probability distributions as much as possible. As a result, reasoning about a system with many highly interconnected concepts becomes flawed. Consider a sequence of three questions:

What the height of this truck?

Will it fit under this 3m bridge?

Do we need pick another route?

In the "classical" mindset you would tend to give fixed answers to the questions.

"Height of the truck is 297".

"Yes, 297<300, hence it will fit".

"No, we do not need".

Sometimes you may be more careful and work with confidence intervals, but it still feels unwieldy:

"The confidence interval on the height of the truck is 290..310"

".. aahm, it might not fit..."

"let's pick another route, just in case"

Note, if a followup question appears that depends on the previous inferences (e.g. "do we need to remodel the truck") answering it becomes even harder because the true uncertainty is "lost" in the intermediate steps. Such problems are never present if you are disciplined as a Bayesian. Note the answers:

"The height of the truck is a normal distribution N(297, 10)"

"It will fit under the bridge with probability 60%"

"We need another route with probability 40%"

At any point is information about the uncertainty is preserved in the distributions and you are free to combine it further, or apply a decision-theoretic utility model. This makes Bayesian networks possible, for example.

It is interesting to see how this largely philosophical preference leads to two completely different (albeit complementary) sets of techniques. Indeed, if you are a true classical statistician, your work revolves around parameterized probability distributions. You write them down like , where is the "truly random" value from some probability space, and is the "fixed but unknown" parameter. Your whole "school of thought" is now focused on clever ad-hoctechniques for computing estimates of this fixed parameter from the provided distribution.

For a pure Bayesian, however, there is no "fixed" that has to be treated somehow separately. Instead, is also a part of some probability space, and instead of writing he would safely write , , or . As a result, the probability distribution he works with are not parameterized any more, and all of the clever techniques that the classical statisticians have invented over the centuries for estimating parameters become seemingly useless. At this point a classical statistician puts his hands down and goes home, as there is nothing to do for him - there are no "unknowns". The Bayesian is, however, left to struggle with mathematically trivial, yet computationally incredibly heavy methods for extracting essentially the same values that the classical statistician could have obtained using his "parameter estimation" approaches. That's why the Bayesian "school of thought" is mostly focused on computationally-efficient methods for marginalization and sampling.

In reality, of course, a Bayesian would quite often give up and "cheat", at least partially parameterizing his models and making use of the classical estimation methods, while a "classical" statistician might happen to write and apply the Bayes rule here and there, whenever it seems appropriate. A number of computations derived from the two theoretical backgrounds end up exactly the same.

Thus, in practice, labeling things as "Bayesian" or "non-Bayesian" is still largely a philosophical choice. For example, there are methods in machine learning, ensemble learners, that are somewhy never labeled/marketed as being "Bayesian" nor were they probably invented by someone "Bayesian", although at their core those would be among the best examples of where a Bayesian approach is different from a classical one. Those are also among the best performant models quite often, by the way.

Collecting large amounts of data and then using it to "teach" computers to automatically recognize patterns is pretty much standard practice nowadays. It seems that, given enough data and the right methods, computers can get quite precise at detecting or predicting nearly anything, whether it is face recognition, fraud detection or movie recommendations.

Whenever a new classification system is created, it is taken for granted that the system should be as precise as possible. Of course, classifiers that never make mistakes are rare, but if it possible, we should strive to have them make as few mistakes as possible, right? Here is a fun example, where things are not as obvious.

Consider a bank, which, as is normal for a bank, makes money by giving loans to its customers. Of course, there is always a risk that a customer will default (i.e. not repay the loan). To account for that, the bank has a risk scoring system which, for a given loan application, assesses the probability that the corresponding customer may default. This probability is later used to compute the interest rate offered for the customer. To simplify a bit, the issued interest on a loan might be computed as the sum of customer's predicted default risk probability and a fixed profit margin. For example, if a customer is expected to default with probability 10% and the bank wants 5% profit on its loans on average, the loan might be issued at slightly above 15% interest. This would cover both the expected losses due to non-repayments as well as the profit margin.

Now, suppose the bank managed to develop a perfect scoring algorithm. That is, each application gets a rating of either having 0% or 100% risk. Suppose as well that within a month the bank processes 1000 applications, half of which are predicted to be perfectly good, and half - perfectly bad. This means that 500 loans get issued with a 5% interest rate, while 500 do not get issued at all.

Think what would happen, if the system would not do such a great job and confused 50 of the bad applications with the good ones? In this case 450 applications would be classified as "100%" risk, while 550 would be assigned a risk score of "9.1%" (we still require the system to provide valid risk probability estimates). In this case the bank would issue a total of 550 loans at 15%. Of course, 50 of those would not get repaid, yet this loss would be covered from the increased interest paid by the honest lenders. The financial returns are thus exactly the same as with the perfect classifier. However, the bank now has more clients. More applications were signed, and more contract fees were received.

True, the clients might be a bit less happy for getting a higher interest rate, but assuming they were ready to pay it anyway, the bank does not care. In fact, the bank would be more than happy to segment its customers by offering higher interest rates to low-risk customers anyway. It cannot do it openly, though. The established practices usually constrain banks to make use of "reasonable" scorecards and offer better interest rates to low-risk customers.

Hence, at least in this particular example, a "worse" classifier is in fact better for business. Perfect precision is not really the ultimately desired feature. Instead, the system is much more useful when it provides a relevant and "smooth" distribution of predicted risk scores, making sure the scores themselves are decently precise estimates for the probability of a default.

The developments of properGPU-basedimplementations of neural network training methods in the recent years have lead to a steady growth of excitingpracticalexamples of their potential. Among others, the topic of face recognition (not to be confused with face detection) is on the steady rise. Some 5 years ago or so, decent face recognition tools were limited to Google Picasa and Facebook, some research labs and a few commercialproducts, often branded with the word "Biometrics" (that somehow seems to grow out of fashion nowadays).

Hence, the growth in popularity and use of face recognition is apparent. Given that the initially overinflated hype around the whole deep learning buzzword seems to have more-or-less settled down to reality, this time the growth is realistic. We are probably on the "enlightenment" segment of the hype cycle here, very close to reaching actual productivity (which is not without issues, though).

During a Skype University Hackathon, which happened in April I had the chance to join forces with them to present their efforts in the form of a fun public web app. The idea was to let people search for similar faces in the Estonian archive photos. The resulting site was called "teisik.ee" ("doppelganger" in Estonian). Although it does not seem to exactly fulfil the "doppelganger finding" purpose, it does manage to identify persons known to the system from the database surprisingly well.

Output from teisik.ee

Having observed that finding matches to celebrities, even if they are not perfect matches, is entertaining in its own way, we also managed to put up a second version of the service (all within that same weekend!), codenamed celebritymatch.me. This app lets you search the dataset of celebrity faces for those which are apparently similar (according to the opinion of our neural network, at least). Try it, it is not perfect, but rather fun.

The implementation of the recognition system is rather straightforward for anyone who knows what a convolutional network is and otherwise pretty impossible to grasp in full, hence I won't go into much technical detail. It is implemented using Caffe and consists of three consequtive convolutional layers (with ReLU and max-pooling), followed by a fully-connected hidden layer, which is then fully-connected to a softmax classification output layer. The outputs of the penultimate layer (of size 64) are used as the feature representation of the face. Those feature representations are then used to find Euclidean-distance-wise nearest neighbors in the database of faces. The future plans are to later apply the probably smarter FaceNet approach to network training for the same use case. The webapp is done using Flask.

Right after the hackathon Tambet was invited to give an interview about the project on Estonian television. If you understand Estonian, check it out, it is very good.

Suppose that you are a working as a security guard at the airport. Your task is to look at people who pass the security line and pick some of them as being worthy of a more detailed screening. Now, of course, telling whether a person is a potential criminal or not by just looking at him/her is hard, if at all possible, but you need to do something. You have been put there for some reason, after all.

One of the simplest ways to approach the problem, mentally, is the following. You assign a "risk value" for each person. At the beginning (when you don't have any information about the person at all) you set this value to zero.

Now you start studying various features of the person in front of you: is it a male or a female? Is it a kid? Is he behaving nervously? Is he carrying a big bag? Is he alone? Did the metal detector beep? Is he a foreigner? etc. For each of those features you know (subconsciously due to your presuppositions, or from actual statistics) the average increase or decrease in risk of the person being a criminal that it entails. For example, if you know that the proportion of males among criminals is the same as the proportion of males among non-criminals, observing that a person is male will not affect his risk value at all. If, however, there are more males among criminals (suppose the percentage is, say, 70%) than among decent people (where the proportion is around 50%), observing that a person in front of you is a male will increase the "risk level" by some amount (the value is log(70%/50%) ~ 0.3, to be precise). Then you see that a person is nervous. OK, you think, 90% of criminals are nervous, but only 50% of normal people are. This means that nervousness should entail a further risk increase (of log(0.9/0.5) ~ 0.6, to be technical again, so by now you have counted a total risk value of 0.9). Then you notice it is a kid. Wow, there is only 1% of kids among criminals, but around 10% among normal people. Therefore, the risk value change due to this observation will be negative (log(0.01/0.10) ~ -2.3, so your totals are around -1.4 by now).

You can continue this as long as you want, including more and more features, each of which will modify your total risk value by either increasing it (if you know this particular feature is more representative of a criminal) or decreasing (if the features is more representative of a decent person). When you are done collecting the features, all is left for you is to compare the result with some threshold level. Say, if the total risk value exceeds 10, you declare the person in front of you to be potentially dangerous and take it into a detailed screening.

The benefit of such an approach is that it is rather intuitive and simple to compute. The drawback is that it does not take the cross-play of features into account. It may very well be the case that while the feature "the person is a kid" on its own greatly reduces the risk value, and the feature "has a moustache" on its own has close to no effect, a combination of the two ("a kid with a moustache") would actually have to increase the risk by a lot. This would not happen when you simply add the separate feature contributions, as described above.

The issues related to scientific publishing, peer-review and funding always make for popular discussion topics at conferences. In fact, the ongoing ECML PKDD 2012 had a whole workshop, where researchers could complain about discuss some of their otherwise interesting results that were hard or impossible to publish. The rejection reasons ranged from "a negative result" or "too small to be worthy of publication" to "lack of theoretical justification". The overall consensus seemed to be that this is indeed a problem, at least in the field of machine learning.

The gist of the problem is the following. Machine learning relies a lot on computational experiments - empirically measuring the performance of methods in various contexts. The current mainstream methodology suggests that such experiments should primarily play a supportive role, either demonstrating a general theoretic statement, or simply measuring the exact magnitude of the otherwise obvious benefit. This, unfortunately, leaves no room for "unexpected" experimental results, where the measured behaviour of a method is either contradicting or at least not explained by the available theory. Including such results in papers is very difficult, if not impossible, as they get criticised heavily by the reviewers. A reviewer expects all results in the paper to make sense. If anything is strange, it should either be explained or would better be disregarded as a mistake. This is a natural part of the quality assurance process in science as a whole.

Quite often, though, unexpected results in computational experiments do happen. They typically have little relevance to the main topic of the paper, and the burden of explaining them can be just too large for a researcher to pursue. It is way easier to either drop the corresponding measurement, or find a dataset that behaves "nicely". As a result, a lot of relevant information about such cases never sees the light of day. Thus, again and again, other researchers would continue stumbling on similar unexpected results, but continue shelving them away.

The problem would not be present if the researchers cared to, say, write up such results as blog posts or tech-reports in ArXiv, thus making the knowledge available. However, even formulating the unexpected discoveries in writing, let along go any deeper, is often regarded as a waste of time that won't get the researcher much (if any) credit. Indeed, due to how the scientific funding works nowadays, the only kind of credit that counts for a scientist is (co-)authoring a publication in a "good" journal or conference.

I believe that with time, science will evolve to naturally accommodate such smaller pieces of research into its process (mini-, micro-, nano-publications?), providing the necessary incentives for the researchers to expose, rather than shelve their "unexpected" results. Meanwhile, though, other methods could be employed, and one of the ideas that I find interesting is the concept I'd call "co-authorship licensing".

Instead of ignoring a "small", "insignificant", or an "unexpected" result, the researcher should consider publishing it as either a blog post or a short (yet properly written) tech report. He should then add an explicit requirement, that the material may be referred to, cited, or used as-is in a "proper" publication (a journal or a conference paper) with the condition that the author of the post must be included in the author's list of the paper.

I feel there could be multiple benefits to such an approach. Firstly, it non-invasively addresses the drawbacks of the current science funding model. If being cited as a co-author is the only real credit that counts in the scientific world, why not use it explicitly and thus allow to effectively "trade" smaller pieces of research. Secondly, it enables a meaningful separation of work. "Doing research" and "publishing papers" are two very different types of activities. Some scientists, who are good at producing interesting experimental results or observations, can be completely helpless when it comes to the task of getting their results published. On the other hand, those, who are extremely talented in presenting and organizing results into high-quality papers, may often prefer the actual experimentation to be done by someone else. Currently, the two activities have to be performed by the same person or, at best, by the people working at the same lab. Otherwise, if the obtained results are not immediately "properly" published, there is no incentive for the researchers to expose them. "Co-authorship licensing" could provide this incentive, acting as an open call for collaboration at the same time. (In fact, the somewhat ugly "licensing" term could be replaced with a friendlier equivalent, such as "open collaboration invitation", for example. I do feel, though, that it is more important to stress that others are allowed to collaborate rather than that someone is invited to).

I'll conclude with three hypothetical examples.

A Bachelor's student makes a nice empirical study of System X in his thesis, but has no idea how to turn this to a journal article. He publishes his work in ArXiv under "co-authorship license", where it is found by a PhD student working in this area, who was lacking exactly those results for his next paper.

A data miner at company X, as a side-effect of his work, ends up with a large-scale evaluation of learning algorithm Y on an interesting dataset. He puts those results up as a "co-authorship licensed" report. It is discovered by a researcher, who is preparing a review paper about algorithm Y and is happy to include such results.

A bioinformatician discovers unexpected behaviour of algorithm X on a particular dataset. He writes his findings up as a blog post with a "co-authorship license", where those are discovered by a machine learning researcher, who is capable of explaining the results, putting them in context, and turning into an interesting paper.

It seems to me that without the use of "co-authorship licensing" the situations above would end in no productive results, as they do nowadays.

Of course, this all will only make sense once many people give it a thought. Unfortunately, no one reads this blog 🙂

For many people, the ability for learning and adaptation seems like something unique, extremely complicated and mysterious. Indeed, those are the abilities we almost exclusively associate with high levels of intelligence and knowledge. This is, however, an illusion. Although adaptive behaviour might indeed look complex, it is not necessarily driven by "intelligent" mechanisms. One of the best illustrations of this is a fully-fledged self-learning machine made from plain matchboxes.

A Tic-Tac-Toe machine by James Bridle

The idea for such a machine was first introduced in 1960 by Donald Michie, who devised a simple self-learning algorithm for Tic-Tac-Toe (reminiscent of what is now known to be Reinforcement Learning). Due to lack of appropriate computing power, he implemented it "in hardware" using 300 or so matchboxes.

The idea of the machine is simple. There is a matchbox corresponding to each game position, where the "computer" has to make a move. The matchbox contains colored beads, each color corresponding to a particular move. The decision is made by picking a random bead from the matchbox. Initially (when the machine is "untrained"), there is an equal number of beads of each color, and the machine thus makes equiprobably random turns. After each game, however, the machine is "punished" by removing beads, corresponding to losing turns, or "rewarded" by adding beads, corresponding to winning turns. Thus, after several games, the machine will adapt its strategy towards a winning one.

The idea was popularized by Martin Gardner in one of his Scientific American articles (later published in the book "The Unexpected Hanging and Other Mathematical Diversions"). Gardner invented a simple game of "Hexapawn", and derived a matchbox machine for it, which only required as little as 19 matchboxes. He also suggests in his article, however, to create a matchbox machine for "Mini-checkers" - checkers played on a 4x4 board. Ever since I saw this article some 20 or so years ago I was thinking of making one. This summer, while teaching a machine learning course in a summer school in Kiev, I actually made one. I could use it to both fulfil my ages-old desire as well as a teaching aid. You can make one too now, if you are interested.

The Mini-checkers Machine

The rules of mini-checkers are exactly like those of usual checkers, with three modifications:

The game is played on a 4x4 field. White is the first one to move. Machine plays for black.

Whenever both players get a King, the game immediately ends in a draw.

The King must always move to the furthest possible position in the chosen direction.

To make the machine, you first have to buy and empty 24 matchboxes. Next, print out and stick the 24 game positions onto the boxes. Draw on each box all the possible black's moves as arrows using colored markers. Finally, for each colored arrow, add 2 beads of the same color into the matchbox. That's it, your machine is ready to play.

The Mini-checkers machine

The game proceeds as already described: whenever the machine (the black player) has to make a decision (i.e. whenever it has to make a move and there is more than one possibility), find the matchbox with the current position depicted on it, shake it, and pick a random bead. This will tell you the decision of the machine. If the corresponding matchbox is empty, the machine forfeits. You should keep the matchboxes, corresponding to the moves that were made, open until the end of the game.

Once the game is over, the machine is "taught":

If the machine won, do nothing.

If the game was a draw, remove the bead corresponding to the machine's last move from the matchbox, unless it was the last bead of that color in the box.

If the machine lost, remove all the beads, corresponding to the machine's last move, from the last matchbox.

It takes about 30 games or so for the machine to actually learn to play well enough. Of course, a human would understand the strategy much earlier, but it's fun none the less.

Playing with the machine will immediately lead you towards two important questions:

How efficient is the suggested learning procedure? Can it be improved and generalized?

How do you make a matchbox machine for a more complex game without having to manage thousands of matchboxes.

As far as I know, contemporary machine learning has only partial answers to both of them.

The following text will only make sense to you if you know the technical details of Support Vector Machine learning.

Having recently had to give some lectures (1,2,3,4) and exercise sessions (1,2,3,4) on linear classification in a machine learning course, I discovered that one of the most famous linear classification methods, the classical Support Vector Machine, does not seem to be fully specified in any of the prominent reference materials. Namely, the SVM learning procedure produces two parameters: the coefficient vector and the bias term. While the problem of finding is typically explained in sufficient detail, the procedure of computing is usually swept under the carpet.

Here is, for example, what we see in the Vapnik's seminal work, the birthplace of SVMs (Section 10.2, page 411):

According to my calculations this is plain incorrect, because the corresponding Kuhn-Tucker conditions (in the "soft margin" SVM that we are considering here) are actually:

The important difference is the presence of the term, which is unknown, hence the equation is not useful for finding . Later on, the reader is also presented with the following summarizing description of the algorithm:

Do you see any mention of the way to compute ? I would expect the author to be somewhat more helpful with respect to this detail, which is obviously important for anyone willing to implement their own version of SVM.

This suggests that in order to compute we need to find , such that , i.e. there must exist at least one training point lying exactly on the margin. Although in most practical situations such support vectors will indeed exist, it is also theoretically possible that there won't be any, i.e. not a single support vector will lie exactly on the margin. Thus, for purposes of implementing SVMs, the provided specification is incomplete.

The same problem is present in the second book:

Take a careful look at lines 5-6, which claim that in order to compute we need to choose such that the corresponding are strictly between 0 and . This is not necessarily true for any .

So then, what is the proper, fully general way of computing ? Of course, it is not too hard to derive, and thus makes for a great course homework exercise (e.g. Exercise 4 here). If you managed to read up to this point, I presume you should be motivated enough to try solving it. If you give up, you are welcome to consult my sample solution (Exercise 4, page 7).