AI News, Where did the least-square come from?

Where did the least-square come from?

Question: Why do you square the error in a regression machine learning task?

But yes, we should definitely have the argument ready about where popular loss functions like least-square and cross-entropy come from — at least when we try to find the most likely hypothesis for a supervised learning problem using Bayesian argument.

Read on… Bayes’ theorem is probably the most influential identity of probability theory for modern machine learning and artificial intelligence systems.

This essentially tells that you update your belief (prior probability) after seeing the data/evidence (likelihood) and assign the updated degree of belief to the term posterior probability.

The equation above looks simple but it is notoriously tricky to compute in practice — because of extremely large hypothesis space and complexity in evaluating integrals over complicated probability distribution functions.

After these two simplifying assumptions, the maximum likelihood (ML) hypothesis can be given by, We generally start using least-square error while learning about simple linear regression back in Stats 101 but this simple-looking loss function resides firmly inside pretty much every supervised machine learning algorithm viz.

This means that every data point (d) in a supervised learning training data set can be written as the sum of the unknown function f(x) (which the learning algorithm is trying to approximate) and an error term which is drawn from a Normal distribution of zero mean (μ) and unknown variance σ².

For some dependent variable $y$ and explanatory variables $x_1, \dots, x_n$, where $\varepsilon$ is the residual due to random variation or other noisy factors.

Of course, we do not know the true values for these $\beta$ parameters (also called regression coefficients) so they end up being point estimates as well.

When given data, one technique we can use is ordinary least squares, sometimes just called least squares regression, which looks for parameters $\beta_0, \dots, \beta_n$ such that the sum of the squared residuals (i.e.

Much of machine learning can be framed as optimization problems - there is some kind of objective (also called loss or cost) function which we want to optimize (e.g.

Where: Some optimization terminology: So we have our training set $\{ (x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), \dots, (x^{(m)}, y^{(m)}) \}$ where $y \in \{0, 1\}$ and with the hypothesis function from before.

Here is the cost function for linear regression: $$ J(\theta) = \frac{1}{m} \sum^m_{i=1} \frac{1}{2} (h_{\theta}(x^{(i)}) - y^{(i)})^2 $$ Note that the $\frac{1}{2}$ is introduced for convenience, so that the square exponent cancels out when we differentiate.

The cost function for logistic regression is different than that used for linear regression because the hypothesis function of logistic regression causes $J(\theta)$ to be non-convex, that is, look something like the following with many local optima, making it hard to converge on the global minimum.

You could use other cost functions for logistic regression, but this one is derived from the principle of maximum likelihood estimation and has the nice property of being convex, so this is the one that basically everyone uses for logistic regression.

For example, if the right-hand side of that equation was a function func(j, t0, t1), you would implement it like so (example is $n=2$): $\alpha$ is the learning rate and tells how large a step/increment to change the parameters by.

With calculus, you find the optimum of a function by calculating where its derivatives equal 0 (the intuition is that derivatives are rates of change, when the rate of change is zero, the function is 'turning around' and is at a peak or valley).

So we can take the same cost function we've been using for linear regression and take the partial derivatives of the cost function $J$ with respect to every parameter of $\theta$ and then set each of these partial derivatives to 0: $$ J(\theta_0, \theta_1, \dots, \theta_m) = \frac{1}{2m} \sum^m_{i=1} (h_{\theta}(x^{(i)}) - y^{(i)})^2 $$ And for each $j$ $$ \frac{\partial}{\partial \theta_j} J(\theta) = \dots = 0 $$ Then solve for $\theta_0, \theta_1, \dots, \theta_m$.

The fast way to do this is to construct a matrix out of your features, including a column for $x_0 = 1$ (so it ends up being an $m \times (n+1)$ dimensional matrix) and then construct a vector out of your target variables $y$ (which is an $m$-dimensional vector): If you have $m$ examples, $(x^{(1)}, y^{(1)}), \dots, (x^{(m)}, y^{(m)})$, and $n$ features and then include $x_0 = 1$, you have the following feature vectors: $$x^{(i)} = \begin{bmatrix} x^{(i)}_0

One class of selection strategies is called stepwise selection because they iteratively remove or add one feature at a time, measuring the goodness of fit for each.

The two approaches here are the backward-elimination strategy which begins with the full model and removes one feature at a time, and the forward-selection strategy which is the reverse of backward-elimination, starting with one feature and adding the rest one at a time.

It's a combination of an encoder function that converts input data into a different representation and a decoder function which converts the new representation back into its original format.

These may not be explicit in the data, 'they may exist either as unobserved objects or forces in the physical world that affect the observable quantities, or they are constructs in the human mind that provide useful simplifying explanations or inferred causes of the observed data.'

For example, if we trained a MLP for image recognition, the first layer may end up learning representations of edges, the next may see corners and contours, the next may identify higher level features like faces, etc.

More formally, with feature scaling you want to get every feature into approximately a $-1 \leq x_i \leq 1$ range (it doesn't necessarily have to be between -1 and 1, just so long as there is a consistent range across your features).

With feature scaling, you could also apply mean normalization, where you replace $x_i$ with $x_i - \mu_i$ (that is, replace the value of the $i$th feature with its value minus the mean value for that feature) such that the mean of that feature is shifted to be about zero (note that you wouldn't apply this to $x_0 = 1$).

The resulting $U$ matrix will be an $n \times n$ orthogonal matrix which provides the projected vectors you're looking for, so take the first $k$ column vectors of $U$.

This $n \times k$ matrix can be called $U_{\text{reduce}}$, which you then transpose to get these vectors as rows, resulting in a $k \times n$ matrix which you then multiply by your feature matrix.

In factor analysis, the retained principal components are called common factors and their correlations with the input variables are called factor loadings.

Basic idea: Generate more data from your existing data by resampling Bagging (stands for Bootstrap Aggregation) is the way decrease the variance of your prediction by generating additional data for training from your original dataset using combinations with repetitions to produce multisets of the same cardinality/size as your original data.

By increasing the size of your training set you can't improve the model predictive force, but just decrease the variance, narrowly tuning the prediction to expected outcome.

The hypothesis takes the form: $$ h_{\theta}(x) = \theta_0 + \theta_1 x $$ Where the $\theta_i$s are the parameters that the learning algorithm learns.

In this case, we call it $J$, which looks like: $$ J(\theta_0, \theta_1) = \frac{1}{2m} \sum^m_{i=1} (h_{\theta}(x^{(i)}) - y^{(i)})^2 $$ Here it is the squared error function - it is probably the most commonly used cost function for regression problems.

You would, for example, just treat $x$ as a feature $x_1$, $x^2$ as another feature $x_2$, $x^3$ as another feature $x_3$, and so on: $$ \theta_0 + \theta_1 x_1 + \theta_2 x_2 + \theta_3 x_3 + \dots + \theta_n x_n $$ Note that in situations like this, feature scaling is very important because these features' ranges differ by a lot due to the exponents.

Say we have our hypothesis function $$ h_{\theta}(x) = \theta^T x $$ With logistic regression, we apply an additional function $g$ : $$ h_{\theta}(x) = g(\theta^T x) $$ where $$ g(z) = \frac{1}{1+e^{-z}} $$ This function is known as the sigmoid function, also known as the logistic function, with the form: So in logistic regression, the hypothesis ends up being: $$ h_{\theta}(x) = \frac{1}{1+e^{\theta^Tx}} $$ The output of the hypothesis is interpreted as the probability of the given input belonging to the positive class.

Logistic regression does not have a closed form solution - that is, it can't be solved in a finite number of operations, so we must estimate its parameters using other methods, more specifically, we use iterative methods.

The technique of one-vs-all (or one-vs-rest) involves dividing your training set into multiple binary classification problems, rather than as a single multiclass classification problem.

Your first binary classifier will distinguish between class 1 and classes 2 and 3,, your second binary classifier will distinguish between class 2 and classes 1 and 3, and your final binary classifier will distinguish between class 3 and classes 1 and 2.

We can use linear models for non-regression situations, as we do with logistic regression - that is, when the output variable is not an unbounded continuous value directly computed from the inputs (that is, the output variable is not a linear function of the inputs), such as with binary or other kinds of classification.

We might come up with some linear function based on income and number of items purchased in the last month, but this won't give us a 0/no or a 1/yes, it will give us some continuous value.

Logistic regression is a type of models called generalized linear models (GLM), which involves two steps: This probability distribution is taken from the exponential family of probability distributions, which includes the normal, Bernoulli, beta, gamma, Dirichlet, and Poisson distributions (among others).

We may run into situations like the following: Where our data seems to encompass multiple models (the red, green, blue, and black ones going up from left to right), but if we try to model them all simultaneously, we get a complete incorrectly model (the dark grey line going down from left to right).

For instance, in the first example above, the intercepts varied, in which case the intercept coefficient would be replaced with a random variable $\alpha^i$ drawn from the normal distribution: $$y = \alpha^i + \beta^i x + \epsilon $$

Let's say only the intercept is affected - if we wanted to model the effects of US regions and US states on separate levels, then the $\alpha_i$ will be drawn from a distribution according to the US region, $\alpha_i \sim \Phi(\mu_{\text{region}}, \sigma^2_\alpha)$, and then the regional mean which parameterizes $\alpha_i$'s distribution is drawn from a distribution of regional means, $\mu_{\text{region}} \sim \Phi(\mu, \sigma_r^2)$.

we have just removed the $\frac{1}{m}$ term (removing it does not make a difference to our result because it is a constant) and substituted the log hypothesis terms for two new cost functions.

That is, where increasing $\lambda$ brings your parameters closer to zero, the regularization parameter $C$ has the opposite effect - as it grows, so do your parameters, and vice versa.

With that representation in mind, we can rewrite the objective by replacing the $\lambda$ with $C$ on the first term: $$min_{\theta} C \sum^m_{i=1} [y^{(i)}\text{cost}_1(\theta^T x^{(i)}) + (1-y^{(i)}) \text{cost}_0(\theta^T x^{(i)})] + \frac{1}{2} \sum^n_{j=1} \theta_j^2 $$

It is the optimal one because it has the largest margins, illustrated by the red lines on the right (technically, the margin is orthogonal from the decision boundary to those red lines).

With this approach, classification becomes based on distances to the landmarks - points that are far away from certain landmarks will be classified 0, points that are close to certain landmarks will be classified 1.

Then given a training example $(x^{(i)}, y^{(i)})$, we can compute a feature vector $f$, where $f_0 = 1$, like so: $$f = \begin{bmatrix} f_0

If $n$ is small (1-1000) but $m$ is large (50000+), then you can create more features and then use logistic regression or SVM without a kernel, since otherwise SVMs struggle at large training sizes.

This basically means nonlinear kernels lose their appeal: they require way more resources to train with little to no gain in predictive performance, so why bother.

To make things easier to work with mathematically, we set $b = -c$ and rewrite this as: $$\vec w \cdot \vec u + b \geq 0 $$

We then add an additional constraint for an $x_i$ in the gutter (that is, within the margin of the decision boundary): $$y_i(\vec w \cdot \vec x + b) - 1 = 0 $$

You can take a negative example $\vec x_-$ and a positive example $\vec x_+$ and compute their difference $\vec x_+ - \vec x_-$.

This resulting vector is not orthogonal to the decision boundary, so we can project it onto the unit vector $\hat w$ (the unit vector of the $\vec w$, which is orthogonal to the decision boundary): $$\text{width} = (\vec x_+ - \vec x_-) \cdot \frac{\vec w}{||\vec w||} $$

We want to maximize the margins, that is, we want to maximize the width, and we can divide by $\frac{1}{2}$ because we still have a meaningful maximum, and that in turn can be interpreted as the minimum of the length of $\vec w$, which we can rewrite in a more mathematically convenient form (and still have the same meaningful minimum): $$max(\frac{2}{||\vec w||}) \to max(\frac{1}{||\vec w||}) \to min(||\vec w||) \to min(\frac{1}{2}||\vec w||^2) $$

We have to use Lagrange multipliers which provide us with this new function we can maximize without needing to think about our constraints anymore: $$L = \frac{1}{2} ||\vec w||^2 - \sum_i \alpha_i [y_i (\vec w \cdot \vec x_i + b) - 1] $$

Since the maximization and the decision rule depend only on the dot products of vectors, we can just substitute the transformation, so that: Since these are just dot products between the transformed vectors, we really only need a function which gives us that dot product: $$K(\vec x_i, \vec x_j) = \phi(\vec x_i) \cdot \phi(\vec x_j) $$

Some popular kernels: Many machine learning algorithms can be written in the form: $$w^T x + b = b + \sum_{i=1}^m \alpha_i x^T x^{(i)} $$

We can substitute $x$ with the output of a feature function $\phi(x)$ and the dot product $x^T x^{(i)}$ with a function $k(x, x^{(i)}) = \phi(x)^T \phi(x^{(i})$.

That is, within the $m$ leaf you have $N_m$ objects to consider and you count the number of a particular class $k$ in that set of objects and divide it by $N_m$ to get the probability $\hat p_{mk}$.

The hinge loss function takes the form $\ell(y) = \max(0, 1-t \cdot y)$ and is typically used for SVMs (sometimes squared hinge loss is used, which is just the previous equation squared).

The general algorithm is: Now suppose $w^{t+1}_i = \frac{w^t_i}{Z}e^{-\alpha^t h^t(x) y(x)}$, where $y(x)$ gives you the right classification (the right sign) for a given Training example.

For linear regression, you accomplish this by modifying the cost function to include the term $\lambda \sum^n_{i=1} \theta_j^2$ at the end: $$J(\theta) = \frac{1}{2m} \sum^m_{i=1} (h_{\theta}(x^{(i)}) - y^{(i)})^2 + \lambda \sum^n_{i=1} \theta_j^2 $$

If you make $\lambda$ too large for your problem, you may make your parameters too close to 0 for them to be meaningful - large values of lambda can lead to underfitting problems (since the parameters get close to 0).

The additional $\lambda \sum^n_{i=1} \theta_j^2$ term is called the regularization loss, and the rest of the loss function is called the data loss.

the L2 norm of the parameters is constrained so that it less than or equal to some specified value (that is, this is L2 regularization): $$\hat \beta = \argmin_{\beta} (\sum_{i=1}^N(y_i - \beta_0 - \sum_{j=1}^p x_{ij} \beta_j)^2 + \lambda \sum_{j=1}^p \beta^2_j) $$

Where: LASSO (Least Absolute Shrinkage and Selection Operator) is a regularization method which constrains the L1 norm of the parameters such that it is less than or equal to some specified value: $$\hat \beta = \argmin_{\beta} (\frac{1}{2} \sum_{i=1}^N(y_i - \beta_0 - \sum_{j=1}^p x_{ij} \beta_j)^2 + \lambda \sum_{j=1}^p |\beta_j|) $$

Then we can update gradient descent with the new derivative of this cost function for the parameters $\theta_j$ where $j \neq 0$ $$\begin{aligned}\theta_0 &:= \theta_0 - \alpha \frac{1}{m}\sum^m_{i=1}(h_{\theta}(x^{(i)}) - y^{(i)}) \cdot x_0^{(i)} \\ \theta_j

We have some data $x_1, x_2, \dots, x_n$ and some latent variables $y_1, y_2, \dots, y_n$ we want to uncover, which correspond to each of our data points.

So the problem of inference is about learning about our latent variables given the observed data, which we can get via the posterior distribution: $$P(y_1, \dots, y_n |

Or, for classification, if we some classes, each parameterizing a joint distribution, we want to pick the class which maximizes the probability of the observed data: $$\argmax_{c}P(x_{n+1}|\theta^c)$$ Discriminative learning algorithms include algorithms like logistic regression, decision trees, kNN, and SVM.

Discriminative approaches try to find some way of separating data (discriminating them), such as in logistic regression which tries to find a dividing line and then sees where new data lies in relation to that line.

So we are looking to estimate the $\theta$ which maximizes this likelihood (this estimate is often notated $\hat \theta$, the hat typically indicates an estimator): $$\hat \theta = \argmax_{\theta} L(\theta;x_1, x_2, \dots, x_n) $$

(Notation note: $p(X;\theta)$ is read 'the probability density of $X$ as parameterized by $\theta$') Though the logarithm version mentioned above is typically preferred to avoid underflow.

the probability of $y$ given $x$, parameterized by $\theta$), in which case, given all our inputs $X$ and all our targets $Y$, we have the conditional maximum likelihood estimator: $$\theta_{\text{ML}} = \argmax_{\theta} P(Y|X;\theta) $$

for binomial distributions it is rather trivial Because $p$ is the probability of a successful trial, and it's intuitive that the most likely $p$ just reflects the number of observed successes over the total number of observed trials: $$\begin{aligned}\tilde \pi_{MLE} &= \argmax_{\pi} P(X|\pi) \\ P(y|X)

The two stages work as such: Intuitively, what EM does is tries to find the parameters $\hat \theta$ which maximizes the log probability $\log P(x|\theta)$ of the observed data $x$, much like MLE, except does so under the conditions of incomplete data.

Then we generate the weighted set of these possible completions by computing how much each of these coins, as weighted by the probabilities we just computed, contributed to the results for this experiment ($(5H, 5T)$): $$\begin{aligned}0.44(5H, 5T) &= (2.2H, 2.2T), \, \text{(coin A)} \\ 0.56(5H,

EM is similar to K-Means, but we use soft assignments instead - datapoints can belong to multiple clusters in varying strengths (the 'strengths' are probabilities of assignment to each cluster).

Machine Learning Theory - Part 2: Generalization Bounds

Last time we concluded by noticing that minimizing the empirical risk (or the training error) is not in itself a solution to the learning problem, it could only be considered a solution if we can guarantee that the difference between the training error and the generalization error (which is also called the generalization gap) is small enough.

This is an instance of wildly known fact about probability that if we retried an experiment for a sufficiency large amount of times, the average outcome of these experiments (or, more formally, the sample mean) will be very close to the true mean of the underlying distribution.

By recalling that the empirical risk is actually the sample mean of the errors and the risk is the true mean, for a single hypothesis $h$ we can say that: Well, that’s a progress, A pretty small one, but still a progress!

The law of large numbers is like someone pointing the directions to you when you’re lost, they tell you that by following that road you’ll eventually reach your destination, but they provide no information about how fast you’re gonna reach your destination, what is the most convenient vehicle, should you walk or take a cab, and so on.

samples of a random variable $X$ distributed by $P$, and $a \leq x_i \leq b$ for every $i$, then for a small positive non-zero value $\epsilon$: You probably see why we specifically chose Heoffding’s inequality from among the others.

We can naturally apply this inequality to our generalization probability, assuming that our errors are bounded between 0 and 1 (which is a reasonable assumption, as we can get that using a 0/1 loss function or by squashing any other loss between 0 and 1) and get for a single hypothesis $h$: This means that the probability of the difference between the training and the generalization errors exceeding $\epsilon$ exponentially decays as the dataset size goes larger.

But the learning problem doesn’t know that single hypothesis beforehand, it needs to pick one out of an entire hypothesis space $\mathcal{H}$, so we need a generalization bound that reflects the challenge of choosing the right hypothesis.

Using the union bound inequality, we get: We exactly know the bound on the probability under the summation from our analysis using the Heoffding’s inequality, so we end up with: Where $|\mathcal{H}|$ is the size of the hypothesis space.

By denoting the right hand side of the above inequality by $\delta$, we can say that with a confidence $1 - \delta$: And with some basic algebra, we can express $\epsilon$ in terms of $\delta$ and get: This is our first generalization bound, it states that the generalization error is bounded by the training error plus a function of the hypothesis space size and the dataset size.

Our theoretical result was able to account for some phenomena (the memorization hypothesis, and any finite hypothesis space) but not for others (the linear hypothesis, or other infinite hypothesis spaces that empirically work).

This means that the event that $h_1$ has a generalization gap bigger than $\epsilon$ should be independent of the event that also $h_2$ has a generalization gap bigger than $\epsilon$, no matter how much $h_1$ and $h_2$ are close or related;

This would be a very good solution if we’re only interested in the empirical risk, but our inequality takes into its consideration the out-of-sample risk as well, which is expressed as: This is an integration over every possible combination of the whole input and output spaces $\mathcal{X, Y}$.

Take for example the rainbow of hypotheses in the above plot, it’s very clear that if the red hypothesis has a generalization gap greater than $\epsilon$, then, with 100% certainty, every hypothesis with the same slope in the region above it will also have that.

But this is not helpful for our mathematical analysis, as the regions seems to be dependent on the distribution of the sample points and there is no way we can precisely capture these dependencies mathematically, and we cannot make assumptions about them without risking to compromise the supremum claim.

In order to measure the accuracy of our model, we hold out a part of the training set to evaluate the model on after training, and we consider the model’s accuracy on this left out portion as an estimate for the generalization error.

Now that the right hand side in expressed only in terms of empirical risks, we can bound it without needing to consider the the whole of $\mathcal{X \times Y}$, and hence we can bound the term with the risk $R(h)$ without considering the whole of input and output spaces!

We can assume the independence of the hypotheses in $\mathcal{H}_{|S}$ like we did before with $\mathcal{H}$ (but it’s more plausible now), and use the union bound to get that: Notice that the hypothesis space is restricted by $S \cup S’$ because we using the empirical risk on both the original dataset $S$ and the ghost $S’$.

we consider a hypothesis to be a new effective one if it produces new labels/values on the dataset samples, then the maximum number of distinct hypothesis (a.k.a the maximum number of the restricted space) is the maximum number of distinct labels/values the dataset points can take.

This fact can be used to get a better bound on the growth function, and this is done using Sauer’s lemma: If a hypothesis space $\mathcal{H}$ cannot shatter any dataset with size more than $k$, then: This was the other key part of Vapnik-Chervonenkis work (1971), but it’s named after another mathematician, Norbert Sauer;

With that, and by combining inequalities (1) and (2), the Vapnik-Chervonenkis theory follows: This can be re-expressed as a bound on the generalization error, just as we did earlier with the previous bound, to get the VC generalization bound: or, by using the bound on growth function in terms of $d_\mathrm{vc}$ as:

Professor Vapnik standing in front of a white board that has a form of the VC-bound and the phrase “All your bayes are belong to us”, which is a play on the broken english phrase found in the classic video game Zero Wing in a claim that the VC framework of inference is superior to that of Bayesian inference.

Consider for example the case of linear binary classifiers in a very higher n-dimensional feature space, using the distribution-free $d_\mathrm{vc} = n + 1$ means that the bound on the generalization error would be poor unless the size of the dataset $N$ is also very large to balance the effect of the large $d_\mathrm{vc}$.

For example, For data points that are linearly separable, contained in a ball of radius $R$, with a margin $\rho$ between the closest points in the two classes, one can prove that for a hyperplane classifier: It follows that the larger the margin, the lower the $d_\mathrm{vc}$ of the hypothesis.

However, no matter what the exact form of the bound produced by any of these methods is, it always takes the form: where $C$ is a function of the hypothesis space complexity (or size, or richness), $N$ the size of the dataset, and the confidence $1 - \delta$ about the bound.

This form of the inequality holds to any learning problem no matter the exact form of the bound, and this is the one we’re gonna use throughout the rest of the series to guide us through the process of machine learning.