Wiki Category: Machine Learning

Locally Weighted Linear Regression

This locally weighted linear regression function is a non-parametric Learning algorithm, where the size of h0(x) is linearly proportional to the size of our training set m. Thus memory sizes increase with the training set.

Finding a new algorithm that is easy to fit curved lines

Look at the data at a small point that you’re interested in

Build a local hypothesis just for that section and try to predict that area

Given location X where we want to make a prediction,, where

The weights depend on the particular point x at which we’re trying to evaluate x
if |x(i) − x| is small, then w(i) is close to 1
if |x(i) − x| is large, then w(i) is small (close to 0)

So how do we determine the appropriate values of θ?
We pick a θ that gives the highest weight based on training examples that are closest to the query point

Bandwidth Parameter: The function is selected because we want a bell-shaped curve that peaks close to x and then falls of quickly after

helps to identify the shape of the curve (fat vs thin)

Regular Normal Equation:
Normal:

Probabilistic interpretation of data

Where is an error term which captures unmodeled effects or random noise
The density of the is given by

This implies that

where

the distribution of y(i)

Likelihood function

Given the design matrix X which contains all the

Maximum likelihood estimation

We should choose θ so as to make the data as high probability as possible.

We can maximize the log likelihood l(θ):

Maximizing is the same as minimizing , which is the cost function J(θ).

X = Y = R, where X denotes the space of input values, and Y the space of output values

Building the hypothesis function h(x)

To perform supervised learning, we must decide how we’re going to represent functions/hypotheses h in a computer. As an initial choice, let’s say we decide to approximate y as a linear function of x:

We let x0 = 1 (this is the intercept term), further simplifying to

Selecting the best parameters θ to predict y

We could have many parameters in our training set (ie. number of rooms, lot size, taxes). So which parameters do we include in our hypothesis function in order to best predict our outcome (y)? Ans: Pick values of θ where the predicted value h(x) is close to the actual value y.

Cost Function: The following formula defines a function J of θ with an objective of seeking the minimal difference between h(x) and y.

This least mean squares function is a Parametric Learning algorithm, where even as our training set m approaches ∞ the size of h0(x) is constant. Thus memory sizes are constant.

How do we optimize J(θ)? Ans: We want to choose θ so as to minimize J(θ)

There are several ways to do this:

Gradient Descent
Normal Equations

Gradient descent algorithm: A search algorithm that starts with some “initial guess” for θ, and that repeatedly changes θ to make J(θ) smaller, until hopefully we converge to a value of θ that minimizes J(θ).

Where α is called the learning rate and is typically a small number

1. For a single training example: Widrow-Hoff learning rule or LMS update rule.

simplifies to , this results in the update rule

Characteristics: The magnitude of the update is proportional to the error term , such that if hθ is close to y, the parameter changes by a small increment

Courtesy of Stanford SCPD – CS229 Machine LearningCan be susceptible to local optima

For linear regression, the cost function J(θ) does not have a local optimum other than a local optimum.

2. For multiple training examples:

M1: Batch Gradient Descent

The quantity in the summation in the update rule above is just ∂J(θ)/∂θj (for the original definition of J). So, this is simply gradient descent on the original cost function J.

Evaluation: Batch gradient descent has to scan through the entire training set before taking a single step

M2: Stochastic Gradient Descent (also incremental gradient descent)

Evaluation: We repeatedly run through the training set, and each time we encounter a training example, we update the parameters according to the gradient of the error with respect to that single training example only.

This method converges to the minimum more rapidly, but has the potential of overshooting the minimum and then oscillating around it. By slowly letting the learning rate α decrease to zero as the algorithm runs, it is also possible to ensure that the parameters will converge to the global minimum rather then merely oscillate around the minimum.

Normal Equation: The goal is to minimize J by explicitly taking its derivatives with respect to the θj’s, and setting them to zero (gradient == 0 or minima).

Matrix derivatives
For a function mapping from m x n matrices to real numbers, the gradient of f w.r.t. A is ∇Af(A)

Traceis the sum of the diagonal entries for a square matrix

Properties of Traces

trAB = trBA
trABC = trCAB = trBCA
(AB)T= BTAT

Additional rules

For (1), we could treat trace as inner product of two long vectors [A(1,1),…,A(1,n),…,A(n,1),…,A(n,n)] (by row) and [B(1,1),…,B(n,1),…,B(1,n),…,B(n,n)] (by column)

Thus, the trace is , which is linear function of all elements of A.

Since only one term in the RHS numerator contains Apq , after applying the partial derivative, we conclude (∇Atr(AB))pq = Bqp ⟹ ∇Atr(AB) = B⊤