This tutorial showcases LinearLeastSquares.jl through a few involved examples of linearly
constrained least squares problems. We’ll refer to LinearLeastSquares.jl as “LLS”
throughout.
The plots generated by the following examples use the Gadfly package.
The documentation on how to use Gadfly can be found here,
but by no means is it necessary to read unless you would like to create plots for yourself.
To install Gadfly, run the following command in a Julia shell

Pkg.add("Gadfly")

Some of the examples will also use data files, which can all be found here.

We will first try to fit a line to the data. A general function for a line is

\[f(x) = \alpha + \beta x\]

where \(\alpha\) is the offset and \(\beta\) is the slope.
We would like to pick \(\alpha\) and \(\beta\) so that our data points lie “close” to
our line. For a point with coordinates \(x\) and \(y\) the residual between the point
and our line is defined as

\[r(x, y) = f(x) - y.\]

One reasonable way to measure the how different line is from the data is to
sum the squares of the residuals between each point in the data and the line:

A line is probably not the best function to fit to this data. Instead, let’s try
to fit a quadratic function, which has the form:

\[f(x) = \alpha + \beta x + \gamma x ^ 2\]

where the new coefficient \(\gamma\) corresponds to the quadratic term
A similar residual function from the linear regression example can be used here;
we measure the error of our quadratic fit by summing the squares of the
residuals

A simple control problem on a system usually involves a variable \(x(t)\)
that denotes the state of the system over time, and a variable \(u(t)\) that
denotes the input into the system over time. Linear constraints are used to
capture the evolution of the system over time:

\[x(t) = Ax(t - 1) + Bu(t), \ \mbox{for} \ t = 1,\ldots, T,\]

where the numerical matrices \(A\) and \(B\) are called the dynamics and input matrices,
respectively.

The goal of the control problem is to find a sequence of inputs
\(u(t)\) that will allow the state \(x(t)\) to achieve specified values
at certain times. For example, we can specify initial and final states of the system:

Additional states between the initial and final states can also be specified. These
are known as waypoint constraints. Often, the input and state of the system will
have physical meaning, so we often want to find a sequence inputs that also
minimizes a least squares objective like the following:

\[\sum_{t = 0}^T \|Fx(t)\|^2_2 + \sum_{t = 1}^T\|Gu(t)\|^2_2,\]

where \(F\) and \(G\) are numerical matrices.

We’ll now apply the basic format of the control problem to an example of controlling
the motion of an object in a fluid over \(T\) intervals, each of \(h\) seconds.
The state of the system at time interval \(t\) will be given by the position and the velocity of the
object, denoted \(p(t)\) and \(v(t)\), while the input will be forces
applied to the object, denoted by \(f(t)\).
By the basic laws of physics, the relationship between force, velocity, and position
must satisfy:

Here, \(a(t)\) denotes the acceleration at time \(t\), for which we we use
\(a(t) = f(t) / m + g - d v(t)\),
where \(m\), \(d\), \(g\) are constants for the mass of the object, the drag
coefficient of the fluid, and the acceleration from gravity, respectively.

We would like to keep both the forces small to perhaps save fuel, and keep
the velocities small for safety concerns.
Here \(\mu\) serves as a parameter to control which part of the objective we
deem more important, keeping the velocity small or keeping the force small.

Tomography is the process of reconstructing a density distribution from given
integrals over sections of the distribution. In our example, we will
work with tomography on black and white images.
Suppose \(x\) be the vector of \(n\) pixel densities, with \(x_j\)
denoting how white pixel \(j\) is.
Let \(y\) be the vector of \(m\) line integrals over the image, with \(y_i\)
denoting the integral for line \(i\).
We can define a matrix \(A\) to describe the geometry of the lines. Entry
\(A_{ij}\) describes how much of pixel \(j\) is intersected by line \(i\).
Assuming our measurements of the line integrals are perfect, we have the relationship that

\[y = Ax\]

However, anytime we have measurements, there are usually small errors that occur.
Therefore it makes sense to try to minimize

\[\|y - Ax\|_2^2.\]

This is simply an unconstrained least squares problem; something we can
readily solve in LLS!

One common problem found in machine learning is the classification of a group of objects into two subgroups.
In this example, we will try to separate sports articles from
other texts in a collection of documents.

When classifying text documents, one of the most common techniques is to build
a term-by-document frequency matrix \(F\), where \(F_{ij}\)
reflects the frequency of term \(j\) in document \(i\).

The documents are then split into a training and testing set. For each document
in the training example, we also label the document with a label. In this case,
sports articles are labelled with a \(1\) and all other text documents are
labelled with a \(-1\).
One reasonable approach to classify the documents is to model the label
as an affine function of the term frequencies of the document:

\[\mbox{label}(i) = v + \sum_{j = 1}^n w_jF_{ij}.\]

The goal now is to find a scalar \(v\) and a weight vector \(w\), where \(w_j\) reflects how
important term \(j\) is in determining the label of the document. In our context, a positive value
means that the term is often seen in sports articles, while a negative value means
the term is often seen in the other documents. One reasonable approach to
finding the best \(w\) and \(v\) is to minimize the following objective:

The first part of the objective is to ensure that our linear model actually closely
reproduces the labels of our training documents. The second part of the objective
ensures that the components of \(w\) are relatively small.
Keeping \(w\) small allows our model to behave better on documents not in the training set.
The regularization parameter \(\lambda\)
is used to control how much we should prioritize keeping \(w\) small versus
how close the affine function should fit the labels.

We can now sort our weight vector \(w\) to see which words were the most
indicative of sports articles and which were most indicative of nonsports.

# print out the 5 words most indicative of sports and nonsports
words = String[]
f = open("largeCorpusfeatures.txt")
for i = 1:length(evaluate(w))
push!(words, readline(f))
end
indices = sortperm(vec(evaluate(w)))
for i = 1:5
print(words[indices[i]])
end
for i = 0:4
print(words[indices[length(words) - i]])
end

Each run will yield different words, but it’ll be clear which words
come from sports articles.

A time series is a sequence of data points, each associated with a time.
In our example, we will work with a time series of daily
temperatures in the city of Melbourne, Australia over a period of a few years.
Let \(x\) be the vector of the time series, and \(x_i\) denote
the temperature in Melbourne on day \(i\).
Here is a picture of the time series:

We can quickly compute the mean of the time series to be \(11.2\). If
we were to always guess the mean as the temperature of Melbourne on a given day,
the RMS error of our guesswork would be \(4.1\). We’ll try to lower
this RMS error by coming up with better ways to model the temperature than
guessing the mean.

A simple way to model this time series would be to find a smooth curve that
approximates the yearly ups and downs.
We can represent this model as a vector \(s\) where \(s_i\)
denotes the temperature on the \(i\)-th day.
To force this trend to repeat yearly, we simply want

\[s_i = s_{i + 365}\]

for each applicable \(i\).

We also want our model to have two more properties. The first is that
the temperature on each day in our model should be relatively close to the actual temperature of that day.
The second is that our model needs to be smooth, so the change in temperature from day to
day should be relatively small. The following objective would capture both properties:

Our smooth model has a RMS error of \(2.7\), a significant improvement from
just guessing the mean, but we can do better.

We now make the hypothesis that the residual temperature on a given day is
some linear combination of the previous \(5\) days. Such a model is called
autoregressive. We are essentially trying to fit the residuals
as a function of other parts of the data itself.
We want to find a vector of coefficients \(a\) such that

\[\mbox{r}(i) \approx \sum_{j = 1}^5 a_j \mbox{r}(i - j)\]

This can be done by simply minimizing the following sum of squares objective