OPRs and Least Squares Approximation, Geometrically

Oct 8, 2017

Recently posted on The Blue Alliance blog was a fantastic post by Eugene, outlining the math behind OPRs through the use of the Least Squares approximation to solve an overdetermined system of equations.
In this post, I’ll be going over the specifics of the normal equation and give an understanding of how
it works.

Note: This post assumes basic familiarity with matricies and how they relate to systems of equations. It is recommended you read the Blue Alliance blog post prior to this one.

Note that we can’t solve this at the moment, since the number of rows is greater than the number of columns (not square), and as such we can’t find the inverse . We call this system overdetermined, and as such the solutions are
inconsistent, meaning we can’t possibly satisfy all equations at once.

To overcome this, we need to introduce the concept of ‘error’. If we pick values for thru with regards to the first 4 rows of the matrix, and then plug those values into
the last row of the matrix, there will be some error in our solution. In order to solve our equation, we need to make our “best guess”, while reducing the error as low
as possible.

Least Squares, Geometrically

In order to reduce the error in our guess, we’re going to use the Least Squares approximation method, which is a form of linear regression. If you’ve ever
used the ‘line of best fit’ function, this is likely what powers it.

Here, we’re going to look into what least squares is, geometrically. First, we must understand linear independence.
Linear Independence is when a group of Vectors is perfectly expressed in comparison to its rank. In short, this means the minimum number of dimensions needed to express a space where Vectors can exist.
To demonstrate this simply, the 2D XY space can be fully expressed by and , and is therefore linearly independent. If we add another vector into the mix, we get 3 “basis” vectors
for 2D space, meaning we must eliminate one of them to reach linear independence.

Now, we must consider column space. Column space is simply an -dimensional space where all column vectors of the Matrix live, where is the rank of the Matrix such that the Matrix is
linearly independent. For example:
\[ \begin{bmatrix} a & b \\ c & d \end{bmatrix} = \{ \begin{bmatrix} a \\ c \end{bmatrix}, \begin{bmatrix} b \\ d \end{bmatrix} \} \]

For now, we don’t need to know what is, but we do need to know that the vectors of all columns of the Matrix exist in this space. In the below diagram, the column space of
some 2D matrix is expressed as a plane.

In this diagram, our ‘ideal’ solution exists within the plane (the point ). Since we know our solution isn’t perfect, there will be some error when it comes
to apply this solution to our equation to find the values of (the match outcomes). This error can be expressed by direction vector , and is orthogonal (perpendicular)
to the plane (that is, the dimension). The reason this sits orthogonal is that it must enter a higher dimension, as it does not exist in our column space, but instead slightly out of it. With this
in mind, we can express the position vector as:

It should be noted here that the term “Least Squared Error” is in fact the minimum distance of from the plane. Since we know the minimum distance between a point and the
plane is the magnitude of the vector orthogonal to the plane that meets the point, the Least Squared Error is the magnitude of , i.e:
\[ ||\underset{\sim}{e}|| = ||\underset{\sim}{b} - A\underset{\sim}{x}|| \]

Since the error vector is orthogonal to the column space, we know that it must also be othogonal to every column in , as otherwise, the error vector would exist within the column space.
We know from basic linear algebra that two vectors are orthogonal if their dot product is equal to 0 (). As we’re checking for orthogonality between our error vector and
all column vectors of , we can check for all of them at once by transposing such that the columns become rows, and multiplying it by our error vector. Equating this to 0 enforces orthogonality.
\[ A^T \underset{\sim}{e} = 0 \]
\[ A^T(\underset{\sim}{b} - A\underset{\sim}{x}) = 0 \]
\[ A^T\underset{\sim}{b} - A^T A\underset{\sim}{x} = 0 \]
\[ A^T A\underset{\sim}{x} = A^T\underset{\sim}{b} \]
\[ \underset{\sim}{x} = (A^T A)^{-1} A^T\underset{\sim}{b} \]

This is known as the pseudoinverse, and is denoted as . The pseudoinverse (also known as the generalized inverse, Moore-Penrose inverse and others)
is used as a stand-in for for matricies that aren’t square (i.e. overdetermined systems), and is used to minimize the magnitude of the error vector (i.e. find the Least Squared Error solution).

We can apply this pseudoinverse to our original data, using the following:
\[ x = A^{-1}b \]
\[ x = A^+b \]
\[ x = (A^T A)^{-1} A^T b \]
\[ or \]
\[ A^T A x = A^T b \]
This is identical to what we determined geometrically, and will provide the best fitting solution of values to minimize the error, giving the ‘best guess’ at what those values should be. In our case, this is
what gives our OPR.

Efficiency and Cholesky Decomposition

It should be noted that while the use of pseudoinverses is practical for smaller datasets (such as OPRs for a single event), it becomes very expensive in both time and storage to compute the pseudoinverse for much larger
datasets. When speed and efficiency is a necessity for the larger datasets, it’s common to instead use Cholesky Decomposition in order to solve the normal equation .

While most of this reading is left up to the reader, Cholesky Decomposition states that for some real, symmetric, positive-definite matrix, , where is a lower triangular matrix with real positive diagonals, as shown below:
\[ L = \begin{bmatrix} 1 & 0 & 0 \\ 4 & 6 & 0 \\ 5 & 6 & 1 \end{bmatrix} \]

can be solved for through forward-substitution, since is known to be lower triangular. Likewise, can then be solved for through back-substitution, both of which are relatively inexpensive operations.

Thanks to Ether for the help in this particular section of the post.

Final Notes

The same logic behind pseudoinverses can be applied to other elements of scoring in FRC other than just OPRs.
Let’s take a look at a 2D dataset, that is, . This could be something like High Fuel Scored vs OPR, or whatever stat you want to measure.

You can use this knowledge to generate trend lines for your data, or for applying OPR to different datasets. All of this can be used to broaden your range of data for analysis,
which is a great quality to have. Never look at one set of data on its own, look at a range of data before making deductions.