Performance Results of Low-Rank Trick

March 09, 2019

In recent posts I explored matrices with structure that enables fast linear solution algorithms. I specifically mentioned Toeplitz and Circulant matrices. I also showed how to still take advantage of the fast linear solvers even if the input matrices do not exactly satisfy the necessary structure. I didn’t give much results though in terms of actual performance. I do that here. I will focus on circulant matrices and I will implement its fast solver in terms of FFTs from scipy. I then perturb circulant systems by low-rank updates and see if my previously developed ideas actually still result in fast solvers. I start by explaining circulant matrices.

Circulant Matrices

Without going into too much math circulant matrices are those matrices \(C\) which may be generated by a one dimensional vector \(c\) such that each row of \(C\) is \(c\) shifted a fixed number of times (with wrap-around).

Thus if we want to solve a system \(Cx=b\) for \(x\) we can simply use scipy.linalg.solve as follows:

importnumpyasnp#Make some data for bn=4b=np.random.rand(n)#Solve for x in Cx=bx=scipy.linalg.solve(C,b)

This however requires \( O(n^3) \) operations because it uses the most generic solver, which typically falls back to LU factorization. When we use the fact that \( C \) is circulant we can solve this system with \( O(n\log(n)) \) operations by the following approach:

importscipy.linalgimportscipy.fftpackimportnumpyasnp#Shorthand definitions for forward,inverse FFT operationsfft=scipy.fftpack.fftifft=scipy.fftpack.ifftfftcs=fft(c)#Solves linear systems involving C.#Note it does not directly refrence the#matrix C, but rather only uses its#generator c.defsolveC(b):returnnp.real(ifft(fft(b)/fftcs))n=4b=np.random.rand(n)#Solve for x in Cx=bx=solveC(b)#x is the same answer we get by #using scipy.linalg.solve, but we took much fewer operations#to get it this time.

Performance of this approach speaks for itself. I timed both approaches with matrix sizes \(n=1,2,4,8,16,32,64,128,256,512,1024,2048\), producing the following plot: (see note after the figure)

(Note: one should always keep in mind that FFTs perform very well on power-of-two sized inputs. Some FFT libraries gracefully handle input sizes of different types and others do not. I chose to only measure the power-of-two case so as to avoid this complicating factor)

In my previous posts I suggested we can still obtain performance like this even if our matrices aren’t exactly circulant, but circulant plus a low rank update. Now I show that this is indeed true

Low-rank Modification of Circulant Matrices

Now let’s consider the case of \( B = C + UV^T \) where \( C \) is a circulant matrix and \(U,V\) are \(n \times k \) matrices. In my previous post I gave the following algorithm which requires \(k+1\) evaluations of the fast solver for \( C \):

Let’s see how this compares to simple scipy.linalg.solve for different values of \( k \). We should expect that the larger \( k \) is, the more expensive the modified solver is (relative to the pure circulant solver).

Rank of update

Timing measurements

Circulant+Rank-1 update

Circulant+Rank-10 update

Circulant+Rank-20 update

Conclusions

Here I showed performance of generic linear solution algorithms which do not exploit structure and compared that performance against a solver that does exploit structure. The structure I chose for illustration purposes was the circulant property. I used some tricks I developed in previous posts to recover the fast structured solve performance even when the input matrix does not precisely have the desired structure.

What we see is that after a certain matrix size there is a crossover point at which a rank-k updated fast-circulant-solver begins to outperform standard direct solution techniques used in scipy.linalg.solve. As expected this crossover point grows with the rank \(k \) since this number has a strong relationship with the number of iterations that GMRES must take to fully solve the system. Generally, if the rank of the update is \( k \) then the circulant+low-rank solver will take \(k+1\) GMRES iterations to converge.

One unexpected effect was the dramatic performance drop for the \(k=1\) case compared to the \(k=0\) case (pure circulant, no updates). The reason for this I believe is that the pure circulant solving algorithm is very cache efficient because it only requires \( O(n) \) (storage of the one dimensional array \(c\) which generates \(C\), but does \(O(n \log(n))\) operations. Thus for large \(n\) the operation/byte ratio is favorable for caching.

The low-rank-updated matrix \( B \) however must be computed explicitly and stored in memory in order to be used in GMRES, this dramatically increases the memory requirement to \( O(n^2) \), but the overall operation count is \( O(n^2 + n\log(n))\). The quadratic factor comes from needing to evaluate the matrix-vector product \( Bx \) to drive GMRES.