I have an upper Hessenberg matrix i.e. upper trapezoidal with a block=1 band under the diagonal with non zero elements from some column onwards, and I need to triangularize it in the fastest possible way. This use-case is the result of computing e.g. QR and then deleting one column (yes it is always only one column) from R. I need to make R upper triangular again by annihilating all those (one per column) non-zero elements below the diagonal e.g. I need to get rid of the 2s in the fastest way:

Using Givens rotations combining dlartg and dlasr in a loop. Issue: very bad spatial locality, keeps moving along the rows that consequences in high cache misses.

Using householder reflectors combining dlarfg and dlarf in a double loop. Issue: very bad spatial locality, keeps moving along the rows that consequences in high cache misses.

Using the latest 3.4.x QR decompositions e.g. dgeqrt. This is a work overkill for this simple update, algorithmically very bad even when the column deleted is among the firsts ones.

Can anyone point me to a set of lower level functions I can use to implement my use-case in the fastest way? Otherwise I might end up reinventing the wheel, studying the update pattern and doing it manually in a block of registers along the columns.

We also need to triangularize upper Hessenberg matrices in the context of the GMRES method. This istraditionally done with Given rotations. (Similarly to what you describe.) The main difference with yourapplication is that the Hessenberg matrix is known one column at a time, so for each column we firstapply the previous rotations and compute the new one, and then move to the next column. There is notmuch opportunity for data reuse, etc. In your case you have the while matrix at once so you might be ableto do something smart.

This reminds me of "bulge chasing. What people have been trying to do in the Bulge chasing case is accumulate the Givens rotations in a small matrix. Say if your matrix is 100-by-100, you would consider the first 10 rows and constructthe first nine rotations. You only work on the 10-by-10 left top submatrix for that, then you accumulate all these rotations in a 10-by-10 matrix. (You multiply against the identity first.) And then apply this all at once on the remaining columns using a DGEMM. There is a paper from 2002 in SIMAX from Braman, Byers and Mathias entitled: THE MULTISHIFT QR ALGORITHM. PART I: MAINTAINING WELL-FOCUSED SHIFTS AND LEVEL 3 PERFORMANCE. What I was thinking was the Level 3 BLAS performance. You do more flops so it is not obvious this is a winning strategy.

In any case, I do not know much about performance optimization for the operation you are performing.