My work includes finding the QR decomposition of a p-cyclic matrix M. It is square, L blocks high, L blocks wide, and each block is N by N. Each block column has two nonzero blocks on and directly below the diagonal, like so

Now my problem is that although BSOF always has a better timing than DGEQRF, as L increases and the amount of zeroes in M increases, DGEQRF gets much much faster in terms of GFlop/s. Below are results from a test where the size of M is constant at 10,000 and L grows as N decreases. If DGEQRF were unaffected by structure, its speed and execution time would be the same for each test, but it is not. So my question is why is DGEQRF going so fast? My theory is that there is some heuristic which skips some amount of flops when it sees some formation of zeroes, making my flop count incorrect and leading to a bad GFlop rate. But I have no idea how/where this is being done.

Due to the large size of your matrices, DGEQRF is using the blocked code path, where instead of computing the QR factorization one column at a time, it computes a block of columns at a time. The steps in the block factorization can be seen in http://www.netlib.org/lapack/double/dgeqrf.f, and there are 3 main steps: (1) compute a QR factorization of the block of columns (tall skinny matrix), with Q represented as single column Householder reflections, (2) Combine the individual reflectors into a block reflector with a diagonal block T (in routine DLARFT). The details of the algorithm are not at all obvious from the code, and there does not seem to be much literature on the subject (I myself am interested in knowing more about it, but I have found this paper: http://www.sciencedirect.com/science/article/pii/0045782576900323. Finally (3), the block reflector is applied.

Throughout this process, there are many optimizations performed which exploit the zero structure of the intermediate factors. In DLARFT, the code checks to see if any Tau's are zero (the scale factor of the Householder reflector), and if so, saves some time on filling in T. This will occur when the diagonal blocks are low rank, but only saves a small amount of time since the size of the diagonal block is relatively small. In DLARFB, the code checks for reflectors with trailing zeros, and saves time there. This happens if your matrix was roughly upper triangular to begin with so that the panel QR in step (1) has trailing zeros, which is true for you, thus saving a lot of time. So in this sense, what DGEQRF is doing is not a whole lot different from your BSOF routine, except it has to explicitly check for zero structure. In the limit of large block sizes, the trailing zero estimate returned by ILADLR becomes exactly correct and makes the amount of work between the two methods roughly the same.