Beware that if you unroll too much, the result won't fit in L1 cache any more. A branch in cache is likely to be less expensive than cache misses.

Also bear in mind that in some architectures with branch prediction, the cost of a successful guess is close to zero.

As ever, it is probably a good idea to start with a baseline of "simple" code which you can profile and benchmark. For one thing, optimisers are the most effective when the programmer isn't being too "clever" with the code.

Ad-hoc hackery without a consistent approach to measuring the result won't get you anywhere.

-fno-guess-branch-probability
Do not guess branch probabilities using heuristics. GCC will use heuristics to guess branch probabilities if they are not provided by profiling feedback (-fprofile-arcs). These heuristics are based on the control flow graph. If some branch probabilities are specified by `__builtin_expect', then the heuristics will be used to guess branch probabilities for the rest of the control flow graph, taking the `__builtin_expect' info into account. The interactions between the heuristics and `__builtin_expect' can be complex, and in some cases, it may be useful to disable the heuristics so that the effects of `__builtin_expect' are easier to understand.
The default is -fguess-branch-probability at levels -O, -O2, -O3, -Os.

yes, i had begun to consider that, although its not a subject i know a lot about. if i am off base (good chance of that; i'm an engineer by trade, not a computer scientist...), please disabuse me of my ignorance:

if i am if i assume a cache size of 32k (64k is de-facto standard now?); i should be able to get ~4k 64-bit flops in before worrying about cache coherence

if i use specialization, i could partition this into 2 chains of 2k operations (or possibly 512 __m128d operations) each.

When it comes to dealing with matricies you should make your code work smarter, not harder. It depends on what you're doing with those matricies, but there are algorithms out there that can perform multiplications quicker for example.
Loop unrolling gives you diminishing returns. The more you unroll, the less it is worth it. I suspect that in your polynomial case that the only reason you might have gotten anywhere near such an improvement is that you were unrolling multiple levels of nested loops, so the improvements compound. For such a large matrix though, I don't expect such massive unrolling to make that much improvement.

My homepage
Advice: Take only as directed - If symptoms persist, please see your debugger

Linus Torvalds: "But it clearly is the only right way. The fact that everybody else does it some other way only means that they are wrong"

i don't know why it took me so long to see this, but there's no reason to use inheritance at all. static class methods + specialization are the way to go. there are apparently no limits (or much larger limits) on the number of class specializations you can have.

Oh that explains it! I wasn't going to say it but more than a 2x improvement through loop unrolling is next to impossible, so yeah I figured something was up. It turns out that probably 99% of the speed improvement you got there was from not using the very slow pow function! (well the speed of pow with an integer exponent does somewhat vary amoung compilers)
Aside from the obvious bug there with a few + instead of * in that code, you could make that even faster by avoiding a lot of multiplications, if you do it like this:

Oh that explains it! I wasn't going to say it but more than a 2x improvement through loop unrolling is next to impossible, so yeah I figured something was up. It turns out that probably 99% of the speed improvement you got there was from not using the very slow pow function! (well the speed of pow with an integer exponent does somewhat vary amoung compilers)
Aside from the obvious bug there with a few + instead of * in that code, you could make that even faster by avoiding a lot of multiplications, if you do it like this:

Now you know that the speed improvement had not so much to do with loop unrolling, perhaps you'll reconsider loop unrolling at all with your matrix stuff huh!

possibly, but i don't think so. i'm getting really good performance with dot-products. i'm still tweaking the benchmarking setup to ensure the optimizer doesn't give me an unfair baseline, but so far the results are highly compelling.