Occam's razor

Gabriele Svelto (gabriele.svelto.delete@this.gmail.com) on January 8, 2014 2:51 pm wrote:
> Nicolas Capens (nicolas.capens.delete@this.gmail.com) on January 8, 2014 2:33 pm wrote:
> > Occam's razor is about making the fewest assumptions. You're making two assumptions:
> > 1) The L0 cache is very complex.
> > 2) The Intel compiler is stupid.
> >
> > Firstly an L0 cache isn't complex. Only a few entries are needed and there's only ~20 bits worth of address
> > encoding to test for a match (http://www.realworldtech.com/forum/?threadid=138897&curpostid=139072).
>
> An extra cache layer which must be coherent with another with 72 cores isn't complex and requires
> only ~20 bits worth of address? Seriously?

Despite me calling it an L0 cache, it's not like any of the other cache "layers". It is read-only, and keeping it coherent with L1 is easy since you can store the corresponding L1 cache line index with each entry, and invalidate them in the same cycle the L1 cache line is invalidated.

> With no other processor ever implementing anything
> even remotely similar? It's existence sounds like a HUGE assumption to make considering there's
> not even the slightest hint of it anywhere (Care to point to some patent? A diagram? Another processor
> doing something similar? Anything at all that's not pulled out of thin air?).

Sure. Having multiple line buffers is fairly similar, and it saves power: http://caps.cs.binghamton.edu/papers/ghose_islped_1999.pdf

> > Secondly it is one of the easiest optimizations to avoid reading memory operands multiple times. It has
> > to be a deliberate choice (http://www.realworldtech.com/forum/?threadid=138897&curpostid=138966).
>
> Right, because compilers never have bugs and never emit sub-optimal code for unreleased architectures. Never.

I don't think you're grasping just how much this code is deviating from the norm. It is not reusing any data that has already been loaded into a register, anywhere, in either example! That is usually one of the fundamental tasks of register allocation. Any course in compiler design covers this as a basic optimization.

So this isn't some heuristic making the wrong decision in a corner case. It requires no advanced analysis or deeper knowledge of the source code to perform this trivial and unambiguous optimization. It would only take one glance at the assembly output for the Intel engineers to have spotted this as a grave error, if it was unintentional. Furthermore, even though KNL is unreleased, AVX-512 is only a minor variation on Intel's previous 512-bit ISA extensions. So they've had plenty of time to make sure they're generating the code they intended to generate.

It is not an easy assumption at all to consider this a compiler bug. Looking at ICC and Intel's contributions to LLVM, their compiler engineers are exceptionally skilled. They would not ship a compile with this behavior, unless it was intentional. So it's easier to look for other explanations, such as the existence of a structure to provide duplicate memory operands.