L1 vs L0 access cost

NoSpammer (no.delete@this.spam.com) on January 11, 2014 3:39 pm wrote:
> Nicolas Capens (nicolas.capens.delete@this.gmail.com) on January 11, 2014 12:24 am wrote:
> > The paper above says it takes 40% more power. That's a huge deal. Even if large scale coherency
> > is a bigger problem, you can't ignore this. 40% is a lot of opportunity for doing something a little
> > different when doubling the number of execution units, like, an L0 cache perhaps? Even if the L0
> > cache itself costs 10% of power per access, that's a huge saving over a second L1 access.
>
> Look at the whole picture supposing your source is right. 40% more power for one operand
> means that three operand instructions use only 13.33% extra power when sourcing one
> operand from L1. If you work the numbers some more you will notice 10 load-to-use instructions
> are still slightly cheaper than 1 extra load-to-register instruction.

The 40% is for the same arithmetic instruction, with a memory source operand instead of a register, hitting L1. So accessing L1 has a very substantial cost of 40% per cycle, not 13.33%.

> > There's a difference between it being the wrong solution and two L1 read ports being the right
> > solution. The L0 cache is just one way Intel might avoid a 40% increase in power consumption,
> > but the most likely one I've been able to come up with so far based on the compiler output.
> > That you think that output is "probably" the result of an immature compiler, which it clearly
> > isn't, is a much more interesting expression of doubt. So what's your other explanation?
>
> Get over your fixed idea.

It's not a fixed idea. I can let go of it in a heartbeat. But someone still has to provide any reasonable explanation why an extremely tiny cache which only requires a handful of small comparators to find hits on reads and invalidate entries on write, would not be power efficient enough to save some of the L1's 40% access power.

> BTW, you are religiously naive about the optimality of Intel compilers.

This isn't about a minor tweak in "optimality". Loading data from memory again when it's already in a register is a big deal and yet a trivial optimization. They do it for other architectures. So I don't think I'm being naive about having faith in Intel's compiler engineers being capable of this optimization. They must have disabled it deliberately for a reason.