Prescott: The Last of the Mohicans? (Pentium 4: from Willamette to Prescott). Part II. Page 9

We continue our detailed article on NetBurst architecture and its peculiarities. Today we are going to raise the curtain of mystery over a few more exciting things in NetBurst architecture, which you may have never heard of before. This is the second article of the trilogy devoted to X-bit’s indepth NetBurst Architecture investigation!

This is how micro-operations could be sent out “in advance” basing on the data load forecast. This allows loading the execution units with work in the most efficient way.

So, if the distance between the scheduler and the execution units is quite big, only the optimistic strategy can load this long pipeline with work efficiently enough. It is important, however, that the scheduler:

always considers the best situation with data availability;

doesn’t lack the info about execution units status.

Well, everything seems to be turning out quite nice. We have finally found the best strategy for the long pipeline of the NetBurst micro-architecture, which allows maintaining high execution rate for the processed micro-operations. And this is really the way we have just described it. Only one important condition has to be fulfilled in this case: the instruction is really executed in the best way. In the example above where we considered data loading from the memory, the “best way” for us will be if the data is in L1 cache. This optimistic forecast has the right to exist for two reasons. Firstly, the probability that the requested data is there is very high. And secondly, data transfer from this cache takes minimum time, so the waiting will also be minimal if the data loads successfully.

But if this condition is not accomplished, then our nicely polished mechanism turns into a trap. Let me explain.

Imagine the following situation. The scheduler released a succession of four micro-operations dependent on one another. They are all on the way to the execution unit, but before the data has been loaded. Everything happens according to the optimistic strategy of the scheduler.

Here they are approaching the execution unit. And then - oh no! - it suddenly turns out that the requested data is not in the L1 cache: instead of the long-awaited operand we receive the deadly “L1 cache miss” signal.

What should be done in this case? Of course, we would have to halt the pipeline, and go look for the missing data. But the pipeline cannot be stopped just like that, it continues processing new micro-operations every clock. The scheduler has already released a succession of uop-s and we know that each next operation depends on the previous one. What will happen now?

The first micro-operation arrived at the execution unit. And since there was no data waiting for it, it was executed incorrectly. The next clock the second micro-operation receives incorrect data from the first micro-operation and is also executed incorrectly. And the same thing will happen to all uop-s until the very last one. Moreover, there appears one more serious problem.

Suppose the data was found in the L2 cache. The Northwood core will need 7 clock cycles to load the data from there. But the pipeline works only in one direction and cannot “reverse” the instructions flow. Our chain of micro-operations has already passed the execution units and was processed with the wrong operand. And if we do not undertake any urgent measures it will continue its way down the pipeline bearing more incorrect results and will be retired.