Prescott: The Last of the Mohicans? (Pentium 4: from Willamette to Prescott). Part II. Page 13

We continue our detailed article on NetBurst architecture and its peculiarities. Today we are going to raise the curtain of mystery over a few more exciting things in NetBurst architecture, which you may have never heard of before. This is the second article of the trilogy devoted to X-bit’s indepth NetBurst Architecture investigation!

Chapter X: Replay System: Cons and Pros

Here I have to make a brave supposition that you are not too tired yet of all these details. Therefore, we are going to dig a little bit deeper into the replay mechanism, especially, since the details we discovered are very interesting and important for better understanding of Pentium 4 processor working principles.

Just to make sure let’s revise a few things we have already talked about in the previous chapter. We were saying that when the micro-operation gets into the replay system it is actually executed almost twice: first time incorrectly, and the second time with the correct operands. This way, the execution units had one idle cycle when the uop passed them the first time. Moreover, if the micro-operation gets into replay, it can also drag a few more micro-operations with it. In particular, we managed to create chains including thousands of commands, which were circling around the replay pipeline hundreds of times as a result of a single cache-miss! To be fair I have to stress that hundreds of replay loops is not a very frequent situation, anyway. In most cases it will all be over after a few, or a few dozens of loops. Of course, this circulation of the same micro-operations reduces the efficiency of the execution units, because about half of all commands going through them turn out idle.

Let’s take a closer look at a situation when we have a multiple rotation of a micro-operation chain within the replay loop. What does this problem arise?

The reason for that is the enthusiasm of the scheduler, strange as it might sound. To be more exact, there are two reasons: the scheduler’s vital desire to load the execution units with the maximum efficiency, and its unawareness of the current micro-operation status in the execution unit.

Let’s return to our example with a dependency commands chain. Let the first one get into replay system. Then the second one, the third one, etc, will also go into replay. There is one important thing here: since the original micro-operation is moving in the main pipeline parallel with its clone moving in the fictitious pipeline, the distance between micro-operations in the replay pipeline will remain the same. In fact, this ability to maintain the same distance between the micro-operations is one of the initial replay features: if the distances were changing, the processor logics would have much harder times managing the work of these two pipelines.

So, if there was a gap between too dependent commands, when the scheduler didn’t release any micro-operations, there will be the same gap between the clones on the replay pipeline. In the terms of semiconductor electronic hole conduction, it will be “a hole”.

This hole turns out to have very interesting features. In particular, it is these holes together with the enthusiasm of the scheduler that result into a phenomenon called replay loop for the entire chain of uop-s.