Replay: Unknown Features of the NetBurst Core. Page 13

In the third part of our NetBurst Architecture investigation trilogy we are going to reveal the details of the Replay mechanism Implemented in Intel Pentium 4 processors, which Intel keeps quiet about. This particular mechanism and its working principles explain why Pentium 4 processors perform pretty slowly, despite their high working frequencies.

If we consider our chain of commands with “holes”, the solution could be quite simple: we should prevent the scheduler queues from sending out new commands for a while. This is a pretty natural solution, because the issue actually arises from the fact that the scheduler has no idea of what is going on in the RL and keeps sending there new instructions. However, this method is not exactly the best way to solve livelocks: it represents just the opposite of what you should do in this case. Look, all we need is to somehow fit an instruction stuck in the scheduler into the RL, and right now all the attempts discussed above result into complete isolation of this instruction. Besides, you can also see that if the chain keeps growing, there will be at least 5 additional instructions in the replay system, and this is not what we suppose it should be.

Well, there is one more option left, which we need to check out: locking the scheduler Input. Since it will take another huge article to reveal all the smallest details of this, we will only provide you with the test results and conclusions drawn from them.

We managed to find out that there is a moment when only 8 instructions can be sent for execution (with further replay), which matches exactly the schQ FAST_0 scheduler queue depth.

The conclusion is evident: we deal with the locking of the scheduler input. This is a very important statement, as this mechanism is a pretty logical basis for the unified universal system used to resolve livelocks as well as “long chain” issues. The tests prove that once the scheduler input is closed, it locks not only the schQ FAST_0 queue, but also all other schedulers. We can see it from the limited number of instructions getting into replay from these schedulers once their input has been locked. The price we have to pay for this stall operation can roughly be estimated as 15-35 clock cycles.

The system would watch for a while the share of the replayed instructions, new instructions and free positions (let’s call them patterns). However, all we know now doesn’t allow us to fully reproduce the exact decision making algorithm about the stalling. Our experiments shows that there is a serializing event that serves as a starting point, that is why if we take the test code of our example as is but slightly modify the calling procedure, the maximum number of replayed instructions may turn out different.

Note that long chains are not stalled globally. For example a chain of dependent shift will never be stopped. The same is true for many other instructions. In other words, no stops is more of a rule than of an exception. The model “chains with holes” are always stopped when they get into RL-7, but in some cases when they get into RL-12, the “holes” keep looping infinitely. The looping of our model chains results into ideal identity of the pipeline status every 14 clocks, which cannot happen in real-life program codes. So, we get the impression that we are dealing not with the mechanism for resolving this difficult situation, but with a system initially intended for other tasks. And this system gets involved here only in specific favorable conditions.

We believe that this mechanism is intended primarily for resolving livelock issues. If you have been reading our article carefully, you might ask: how can the blocked scheduler input help get out of a livelock? We think it works like that: when the scheduler input is closed, the chain of commands circling in the replay loop is sent to some special buffer and the free positions in the pipeline get occupied by the micro-operations left in the scheduler. These micro-operations may be executed successfully, thus resolving the livelock.