While Fujitsu has made some very respectable Sparc64 chips aimed at the supercomputing market, it has been a long time since the Japanese chip and server maker has put out a new Sparc64 processor that went into general purpose servers.
That changes in a big way with the forthcoming Sparc64-X processor, which will be used in …

COMMENTS

How ?

The Sparc64-X core has a deeper pipeline, which enables a higher clock frequency on the processor

How does a deeper pipeline allow a higher clock frequency ?

I seem to remember that one of the reasons the PII processor was such a flop was because it had a deep pipeline. On branch predicition misses, it would take ages to restart the deep pipeline with the correct instructions.

Re: How ?

All high clocked cpus must have deep pipelines. I dont know why, maybe someone can explain. But this is what I have read on several places.

Basically, each phase in the pipeline does some work when decoding machine code. High GHz, means many phases that can do work. Maybe each phase is very specialized and can act fast, and the higher GHz, the more specialized? I dont know.

.

Where are Matt Bryant that all the time claimed there will be no SPARC64, and that SPARC64 is dead? And the rest of the IBM supporters? :o)

Re: How ?

You want to execute one instruction. That instruction can be divided into several tasks - fetch, decode, execute, write. Suppose executing the whole instruction takes N seconds, always.

If the processor is "single stage", it will perform one instruction every N seconds, with most of the circuitry idle as it waits for something to do:

--->time

F>D>E>W

.............F>D>E>W

If you manage to design the processor (and the program) so that each of the instruction tasks listed above can be done independently, you have a pipeline 4 stages deep. In that case, you can issue a new instruction every N/4 seconds and more of the circuitry is active at any given time. Big win. In reality instruction interdependence and jumps may force the processor to "flush the pipeline", i.e. discard the partially executed instructions, which evidently slows throughput. See in particular "vector processing".

--->time

F1>D1>C1>W1

..>F2>D2>C2>W2

.....>F3>D3>C3>W3

.......>F4>D4>C4>W4

You can now deepen the pipeline by dividing the tasks into subtasks to issue even more instructions per N. Depending on your expected workload, this may or may not make sense.

In the limit, you would get a processor that works asynchronously, without a central clock, where each logic gate does its work as soon as all its inputs have been set.

This has nothing to do with overall clock speed, though as frequency increases you cannot reliably give a good clock signal to all of the chip area "at the same time", so you are forced to compartimentalize anyway.

Re: How ?

See http://forums.anandtech.com/archive/index.php/t-1011555.html for a quick explanation.

But in essence, it's not quite that high clock rates mean deep pipelines in the sense of a physics-based explanation, but rather that the latency between inputs and outputs on your CPU, determines how many operations you can run per second. So to get more oomph out of your CPU what you can do is find something else for the input pipelines to feed until the original input thread is done. and you never ever want to run out of work to do by draining the pipeline(s). So, if you are going to run fast, you need multiple input pipelines, sufficiently deep, and a core or cores that can pull data off the pipes so that the cores are always busy. On the other hand, getting your prediction of the content of the pipelines wrong has a killer effect on performance....

Re: How ?

Err, am confused by the answers here.

Basically, electrical signals propagate similarly to the speed of light. They have constant speed at a certain temperature depending of the interconnect and materials of the chip. That means that, because the speed of the signal propagation is fixed, the only way to reduce the time is to make the traces shorter, and this can only be achieved by adding pipeline stages. That is why the Pentium 4 "netburst" even had pipelines that did no work at all, as their purpose was just to reduce the length of critical paths of pipeline stages. To reduce the required time to propagate a signal on a critical path, is to increase the maximum possible frequency.