Rationale for Multiprocessors on a Chip

The primary compute intensive application that most consumers will care
about is graphics, e.g. 3D animation, MPEG decoding, etc., and there is
plenty of parallelism in graphics to enable effective use of multiple
processors on a chip.

With coming graphics programs requiring thousands of mips, I'd estimate
the utilization of multiple processors to be over 90%.

Comparing wide issue processors to multiprocessors, each time you
increase the issue rate by one, you must add logic to support the
increased issue rate, and the overall average utilization of the
logic falls off.

Similarly, each time you add a processor in a multiple processor
system, the overall average utilization falls off. However, it's
easier to find exploitable parallelism in a program than in a single
thread, and so the fall off should be less rapid in a multiple
processor system, given the coming graphic applications. For this
reason, I think that processors beyond a few instructions issued per
clock will be less than optimum for graphics applications.

Supercomputers, as I recall, went to multiprocessors at less than 1000
mips in the early 80's. Apparently, it was the path of least cost to
higher performance.

It seems that a certain threshold must be crossed before multiple
processors are the better choice. Supercomputers crossed that threshold
in the early 80's. I surmise that microprocessors will soon cross it
because graphics programs have a lot of parallelism.

As CMOS processes shrink, wiring delay is becoming a larger part of the
clock delay. This tends to reduce the performance of wide issue processors.

At some point, multi-threading becomes attractive as a
way to keep the processor busy. However, each thread needs cache
space, and it seems to me that once you're up to 2-4 threads,
it's better to go to more processors rather than attempt to run
even more threads to keep the processor busy.

Multi-threading a processor can mask some of the DRAM access time.
One could even execute on another thread while waiting for a branch
condition to become available. This could be a viable option when
there are plenty of threads.

As the amount of logic that can be put on a chip increases, it's only a
matter of time before multiple processors become more viable than a
really wide issue single processor. And then it's a matter of optimizing
the issue width versus productivity within the single processors used in
a multiple processor system.

There is tradeoff between low utilization of hardware in a
high ILP processor and low utilization of hardware in a multiple
processor due to a lack of paralellism. The balance will soon tip
in favor of the multiple processors, if not already.

Looking down the road, the ILP needed for good multiple
processor performance is inversely proportional to the
clock speed, and so as clock rates increase, the ILP can
decrease. E.G. by the time clock rates have increased by
a factor of 5, the ILP may have dropped by a factor of 2,
or down to where ILP is near optimum with respect to
cost-performance.

Once you're up into > 1 Ghz clock rates, an ILP < 2 would
seem to be optimum from the standpoint of cost-performance.

The Stanford Hydra project is about using 4 processors on a chip to
cooperatively work on a single thread. This improves single thread
execution speed when it's the bottleneck. And when not, the multiple
processors can execute code more efficiently than a single processor.