Single-threaded vs. Multi-threaded position statements

Joel Emer -
In the early 1970s when Intel introduced the 4004 microprocessor
researchers immediately proposed that they would deliver immense
performance improvements by hooking together 100s or 1000s of 4004s.
Probably, you've never seen one of these wonders, or the fruits of any of
subsequent proposals to build massively parallel machines out of each
successive microprocessor generation. What, if anything, has changed now?
Clearly it is not just that we cannot achieve IPC gains in direct
proportion to the number of transistors used, because that's never been
true. If it had been true then we'd have IPCs several orders of magnitude
larger than we have today. Our recent rule of thumb has been processor
performance improves as the square root of the number of transistors and
cache miss rates likewise improve as the square root of the cache size.
Yet, despite these sub-linear architectural improvements, such machines
have been the preferred trajectory. Why is this insufficient for today?
Are there no ideas that will bring even that sub-linear architectural
performance gain? Why aren't the order-of-magnitude gains being promised
by SIMD, vector and/or streaming processors of interest?
Is our problem one of lack of innovation rabbits, now that we've lost MIPS
and Alpha and marginalized SPARC and Power? Is the complexity of the x86
architecture a factor in the inability to push across the next
architectural performance step?
Or is there a feeling that irrespective of architecture that we've crossed
a complexity threshold beyond which we can't build better processors in a
timely fashion. Is using full (but simpler) processors as the building
blocks the right granularity? Are multiprocessors really such a panacea of
simplicity? How much complexity is going to be introduced in the
inter-processor interconnect, in the shared cache hierarchy and in
processor support for mechanisms like transactional memory?
And much of the challenge of past generations has been coping with the
increasing disparity between processor speed and memory speed or the
limits of die to memory bandwidth. Does having a multiprocessor with its
multiple contexts just make this problem worse not better?
And even if multiprocessors really are a simpler alternative, what is the
application domain over which they are going to provide a benefit? And
will enough people be able to program them?

Yale Patt -
I start with the premise that the purpose of a chip is to optimally
execute someones desired single application. That is, the reason for
designing an expensive chip, rather than a network of simple chips is to
speed up the execution of individual programs. For servers, multiple
cheaper chips could be much more effective. Second, I do not believe that
everything we do must be transparent to the buffoon-programmer who can not
be expected to understand anything beyond the highest level, template-type
programming language.

This does not suggest that research should not be undertaken to figure out
how to get people to think via a "parallel programming model." Or, that
multiprocessor processing is not important. Cache coherency, memory
consistency, contention issues, etc. are all relevant avenues for useful
research. The fact is that current programmers do not naturally think
"parallel programming model," which means a lot of applications are single
thread. If performance of them is important, then we need to provide
single thread performance. It is true that more and more applications
naturally support lots of threads. Thus, the ability to handle lots of
parallel threads is also important. However, we need to consider Amdahls
Law -- getting the whole job done fast also requires the ability to handle
the serial part. Ergo, Pentium_X/Niagara_Y. To do this this:

1. We treat what I have called the Levels of Transformation (from natural
language problem statement to logic circuits) as one integrated whole.

a. That is, we add large functional units to the microarchitecture that
remain powered off when not in use, but are powered up via compiled code
when necessary to carry out a needed piece of work, specified by the
algorithm. (My "Refrigerator" analogy.) ...and we let the algorithm writer
and the compiler writer know that it is available.

b. We add many very light-weight processor engines for processing the
embarrassingly parallel part of an
algorithm. (Niagara X.)

c. We add some (very few, perhaps only one) very heavy-weight processors
with serious hybrid branch prediction, out-of-order execution, etc. to
handle the serial part of an algorithm. (Pentium Y.)

2. We provide appropriate interconnect to allow the Pentium Y core to talk
to the Niagara processors, without throwing away a lot of performance
waiting for information to go from one part of the chip to another.

3. We deal with the off-chip memory bandwidth demands by asking how to
reduce this bandwidth. Re, code: denser encoding of the I stream. Re,
data: representing values with the minimum number of bits required. Re:
on-chip storage, we store what we need. In all three cases, can we take
advantage of the enormous increase in logic capability to save off-chip
bandwidth?

Mark Hill -
For decades technologists have provided architects with more transistors
that we have used to make faster processors via bit-level parallelism,
instruction-level parallelism, and memory hierarchies. It now appears
that not even Yale Patt can figure out ways to use even more transistors
to speed up processors in cost- and power-effective ways. A critical
barrier is that it appears hard to do useful work on behalf of a single
thread for the hundreds of instruction opportunities it now takes to
access main memory. Thus, it is time turn to the easier task of teaching
Yale multi-threaded programming.