We begin our investigation by evaluating multithreading in its own right.
Later we will examine the benefits of combining multithreading with
prefetching.

Figure shows performance results for 1, 2, and
4-context processors with context switching penalties of 4 and 16 cycles.
Each bar in the graphs is broken down into the following components: time
spent executing instructions, time spent switching between contexts,
and time when the processor is idle. The idle time is broken down
further into all idle time which is when all contexts are idle
waiting for a reference to complete, and no switch time which
represents time when the current context is idle but is not switched out.
Most of the latter idle time is due to the fact that the processor is
locked out of the primary cache while fill operations of other contexts
complete.

Most of the applications benefited from multithreading. The noteworthy
exceptions are CHOLESKY and PTHOR, where the performance is worse
with four contexts than with a single context. The reason for this is that
these two applications do not scale well to 64 processes, and therefore the
processes spend too much time spinning waiting for work. This extra
spinning time can be seen as the increase in the instruction category
in Figure .

To provide some insight into these results, Table
shows the median run length and average primary miss latency for each
application. A rough estimate of the number of contexts necessary to hide
memory latency is the miss latency divided by the run length. For example,
MP3D has one of the more favorable ratios (roughly two-to-one), which helps
explain why two contexts eliminate a large fraction of the idle time. In
contrast, OCEAN has a ratio of more than three-to-one, which helps explain
why two contexts eliminate only part of the idle time.

However, a favorable run-length-to-miss-latency ratio does not ensure good
performance. For example, in BARNES this ratio would suggest that two
contexts would be sufficient to hide the latency, but in fact only about
half of the all-idle time is eliminated. The reason for this is the
clustering of cache misses. Also the cache miss rates can deteriorate as
the different contexts compete for the same cache; we observe this effect
in LOCUS, where the primary data miss rate more than doubles from 14%to
30%as we go from one to four contexts.

The importance of minimizing the context switch latency varies depending on
whether there is frequently another ready-to-run context during a context
switch. On the one hand, when some of the applications are run with only 2
contexts (e.g., OCEAN, LU, and PTHOR), there typically is not a
ready-to-run context during a context switch, and therefore reducing the
switch penalty from 16 to 4 cycles has little impact on performance. On the
other hand, the switch penalty does affect performance significantly in
most cases with 4 contexts, and even in some cases with 2 contexts (e.g.,
CHOLESKY and LOCUS). Therefore, given that there are enough contexts to
hide the latency, it is important to minimize the context switch latency.

To summarize, we see that multithreading can increase performance
significantly when the run length to latency ratio is favorable. However,
enough parallelism must be available in the application to keep the
additional contexts busy. We further observe that destructive interference
of the contexts in the processor cache can undo any gains achieved.
Interference is more of a problem with multithreading than with
prefetching because multiple working sets interfere with each other in the
same cache. The smaller the number of cycles required for context
switching, the lower the total overhead due to multithreading. A context
switch cost of 16 cycles introduces significant overhead, whereas the
overhead is much more reasonable with a 4-cycle switch penalty.