Figure 8: Relative run time normalized to the (1, 16, 4) configuration of MATMUL and StreamMD with and without inter-cluster communication.

threads. StreamFEM contains several barrier synchronization points, and load imbalance increases the synchronization time. This is most clearly seen in the extreme (16,1,4) configuration.

Based on the results presented in [10] we expect the benefit of TLP for irregular control to be significant (maximum 20% and 30% for StreamMD and StreamCDP respectively). Our results, however, show much lower performance improvements. In the baseline configuration, with 1MB SRF and inter-cluster communication, we see no advantage at all to utilizing TLP.

For StreamMD the reason relates to the effective inter-cluster communication enabled by SIMD. In the baseline configuration with 2-ILP, the algorithm utilizes the entire 1MB of the SRF for exploiting locality (cross-lane duplicate removal method of [10]) and is thus able to match the performance of the MIMD enabled organizations. As discussed earlier, the 4-ILP configurations perform poorly because of a less effective VLIW schedule. If we remove the inter-cluster switch for better area scaling (Sec. 5), the performance advantage of MIMD grows to a significant 10% (Fig. 8). Similarly, increasing the SRF size to 8MB provides enough state for exploiting locality within the SRF space of a single cluster and the performance advantage of utilizing TLP is again 10% (Fig. 7).

In the case of StreamCDP, the advantage of TLP is negated because the application is memory throughput bound. Even though computation efficiency is improved, the memory system throughput limits performance as is evident by the run time category corresponding to all sequencers being idle while the memory system is busy (white bars in Fig. 6).

7.

CONCLUSION

We extend the stream architecture to support and scale along the three main dimensions of parallelism. VLIW instructions utilize ILP to drive multiple ALUs within a cluster, clusters are grouped under SIMD control of a single sequencer exploiting DLP, and multiple sequencer groups rely on TLP. We explore the scaling of the architecture along these dimensions as well as

the tradeoffs between choosing different values of ILP, DLP, and TLP control for a given set of ALUs. Our methodology pro- vides a fair comparison of the different parallelism techniques within the scope of applications with scalable parallelism, as the same execution model and basic implementation were used in all configurations.

We develop a detailed hardware cost model based on area normalized to a single ALU, and show that adding TLP sup- port is beneficial as the number of ALUs on a processor scales above 32 − 64. However, the cost of increasing the degree of TLP to an extreme of a single sequencer per cluster can be significant and ranges between 15 − 86%. The smaller part of this large overhead span is for configurations in which intro- ducing sequencer groups partitions global switch structures on the chip, whereas high overhead is unavoidable in cases where the datapath is narrow (32-bits) and no inter-cluster commu- nication is available.

Our performance evaluation shows that a wide range of nu- merical applications with scalable parallelism are fairly insen- sitive to the type of parallelism exploited. This is true for both regular and irregular control algorithms, and overall, the performance speedup is in the 0.9 − 1.15 range. We provide detailed explanation to understand the many subtle sources of the performance difference and discuss the sensitivity of the results.

Continuous increases of transistor count force all the ma- jor parallelism types – DLP, ILP, and TLP – to be exploited in order to provide performance scalability as the number of ALUs per chip grows. The multi-threaded stream processor introduced in this paper is a step toward expanding the appli- cability and performance scalability of stream processing. The addition of hardware MIMD support allows efficient mapping of applications with a limited amount of fine-grain parallelism. In addition our cost models enable the comparison of stream processors with other chip-multiprocessors in both performance and efficiency.