In this paper the authors compare single processor performance of the SGI Origin and PowerChallenge and utilize a previously reported performance model for hierarchical memory systems to explain the results. Both the Origin and PowerChallenge use the same microprocessor (MIPS R10000) but have significant differences in their memory subsystems. Their memory model includes the effect of overlap between CPU and memory operations and allows them to infer the individual contributions of all three improvements in the Origin`s memory architecture and relate the effectiveness of each improvement to application characteristics.

Performance and scalability of high performance scientific applications on large scale parallel machines are more dependent on the hierarchical memory subsystems of these machines than the peak instruction rate of the processors employed. The dependence is likely to increase in the future. While single-processor performance may double every eighteen months, memory bandwidth increases by only 15% during the same period. In addition, distributed shared memory (DSM) architectures are now being implemented which extend the concept of single-processor cache hierarchies across an entire physically-distributed multi-processor machine. Machines which will be available to the Department of Energy`s Accelerated Strategic Computing Initiative (ASCI)more » can have as many as 128 processors in a single DSM. Scalability of these machines to large numbers of processors is ultimately tied to issues of memory hierarchy performance, which includes data migration policies and distributed cache coherence protocols. Investigations of the performance improvements of applications over time and across new generations of machines must explicitly account for the effects of memory performance. In this paper, the authors characterize application performance with a memory-centric view. The applications are a representative part of the ASCI workload. Using a simple Mean Value Analysis (MVA) strategy and observed performance data, they infer the contribution of each level in the memory system to the application`s overall performance in cycles per instruction (CPI). Their empirical model accounts for the overlap of processor execution with memory accesses.« less