Proceedings of the 1997 Winter Simulation Conference
ed. S. Andradóttir, K. J. Healy, D. H. Withers, and B. L. Nelson
EXECUTION-DRIVEN SIMULATORS FOR PARALLEL SYSTEMS DESIGN
Anand Sivasubramaniam
Department of Computer Science and Engineering
The Pennsylvania State University
University Park, Pennsylvania 16802, U.S.A.
ABSTRACT imposing diverse demands on the underlying hard-
ware, while parallel machines also come in several
Evaluating, analyzing and predicting the performance ﬂavors. To ﬁnd out how good a job a machine does
of a parallel system is challenging due to the complex of meeting an application’s demands, we need a way
inter-play between the application characteristics and of evaluating the match between an application and
architectural features. The overheads in a parallel an architecture. Evaluating the performance of an
system that limit its scalability have to be identi- application-architecture combination has widespread
ﬁed and separated in order to enable performance- applicability in parallel systems research. The results
conscious parallel application design and the develop- from such an evaluation may be used to: select the
ment of high-performance parallel machines. We have best architecture platform for an application domain,
developed an evaluation framework that uses a combi- select the best algorithm for solving the problem on
nation of experimentation, simulation and analytical a given hardware platform, predict the performance
modeling to quantify these parallel system overheads. of an application on a larger conﬁguration of an ex-
At the heart of this framework is an execution-driven isting architecture, predict the performance of large
simulation testbed called SPASM which uses a suite application instances, identify application and archi-
of real applications as the workload. We discuss our tectural bottlenecks in a parallel system to suggest
experiences in using this simulator in a wide range of application restructuring and architectural enhance-
architectural projects in this paper. ments, and evaluate the cost vs. performance trade-
oﬀs in important architectural design decisions. But
1 INTRODUCTION evaluating and analyzing the performance of parallel
systems pose several problems due to the complex in-
High Performance Computing is becoming increas- teraction between application characteristics and ar-
ingly important to scientiﬁc advancement and eco- chitectural features.
nomic development and is at the point of signiﬁcantly Performance evaluation techniques have to grap-
improving our standard of living. With the inher- ple with several more degrees of freedom exhibited by
ent limitations of sequential computing, parallel ma- parallel systems compared to their sequential coun-
chines have been proposed as the solution for high- terparts. Experimentation and measurement on ac-
performance computing. Despite their promise and tual hardware, analytical modeling and simulation
attractiveness to the research community, parallel ma- are three well-known performance evaluation tech-
chines have not been very successful in the commer- niques. But each technique has its own limitations.
cial world due to two main reasons. First, their de- Experimentation requires the hardware to be built,
livered performance often falls short of the projected analytical models often make unreasonable assump-
peak performance. Second, the cost of these machines tions about the underlying system to keep the mod-
is high compared to their sequential counterparts. eling tractable, and simulation requires immense re-
For the success of parallel computation, we should sources in terms of storage and time.
build machines that bridge the gap between projected In this paper, we summarize our previous and on-
and delivered performance over a spectrum of impor- going eﬀort in developing a framework for evaluating
tant real-world applications in a cost-eﬀective man- the performance of parallel systems and using this
ner. Performance evaluation of parallel systems plays framework to develop cost-eﬀective platforms that
a crucial role towards this goal. meet the demands of numerous real-world applica-
Applications exhibit diﬀerent characteristics thus tions. First, we identify performance metrics which
1021
1022 Sivasubramaniam
are essential to understand the intrinsic algorithmic tectural artifacts that lead to these bottlenecks and
and architectural artifacts that impact the perfor- quantify their relative contribution towards limiting
mance of a parallel system. Next, we outline an eval- the overall scalability of the system. Traditional met-
uation framework that we have developed to quantify rics do not help further in this regard.
these metrics. The framework uses all three perfor- Parallel system overheads may be broadly classi-
mance evaluation techniques to alleviate their indi- ﬁed into a purely algorithmic component (algorith-
vidual limitations. At the heart of this framework lies mic overhead), a component arising from the interac-
SPASM (Simulator for Parallel Architectural Scala- tion of the application with the system software (soft-
bility Measurements) which provides detailed perfor- ware interaction overhead), and a component arising
mance proﬁles for applications on a range of parallel from the interaction of the application with the hard-
hardware platforms. This simulator helps identify, ware (hardware interaction overhead). Algorithmic
isolate and quantify the algorithmic and architectural overheads arise from the inherent serial part in the
bottlenecks in an execution, that can be used for ap- application, the work-imbalance between the execut-
plication restructuring and to suggest architectural ing threads of control, any redundant computation
enhancements. In the rest of the paper, we illustrate that may be performed, and additional work intro-
the utility of an execution-driven simulator such as duced by the parallelization. Software interaction
SPASM in several architectural projects. overheads such as overheads for scheduling, message-
The rest of this paper is organized as follows. In passing, and software synchronization arise due to
Section 2, we identify performance metrics that we the interaction of the application with the system
require from evaluating a parallel system and discuss software. Hardware slowdown due to network la-
diﬀerent evaluation techniques. Section 3 outlines tency (the transmission time for a message in the
our evaluation framework and the SPASM simulator. network), network contention (the amount of time
Section 4 summarizes our experience and results in spent in the network waiting for availability of net-
using execution-driven simulators for several architec- work resources), synchronization and cache coherence
tural projects. Finally, Section 5 presents concluding actions, contribute to the hardware interaction over-
remarks. head. Each of these components would cause the per-
formance to deteriorate from the available compute
2 EVALUATING PARALLEL SYSTEMS power (potential peak performance) of the hardware.
To fully understand the scalability of a parallel sys-
In conducting any evaluation, we need to identify a tem, it is important to isolate and quantify the impact
set of performance metrics that we would like to mea- of each of these components on the overall execution.
sure and the techniques and tools that will be used In our earlier research, we have proposed the notion of
to gather these metrics. an overhead function (Sivasubramaniam et al. 1994)
that tracks the growth of a particular system over-
2.1 Performance Metrics head with respect to a speciﬁc system parameter.
Metrics which capture the “available” compute power 2.2 Evaluation Techniques
(MFLOPS, MIPS etc.) are often not a true indi-
cator of the performance actually “delivered” by a Experimentation, analytical modeling and simulation
parallel system. Metrics for parallel system perfor- are three well-known techniques for evaluating par-
mance evaluation should quantify this gap between allel systems. Experimentation involves implement-
available and delivered compute power since under- ing the application on the actual hardware and mea-
standing application and architectural bottlenecks is suring its performance. Analytical models abstract
crucial for application restructuring and architectural hardware and application details in a parallel sys-
enhancements. Many performance metrics such as tem and capture complex system features by simple
speedup, scaled speedup and isoeﬃciency, have been mathematical formulae. These formulae are usually
proposed to quantify the match between the appli- parameterized by a limited number of degrees of free-
cation and architecture in a parallel system. While dom so that the analysis is kept tractable. Simu-
these metrics are useful for tracking overall perfor- lation is a valuable technique which exploits com-
mance trends, they provide little additional informa- puter resources to model and imitate the behavior
tion about where performance is lost. Some of these of a real system in a controlled manner. Each tech-
metrics attempt to identify the cause (the application nique has its own limitations. The amount of statis-
or the architecture) of the problem when the paral- tics that can be gleaned by experimentation (to quan-
lel system does not scale as expected. However, it is tify the overhead functions) is limited by the monitor-
essential to ﬁnd the individual application and archi- ing/instrumentation support provided by the under-
Execution-Driven Simulators for Parallel Systems Design 1023
lying system. Additional instrumentation can some- simulation platform called SPASM which is used to
times perturb the evaluation. Analytical models are identify, isolate and quantify the individual parallel
often criticized for the unrealism and simplifying as- system overheads.
sumptions made in expressing the complex interac- SPASM is an execution-driven simulator written
tion between the application and the architecture. in CSIM used for simulating the execution of a par-
Simulation of realistic computer systems demand con- allel program on a parallel machine. As with other
siderable resources in terms of space and time. recent simulators the bulk of the instructions in the
parallel program is executed at the speed of the native
3 THE FRAMEWORK processor (SPARC in our studies) and only instruc-
tions such as LOADs/STOREs on a shared memory
platform, and SENDs/RECEIVEs on a message pass-
Applications
ing platform, that may potentially involve a network
access are simulated. The rationale behind this ap-
Experimentation proach is that since uniprocessor architecture is get-
Validation
ting standardized with the advent of RISC technol-
Speedup Simulation Kernels ogy, we can ﬁx most of the processor characteristics
(such as instruction sets, clocks per instruction, ﬂoat-
ing point capabilities, pipelining) by using a com-
Analytical Simulation modity processor as the baseline for each processor
in our parallel system. A detailed simulation of the
processor architecture is not likely to contribute sig-
Reﬁne Models niﬁcantly to our understanding of the scalability of a
parallel system. The input to the simulator is parallel
applications written in C. On a message passing sys-
Results tem, the calls (SENDs/RECEIVEs) which trap to the
simulator are inserted into the application program
Figure 1: The Framework
explicitly by the programmer. On a shared memory
system, a pre-processor inserts code into the appli-
We have developed an evaluation framework that uses cation program to trap to the simulator on a shared
a combination of the three techniques to avoid some memory reference. On both systems, the compiled
of their individual drawbacks. Experimentation is assembly code is augmented with cycle counting in-
used to implement real-world applications on parallel structions which is used to keep track of the time
machines, to understand their behavior and extract spent in the application program since the last trap
interesting kernels (abstractions of applications that to the simulator. Finally, the assembled binary is
capture representative phases of the execution) that linked with the rest of the simulator code.
occur in them. These kernels are fed to an execution- A simulation platform like SPASM allows us to
driven simulator called SPASM which faithfully mod- vary a wide range of hardware parameters such as
els the details of the parallel system interactions. The the number of processors, the CPU clock speed, the
statistics that are drawn from the simulation are used network topology, the bandwidth of the links in the
to develop new analytical models or to validate and network, the network switching delays, and the cache
reﬁne existing models. Simulation is used for detailed parameters (the block size, cache size, associativity,
study of smaller systems in a non-intrusive manner. etc). SPASM gives a wide range of statistics that
Analytical models are used to complement the sim- isolate and quantify the contribution of each parallel
ulation results to project the performance and over- system overhead on the overall execution time of the
heads for larger systems (than those that can be sim- application. Further, these overheads can be quanti-
ulated). When an analytical model is suﬃciently val- ﬁed for diﬀerent phases of the execution that can help
idated/reﬁned, it may be possible to use this model in performance debugging for application restructur-
in the simulator itself to abstract details in the sim- ing and for suggesting architectural enhancements.
ulation to ease resource requirements. Using this ap-
proach, we have illustrated (Sivasubramaniam et al.
1995a) how the details of cache simulation and the 4 PROJECTS USING THE FRAMEWORK
details of interconnection network simulation may be
We have used the above framework in a wide spec-
abstracted by suitable models to gain substantial sav-
trum of architectural projects that are summarized
ings in the simulation time.
below. Even though SPASM has been used to model
At the heart of our evaluation framework lies a
and study message passing systems, the projects dis-
1024 Sivasubramaniam
cussed here have used only its shared memory capa- tion. Furthermore, if our abstraction closely models
bilities. the behavior of a machine with a simple cache coher-
ent protocol, then it would even more closely model
4.1 Validating Abstractions the behavior of a machine with a fancier cache coher-
ence protocol.
Abstracting features of parallel systems is a tech- We have used our simulation framework for eval-
nique often employed in performance analysis and al- uating these abstractions. We have compared the re-
gorithm development. For instance, abstracting par- sults from simulating the ﬁve applications on a ma-
allel machines by theoretical models like the PRAM chine incorporating these abstractions with the re-
has facilitated algorithm development and analysis. sults from an exact simulation of the actual hardware.
Such models try to hide hardware details from the Our results show that the latency overhead modeled
programmer, providing a simpliﬁed view of the ma- by LogP is fairly accurate. On the other hand, the
chine. Similarly, analytical models used in perfor- contention overhead modeled by LogP can become
mance evaluation abstract complex system interac- pessimistic for some applications since the model does
tions with simple mathematical functions, parame- not capture communication locality. The pessimism
terized by a limited number of degrees of freedom gets ampliﬁed as we move to networks with lower con-
that are tractable. Abstractions are also useful in nectivity. With regard to data locality, results show
execution-driven simulators where details of the hard- that our ideal cache, which does not model any co-
ware and the application can be captured by abstract herence protocol overheads, is a good abstraction for
models in order to ease the demands on resource (time capturing locality over the chosen range of applica-
and space) usage in simulating large parallel systems. tions.
Some simulators already abstract details of instruction- Apart from evaluating these abstractions in the
set simulation, since such a detailed simulation is not context of real applications, the isolation and quan-
likely to contribute signiﬁcantly to the performance tiﬁcation of parallel system overheads has helped us
analysis of parallel systems. validate the individual parameters used in each ab-
An important question that needs to be addressed straction. The simulation of the system which incor-
in using abstractions is their validity. Our framework porates these two abstractions is around 250-300%
serves as a convenient vehicle for evaluating the accu- faster than the simulation of the actual machine. Us-
racy of these abstractions using real applications. In ing a similar approach, one may also use this frame-
(Sivasubramaniam et al. 1995a), we have illustrated work to reﬁne existing models (like reducing the pes-
the use of the framework to evaluate the validity and simism in LogP in modeling contention), or even de-
use of abstractions in simulating the interconnection velop new models for accurately capturing parallel
network and locality properties of parallel systems. system behavior.
An outline of the evaluation strategy and results are
presented below.
4.2 Synthesizing Network Requirements
For abstracting the interconnection network, we
have used the recently proposed LogP model that in- For building a general-purpose parallel machine, it
corporates the two deﬁning characteristics of a net- is essential to identify and quantify the architectural
work, namely, latency and contention. For abstract- requirements necessary to assure good performance
ing the locality properties of a parallel system, we over a wide range of applications. Such a synthesis of
have modeled a private cache at each processing node requirements from an application view-point can help
in the system to capture data locality. Shared mem- us make cost vs. performance trade-oﬀs in important
ory machines with private caches usually employ a architectural design decisions. Our framework pro-
protocol to maintain coherence. With a diverse range vides a convenient platform to study the impact of
of cache coherence protocols, it would become very hardware parameters on application performance and
speciﬁc if our abstraction were to model any partic- use the results to project architectural requirements.
ular protocol. Further, memory references (locality) We have conducted such a study in (Sivasubrama-
are largely dictated by application characteristics and niam et al. 1995b) towards synthesizing the network
are relatively independent of cache coherence proto- requirements of the applications mentioned earlier,
cols. Hence, instead of modeling any particular proto- and the experimental strategy along with interesting
col, we have chosen to maintain the caches coherent results from our study are summarized here.
in our abstraction but do not model the overheads To quantify link bandwidth requirements for a
associated with maintaining the coherence. Such an particular network topology, we have simulated the
abstraction would represent an ideal coherent cache execution of the applications on such a topology and
that captures the true inherent locality in an applica- vary the bandwidth of the links in the network. As
Execution-Driven Simulators for Parallel Systems Design 1025
the bandwidth is increased, the network overheads unavoidable. Cache coherence protocols, weak mem-
(latency and contention) decrease, yielding a perfor- ory consistency models, prefetch, poststore, and mul-
mance that is close to the ideal execution. From tithreading are some of the proposed latency reduc-
these results, we have arrived at link bandwidths that ing and tolerating techniques in the context of shared
are needed to limit network overheads (latency and memory architectures. It has been shown that no one
contention) to an acceptable level of the overall ex- technique is universally applicable for all applications.
ecution time. We have also studied the impact of On the other hand, a close examination of the com-
the number of processors, the CPU clock speed and munication behavior of a range of applications can
the application problem size on bandwidth require- help derive a set of architectural mechanisms that
ments. Computation to communication ratio tends may prove beneﬁcial and we have conducted such a
to decrease when the number of processors or the study in (Ramachandran et al. 1995) using our eval-
CPU clock speed is increased, making the network uation framework. By examining the communication
requirements more stringent. An increase in problem properties of applications, we have proposed a set of
size improves the computation to communication ra- explicit communication primitives that are general-
tio, lowering the bandwidth needed to maintain an izations of the poststore and prefetch mechanisms.
acceptable eﬃciency. Using regression analysis and Cache coherence protocols broadly fall into two
analytical techniques, we have extrapolated require- categories: write-invalidate and write-update. Invalidation-
ments for systems built with larger number of proces- based schemes are more suited to migratory data and
sors. can become ineﬃcient when the producer-consumer
The results from the study suggest that existing relationship for shared data remains relatively un-
link bandwidth of 200-300 MBytes/sec available on changed during the course of execution. On the other
machines like Intel Paragon and Cray T3D can easily hand, update-based protocols can result in signiﬁcant
sustain the requirements of some applications even on overheads due to repeated updates to the same data
high-speed processors of the future. For the other ap- before they are used by another processor, as well
plications studied, one may be able to maintain net- as redundant updates when there are changes to the
work overheads at an acceptable level if the problem sharing pattern of a data item. The update and in-
size is increased commensurate with the processing validation based schemes thus have their relative ad-
speed. vantages and disadvantages, and based on application
The separation of the parallel system overheads characteristics one may be preferable over the other.
plays an important role in synthesizing the communi- Invalidations are useful when an application changes
cation requirements of applications. For instance, an its sharing pattern, and updates are useful to eﬀect
application may have an algorithmic deﬁciency due direct communication once a sharing pattern is estab-
to either a large serial part or due to work-imbalance, lished.
in which case 100% eﬃciency is impossible regardless By examining the communication properties of a
of other architectural parameters. The separation of spectrum of applications, we have derived a set of
overheads enables us to quantify bandwidth require- explicit communication primitives that use sender-
ments as a function of acceptable network overheads initiated communication within the context of an un-
(latency and contention). The framework may also be derlying invalidation-based protocol. The three pro-
used for synthesizing requirements of other architec- posed primitives intelligently propagate the data items
tural features such as synchronization primitives and to one or more consumers as soon as the data items
locality capabilities from an application perspective. are produced. The ﬁrst primitive is intended for ap-
plications with static communication behavior where
4.3 Deriving Architectural Mechanisms the consumer set of a data item is available at compile
time. As a result, this set can be directly supplied to
The single most important overhead limiting perfor- the hardware when the data item is produced. The
mance of parallel applications is the communication second primitive is intended for variables governed by
overhead. One solution is to make the network as locks and it uses the lock structure to propagate data
fast as possible so that even though the application items to the processor next in line for the lock. The
does not make any fewer network accesses, the over- third primitive is for applications with dynamic com-
heads will not manifest as a signiﬁcant component of munication behavior which detects the arrival of a
the total execution time. But the resources to sustain new consumer to a current sharing pattern, and uses
the necessary bandwidth may simply not be available this information to intelligently mix invalidates with
in some cases. The second approach is to reduce the updates.
network accesses incurred in the execution or to toler- The execution-driven simulation of real applica-
ate the communication overhead if these accesses are tions has played an important role in this exercise. It
1026 Sivasubramaniam
has helped us identify and isolate typical communi- SPASM is perhaps the ﬁrst execution-driven simu-
cation scenarios in applications and derive a set of lator that has been used to integrate both these view-
mechanisms that can optimize these scenarios. It points into a single evaluation framework. It has been
has also helped us evaluate the cost-eﬀectiveness of used extensively to study performance over a wide
these primitives, and the beneﬁts of these primitives range of real applications and network parameters.
over alternate mechanisms. A related study (Shah, For instance, in (Vaidya, Sivasubramaniam, and Das
Singla, and Ramchandran 1995) develops a realis- 1997a) we have used it to study the performance of a
tic model for a shared memory machine, and using 2-dimensional mesh network for 5 shared memory ap-
SPASM shows that for a spectrum of applications al- plications. The speciﬁc aim in this study is to verify
most all the inherent communication in them may whether the promised performance improvement (for
be overlapped with computation. This serves as the synthetic workloads) using recently proposed network
motivation for further research in developing explicit enhancements, such as virtual channels and adaptive
sender-initiated communication mechanisms. routing, is indeed obtained for real applications, and
if so do these beneﬁts override the cost of providing
4.4 Evaluating Network Designs these enhancements.
The performance results show that there is a mod-
The complex interaction between a parallel architec- est performance beneﬁt with these enhancements in
ture and an application makes it essential to use real- the average network latency for the messages. How-
istic workloads for evaluating parallel systems. Per- ever, with respect to the overall execution time, this
formance analyses of processors, caches, memory and improvement is dwarfed in comparison to the other
I/O subsystems have therefore been conducted with components which constitute the execution time. When
parallel benchmarks. However, unlike other subsys- considered in the context of application scalability
tems, the design and analysis of the interconnection in terms of the number of processors and the prob-
network, which is perhaps the most crucial hardware lem considered, even though many of the considered
component in a parallel machine, has rarely used the applications inject a large number of messages into
knowledge of workloads generated by parallel appli- the network, their arrival into the network does not
cations. seem to generate any signiﬁcant contention for net-
There are two diﬀering perspectives of viewing work resources. Consequently, virtual channels and
the multiprocessor interconnection network. From adaptive routing algorithms, which attempt to lower
the viewpoint of a software designer or an applica- the network contention and not the raw network la-
tion programmer, it helps to make certain simplify- tency, do not show substantial saving in execution
ing assumptions about the interconnection network time. Further, our results suggest that the perfor-
such as assuming a constant delay or a simple model mance rewards may not justify the cost of these en-
which does not take into account the details of mes- hancements unless an application is highly commu-
sage traversal within the actual network. These as- nication intensive and potentially scaling poorly. On
sumptions are suﬃciently accurate when the objec- the other hand, if any of these enhancements were to
tive is to minimize the communication required. By slow down the network router, then there is a signif-
making these assumptions, performance evaluation of icant degradation in performance.
the system can be simpliﬁed and speeded-up. Inter- This study (Vaidya, Sivasubramaniam, and Das
connection network designers have a more network- 1997a) has served has the motivation for yet another
centric viewpoint. From this viewpoint, improving project (Vaidya, Sivasubramaniam, and Das 1997b)
the network performance is critical. Network topol- where we are trying to develop better routers for
ogy, switching mechanism, routing, ﬂow control, and interconnection networks. In this project, we have
communication workload, together determine the net- formalized a pipelined model for the network router,
work performance. Until recently, network research and we have evaluated the trade-oﬀs between diﬀer-
has primarily focussed on the ﬁrst four parameters to ent router designs using our simulator. We have also
optimize network latency and throughput. Network proposed and evaluated dynamically adaptable selec-
designers have traditionally used synthetic benchmarks tion functions within the router to route messages
to evaluate their designs. At best, these benchmarks along less congested paths.
try to mimic some typical communication behavior in
applications. The performance results derived from
4.5 Characterizing Communication Behavior
synthetic workloads can provide a general guideline
or bounding values, while it may be diﬃcult to make Characterization of the communication in parallel ap-
cost-performance architectural design decisions using plications is essential in understanding their interplay
these results. with parallel architectures, to maximize the perfor-
Execution-Driven Simulators for Parallel Systems Design 1027
mance of existing architectures and to design better destination distribution.
architectures in the future. The communication traf- The results obtained from the analysis of the ap-
ﬁc of a parallel application can be captured by three plication traces show that the inter-arrival times of
attributes namely the temporal, spatial and volume all applications except one can be ﬁtted to known
components. Temporal behavior is captured by the probability distribution functions, which are varia-
message generation rate, spatial behavior is expressed tions of exponential distribution. Also, the average
in terms of the message distribution or traﬃc pattern, message generation rate can be obtained for the un-
and volume of communication is speciﬁed by the num- derlying distribution. For the spectrum of applica-
ber of messages and the message length distribution. tions considered, the message generation distribution
These three attributes together deﬁne the communi- can be expressed in terms of exponential, hypoex-
cation workload and have been used extensively in ponential or weibull distributions. Our results also
many types of architectural evaluations. In particu- conﬁrm that the spatial distributions of parallel ap-
lar, one of the most extensively studied areas of re- plications can be captured mathematically. For the
search in parallel architectures is the interconnection applications considered, the spatial distributions are
networks. A plethora of network topologies that sup- uniform, bimodal uniform and univariate polynomial.
port various types of switching mechanism and mes- The sensitivity of these results to diﬀerent application
sage routing algorithms have been proposed to design and hardware parameters has also been studied. We
scalable parallel machines. Performance analyses of have found that only the means of the distributions
all these networks either via simulation or analysis change as we vary many of the parameters. These
require the above three communication attributes. results lead us closer to the belief that it is possible
In the previous subsection, we discussed two stud- to abstract the communication properties of parallel
ies that have studied the network for real applications applications in convenient mathematical forms that
using execution-driven simulation. But, such a de- have wide applicability.
tailed simulation of the network makes the evaluation
exceedingly slow. Mathematical models, on the other 5 CONCLUDING REMARKS
hand, do not suﬀer from this drawback. However,
most of these models for interconnection networks Performance evaluation is an integral part of any sys-
have been accused of making unrealistic assumptions tems design process to: evaluate the cost-eﬀectiveness
about the communication workload. It is not clear of a given design, compare diﬀerent designs, and de-
what diﬀerent traﬃc patterns are generated by par- rive alternate designs. This process is particularly
allel applications and how these traﬃc patterns can made more diﬃcult for parallel systems where the
be captured by a distribution function for subsequent complex interaction between application and archi-
study. Therefore, the credibility of many model-based tecture introduces several more degrees of freedom
performance results has been questioned frequently. compared to their sequential counterparts.
It is thus crucial to develop some formal tech- Performance evaluation techniques should clearly
niques to capture the communication properties of isolate and quantify the diﬀerent overheads in a paral-
parallel applications. The novelty of such a charac- lel system execution that limit its scalability. Exper-
terization is that these attributes can be useful for imentation on the actual system, analytical modeling
many divergent studies: a system architect can use and simulation are three well-known techniques. But
the communication information for better architec- each has its own limitations. Execution-driven simu-
tural design; an algorithm developer can use the com- lation oﬀers the most promise because of its ability to
munication cost for better algorithm design and anal- study the parallel system accurately and in great de-
ysis; and a system analyst can develop more accurate tail in a non-intrusive manner. However, we need to
performance models using realistic workloads. conﬁne ourselves to smaller systems with this tech-
In (Chodnekar et al. 1997) and (Seed, Sivasubra- nique, and complement the evaluation with mathe-
maniam, and Das 1997), we have embarked on char- matical models and experimentation to extrapolate
acterizing the communication traﬃc generated by a performance for larger systems. In this paper, we
spectrum of applications using SPASM. We conduct have described one such simulator called SPASM that
a detailed execution-driven simulation on a chosen has been used extensively to study parallel architec-
network conﬁguration for each application. The net- tures over a spectrum of applications. We have also
work logs the arrival of messages along with the time brieﬂy discussed ﬁve architectural projects that have
of arrival, length and destination information. These used this simulator.
logs are then presented to a statistical package (SAS) Recent trends show that a Network of Worksta-
for regression analysis to calculate the message gen- tions (NOW) is a cost-eﬀective solution for high per-
eration rate, the message length distribution, and the formance computing. There are a wide range of ar-
1028 Sivasubramaniam
chitectural issues that need to be addressed if this 1994 Conference on Measurement and Modeling
platform is to become more prevalent. Our ongo- of Computer Systems, 171–180.
ing research is focusing on architectural projects to- Sivasubramaniam, A., A. Singla, U. Ramachandran,
wards this goal. We have recently implemented an and H. Venkateswaran. 1995. Abstracting net-
execution-driven simulator called pSNOW (Kasbekar, work characteristics and locality properties of par-
Nagar, and Sivasubramaniam 1997) to speciﬁcally study allel systems. In Proceedings of the First Inter-
hardware and system software issues for NOW plat- national Symposium on High Performance Com-
forms. We intend to use this simulator to design and puter Architecture, 54–63.
evaluate architectural innovations concurrently with Sivasubramaniam, A., A. Singla, U. Ramachandran,
the development of an actual prototype in our labo- and H. Venkateswaran. 1995. On characteriz-
ratory. ing bandwidth requirements of parallel applica-
tions. In Proceedings of the ACM SIGMETRICS
ACKNOWLEDGMENTS 1995 Conference on Measurement and Modeling
of Computer Systems, 198–207.
This research is supported in part by a NSF Career Vaidya, A., A. Sivasubramaniam, and C. Das. 1997.
Award MIP-9701475 and equipment grants from NSF Performance beneﬁts of virtual channels and adap-
and IBM. tive routing: An application-driven study. In Pro-
ceedings of the ACM 1997 International Confer-
REFERENCES ence on Supercomputing, 140–147.
Vaidya, A., A. Sivasubramaniam, and C. Das. 1997.
Chodnekar, S., V. Srinivasan, A. Vaidya, A. Sivasub- The PROUD pipelined routers for high perfor-
ramaniam, and C. Das. 1997. Towards a commu- mance networks. Technical Report CSE-97-007,
nication characterization methodology for parallel Department of Computer Science and Engineer-
applications. In Proceedings of the Third Inter- ing, The Pennsylvania State University.
national Symposium on High Performance Com-
puter Architecture, 310–319. AUTHOR BIOGRAPHY
Kasbekar, M., S. Nagar, and A. Sivasubramaniam.
1997. pSNOW: A tool to evaluate architectural ANAND SIVASUBRAMANIAM is an Assistant
issues for NOW environments. In Proceedings of Professor in the Department of Computer Science and
the ACM 1997 International Conference on Su- Engineering at The Pennsylvania State University.
percomputing, 100–107. He received his B.Tech in Computer Science from the
Ramachandran, U., G. Shah, A. Sivasubramaniam, Indian Institute of Technology, Madras, in 1989, and
A. Singla, and I. Yanasak. 1995. Architectural the MS and Ph.D. degrees in Computer Science from
mechanisms for explicit communication in shared the Georgia Institute of Technology in 1991 and 1995
memory multiprocessors. In Proceedings of Super- respectively. His research interests are in architec-
computing ’95. ture, operating systems, performance evaluation and
Seed, D., A. Sivasubramaniam, and C. Das. 1997. application aspects of high performance computing.
Communication in Parallel Applications: Charac-
terization and Sensitivity Analysis. To appear in
Proceedings of the 1997 International Conference
on Parallel Processing.
Shah, G., A. Singla, and U. Ramachandran. 1995.
The quest for a zero overhead shared memory par-
allel machine. In Proceedings of the 1995 Inter-
national Conference on Parallel Processing, 194–
201.
Sivasubramaniam, A. 1997. Reducing the communi-
cation overhead of dynamic applications on shared
memory multiprocessors. In Proceedings of the
Third International Symposium on High Perfor-
mance Computer Architecture, 194–203.
Sivasubramaniam, A., A. Singla, U. Ramachandran,
and H. Venkateswaran. 1994. An approach to
scalability study of shared memory parallel sys-
tems. In Proceedings of the ACM SIGMETRICS