Performance Comparisons

A small subset of programs written in Fortran 77, Fortran 90 and C++
are compared based on their run-time performance. The Fortran 90 and
C++ programs are object-oriented, derived from the original Fortran
77 programs. A variety of simulations have been developed in one,
two and three dimensions. For illustrative purposes we have selected
some test cases from a two-stream instability experiment.

The benchmark code used is a plasma particle-in-cell code based on the
General Concurrent PIC algorithm [2]. The Fortran
77 codes have been well-benchmarked [1]. The
Fortran 90 and C++ [3,4,5] versions were designed from
the original Fortran 77 codes.

IBM RS/6000 (AIX 4.1) Sequential Performance Comparison

Machine

Language

Compiler

Particles

Time (sec)

One-Dimensional Program

RS/6000

Fortran 77

IBM xlf

450,000

245.49

RS/6000

Fortran 90

IBM xlf90

450,000

364.25

RS/6000

C++

IBM xlC

450,000

508.00

Two-Dimensional Program

RS/6000

Fortran 90

IBM xlf90

327,680

526.71

RS/6000

Fortran 77

IBM xlf

327,680

549.23

RS/6000

C++

IBM xlC

327,680

667.00

Functions calling private data without in-lining contributed to the
Fortran 90 program overhead in the one-dimensional program. A
different object model, which included better abstractions, allows
the Fortran 90 program to perform better than the Fortran 77 and C++
versions in the two-dimensional case as seen in the graph below.

IBM SP2 Parallel Performance Comparison

The table below shows performance comparisons for a two-dimensional
parallel Fortran 90 program using the MPI message passing library.

Machine

PEs

Language

Compiler

Particles

Time (sec)

Two-Dimensional Program

SP2

32

Fortran 77

IBM xlf

3,571,712

159.08

SP2

32

Fortran 90

IBM xlf90

3,571,712

202.88

SP2

32

C++

IBM xlC

3,571,712

359.00

Two-Dimensional Program

SP2

4

Fortran 77

IBM xlf

327,680

114.31

SP2

4

Fortran 90

IBM xlf90

327,680

117.49

SP2

4

C++

IBM xlC

327,680

249.00

Much more extensive performance comparisons are available in the
publications, including comparisons among various machines and
compilers from additional vendors. A plot of the 32 processor
experiment is shown below.

Performance of a three-dimensional parallel Fortran 90 program,
using MPI, is also available. Details of this work can be found
in the following paper [5].

Machine

PEs

Language

Compiler

Particles

Time (sec)

Three-Dimensional Program

SP2

32

Fortran 77

IBM xlf90

7,962,624

1548.71

SP2

32

Fortran 77

IBM xlf

7,962,624

1550.14

SP2

32

Fortran 90

IBM xlf90

7,962,624

1339.91

SP2

32

C++

IBM xlC

7,962,624

2797.00

The Fortran 90 version outperformed the Fortran 77 versions due to
improved cache-utilization of field components. The Fortran 90 (and
C++) version encapsulates components into a single derived type, but
the Fortran 77 version stores field elements in separate arrays.

Comparison against the KAI Optimizing C++ Compiler

The chart below shows results for a 3D code on the Cornell SP,
recently upgraded with the P2SC Chips. The C++ code used the KAI
C++ compiler.

The most aggressive optimizations produced the fastest timings;
these are represented in the table. The KAI C++ compiler with K3
-O3 --abstract_pointer spent OVER 2 HOURS in the compilation
process. The IBM F90 compiler with -O3 -qlanglvl=90std
-qstrict -qalias=noaryovrlp used 5 MINUTES for compilation. (The
KAI compiler generated faster executables than the IBM xlC C++
compiler.)

Times in yellow use the -qarch=pwr2 -qtune=pwr2 hardware optimization
switches.