ARSC HPC Users' Newsletter 389, June 27, 2008

Contents

On the Hazards of Array Syntax, ( Or How I Achieved Six Orders of Magnitude Speedup )

[ By: Lee Higbie ]

For a few computational problems the phenomenon being modeled is completely local. What happens at one point in the model does not depend on what goes on around it. To a first order, a forest fire's burning depends on the weather and the fuel, but not on the fire's behavior a mile away, in another part of the forest. These programs, sometimes called embarrassingly parallel, are the easiest to parallelize for supercomputers because the model can easily run on many cores.

Most programs require information from neighboring points, which means that the cores running the program must communicate- the major source of speed and programming bottlenecks for many parallelized programs. For example, the weather here depends not only on the local temperature, pressure, wind speed, but also on that of the areas nearby. Wind and barometric pressure changes propagate rapidly compared to fires.

I received a program from an Outside source ("Outside" is Alaskan for "Lower 48"). It is a first order production forest fire model. We wanted to adapt it to use for the fires we have in Interior Alaska, ones that often account for more than half the acreage burned in the entire US.

In the middle of this fortran code, inside the main triply nested loop, is the innocuous looking code block:

After changing the geometry for the larger region needed for Alaska; updating the grid to use 1 km cells; and changing to a projection that is more suitable for our latitude, I had a program grid with about 2.7M grid cells instead of the original 10K. In the course of modifying the program, we also changed from 24 time cells to 49.

I ran the program. It bombed when it ran out of time, so I started a performance analysis. With gprof I quickly found that the main routine accounted for nearly all the execution time. After inserting a few:

call cpu_time(time)
print *, [ cumulative elapsed time in code block]

I isolated the time to those simple lines above. Twice I had ignored them--no loop, nothing much going on. Right? Wrong!

I figured this would stop trashing the cache and reduce storing the elements of the array mc1, to one store per element. This change sped up the loop nest by about a factor of three, pretty good for an hour's work. But still on the order of a second per innermost loop iteration.

But, I kept thinking about the physics. The inner pair of loops was working on the mesh's 2.7 M grid cells, why would any cell require more than a second's worth of computation? And if you do the arithmetic, that works out to a couple core-weeks of CPU time, still far from feasible for a production code that should run daily or twice daily during the fire season. Even using all of Midnight's 2312 cores, it would take thousands of seconds. Embarrassing.

I reread the paper describing the fire model, the one the original program was based on, and decided that nearly all the work implied by the array statements was redundant. Each variable in both of the statement blocks above is triply subscripted for longitude, latitude and time. The innermost loop pair is also over longitude and latitude. Only the value for the current cell is being used in any iteration.

Conveniently i and j are used for longitude and latitude. With the redundant calculations removed, the statement block becomes:

The innermost loop pair includes more than 100 lines of code, but now it runs about a million times faster than before. 396M loads and stores have been eliminated from each loop iteration. (The third subscript, the colon above, is over the time steps.)

Those 396 million extra loads and stores took so much time that the one hundred line loop pair runs about a million times faster than the original. Embarrassment gone.

[Ed. Note: On the entire program, which performs extensive I/O, the
changes would save more than 100,000,000,000,000,000 stores and reduce
the run time by a factor of about 5000.]

Handling Little Endian Files with XLF

[ By: Don Bahls ]

By default IBM systems such as iceberg generate big endian binary files. This can cause problems when you attempt to use binary files that were generated on a little endian system, such Linux systems using AMD or Intel processors. Version 10 of the XLF compiler has a runtime option which allows Fortran I/O to use unformatted little endian files.

Iceberg Farewell

Back in 2004, iceberg became the first allocated IBM system at ARSC. It was one of the first systems that IBM deployed with, the then new, Federation 2 switch. Over the last 4 years iceberg has been a work-horse HPCMP and academic users alike. In a little under a month, iceberg will be retired to make way for our latest allocated system.

The University of Alaska Fairbanks is an affirmative action/equal
opportunity employer and educational institution and is a part of the University
of Alaska system.
Arctic Region Supercomputing Center (ARSC) |PO Box 756020, Fairbanks, AK 99775 | voice: 907-450-8602 | fax: 907-450-8601 | Supporting high performance computational research in science and engineering with emphasis on high latitudes and the arctic.
For questions or comments regarding this website, contact info@arsc.edu