Fortran Programming Language

Speed versus memory

I am going through and updating a computation fluid dynamics solver from F77-style programming to "modern" Fortran (F95 not 2003 since it doesn't seem to be generally available). The code will be used to run very large jobs (upwards of 100 million grid points and many thousands of time steps) on massively parallel architectures.

My initial impulse was to use vector processes everywhere. To some extent this was driven by books such as "Numerical Recipes in F90" and "Fortran Explained 90/95" which take the stance that do-loops introduce an artificial order to the processes (do i=1, then i=2, then i=3...) which limit a compiler's ability to generate fast executables. Algorithms written using vector operators are also generally less verbose, which I find makes them easier to read and check for bugs.

However, I've noticed that writing complex algorithms (such as an upwind quadratic interpolation routine) using vector notation requires an absurd amount of memory. Such a routine only has two input arrays and one output array but goes through ten intermediate arrays all of which must have space allocated for them when using vector processes. Even if the compiler splits the thread among a large number of processors I have found that the routine will use thousands of times more memory than the same routine written with do-loops (with ten continually redefined scalars). I can't imagine this is computationally fast and it's limiting the size of jobs I can run.

I have hacked together some methods using pointers and elemental routines that seem like they might be ideal. As I understand it pointers don't really use up memory (just a few bites per array) and it seems that intermediate variables within an elemental are allocated efficiently (they aren't blown up to the full array size unless the computer can operate simultaneously on all those components) but I have no idea how compiler and platform dependent that will be.

I'm wondering if other people have any hard-and-fast rules or guiding principles in regards to the problem of speed versus memory allocation. Thanks.

I assume from context that by "vector processes" you mean what are more commonly called something like whole array operations. Your "vector precesses" made me thing of things like openMP and the like.

> To some > extent this was driven by books such as "Numerical Recipes in F90" and > "Fortran Explained 90/95" which take the stance that do-loops > introduce an artificial order to the processes (do i=1, then i=2, then > i=3...) which limit a compiler's ability to generate fast executables.

Do those texts really say that? It has been quite a while since I looked at Numerical Recipes, but I know the authors of the "Explained" series and I wouldn't expect any of then to say quite that.

Generalizing about perfiormance is *VERY* dangerous in many areas, specifically including this one. Counterexamples of almost any generalization can be found. That being said, my general impression is pretty much the opposite - Fortran compilers have been optimizing DO loops for over 5 decades now. They have gotten pretty darned good at it. Optimization of whole array expressions is a much newer field. It is done, but things about it are still being learned and working their way into practice.

> Algorithms written using vector operators are also generally less > verbose, which I find makes them easier to read and check for bugs.

Sometimes. Plenty of counterexamples exist, but I think that is more often correct than not. In any case, it is at least often correct. I do think that, on the whole, that's a better reason than performance for choosing the form.

> However, I've noticed that writing complex algorithms... > ... intermediate arrays all of which must have space allocated...

And that allocation/deallocation takes time independent of memory use issues.

> I have found that the routine will use thousands of times > more memory than the same routine written with do-loops

Thousands of times sounds implausible. Certainly one can generate things arbitrarily bad, but I wouldn't expect a factor of thousands. This sounds to me like a memory leak... also known as a bug... rather than expected behavior. I couldn't say whether the bug is in your coding of the array expressions or in the compiler's implementation. (I have in the past seen compiler bugs with failing to deallocate compiler temporaries.) In either case, it sounds an awful lot like a bug.

>>To some >>extent this was driven by books such as "Numerical Recipes in F90" and >>"Fortran Explained 90/95" which take the stance that do-loops >>introduce an artificial order to the processes (do i=1, then i=2, then >>i=3...) which limit a compiler's ability to generate fast executables. > Do those texts really say that? It has been quite a while since I looked > at Numerical Recipes, but I know the authors of the "Explained" series > and I wouldn't expect any of then to say quite that. > Generalizing about perfiormance is *VERY* dangerous in many areas, > specifically including this one. Counterexamples of almost any > generalization can be found. That being said, my general impression is > pretty much the opposite - Fortran compilers have been optimizing DO > loops for over 5 decades now. They have gotten pretty darned good at it. > Optimization of whole array expressions is a much newer field. It is > done, but things about it are still being learned and working their way > into practice.

My understanding of the above is that DO loops work well for serial processors, but that array operations might work better for parallel processors. It should be easier to vectorize an array expression, for example. Still, it will depend much on how each one is actually written.

>>Algorithms written using vector operators are also generally less >>verbose, which I find makes them easier to read and check for bugs. > Sometimes. Plenty of counterexamples exist, but I think that is more > often correct than not. In any case, it is at least often correct. I do > think that, on the whole, that's a better reason than performance for > choosing the form.

My feeling is that array operators are a good choice for simpler operations, and DO loops for more complex operations. Though that is mostly considering serial processors.

That more temporary arrays tend to be needed for array operations is one reason. Also, sometimes the expression as written will require much more work even though it looks simpler mathematically.

>>However, I've noticed that writing complex algorithms... >>... intermediate arrays all of which must have space allocated... > And that allocation/deallocation takes time independent of memory use > issues. >>I have found that the routine will use thousands of times >>more memory than the same routine written with do-loops > Thousands of times sounds implausible. Certainly one can generate things > arbitrarily bad, but I wouldn't expect a factor of thousands. This > sounds to me like a memory leak... also known as a bug... rather than > expected behavior. I couldn't say whether the bug is in your coding of > the array expressions or in the compiler's implementation. (I have in > the past seen compiler bugs with failing to deallocate compiler > temporaries.) In either case, it sounds an awful lot like a bug.

If it is one very large temporary array it could easily be thousands of times more than needed for DO loops. If it is running on multiple processors, though, it shouldn't need so much for each one.

Richard already stated that performance optimisation of array notation statements is still developing and maturing.

Regarding your initial problem, I would suggest that you write the code using array notation because it's often more concise and (more important) lets you know if execution order is relevant or not (same goes for the employ of FORALL and WHERE constructs).

To optimise the serial part of your program, you will have to do performance measurements; if there are memory problems caused by inefficient array syntax, you can always rewrite the critical part in explicit DO-loop terms. Having clearly stated where execution order matters or not helps a great deal when you're parallelising the code.

Citing Hoare: "Premature optimisation is the root of all evil"

Sebastian

Thanks for all three responses. First off, I did mean to say array notation. Thanks for putting up with my poor Fortran vocabulary.

The the point that compilers are more experienced at optimizing do- loops is a very interesting one. "Recipes" says many times that array notation and use of intrinsics allows compilers to make faster code. For example, from section 22.1 "In fact the serial code above [nested do-loop example] overspecifies the desired task...The essence of parallel programming is not to force "into the time dimension"(ie, to serialize) operations that naturally extend across a span of data..." On page 1015 "One quick (if superficial) test of how much parallelism is achieved in a Fortran 90 code is to count it's do-loops..." You read enough of these kind of comments and you'll start thinking do- loops are slow as mud and should be avoided like the plague. "Fortran Explained" isn't so damning, but they still tout array operations as the best thing since sliced bread.

And while my estimate of 1000 was just a back of the envelope thing it seems reasonable to me. If I've got 10 dummy arrays with more than 1000 components each (even after breaking the full grid over a large number of processors) then wouldn't that take more than 1000 times more memory than the serial version which just has 10 scalars?

Also, does anyone have any opinion on elemental routines? I've found that converting routines with array operations to elemental routines is straightforward, but I don't have enough experience to know if there's a consistent advantage in terms of memory or speed. (I've found that my upwind quadratic interpolation algorithm uses orders of magnitude less memory on my 250,000 grid point test case when written with pointer arrays passed to an elemental routine.)

Gabe <DrWeymo@gmail.com> wrote: > On page 1015 "One quick (if superficial) test of how much parallelism > is achieved in a Fortran 90 code is to count it's do-loops..." You > read enough of these kind of comments and you'll start thinking do- > loops are slow as mud and should be avoided like the plague. "

Wow. That quote tells me a lot... but not about DO loops. It tells me more about the authors of NR. That's horrible advice. I've seen people follow advice like that. They take clear and efficient code using DO loops and turn it into convoluted and slow code using array operations. I'm not saying array operations are bad. Sometimes they are simpler and/or faster. But I have trouble expressing how horrible I think that advice is.

> And while my estimate of 1000 was just a back of the envelope thing it > seems reasonable to me. If I've got 10 dummy arrays with more than > 1000 components each (even after breaking the full grid over a large > number of processors) then wouldn't that take more than 1000 times > more memory than the serial version which just has 10 scalars?

No. Because there is no code around that is going to take only the storage of 10 scalar variables. With most current systems, that would be 40 bytes (maybe 80 if you are talking double precision). There is all kinds of overhead and other cost. Try to find an executable program that takes only 80 bytes.

Now that you detail it, I wonder why you are expressing this as a factor of the memory size anyway. Seems to me that absolute numbers would make more sense and tell you more anyway. Is that actually the size of numbers you are talking about? Even in back-of-the-envelope terms? Is your increase of 1,000 times something like from 40 bytes to 40,000? If that is the case, then the answer to your question is simple. No, using 40,000 bytes will not have measurable effect on the performance of any current system you are likely to have. If you are talking the difference between 40,000 and 40 million, that might start to get more interesting, though it still isn't likely to be a huge performance issue.

> Also, does anyone have any opinion on elemental routines? > I've found > that converting routines with array operations to elemental routines > is straightforward, but I don't have enough experience to know if > there's a consistent advantage in terms of memory or speed.

Elemental functions are convenient because they are generic and can reduce the amount of duplicative code in a program, but I have never seen evidence that they are faster than equivalent code with array arguments.

Beliavsky <beliav@aol.com> wrote: > Elemental functions are convenient because they are generic...

That's not in general so. Similar idea in some senses, but they aren't technically inherently generic. You can make them so if desired, but that's more or less independent of genericity.

> > And while my estimate of 1000 was just a back of the envelope thing it > > seems reasonable to me. If I've got 10 dummy arrays with more than > > 1000 components each (even after breaking the full grid over a large > > number of processors) then wouldn't that take more than 1000 times > > more memory than the serial version which just has 10 scalars?

> No. Because there is no code around that is going to take only the > storage of 10 scalar variables. With most current systems, that would be > 40 bytes (maybe 80 if you are talking double precision). There is all > kinds of overhead and other cost. Try to find an executable program that > takes only 80 bytes.

I think there is a misunderstanding. I mean the amount of memory used for the temporary arrays will be ~1000 times larger than the amount of temporary memory required for the serial version. My full code uses around 15 full-sized variables, so the change in total memory requirements couldn't be more than 67% (10/15). Taking overhead (and other temporary arrays) into account we're probably talking less than 50%.

> Is that actually the size of > numbers you are talking about? Even in back-of-the-envelope terms?

The jobs I need to eventually run will use tens of millions of grid points in double precision, so the temporary memory requirement would be order 1e9 bytes. Again, this might be split up over ~100 processors, so ~1e7 bytes each.