Report on Experiences with Using SESP Tools and Code Conversion Issues

Section 1. Introduction

The Software Engineering Support Program (SESP) was formed in response to SLA Review Recommendation 7 (2004) that 'CSED should consider implementing minimum standards of design, documentation and testing to ensure that software distributed by Daresbury (and Rutherford Appleton) and used by the scientific community acquires a reputation of being of the highest quality'. High on the agenda of this recommendation is the need to address legacy code issues: a generic term for outdated software techniques not suited to code modularity, extensibility ('plug-n- play'), or general ease of maintenance.

In June 2006, members of CSED, including Graham Fletcher and Jens Thomas of the Computational Chemistry Group (CCG), met with Chris Greenough, head of SESP, to discuss how to move forward with the tools identified by SESP for use with CSED codes. The code of relevance to CCG here is GAMESS-UK. GAMESS-UK is a mixture of a number of different FORTRAN dialects (66, 77, 90 & 95), together with some C code. As a result, much of the code is "legacy code" that includes many non- standard language features and there are also many "machine-specific" pieces of code that were written to take advantage of particular computer architectures that may be long since obsolete.

As GAMESS-UK is still being developed and new features added to it, we are interested in modernising the code to take advantage of new features of the FORTRAN language and remove obsolete and unreliable code. A suitable strategy was discussed for transforming GAMESS-UK which involves tools that can automatically update any code to F90 or F95 once strict compliance with F77 has been achieved. SESP has identified several tools, mostly from the Numerical Algorithms Group (NAG), that can assist in all these areas.

At over a million lines of code, however, the full code was deemed too complex to submit to the tools directly. In addition, GAMESS-UK also makes heavy use of pre- processing to include or remove sections of the code for compilation, and none of the available tools are able to cope with code containing pre-processor directives. A smaller "kernel" program was therefore written, called SCFK, which contains some of the core code from GAMESS-UK and uses similar programming practices to the original code. SCFK computes a self-consistent field wave function so that any alterations can be checked by the correct execution of the program for an example case.

This kernel program was used to test the software transformation tools to see how feasible it would be to use them on GAMESS-UK itself.

Section 2. Experiences Using the Tools.

Here we provide a step-by-step account of using various NAG tools on SCFK. Details on what each tool does can be found on the SESP web site, http://www.sesp.cse.clrc.ac.uk/ .

Starting with the NAG tools, we first tried nag_pfort and got an
error with occurrence of the "!" symbol used as a comment (rather
than a warning or statement).

Next we found nag_decs to be useful and understandable except for
a bug where DFLOAT gets declared as EXTERNAL rather than INTRINSIC
under the - generify flag.

At first we tried to use spag from the plusFort tools. This didn't
work as it didn't appear to be able to deal with common blocks
where the size of an array was specified in a parameter
statement.

We therefore tried to use the NAG tools, starting with
nag_pfort. The problem with the f90 "!" syntax for a comment was
only a minor annoyance, as a line of sed was able to fix this. The
tool was then able to detect a number of minor errors with unused
common blocks and ordering of statements that didn't adhere to the
Fortran standard.

The tool was only able to get through its first
pass, but then failed on the second (where it checks things between
subroutines) due to the way the memory allocation works in most
electronic structure codes. As you probably know, in GAMESS-UK and
its ilk, we allocate a chunk of memory using doubles and then pass
this into subroutines sometimes as doubles, sometimes as integers,
which seriously confuses the tools.

It would be interesting to
know whether any tool can deal with this, or whether the memory
allocation stuff would have to be re-written in order to be able to
take the code into F90 land (which would be a major task for
GAMESS-UK, which is ultimately where we'd like to go with
this).

We then moved on to look at nag_decls, running it with the
options: nag_decs -declare -ardicb -generify Then we hit the
problem with DFLOAT already mentioned. There was also an identical
problem with the LOC intrinisic, so we had to edit our own source
file to change both of these ourselves.

The next problem that we hit was that the tool declared data types
for various quantities that had been implicitly declared, but with
their values specified in parameter statements. The declarations
were placed after the parameter statements and this caused the
compiler to complain that there were two declarations. When this
has been fixed by hand the file finally compiled.

Responses from Chris Greenough and David Worth are as follows:

It is clear that the code is an interesting mixture of a variety of
Fortrans.

Comments: "!" is not Fortran 77 so one might expect nag_pfort - a
f77 tool - to complain - as Jens say easy enough to fix - but
there might be problems if they are appended to the ends of
executable statements.

DFLOAT & LOC: neither of these are in the Fortran 77 standard as
intrinsic function - so nag_decs will treat them as EXTERNALs. All
references to DFLOAT should be replace by DBLE - which is both a
Fortran 77 and Fortran 95 intrinsic. LOC is an aberration: Could
you send us a short section of code showing how you use it. It
clearly needs replacing with some form of interface to some suitable
f90 - what - I don't know yet.

spag and COMMON blocks: spag appeared to be fine with the test
program - see test_nag_decs.f and test_nag_decs.spg. Are we missing
something?

nag_pfort: this checks against a slightly restricted set f77 that
was thought to be portable - so this is stricter than f77. An
example output would be useful. Most tools require reasonable
input to start with. The memory management tricks are common in
Fortran 77. Tools would need to reduce their type checking level
for these constructs to pass through. Converting the Fortran
90/95 memory management is really the only sensible option -
however as you say - time consuming.

nag_decs: The -declare option only adds declarations for implicit
variables but doesn't check for declarations via parameters. I
would call this a bug in the tool. Without this flag all the
declaration statements are rewritten and there is no problem. See
test_nag_decs_new.f. DFLOAT and LOC are declared as EXTERNAL as
you might expect.

In terms of what to do next: clearly the idea is for you to use
the tools in a process of slowly transforming the code to f95. As
we might expect tools are not totally automatic. There still needs
to be significant investment of people time.

Section 3. Code Conversion Activities.

Compliance of SCFK with F77 was also monitored by examining the output of the program FORCHECK. Based on the output of FORCHECK, the following types of conversion were found necessary:

change DFLOAT to DBLE

replace REAL*8 with DOUBLE PRECISION

'&' as continuation replaced by '*'

names reduced to ?6 chars

DO?END DO replaced with DO # ... CONTINUE

DO 50 WHILE ... replaced with DO 50 ...IF...

underscores removed from names

added IMPLICIT DOUBLE PRECISION in subroutine RHFBLD

reordered declarations in subroutine INTS1E

remove unused variables

IMPLICIT NONE removed

'!' comments have been changed to 'C'

remove declaration of intrinsic functions (e.g. LOC)

I=X2 replaced with I=NINT(X2) in subroutine FFUN

TEMP replaced with TEMP(1) in subroutine SRTORB

Having completed the above conversions, the output from FORCHECK given below is obtained:

From the above output it can be seen that the remaining error messages are associated with coding practices such as the re-dimensioning of COMMON block arrays and the re-typing of variables between subroutine call and definition statements, traditionally regarded as essential measures to overcome serious limitations in strict F77 that have become widespread in scientific computing.

Together with the experiences of two other groups maintaining CSED codes (PDVR3D, from Walter Tennyson's group, and THOR (?)), the following conclusions were drawn:

The tools encounter difficulties with very mixed dialect
programs.

The process of transforming the code to F95 is expected to be
slow.

There still needs to be significant investment of people
time.

The tools rely on being given a reasonable starting point. However
much of CSED major codes is not a good starting point.

Most of the tools are very strict on language conformance and this
leads to a large number of trivial errors which might be better
highlighted as warnings.

Some of the constructs being used are from the dim distant past
and rely on dubious side effects in the language.

The scale of CSED applications presents problem in
itself. Although the tools can process large qualities of code the
output can be vast.

The simple task of assessing a code's conformance to the language
standard is non-trivial and requires significant time and input by
the developer.

The long term benefits in terms of ease of maintenance and ease of future development will only become apparent over years not months.

Section 4. Legacy Practices.

So far, two general areas can be identified over which the conversion of scientific programs to strict F77 can encounter difficulties.

1. 'Non-standardizations'

These include the use of non-standard FORTRAN features (some of
which were mentioned above) and coding 'techniques' employed by
programmers to circumvent the limitations of the language.

Some of the above issues are discussed in 'Features and Things to
Avoid in F95!', L. S. Chin, C. Greenough, and D. J. Worth, Software
Engineering Group Note SEG-N- 003.

2. Traditional problem areas in F77

There are numerous coding practices fully supported in F77 that are
usually seen as unhelpful in the quest for greater modularity and
extensibility. These include the use of 'GO TO's, alternate RETURN's,
multiple loops on the same CONTINUE line, IMPLICIT declarations, and
functions appearing indistinguishable from array elements, etc. Many
of these can be addressed following conversion of the fully-F77-
compliant code to f95 using the available tools (e.g. nag_cbm95,
nag_chname95, nag_decs95, nag_struct95, etc). However, at least two
legacy practices related to the use of COMMON blocks present a
potentially tougher problem in the automatic conversion of code. These
are:

Re-naming of COMMON block variables

Re-dimensioning of COMMON block arrays

With luck, the majority of legacy practices could be addressed by
string-substitution, while traditional techniques for simulating
dynamic memory (over-indexing plus data type redefinition) could be
accomplished in a series of incremental transformations. Re-naming
COMMON block variables may affect a relatively minor proportion of
many codes and could be moderately automatable (see below) ? though
choices may be required. However, since the re-dimensioning of arrays
between different subroutines impacts the loop structures, automatic
tools would need to handle modifications to the code semantics of
those subroutines.

Below, the scale of such efforts applied to a widely used and
fairly typical scientific software package in this arena is
indicated.

Example package: GAMESS (US version)

In GAMESS, only a handful of key COMMON blocks are re-defined with
regard to their contents, there is some minor use of pre-processing
and non-standard namings. GAMESS maintains the 'dynamic memory' pool
in an array simply called 'X' which is partitioned in driver routines
before passing to subroutines that label the workspace more
conveniently and do the computation. Therefore the occurrence of this
array passed with an integer offset, or address, such as X(IPOINT)
serves as rough a indicator of the scale of the problem associated
with measures to simulate dynamic memory in older codes.

Line type

Approximate number of lines (thousands)

Entire

770

Comments

70

Executable

700

Occurrences of over-indexing

40

Percentage

6

Thus, the bulk of legacy coding practices in GAMESS that may demand
techniques currently unavailable are accounted for by the occurrence
of 'X(I)' type statements, and these comprise somewhat less than 10%
of the entire code.

While it may be feasible to consider converting key functionality
in a piecemeal fashion, the task of converting entire packages such as
those found in CSED would demand significant investment in
resources.

Conclusion.

Many scientific packages, such as those used in CSE, are probably
too complex to be readily converted or analysed by the software tools
available to SESP. For instance, codes relying on pre-processing to
include or remove sections of the code for compilation will find that
none of the available tools are able to cope with pre- processor
directives. Furthermore, it is hard to imagine such techniques as the
simulation of dynamic memory, the renaming of common block variables,
the re- dimensioning of common block arrays, nor indeed many of the
numerous legacy practices that have been employed by scientists over
the decades to circumvent the limitations of the languages then
available, ever being converted in an automatic way to a more modern
programming paradigm without significant investment in human time.

For more information about the work of the Computational Chemistry Group please contact
Paul Sherwood p.sherwood@dl.ac.uk.