TriDenT CoLUMBO

Trees in HPF2

My Research Team was in the University of Bordeaux I in the laboratory named LABRI (LABoratory of Research in Informatique/Computer Science). Our works began with the study in a new way to describe irregular data structures in HPF. This method is based on a treelike writing of the data structures by way of the derived types of Fortran 90/95 and the extensions approved of HPF 2.0. These notations allow to remove present necessary indirect accesses in the writing of the classic irregular data structures as the CCS (Compressed Column Storage). Indirect accesses prevent the compiler from generating parallel effective codes because of the not knowledge of the values of these indirect indices.

Following example shows a sparse matrix with accesses directed by column, its show under shape of a tree and its writing in HPF 2.0 as well as its distribution.

We also suggest spreading the possibilities of distribution of trees as well as the addition of a new directive TREE to the language HPF: this directive specifies restricted usage which is made notations pointers and derived types, limitations producing effectively a treelike structure of data and in known depth during the compilation.

These limitations as well as additions for the distribution allow an effective implementation of derived types by the compiler as well as optimizations during the compilation based on regular frame analyses of dependences and of locality. These analyses are not practicable with irregular standard formats.

Inspection-Execution

In spite of this improvement of analyses in the compilation, it remains unknown information during this one. The parallel properties of these applications are not, as a consequence, exploitable during the compilation. It is necessary to use paradigm of inspection-execution.

The compiler generates a code containing two phases: first phase, inspection, scans these information only known in the execution to produce, during the second phase, the execution, the realization of calculations with plans optimized as at the level of the communications as of the distribution of calculations. We used this paradigm for various properties which were not studied in the previous works until this day. We were notably interested in the cases of loops with partial inter-iterations dependences (BDIIP (french) - LPIID).It should be noted that our inspector - executor is very different from classic approaches in the sense where, in our study, inspection must be made profitable with only one phase of execution, contrary to the other approaches which are used in an iterative context, that is to say with several identical phases of execution in terms of access and communications.

Irregular sets of processors

Irregular indexes of loops

Communications inter-and extra - iterations (LPIID)

Scheduling of the DAG associated to a LPIID

Progressive Irregular Prefixe Operations

We present quickly each of these inspections - executions :

Nature

HPF 2.0

Example

Irregular sets of processors

Not supported

DO I= 1, N < /FONT>

!HPF$ ON HOME (A(J), ANY (INDEX(I,1:NB(I)))=J .or. J=I), BEGIN

!HPF$ INDEPENDENT, REDUCTION(A(I))

DO J1= 1, NB(I) < /FONT>

J=INDEX(I,J1)

A(I) = A(I) + B(J) < /FONT >

END DO

A(I)= f(A(I)) < /FONT>

!HPF$ END ON

END DO

Extended directive ONE HOME(containing an algorithmic expression of the type FORALL) allows to declare an irregular set of processors: the reduction operation will therefore only use this subset, reducing implied synchronizations.

Irregular Loop Indexes

Not specified

DO I= 1, N

!HPF$ INDEPENDENT, REDUCTION(A(I))

FORALL(J=1:I, ANY(INDEX(I,1:NB(I)) = J)

A(I) = A(I) + B(J)

END FORALL

A(I) = f(A(I))

END DO

The inspector of the indexes of loops precomputes valid indexes for every processor (here, based on the expression FORALL). It allows in certain cases to reduce algorithmic costs implied by the evaluation of guards at every iteration.

Communications inter-and extra - iterations (LPIID)

Not supported

This inspector allows to optimize the extractable communications of the global loop, or to send inter-iteration communications as soon as possible. Communications are split into two steps: emission and reception (contrary to the standard generation of code in HPF).

Scheduling of the DAG associated to a LPIID

Not supported

!HPF$ SCHEDULE (J = 1:I-1, ANY(INDEX(I,1:NB(I)) = J)

DO I = 1, N

...

END DO

Gives:

Call Schedule(task, 1, N)

Subroutine task(I)

!HPF$ ON HOME (A(J), ANY (INDEX(I,NB(I)))= J .or. J= I), BEGIN

!HPF$ INDEPENDENT, REDUCTION(A(I))

FORALL(J = 1:I, ANY(INDEX(I,1:NB(I)) = J)

A(I) = A(I) + B(J)

END FORALL

A(I) = f(A(I))

!HPF$ END ON

END Subroutine

This new directive SCHEDULE allows the consideration of LPIID and of task graph associated (every iteration is associated to a task, the associated DAG is extracted from the algorithmics expression of the directive clarifying the dependences of the iteration current I). The code of the loop is then rewritten in a sub-routine which will be executed according to the result of the scheduler.

Progressive Irregular Prefixe Operations

Not supported

!HPF$ PREFIX(A)

DO I= 1, N

!HPF$ ON HOME (A(J), ANY (INDEX(I,NB(I)))= J .or. J= I), BEGIN

!HPF$ INDEPENDENT, PREFIX(A)

FORALL(J = 1:I, ANY(INDEX(I,1:NB(I)) = J)

A(I) = A(I) + A(J)

END FORALL

A(I)= f(A(I))

!HPF$ END ON

END DO

Defined directive PREFIX present data-parallel operation in the loop as being an irregular progressive prefix operation. This operation is progressive because every index is calculated from the final values of the array. It is not about a simple reducing, because the array in entry is also the array in exit. The management of this operation uses asynchronous communications.

Experimental Results

This is the results of a Cholesky factorization of sparse matrices in block-columns 1D version (distribution of blocks-columns). Every versions include more and more optimizations.

The last version corresponds to a Team's Cholesky factorization with important performances [A Mapping and Scheduling Algorithm for Parallel Sparse Fan-In Numerical Factorization, EuroPar'99, 1999, P. Henon, P. Ramet and J. Roman].

Factorization is placed in an iterative frame with a constant sparse matrix structure (only values evolve), as in Karmarkar's algorithm which solves AX system = B and which minimizes a function objectivizes CTX.

First iteration is obtained after the static scheduling (if any) and the time enclosed the inspection ones, the second with the dynamic scheduling (the time of the next iterations in case of the PREFIX directive) without the time of the inspection.