1 READEX Concepts and Formalism

The goal of this section is to present a formal mathematical description of the READEX concepts to minimize ambiguity in their definitions, to improve common understanding and communication between the project partners, and to support the development of READEX DTA and RAT methodologies and software components.

In Section 1.1, we present the fundamental concepts of the READEX methodology. Section 1.2 presents the concept of a tuning model which is the final outcome of the DTA stage and which is passed to the RRL to guide dynamic runtime tuning of the HPC stack.

In Section 1.3, we describe how the base READEX concepts are applied to define the tuning potential metric, a metric quantifying the potential improvement to be achieved by the tuning, and thus central to the overall READEX methodology. Section 1.4 extends the base concepts to include the central concept of inter-phase dynamism. Finally, in Section 1.5 we describe how these fundamental concepts can be extended to handle a multiplicity of application inputs, which also typically leads to varying application behaviour during runtime.

1.1 Fundamental concepts

This section introduces the core concepts that will be referred to throughout this document and the READEX project more generally.

1.1.1 Significant regions

The set of all instrumented regions in the program, e.g., functions, parallel OpenMP regions, and MPI operations, is denoted by R instr , i.e., all regions that are known to the READEX tool suite. During execution of the application, a region might be executed multiple times. Each region execution is called a region instance. It is assumed that for all regions r ∈ R instr the instrumentation overhead is insignificant.

A region r ∈ R instr is called a significant region if it covers a significant part of the execution time. The READEX tuning approach only targets the significant regions. The set of significant regions are denoted R sig ⊆ R instr . From this point in the deliverable, the subscript sig may be omitted, in which case R stands for R sig .

1.1.2 Identifiers

An identifier is an element that contains information to predict the characteristics of the consequent execution.
The set of all identifiers is denoted as ID. ID r ⊆ ID is the set of identifiers for a region r ∈ R and ID R is the set of identifiers of all regions. Identifiers of regions considered in READEX include the region name, region call path, and region parameters. The region call path is the sequence of nested region instances when a region is executed. Region parameters are variables associated with the region as identifiers (see Section 4.3 for more details).
Further subsets of the set ID are specified in Section 1.4.3 and Section 1.5.2. Values that can be taken by identifiers are denoted as v ∈ V .

1.1.3 Context elements

A context element c is a tuple consisting of an identifier and its value. C := ID × V is the set of all context elements, and is defined as the cross product of all identifiers and the values taken by them.
The set of context elements for region r ∈ R is defined as C r := {c ∈ C|c = (id, v)∧id ∈ ID r }.

1.1.4 Runtime situations

A runtime situation (rts) is an instance of a significant region r ∈ R during an execution. Instances of regions r ∈ R instr \ R sig are not rts’s. The set of all possible runtime situations is denoted by RTS.

The region of an rts ∈ RTS is denoted by r rts ∈ R and its context by C rts ⊆ C. Multiple rts’s may have the same context.

1.1.5 Application execution

An application is executed by a set of processes P , each of which can use multiple threads. An execution of the application on a given input by a process p is exe p := rts 1 , rts 2 , . . . , rts n ,where n = len(exe). An application execution is the set of executions of all processes and is defined as EXE := {exe p |∀p ∈ P }.

1.1.6 Tuning parameters

A tuning parameter tp ∈ TP is a parameter of the HPC stack (e.g. CPU frequency, accelerator offloading switch, application parameter, etc.). READEX focuses on tuning parameters that have the potential to influence the energy consumption of an application running on an extreme-scale system and can be affected by the RRL at runtime.

A tuning parameter tp ∈ TP can take a value from VAL tp . The set of all values of tuning parameters is VAL = ∪ tp∈TP VAL tp . See Section 3 for possible tuning parameters considered in READEX.

1.1.7 System configurations

A system configuration cfg ∈ CFG is a function that maps a tuning parameter tp ∈ TP onto its value val ∈ VAL tp and is defined as cfg : TP −→ VAL.

1.1.8 Switching point

A switching point sp ∈ SP is a point during application execution when the monitoring library is entered and can perform switching of the system configuration. The monitoring library can determine the region r at every switching point. Switching points are triggered when a region is entered or exited.

1.1.9 Objective function

The objective function o : RTS × CFG −→ R is a function mapping a given runtime situation rts and a given system configuration onto a real number. The objective of tuning is to minimise or maximise a given objective function by varying the system configuration.

1.1.10 Static system configuration

In READEX, we consider a static system configuration cfg static ∈ CFG as a static system configuration that is either a system-wide default or was obtained by tuning for a best static configuration. For example, we could select the best fixed CPU frequency for an entire program run, optimising the energy consumption of the application.

The static system configuration cfg static gives a baseline for a comparison of the results achieved with READEX runtime tuning.

1.1.11 Example

This section describes the terms introduced in Section 1.1 with an example code.

The program in Listing 1 consists of the main() function that calls three other functions, namely laplace(), reduction(), and fftw(). After instrumenting the application with Score-P, we obtain R instr = {laplace, reduction, fftw }, the set of all regions in the application. In READEX, the most interesting regions are the ones that are coarse granular so that there is low overhead while switching between configurations.

Suppose the execution times of each instance of laplace(), reduction(), and fftw() are 2 s, 0.1 ms and 1 s respectively. The execution times for laplace() and fftw() are much greater than the switching overhead. Thus, laplace and fftw are significant regions R sig = R = {laplace, fftw } and will be the targets of the READEX tuning approach.

The region name and call-path of these functions can be selected as identifiers. For example, identifiers for region laplace would be ID laplace = {RegionN ame, CallP ath}.

Suppose we have two processes, P 0 and P 1 , executing this program. An rts of region laplace 1 in process P 1 for iteration iter = 1 of the loop is denoted as rts 1,P laplace . Assuming that we have two iterations, all possible runtime situations are denoted by

A context element c rts 1,P 0 =
(CallP ath, main/laplace) for region laplace
laplace is a tuple containing the identifier CallP ath and its value.
C rts 1,P 0 = laplace {(RegionN ame, laplace), (CallP ath, main/laplace)} is the set of all context elements of this rts. The application execution on process P 0 is defined as

exe P 0 = rts 1,P laplace , rts fftw , rts laplace , rts fftw.

Switching points are triggered when a region is entered or exited and SP = {sp laplace enter , sp laplace exit , sp fftw enter , sp fftw exit } is the set of switching points. The sequence of switching points and rts’s for process P 0 are shown in Figure 1.1.

The goal of READEX is to optimize an objective function o(rts, cfg) by switching the system configuration dynamically at the switching points. In this example, the tuning parameter tp ∈ T P that influences the energy consumption is CPU Freq, the CPU frequency setting. CPU Freq can take values from VAL CPU Freq = {1.8, 2.0, 2.1, 2.3, 2.4, 2.5}.

1.2 Tuning model

As a result of DTA, a tuning model is produced. The tuning model captures the knowledge about the best-found system configurations for individual rts’s. It is based on the idea of scenario-based tuning which aggregates similar rts’s into scenarios.

1.2.1 Scenarios

The set of rts’s is partitioned into a set, S, of scenarios. The rts’s are grouped into one scenario if they have the same best-found configuration, i.e., the same selector as defined in Section 1.2.3 below, or if they have the same context C rts .

1.2.2 Classifier

1.2.3 Selectors

A selector of a scenario s ∈ S is a function sel s : ∅ −→ CF G that returns a single configuration. The configuration is usually the best configuration with respect to the chosen objective. The set of all selectors is SEL. Selectors can have quite different implementations based on the knowledge included in the tuning model for the scenario. Examples of selectors are:

It is based on a single best configuration determined at design time.

It is based on a set of pareto-optimal configurations and returns one of those according to some runtime priorities for the objectives.

It could choose from a set of good configurations determined during design time analysis based on probabilities. This could actually dynamically check for the best solution and go beyond the limitation of design time analysis.

It could select from different configurations that were combined because of merging rts’s with different best configurations, but with the same context (see Section 1.2.4 for an example).

Selectors are also used to store the tuning information for individual rts’s, which we call rts selectors.

1.2.4 Example (continued)

This section describes the terms introduced in Section 1.2 using the example defined in Section 1.1.11.

It may be argued that the rts’s of laplace and fftw have different best configurations, and so, they should be partitioned into different scenarios. This does not happen because it is not possible to differentiate between the rts’s based on the context, and hence, they are merged into the same scenario. These cases will happen rarely if the identifiers are well chosen.

1.3 Applying the formalism

This section describes how the concepts described so far can be applied, both in terms of tuning-relevant dynamism and in terms of the tuning potential.

1.3.1 Optimal system configuration

The function ocfg : RTS × OBJ −→ CFG is an oracle and determines the optimal system configuration for a runtime situation and a given objective. It is the goal of READEX to approximate ocfg as best as possible based on the READEX methodology.

1.3.2 Tuning-relevant dynamism

An application exhibits tuning relevant dynamism iff ∃rts, rts 0 ∈ RTS such that ocfg(rts, o) 6 = ocfg(rts 0 , o). This means that there are two rts’s during one execution exe p ∈ EXE that have different optimal configurations with respect to objective o. In such cases, the system configuration has to be switched in order to achieve the best possible objective function value.

Whether the tuning potential can be achieved with the READEX methodology depends on the quality of the tuning model, the switching overhead, and mutual side effects between processes.

1.3.4 Example (continued)

This section extends the example defined in Section 1.1.11 and Section 1.2.4 to describe the terms introduced in Section 1.3.

The optimal system configuration function ocfg(rts, o) returns an optimal configuration cfg ∈ CFG for each rts as shown in Table 2.

During the application execution, we see that rts 1,P laplace and rts laplace have different optimal configurations with respect to the objective. Hence, tuning relevant dynamism exists because a configuration switch is determined to be worthwhile.

The tuning potential tpo(rts 1,P
laplace , o) = o(rts laplace , ocfg(rts laplace , o)) − o(rts laplace , cfg static ) 0 for rts 1,P laplace specifies the improvement in the value of the objective function (consumed energy) as compared to the value for cfg static (CPU Freq) = 2.5. The improvement for 0 rts 1,P laplace is 92.857%, and for the application execution exe P 0 is 90.196%. However, the tuning effect is diminished because of the merging of the rts’s. Due to this limitation, we can only realize an improvement of 88% for execution exe P 0 . To avoid this, an additional identifier can be introduced to distinguish the rts’s (see Section 2.4.5).

1.4 Extension for inter-phase dynamism

This section describes an extension of the formalism for inter-phase dynamism. We define an inter-phase tuning model, which is a tuning model according to the definition in Section 1.2 and can provide different system configuration for rts’s in phases with different characteristics. The definitions in this section and in Section 1.5 align strongly with the definitions for identifiers, scenarios, and tuning model in Section 1.1, which demonstrates the generality of the fundamental concepts.

1.4.1 Phase region

A phase region is a program region r ∈ R instr that defines the phases of execution. All significant regions should be included in the dynamic extent 1 of the phase region. The set of all phase regions is R phase ⊆ R instr . In READEX, we initially assume that there is exactly one phase region, i.e., |R phase | = 1.

1.4.2 Phase

A phase ph ∈ P H is an instance of a phase region and thus an rts on phase level. It is a single execution of the phase region. In READEX we assume that the phases are executed collectively by all of the processes. Thus, all the processes go through the same phase sequence ps = ph 1 , ph 2 , . . . , ph k .

1.4.3 Phase identifiers

For the phase region r ∈ R phase , identifiers can be given, as for any other significant region. We denote the set of phase identifiers as ID phase ⊆ ID. The phase identifiers should be chosen such that they allow to distinguish different behaviour across phases of that phase region. For region identifiers, we define the context elements based on phase identifiers as C phase ⊆ C.

1.4.4 Inter-phase tuning model

Based on the phase identifiers, the rts’s of a region with the same context but in phases with different characteristics can now be mapped to different rts scenarios with different selectors.
Thus, we obtain again a set of scenarios, S, which might be different from the one for the same execution ignoring phases. The partitioning into rts scenarios is now given by a phase-aware classifier cl : P(C phase ∪ C R ) −→ S.
Since the tuning model, T M = (S, cl, SEL), is based on a phase-aware classifier, it can now cover the scope of inter-phase dynamism.

1.4.5 Example (continued)

This section describes the terms introduced in Section 2.4 with an extension to the example code described in Section 1.1.11.

The program in Listing 2 now consists of the phase region r ∈ R phase . ph 1 is the first execution of r, the body of the f or loop and the first loop iteration. Phase ph 2 is the second loop iteration. Thus, the sequence of phases is ps = ph 1 , ph 2 . All the processes execute the same sequence.

In Section 1.2.4, both the rts’s of laplace and fftw were merged into scenarios s 0 and s 1 respectively, and the tuning potential was diminished. To avoid this, a phase identifier is introduced, ID phase = {P haseCharact}, which distinguishes phases with different characteristics. Let P haseCharact = A for ph 1 and P haseCharact = B for ph 2 . Now, the merged rts’s can be distinguished because we can draw information about a phase’s characteristic, and thus determine which phase an rts belongs to.

1.5 Extension for multiple application inputs

This section presents an extension of the previous concepts for different application runs with different application inputs. The application input comprises input data and execution environment configurations. Application runs with different inputs will usually result in different performance and energy characteristics. Since in READEX, we assume that the application has reproducible performance and energy characteristics when run with the same input several times, the tuning model should be extended to account for this varying behaviour.

1.5.1 Application input

The application input ai ∈ AI includes the input data to the application as well as other application-external settings at the start of an application that influence the performance and dynamic behaviour of the execution, such as the set of processes P. As a result of the application behaving differently for different application inputs, the DTA will have to consider a set of application inputs. AI denotes all possible application inputs for the given application.

1.5.2 Input identifier

Identifiers can be given for the application input that can be used to differentiate different performance and energy characteristics. Such identifiers are called input identifiers and are denoted by ID input . The context elements for those identifiers are denoted by C input.

1.5.3 Application tuning model

Based on the application input identifiers we can now obtain the application tuning model that considers application input, application phases, and significant regions. It is based on a partitioning of all rts’s in all executions for the different inputs into rts scenarios. The partitioning will be given by an input-aware classifier cl : P(C input ∪ C phase ∪ C R ) −→ S that maps rts’s based on the input context, the phase context, and the rts context into rts scenarios.

1.5.4 Example (continued)

This section describes the terms introduced in Section 2.5 with reference to the code in Listing 2.

Different application inputs with different characteristics i 1 , i 2 , i 3 , . . . result in different measured values of the consumed energy. Assuming there is an input identifier InputCharact,
which takes the values i 1 , i 2 , i 3 , . . . and identifies these characteristics, it can be used to differentiate different energy characteristics.

The contexts of the rts’s of laplace and fftw can now also contain the contexts of the input identifiers. The contexts of the rts’s for the first execution are:

1.6 Summary

The previous sections presented a formalism to describe the READEX methodology as well as the tool suite described in detail in later sections. The tuning model serves as the interface between the DTA and RAT stages and is based on a partitioning of rts’s into scenarios. For each scenario, the decision logic for selecting the system configuration to use is provided by a selector, which in turn returns a system configuration. The rts’s are mapped by a classifier to those scenarios. In the tuning model, the classifier is based on the input context, the phase context, and the context of the rts. As a result, all of the sources of application dynamism can be used to select a system configuration.