Sid-Ahmed-Ali Touati, Julien Worms and Sebastien Briais. The Speedup Test. 2010. A software is included with the document: the software implements the speedup-test protocole..Abstract{Numerous code optimisation methods are usually experimented by doing multiple observations of the initial and the optimised executions times in order to declare a speedup. Even with fixed input and execution environment, programs executions times vary in general. So hence different kinds of speedups may be reported: the speedup of the average execution time, the speedup of the minimal execution time, the speedup of the median, etc. Many published speedups in the literature are observations of a set of experiments. In order to improve the reproducibility of the experimental results, this technical report presents a rigorous statistical methodology regarding program performance analysis. We rely on well known statistical tests (Shapiro-wilk's test, Fisher's F-test, Student's t-test, Kolmogorov-Smirnov's test, Wilcoxon-Mann-Whitney's test) to study if the observed speedups are statistically significant or not. By fixing $0\frac12$, the probability that an individual execution of the optimised code is faster than the individual execution of the initial code. Our methodology defines a consistent improvement compared to the usual performance analysis method in high performance computing as in \citeJain:1991:ACS,lilja:book. We explain in each situation what are the hypothesis that must be checked to declare a correct risk level for the statistics. The Speedup-Test protocol certifying the observed speedups with rigorous statistics is implemented and distributed as an open source tool based on R software.}URLBibTeX

@techreport{TWBr10,
author = "Touati, Sid-Ahmed-Ali and Worms, Julien and Briais, Sebastien",
title = "{The Speedup Test}",
year = "{2010}",
note = "{A software is included with the document: the software implements the speedup-test protocole.}",
abstract = "{Numerous code optimisation methods are usually experimented by doing multiple observations of the initial and the optimised executions times in order to declare a speedup. Even with fixed input and execution environment, programs executions times vary in general. So hence different kinds of speedups may be reported: the speedup of the average execution time, the speedup of the minimal execution time, the speedup of the median, etc. Many published speedups in the literature are observations of a set of experiments. In order to improve the reproducibility of the experimental results, this technical report presents a rigorous statistical methodology regarding program performance analysis. We rely on well known statistical tests (Shapiro-wilk's test, Fisher's F-test, Student's t-test, Kolmogorov-Smirnov's test, Wilcoxon-Mann-Whitney's test) to study if the observed speedups are statistically significant or not. By fixing $0\frac{1}{2}$, the probability that an individual execution of the optimised code is faster than the individual execution of the initial code. Our methodology defines a consistent improvement compared to the usual performance analysis method in high performance computing as in \cite{Jain:1991:ACS,lilja:book}. We explain in each situation what are the hypothesis that must be checked to declare a correct risk level for the statistics. The Speedup-Test protocol certifying the observed speedups with rigorous statistics is implemented and distributed as an open source tool based on R software.}",
affiliation = "Parall{\'e}lisme, R{\'e}seaux, Syst{\`e}mes d'information, Mod{\'e}lisation - PRISM - CNRS : UMR8144 - Universit{\'e} de Versailles-Saint Quentin en Yvelines - ALCHEMY - INRIA Saclay - Ile de France - INRIA - CNRS : UMR8623 - Universit{\'e} Paris Sud - Paris XI - Laboratoire de Math{\'e}matiques de Versailles - LM-Versailles - CNRS : UMR8100 - Universit{\'e} de Versailles-Saint Quentin en Yvelines",
file = "SpeedupTestDocument.pdf:http\://hal.inria.fr/inria-00443839/PDF/SpeedupTestDocument.pdf:PDF",
hal_id = "inria-00443839",
keywords = "Code optimisation, program performance evaluation and analysis, statistics",
language = "Anglais",
owner = "MOIS",
timestamp = "2011.07.25",
url = "http://hal.inria.fr/inria-00443839/en/"
}

Abdelhafid Mazouz, Sid-Ahmed-Ali Touati and Denis Barthou. Measuring and Analysing the Variations of Program Execution Times on Multicore Platforms: Case Study. 2010.AbstractThe recent growth in the number of precessing units in today's multicore processor architectures enables multiple threads to execute simultanesiouly achieving better performances by exploiting thread level parallelism. With the architectural complexity of these new state of the art designs, comes a need to better understand the interactions between the operating system layers, the applications and the underlying hardware platforms. The ability to characterise and to quantify those interactions can be useful in the processes of performance evaluation and analysis, compiler optimisations and operating system job scheduling allowing to achieve better performance stability, reproducibility and predictability. We consider in our study performances instability as variations in program execution times. While these variations are statistically insignificant for large sequential applications, we observe that parallel native OpenMP programs have less performance stability. Understanding the performance instability in current multicore architectures is even more complicated by the variety of factors and sources influencing the applications performances.URLBibTeX

Yuanjie Huang, Liang Peng, Chengyong Wu, Yury Kashnikov, Jorn Rennecke and Grigori Fursin. Transforming GCC into a research-friendly environment: plugins for optimization tuning and reordering, function cloning and program instrumentation. In 2nd International Workshop on GCC Research Opportunities (GROW'10). 2010. Google Summer of Code'09.AbstractComputer scientists are always eager to have a powerful, robust and stable compiler infrastructure. However, until recently, researchers had to either use available and often unstable research compilers, create new ones from scratch, try to hack open-source non-research compilers or use source to source tools. It often requires duplication of a large amount of functionality available in current production compilers while making questionable the practicality of the obtained research results. The Interactive Compilation Interface (ICI) has been introduced to avoid such time-consuming replication and transform popular, production compilers such as GCC into research toolsets by providing an ability to access, modify and extend GCC's internal functionality through a compiler-dependent hook and clear compiler-independent API with external portable plugins without interrupting the natural evolution of a compiler. In this paper, we describe our recent extensions to GCC and ICI with the preliminary experimental data to support selection and reordering of optimization passes with a dependency grammar, control of individual transformations and their parameters, generic function cloning and program instrumentation. We are synchronizing these developments implemented during Google Summer of Code'09 program with the mainline GCC 4.5 and its native low-level plugin system. These extensions are intended to enable and popularize the use of GCC for realistic research on empirical iterative feedback-directed compilation, statistical collective optimization, run-time adaptation and development of intelligent self-tuning computing systems among other important topics. Such research infrastructure should help researchers prototype and validate their ideas quickly in realistic, production environments while keeping portability of their research plugins across different releases of a compiler. Moreover, it should also allow to move successful ideas back to GCC much faster thus helping to improve, modularize and clean it up. Furthermore, we are porting GCC with ICI extensions for performance/power auto-tuning for data centers and cloud computing systems with heterogeneous architectures or for continuous whole system optimization.URLBibTeX

@inproceedings{HPW+10,
author = "Huang, Yuanjie and Peng, Liang and Wu, Chengyong and Kashnikov, Yury and Rennecke, Jorn and Fursin, Grigori",
title = "{Transforming GCC into a research-friendly environment: plugins for optimization tuning and reordering, function cloning and program instrumentation}",
booktitle = "{2nd International Workshop on GCC Research Opportunities (GROW'10)}",
year = "{2010}",
address = "{Pisa, Italie}",
month = "{} } # Jan # { {}",
note = "{Google Summer of Code'09}",
abstract = "{Computer scientists are always eager to have a powerful, robust and stable compiler infrastructure. However, until recently, researchers had to either use available and often unstable research compilers, create new ones from scratch, try to hack open-source non-research compilers or use source to source tools. It often requires duplication of a large amount of functionality available in current production compilers while making questionable the practicality of the obtained research results. The Interactive Compilation Interface (ICI) has been introduced to avoid such time-consuming replication and transform popular, production compilers such as GCC into research toolsets by providing an ability to access, modify and extend GCC's internal functionality through a compiler-dependent hook and clear compiler-independent API with external portable plugins without interrupting the natural evolution of a compiler. In this paper, we describe our recent extensions to GCC and ICI with the preliminary experimental data to support selection and reordering of optimization passes with a dependency grammar, control of individual transformations and their parameters, generic function cloning and program instrumentation. We are synchronizing these developments implemented during Google Summer of Code'09 program with the mainline GCC 4.5 and its native low-level plugin system. These extensions are intended to enable and popularize the use of GCC for realistic research on empirical iterative feedback-directed compilation, statistical collective optimization, run-time adaptation and development of intelligent self-tuning computing systems among other important topics. Such research infrastructure should help researchers prototype and validate their ideas quickly in realistic, production environments while keeping portability of their research plugins across different releases of a compiler. Moreover, it should also allow to move successful ideas back to GCC much faster thus helping to improve, modularize and clean it up. Furthermore, we are porting GCC with ICI extensions for performance/power auto-tuning for data centers and cloud computing systems with heterogeneous architectures or for continuous whole system optimization.}",
affiliation = "Institute of Computing Technology - Chinese Academy of Science - ICT - Chinese Academy of Science (CAS) - Parall{\'e}lisme, R{\'e}seaux, Syst{\`e}mes d'information, Mod{\'e}lisation - PRISM - CNRS : UMR8144 - Universit{\'e} de Versailles-Saint Quentin en Yvelines - ALCHEMY - INRIA Saclay - Ile de France - INRIA - CNRS : UMR8623 - Universit{\'e} Paris Sud - Paris XI",
audience = "internationale",
file = "hpwp2010.pdf:http\://hal.inria.fr/inria-00451106/PDF/hpwp2010.pdf:PDF",
hal_id = "inria-00451106",
language = "Anglais",
owner = "MOIS",
timestamp = "2011.07.25",
url = "http://hal.inria.fr/inria-00451106/en/"
}

Sebastien Briais, Sid-Ahmed-Ali Touati and Karine Deschinkel. Ensuring Lexicographic-Positive Data Dependence Graphs in the SIRA Framework. 2010.Abstract{Usual cyclic scheduling problems, such as software pipelining, deal with precedence constraints having non-negative latencies. This seems a natural way for modelling scheduling problems, since instructions delays are generally non-negative quantities. However, in some cases, we need to consider edges latencies that do not only model instructions latencies, but model other precedence constraints. For instance in register optimisation problems, a generic machine model can allow considering access delays into/from registers (VLIW, EPIC, DSP). In this case, edge latencies may be non-positive leading to a difficult scheduling problem in presence of resources constraints. This research report studies the problem of cyclic instruction scheduling with register requirement minimisation (without resources constraints). We show that pre-conditioning a data dependence graph (DDG) to satisfy register constraints before software pipelining under resources constraint s may create cycles with non-positive distances, resulted from the acceptance of non-positive edges latencies. Such DDG is called ıt non lexicographic positive because it does not define a to pological sort between the instructions instances: in other words, its full unrolling does not define an acyclic graph. As a compiler construction strategy, we cannot allow thecreation of cycles with non-positive di stances during the compilation flow, because non lexicographic positive DDG does not guarantee the existence of a valid instruction schedule under resource constraints. This research report examines two strategies to avoid the creation of these problematic DDG cycles. A first strategy is reactive, it tolerates the creation of non-positive cycles in a first step, and if detected in a further check step, makes a backtrack to eliminate them. A second strategy is proactive, it prevents the creation of non-positive cycles in the DDG during the register minimisation process. Our extensive experiments on FFMPEG, MEDIABENCH, SPEC2000 and SPEC2006 benchmarks show that the reactive strategy is faster and works well in practice, but may require more registers than the proactive strategy. Consequently, the reactive strategy is a suitable working solution for compilation if the number of available architectural registers is already fixed and register minimisation is not necessary (just consume less registers than the available capacity). However, the proactive strategy, while more time consuming, is a better alternative for register requirement minimisation: this may be the case when dealing with reconfigurable architectures, i.e. when the nu mber of available architectural registers is defined posterior to the compilation of the application.}URLBibTeX

@techreport{BTDe10,
author = "Briais, Sebastien and Touati, Sid-Ahmed-Ali and Deschinkel, Karine",
title = "{Ensuring Lexicographic-Positive Data Dependence Graphs in the SIRA Framework}",
year = "{2010}",
month = "",
abstract = "{Usual cyclic scheduling problems, such as software pipelining, deal with precedence constraints having non-negative latencies. This seems a natural way for modelling scheduling problems, since instructions delays are generally non-negative quantities. However, in some cases, we need to consider edges latencies that do not only model instructions latencies, but model other precedence constraints. For instance in register optimisation problems, a generic machine model can allow considering access delays into/from registers (VLIW, EPIC, DSP). In this case, edge latencies may be non-positive leading to a difficult scheduling problem in presence of resources constraints. This research report studies the problem of cyclic instruction scheduling with register requirement minimisation (without resources constraints). We show that pre-conditioning a data dependence graph (DDG) to satisfy register constraints before software pipelining under resources constraint s may create cycles with non-positive distances, resulted from the acceptance of non-positive edges latencies. Such DDG is called {\it non lexicographic positive} because it does not define a to pological sort between the instructions instances: in other words, its full unrolling does not define an acyclic graph. As a compiler construction strategy, we cannot allow thecreation of cycles with non-positive di stances during the compilation flow, because non lexicographic positive DDG does not guarantee the existence of a valid instruction schedule under resource constraints. This research report examines two strategies to avoid the creation of these problematic DDG cycles. A first strategy is reactive, it tolerates the creation of non-positive cycles in a first step, and if detected in a further check step, makes a backtrack to eliminate them. A second strategy is proactive, it prevents the creation of non-positive cycles in the DDG during the register minimisation process. Our extensive experiments on FFMPEG, MEDIABENCH, SPEC2000 and SPEC2006 benchmarks show that the reactive strategy is faster and works well in practice, but may require more registers than the proactive strategy. Consequently, the reactive strategy is a suitable working solution for compilation if the number of available architectural registers is already fixed and register minimisation is not necessary (just consume less registers than the available capacity). However, the proactive strategy, while more time consuming, is a better alternative for register requirement minimisation: this may be the case when dealing with reconfigurable architectures, i.e. when the nu mber of available architectural registers is defined posterior to the compilation of the application.}",
affiliation = "Parall{\'e}lisme, R{\'e}seaux, Syst{\`e}mes d'information, Mod{\'e}lisation - PRISM - CNRS : UMR8144 - Universit{\'e} de Versailles-Saint Quentin en Yvelines - ALCHEMY - INRIA Saclay - Ile de France - INRIA - CNRS : UMR8623 - Universit{\'e} Paris Sud - Paris XI - Laboratoire d'Informatique de Franche-Comt{\'e} - LIFC - Universit{\'e} de Franche-Comt{\'e} : EA4269",
collaboration = "PRiSM-INRIA",
file = "_negcycle.pdf:http\://hal.inria.fr/inria-00452695/PDF/main\\_report\\_negcycle.pdf:PDF",
hal_id = "inria-00452695",
keywords = "Compilation, Code optimisation, Register pressure, Cyclic instruction scheduling, Instruction level parallelism",
language = "Anglais",
owner = "MOIS",
timestamp = "2011.07.25",
url = "http://hal.inria.fr/inria-00452695/en/"
}

Frederic Brault, Benoit Dupont-De-Dinechin, Sid-Ahmed-Ali Touati and Albert Cohen. Software Pipelining and Register Pressure in VLIW Architectures: Preconditionning Data Dependence Graphs is Experimentally Better Than Lifetime-Sensitive Scheduling. In 8th Workshop on Optimizations for DSP and Embedded Systems (ODES'10). 2010.AbstractEmbedding register-pressure control in software pipelining heuristics is the dominant approach in modern back-end compilers. However, aggressive attempts at combining resource and register constraints in software pipelining have failed to scale to real-life loops, leaving weaker heuristics as the only practical solutions. We propose a decoupled approach where register pressure is controlled before scheduling, and evaluate its effectiveness in combination with three representative software pipelining algorithms. We present conclusive experiments in a production compiler on a wealth of media processing and general purpose benchmarks.URLBibTeX

Jean-Christian Angles D'Auriac, Denis Barthou, Damir Becirevic, Rene Bilhaut, François Bodin, Philippe Boucaud, Olivier Brand-Foissac, Jaume Carbonell, Christine Eisenbeis, P Gallard, Gilbert Grosdidier, P Guichon, P F Honore, G Le Meur, P Pene, L Rilling, P Roudeau, André Seznec and A Stocchi. Towards the Petaflop for Lattice QCD Simulations the PetaQCD Project. In J Gruntorad and M Lokajicek (eds.). Journal of Physics Conference Series 219. 2010, 052021.AbstractThe study and design of a very ambitious petaflop cluster exclusively dedicated to Lattice QCD simulations started in early '08 among a consortium of 7 laboratories (IN2P3, CNRS, INRIA, CEA) and 2 SMEs. This consortium received a grant from the French ANR agency in July '08, and the PetaQCD project kickoff took place in January '09. Building upon several years of fruitful collaborative studies in this area, the aim of this project is to demonstrate that the simulation of a 256 x 1283 lattice can be achieved through the HMC/ETMC software, using a machine with efficient speed/cost/reliability/power consumption ratios. It is expected that this machine can be built out of a rather limited number of processors (e.g. between 1000 and 4000), although capable of a sustained petaflop CPU performance. The proof-of-concept should be a mock-up cluster built as much as possible with off-the-shelf components, and 2 particularly attractive axis will be mainly investigated, in addition to fast all-purpose multi-core processors: the use of the new brand of IBM-Cell processors (with on-chip accelerators) and the very recent Nvidia GP-GPUs (off-chip co-processors). This cluster will obviously be massively parallel, and heterogeneous. Communication issues between processors, implied by the Physics of the simulation and the lattice partitioning, will certainly be a major key to the project.URL, DOIBibTeX

@inproceedings{ABB+10,
author = "Angles D'Auriac, Jean-Christian and Barthou, Denis and Becirevic, Damir and Bilhaut, Rene and Bodin, François and Boucaud, Philippe and Brand-Foissac, Olivier and Carbonell, Jaume and Eisenbeis, Christine and Gallard, P. and Grosdidier, Gilbert and Guichon, P. and Honore, P.F. and Le Meur, G. and Pene, P. and Rilling, L. and Roudeau, P. and Seznec, André and Stocchi, A.",
title = "{Towards the Petaflop for Lattice QCD Simulations the PetaQCD Project}",
booktitle = "{Journal of Physics Conference Series}",
year = "{2010}",
editor = "Gruntorad, J. and Lokajicek, M.",
volume = 219,
pages = 052021,
address = "Prague, Tch{\`e}que, R{\'e}publique",
publisher = "IOP Publishing",
abstract = "{The study and design of a very ambitious petaflop cluster exclusively dedicated to Lattice QCD simulations started in early '08 among a consortium of 7 laboratories (IN2P3, CNRS, INRIA, CEA) and 2 SMEs. This consortium received a grant from the French ANR agency in July '08, and the PetaQCD project kickoff took place in January '09. Building upon several years of fruitful collaborative studies in this area, the aim of this project is to demonstrate that the simulation of a 256 x 1283 lattice can be achieved through the HMC/ETMC software, using a machine with efficient speed/cost/reliability/power consumption ratios. It is expected that this machine can be built out of a rather limited number of processors (e.g. between 1000 and 4000), although capable of a sustained petaflop CPU performance. The proof-of-concept should be a mock-up cluster built as much as possible with off-the-shelf components, and 2 particularly attractive axis will be mainly investigated, in addition to fast all-purpose multi-core processors: the use of the new brand of IBM-Cell processors (with on-chip accelerators) and the very recent Nvidia GP-GPUs (off-chip co-processors). This cluster will obviously be massively parallel, and heterogeneous. Communication issues between processors, implied by the Physics of the simulation and the lattice partitioning, will certainly be a major key to the project.}",
affiliation = "Laboratoire de Physique Subatomique et de Cosmologie - LPSC - CNRS : UMR5821 - IN2P3 - Universit{\'e} Joseph Fourier - Grenoble I - Institut Polytechnique de Grenoble - Parall{\'e}lisme, R{\'e}seaux, Syst{\`e}mes d'information, Mod{\'e}lisation - PRISM - CNRS : UMR8144 - Universit{\'e} de Versailles-Saint Quentin en Yvelines - Laboratoire de Physique Th{\'e}orique d'Orsay - LPT - CNRS : UMR8627 - Universit{\'e} Paris Sud - Paris XI - Laboratoire de l'Acc{\'e}l{\'e}rateur Lin{\'e}aire - LAL - CNRS : UMR8607 - IN2P3 - Universit{\'e} Paris Sud - Paris XI - Institut de Recherches sur les lois Fondamentales de l'Univers (ex DAPNIA) - IRFU - CEA : DSM/IRFU - ALF - INRIA - IRISA - INRIA - Universit{\'e} de Rennes I",
audience = "internationale",
doi = "10.1088/1742-6596/219/5/052021",
hal_id = "in2p3-00380246",
owner = "MOIS",
timestamp = "2011.07.25",
url = "http://hal.in2p3.fr/in2p3-00380246/en/"
}