Resum:

Nowadays, supercomputers deliver an enormous amount of computation power; however, it is well-known that applications only reach a fraction of it. One limiting factor is the single processor performance because it ultimately dictates the overall achieved performance. Performance analysis tools help locating performance inefficiencies and their nature to ultimately improve the application performance. Performance tools rely on two collection techniques to invoke their performance monitors: instrumentation and sampling. Instrumentation refers to inject performance monitors into concrete application locations whereas sampling invokes the installed monitors to external events. Each technique has its advantages. The measurements obtained through instrumentation are directly associated to the application structure while sampling allows a simple way to determine the volume of measurements captured. However, the granularity of the measurements that provides valuable insight cannot be determined a priori. Should analysts study the performance of an application for the first time, they may consider using a performance tool and instrument every routine or use high-frequency sampling rates to provide the most detailed results. These approaches frequently lead to large overheads that impact the application performance and thus alter the measurements gathered and, therefore, mislead the analyst.
This thesis introduces the folding mechanism that takes advantage of the repetitiveness found in many applications. The mechanism smartly combines metrics captured through coarse-grain sampling and instrumentation mechanisms to provide instantaneous metric reports within instrumented regions and without perturbing the application execution. To produce these reports, the folding processes metrics from different type of sources: performance and energy counters, source code and memory references. The process depends on their nature. While performance and energy counters represent continuous metrics, the source code and memory references refer to discrete values that point out locations within the application code or address space. This thesis evaluates and validates two fitting algorithms used in different areas to report continuous metrics: a Gaussian interpolation process known as Kriging and piece-wise linear regressions. The folding also takes benefit of analytical performance models to focus on a small set of performance metrics instead of exploring a myriad of performance counters. The folding also correlates the metrics with the source-code using two alternatives: using the outcome of the piece-wise linear regressions and a mechanism inspired by Multi-Sequence Alignment techniques. Finally, this thesis explores the applicability of the folding mechanism to captured memory references to detail which and how data objects are accessed.
This thesis proposes an analysis methodology for parallel applications that focus on describing the most time-consuming computing regions. It is implemented on top of a framework that relies on a previously existing clustering tool and the folding mechanism. To show the usefulness of the methodology and the framework, this thesis includes the discussion of multiple first-time seen in-production applications. The discussions include high level of detail regarding the application performance bottlenecks and their responsible code. Despite many analyzed applications have been compiled using aggressive compiler optimization flags, the insight obtained from the folding mechanism has turned into small code transformations based on widely-known optimization techniques that have improved the performance in some cases. Additionally, this work also depicts power monitoring capabilities of recent processors and discusses the simultaneous performance and energy behavior on a selection of benchmarks and in-production applications.