This paper shows that breaking the barrier of 1 triangle/clock rasterization rate for microtriangles in modern GPU architectures in an efficient way is possible. The
fixed throughput of the special purpose culling and triangle setup stages of the classic pipeline limits the GPU scalability
to rasterize many triangles in parallel when these cover very few pixels. In contrast, the shader core counts and increasing
GFLOPs in modern GPUs clearly suggests parallelizing this computation entirely across multiple shader threads, making
use of the powerful wide-ALU instructions. In this paper, we present a very efficient SIMD-like rasterization code targeted at very small triangles that scales very well with the number of shader cores and has higher performance than traditional edge equation based algorithms. We have extended
the ATTILA GPU shader ISA (del Barrioet al. in IEEE International Symposium on Performance Analysis of Systems and Software, pp. 231–241, 2006) with two fixed point instructions to meet the rasterization precision requirement.
This paper also introduces a novel subpixel Bounding Box size optimization that adjusts the bounds much more finely, which is critical for small triangles, and doubles the 2x2- pixel stamp test efficiency. The proposed shader rasterization program can run on top of the original pixel shader program in such a way that selected fragments are rasterized, attribute interpolated and pixel shaded in the same pass. Our results show that our technique yields better performance than a classic rasterizer at 8 or more shader cores, with
speedups as high as 4x for 16 shader cores.

Two of the most important performance limiters in today's processor families comes from solving the memory wall and handling control dependencies. In order to address these issues, cache memories and branch predictors are well-known hardware proposals that take advantage of, among other things, exploiting both temporal memory reuse and branch correlation. In other words, they try to exploit the dynamic redundancy existing in programs. This redundancy comes partly from the way that programmers write source code, but also from limitations in the compilation model of traditional compilers, which introduces unnecessary memory and conditional branch instructions. We believe that today's optimizing compilers should be very aggressive in optimizing programs, and then they should be expected to optimize a significant part of this redundancy away.On the other hand, optimizations performed at link-time or directly applied to final program executables have received increased attention in recent years, due to limitations in the traditional compilation model. First, even though performing sophisticated interprocedural analyses and transformations, traditional compilers do not have the opportunity to optimize the program as a whole. A similar problem arises when applying profile-directe compilation techniques: large projects will be forced to re-build every source file to take advantage of profile information. By contrast, it would be more convenient to build the full application, instrument it to obtain profile data and then re-optimize the final binary without recompiling a single source file.In this thesis we present new profile-guided compiler optimizations for eliminating the redundancy encountered on executable programs at binary level (i.e.: binary redundancy), even though these programs have been compiled with full optimizations using a state-ofthe- art commercial compiler. In particular, our Binary Redundancy Elimination (BRE) techniques are targeted at eliminating both redundant memory operations and redundant conditional branches, which are the most important ones for addressing the performance issues that we mentioned above in today's microprocessors. These new proposals are mainly based on Partial Redundancy Elimination (PRE) techniques for eliminating partial redundancies in a path-sensitive fashion. Our results show that, by applying our optimizations, we are able to achieve a 14% execution time reduction in our benchmark suite.In this work we also review the problem of alias analysis at the executable program level, identifying why memory disambiguation is one of the weak points of object code modification. We then propose several alias analyses to be applied in the context of linktime or executable code optimizers. First, we present a must-alias analysis to recognize memory dependencies in a path- sensitive fashion, which is used in our optimization for eliminating redundant memory operations. Next, we propose two speculative may-alias data-flow algorithms to recognize memory independencies. These may-alias analyses are based on introducing unsafe speculation at analysis time, which increases alias precision on important portions of code while keeping the analysis reasonably cost-efficient. Our results show that our analyses prove to be very useful for increasing memory disambiguation accuracy of binary code, which turns out into opportunities for applying optimizations.All our algorithms, both for the analyses and the optimizations, have been implemented within a binary optimizer, which overcomes most of the existing limitations of traditional source-code compilers. Therefore, our work also points out the most relevant issues of applying our algorithms at the executable code level, since most of the high-level information available in traditional compilers is lost.

Fernandez Gomez, Manel; Espasa Sans, Roger8th Annual Workshop on Interaction between Compilers and Computer Architecture (INTERACT-8) in conjunction with the IEEE 10th International Symposium on High-Performance Computer Architecture (HPCA-10)Presentation of work at congresses

The papers presented in this combined topic consider issues related to the broad theme of computer architecture research. The program reflects the current emphasis of research on the exploitation of instruction-level parallelism and thread-level parallelism, with the papers presented covering several important aspects on both approaches: branch prediction, speculative multitheading, pipelining and superscalar architecture design, SIMD extensions, and dynamic scheduling issues in multithreaded architectures.