PrimeTile - A parametric multi-level tiler for imperfect
loop nests

Loop tiling, as one of the most important compiler optimization
techniques, is beneficial for both parallel machines and
uniprocessors. Efficient generation of multi-level tiled code is
essential to maximize data reuse in deep memory hierarchies. Tiled
loops with parameterized tile sizes (not compile time constants)
enable runtime optimizations used in iterative compilation and
automatic tuning. Previous parametric multi-level tiling approaches
have been restricted to perfectly nested loops, where all statements
are contained inside the innermost loop of a loop nest. Previous
solutions to tiling for imperfect loop nests have been limited to the
case where tile sizes are fixed. PrimeTile provides an
effective way to generate efficient parameterized multi-level tiled
code for imperfectly nested loops. The generated tiled code contains
loops that iterate over full rectangular tiles, making them suitable
for compiler optimizations such as register tiling.

Tiling is a crucial transformation for achieving high performance,
especially with deep multi-level memory hierarchies. It is a well
known technique for improving data locality and register reuse. Tiling
has received a lot of attention in the compiler community. However,
the majority of work only addresses tiling of perfectly nested
loops. With perfectly nested loops, tiling is possible when a band of
loops is fully permutable. From a code viewpoint, tiling involves
strip mining, i.e. splitting loops into an adjacent pair of tile and
intra-tile loops, and permutation to bring all intra-tile loops
innermost. The condition for permutability of a band of loops is that
all correspondingly permuted dependence vectors must be
lexicographically positive.

With imperfectly nested loops too, tiling involves strip mining loop
splitting and permutation, along with loop distribution. The reasoning
about legality of tiling of imperfectly nested loops requires a more
general dependence model than dependence vectors. The earliest works
to develop approaches to tile imperfectly nested loops effectively
mapped the instances of statements of an imperfectly nested loop (of
possibly different nesting depths in the input code) into a common
embedding iteration space and then performed tiling in the framework
of the common embedding space. A critical unresolved challenge with
this approach is that of developing an effective algorithm for
generating efficient tiled code where lower dimensional statements are
hoisted out so that they do not have more loops surrounding them than
their inherent dimensionality - heuristics have been proposed, but no
generally effective implementation has been developed to our
knowledge.

A major breakthrough in efficient code generation for multiple
statements in imperfectly nested loops was achieved by the development
of the polyhedral scanning algorithm of Quillere et al. and its
refinement and optimization in CLooG. An innovative scheme for
efficient parametric tiling of arbitrary polyhedral statement domains
was developed by Renganarayana et al. and extended to efficiently
generate multi-level parametric tiled code. However, this work is only
applicable to perfect loop nests. The first effective approach for
tiling of imperfectly nested loops was recently developed in the Pluto
polyhedral transformation framework. However, it is only able to
generate tiled code with fixed tile sizes. While the approach can
handle multi-level tiling, the code generation complexity grows
explosively; the latest distribution of Pluto was found to fail for
some inputs when generating multi-level tiled code.

In this work on PrimeTile, we develop an effective approach to parametric
multi-level tiling of imperfectly nested affine loops. The importance
of parametric tiling is exemplified by the highly successful ATLAS
system for empirical tuning of BLAS kernels. ATLAS uses parametrically
tiled BLAS kernels that are repeatedly executed on the target
architecture for different problem sizes using an empirical search
strategy that varies the tile sizes. The ATLAS system was manually
engineered by experts with insights into tiling for optimization of
BLAS kernels. There has been much recent interest in developing
generalized tuning systems that can similarly tune and optimize codes
input by users or library developers. An efficient parametric tiling
tool is extremely valuable for generating input tiled codes for such
empirical tuning systems.

The key to our approach PrimeTile is the exploitation of the power and
effectiveness of the Quillere algorithm in generating untiled
imperfectly nested code, along with use of input scattering functions
(affine schedules for the different statements) satisfying a
generalized tiling condition and a geometric approach to generation of
parametric tiles by post-processing the abstract syntax tree (AST) of
the loop structure generated by CLooG.