Details

Introduction

Currently llvm-mca only accepts assembly code as input. We would like to
extend llvm-mca to support object files, allowing users to analyze the
performance of binaries. The proposed changes optionally introduce an object
file section, but this can be stripped-out if desired.

For the llvm-mca binary support feature to be useful, a user needs to tell
llvm-mca which portions of their code they would like analyzed. Currently,
this is accomplished via assembly comments. However, assembly comments are not
preserved in object files, and this has encouraged this RFC. For the proposed
binary support, we need to introduce changes to clang and llvm to allow the
user's object code to be recognized by llvm-mca:

We need a way for a user to identify a region/block of code they want analyzed by llvm-mca.

We need the information defining the user's region of code to be maintained in the object file so that llvm-mca can analyze the desired region(s) from the binary object file.

We define a "code region" as a subset of a user's program that is to be
analyzed via llvm-mca. The sequence of instructions to be analyzed is
represented as a pair: <start, end> where the 'start' marks the beginning of
the user's source code and 'end' terminates the sequence. The instructions
between 'start' and 'end' form the region that can be analyzed by llvm-mca at a
later time.

Example

Before we go into the details of this proposed change, let's first look at a
simple example:

In the example above, we have identified a code region, in this case a single
dot-product expression. For the sake of brevity and simplicity, we've chosen
a very simple example, but in reality a more complicated example could use
multiple expressions. We have also denoted this region as number 42. That
identifier is only for the user, and simplifies reading an llvm-mca analysis
report later.

When this code is compiled, the region markers (the mca_code_region markers)
are transformed into assembly labels. While the markers are presented as
function calls, in reality they are no-ops.

The assembly has been trimmed to show the portions relevant to this RFC.
Notice the labels enclose the user's defined region, and that they preserve the
user's arbitrary region identifier, the ever-so-important region 42.

In the object file section .mca_code_regions, we have noted the user's region
identifier (.quad 42), start address, and region size. A more complicated
example can have multiple regions defined within a single .mca_code_regions
section. This section can be read by llvm-mca, allowing llvm-mca to take
object files as input instead of assembly source.

Details

We need a way for a user to identify a region/block of code they want analyzed
by llvm-mca. We solve this problem by introducing two intrinsics that a user can
specify, for identifying regions of code for analysis.

The two intrinsics are: llvm.mca.code.regions.start and
llvm.mca.code.regions.end. A user can identify a code region by inserting themca_code_region_start and mca_code_region_end markers. These are simply
clang builtins and are transformed into the aforementioned intrinsics during
compilation. The code between the intrinsics are what we call "code regions"
and are to be easily identifiable by llvm-mca; any code between a start/end
pair can be analyzed by llvm-mca at a later time. A user can define multiple
non-overlapping code regions within their program.

The llvm.mca.code.region.start intrinsic takes an integer constant as its only
argument. This argument is implemented as a metadata i32, and is only used
when generating llvm-mca reports. This value allows a user to more easily
identify a specific code region. llvm.mca.code.region.end takes no arguments.
Since we disallow nesting of regions, the first 'end' intrinsic lexically
following a 'start' intrinsic represents the end of that code region.

Now that we have a solution for identifying regions for analysis, we now need a
way for preserving that information to be read at a later time. To accomplish
this we propose adding a new section (.mca_code_regions) to the object file
generated by llvm. During code generation, the start/end intrinsics described
above will be transformed into start/end labels in assembly. When llvm
generates the object file from the user's code, these start/end labels form a
pair of values identifying the start of the user's code region, and size. The
size represents the number of bytes between the start and end address of the
labels. Note that the labels are emitted during assembly printing. We hope
that these labels have no influence on code generation or basic-block
placement. However, the target assembler strategy for handling labels is
outside of our control.

This proposed change affects the size of a binary, but only if the user calls
the start/end builtins mentioned above. The additional size of the
.mca_code_regions section, which we imagine to be very small (to the order of a
few bytes), can trivially be stripped by tools like 'strip' or 'objcopy'.

Implementation Status

We currently have the proposed changes implemented at the url posted below.
This initial patch only targets ELF object files, and does not handle
relocatable addresses. Since the start of a code region is represented as an
assembly label, and referenced in the .mca_code_regions section, that address
is relocatable. That value can be represented as section-relative relocatable
symbol (.text + addend), but we are not handling that case yet. Instead, the
proposed changes only handle linked/executable object files.

The change is presented as a monolithic patch; however, when the time comes
it will be split into three patches:

Following some discussion of the RFC on the mailing list, this patch makes a few improvements:

Object files and executables are supported. This is accomplished by scanning the symbol table for mca_code_region_start and mca_code_region_end symbols.

This solution does not rely on target specific relocations.

Regions cannot be nested, so a start of region label/symbol must be followed by an end (this has always been the case).

Symbol names are encoded with the user's defined region number." That number is just for cosmetic purposes and is only helpful for the user, llvm-mca can make use of that number to annotate its analysis reports.

Following some discussion of the RFC on the mailing list, this patch makes a few improvements:

Object files and executables are supported. This is accomplished by scanning the symbol table for mca_code_region_start and mca_code_region_end symbols.

This solution does not rely on target specific relocations.

Regions cannot be nested, so a start of region label/symbol must be followed by an end (this has always been the case).

Symbol names are encoded with the user's defined region number." That number is just for cosmetic purposes and is only helpful for the user, llvm-mca can make use of that number to annotate its analysis reports.

I forgot to mention, this also removes the need for the .mca_code_regions object file section. All of the parsing and code-region identification is performed via symbol table.

In short, we let llvm-mca handle multiple blocks as it always has. To be fair here, llvm-mca doesn't handle branch instructions, but a user can currently place LLVM-MCA assembly comments such that the instructions cross multiple blocks.

This update allows llvm-mca to sort code regions based on both the start address of the region and the compiler-generated sequence number.
The sequence number is useful when sorting regions that might begin/end at the same address. This can be helpful in cases where a region begins immediately after the previous one ends.

I've set hasSideEffects to true, so that DeadMachineInstructionElim does not remove the llvm-mca code markers under optimization. However, it's certainly possible that optimizations will move code outside of the region.

Update the SelectionDAG handling of the mca_code_region_start and mca_code_region_end intrinsics.

The previous version of this patch just emitted the machine instructions when building the SelectionDAG. Now we are generating SDNodes which seems to give a better representation of the code blocks during inlining/optimization.