Compilation performance

This document describes a systematic method for detection,
isolation, correction and prevention of regression, on matters
of compiler performance. The method focuses on the creation of
small end-to-end scaling tests that characterize, and stand in
for, larger codebases on which the compiler is seen to perform
poorly. These scaling tests have a number of advantages over
"full sized" performance-testing codebases:

They run quickly, just long enough to measure counters

They use a clear criterion (super-linearity) for "bad performance"

They are small and comprehensible

They can focus on a single, isolated performance problem

They can be written and shared by users, characterizing
patterns in private codebases

The method is first described in general; we then work through an
example of finding and fixing a compilation-performance bug.

(The same method can also be used for measuring
and correcting problems in generated code, insofar as the
problem you're interested manifests as a thing that can be
counted while compiling, eg. quantity of LLVM code emitted.)

General method

Formulate a linear (or sub-linear) relationship that should hold
between some aspect of program input size N and some measurement
W of work done by the compiler.

Add a statistic counter to the compiler that measures W.

Write a small scaling testcase: a .gyb file varying in N,
that can be run by utils/scale-test.

Run utils/scale-test selecting for W, and see if W exceeds
O(N^1).

If not, goto #1 and examine a new relationship.

If so, run the testcase at a small scale under a debugger, breaking
on the context you count W in, and find the unexpected contexts /
causes of unexpected work.

Fix the unexpected causes of work, rerun the test to confirm
(sub-)linearity.

Add the testcase to the validation-tests/compiler_scale directory
to prevent future performance regression.

Example

As an example (arising from a bug reported in the field), consider
work done typechecking the function bodies of property getters and
setters that are synthesized for nominal types. Ideally, compiling a
multi-file module, we expect that only the getters and setters in
the current file being complied are subject to function-body-level
typechecking.

That is, our independent scaling variable N will scale the number
of files, and our dependent work variable W will count the number
of calls to the function-body typechecking routine in the compiler.
We will check to see that this relationship is linear.

Formulating this relationship precisely (as a testcase), we begin
by ensuring that the compiler has a statistic to measure the
dependent variable. To do this, we select a location in the compiler
to instrument, ensure the containing file includes swift/Basic/Statistic.h and defines a local macro DEBUG_TYPE
naming the local context. Then we add a single SWIFT_FUNC_STAT
macro in the function we want to count. In this case,
we place a counter immediately inside typeCheckAbstractFunctionBody:

Recall that the relationship we're interested in testing is one
that arises when doing a multi-file compilation: when swiftc
is invoked with multiple input files comprising a module, and
its driver subsequently runs one frontend job for each file
in the module, treating one file as primary and parsing (but
not translating) all the other files as additional inputs.

This type of scale test is not the default mode of the scale-test
script, but it is supported with the --sum-multi command line
flag. In this mode, scale-test collects and sums-together all
statistics of all the primary-file frontend jobs in a multi-file
driver job.

In order to test the relationship, we start with a single
declaration in each file, and vary the number of files. The simplest
test therefore looks like this:

structStruct${N} {
var Field :Int?
}

When we run this file under scale-test, however, we see
only a linear relationship:

This is encouraging, but since it doesn't reproduce bad behaviour
described in the bug report, we make our testcase just a
little bit more complicated by making the structs in each file
refer to the file adjacent to them in the module:

Now the 0th struct has an Int?-typed property and every other struct
has a property that refers to a definition in a neighbouring file;
a bit more pathological than in most codebases, but not too
unrealistic. When we run this case under scale-test, we see the
problematic behaviour very clearly:

Next we get to the part that's hard to describe through anything
other than debugging experience: finding and fixing the bug. The
scale-test script can save us some difficulty in setup by
invoking lldb on one of the frontend jobs if we pass it --debug,
but beyond that it's up to us to diagnose and fix:

In this case the unwanted work can be inhibited by modifying
addTrivialAccessorsToStorage; the fix involves redirecting
some calls from typeCheckDecl to the less-involved validateDecl
(along with some other compensating changes omitted for brevity here)

To prevent future regression, we put the scale test in the
testsuite. The scale-test driver script will consider any
test as failing if it selects a counter that scales worse
than O(n^1.2) (to give a little room for error).

So our investigation testcase is nearly usable as a regression
test. We make a few modifications. First, we'll pass --parse
since there's no need to run full code generation each time.
Second we'll pass a more limited set of scaling steps, sufficient
to show the problem, but faster to run. Finally we'll use the
lit.py test-execution framework to both run the
test, and make the test conditional on a release-mode compiler,
to further limit the impact on testsuite execution time. The
final regression test, which we put in
validation-test/compiler_scale/scale_neighbouring_getset.gyb,
looks like this: