I began implementing compilers on painfully slow machines like the IBM PC, so having the
compiler run fast was always a big deal. It got a lot of attention, and the compilers
I wrote were known for being very fast. I naturally built the techniques learned along the way into
the DMD D compiler, and D itself is designed to be a language that is amenable to fast
compilation.

This has been a big success for D, and many engineers have told me that D's compile speed
was a huge factor in being seduced by D. Every big boost in compile speed has a
transformative effect on the development process.

It's easy to get complacent and focus on other issues with DMD. Subtle degradations in speed
can creep in, especially with a lot of people contributing to the codebase.
Compiling the D test suite was taking too much time - making that faster has a lot of leverage
in improving productivity.
In particular, some D code makes very heavy use of templates,
and these were not compiling as fast as I'd like. The library module
std.algorithm stood out.

The obvious low hanging fruit was that some templates were being instantiated 70,000 times
or more. This is a problem well known among C++ compiler guys - given a template and the
arguments to instantiate it, look to see if it's already been instantiated and simply point
to the existing one. The lookup for this was, I'm ashamed to admit, a linear list.
It's like dragging a block of concrete behind your sports car. Replacing this with a hash
table netted a mere 10% improvement, I expected more. How disappointing.

My next idea revolved around the issue that templates, when instantiated, are given a
string identifier that uniquely identifies them. You can see these names if you disassemble
any template-using code written in C++ or D. They tend to be long, as they incorporate a
“mangled” version of each and every template argument.

The thing is, an awful lot of templates are generated that never appear in the generated
code, as they are templates that produce a value or a type. These don't need generated
identifiers. I switched the identifier generation to being lazy, i.e. only generated if
needed by the object code generation. This also produced a perceptible improvement, but much
more modest than I'd anticipated.

Time to stop guessing where the speed problems were, and start instrumenting.

Time to trot out gprof, the Gnu profiler. I fired it up on the slow example, and waited.
And waited, and waited, and waited. I waited overnight. It pole-axed my Ubuntu box, I had to
cold boot it. The test
case was so big that it plus gprof was too much. gprof slows things down a lot, so I had
to cut things way down to get it to work.

The top three time-eating tasks were:

lexing

hash table lookups

storage allocation

Lexing has always been the succubus leaching the cycles out of a compiler. It's written
with that in mind (and so is D's grammar), and I didn't see much more oil to be pressed out
of that olive.

I figured hash table lookups would be high on the list, but not that high. DMD makes heavy
use of hash tables internally. The symbol tables are all hash tables, the template instantiations
are (now!), etc. Looking at the code, nothing too obvious stuck out:

But there is something interesting about it - the key%len operation. Division is a
notoriously expensive operation. I could replace it with a mask, but that supposedly leads
to a not-so-uniform distribution. A few months back, I implemented in the compiler back
end the optimization of replacing an integral divide by constant with a multiply by the reciprocal.
[4] It's a lot faster, but here the denominator is a variable, and so I can't use that
optimization here.

Or can I? The denominator is a prime number pulled out of a table of primes:

If the hash table chains get too long, it is rehashed using a larger prime from this table.
But most of the time, the symbol tables are small (every scope has its own symbol table).
So I rewrote the key%len as:

The smallest bucket has a length of 4, the next larger size is 31. So, for the most common
cases, the compiler is able to generate the fast modulo code because it's dividing by
a known constant.

This change alone was worth a 5% speed boost. Not bad for 3 or 4 lines of code!

That left storage allocation.

Storage allocation is one of the great unsolved problems in programming. You can do manual
allocation at the expense of endless pointer bugs. You can do reference counting with its
performance problems and bloat. You can do garbage collection with its pausing and excessive
memory consumption.

DMD does memory allocation in a bit of a sneaky way. Since compilers are short lived programs,
and speed is of the essence, DMD just malloc's away, and never free's. This eliminates the
scaffolding and complexity of figuring out who owns the memory and when it should be released.
(It has the downside of consuming all the resources of your machine if the module being compiled
is big enough.)

But malloc() itself is designed with the presumption that the memory allocated will eventually
get free'd. Since it's not in DMD, I tried replacing the storage allocator with a dead
simple bump-the-pointer greedy allocator:

It's like I poured nitrous down the compiler's throat. Along with the other improvements,
this achieved a near doubling of the compiler speed. (It even reduced the memory consumption
somewhat, apparently because malloc() needed memory for its data structures.)

The benchmark used was compiling Phobos' std.algorithm for unit tests, which is:

dmd std\algorithm -unittest -main

dmd 2.063 does the nasty in 21.56 seconds, and the latest development version does it
in 12.19. This was done on Windows. There's a little bit of apples vs oranges here,
because the latest Phobos is larger than the 2.063 one.

(DMD 2.064 with this improvement hasn't been released yet, but of course you can try it
out from github.)

Conclusion

Even if you know your code well, you're likely wrong about where the performance bottlenecks
are. Use a profiler. If you haven't used one on your codebase in a while, it's highly likely
there's a bottleneck in there that's fixable with just a few lines of code.