Appfolio Engineering

These days, my job is to speed up Ruby (thanks, AppFolio!) One way to do that is by careful configuration options for building the Ruby source. Let's talk about a nice recent discovery which may help you out. It works if you use CLang -- as Mac OS does by default, and Linux easily can. CLang is intended to be the next GCC by replacing it with a fully-compatible LLVM-based successor. CLang is most of the way there, and ships by default on the Mac. It's available pretty much everywhere. In addition to MacOS, we'll also talk about this speedup on CentOS, a popular OS to deploy Ruby over top of.

Function Inlining

By default, GCC will inline functions on the higher optimization levels. That is, for short, simple functions, it will just substitute the whole function body where you called the function. That lets you skip a function call at runtime. But more important, it lets the compiler skip parts of the function that aren't used, special-case for the parameters that get passed in and otherwise speed up the code.

The down-side is that the code gets bigger - if you put in the full function definition at each call site, things get big. So there's a balancing act for how big a function you inline -- if you inline huge functions it doesn't buy you much speed and the code gets huge. If you never inline anything, the code is as small as it can be, but also slow.

CLang has an easy way to tune what size function you inline. It has an internal metric of how big/complex a function is, and a threshold (default 225) for where it inlines. But you can turn that threshold way up and inline much bigger functions if you want to.

The result is a bit hard to measure. I'll write another blog post on the stats-based profiler I wrote to convince myself that it does speed things up noticeably. But I'm now convinced - with higher inlining settings, you can make a Ruby that's faster than vanilla Ruby, which uses -O3 and the default inlining threshold of 225.

How Big? How Fast? (Mac Edition)

I'm using a standard Ruby benchmark called optcarrot to measure speed here as I tend to do. I'm adding inlining options to Ruby's defaults for optimization ("export optflags='-O3 -fno-fast-math -mllvm -inline-threshold=5000'"). Each of the measurements below used hundreds of trials, alternating between the two implementations, because benchmarks can be noisy :-/

Still, how much faster?

cflags="-mllvm -inline-threshold=1000": about 5% faster?

cflags="-mllvm -inline-threshold=1800": 5% faster

cflags="-mllvm -inline-threshold=2500": 6% faster

cflags="-mllvm -inline-threshold=5000": 7% faster

Those Rubies are also bigger, though. Plain Ruby on my Mac is about 3MB. With an inline threshold of 1000, it's about 4.8MB. At a threshold of 2500, it's about 5.5MB. With a threshold of 5000, it's about 6MB. This doesn't change the size of most Ruby processes, but it does (once per machine) take that extra chunk of RAM -- a 6MB Ruby binary will consume 3MB of additional RAM, over what a 3MB Ruby binary would have taken.

Note: these may look linear-ish. Really they're just very unpredictable, because it depends which functions get inlined, how often they're called, how well they optimize... Please take the "X% faster" above as being noisy and hard to verify, even with many trials. The 1000-threshold and 1800-threshold are especially hard to tell apart, says my statistical profiling tool. Even 1000 and 2500 aren't easy. Also: measured vs recent MacOS with a recent Ruby 2.4.0 prerelease version.

How Big? Host Fast? (CentOS 6.6 Edition)

I didn't do the full breakdown for CentOS, but... It's actually more of a difference. If I measure the difference between -inline-threshold=5000 and default CLang (with -O3 and -fno-fast-math, the defaults, which uses 225 as the inline threshold), I measure about 14% faster.

But that's cheating - the version of CLang available here (3.4.2) is slower than the default GCC on CentOS 6.6. So how about measuring it versus GCC?

Still about 9% faster than GCC.

The CentOS binaries are a bit bigger than MacOS, closer to 8MB than 3MB with debugging info. So the doubling in size makes it bigger, about 18MB instead of 6MB. Beware if you're tight on RAM. But the speedup appears to be true for CentOS, too.

Disclaimers

So: can you justmake Ruby 10% faster? Only if you're CPU-bound. Which you probably aren't.

I'm measuring versus optcarrot, which is intentionally all about CPU calculations in Ruby. "Normal" Ruby apps, like Rails apps, tend to waiting on the OS, or the database, or the network, or...

Nothing you can do to Ruby will speed up the OS, the database or the network by much.

But if you're waiting on Ruby, this can help.

Did Somebody Do It For Me?

Wondering if this got picked up in mainline Ruby, or otherwise made moot? You can check the flags that your current Ruby was compiled with easily:

ruby -r rbconfig -e "puts RbConfig::CONFIG['CFLAGS']"

You're looking for flags like "-O3" (optimization) and "-mllvm -inline-threshold=5000" (to tune inlining.) You can do the same thing with ./tool/runruby inside a built Ruby directory. If you see the inline-threshold flag, somebody has already compiled this into your Ruby.

Conclusion

This option is a great deal if you're on a desktop Mac -- 3MB of RAM for a 7% faster Ruby -- but questionable on a tiny Linux virtual host -- 8MB for 10% faster, but you're probably not CPU-bound.

It is, however, only available with CLang, not with vanilla GCC.

I've opened a bug with a patch to make this behavior default with CLang. Let's see if the Ruby team goes for it! And even if they don't, perhaps ruby-build or RVM will take up the torch. On a modern Mac, this seems like a really useful thing to add.