Tuesday, October 6, 2015

Jitterdämmerung

So, Windows 10 has just been released, and with it Ahead Of Time (AOT) compilation feature .NET native. Google also just recently introduced ART for
Android, and I just discovered that Oracle is planning an AOT compiler for mainstream Java.

With Apple doggedly sticking to Ahead of Time Compilation for Objective-C and now their new Swift, JavaScript
is pretty much the last mainstream hold-out for JIT technology. And even in JavaScript, the state-of-the-art
for achieving maximum performance appears to be asm.js, which largely eschews JIT techniques by acting as
object-code in the browser represented in JavaScript for other languages to be AOT-compiled into.

I think this shift away from JITs is not a fluke but was inevitable, in fact the big question is why
it has taken so long (probably industry inertia). The benefits were always less than advertised,
the costs higher than anticipated. More importantly though, the inherent performance characteristics
of JIT compilers don't match up well with most real world systems, and the shift to mobile has only
made that discrepancy worse. Although JITs are not going to go away completely, they are fading
into the sunset of a well-deserved retirement.

Advantages of JITs less than promised

I remember reading the copy of the IBM Systems Journal on Java Technology back in 2000, I think. It had a bunch of research articles describing
super amazing VM technology with world-beating performance numbers. It also had a single real-world report
from IBM's San Francisco project. In the real world, it turned out, performance was a bit more "mixed" as
they say. In other words: it was terrible and they had to do an incredible amount of work for the system
be even remotely usable.

There was also the experience of the New Typesetting System (NTS), a rewrite of TeX in Java. Performance
was atrocious, the team took it with humor and chose a snail as their logo.

One of the reasons for this less than stellar performance was that JITs were invented for highly dynamic
languages such as Smalltalk and Self. In fact, the Java Hotspot VM can be traced in a direct line to
Self via the Strongtalk system, whose creator Animorphic Systems was purchased by Sun in order to acquire the VM technology.

However, it turns out that one of the biggest benefits of JIT compilers in dynamic languages is figuring
out the actual types of variables. This is a problem that is theoretically intractable (equivalent to
the halting problem) and practically fiendishly difficult to do at compile time for a dynamic language.
It is trivial to do at runtime, all you need to do is record the actual types as they fly by. If you
are doing Polymorphic Inline Caching, just look at the contents of the caches after a while. It is
also largely trivial to do for a statically typed language at compile time, because the types are right
there in the source code!

So gathering information at runtime simply isn't as much of a benefit for languages such as C# and Java
as it was for Self and Smalltalk.

Significant Costs

The runtime costs of a JIT are significant. The obvious cost is that the compiler has to be run
alongside the program to be executed, so time compiling is not available for executing. Apart
from the direct costs, this also means that your compiler is limited in the types of analyses
and optimizations it can do. The impact is particularly severe on startup, so short-lived
programs like for example the TeX/NTS are severely impacted and can often run slower overall
than interpreted byte-code.

In order to mitigate this, you start having to have multiple compilers and heuristics
for when to use which compilers. In other words: complexity increases dramatically,
and you have only mitigated the problem somewhat, not solved it.

A less obvious cost is an increase in VM pressure, because the code-pages created by the JIT
are "dirty", whereas executables paged in from disk are clean. Dirty pages have to be written
to disk when memory is required, clean pages can simply be unmapped. On devices without a
swap file like most smartphones, dirty vs. clean can mean the difference between a few unmapped
pages that can be swapped in later and a process getting killed by the OS.

VM and cache pressure is generally considered a much more severe performance problem than a little extra
CPU use, and often even than a lot of extra CPU use. Most CPUs today can multiply numbers
in a single cycle, yet a single main memory access has the CPU stalled for a hundred cycles or more.

In fact, it could very well be that keeping non-performance-critical code as compact interpreted
byte-code may actually be better than turning it into native code, as long as the code-density
is higher.

Security risks

Having memory that is both writable and executable is a security risk. And forbidden on iOS,
for example. The only exception is Apple's own JavaScript engine, so on iOS you simply
can't run your own JITs.

Machines got faster

On the low-end of performance, machines have gotten so fast that pure interpreters are often
fast enough for many tasks. Python is used for many tasks as is and PyPy isn't really taking
the Python world by storm. Why? I am guessing it's because on today's machines, plain old
interpreted Python is often fast enough. Same goes for Ruby: it's almost comically slow
(in my measurements, serving http via Sinatra was almost 100 times slower than using libµhttp),
yet even that is still 400 requests per second, exceeding the needs of the vast majority of web-sites
including my own blog, which until recently didn't see 400 visitors per day.

Successful hybrids

The technique used by Squeak: interpreter + C primitives for heavy lifting, for example for multi-media
or cryptography has been applied successfully in many different cases. This hybrid approach was described
in detail by John Ousterhout in Scripting: Higher-Level Programming for the 21st Century: high level "scripting" languages are used to glue together
high performance code written in "systems" languages. Examples include Numpy, but the ones I found most
impressive were "computational steering" systems apparently used in supercomputing facilities such as
Oak Ridge National Laboratories. Written in Tcl.

What's interesting with these hybrids is that JITs are being squeezed out at both ends: at the "scripting"
level they are superfluous, at the "systems" level they are not sufficient. And I don't believe that this
idea is only applicable to specialized domains, though there it is most noticeable. In fact, it seems
to be an almost direct manifestation of the observations in Knuth's famous(ly misquoted) quip about
"Premature Optimization":

Experience has shown (see [46], [51]) that most of the running time in non-IO-bound programs is concentrated in about 3 % of the source text.

[..]
The conventional wisdom shared by many of today's software engineers calls for ignoring efficiency in the small; but I believe this is simply an overreaction to the abuses they see being practiced by penny-wise-and-pound-foolish programmers, who can't debug or maintain their "optimized" programs. In established engineering disciplines a 12 % improvement, easily obtained, is never considered marginal; and I believe the same viewpoint should prevail in soft- ware engineering. Of course I wouldn't bother making such optimizations on a one-shot job, but when it's a question of preparing quality programs, I don't want to restrict myself to tools that deny me such efficiencies.

There is no doubt that the grail of efficiency leads to abuse. Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil.

Yet we should not pass up our opportunities in that critical 3 %. A good programmer will not be lulled into complacency by such reasoning, he will be wise to look carefully at the critical code; but only after that code has been identified. It is often a mistake to make a priori judgments about what parts of a program are really critical, since the universal experience of programmers who have been using measurement tools has been that their intuitive guesses fail. After working with such tools for seven years, I've become convinced that all compilers written from now on should be designed to provide all programmers with feedback indicating what parts of their programs are costing the most; indeed, this feedback should be supplied automatically unless it has been specifically turned off.

[..]

(Most programs are probably only run once; and I suppose in such cases we needn't be too fussy about even the structure, much less the efficiency, as long as we are happy with the answers.)
When efficiencies do matter, however, the good news is that usually only a very small fraction of the code is significantly involved.
Structured Programming with go to Statements, Donald Knuth, 1974

For the 97%, a scripting language is often sufficient, whereas the critical 3% are both critical enough
as well as small and isolated enough that hand-tuning is possible and worthwhile.

I agree with Ousterhout's critics who say that the split into scripting languages and systems languages
is arbitrary, Objective-C for example combines that approach into a single language, though one that is
very much a hybrid itself.
The "Objective" part is very similar to
a scripting language, despite the fact that it is compiled ahead of time, in both performance and ease/speed of
development, the C part does the heavy
lifting of a systems language. Alas, Apple has worked continuously and fairly successfully at destroying
both of these aspects and turning the language into a bad caricature of Java. However, although the
split is arbitrary, the competing and diverging requirements are real, see Erlang's split into a
functional language in the small and an object-oriented language in the large.

Unpredictable performance model

The biggest problem I have with JITs is that their performance model is extremely unpredictable. First,
you don't know when optimizations are going to kick in, or when extra compilation is going to make you
slower. Second, predicting which bits of code will actually be optimized well is also hard and a moving
target. Combine these two factors, and you get a performance model that is somewhere between unpredictable
and intractable, and therefore at best statistical: on average, your code will be faster. Probably.

While there may be domains where this is acceptable, most of the domains where performance matters at all
are not of this kind, they tend to be (soft) real time. In real time systems average performance matters
not at all, predictably meeting your deadline does. As an example, delivering 80 frames in 1 ms each and
20 frames in 20 ms means for 480ms total time means failure (you missed your 60 fps target 20% of the time)
whereas delivering 100 frames in 10 ms each means success (you met your 60 fps target 100% of the time),
despite the fact that the first scenario is more than twice as fast on average.

I really learned this in the 90ies, when I was doing pre-press work and delivering highly optimized
RIP and Postscript processing software. I was stunned when I heard about daily newspapers switching
to pre-rendered, pre-screened bitmap images for their workflows. This is the most inefficient format
imaginable for pre-press work, with each page typically taking around 140 MB of storage uncompressed,
whereas the Postscript source would typically be between 1/10th and 1/1000th of the size. (And at
the time, 140MB was a lot even for disk storage, never mind RAM or network capacity.

The advantage of pre-rendered bitmaps is that your average case is also your worst case. Once you
have provisioned your infrastructure to handle this case, you know that your tech stack will be able to
deliver your newspaper on time, no matter what the content. With Postscript (and
later PDF) workflows, you average case is much better (and your best case ridiculously so), but you
simply don't get any bonus points for delivering your newspaper early. You just get problems
if it's late, and you are not allowed to average the two.

Eve could survive and be useful even if it were never faster than, say, Excel. The Eve IDE, on the other hand, can't afford to miss a frame paint. That means Imp must be not just fast but predictable - the nemesis of the SufficientlySmartCompiler.
Eve blog

I also saw this effect in play with Objective-C and C++ projects: despite the fact that Objective-C's
primitive operations are generally more expensive, projects written in Objective-C often had better
performance than comparable C++ projects, because the Objective-C's performance model was so much
more simple, obvious and predictable.

When Apple was still pushing the Java bridge, Sun engineers did a stint at a WWDC to explain how
to optimize Java code for the Hotspot JIT. It was comical. In order to write fast Java code,
you effectively had to think of the assembler code that you wanted to get, then write the
Java code that you thought might net that particular bit of machine code, taking into
account the various limitations of the JIT. At that point, it is a lot easier to just
write the damn assembly code. And more vastly more predictable, what you write is what you get.

Modern JITs are capable of much more sophisticated transformations, but what the creators of these
advanced optimizers don't realize is that they are making the problem worse rather than
better. The more they do, the less predictable the code becomes.

The same, incidentally, applies to SufficentlySmart AOT compilers such as the one for the Swift
language, though the problem is not quite as severe as with JITs because you don't have the
dynamic component. All these things are well-intentioned but all-in-all counter-productive.

Conclusion

Although the idea of Just in Time Compilers was very good, their area of applicablity, which was
always smaller than imagined and/or claimed, has shrunk ever further due to advances in technology,
changing performance requirements and the realization that for most performance critical tasks,
predictability is more important than average speed. They are therefore slowly being phased
out in favor of simpler, generally faster and more predictable AOT compilers. Although they
are unlikely to go away completely, their significance will be drastically diminished.

Alas, the idea that writing high-level code without any concessions to performance (often
justified by misinterpreting or simply just misquoting Knuth) and then letting a sufficiently
smart compiler fix it lives on. I don't think this approach to performance is viable, more
predictability is needed and a language with a hybrid nature and the ability for the programmer
to specify behavior-preserving transformations that alter the performance characteristics of
code is probably the way to go for high-performance, high-productivity systems. More on that
another time.

What do you think? Are JITs on the way out or am I on crack? Should we have a more manual
way of influencing performance without completely rewriting code or just trusting the
SmartCompiler?

Update: Nov. 13th 2017

The Mono Project has just announced that they are adding a byte-code interpreter: "We found that certain programs can run faster by being interpreted than being executed with the JIT engine."

I think there are some great things about PyPy and Tracing JITs in general, and there are some great things about compilers.

I've always fallen on the Compiler side of this particular holy war, but I think that's down to me being a Pascal and C guy, at the end of the day.

I still don't trust these VM things. Nor do I trust these heavyweight non-composable standard libraries (usually called Frameworks) that must come along for the ride.But I don't kid myself by thinking that you can just wave your hands and make the VM go away, I will believe it when I see C# compile to a binary exe on my own desktop, and not far away in the bowels of some windows store server farm.

You missed the most common "JIT" we have today: the CPU! Modern CPUs do what we would have called "JIT" 10 or 20 years ago (and many people in the industry do say that CPUs implement a JIT). We had a modern fully-AOT CPU -- it was called Itanium, and nobody wanted it.

I'd say the third most common JIT today is LLVM, and it's extremely popular, too. Mac OS X uses it for optimizing both OpenGL and OpenCL, which are in turn used by the whole system.

The pattern here isn't that JITs are going away. It's that they're successful when they can become invisible. (You get a new CPU, and it's just like the old one but faster. You download a newer version of your web browser or operating system, and it's just like the old one but faster.) I'm not running my programs by typing a command that starts with the letter "j", but more of my software is JITted than ever before.

When I click on the "I'm not a robot" checkbox here and it turns green, I'm seeing code run that's been JITted at 3 different levels of abstraction! I see no "shift away from JITs" here. The war is over, and JITs have firmly established their place. They are the stagehands of modern computing. Their job is to stay hidden, but that makes them no less vital.

@Warren: I also think JITs are amazing pieces of technology, and it surprised me that for example PyPy hasn't taken the Python world by storm as I thought it would/should. What's interesting is that mostly it really isn't a shortcoming of the technology, but shifting requirements and technology.

@Anonymous: Interesting perspective about CPUs, but I don't quite buy it, they look more like interpreters to me. I remember Chris presenting the OpenGL/OpenCL stuff while I was still at Apple, very clever indeed. Alas, as a member of the performance team at the time, I also noticed it causing significant performance regressions in common situations, usually when you least wanted them: "I want to do some graphics really fast, quick turn on the OpenGL path. Uh oh, driver's not ready, turn on the compiler. Hey, where are my graphics, I wanted them fast?!" I haven't looked in too much detail yet, but what I've seen of Metal suggest to me that it's much more about having resources (including shaders) compiled ahead of time as much as possible.

@Richard: I am sure they are. I've read about Truffle/Graal, sat through talks at Splash/DLS, and it looks like impressive technology. So did all the super-duper optimizing Java compilers/VMs in the IBM Systems Journal referenced. Real world performance lagged a bit. And yes, needing super-sophisticated technology to run your language with adequate speed does sound like a lock-in strategy :-)

You almost seem to be writing for a theoretical POV. You might want to have a look at some of the extraordinary optimizations that JITs can provide which are almost impossible for an AOT compiler, including all those speculative optimizations which are usually correct. Have a look for articles by Cliff Click (who sees the pros and cons in a pretty balanced way), for example.

Java has had an AOT compiler for many many years (actually several in the past, but the only one surviving is Excelsior JET - because the market doesn't need more than one since JITs do so well). The low-level stuff where you need to write code that matches assembly-type instructions are incredibly atypcial, not even 1% of1% of applications need that. For your average application, Java JITs work spectacularly well.

The unpredictability is accurate, but AOTs don't really help there, jitter exists for those just as much as JITs. The JITed apps tend to settle down pretty quickly, we've identified a lot of the JIT causes of jitter and baked those into the tools, eg have a look at the JMH microbenchmarking tool.

To an extent it's irrelevant. As you point out in your article, CPUs are so fast nowadays that "access to memory" patterns matters more than the raw speed of the generated assembly. And what matters even more is the ability to use multiple cores efficiently. JIT vs AOT is far less interesting than "tools or languages that make multi-core transparent".

"The final compile will be done in the cloud for any apps going to the store, but the tools will be available locally as well to assist with testing the app before it gets deployed to ensure there are no unseen consequences of moving to native code."

"Interesting perspective about CPUs, but I don't quite buy it, they look more like interpreters to me."

I didn't think that was a controversial idea. For example, the first paragraph of the Wikipedia article for "Out-of-order execution" says: "It can be viewed as a hardware based dynamic recompilation or just-in-time compilation (JIT) to improve instruction scheduling."

JIT compilation is any compilation that's not AOT, right? Compilation is a translation from one language to another (unless you're one of those folks who insist that "translation" is somehow a different class of operation than "compilation", for reasons which have never been adequately explained to me, or my compilers professors). On any CPU with microcode, I don't see how you can categorize microcode as anything other than JIT compilation.

It's true that interpreters and JIT compilers are very similar -- JIT compilation is essentially a type of interpreter. I wouldn't call a modern CPU an "interpreter", though, for the primary reason that your code gets faster the more you run it. That's symptomatic of a JIT compiler, not an interpreter. For starters, they've got sophisticated branch prediction. Some even have trace caches. There's no controversy about that, is there? Once you've got a trace cache, you're definitely a JIT compiler!

@Anonymous: As I said, I think that's an interesting perspective, and the Wikipedia entry you quote seems to agree with me: "can be viewed as". Not "is". And yes, "JIT compilation". The CPU is not "compiling". It is doing binary-binary translation. For example, I don't see the CPU reconstructing ASTs, inlining and specializing code, etc.

Again, I would agree that some aspects of the process, the trace cache you mentioned springs to mind, smell a little bit of a JIT, but not everything that makes code faster is a JIT.

Proposals for trace cache reoptimization definitely have included JIT-like profile-based optimizations. There's been some proposals going back quite a ways; one paper I recall from my time at Illinois is http://impact.crhc.illinois.edu/shared/journal/ieeetc.hotspot.01.pdf, and some earlier work is rePLay (https://www.ideals.illinois.edu/bitstream/handle/2142/74569/B53-CRHC_99_16.pdf).

My old research group also had some related work in this area, though it was higher-level, only relying on the hardware for speculation support and doing profiling and trace optimization in software. One of my officemates was able to publish an evaluation of this technique through a modification of the Transmeta Code Morphing Software — using real (if obsolete at the time) hardware, much nicer than using a simulator! http://www.xsim.com/papers/neelakantam-real.pdf

(My own PhD was yet higher-level, developing techniques for limited hardware speculation for improving performance of dynamic languages on the JVM/CLR or a a JIT that had to fit into preexisting ABI constraints like Psyco.)

Nobody ever defined JIT compilation as needing to transform ASTs at runtime. Binary-to-binary translation has been done for a long time, and it just as clearly qualifies as a just-in-time compilation as anything else.