The Indirection Problem

April 20, 2012

In this post I’m going to talk about a performance disadvantage inherent to most (if not all) current implementations of dynamic languages and, more generally, languages that make use of garbage collection. This problem affects JavaScript, but also Python, Lua, Scheme and even Java to some extent. Surprisingly, this performance issue is one that is seldom talked about. When you ask someone why JavaScript code is slower than C, they will mention dynamic typing, run-time checks, number types and other sources of overhead, but they will seldom mention indirection.

What do I mean by indirection? Why is it a problem? In JavaScript (and most other dynamic languages), objects, arrays, functions and strings are manipulated through references values. Concretely speaking, all JavaScript objects are allocated on the heap and manipulated through pointers. This means that accessing object properties necessarily means going through one or more levels of indirection. You might think that this isn’t such a big problem. Most objects in C/C++ are also heap-allocated, after all. There is one big difference, however: in C and C++, the programmer can intentionally nest (inline) objects inside of each other. In JavaScript, this cannot be done manually.

As a simple example, suppose you wanted to define and instantiate an object representing a car and its 4 wheels, with each wheel having angles of direction and rotation. In C, you could do the following:

Both of these accomplish the same goal. In either language, one can write car.wheels[0].direction = 0.32, for example. In practice, there are several differences as to how this will be implemented, however:

In C, the car object is explicitly allocated on the stack. In JavaScript, it is implicitly allocated on the heap.

The C object, including the wheel objects and the rotation angles, is one contiguous piece of memory. This is not the case in JavaScript. Modern JS engines will allocate a car object, a string object, an array object to store the wheels, and individual objects for each wheel. The JS car object only contains pointers.

In C, the rotation angles are contained within the wheel objects, in the same contiguous piece of memory. Under Google V8, this would not be so, the rotation angles, if they are floating-point, would each be stored in objects of their own.

These differences have a performance impact. In C, accessing a wheel rotation angle only requires accessing the stack pointer at some fixed offset. There is only one level of indirection. In JavaScript, there is indirection from the car variable to the car object, from the car object to the wheel array, from the wheel array to the wheel objects, and potentially from the wheel objects to the rotation angles. Concretely, this means:

Allocating all of the needed JavaScript objects will be slower. There may need to be multiple calls to the allocator. Object headers and pointers may need to be initialized.

Accessing object properties will also be slower in JavaScript. This is not so relevant in this toy example, but you could imagine that if we were updating a list of car objects in a game engine loop there could be a significant impact on performance.

The JavaScript representation will use more memory. In addition to storing the wheel objects, we have to store a header for these objects and pointers to the said wheels. Memory alignment constraints might impose additional memory waste as well.

Because the JavaScript representation is larger, it will not fit in as few cache lines as the C representation, this can cause further performance losses when dealing with large sets of objects.

In Google V8, incrementing a wheel rotation angle would cause a new floating-point object to be allocated.

I’m well aware that there are ways to deal with some of these issues in JavaScript. Perhaps the objects could be represented differently (we could have avoided the array). Perhaps we could have used the new Float64Array extension for rotation angles. The point is simply that often, the intuitive JavaScript representation performs worse due to the various reasons I’ve outlined. The programmer can do some things to recoup part of the performance, but this is often at the cost of readability. Ideally, one would want the (JIT) compiler to be able to mitigate or eliminate these performance issues without the code having to be changed.

One of the main reasons why I wrote this post is because I know that compilers can do more in this regard, and more would probably be done if compiler writers were more aware of these issues. The Java VM can do escape analysis and automatically allocate some objects on the stack when this is found to be safe. Papers have also been published, on the topic of automatic object inlining, that is, automatically allocating child objects inside of parent objects when possible. Such optimizations can help eliminate indirection and memory usage overhead. However, these techniques are not widespread in modern JIT compilers, and more research could be used to refine them.

Share this:

Like this:

Related

I think many people underestimate the performance loss due to indirection and discontiguous memory. Even in C++ it’s the reason why a vector can often outperform a set despite a higher time complexity: the locality of data, and minimal redirects, just lets the CPU execute it so much faster.

You mentioned GC, so it should be pointed out that a moving GC may actually suffer double indirection. The outermost pointer is actually a pointer to another pointer which can be moved/changed when the GC kicks in. Of course, this moving should, in theory, be able to help with memory locality.

I’m reminded of a feature from the haXe language. I used it to do Flash programming and objects are very expensive in Flash. So a feature was added which could inline structures. So if you had a Point { int x; int y; } class, but your code didn’t actually need the Point itself, the compiler would remove the Point type and inline x and y. I haven’t used the language in a while, but it’s maintainer was good at finding ways to make it execute faster in less than optimal VMs.

I read Henrik Wann Jensen’s book on photon mapping. The guy actually spent quite a bit of time designing a kd-tree data structure to hold photons where each entry was tiny enough to fit in a single cache line and the whole tree was stored in a flat array. This was to maximize performance in an algorithm that can require hundreds of millions of lookups in a data structure with thousands or even millions of entries.

A straightforward reimplementation of this in JavaScript would probably be scarily slow. I wouldn’t be too surprised if it was actually 100 times slower. I’m not sure Java would fare that much better either.

We got bit by a linked list used for undo. We would remove redundant undo steps from the middle of the list so we used a simple linked list. When we switched to STL vectors everything ran several times faster.

On the other hand too much locality with a concurrent program can cause big performance hits. I have seen problems on a 16 core 32 thread workstation.