Allocate each Go routine a contiguous piece of memory for its stack, grown by reallocation/copy when it fills up.

Why?

Current split stack mechanism has a “hot split” problem - if the stack is almost full, a call will force a new stack chunk to be allocated. When that call returns, the new stack chunk is freed. If the same call happens repeatedly in a tight loop, the overhead of the alloc/free causes significant overhead.

Stack allocation/deallocation work is never complete with split stacks - every time the stack size passes a threshold in either direction, extra work is required.

With contiguous stacks, we avoid both of these problems. Stacks are grown to the size required, and no more work is needed from then on (modulo shrinking, see below).

How?

How can you move a stack? It turns out that our compiler’s escape analysis provides a very important invariant: pointers to data on the stack are only ever passed down in the call tree. Any other escape (written to global variables, returned to a parent, written to the heap, etc.) prevents that data from being allocated on the stack. This invariant means that the only pointers to stack data are in the stack itself (but see below for exceptions), which makes them easy to find and update in case we need to copy the stack.

Overflow check

The stack overflow check is very similar to the overflow check for split stacks. The only major difference is that we no longer need to know what the argument size is, as there is no longer the need to copy the arguments to a new stack segment. This simplifies the proliferation of the morestack routine family somewhat, and eliminates the need for most of the NOSPLIT directives.

We also don’t even need to know the frame size. Just double the stack size and try again, if it still doesn’t fit we’ll hit the overflow check again and double the stack size again. Repeat until it fits.

Copying

When a stack overflow check triggers:

Allocate a new stack somewhat bigger than the old stack. We should allocate in exponentially increasing sizes (to limit work done copying), and possibly also round up to a multiple of a nice size (so the allocator can be more efficient). I’m currently planning on using powers of two sizes and just doubling each realloc.

Copy the old stack to the new stack. The stack is treated as an array of *byte and each *byte is copied from old to new. Each *byte is compared with the bounds of the old stack, and if it is in range it is adjusted to point to the same offset from TOS in the new stack.

Only real pointers can be adjusted - we need to avoid integers that look like pointers. So we need a data structure that tells us, for each pointer-sized word on the stack, whether it is a pointer or not. Fortunately, Carl is generating just such a data structure to be used for precise GC.

Any pointers into the stack from outside need to be adjusted. Despite the escape analysis, there are a few. Those include:

The function object for a deferred call

The arguments to a deferred call

The closure parameters of the function object for a deferred call

The function object that is being passed to the overflowing frame (it gets passed in a register, not on the stack, at least in x86)

The closure parameters to the function object that is being passed to the overflowing frame

The same rules apply when copying - we need to know whether the data really is a pointer and if so, whether it points into the old stack.

Reflect.call

Reflect.call calls a function given a pointer to an argument area and an argument size. The old stack-split mechanism forced a new stack chunk so this variable-sized argument area can be allocated at the top of the stack. The new mechanism needs to allocate these args anywhere on the stack. The implementation is simpler if all stack frames are fixed size, so to “simulate” a variable-sized frame we define a set of fixed-size-frame functions in powers-of-two sizes which can be used as trampolines.

reflect.call(function, args, argsize)

if argsize <= 16: jump reflect.call16(function, args, argsize)

if argsize <= 32: jump reflect.call32(function, args, argsize)

if argsize <= 64: jump reflect.call64(function, args, argsize)

…

reflect.call16(function, args, argsize)

sp -= 16

copy [args, args+argsize] to [sp, sp+argsize]

call function

copy [sp, sp+argsize] to [args, args+argsize] (for return values)

sp += 16

return

I propose a maximum reflect.call arg size of 64K.

The stack copier needs to know what is a pointer and what is not for these frames. This info can be derived by looking at the signature of the called function. (and for vararg calls also the argument size?)

Stack tracing

Stack tracing code simplifies somewhat because it doesn’t need to deal with splits on the stack.

GC

As in stack tracing, GC walk of the stack for a G is simplified.

Defer/panic/recover

The implementations of these mechanisms change slightly, and are simplified in most instances. The old mechanisms used stack-split locations to mark where panics are happening.

Testing

For testing, all stacks will be allocated somewhere outside the heap. When a stack is copied, the old stack will be munmap’d so any errant access to the old stack will trap.

CGo

TBD, I don’t know what the issues are yet.

Shrinking

If a Go routine uses lots of stack at the start of its lifetime but then uses much less stack for the rest of its lifetime, then a lot of stack space is left idle. We need to recover this space somehow. Ideas:

At GC time, if a go routine is not using more than X% of its stack, copy it to a smaller stack. It will regrow it later if it needs it.

At GC time, if a go routine is not using more than X% of its stack, lower its stack guard to a smaller value. If it doesn’t hit the guard by the next GC, then copy to a smaller stack.

My current plan is at GC time, if a go routine is using at most ¼ of its stack, free the bottom ½ of the stack. I’ll be freeing directly if the stack is big enough to allow it, and copying otherwise.

Starting stacks

All Gs by default start with a minimum-sized stack. When a G finishes, we deallocate its stack, as it may be large and not what is required by the next go routine which happens to start on that G. (Or maybe we let the shrinking code handle this?)

We could keep some statistics of how much stack each go routine uses (keyed by function address) and pre-allocate that much stack before starting that go routine. Statistics can be updated each time a go routine finishes. It won’t be perfect, for example a go routine whose stack use is data-dependent. But it may be an adequate starting point.

This mechanism could be used in the split stack world as well, if we were willing to record the maximum stack usage during a Go routine. It would have to be more conservative, however. In the split-stack world allocating too large an initial stack can’t be undone.

Disadvantages

More virtual memory pressure / fragmentation. Especially with large stacks, finding contiguous memory to place the stack becomes more difficult. It is possible that some hybrid scheme could be better, for example copying up to a size of 1 MB and splitting after that. Although it would probably help, I don’t think it is worth the additional implementation complexity to retain both mechanisms.

Pointer-into-stack restrictions. All pointers into the stack area need to be strictly accounted for. Currently, our escape analysis ensures that there are no stray pointers into the stack. There may be situations in the future where we wish to allow pointers into the stack, and this implementation will preclude them. I’m not really sure what those situations would be, but this limitation is worth mentioning.

Probably need to do an audit of the runtime C/asm code to make sure nothing tricky with stack pointers is going on.

Peano (test/peano.go) modified to run up to 11! is about 10% faster. It is very sensitive to the growth rate, though, so take it with a lot of salt. (Doubling the stack size each copy is 10% faster. Growing the stack by 50% each copy is only 2% faster. Growing by 25% is 20% slower.)

I modified the segmented stack and contiguous stack implementations to use an environment variable to configure their stack segment size / initial stack size. We can use this feature to test a benchmark’s sensitivity to stack segment size. (For these experiments, the contiguous stack size is limited to powers of 2.)

encoding/json/BenchmarkEncode: this one has been complained about historically. See bug 3787.

Presumably the max stack use is 19KB or so, and below that bad splits cause the downward spikes.

html/template/BenchmarkEscapedExecute: this benchmark has a particular problem at 4096 bytes/segment (the default size, which is how I found it).