What do memory allocation, histograms, and event scheduling have in
common? They all benefit from rounding values to predetermined
buckets, and the same bucketing strategy combines acceptable precision
with reasonable space usage for a wide range of values. I don’t know
if it has a real name; I had to come up with the (confusing) term
“linear-log bucketing” for this post! I also used it twice last
week, in otherwise unrelated contexts, so I figure it deserves more
publicity.

I’m sure the idea is old, but I first came across this strategy in
jemalloc’s binning scheme for
allocation sizes. The general idea is to simplify allocation and reduce
external fragmentation by rounding allocations up to one of a few bin
sizes. The simplest scheme would round up to the next power of two,
but experience shows that’s extremely wasteful: in the worst case, an
allocation for \(k\) bytes can be rounded up to \(2k - 2\) bytes,
for almost 100% space overhead! Jemalloc further divides each
power-of-two range into 4 bins, reducing the worst-case space overhead
to 25%.

This sub-power-of-two binning covers medium and large allocations. We
still have to deal with small ones: the ABI forces alignment on every
allocation, regardless of their size, and we don’t want to have too
many small bins (e.g., 1 byte, 2 bytes, 3 bytes, …, 8 bytes).
Jemalloc adds another constraint: bins are always multiples of the
allocation quantum (usually 16 bytes).

I like to think of this sequence as a special initial range with 4
linearly spaced subbins (0 to 63), followed by power-of-two ranges
that are again split in 4 subbins (i.e., almost logarithmic binning).
There are thus two parameters: the size of the initial linear range,
and the number of subbins per range. We’re working with integers, so
we also know that the linear range is at least as large as the number
of subbins (it’s hard to subdivide 8 integers in 16 bins).

Assuming both parameters are powers of two, we can find the bucket for
any value with only a couple x86 instructions, and no conditional jump or
lookup in memory. That’s a lot simpler than jemalloc’s
implementation; if you’re into Java,
HdrHistogram’s binning code is nearly
identical to mine.

Common Lisp: my favourite programmer’s calculator

As always when working with bits, I first doodled in SLIME/SBCL: CL’s
bit manipulation functions are more expressive than C’s, and a
REPL helps exploration.

Let linear be the \(\log\sb{2}\) of the linear range, and subbin
the \(\log\sb{2}\) of the number of subbin per range, with
linear >= subbin.

The key idea is that we can easily find the power of two range (with a
BSR), and that we can determine the subbin in that range by shifting
the value right to only keep its subbin most significant (nonzero)
bits.

I clearly need something like \(\lfloor\log\sb{2} x\rfloor\):

“lb.lisp”

12

(defunlb(x)(1-(integer-lengthx)))

I’ll also want to treat values smaller than 2**linear as
though they were about 2**linear in size. We’ll do that with

n-bits := (lb (logior x (ash 1 linear))) === (max linear (lb x))

We now want to shift away all but the top subbin bits of x

shift := (- n-bits subbin)
sub-index := (ash x (- shift))

For a memory allocator, the problem is that the last rightward shift
rounds down! Let’s add a small mask to round things up:

The same, in GCC

staticinlineunsignedintlb(size_tx){/* I need an extension just for integer-length (: */return(sizeof(longlong)*CHAR_BIT-1)-__builtin_clzll(x);}/* * The following isn't exactly copy/pasted, so there might be * transcription bugs. */staticinlinesize_tbin_of(size_tsize,size_t*rounded_size,unsignedintlinear,unsignedintsubbin){size_tmask,range,rounded,sub_index;unsignedintn_bits,shift;n_bits=lb(size|(1ULL<<linear));shift=n_bits-subbin;mask=(1ULL<<shift)-1;rounded=size+mask;/* XXX: overflow. */sub_index=rounded>>shift;range=n_bits-linear;*rounded_size=rounded&~mask;return(range<<subbin)+sub_index;}staticinlinesize_tbin_down_of(size_tsize,size_t*rounded_size,unsignedintlinear,unsignedintsubbin){size_trange,sub_index;unsignedintn_bits,shift;n_bits=lb(size|(1ULL<<linear));shift=n_bits-subbin;sub_index=size>>shift;range=n_bits-linear;*rounded_size=sub_index<<shift;return(range<<subbin)+sub_index;}

What’s it good for?

I first implementated this code to mimic’s jemalloc binning scheme: in
a memory allocator, a linear-logarithmic sequence give us
alignment and bounded space overhead (bounded internal fragmentation),
while keeping the number of size classes down (controlling external
fragmentation).

High dynamic range histograms use the same
class of sequences to bound the relative error introduced by binning,
even when recording latencies that vary between microseconds and
hours.

I’m currently considering this binning strategy to handle a large
number of timeout events, when an exact priority queue is overkill. A
timer wheel would work, but tuning memory usage is annoying. Instead
of going for a hashed or hierarchical timer wheel, I’m thinking of
binning events by timeout, with one FIFO per bin: events may
be late, but never by more than, e.g., 10% their timeout. I also
don’t really care about sub millisecond precision, but wish to treat
zero specially; that’s all taken care of by the “round up” linear-log
binning code.

In general, if you ever think to yourself that dispatching on the
bitwidth of a number would mostly work, except that you need more
granularity for large values, and perhaps less for small ones,
linear-logarithmic binning sequences may be useful. They let you tune
the granularity at both ends, and we know how to round values and map
them to bins with simple functions that compile to fast and compact
code!

P.S. If a chip out there has fast int->FP conversion and slow bit
scans(!?), there’s another approach: convert the integer to FP,
scale by, e.g., \(1.0 / 16\), add 1, and shift/mask to extract
the bottom of the exponent and the top of the significand. That’s not
slow, but unlikely to be faster than a bit scan and a couple
shifts/masks.

]]>2015-04-26T23:30:00-04:00http://www.pvk.ca/Blog/2015/04/26/pointer-less-scapegoat-treesI’m trying something new this week: gathering a small group after
work for 90 minutes of short talks and
discussions. We’ll also have one longer slot because not everything
fits in a postcard, but my main goal is really to create opportunities
for everyone to infect us with their excitement for and interest in an
idea or a question. I successfully encouraged a couple people to
present, although many seemed intimidated by the notion… perhaps because we
have grown to expect well researched and rehearsed performances.
However, I believe that simple presentations of preliminary work are
worthwhile, and probably more likely to spark fresh conversations than
the usual polished fare: it’s healthy to expose our doubts,
trials, and errors, and there’s certainly value in reminding ourselves
that everyone else is in that same boat.

Here’s what I quickly (so quickly that my phone failed to focus
correctly) put together on embedding search trees in sorted
arrays. You’ll note that the “slides” are very low tech; hopefully,
more people will contribute their own style to the potluck next time
(:

I didn’t really think about implementing search trees until 3-4 years
ago. I met an online collaborator in Paris who, after a couple G&T,
brought up the topic of “desert island” data structures: if you were
stuck with a computer and a system programming guide on a desert
island, how would you rebuild a standard library from scratch?
Most data structures and algorithms that we use every day are fairly
easy to remember, especially if we don’t care about proofs of
performance: basic dynamic memory allocation, hash tables, sorting,
not-so-bignum arithmetic, etc. are all straightforward. He even had a
mergeable priority queue, with
skew heaps. However,
we both got stuck on balanced search trees: why would anyone want to
remember rotation rules? (Tries were rejected on what I argue are
purely theoretical grounds ;)

I
love searching
in sorted arrays, so I kept looking for a way
to build simpler search trees on top of that. That lead me to
Bentley and Saxe’s (PDF) dynamisation trick. The gist of it is
that there’s a family of methods to build dynamic sets on top of
static versions. For sorted arrays, one extreme is an unsorted list
with fast inserts and slow reads, and the other exactly one sorted
array, with slow inserts and fast lookups. The most interesting
design point lies in the middle, with \( \log n \) sorted arrays,
yielding \( \mathcal{O}(\lg n) \) time inserts and
\( \mathcal{O}(\lg\sp{2}n) \) lookups; we can see that design in
write-optimised databases. The problem is that my workloads tend to
be read heavy.

Some time later, I revisited a paper by Brodal, Fagerberg, and Jacob (PDF). They
do a lot of clever things to get interesting performance bounds, but
I’m really not convinced it’s all worth the complexity1… especially in
the context of our desert island challenge. I did find one trick very
interesting: they preserve logarithmic time lookups when binary
searching arrays with missing values by recasting these arrays as
implicit binary trees and guaranteeing that “NULLs” never have valid
entries as descendants. That’s a lot simpler than other arguments
based on guaranteeing a minimum density. It’s so much simpler that we
can easily make it work with a branch-free binary search: we only
need to treat NULLs as \( \pm \infty \) (depending on whether we
want a predecessor or a successor).

While lookups are logarithmic time, inserts are
\(\mathcal{O}(\lg\sp{2} n) \) time. Still no satisfying
answer to the desert island challenge.

I went back to my real research in optimisation, and somehow
stumbled on
Igal Galperin’s PhD thesis
on both on-line optimisation/learning and… simpler balanced binary search trees!

Scapegoat trees (PDF)
rebalance by guaranteeing a bound \( \alpha > 0 \) on the
relative difference between the optimal depth
(\( \lceil\lg n\rceil \)) for a set of \(n\) values and the height (maximal depth) of
the balanced tree (at most \( (1+\alpha)\lceil\lg n\rceil \)). The
only property that a scapegoat tree has (in addition to those of
binary search trees) is this bound on the height of the
tree, as a function of its size. Whenever a new node would be
inserted at a level too deep for the size of the tree, we go up
its ancestors to find a subtree that is small enough to accomodate
the newcomer and rebuild it from scratch. I will try to provide an
intuition of how they work, but the paper is a much better source.

For a tree of \(n = 14\) elements, we could have \(\alpha =
0.25\), for a maximum depth of \(1.25 \lceil\lg 14\rceil = 5\).
Let’s say we attempt to insert a new value, but the tree is structured such
that the value would be the child of a leaf that’s already at depth \(5\);
we’d violate the (im)balance bound. Instead, we go up until we find
an ancestor \(A\) at depth, e.g., \(3\) with \(4\) descendants. The
ancestor is shallow enough that it has space for \(5 - 3 = 2\)
levels of descendants, for a total height of \(2 + 1 = 3\) for the
subtree. A full binary tree of height \(3\) has
\(2\sp{3} - 1 = 7\) nodes, and we thus have enough space for
\(A\), its \(4\) descendants, and the new node! These 6 values
are rebuilt in a near-perfect binary tree: every level must be
fully populated, except for the last one.

The criteria to find the scapegoat subtree are a bit
annoying to remember–especially given that we don’t want to
constantly rebuild the whole tree–but definitely simpler than rotation
rules. I feel like that finally solves the desert island balanced
search tree challenge… but we still have gapped sorted arrays to
address.

What’s interesting about scapegoat trees is that rebalancing is always
localised to a subtree. Rotating without explicit pointers is hard
(not impossible, amazingly enough), but scapegoat trees just
reconstruct the whole subtree, i.e., a contiguous section of the
sorted array. That’s easy: slide non-empty values to the right, and
redistribute recursively. But, again, finding the scapegoat subtree
is annoying.

That made me think: what if I randomised scapegoat selection? Rather than
counting elements in subtrees, I could approximate that
probabilistically by sampling from an exponential distribution… which
we can easily approximate with the geometric for \(p = 0.5\) by
counting leading zeros in bitstrings.

I’m still not totally convinced that it works, but I vaguely remember
successfully testing an implementation and sketching a proof that we
can find the scapegoat subtree by going up according to a scaled
geometric to preserve amortised logarithmic time inserts. The
probability function decreases quickly enough that we preserve
logarithmic time inserts on average, yet slowly enough that we can
expect to redistribute a region before it runs out of space.

The argument is convoluted, but the general idea is based
on the observation that, in a tree of maximum height \(m\), a
subtree at depth \(k\) can contain at most
\(n\sp\prime = 2\sp{m - k + 1} - 1\) elements (including the
subtree’s root).

We only violate the imbalance bound in a subtree if
we attempt to insert more than \(n\sp\prime\) elements in it.
Rebalancing works by designating the shallowest subtree that’s not yet
full as the scapegoat. We could simplify the selection of the
scapegoat tree by counting the number of inserts in each subtree, but
that’d waste a lot of space. Instead, we count probabilistically and
ensure that there’s a high probability (that’s why we always go up by
at least \(\lg \lg n\) levels) that each subtree will be
rebalanced at least once before it hits its insertion count limit.
The memoryless property of the geometric distribution means that
this works even after a rebalance. If we eventually fail to find
space, it’s time to completely rebuild the subtree; this case happens
rarely enough (\(p \approx \frac{\lg n}{n}\)) that the amortised
time for insertions is still logarithmic.

Again \(2\sp{\alpha \lg n}\) should be \(2\sp{(1 + \alpha)\lg n}\).

We can do the same thing when embedding scapegoat trees in implicit
trees. The problem is that a multiplicative overhead in depth results
in an exponential space blowup. The upside is that the overhead is
tunable: we can use less space at the expense of slowing down
inserts.

In fact, if we let \( \alpha \rightarrow 0 \), we find Brodal et
al’s scheme (I don’t know why they didn’t just cite Galperin and
Rivest on scapegoat trees)! The difference is that we are now pretty
sure that we can easily let a random number generator guide our
redistribution.

I only covered insertions and lookups so far. It turns out that
deletions in scapegoat trees are easy: replace the deleted node with
one of its leaves. Deletions should also eventually trigger
a full rebalance to guarantee logarithmic time lookups.

Classical implicit representations for sorted sets make us choose
between appallingly slow (linear time) inserts and slow lookups.
With stochastic scapegoat trees embedded in implicit binary trees, we
get logarithmic time lookups, and we have a continuum of choices
between wasting an exponential amount of space and slow
\( \mathcal{O}(\lg\sp{2} n) \) inserts. In order to get there, we
had to break one rule: we allowed ourselves \(\mathcal{O}(n)\)
additional space, rather than \(\mathcal{o}(n)\), but it’s all
empty space.

What other rule or assumption can we challenge (while staying true to
the spirit of searching in arrays)?

I’ve been thinking about interpolation lately: what if we had a
monotone (not necessarily injective) function to map from the set’s
domain to machine integers? That’d let us bucket values or
interpolate to skip the first iterations of the search. If we can
also assume that the keys are uniformly distributed once mapped to
integers, we can use a linear Robin Hood hash table: with a linear
(but small) space overhead, we get constant time expected inserts and
lookups, and what seems to be \( O(\lg \lg n) \) worst case2
lookups with high probability.

Something else is bothering me. We embed in full binary trees, and
thus binary search over arrays of size \(2\sp{n} - 1\)… and we know
that’s a
bad idea.
We could switch to ternary trees, but that means inserts and deletes
must round to the next power of three. Regular div-by-mul and
scaling back up by the divisor always works; is there a simpler way to round
to a power of three or to find the remainder by such a number?

I don’t know! Can anyone offer insights or suggest new paths to explore?

I think jumping the van Emde Boa[s] is a thing, but they at least went for the minor version, the van Emde Boas layout ;)↩

The maximal distance between the interpolation point and the actual location appears to scale logarithmically with the number of elements. We perform a binary search over a logarithmic-size range, treating empty entries as \(\infty\).↩

]]>2015-01-13T02:30:00-05:00http://www.pvk.ca/Blog/2015/01/13/lock-free-mutual-exclusionSpecialised locking schemes and lock-free data structures are a big
part of my work these days. I think the main reason the situation is
tenable is that, very early on, smart people decided to focus on an
SPMC architecture: single writer (producer), multiple readers
(consumers).

As programmers, we have a tendency to try and maximise generality: if
we can support multiple writers, why would one bother with measly SPMC
systems? The thing is SPMC is harder than SPSC, and MPMC is
even more complex. Usually, more concurrency means programs are harder to
get right, harder to scale and harder to maintain. Worse: it also
makes it more difficult to provide theoretical progress guarantees.

Last week, I got lost doodling with x86-specific cross-modifying code,
but still stumbled on a cute example of a simple lock-free protocol:
lock-free sequence locks. This sounds like an oxymoron, but I promise
it makes sense.

Lock-free sequence locks

It helps to define the terms
better. Lock-freedom
means that the overall system will always make progress, even if some
(but not all) threads are suspended.
Classical sequence locks are
an optimistic form of write-biased reader/writer locks: concurrent
writes are forbidden (e.g., with a spinlock), read transactions abort
whenever they observe that writes are in progress, and a generation
counter avoids
ABA problems (when a read
transaction would observe that no write is in progress before and after a
quick write).

In
Transactional Mutex Locks (PDF),
sequence locks proved to have enviable performance on small systems and
scaled decently well for read-heavy workloads. They even allowed lazy
upgrades from reader to writer by atomically checking that the
generation has the expected value when acquiring the sequence lock for
writes. However, we lose nearly all progress guarantees: one
suspended writer can freeze the whole system.

The central trick of lock-freedom is cooperation: it doesn’t matter if
a thread is suspended in the middle of a critical section, as long as
any other thread that would block can instead complete the work that
remains. In general, this is pretty hard, but we can come up with
restricted use cases that are idempotent. For lock-free sequence
locks, the critical section is a precomputed set of writes: a series
of assignments that must appear to execute atomically. It’s fine if
writes happen multiple times, as long as they stop before we move on
to another set of writes.

There’s a primitive based on compare-and-swap that can easily achieve
such conditional writes: restricted double compare and single swap
(RDCSS, introduced in
A Practical Multi-Word Compare-and-Swap (PDF)).
RDCSS atomically checks if both a control word (e.g., a generation
counter) and a data word (a mutable cell) have the expected values and,
if so, writes a new value in the data word. The pseudocode for
regular writes looks like

The trick is that, if the first CAS succeeds, we always know how to
undo it (data’s old value must be self.old), and that
information is stored in self so any thread that observes the first
CAS has enough information to complete or rollback the RDCSS. The
only annoying part is that we need a two-phase commit: reserve data,
confirm that control is as expected, and only then write to data.

For the cost of two compare-and-swap per write – plus one to acquire the
sequence lock – writers don’t lock out other writers (writers help
each other make progress instead). Threads (especially readers) can
still suffer from starvation, but at least the set of writes can be
published ahead of time, so readers can even lookup in that set rather
than waiting for/helping writes to complete. The generation counter
remains a bottleneck, but, as long as writes are short and happen
rarely, that seems like an acceptable trade to avoid the 3n CAS in
multi-word compare and swap.

Real code

Here’s what the scheme looks like in SBCL.

First, a mutable box because we don’t have raw pointers (I could also
have tried to revive my sb-locative hack) in CL.

123

(defstruct (box
(:constructor make-box (%value)))
%value)

Next, the type for write records: we have the the value for the next
generation (once the write is complete) and a hash table of box to
pairs of old and new values. There’s a key difference with the way
RDCSS is used to implement multiple compare and swap: we don’t check
for mismatches in the old value and simply assume that it is correct.

I see two ways to deal with starting a read transaction while a write
is in progress: we can help the write complete, or we can overlay the
write on top of the current heap in software. I chose the latter:
reads can already be started by writers. If a write is in progress
when we start a transaction, we stash the write set in *current-map*
and lookup there first:

12345678910111213141516

(defvar *current-map* nil)
(defun box-value (box)
(prog1 (let* ((map *current-map*)
(value (if map
(cdr (gethash box map (box-%value box)))
(box-%value box))))
(if (record-p value)
;; if we observe a record, either a new write is in
;; progress and (check) is about to fail, or this is
;; for an old (already completed) write that succeeded
;; partially by accident. In the second case, we want
;; the *old* value.
(car (gethash box (record-ops value)))
value))
(check)))

We’re now ready to start read transactions. We take a snapshot of the
generation counter, update *current-map*, and try to execute a
function that uses box-value. Again, we don’t need a read-read
barrier on x86oids (nor on SPARC, but SBCL doesn’t have threads on
that platform).

Now we can commit with a small wrapper around help. Transactional
mutex lock has the idea of transaction that are directly created as
write transactions. We assume that we always know how to undo writes,
so transactions can only be upgraded from reader to writer.
Committing a write will thus check that the generation counter is
still consistent with the (read) transaction before publishing the new
write set and helping it forward.

The function test-reads counts the number of successful read
transactions and checks that (box-value a) and (box-value b) are
always equal. That consistency is preserved by test-writes, which
counts the number of times it succeeds in incrementing both
(box-value a) and (box-value b).

The baseline case should probably be serial execution, while the ideal
case for transactional mutex lock is when there is at most one
writer. Hopefully, lock-free sequence locks also does well when there
are multiple writers.

Let’s try this!

First, the serial case. As expected, all the transactions succeed, in
6.929 seconds total (6.628 without GC time). With one writer and two
readers, all the writes succeed (as expected), and 98.5% of reads do as
well; all that in 4.186 non-GC seconds, a 65% speed up. Finally, with
two writers and two readers, 76% of writes and 98.5% of reads complete in
4.481 non-GC seconds. That 7% slowdown compared to the single-writer
case is pretty good: my laptop only has two cores, so I would expect
more aborts on reads and a lot more contention with, e.g., a spinlock.

There’s almost no allocation (there’s no write record), but the lack
of read parallelism makes locks about 20% slower than the lock-free
sequence lock. A reader-writer lock would probably close that gap.
The difference is that the lock-free sequence lock has stronger
guarantees in the worst case: no unlucky preemption (or crash, with
shared memory IPC) can cause the whole system to stutter or even halt.

The results above correspond to my general experience. Lock-free
algorithms aren’t always (or even regularly) more efficient than well
thought out locking schemes; however, they are more robust and easier
to reason about. When throughput is more than adequate, it makes
sense to eliminate locks, not to improve the best or even the average
case, but rather to eliminate a class of worst cases – including
deadlocks.

P.S., here’s a sketch of the horrible cross-modifying code hack. It
turns out that the instruction cache is fully coherent on (post-586)
x86oids; the prefetch queue will even reset itself based on the linear
(virtual) address of writes. With a single atomic byte write, we can
turn a xchg (%rax), %rcx into xchg (%rbx), %rcx, where %rbx
points to a location that’s safe to mutate arbitrarily. That’s an
atomic store predicated on the value of a control word elsewhere
(hidden in the instruction stream itself, in this case). We can then
dedicate one sequence of machine to each transaction and reuse them
via some
Safe Memory Reclamation mechanism (PDF).

There’s one issue: even without preemption (if a writer is pre-empted,
it should see the modified instruction upon rescheduling), stores
can take pretty long to execute: in the worst case, the CPU has to
translate to a physical address and wait for the bus lock. I’m pretty
sure there’s a bound on how long a xchg m, r64 can take, but I
couldn’t find any documentation on hard figure. If we knew that xchg
m, r64 never lasts more than, e.g., 10k cycles, a program could wait
that many cycles before enqueueing a new write. That wait is bounded
and, as long as writes are disabled very rarely, should improve
the worst-case behaviour without affecting the average throughput.

]]>2014-10-19T20:05:00-04:00http://www.pvk.ca/Blog/2014/10/19/performance-optimisation-~-writing-an-essaySkip to the meaty bits.

My work at AppNexus mostly involves
performance optimisation, at any level from microarchitecture-driven
improvements to data layout and assembly code to improving the
responsiveness of our distributed system under load. Technically,
this is similar to what I was doing as a lone developer on
research-grade programs. However, the scale of our (constantly
changing) code base and collaboration with a dozen other coders mean
that I approach the task differently: e.g., rather than
single-mindedly improving throughput now, I aim to pick an evolution
path that improves throughput today without imposing too much of a
burden on future development or fossilising ourselves in a design
dead-end. So, although numbers still don’t lie (hah), my current
approach also calls for something like judgment and taste, as well as
a fair bit of empathy for others. Rare are the obviously correct
choices, and, in that regard, determining what changes to make and
which to discard as
over-the-top ricing feels like
I’m drafting a literary essay.

This view is probably tainted by the fact that, between English and
French classes, I spent something like half of my time in High School
critiquing essays, writing essays, or preparing to write one.
Initially, there was a striking difference between the two languages:
English teachers had us begin with the five paragraph format where one
presents multiple arguments for the same thesis, while French teachers
imposed a thesis/antithesis/synthesis triad (and never really let it
go until CÉGEP, but that’s another topic). When I write that
performance optimisation feels like drafting essays, I’m referring to
the latter “Hegelian” process, where one exposes arguments and
counterarguments alike in order to finally make a stronger case.

I’ll stretch the analogy further. Reading between the lines gives us
access to more arguments, but it’s also easy to get the context wrong and
come up with hilariously far-fetched interpretations. When I try to
understand a system’s performance, the most robust metrics treat the
system as a black box: it’s hard to get throughput under production
data wrong. However, I need finer grained information (e.g.,
performance counters, instruction-level profiling, or
application-specific metrics) to guide my work, and, the more useful
that information can be – like domain specific metrics that highlight
what we could do differently rather than how to do the same thing more
efficiently – the easier it is to measure incorrectly. That’s not a
cause for despair, but rather a fruitful line of skepticism that helps
me find more opportunities.

Just two weeks ago, questioning our application-specific metrics
lead to an easy 10% improvement in throughput for our biggest
consumer of CPU cycles. The consumer is an application that
determines whether internet advertising campaigns are eligible to bid
on an ad slot, and if so, which creative (ad) to show and at what bid
price. For the longest time, the most time-consuming part of that
process was the first step, testing for campaign eligibility.
Consequently, we tracked the execution of that step precisely and
worked hard to minimise the time spent on ineligible campaigns,
without paying much attention to the rest of the pipeline. However,
we were clearly hitting diminishing returns in that area, so I asked
myself how an adversary could use our statistics to mislead us. The
easiest way I could think of was to have campaigns that are eligible
to bid, but without any creative compatible with the ad slot (e.g.,
because it’s the wrong size or because the website forbids Flash ads):
although the campaigns are technically eligible, they are unable to
bid on the ad slot. We added code to track these cases and found that
almost half of our “eligible” campaigns simply had no creative in the
right size. Filtering these campaigns early proved to be a
low-hanging fruit with an ideal code complexity:performance
improvement ratio.

I recently learned that we also had to second-guess instruction level
profiles. Contemporary x86oids are out of order, superscalar, and
speculative machines, so profiles are always messy: “blame” is
scattered around the real culprit, and some instructions (pipeline
hazards like conditional jumps and uncached memory accesses, mostly)
seem to account for more than their actual share. What I never
realised is that, in effect, some instructions systematically mislead
and push their cycles to others.

Some of our internal spinlocks use mfence. I expected that to be
suboptimal, since it’s
commonknowledge
that locked instruction are more efficient barriers: serialising
instructions like mfence have to affect streaming stores and other
weakly ordered memory accesses, and that’s a lot more work than just
preventing store/load reordering. However, our profiles showed that
we spent very little time on spinlocking so I never gave it much thought…
until eliminating a set of spinlocks had a much better impact on
performance than I would have expected from the profile. Faced with
this puzzle, I had to take a closer look at the way mfence and
locked instructions affect hardware-assisted instruction profiles on
our production Xeon E5s (Sandy Bridge).

I came up with a simple synthetic microbenchmark to simulate locking
on my E5-4617: the loop body is an adjustable set of memory accesses
(reads and writes of out-of-TLB or uncached locations) or computations
(divisions) bracketed by pairs of normal stores, mfence, or lock
inc/dec to cached memory (I would replace the fences with an
increment/decrement pair and it looks like all read-modify-write
instructions are implemented similarly on Intel). Comparing runtimes
for normal stores with the other instructions helps us gauge their
overhead. I can then execute each version under perf and estimate
the overhead from the instruction-level profile. If mfence is
indeed extra misleading, there should be a greater discrepancy between
the empirical impact of the mfence pair and my estimate from the
profile.

It looks like the loads from uncached memory represent ~85% of the
runtime, while the mfence pair might account for at most ~15%, if
I include all the noise from surrounding instructions.

If I trusted the profile, I would worry about eliminating locked
instructions, but not so much for mfence. However, runtimes (in
cycles), which is what I’m ultimately interested in, tell a different
story. The same loop of LLC load misses takes 2.81e9 cycles for 32M
iterations without any atomic or fence, versus 3.66e9 for lock
inc/dec and 19.60e9 cycles for mfence. So, while the profile for
the mfence loop would let me believe that only ~15% of the time is
spent on synchronisation, the mfence pair really represents 86%
\(((19.6 - 2.81) / 19.6)\) of the runtime for that loop! Inversely,
the profile for the locked pair would make me guess that we spend
about 40% of the time there, but, according to the timings, the real
figure is around 23%.

The other tests all point to the same conclusion: the overhead of
mfence is strongly underestimated by instruction level profiling,
and that of locked instructions exaggerated, especially when
adjacent instructions write to memory.

I can guess why we observe this effect; it’s not like Intel is
intentionally messing with us. mfence is a full pipeline flush: it
slows code down because it waits for all in-flight instructions to
complete their execution. Thus, while it’s flushing that slows us
down, the profiling machinery will attribute these cycles to the
instructions that are being flushed. Locked instructions instead
affect stores that are still queued. By forcing such stores to
retire, locked instructions become responsible for the extra cycles
and end up “paying” for writes that would have taken up time anyway.

Losing faith in hardware profiling being remotely representative of
reality makes me a sad panda; I now have to double check perf
profiles when hunting for misleading metrics. At least I can tell
myself that knowing about this phenomenon helps us make better
informed – if less definite – decisions and ferret out more easy
wins.

P.S., if you find this stuff interesting, feel free to send an email
(pkhuong at $WORK.com). My team is hiring both experienced developers
and recent graduates (:

]]>2014-09-13T22:56:00-04:00http://www.pvk.ca/Blog/2014/09/13/doodle-hybridising-sbcls-gencgc-with-mark-and-sweepMeta-note: this is more of a journal entry than the usual post
here. I’ll experiment with the format and see if I like publishing
such literal and figurative doodles.

Garbage collection is in the air. My friend
Maxime
is having issues with D’s garbage collector, and Martin Cracauer has a
large patch to improve SBCL’s handling of conservative references. I
started reviewing that patch today, and, after some discussion with
Alastair Bridgewater, I feel like adding a mark-and-sweep component to
SBCL’s GC might be easier than what the patch does, while achieving
its goal of reducing the impact of conservative references. That lead
to the whiteboarding episode below and a plan to replace the garbage
collecting half of SBCL’s generational GC. But first, a summary of
the current implementation.

The present, and how we got here

CMUCL started out with a Cheney-style two-space collector. Two-space
collectors free up space for more allocations by copying objects that
might still be useful (that are reachable from “roots,” e.g.,
variables on the stack) from the old space to the new space. Cheney’s
algorithm elegantly simplifies this task by storing bookkeeping
information in the data itself. When we copy an object to the new
space (because it is reachable), we want to make sure that all other
references to that object are also replaced with references to the
copy. Cheney’s solution to that desideratum is obvious: overwrite the
old object with a broken heart (forwarding pointer), a marker that

the object has already been copied to the new space;

the copy lives at address x.

This adds a constraint that heap-allocated objects can never be
smaller than a broken heart, but they’re usually one or two words (two
in SBCL’s case) so the constraint is rarely binding.

When the garbage collector traverses the roots (the stack, for
example) and finds a pointer, the code only has to dereference that
pointer to determine if the objects it points to has been moved. If
so, the GC replaces the root pointer with a pointer to the copy in the
new space. Otherwise, the GC copies the object to the new space,
repoints to that copy, and overwrites the old object with a broken heart.

We also need to traverse objects recursively: when we find that an
object is live and copy it to the new space, we must also make sure
that anything that objects points to is also preserved, and that any
pointer in that object is updated with pointers to copies in the new
space.

That’s a graph traversal, and the obvious implementation
maintains a workset of objects to visit which, in the worst case,
could include all the objects in old space. The good news is we don’t have
to worry about objects re-entering that workset: we always
overwrite objects (in old space) with a broken heart when we visit
them for the first time.

Cheney proposed a clever trick to implement this workset. Whenever an
object enters the workset, it has just been copied to the new space;
as long as we allocate in the new space by incrementing an allocation
pointer, the new space itself can serve as the workset! In addition
to the allocation pointer, we now need a “to-scan” pointer. Any
object in the new space that’s below the to-scan pointer has already
been scanned for pointers and fixed to point in the new space; any object
between the to-scan pointer and the allocation pointer must be scanned
for pointers to the old space. We pop an element from the workset by
looking at the next object (in terms of address) after the to-scan
pointer and incrementing that pointer by the object’s size. When the
to-scan and the allocation pointers meet, the workset is empty
and GC terminates.

Some SBCL platforms still use this two-space collector, but it doesn’t
scale very well to large heaps (throughput is usually fine, but we
waste a lot of space and GC pauses can be long). The generational
conservative garbage collector (GENCGC, GENGC on
precise/non-conservative platforms) is a hack on top of that Cheney GC.

The GC is “generational” because most passes only collect garbage from
a small fraction of the heap, and “conservative” because we have to
deal with values that may or may not be pointers (e.g., we don’t always
know if the value in a register is a Lisp reference or just a machine
integer) by considering some objects as live (not up for collection)
while pinning them at their current address.

The runtime uses mprotect to record writes to the heap, except for the
nursery (newly allocated objects) where we expect most writes to land.
The heap is partitioned in pages, and the first write to a page after
a GC triggers a protection fault; the signal handler marks that page
as mutated and changes the protection to allow writes.

Pinned objects are also handled by abusing the root set: pages that
contain at least one pinned object don’t undergo garbage collection
and are directly scanned for pointers, like the stack in Cheney GC.

Instead of having two heaps, an old space and a new space, we now have
a lot of pages, and each page belongs to a generation. When we want
to collect a given generation, pages in that generation form the old
space, and pages allocated during GC the new space. This means that
we lose the simplicity of Cheney’s new-space-is-the-workset trick: the
new space isn’t contiguous, so a single to-scan pointer doesn’t cut it
anymore! GENGC works around that by scanning the page table, but it’s
not pretty and I really don’t know if Cheney is a good fit anymore.

Martin Cracauer’s patch

GENCGC’s approach to pinned objects is stupid. If a page has no
reference except for one conservative pointer, the whole page is
considered live and scanned for references.

Martin’s solution is to allocate additional temporary metadata only
for pinned pages and track the pinned status of individual objects.
When the GC encounters a pointer to a page with pinned objects, it
checks if it’s a pointer to a pinned object. If so, the pointee is
left in place. Otherwise, it’s copied normally.

The patch has code to mark objects as live (pinned) and to overwrite
objects once they have been copied. Basically, it is half of a
mark-and-sweep garbage collector. The main difference is that the set
of pinned objects doesn’t grow (being pinned isn’t a contagious
property), so we don’t need a worklist for pinned objects. However, I
already noted that I’m not convinced the worklist hack in GENGC is a
good idea.

A hybrid collector!

Instead of marking pages as containing pinned objects, I feel it may
be simpler to collect some pages by copying, and others by marking.
Any pinned page would have the “mark” GC policy, while pages that
likely contain few live objects (e.g., the nursery and pages with a
lot of unused memory) would be collected by copying. This too would
avoid the issue with considering whole pages as live when pinned,
and I think that having the choice of copying or marking at a page
granularity will be simpler than toggling at the level of individual
object.

Each “mark” page now has two (bit)sets, one for live objects and
another for live objects that have already been scanned. We can
maintain a worklist at the page granularity with an embedded linked
list: whenever a “mark” page gains a new live object and it’s not
already in the worklist, that page is enqueued for scanning.

Instead of emulating Cheney’s trick by looking for newly allocated
pages in our page table, we can add pages in new space to the worklist
whenever they become full.

Finally, at the end of the pass, we traverse all “mark” pages and
clear dead objects.

That’s pretty simple (arguably simpler than the current
implementation!), and shouldn’t involve too many changes to the rest
of the code. Mostly, I’d have to adapt the weak pointer machinery to
stop assuming that it can use forwarding pointers to determine when
objects have been marked as live.

However, we might lose the ability to run medium GCs, to collect more
than the nursery but less than the whole heap. If we only want to GC
the nursery, the mprotect write barrier gives us all the information
we need to find references from the rest of the heap to the nursery.
If we wish to collect the whole heap, we only have to consider stacks
and some statically allocated space as roots.

For medium GCs, e.g., collect only generations 1-4 out of 7, GENGC
exploits the way that garbage collection (almost) always copies to
easily track pages with pointers to younger generations. It’s coarse,
but usually acceptable thanks to the copying. I don’t know that it
would work as well if the default is to only copy the nursery.
Moreover, if we have a hybrid GC, it probably makes sense to focus
copying on pages that are mostly empty, regardless of their age. If
we do want medium GCs, we might have to track, for each page, the set
of pages that point there. This set can include false positives, so
it’s probably easiest to clear it before major GCs, and otherwise only
add to that set (removing pages that were emptied by a GC pass sounds
reasonable). I also expect that some pages will have many refererrers;
I’m thinking we might use a distinguished value to mean “referred by
every pages” and not consider them for medium GC.

What’s next

Martin’s patch clearly addresses an important weakness in SBCL’s
garbage collector. If I can’t make good progress on the hybrid GC
soon, I’ll make sure the patch is cleaned up for master, hopefully by
Thanksgiving.