What’s in a word? Memory access interference for languages and CPUs.

I’ve been working on a new malloc implementation for Jinx recently. The last time that a developer at Corensic wrote a version of malloc(), I asked him why he didn’t use a great off-the shelf malloc like tcmalloc. Now it’s my turn, and I also wrote my own. I guess you can accuse me of failing to learn from my mistakes; I’m certainly a hypocrite!

In my defense, a few things about our hypervisor make using someone else’s malloc a pain. First, we’re effectively an embedded system working inside a small chunk of memory allocated in a device driver. The hypervisor works in its own virtual address space, but we never modify that address space. We do some large allocations along the way, but we also have some long-lived smaller memory allocations, so fragmentation is a concern. We code our hypervisor in C. In addition to these constraints, we have one opportunity: we have no pre-emption in our threading package.

I decided to create a pretty simple global allocator, but with the addition of per-physical-processor freelists. Since there’s no pre-emption in our system, we can put freelists in local storage on physical processors, and access the freelists without a lock, without fear of migration.

An example of the global heap data structure follows:

An example Jinx heap data structure.

The heap itself is a linked list of regions from HEAD to TAIL. Each element on the linked list is either in-use or free, or is the head or tail of the list. The heap linked list may have no two adjacent free regions (they should be coalesced into one free region). In addition, there are freelist heads for powers of two. For a given freelist head with size S, every item of size P on that freelist obeys S <= P < S * 2. At the head of every region, there’s a data structure that looks like this:

This preserves the heap invariant that there exist no two adjacent free regions.

We use one giant lock for all items in the heap, relying on per-processor freelists to provide concurrency. Those freelists reuse the freelist field of the header structure for linkage. Those fields are protected by the global heap lock only as long as alloc_state == ALLOC_STATE_FREE.

A bug in this code, like many concurrency errors, resulted in intermittent, unusual, crashes and corruption. It was rare enough that I almost checked it in, but was saved by one of the many unit tests we run prior to commit crashing once.

Now, in retrospect, setting the alloc_state field outside of the lock was a pretty obvious bug. Imagine that a manipulator of our left-hand neighbor, holding the lock, observed that our region had become free, and coalesced its region with ours. That would be catastrophic, as we’d end up with a pointer to a region that no longer was! Because regions consult their neighbors’ allocation status during coalescing, that field must be protected by that lock.

However, there’s a more subtle bug lurking here. Consider a different structure definition:

Let’s say that “available” field is available for use by the user of the malloc interface. Now we have a problem with non-atomicity of modification. That is, even if that field is owned by the owner of the memory block, not by the holder of the lock, that field can not be modified using normal operations.

The first part of that code, compiled into x86-64 instructions, might look like

and ~$3, (%rax)
or $2, (%rax)

As covered in a recent blog post about non-serializable single instructions, variants of these instructions without the LOCK prefix are not atomic, so there’s ample opportunity here to clobber ongoing modifications to bitfields that share our storage. My coworker David points out that it’s more likely to get compiled as this:

mov (%rax), %rdx
and ~$3, %rdx
or $2, %rdx
mov %rdx, (%rax)

Either way, this is a non-atomic update!

This is yet another occasion where local reasoning about code is foiled by concurrency. Unlike single-threaded code, you have to know the exact structure you’re dealing with to know if modifying available is legal. In addition, you need to know what sort of atomicity your compiler or language guarantees when you’re accessing variables. For example, would this be OK?

When I modify “available”, how do I know that “bytes_to_previous_header” doesn’t take a read-modify-write trip with me? Can I use that field without acquiring the global heap lock? I sent this question out to the rest of our development team, and our UW CSE faculty: in what languages is a byte read and write guaranteed to be isolated from adjacent fields? How about a 16 bit number? How about numbers that are the natural register width? How about larger?

Consensus was that all languages on all commonly used architectures guarantee that if one reads or writes a byte, adjacent fields don’t go along for the ride, and that this is true, outside of bitfields, for all types up to the register size of a CPU. Programs would clearly be too hard to reason about were this not true.

Bartosz illustrated this to me for C++ with some language from the C++ spec:

(1.7.3) A memory location is either an object of scalar type or a maximal sequence of adjacent bit-fields all having non-zero width. […] Two threads of execution (1.10) can update and access separate memory locations without interfering with each other.
(1.7.4) [ Note: Thus a bit-field and an adjacent non-bit-field are in separate memory locations, and therefore can be concurrently updated by two threads of execution without interference. The same applies to two bit-fields, if one is declared inside a nested struct declaration and the other is not, or if the two are separated by a zero-length bit-field declaration, or if they are separated by a non-bit-field declaration. It is not safe to concurrently update two bit-fields in the same struct if all fields between them are also bit-fields of non-zero width. —end note ]
(1.7.5) [ Example: A structure declared as

struct {
char a;
int b:5,
c:11,
:0,
d:8;
struct {int ee:8;} e;
}

contains four separate memory locations: The field a and bit-fields d and e.ee are each separate memory locations, and can be modified concurrently without interfering with each other. The bit-fields b and c together constitute the fourth memory location. The bit-fields b and c cannot be concurrently modified, but b and a, for example, can be. —end example ]

I would be delighted to hear about it if people have examples from computers, compilers, and languages, old and new, where this is not true.

One last note on this topic: In the FreeBSD kernel, the developers have a good habit of documenting what lock protects which field of a structure, via a locking guide. In their world, that structure might look like this:

In this world, the author of the locking guide would be reprimanded for trying to apply locking rule L/F to the field alloc_state or the hypothetical available field, while other members of the same bitfield had locking rule L. Any time you document who locks what in a structure, the granularity may not be smaller than that which your compiler and your architecture guarantees, assuming you’re using non-atomic operations to modify the fields! This is a great practice that has served Corensic well in the development of its hypervisor. My non-adherence to this practice cost me a few hours in debugging!

So, in summary:

The combination of bitfields + concurrency is dangerous, because fields of a structure are implicitly intertwined with adjacent fields.

Those problems could be mitigated somewhat by wrapping all bitfields in their own structure. If you don’t do this (as we don’t), then you are denied local reasoning about fields of structures.

For these reasons, converting code into use of bitfields due to space pressure is also an operation that cannot be performed with only local reasoning. When you change a field of structure to a bitfield, you have to look at uses of that field to ensure that you don’t violate interference assumptions in the code.

Use of a locking guide in your C structures will alert you to many of these problems.

Not sure this is legal optimization under modern C/C++ language definitions, but I worked on a compiler that would localize small structs (under 64bits) into a register.

In this case, updating a single byte field in a struct would be treated much like updating 8 bits of a bitfield. The entire struct would be handled in a non-atomic read-modify-write of the entire structure. So obviously this would be problematic unless the struct could be marked.

Certainly a C++0x struct with an atomic_char would never be subject to this optimization.

And in that particular mid-90’s compiler, this optimization didn’t happen unless an alignment directive was used on the struct or whole-program-optimization was enabled (which essentially added the alignment directive to small structs).

True that optimizing compilers are more of a worry than dead architectures 🙂

My understanding is that the C++11 memory model, and the port to C1X essentially disallow this optimization. Each field is a distinct memory location and the compiler is not allowed to write to unrelated memory locations. For obvious reasons, there is a loophole for bitfields – contiguous non-zero bitfields in the same structure/union may be treated as a single memory location. A zero-sized bitfield forces separation of bitfields in a struct and bitfields that happen to be contiguous in memory but are from different structs are not covered by the loophole but then again alignment reqs kick in anyway in these cases .

Even after compilers are updated to follow the rules, we’ll of course be hosed by bugs in the compilers. Typically many compilers got volatile wrong.. it’ll be no surprise that they miscompile this stuff as well.

Nice writeup. I’d also often wondered whether some systems would not have a way to write back less than an entire word since clearly that requires a more complicated bus… Makes me nervous about trying to do atomic operators on chars.

Personally, I can’t wait for C++0x’s memory model because I don’t want to have to worry about how the bus is wired up for every architecture. Leave that up to the G++ guys to figure out, and I can just read the spec.