So there I was, implementing a one element ring buffer. Which,
I'm sure you'll agree, is a perfectly reasonable data structure.

It was just surprisingly annoying to write, due to reasons we'll
get to in a bit. After giving it a bit of thought, I realized I'd always
been writing ring buffers "wrong", and there was a better way.

Array + two indices

There are two common ways of implementing a queue with a ring buffer.

One is to use an array as the backing storage plus two indices
to the array; read and write. To shift a
value from the head of the queue, index into the array by the read
index, and then increment the read index. To push a value to the back,
index into the array by the write index, store the value in that
offset, and then increment the write index.

Both indices will always be in the range 0..(capacity - 1). This
is done by masking the value after an index gets incremented.

The downside of this representation is that you always waste one element
in the array. If the array is 4 elements, the queue can hold at most 3. Why?
Well, an empty buffer will have a read pointer that's equal
to the write pointer; a buffer with capacity N and size N would also
have have a read pointer equal to the write pointer. Like this:

The 0 and 4 element cases are indistinguishable, so we need to prevent one
from ever happening. Since empty queues are kind of necessary, it follows
that the latter case needs to go. The queue has to be defined
as full when one element in the array is still unused. And that's the
way I've always done it.

Losing one element isn't a huge deal when the ring buffer has thousands
of elements. But when the array is supposed to have just one element...
That's 100% overhead, 0% payload!

Array + index + length

The alternative is to use one index field and one length field. Shifting
an element indexes to the array by the read index, increments the read
index, and then decrements the length. Pushing an element writes to
the slot that "length" elements after the read index, and then
increments the length. That looks something like this:

This uses the full capacity of the array, with the code not getting much more
complex.

But at least I've never liked this representation. The most common use
for ring buffers is for it to be the intermediary between a concurrent
reader and writer (be it two threads, to processes sharing memory, or
a software process communicating with hardware). And for that, the
index + size representation is kind of miserable. Both the reader and
the writer will be writing to the length field, which is bad for
caching. The read index and the length will also need to always be
read and updated atomically, which would be awkward.

(Obviously my one element ring buffer wasn't going to be used in a
concurrent setting. But it's a matter of principle.)

Array + two unmasked indices

So is there an option that gets the benefits of both
representations, without introducing a third state variable?
(Whether it's two indices + a size, or two indices + some kind
of a full vs. empty flag). Turns out there is, and it's really
simple. It uses two indices, but with one tweak compared to the
first solution: don't squash the indices into the correct
range when they are incremented, but only when they are used to index into
the array. Instead you let them grow unbounded, and eventually
wrap around to zero once the unsigned integer overflows. So:

This reclaims the wasted slot.
The code modifying the indices also becomes simpler, since the
clumsy ordering of increments vs. array accesses was only needed for
maintaining the invariant that the index is always in range.

The implementation language supports wraparound on unsigned
integer overflow.
If it doesn't, this approach doesn't really buy anything. (What will
happen in these languages is that the indices get promoted to bignums
which will be bad, or they get promoted to doubles which will be worse.
So you'll need to manually restrict their range anyway).

The capacity must always be a power of two. (Edit: This
limitation does not come just from the definition of mask
using a bitwise and.
It applies even if mask were defined using modular arithmetic or a
conditional. It's required for the code to be correct on unsigned
integer overflow.)

The maximum capacity can only be half the range of the index
data types. (So 2^31-1 when using 32 bit unsigned integers).
In a way that could be interpreted as stealing the top bit of the
index to function as a flag. But the case against flags isn't so much
the extra memory as having to maintain the extra state.

All of those seem like non-issues. What kind of a monster would make
a non-power of two ring anyway?

This is of course not a new invention. The earliest instance I
could find with a bit of searching was from 2004, with Andrew Morton
mentioning
in it a code review so casually that it seems to have been a
well established trick. But the vast majority of implementations
I looked at do not do this.

So here's the question: Why do people use the version that's inferior
and more complicated? I've must have written a dozen ring buffers over
the years, and before being forced to really think about it, I'd always
just used the first definition. I can understand why a textbook wouldn't
take advantage of unsigned integer wraparound. But it seems like it
should be exactly the kind of cleverness that hackers would relish
using and passing on.

Could it just be tradition? It seems likely that
this is the kind of thing one learns by osmosis, and then never
revisits. But even so, you'd expect the "good"
implementations to push out the "bad" ones at some
point, which doesn't seem to be happening in this case.

Is it resistance to having code actually take advantage of integer
overflow, rather than it be a sign of a bug?

Are non-power of two capacities for ring buffers actually
common?

Join me next week for the exciting sequel to this post, "I've
been tying my shoelaces wrong all these years".

Comments

If you replace masking with integer-modulo-the-size, you can use sizes that are not powers of two. More expensive but handy if you really *need* a 17-byte ring.

By Juho on 2016-12-13

Unfortunately I don't think that works. It needs to be a power of two or the integer overflow causes a discontinuity. Let's say we're using mod 17. The last slot before wraparound would map to slot 0 in the array:

0xffffffff % 17 == 0

Increment it by one, it wraps around to 0, which obviously again maps to slot 0.

0 % 17 == 0

By Peter Bhat Harkins on 2016-12-13

> Join me next week for the exciting sequel to this post, "I've been tying my shoelaces wrong all these years".

That's certainly an option. But that bumps up the memory use of the non-array parts from 3*4 bytes to 3*8 bytes for the indices vs. reclaiming one slot in the array. (You need three integers of this size: Read index, write index, and capacity). It might still be a win over the naive version if the elements are large enough.

By Jack on 2016-12-13

0xffffffff % 17 DOES equal 0. His point stands.

By Mike Spooner on 2016-12-14

Standing corrected, there is a discontinuity at overflow, although the 17 case is less than obvious at first glance because it divides Oxffffffff *exactly*. 49 is a more illuminating case: 0xffffffff % 49 == 38, and with 16-bit unsigned int, the "next" value would be 0 rather than the desired 39, followed by 1 rather than 40, a clearer oops. To use modulo, you would need to *reduce* the value modulo-N at each step, which is no more expensive.

By Mike Spooner on 2016-12-14

Standing corrected, there is a discontinuity at overflow, although the 17 case is less than obvious at first glance because it divides Oxffffffff *exactly*. 49 is a more illuminating case: 0xffffffff % 49 == 38, and with 16-bit unsigned int, the "next" value would be 0 rather than the desired 39, followed by 1 rather than 40, a clearer oops. To use modulo, you would need to *reduce* the value modulo-N at each step, which is no more expensive.

Why not store indices modulo 2*capacity? Same as last solution, but prevents overflow. An additional bit in the index can be viewed as a 'fold number'. An array is full when indices refer to the same cell on different folds, and empty if they are on the same fold. And we don't need more then two folds, as indices can't be more than 'capacity' elements apart.

By jj on 2016-12-14

dizzy57 has the answer. All the talk of when to mask is misguided. The original implementation had a simple problem, the need for one extra boolean to distinguish empty from full. The proposed fix uses all of the bits that were being masked off to implement that bit. Instead, just use one (which is what dizzy57's proposal does).

What kind of a monster would make a non-power of two ring? The kind that ran out of microcontroller SRAM for a large enough power of two ring, but could spare CPU cycles for an expensive modulo operation.

By Darrell Wright on 2016-12-14

uin64_t is so large that it would take over 100 years to overflow. Don't worry about it. :)

By SmallStepForMan on 2016-12-14

Instead if using 2 indices for read/write, use 1 index for write, and an int for count pending. Read_index=write_index - pending. When pending=0, there is nothing more to read, and when pending=capacity, your buffer is full.

By Jim Fisher on 2016-12-14

I spent a few minutes working out how `val & (array.capacity - 1)` could possibly work. I now realize that this only works if the array has a power-of-two length. This seems to be an implicit assumption of the article.

By sepia on 2016-12-14

I think a two indices version is the better option to go with when you intend to do it thread safe. If there was a way to atomically put a new indice in place, the code would be thread safe by default. x86 provides an atomic version in case you prefer to play with assembler rather than mutexes.

By Alex on 2016-12-14

This looks similar to willemt's mmap ring buffers: https://github.com/willemt/cbuffer

The main difference is that you use mask() to wrap the index before indexing into the array, and willemt asks the operating system to map the memory out there to the same place so it's automatic.

The justification and the details are a bit different, but his implementation, like yours, has the nice property that offer() and poll() (his names for these functions) each only need to write one pointer.

By Tensility on 2016-12-15

@Tomaž: 2^0 == 1

By Anonymous on 2016-12-15

Hardware engineers have been doing this for decades -- we generally describe the head/tail (put/get) as free-running indexes here.

By Dave on 2016-12-15

See Disruptor pattern for something better

By Jesper on 2016-12-15

I reached the same conclusion when writing https://github.com/jbro/ring_buffer.

It also uses a neat trick with mmap, so the rings pages are wired into memory twice. That way, when writing multiple elements, you can just write them over the end, and they will magically appear in the front of the ring.

Only downside, is that instead of having your power of two constraint, I have a must be a multiple of the page_size.

By Bram on 2016-12-15

Does it still work if write wraps around at 4G writes? (Note: I don't mean wrap around in buffer, but overflow in uint32.

By Chris on 2016-12-15

All in all, unless the subtly weird code resulting from this lengthy analysis yields *significantly* better performance and big-O scaling, I think it's better just to use the original, straightforward, intuitive algorithm, with the dreaded Third Variable (I prefer the empty-vs-non-empty flag, myself). That leaves behind more-readable, more-easily-understood code for the next guy. If everybody were an ivory-tower programming guru who could look at the code that resulted from all this navel-gazing and understand it without having to have this article handy to reverse-engineer it, that would be fine. But, frankly, in thirty years as a software engineer, I've met *maybe one* guy who could have done it - - but he was back in the days when Men Were Men and I myself could decompile a puece of executable code back to Fortran, by hand. In the last twenty-five years, certainly, there's been nobody else. So, if you use this in production code, and your product is anything other than a write-once, never-look-at-it-again utility library, OS, etc., the next guy in your job is going to curse you, especially if it's been a while and this article has fallen off the Internet. That poor bastard. (At the very least, *print* this article, make six copies, and leave them behind when you go.)

I'm also going slowly insane trying to figure out why on Earth anybody would need a *one-element* ring buffer. That's barely a ring buffer at all, so I deduce you must be less interested in the buffering aspect than in the lockstep serialization of writes and reads - - and from, again, the clarity-and-readability standpoint, there are undoubtedly better ways to achieve that. But who am I? What do I know? Maybe there's some well-known purpose for a one-element ring buffer, that wasn't yet well-known the lady time I implemented one (circa 1991). Just tell us!

By Chris on 2016-12-15

Typo correction in the second-to-last sentence of my first post: "lady" should be "last". (This is what I get for writing from my phone, in bed, first thing upon waking up.)

Yes. Any time this post talks about overflow, it is talking about exactly about that: an unsigned integer wrapping around to 0.

By Adam Sawicki on 2016-12-15

That's a great post, thank you so much! I think that people (including me) tend to do it the "two masked indices" way before they learn the better way is that this is the most intuitive solution that comes first when thinking about the problem.

By wheels on 2016-12-15

I've always used the first method, personally. Once, though, I had to fix a bug in someone else's ring buffer implementation, which involved several *pages* of code implementing a state machine with 64 states.

As much as I'd have liked to rip it out and replace it, the entire system it was part of had so many intertwingled "moving parts" that I couldn't guarantee that it would be safe.

By Edward Kmett on 2017-04-18

The Chase-Lev workstealing deque is based on the "good" approach mentioned here:

The main additions are growing the array, and ensuring reads/writes are done in the correct manner to ensure 'stealing' an be done lock-free, but all indices are left unmasked.

By Aristotle Pagaltzis on 2017-04-23

I had the same thought as dizzy57 but also actually went and implemented it to get a feel for it. It turns out to be more complicated than the original approach, but just slightly, and less complicated than the length+index approach.

This avoids the wasted buffer slot, only updates one index for any given operation, and generalises to arrays of arbitrary sizes (by switching to arithmetic mask()/mask2() functions).

By Aristotle Pagaltzis on 2017-04-23

Actually, the empty() I gave is unnecessarily complex (and dizzy57 was first on that too) – just `return read==write` will do.

By By Aristotle Pagaltzis on 2017-04-23

Also, I botched size() – once slots are no longer wasted, clamping the value to `array.capacity - 1` is wrong:

size() { return mask2(write - read); }

I should have written some tests… (Disclaimer: and I still haven’t.)

By TheBonobo on 2017-08-10

BUMP!

Surely, there is a text book version which is easier to understand that uses mod n+1 (*) arithmetic?

Let me write one...

(*In addition to the existing mod n code)

By mjm on 2017-09-06

I seem to be missing something obvious. Since write should increment ahead of read, there should be a time that write has rolled and read hasn't. It seems to me that (write - read) for determining full() would not work correctly then. What am I missing?

By mjm on 2017-09-06

I seem to be missing something obvious. Since write should increment ahead of read, there should be a time that write has rolled and read hasn't. It seems to me that (write - read) for determining full() would not work correctly then. What am I missing?

By Juho on 2017-09-07

mjm,

The wraparound will work correctly since these are unsigned integers and the ring size is a power of two.

By Jens on 2017-11-08

> And for that, the index + size representation is kind of miserable. Both the reader and the writer will be writing to the length field, which is bad for caching.

I don't follow this argument. Assuming that writer and reader update distinct fields you still have bad caching. Because assuming both values are on the same cache line (which is often the case I'll guess) you have to keep the cache line coherent anyway. But I do understand that from a threading point of view updating more than one value atomically is worse than updating only one.

By andrewl on 2018-01-08

I have the same concern as mjm. If write and read approach MAX_UINT and then write wraps but read does not, size() is miscalculated.

By Juho on 2018-01-10

Andrew; no, it will be calculated correctly. That's the whole point here. If you don't believe me, just try it with some example numbers! :)