I have created a fortune file for the all-time top 10,000
/r/Showerthoughts posts, as of October 2016. As a word of
warning: Many of these entries are adult humor and may not be
appropriate for your work computer. These fortunes would be
categorized as “offensive” (fortune -o).

The copyright status of this file is subject to each of its thousands
of authors. Since it’s not possible to contact many of these authors —
some may not even still live — it’s obviously never going to be under
an open source license (Creative Commons, etc.). Even more, some
quotes are probably from comedians and such, rather than by the
redditor who made the post. I distribute it only for fun.

Installation

To install this into your fortune database, first process it with
strfile to create a random-access index, showerthoughts.dat, then
copy them to the directory with the rest.

$ fortune showerthoughts
Not once in my life have I stepped into somebody's house and
thought, "I sure hope I get an apology for 'the mess'."
―AndItsDeepToo, Aug 2016

If you didn’t already know, fortune is an old unix utility that
displays a random quotation from a quotation database — a digital
fortune cookie. I use it as an interactive login shell greeting on
my ODROID-C2 server:

if shopt -q login_shell; then
fortune ~/.fortunes
fi

How was it made?

Fortunately I didn’t have to do something crazy like scrape reddit for
weeks on end. Instead, I downloaded the pushshift.io submission
archives, which is currently around 70 GB compressed. Each file
contains one month’s worth of JSON data, one object per submission,
one submission per line, all compressed with bzip2.

Unlike so many other datasets, especially when it’s made up of
arbitrary inputs from millions of people, the format of the
/r/Showerthoughts posts is surprisingly very clean and requires
virtually no touching up. It’s some really fantastic data.

A nice feature of bzip2 is concatenating compressed files also
concatenates the uncompressed files. Additionally, it’s easy to
parallelize bzip2 compression and decompression, which gives it an
edge over xz. I strongly recommend using lbzip2 to
decompress this data, should you want to process it yourself.

cat RS_*.bz2 | lbunzip2 > everything.json

jq is my favorite command line tool for processing JSON (and
rendering fractals). To filter all the /r/Showerthoughts posts,
it’s a simple select expression. Just mind the capitalization of the
subreddit’s name. The -c tells jq to keep it one per line.

However, you’ll quickly find that jq is the bottleneck, parsing all
that JSON. Your cores won’t be exploited by lbzip2 as they should. So
I throw grep in front to dramatically decrease the workload for
jq.

This will let some extra things through, but it’s a superset. The -a
option is necessary because the data contains some null bytes. Without
it, grep switches into binary mode and breaks everything. This is
incredibly frustrating when you’ve already waited half an hour for
results.

To further reduce the workload further down the pipeline, I take
advantage of the fact that only four fields will be needed: title,
score, author, and created_utc. The rest can — and should, for
efficiency’s sake — be thrown away where it’s cheap to do so.

This gathers all 1,199,499 submissions into a 185 MB JSON file (as of
this writing). Most of these submissions are terrible, so the next
step is narrowing it to the small set of good submissions and putting
them into the fortune database format.

It turns out reddit already has a method for finding the best
submissions: a voting system. Just pick the highest scoring posts.
Through experimentation I arrived at 10,000 as the magic cut-off
number. After this the quality really starts to drop off. Over time
this should probably be scaled up with the total number of
submissions.

I did both steps at the same time using a bit of Emacs Lisp, which is
particularly well-suited to the task:

This Elisp program reads one JSON object at a time and sticks each
into a AVL tree sorted by score (descending), then timestamp
(ascending), then title (ascending). The AVL tree is limited to 10,000
items, with the lowest items being dropped. This was a lot faster than
the more obvious approach: collecting everything into a big list,
sorting it, and keeping the top 10,000 items.

Formatting

The most complicated part is actually paragraph wrapping the
submissions. Most are too long for a single line, and letting the
terminal hard wrap them is visually unpleasing. The submissions are
encoded in UTF-8, some with characters beyond simple ASCII. Proper
wrapping requires not just Unicode awareness, but also some degree of
Unicode rendering. The algorithm needs to recognize grapheme
clusters and know the size of the rendered text. This is not so
trivial! Most paragraph wrapping tools and libraries get this wrong,
some counting width by bytes, others counting width by codepoints.

Emacs’ M-x fill-paragraph knows how to do all these things — only
for a monospace font, which is all I needed — and I decided to
leverage it when generating the fortune file. Here’s an example that
paragraph-wraps a string:

For the file format, items are delimited by a % on a line by itself.
I put the wrapped content, followed by a quotation dash, the
author, and the date. A surprising number of these submissions have
date-sensitive content (“on this day X years ago”), so I found it was
important to include a date.

April Fool's Day is the one day of the year when people critically
evaluate news articles before accepting them as true.
―kellenbrent, Apr 2015
%
Of all the bodily functions that could be contagious, thank god
it's the yawn.
―MKLV, Aug 2015
%

There’s the potential that a submission itself could end with a lone
% and, with a bit of bad luck, it happens to wrap that onto its own
line. Fortunately this hasn’t happened yet. But, now that I’ve
advertised it, someone could make such a submission, popular enough
for the top 10,000, with the intent to personally trip me up in a
future update. I accept this, though it’s unlikely, and it would be
fairly easy to work around if it happened.

The strfile program looks for the % delimiters and fills out a
table of file offsets. The header of the .dat file indicates the
number strings along with some other metadata. What follows is a table
of 32-bit file offsets.

struct{uint32_tstr_version;/* version number */uint32_tstr_numstr;/* # of strings in the file */uint32_tstr_longlen;/* length of longest string */uint32_tstr_shortlen;/* shortest string length */uint32_tstr_flags;/* bit field for flags */charstr_delim;/* delimiting character */}

Note that the table doesn’t necessarily need to list the strings in
the same order as they appear in the original file. In fact, recent
versions of strfile can sort the strings by sorting the table, all
without touching the original file. Though none of this important to
fortune.

Now that you know how it all works, you can build your own fortune
file from your own inputs!

Suppose you need to write a C program to access a long sequence of
structures from a binary file in a specified format. These structures
have different lengths and contents, but also a common header
identifying its type and size. Here’s the definition of that header
(no padding):

If all integers are stored in little endian byte order (least
significant byte first), there’s a strong temptation to lay the
structures directly over the data. After all, this will work correctly
on most computers.

The host machine doesn’t use little endian byte order, though this
is now uncommon. Sometimes developers will attempt to detect the
byte order at compile time and use the preprocessor to byte-swap if
needed. This is a mistake.

The host machine has different alignment requirements and so
introduces additional padding to the structure. Sometimes this can
be resolved with a non-standard #pragma pack.

Integer extraction functions

Fortunately it’s easy to write fast, correct, portable code for this
situation. First, define some functions to extract little endian
integers from an octet buffer (uint8_t). These will work correctly
regardless of the host’s alignment and byte order.

The big endian version is identical, but with shifts in reverse order.

A common concern is that these functions are a lot less efficient than
they could be. On x86 where alignment is very relaxed, each could be
implemented as a single load instruction. However, on GCC 4.x and
earlier, extract_u32le compiles to something like this:

It’s unportable, it’s undefined behavior, and worst of all, it might
not work correctly even on x86. Fortunately I have some great
news. On GCC 5.x and above, the correct definition compiles to the
desired, fast version. It’s the best of both worlds.

Unfortunately, Clang/LLVM is not this smart as of 3.9, but I’m
betting it will eventually learn how to do this, too.

Member offset constants

For this next technique, that struct event from above need not
actually be in the source. It’s purely documentation. Instead, let’s
define the structure in terms of member offset constants — a term I
just made up for this article. I’ve included the integer types as part
of the name to aid in their correct use.

On x86 with GCC 5.x, each member access will be inlined and compiled
to a one-instruction extraction. As far as performance is concerned,
it’s identical to using a structure overlay, but this time the C code
is clean and portable. A slight downside is the lack of type checking
on member access: it’s easy to mismatch the types and accidentally
read garbage.

Memory mapping and iteration

There’s a real advantage to memory mapping the input file and using
its contents directly. On a system with a huge virtual address space,
such as x86-64 or AArch64, this memory is almost “free.” Already being
backed by a file, paging out this memory costs nothing (i.e. it’s
discarded). The input file can comfortably be much larger than
physical memory without straining the system.

Unportable structure overlay can take advantage of memory mapping this
way, but has the previously-described issues. An approach with member
offset constants will take advantage of it just as well, all while
remaining clean and portable.

I like to wrap the memory mapping code into a simple interface, which
makes porting to non-POSIX platforms, such Windows, easier. Caveat:
This won’t work with files whose size exceeds the available contiguous
virtual memory of the system — a real problem for 32-bit systems.

Now they’ve gone an done it. An unidentified agency has spread a
potent computer virus across all the world’s computers and deleted the
binaries for every copy of every software development tool. Even the
offline copies — it’s that potent.

Most of the source code still exists, even for the compilers, and most
computer systems will continue operating without disruption, but no
new software can be developed unless it’s written byte by byte in raw
machine code. Only real programmers can get anything done.

The world’s top software developers have been put to work
bootstrapping a C compiler (and others) completely from scratch so
that we can get back to normal. Without even an assembler, it’s a
slow, tedious process.

In the mean time, rather than wait around for the bootstrap work to
complete, the rest of us have been assigned individual programs hit by
the virus. For example, many basic unix utilities have been wiped out,
and the bootstrap would benefit from having them. Having different
groups tackle each missing program will allow the bootstrap effort to
move forward somewhat in parallel. At least that’s what the compiler
nerds told us. The real reason is that they’re tired of being asked
if they’re done yet, and these tasks will keep the rest of us quietly
busy.

Fortunately you and I have been assigned the easiest task of all:
We’re to write the true command from scratch. We’ll have to
figure it out byte by byte. The target is x86-64 Linux, which means
we’ll need the following documentation:

Manual Assembly

The program we’re writing is true, whose behavior is documented as
“do nothing, successfully.” All command line arguments are ignored and
no input as read. The program only needs to perform the exit system
call, immediately terminating the process.

According to the ABI document (3) Appendix A, the registers for system
call arguments are: rdi, rsi, rdx, r10, r8, r9. The system
call number goes in rax. The exit system call takes only one
argument, and that argument will be 0 (success), so rdi should be
set to zero. It’s likely that it’s already zero when the program
starts, but the ABI document says its contents are undefined (§3.4),
so we’ll set it explicitly.

For Linux on x86-64, the system call number for exit is 60,
(/usr/include/asm/unistd_64.h), so rax will be set to 60, followed
by syscall.

xoredi,edimoveax,60syscall

There’s no assembler available to turn this into machine code, so it
has to be assembled by hand. For that we need the Intel manual (2).

The first instruction is xor, so look up that mnemonic in the
manual. Like most x86 mnemonics, there are many different opcodes and
many ways to encode the same operation. In this case, there are 22
options.

The operands are two 32-bit registers, so there are two options
available (opcodes 0x31 and 0x33):

31 /r XOR r/m32, r32
33 /r XOR r32, r/m32

The “r/m32” means the operand can be either a register or the address
of a 32-bit region of memory. With two register operands, both
encodings are equally valid, both have the same length (2 bytes), and
neither is canonical, so the decision is entirely arbitrary. Let’s
pick the first one, opcode 0x31, since its listed first.

The “/r” after the opcode means the register-only operand (“r32” in
both cases) will be specified in the ModR/M byte. This is the byte
that immediately follows the opcode and specifies one of two of the
operands.

The ModR/M byte is broken into three parts: mod (2 bits), reg (3
bits), r/m (3 bits). This gets a little complicated, but if you stare
at Table 2-1 in the Intel manual for long enough it eventually makes
sense. In short, two high bits (11) for mod indicates we’re working
with a register rather than a load. Here’s where we’re at for ModR/M:

11 ??? ???

The order of the x86 registers is unintuitive: ax, cx, dx, bx,
sp, bp, si, di. With 0-indexing, that gives di a value of 7
(111 in binary). With edi as both operands, this makes ModR/M:

11 111 111

Or, in hexadecimal, FF. And that’s it for this instruction. With the
opcode (0x31) and the ModR/M byte (0xFF):

31 FF

The encoding for mov is a bit different. Look it up and match the
operands. Like before, there are two possible options:

B8+rd id MOV r32, imm32
C7 /0 id MOV r/m32, imm32

In the B8+rd notation means the 32-bit register operand (rd for
“register double word”) is added to the opcode instead of having a
ModR/M byte. It’s followed by a 32-bit immediate value (id for
“integer double word”). That’s a total of 5 bytes.

The “/0” in second means 0 goes in the “reg” field of ModR/M, and the
whole instruction is followed by the 32-bit immediate (id). That’s a
total of 6 bytes. Since this is longer, we’ll use the first encoding.

So, that’s opcode 0xB8 + 0, since eax is register number 0,
followed by 60 (0x3C) as a little endian, 4-byte value. Here’s the
encoding for the second instruction:

B8 3C 00 00 00

The final instruction is a cakewalk. There are no operands, it comes
in only one form of two operands.

0F 05 SYSCALL

So the encoding for this instruction is:

0F 05

Putting it all together the program is 9 bytes:

31 FF B8 3C 00 00 00 0F 05

Aren’t you glad you don’t normally have to assemble entire programs by
hand?

Constructing the ELF

Back in the old days you may have been able to simply drop these bytes
into a file and execute it. That’s how DOS COM programs worked.
But this definitely won’t work if you tried it on Linux. Binaries must
be in the Executable and Linking Format (ELF). This format tells the
loader how to initialize the program in memory and how to start it.

Fortunately for this program we’ll only need to fill out two
structures: the ELF header and one program header. The binary will be
the ELF header, followed immediately by the program header, followed
immediately by the program.

To fill this binary out, we’d use whatever method the virus left
behind for writing raw bytes to a file. For now I’ll assume the echo
command is still available, and we’ll use hexadecimal \xNN escapes
to write raw bytes. If this isn’t available, you might need to use the
magnetic needle and steady hand method, or the butterflies.

The very first structure in an ELF file must be the ELF header, from
the ELF specification (1):

No other data is at a fixed location because this header specifies
where it can be found. If you’re writing a C program in the future,
once compilers have been bootstrapped back into existence, you can
access this structure in elf.h.

The ELF header

The EI_NIDENT macro is 16, so e_ident is 16 bytes. The first 4
bytes are fixed: 0x7F, E, L, F.

The 5th byte is called EI_CLASS: a 32-bit program (ELFCLASS32 =
1) or a 64-bit program (ELFCLASS64 = 2). This will be a 64-bit
program (2).

The 6th byte indicates the integer format (EI_DATA). The one we want
for x86-64 is ELFDATA2LSB (1), two’s complement, little-endian.

The 7th byte is the ELF version (EI_VERSION), always 1 as of this
writing.

The 8th byte is the ABI (ELF_OSABI), which in this case is
ELFOSABI_SYSV (0).

The next field is the e_type. This is an executable program, so it’s
ET_EXEC (2). Other options are object files (ET_REL = 1), shared
libraries (ET_DYN = 3), and core files (ET_CORE = 4).

echo -ne '\x02\x00' >> true

The value for e_machine is EM_X86_64 (0x3E). This value isn’t in
the ELF specification but rather the ABI document (§4.1.1). On BSD
this is instead named EM_AMD64.

echo -ne '\x3E\x00' >> true

For e_version it’s always 1, like in the header.

echo -ne '\x01\x00\x00\x00' >> true

The e_entry field will be 8 bytes because this is a 64-bit ELF. This
is the virtual address of the program’s entry point. It’s where the
loader will pass control and so it’s where we’ll load the program. The
typical entry address is somewhere around 0x400000. For a reason I’ll
explain shortly, our entry point will be 120 bytes (0x78) after that
nice round number, at 0x40000078.

echo -ne '\x78\x00\x00\x40\x00\x00\x00\x00' >> true

The e_phoff field holds the offset of the program header table. The
ELF header is 64 bytes (0x40) and this structure will immediately
follow. It’s also 8 bytes.

echo -ne '\x40\x00\x00\x00\x00\x00\x00\x00' >> true

The e_shoff header holds the offset of the section table. In an
executable program we don’t need sections, so this is zero.

echo -ne '\x00\x00\x00\x00\x00\x00\x00\x00' >> true

The e_flags field has processor-specific flags, which in our case is
just 0.

echo -ne '\x00\x00\x00\x00' >> true

The e_ehsize holds the size of the ELF header, which, as I said, is
64 bytes (0x40).

echo -ne '\x40\x00' >> true

The e_phentsize is the size of one program header, which is 56 bytes
(0x38).

echo -ne '\x38\x00' >> true

The e_phnum field indicates how many program headers there are. We
only need the one: the segment with the 9 program bytes, to be loaded
into memory.

echo -ne '\x01\x00' >> true

The e_shentsize is the size of a section header. We’re not using
this, but we’ll do our due diligence. These are 64 bytes (0x40).

echo -ne '\x40\x00' >> true

The e_shnum field is the number of sections (0).

echo -ne '\x00\x00' >> true

The e_shstrndx is the index of the section with the string table. It
doesn’t exist, so it’s 0.

The program header

The p_type field indicates the segment type. This segment will hold
the program and will be loaded into memory, so we want PT_LOAD (1).
Other kinds of segments set up dynamic loading and such.

echo -ne '\x01\x00\x00\x00' >> true

The p_flags field gives the memory protections. We want executable
(PF_X = 1) and readable (PF_R = 4). These are ORed together to
make 5.

echo -ne '\x05\x00\x00\x00' >> true

The p_offset is the file offset for the content of this segment.
This will be the program we assembled. It will immediately follow the
this header. The ELF header was 64 bytes, plus a 56 byte program
header, which is 120 (0x78).

echo -ne '\x78\x00\x00\x00\x00\x00\x00\x00' >> true

The p_vaddr is the virtual address where this segment will be
loaded. This is the entry point from before. A restriction is that
this value must be congruent with p_offset modulo the page size.
That’s why the entry point was offset by 120 bytes.

echo -ne '\x78\x00\x00\x40\x00\x00\x00\x00' >> true

The p_paddr is unused for this platform.

echo -ne '\x00\x00\x00\x00\x00\x00\x00\x00' >> true

The p_filesz is the size of the segment in the file: 9 bytes.

echo -ne '\x09\x00\x00\x00\x00\x00\x00\x00' >> true

The p_memsz is the size of the segment in memory, also 9 bytes. It
might sound redundant, but these are allowed to differ, in which case
it’s either truncated or padded with zeroes.

echo -ne '\x09\x00\x00\x00\x00\x00\x00\x00' >> true

The p_align indicates the segment’s alignment. We don’t care about
alignment.

Suppose you want to bake binary data directly into a program’s
executable. It could be image pixel data (PNG, BMP, JPEG), a text
file, or some sort of complex data structure. Perhaps the purpose is
to build a single executable with no extraneous data files — easier to
install and manage, though harder to modify. Or maybe you’re lazy and
don’t want to worry about handling the various complications and
errors that arise when reading external data: Where to find it, and
what to do if you can’t find it or can’t read it. This article is
about two different approaches I’ve used a number of times for C
programs.

The linker approach

The simpler, less portable option is to have the linker do it. Both
the GNU linker and the gold linker (ELF only) can create
object files from arbitrary files using the --format (-b) option
set to binary (raw data). It’s combined with --relocatable (-r)
to make it linkable with the rest of the program. MinGW supports all
of this, too, so it’s fairly portable so long as you stick to GNU
Binutils.

For example, to create an object file, my_msg.o with the
contents of the text file my_msg.txt:

$ ld -r -b binary -o my_file.o my_msg.txt

The object file will have three symbols, each named after the input
file. Unfortunately there’s no control over the symbol names, section
(.data), alignment, or protections (e.g. read-only). You’re completely
at the whim of the linker, short of objcopy tricks.

The size symbol, _binary_my_msg_txt_size, is misleading. The “A”
from nm means it’s an absolute symbol, not relocated. It doesn’t refer
to an integer that holds the size of the raw data. The value of the
symbol itself is the size of the data. That is, take the address of it
and cast it to an integer.

size_tsize=(size_t)&_binary_my_msg_txt_size;

Alternatively — and this is my own preference — just subtract the
other two symbols. It’s cleaner and easier to understand.

If this was binary data, such as an image file, the program would
instead read the array as if it were a memory mapped file. In fact,
that’s what it really is: the raw data memory mapped by the loader
before the program started.

How about a data structure dump?

This could be taken further to dump out some kinds of data structures.
For example, this program (table_gen.c) fills out a table of the
first 90 Fibonacci numbers and dumps it to standard output.

However, there are some good reasons not to use this feature in this
way:

The format of table.bin is specific to the host architecture
(byte order, size, padding, etc.). If the host is the same as the
target then this isn’t a problem, but it will prohibit
cross-compilation.

The linker has no information about the alignment requirements of
the data. To the linker it’s just a byte buffer. In the final
program the long long array will not necessarily aligned properly
for its type, meaning the above program might crash. The Right Way
is to never dereference the data directly but rather memcpy() it
into a properly-aligned variable, just as if the data was an
unaligned buffer.

The data structure cannot use any pointers. Pointer values are
meaningless to other processes and will be no different than
garbage.

Towards a more portable approach

There’s an easy way to address all three of these problems and
eliminate the reliance on GNU linkers: serialize the data into C code.
It’s metaprogramming, baby.

Any C compiler and linker could do all of this, no problem, making it
more portable. The intermediate metaprogram isn’t a barrier to cross
compilation. It would be compiled for the host (typically identified
through HOST_CC) and the rest is compiled for the target (e.g.
CC).

The output of table_gen.c isn’t dependent on any architecture,
making it cross-compiler friendly. There are also no alignment
problems because it’s all visible to compiler. The type system isn’t
being undermined.

Dealing with pointers

The Fibonacci example doesn’t address the pointer problem — it has no
pointers to speak of. So let’s step it up to a trie using the trie
from the previous article. As a reminder, here it is:

Dumping these structures out raw would definitely be useless since
they’re almost entirely pointer data. So instead, fill out an array of
these structures, referencing the array itself to build up the
pointers (later filled in by either the linker or the loader). This
code uses the in-place breadth-first traversal technique from
the previous article.

This trie can be immediately used at program startup without
initialization, and it can even have new nodes inserted into it. It’s
not without its downsides, particularly because it’s a trie:

It’s really going to blow up the size of the binary, especially
when it holds lots of strings. These nodes are anything but
compact.

If the code is compiled to be position-independent (-fPIC), each
of those nodes is going to hold multiple dynamic relocations,
further exploding the size of the binary and preventing the trie
from being shared between processes. It’s 24 bytes per
relocation on x86-64. This will also slow down program start up
time. With just a few thousand strings, the simple test program was
taking 5x longer to start (25ms instead of 5ms) than with an empty
trie.

Even without being position-independent, the linker will have to
resolve all the compile-time relocations. I was able to overwhelm
linker and run it out of memory with just some tens of thousands of
strings. This would make for a decent linker stress test.

This technique obviously doesn’t scale well with trie data. You’re
better off baking in the flat string list and building the trie at run
time — though you could compute the exact number of needed nodes at
compile time and statically allocate them (in .bss). I’ve personally
had much better luck with other sorts of lookup tables.
It’s a useful tool for the C programmer’s toolbelt.

As part of a demonstration in an upcoming article, I wrote a
simple trie implementation. A trie is a search tree where the
keys are a sequence of symbols (i.e. strings). Strings with a common
prefix share an initial path down the trie, and the keys themselves
are stored implicitly by the structure of the trie. It’s commonly used
as a sorted set or, when values are associated with nodes, an
associative array.

This wasn’t my first time writing a trie. The curse of programming in
C is rewriting the same data structures and algorithms over and over.
It’s the problem C++ templates are intended to solve. This rewriting
isn’t always bad since each implementation is typically customized for
its specific use, often resulting in greater performance and a smaller
resource footprint.

Every time I’ve rewritten a trie, my implementation is a little bit
better than the last. This time around I discovered an approach for
traversing, both depth-first and breadth-first, an arbitrarily-sized
trie without memory allocation. I’m definitely not the first to
discover something like this. There’s Deutsch-Schorr-Waite pointer
reversal for binary graphs (1965) — which I originally learned
from reading the Scheme 9 from Outer Space garbage collector
source — and Morris in-order traversal (1979) for binary
trees. The former requires two extra tag bits per node and the latter
requires no modifications at all.

What’s a trie?

But before I go further, some background. A trie can come in many
shapes and sizes, but in the simple case each node of a trie has as
many pointers as its alphabet. For illustration purposes, imagine a
trie for strings of only four characters: A, B, C, and D. Each node is
essentially four pointers.

It includes a flags field, where a single bit tracks whether or not
a node is terminal — that is, a key terminates at this node. Terminal
nodes are not necessarily leaf nodes, which is the case when one key
is a prefix of another key. I could instead have used a 1-bit
bit-field (e.g. int is_terminal : 1;) but I don’t like bit-fields.

A trie with the following keys, inserted in any order:

AAAAA
ABCD
CAA
CAD
CDBD

Looks like this (terminal nodes illustrated as small black squares):

The root of the trie is the empty string, and each child represents a
trie prefixed with one of the symbols from the alphabet. This is a
nice recursive definition, and it’s tempting to write recursive
functions to process it. For example, here’s a recursive insertion
function.

If the string is empty (!*s), mark the current node as terminal.
Otherwise recursively insert the substring under the appropriate
child. That’s a tail call, and any optimizing compiler would optimize
this call into a jump back to the beginning of of the function
(tail-call optimization), reusing the stack frame as if it were a
simple loop.

If that’s not good enough, such as when optimization is disabled for
debugging and the recursive definition is blowing the stack, this is
trivial to convert to a safe, iterative function. I prefer this
version anyway.

Depth-first traversal is stack-oriented. The stack represents the
path through the graph, and each new vertex is pushed into this stack
as it’s visited. A recursive traversal function can implicitly use the
call stack for storing this information, so no additional data
structure is needed.

The downside is that the call is no longer tail-recursive, so a large
trie will blow the stack. Also, the caller needs to provide a callback
function because the stack cannot unwind to return a value: The stack
has important state on it. Here’s a typedef for the callback.

typedefvoid(*trie_visitor)(constchar*key,void*arg);

And here’s the recursive depth-first traversal function. The top-level
caller passes the same buffer for buf and bufend, which must be at
least as large as the largest key. The visited key will be written to
this buffer and passed to the visitor.

Heap-allocated Traversal Stack

Moving the traversal stack to the heap would eliminate the stack
overflow problem and it would allow control to return to the caller.
This is going to be a lot of code for an article, but bear with me.

First define an iterator object. The stack will need two pieces of
information: which node did we come from (p) and through which
pointer (i). When a node has been exhausted, this will allow return
to the parent. The root field tracks when traversal is complete.

A special value of -1 in i means it’s the first visit for this node
and it should be visited by the callback if it’s terminal.

The iterator is initialized with trie_iter_init. The max indicates
the maximum length of any key. A more elaborate implementation could
automatically grow the stack to accommodate (e.g. realloc()), but I’m
keeping it as simple as possible.

And finally the complicated part. This uses the allocated stack to
explore the trie in a loop until it hits a terminal, at which point it
returns. A further call continues the traversal from where it left
off. It’s like a hand-coded generator. With the way it’s
written, the caller is obligated to follow through with the entire
iteration before destroying the iterator, but this would be easy to
correct.

inttrie_iter_next(structtrie_iter*it){for(;;){structtrie*current=it->stack->p;inti=it->stack->i++;if(i==-1){/* Return result if terminal node. */if(current->flags&TRIE_TERMINAL_FLAG){*it->bufend=0;return1;}continue;}if(i==TRIE_ALPHABET_SIZE){/* End of current node. */if(current==it->root)return0;// back at root, done
it->stack--;it->bufend--;continue;}if(current->next[i]){/* Push on next child node. */*it->bufend='A'+i;it->stack++;it->bufend++;it->stack->p=current->next[i];it->stack->i=-1;}}}

Initialization could fail (not checked in the example) since it
allocates memory.

Either the caller has to keep track of the maximum key length, or
the iterator grows the stack automatically, which would mean
iteration could fail at any point in the middle.

In order to destroy the trie, it needs to be traversed: Freeing
memory first requires allocating memory. If the program is out of
memory, it cannot destroy the trie to clean up before handling the
situation, nor to make more memory available. It’s not good for
resilience.

Wouldn’t it be nice to traverse the trie without memory allocation?

Modifying the Trie

Rather than allocate a separate stack, the stack can be allocated
across the individual nodes of the trie. Remember those p and i
fields from before? Put them on the trie.

The iteration function itself is almost identical to before. Rather
than increment a stack pointer, it uses p to chain the nodes as a
linked list.

inttrie_v2_iter_next(structtrie_v2_iter*it){for(;;){structtrie_v2*current=it->current;inti=it->current->i++;if(i==-1){/* Return result if terminal node. */if(current->flags&TRIE_TERMINAL_FLAG){*it->buf=0;return1;}continue;}if(i==TRIE_ALPHABET_SIZE){/* End of current node. */if(!current->p)return0;it->current=current->p;it->buf--;continue;}if(current->next[i]){/* Push on next child node. */*it->buf='A'+i;it->buf++;it->current=current->next[i];it->current->p=current;it->current->i=-1;}}}

During traversal the iteration pointers look something like this:

This is not without its downsides:

Traversal is not re-entrant nor thread-safe. It’s not possible to
run multiple in-place iterators side by side on the same trie since
they’ll clobber each other.

It uses more memory — O(n) rather than O(max-key-length) — and sits
on this extra memory for its entire lifetime.

Breadth-first Traversal

The same technique can be used for breadth-first search, which is
queue-oriented rather than stack-oriented. The p pointers are
instead chained into a queue, with a head and tail pointer
variable for each end. As each node is visited, its children are
pushed into the queue linked list.

This isn’t good for visiting keys by name. buf was itself a stack
and played nicely with depth-first traversal, but there’s no easy way
to build up a key in a buffer breadth-first. So instead here’s a
function to destroy a trie breadth-first.

Further Research

In my real code there’s also a flag to indicate the node’s allocation
type: static or heap. This allows a trie to be composed of nodes from
both kinds of allocations while still safe to destroy. It might also
be useful to pack a reference counter into this space so that a node
could be shared by more than one trie.

For a production implementation it may be worth packing i into the
flags field since it only needs a few bits, even with larger
alphabets. Also, I bet, as in Deutsch-Schorr-Waite, the p field
could be eliminated and instead one of the child pointers is
temporarily reversed. With these changes, this technique would fit
into the original struct trie without changes, eliminating the extra
memory usage.

Update: Over on Hacker News, psi-squared has interesting
suggestions such as leaving the traversal pointers intact,
particularly in the case of a breadth-first search, which, until the
next trie modification, allows for concurrent follow-up traversals.

Two months ago Emacs 25 was released and introduced a new dynamic
module feature. Emacs can now load shared libraries built
against Emacs’ module API, defined in emacs-module.h. What’s
interesting about this API is that it doesn’t require linking against
Emacs or any sort of library. Instead, at run time Emacs supplies the
module’s initialization function with function pointers for the entire
API.

As a demonstration, in this article I’ll build an Emacs joystick
interface (Linux only) using a dynamic module. It will allow Emacs to
read events from any joystick on the system. All the source code is
here:

Currently, Emacs’ emacs-module.h header is the entirety of the module
documentation. It’s a bit thin and leaves ambiguities that requires
some reading of the Emacs source code. Even reading the source, it’s
not clear which behaviors are a reliable part of the interface. For
example, if there’s a pending non-local exit, it’s safe for a function
to return NULL since the return value is never inspected (Emacs
25.1), but will this always be the case? While mistakes are
unforgiving (a hard crash), the API is mostly intuitive and it’s been
pretty easy to feel my way around it.

Dynamic Module Types

All Emacs values — integers, floats, cons cells, vectors, strings,
etc. — are represented as the polymorphic, pointer-valued type,
emacs_value. Despite being a pointer, NULL is not a valid value,
as convenient as that would be. The API includes functions for
creating and extracting the fundamental types: integers, floats,
strings. Almost all other object types can only be accessed by making
Lisp function calls to regular Emacs functions from the module.

Modules also introduce a brand new Emacs object type: a user
pointer. These are non-readable, opaque pointer values
returned by modules, typically representing a handle to some resource,
be it a memory block, database connection, or a joystick. These
objects include a finalizer function pointer — which, surprisingly, is
not permitted to be NULL — and their lifetime is managed by Emacs’
garbage collector.

User pointers are a somewhat dangerous feature since there’s little to
stop Emacs Lisp code from misusing them. A Lisp program can take a
user pointer from one module and pass it to a function in a different
module. Since it’s just a pointer, there’s no way to type check it. At
best, a module could maintain a table of all its live pointers,
checking all user pointer arguments against the table before
dereferencing. But I don’t expect this to be normal practice.

Module Initialization

After loading the module through the platform’s mechanism, the first
thing Emacs does is check for the symbol plugin_is_GPL_compatible.
While tacky, this is not surprising given the culture around Emacs.

Next it calls emacs_module_init(), passing it the first function
pointer. From this, the module can get a Lisp environment and start
doing Emacs things, such as binding module functions to Lisp symbols.

In a real module, it’s common to create function objects for native
functions, then fetch the fset symbol and make a Lisp call on it to
bind the newly-created function object to a name. You’ll see this in
action later.

Joystick API

The joystick API will closely resemble Linux’s own joystick API,
making for a fairly thin wrapper. It’s so thin that Emacs almost
doesn’t even need a dynamic module. This is because, on Linux,
joysticks are just files under /dev/input/. Want to see the input
events on the first joystick? Just read /dev/input/js0. So Plan 9.

Emacs already knows how to read files, but these virtual files are a
little too special for that. The header linux/joystick.h defines a
struct js_event:

The idea is to read from the joystick device into this structure. The
first several reads are initialization that define the axes and
buttons of the joystick and their initial state. Further events are
queued up for the file descriptor. This all means that the file can’t
just be opened each time joystick input is needed. It has to be held
open for the duration, and is typically configured non-blocking.

The Emacs package will be called joymacs and there will be three
functions:

joymacs-open

The joymacs-open function will take an integer, opening the Nth
joystick (/dev/input/jsN). It will create a file descriptor for the
joystick device, returning it as a user pointer. Think of it as a sort
of “joystick handle.” Now, it could instead return the file
descriptor as an integer, but the user pointer has two significant
benefits:

The resource will be garbage collected. If the caller loses
track of a file descriptor returned as an integer, the joystick
device will be held open until Emacs shuts down, using up one of
Emacs’ file descriptors. By putting it in a user pointer, the
garbage collector will have the module to release the file
descriptor if the user loses track of it.

It should be difficult for the user to make a dangerous call.
Emacs Lisp can’t create user pointers — they only come from modules
— and so the module is less likely to get passed the wrong thing.
In the case of joystick-close, the module will be calling
close(2) on the argument. We definitely don’t want to make that
system call on file descriptors owned by Emacs. Further, since user
pointers are mutable, the module can ensure it doesn’t call
close(2) twice.

Here’s the implementation for joymacs-open. I’ll over over each part
in detail.

The C function name doesn’t matter to Emacs. It’s static because it
doesn’t even matter if the function visible to Emacs. It will get the
function pointer later as part of initialization.

This is the prototype for all functions callable by Emacs Lisp,
regardless of its arity. It has four arguments:

It gets an environment, env, through which to call back into
Emacs.

It gets n, the number of arguments. This is guaranteed to be the
correct number of arguments, as specified later when creating the
function object, so only variadic functions need to inspect this
argument.

The Lisp arguments are passed as an array of values, args.
There’s no type declaration when declaring a function object, so
these may be of the wrong type. I’ll go over how to deal with this.

Finally, it gets an arbitrary pointer, supplied at function object
creation time. This allows the module to create closures, but will
usually be ignored.

The first thing the function does is extract its integer argument.
This is actually an intmax_t, but I don’t think anyone has that many
USB ports. An int will suffice.

As for not underestimating fools, what if the user passed a value that
isn’t an integer? Will the world come crashing down? Fortunately Emacs
checks that in extract_integer and, if there’s a mismatch, sets a
pending error signal in the environment. This is really great because
checking types directly in the module is a real pain the ass. So,
before committing to anything further, such as opening a file, I check
for this signal and bail out early if necessary. In Emacs 25.1 it’s
safe to return NULL since the return value will be completely ignored,
but I’d rather hedge my bets.

By the way, the nil here is a global variable set in initialization.
You don’t just get that for free!

The next step is opening the joystick device, read-only and
non-blocking. The non-blocking is vital because the module would
otherwise hang Emacs later if there are no events (well, except for
the read being quickly interrupted by a POSIX signal).

If the joystick fails to open (e.g. it doesn’t exist, or the user
lacks permission), manually set an error signal for a non-local exit.
I chose the file-error signal and I’m just using the filename as the
signal data.

Otherwise create the user pointer. No need to allocate any memory;
just stuff it in the pointer itself. If the user mistakenly passes it
to another module, it will sure be in for a surprise when it tries to
dereference it.

The garbage collector will call this function when the user pointer is
lost. If the user closes it early with joymacs-close, that function
will set the user pointer to -1, an invalid file descriptor, so that
it doesn’t get closed a second time here.

joymacs-read

The joymacs-read function is doing something a little unusual for an
Emacs Lisp function. It takes two arguments: the joystick handle and a
5-element vector. Instead of returning the event in some
representation, it fills the vector with the event details. The are
two reasons for this:

The API has no function for creating vectors … though the module
could get the make-symbol vector and call it to create a
vector.

The idiom for event pumps is for the caller to supply a buffer to
the pump. This has better performance by avoiding lots of
unnecessary allocations, especially since events tend to be
message-like objects with a short, well-defined extent.

Otherwise fill out the event vector. If the second argument isn’t a
vector, or if it’s too short, the signal will automatically get raised
by Emacs. The module can keep plowing through the vec_set() calls
safely since it’s not committing to anything.

The Linux event struct has four fields and the function fills out five
values of the vector. This is because the type field has a bit flag
indicating initialization events. This is split out into an extra
t/nil value. It also normalizes axis values and converts button values
into t/nil, which makes more sense for Emacs Lisp. The event itself is
returned since it’s a truthy value and it’s convenient for the caller.

The astute programmer might notice that the negative side of the axis
could go just below -1.0, since INT16_MIN has one extra value over
INT16_MAX (two’s complement). It doesn’t seem to be documented, but
the joystick drivers I’ve seen never exactly return INT16_MIN, so
this is in fact the correct way to normalize it.

Initialization

All that’s left is the initialization function. First declare some
global variables to keep track of frequently-used symbols.

These are interned at the very beginning of initialization. The
symbols :button and :axis are given global references so that the
garbage collector doesn’t rip them out from under the module. It’s
unclear from the API, but the make_global_ref() function returns the
object being referenced. I trust that the t and nil symbols will never
be garbage collected, so these don’t need global references.

Finally, bind the functions. The second and third arguments to
make_function are the minimum and maximum number of arguments, which
may look familiar. The last argument is that closure
pointer I mentioned at the beginning.

The source repository now includes a port to Windows (XInput). If
you’re on Linux or Windows, have Emacs 25 with modules enabled, and a
joystick is plugged in, then make run in the repository should bring
up Emacs running a joystick calibration demonstration. The module
can’t poke at Emacs when events are ready, so instead there’s a timer
that polls the module for events.

I’d like to someday see an Emacs Lisp game well-suited for a joystick.

In a C program, suppose I have a table of color names of similar
length. There are two straightforward ways to construct this table.
The most common would be an array of char *.

char*colors_ptr[]={"red","orange","yellow","green","blue","violet"};

The other is a two-dimensional char array.

charcolors_2d[][7]={"red","orange","yellow","green","blue","violet"};

The initializers are identical, and the syntax by which these tables
are used is the same, but the underlying data structures are very
different. For example, suppose I had a lookup() function that
searches the table for a particular color.

Thanks to array decay — array arguments are implicitly converted to
pointers (§6.9.1-10) — it doesn’t matter if the table is char
colors[][7] or char *colors[]. It’s a little bit misleading because
the compiler generates different code depending on the type.

Memory Layout

Here’s what colors_ptr, a jagged array, typically looks like in
memory.

The array of six pointers will point into the program’s string table,
usually stored in a separate page. The strings aren’t in any
particular order and will be interspersed with the program’s other
string constants. The type of the expression colors_ptr[n] is char *.

On x86-64, suppose the base of the table is in rax, the index of the
string I want to retrieve is rcx, and I want to put the string’s
address back into rax. It’s one load instruction.

movrax,[rax+rcx*8]

Contrast this with colors_2d: six 7-byte elements in a row. No
pointers or addresses. Only strings.

The strings are in their defined order, packed together. The type of
the expression colors_2d[n] is char [7], an array rather than a
pointer. If this was a large table used by a hot function, it would
have friendlier cache characteristics — both in locality and
predictability.

In the same scenario before with x86-64, it takes two instructions to
put the string’s address in rax, but neither is a load.

imulrcx,rcx,7addrax,rcx

In this particular case, the generated code can be slightly improved
by increasing the string size to 8 (e.g. char colors_2d[][8]). The
multiply turns into a simple shift and the ALU no longer needs to be
involved, cutting it to one instruction. This looks like a load due to
the LEA (Load Effective Address), but it’s not.

learax,[rax+rcx*8]

Relocation

There’s another factor to consider: relocation. Nearly every process
running on a modern system takes advantage of a security feature
called Address Space Layout Randomization (ASLR). The virtual address
of code and data is randomized at process load time. For shared
libraries, it’s not just a security feature, it’s essential to their
basic operation. Libraries cannot possibly coordinate their preferred
load addresses with every other library on the system, and so must be
relocatable.

If the program is compiled with GCC or Clang configured for position
independent code — -fPIC (for libraries) or -fpie + -pie (for
programs) — extra work has to be done to support colors_ptr. Those
are all addresses in the pointer array, but the compiler doesn’t know
what those addresses will be. The compiler fills the elements with
temporary values and adds six relocation entries to the binary, one
for each element. The loader will fill out the array at load time.

However, colors_2d doesn’t have any addresses other than the address
of the table itself. The loader doesn’t need to be involved with each
of its elements. Score another point for the two-dimensional array.

On x86-64, in both cases the table itself typically doesn’t need a
relocation entry because it will be RIP-relative (in the small code
model). That is, code that uses the table will be at a fixed
offset from the table no matter where the program is loaded. It won’t
need to be looked up using the Global Offset Table (GOT).

In case you’re ever reading compiler output, in Intel syntax the
assembly for putting the table’s RIP-relative address in rax looks
like so:

;; NASM:learax,[reladdress];; Some others:learax,[rip+address]

Or in AT&T syntax:

lea address(%rip), %rax

Virtual Memory

Besides (trivially) more work for the loader, there’s another
consequence to relocations: Pages containing relocations are not
shared between processes (except after fork()). When loading a
program, the loader doesn’t copy programs and libraries to memory so
much as it memory maps their binaries with copy-on-write semantics. If
another process is running with the same binaries loaded (e.g.
libc.so), they’ll share the same physical memory so long as those
pages haven’t been modified by either process. Modifying the page
creates a unique copy for that process.

Relocations modify parts of the loaded binary, so these pages aren’t
shared. This means colors_2d has the possibility of being shared
between processes, but colors_ptr (and its entire page) definitely
does not. Shucks.

This is one of the reasons why the Procedure Linkage Table (PLT)
exists. The PLT is an array of function stubs for shared library
functions, such as those in the C standard library. Sure, the loader
could go through the program and fill out the address of every
library function call, but this would modify lots and lots of code
pages, creating a unique copy of large parts of the program. Instead,
the dynamic linker lazily supplies jump addresses for PLT
function stubs, one per accessed library function.

However, as I’ve written it above, it’s unlikely that even colors_2d
will be shared. It’s still missing an important ingredient: const.

Const

They say const isn’t for optimization but, darnit, this
situation keeps coming up. Since colors_ptr and colors_2d are both
global, writable arrays, the compiler puts them in the same writable
data section of the program, and, in my test program, they end up
right next to each other in the same page. The other relocations doom
colors_2d to being a local copy.

Fortunately it’s trivial to fix by adding a const:

constcharcolors_2d[][7]={/* ... */};

Writing to this memory is now undefined behavior, so the compiler is
free to put it in read-only memory (.rodata) and separate from the
dirty relocations. On my system, this is close enough to the code to
wind up in executable memory.

Note, the equivalent for colors_ptr requires two const qualifiers,
one for the array and another for the strings. (Obviously the const
doesn’t apply to the loader.)

constchar*constcolors_ptr[]={/* ... */};

String literals are already effectively const, though the C
specification (unlike C++) doesn’t actually define them to be this
way. But, like setting your relationship status on Facebook, declaring
it makes it official.

It’s just micro-optimization

These little details are all deep down the path of micro-optimization
and will rarely ever matter in practice, but perhaps you learned
something broader from all this. This stuff fascinates me.

I’ve worked on many programs that frequently require small,
short-lived buffers for use as a temporary workspace, perhaps to
construct a string or array. In C this is often accomplished with
arrays of automatic storage duration (i.e. allocated on the
stack). This is dirt cheap — much cheaper than a heap allocation —
and, unlike a typical general-purpose allocator, involves no thread
contention. However, the catch that there may be no hard bound to the
buffer. For correctness, the scratch space must scale appropriately to
match its input. Whatever arbitrary buffer size I pick may be too small.

A widespread extension to C is the alloca() pseudo-function. It’s like
malloc(), but allocates memory on the stack, just like an automatic
variable. The allocation is automatically freed when the function (not
its scope!) exits, even with a longjmp() or other non-local exit.

void*alloca(size_tsize);

Besides its portability issues, the most dangerous property is the
complete lack of error detection. If size is too large, the
program simply crashes, or worse.

For example, suppose I have an intern() function that finds or creates
the canonical representation/storage for a particular string. My
program needs to intern a string composed of multiple values, and will
construct a temporary string to do so.

I expect the vast majority of these prefix strings to be very small,
perhaps on the order of 10 to 80 bytes, and this function will handle
them extremely efficiently. But should this function get passed a huge
prefix, perhaps by a malicious actor, the program will misbehave
without warning.

A portable alternative to alloca() is variable-length arrays (VLA),
introduced in C99. Arrays with automatic storage duration need not
have a fixed, compile-time size. It’s just like alloca(), having
exactly the same dangers, but at least it’s properly scoped. It
was rejected for inclusion in C++11 due to this danger.

There’s a middle-ground to this, using neither VLAs nor alloca().
Suppose the function always allocates a small, fixed size buffer —
essentially a free operation — but only uses this buffer if it’s large
enough for the job. If it’s not, a normal heap allocation is made with
malloc().

Since the function can now detect allocation errors, this version has
an error condition. Though, intern() itself would presumably return
NULL for its own allocation errors, so this is probably transparent to
the caller.

We’ve now entered the realm of small-size optimization. The vast
majority of cases are small and will therefore be very fast, but we
haven’t given up on the odd large case either. In fact, it’s been made
a little bit worse (via the unnecessary small allocation), selling
it out to make the common case fast. That’s sound engineering.

Visual Studio has a pair of functions that nearly automate this
solution: _malloca() and _freea(). It’s like alloca(), but
allocations beyond a certain threshold go on the heap. This allocation
is freed with _freea(), which does nothing in the case of a stack
allocation.

void*_malloca(size_t);void_freea(void*);

I said “nearly” because Microsoft screwed it up: instead of returning
NULL on failure, it generates a stack overflow structured exception
(for a heap allocation failure).

I haven’t tried it yet, but I bet something similar to malloca() /
freea() could be implemented using a couple of macros.

Toward Structured Small-Size Optimization

CppCon 2016 was a couple weeks ago, and I’ve begun catching up on the
talks. I don’t like developing in C++, but I always learn new,
interesting concepts from this conference, many of which apply
directly to C. I look forward to Chandler Carruth’s talks the most,
having learned so much from his past talks. I recommend these all:

Chandler’s talk this year was the one on hybrid data structures. I’d
already been mulling over small-size optimization for months, and the
first 5–10 minutes of his talk showed me I was on the right track. In
his talk he describes LLVM’s SmallVector class (among others), which
is basically a small-size-optimized version of std::vector, which, due
to constraints on iterators under std::move() semantics, can’t itself
be small-size optimized.

I picked up a new trick from this talk, which I’ll explain in C’s
terms. Suppose I have a dynamically growing buffer “vector” of long
values. I can keep pushing values into the buffer, doubling the
storage in size each time it fills. I’ll call this one “simple.”

structvec_simple{size_tsize;size_tcount;long*values;};

Initialization is obvious. Though for easy overflow checks, and for
another reason I’ll explain later, I’m going to require the starting
size, hint, to be a power of two. It returns 1 on success and 0 on
error.

intvec_simple_init(structvec_simple*v,size_thint){assert(hint&&(hint&(hint-1))==0);// power of 2
v->size=hint;v->count=0;v->values=malloc(sizeof(v->values[0])*v->size);return!!v->values;}

Pushing is straightforward, using realloc() when the buffer fills,
returning 0 for integer overflow or allocation failure.

And finally, cleaning up. I hadn’t thought about this before, but if
the compiler manages to inline vec_simple_free(), that NULL pointer
assignment will probably get optimized out, possibly even in the face
of a use-after-free bug.

If the common case is only a handful of long values, and this
function is called frequently, we’re doing a lot of heap allocation
that could be avoided. Wouldn’t it be nice to put all that on the
stack?

Applying Small-Size Optimization

Modify the struct to add this temp field. It’s probably obvious what
I’m getting at here. This is essentially the technique in SmallVector.

structvec_small{size_tsize;size_tcount;long*values;longtemp[16];};

The values field is initially pointed at the small buffer. Notice
that unlike the “simple” vector above, this initialization function
cannot fail. It’s one less thing for the caller to check. It also
doesn’t take a hint since the buffer size is fixed.

Pushing gets a little more complicated. If it’s the first time the
buffer has grown, the realloc() has to be done “manually” with
malloc() and memcpy().

intvec_small_push(structvec_small*v,longx){if(v->count==v->size){size_tvalue_size=sizeof(v->values[0]);size_tnew_size=v->size*2;if(!new_size||value_size>(size_t)-1/new_size)return0;// overflow
void*new_values;if(v->temp==v->values){/* First time heap allocation. */new_values=malloc(new_size*value_size);if(new_values)memcpy(new_values,v->temp,sizeof(v->temp));}else{new_values=realloc(v->values,new_size*value_size);}if(!new_values)return0;// out of memory
v->size=new_size;v->values=new_values;}v->values[v->count++]=x;return1;}

Finally, only call free() if the buffer was actually allocated on the
heap.

If 99% of these vectors never exceed 16 elements, then 99% of the time
the heap isn’t touched. That’s much better than before. The 1% case
is still covered, too, at what is probably an insignificant cost.

An important difference to SmallVector is that they parameterize the
small buffer’s size through the template. In C we’re stuck with fixed
sizes or macro hacks. Or are we?

Using a Caller-Provided Buffer

This time remove the temporary buffer, making it look like the simple
vector from before.

structvec_flex{size_tsize;size_tcount;long*values;};

The user will provide the initial buffer, which will presumably be an
adjacent, stack-allocated array, but whose size is under the user’s
control.

voidvec_flex_init(structvec_flex*v,long*init,size_tnmemb){assert(nmemb>1);// we need that low bit!
assert(nmemb&&(nmemb&(nmemb-1))==0);// power of 2
v->size=nmemb|1;v->count=0;v->values=init;}

The power of two size, greater than one, means the size will always be
an even number. Why is this important? There’s one piece of
information missing from the struct: Is the buffer currently heap
allocated or not? That’s just one bit of information, but adding just
one more bit to the struct will typically pad it out another 31 or 63
more bits. What a waste! Since I’m not using the lowest bit of the
size (always being an even number), I can smuggle it in there. Hence
the nmemb | 1, the 1 indicating that it’s not heap allocated.

When pushing, the actual_size is extracted by clearing the bottom
bit (size & ~1) and the indicator bit is extracted with a 1 bit mask
(size & 1). The bit is cleared by virtue of not intentionally
setting it again.

If you were to log all vector sizes as part of profiling, and the
assumption about their typical small number of elements was correct,
you could easily tune the array size in each case to remove the vast
majority of vector heap allocations.

Now that I’ve learned this optimization trick, I’ll be looking out for
good places to apply it. It’s also a good reason for me to stop
abusing VLAs.