Welcome back and thanks for joining us for the reads notes…
the thirteenth installment of our series on ELF files, what they are,
what they can do, what does the dynamic linker do to them, and how can
we do it ourselves.

I've been pretty successfully avoiding talking about TLS so far (no, not
that
one)
but I guess we've reached a point where it cannot be delayed any further, so.

Sure, we may execute some instructions, we may even politely request that
certain devices tend to our needs but ultimately the one who's calling the
shots is the kernel. We are tolerated here. Honestly, the kernel would
rather nothing execute at all.

Occasionally though, the kernel will let non-kernel code execute. And again,
it's in charge of exactly how and when that happens - and for how long.

By now we've formed a fairly good idea of how processes are loaded into
memory: the kernel itself parses the file we want to execute, if it's ELF it
parses it (it's not interested in nearly as many things as we are, though),
maps a few things, then “hands off” control to it.

But what does hand off mean? In concrete terms, what happens? Well, today's
not the day we get into kernel debugging (although… nah. unless? no.), but
we sure can get a rough idea what's going on.

What is a computer? A miserable little pile of registers. That's right - it's
global variables all the way down.

Here's the value of some of the CPU registers just as echidna's main function
starts executing:

Is that all of them? Nope! There's 128-bit registers (SSE), 256-bit registers
(AVX), 512-bit registers (AVX-512) - and of course we still have the x87/FPU
registers, from back when you needed a co-processor for
that.

TL;DR - it's a whole mess. The point is, we have a bunch of global
variables that are, like, really fast to read from and write to. So
optimizing compilers tend to want to use them whenever possible.

And by “them” I mean the general-purpose ones in the bunch - from %rax to
%r15. And sometimes, if your optimizer feels particularly giddy, some of
the %xmmN registers as well (as we have painfully learned in the last
article).

And then there's special-purpose registers, like cs, ss, ds, es, etc.
We're not overly concerned with those four in particular, because on we're on
64-bit Linux and our memory model is somewhat simpler.

In fact, we've been using registers all this time to send the kernel love
letters - in echidna's write function for example:

So both the kernel and userland applications use registers. One of my
favorite registers - seeing as I'm in the middle of writing a series
about ELF files - is %rip, the instruction pointer.

I'm being told that it wasn't always that simple, but on 64-bit Linux, it
just points to the (virtual) address of the code we're currently executing.
Whenever program execution moves forward, so does %rip - by however many
bytes it took to encode the instruction that was just executed:

So, this answers part of our question - how does the kernel “hand off”
control to a program: it just changes %rip! And the processor does the
rest. Well. Sorta kinda. “Among other things”, let's say.

(Note that, on x86, you can't write to the %rip register directly - you
have to use instructions like jmp, call, or ret.)

To be fair, it also switches from ring 0 to ring 3 - again, something we've
briefly discussed in Reading files the hard way Part
2. And it switches from the
“kernel virtual address space” to the “userland virtual address space”.

Point is - that's also how switching between processes works. As far as the
user is concerned, processes execute in parallel, but as far as the kernel is
concerned, its scheduler is handing out slices of time. Whenever it lets
process “foo” execute for a bit, it:

Sets up a system timer interrupt

Restores the state of all CPU registers to what

Switches from Ring 0 to Ring 3, also jumping to whatever address %rip
had when process “foo” was last interrupted

Eventually, the system timer interrupt goes off, and execution immediately
jumps back to the kernel's interrupt handler - at which point the kernel
decides whether the process has been naughty or nice and whether it merits
more time.

If not - for example, if it decides we really should be giving process “bar”
more time next, then the kernel saves the state of “foo”, (most of the
registers), resets a bunch of CPU state (mostly memory caches), and switches
to “bar” the way we've just described.

That's the very distant overview of things. It's also not entirely correct.
But for our purposes, it's correct enough.

That's for processes. But what about threads? Threads are also “preemptive
multitasking” - instead of explicitly relinquishing control, their execution
can be violently interrupted (ie. “preempted”) so that other threads can be
executed.

Cool bear's hot tip

The “other” multitasking is cooperative multitasking - which you don't need
the kernel's help to do. That's how coroutines work - just bits of userland
state that all play nice together when it comes to deciding whose turn is it.

Switching between threads is simpler though. Because of all threads of a
given process share the same address space. So there's less state to save
and restore when switching from one to the other.

But then… the question arises: how do you tell threads apart? If several
threads are started with the same entry point, how do you know which is
which? Is that something the CPU handles? or the kernel?

t a a i r fs is just the obscure way of saying thread apply all info register fs.

That's right - whenever it's not ambiguous, GDB lets you shorten any command or
option name. In fact, if you see a shortcut being used and you're not sure what it does,
you can ask gdb, since its help command also accepts the shortcut form.

Is this a lie? Yes. If that was the case, pthread_self would try to read
from memory address 0x0+0x10 and definitely segfault.

But it doesn't:

(gdb) print (void*) pthread_self()
[Switching to Thread 0x7ffff7da5700 (LWP 14474)]
The program stopped in another thread while making a function call from GDB.
Evaluation of the expression containing the function
(pthread_self) will be abandoned.
When the function is done executing, GDB will silently stop.

So GDB is lying. But it's not entirely surprising - the %fs register
is thread-local (on Linux 64-bit! remember that whatever a register is
used for is entirely defined in the ABI and it's up to the kernel to make it
so), and GDB itself is running its own threads distinct from the inferior's
threads.

Cool bear's hot tip

It's been a while since we've been over weird GDB terminology, so, just in
case, the “inferior” is the “process being debugged”. I know. Weird. Moving on.

Is there another way to grab the contents of the %fs register? Sure there
is! We can ask the kernel politely via the arch_prctl syscall. We'll use
libc's wrapper for it:

A short (and mostly incorrect) history of Intel chips

An Intel 8008 chip

Digital Equipment Corporation (DEC), Fairchild Semiconductor and National
Semiconductor have all released some form of 16-bit microprocessor. One year
prior, National even released the
PACE, a single
chip based loosely on its own IMP-16 design.

Meanwhile, Intel is one year into the iAPX
432 project, which.. really
warrants at least one entire article. Ada was the intended programming language
for the processor, and it supported object-oriented programming and capability-based addressing.

The iAPX 432 project was struggling though - turns out those abstractions
weren't free. Not only did they require significantly more transistors,
performance of equivalent programs suffered compared to competing
microprocessors.

So, in May of 1976, the folks at Intel go “okay let's work on some 16-bit
chip that we can release before iAPX 432 is done cooking. This is one month
before Texas Instruments (TI) releases the
TMS9900, another
single-ship 16-bit microprocessor - the pressure is real.

But what does “a 16-bit chip” really mean? Well actually… it all depends.

For example, I've referred to the Intel
8008 as an “8-bit chip” - but it's
not that simple.

Sure, the registers of the 8008 were eight bits. Each bit can be on or off:

Each bit also corresponds to a power of two - by adding the power of two
of each of the “on” bits, we get the value as an unsigned integer:

Signed integers are a bit more involved - and floating-point numbers are even
move involved. But let's not get too distracted.

If you only used eight bits to encode memory addresses, then you could only
address, well, 256 bytes of memory.

Which is very little. Like, not enough for a non-trivial program.

So, even eight-bit chips usually had a larger “address bus”. The 8008
had a 14-bit address bus - which means the width of its PC register
(program counter, which we call instruction pointer on x86-64) was..
14 bits.

How do you manipulate 14-bit addresses with 8-bit general-purpose registers?
With two of them! Why 14-bit and not 16-bit? Well, when you're making a
chip, every pin counts:

The chip has a 8 bit wide data bus and 14 bit wide address bus, which can
address 16 KB of memory. Since Intel could only manufacture 18 pin DIP
packages at 1972, the bus has to be three times multiplexed. Therefore the
chip's performance is very limited and it requires a lot external logic to
decode all signals.

So, thanks to pin multiplexing, the 8008 could address 16KiB of memory.
Which is still not a lot. And back in the 70s, Intel was a startup
devoted to making memory chips. It
stands to reason they'd like people to use microprocessors that allow
addressing a lot more memory.

The 8086's design is bigger. It ships in a 40-pin package, so they're able to
bump the number of data pins to 20 - still with some multiplexing. And with a
20-bit address bus, the 8086 is able to provide a whopping 1 MiB physical
address space.

Intel C8086 Chip

But just as before, the 8086's general-purpose registers are smaller -
they're only 16 bits. A single register is still not enough to refer to a
physical memory address.

What to do? Use segments! The 8086 introduces four segment registers:
the code segment (CS), from which instructions are read, the data segment
(DS) for general memory, the stack segment (SS), and the extra segment
(ES), useful as a temporary storage space when you need to copy from one
area of memory to another.

Instructions would typically take 16-bit offset arguments, and depending on
the nature of the instruction, it would add up that offset with the relevant
segment register. Each of the 8086's segment registers were… also 16 bits.
16 + 16 = 20, all is well.

That means that, for the 8086, each single memory address can be referred to
by 4096 different segment:offset pairs.

This also means that - as long as your entire program (code and data) fits within
a single 64K segment, you can have nice offsets that start at 0 (for your segment).

If it doesn't fit in a single 64K segment, well then your offsets don't fit
in 16 bits anymore, and you have to start juggling between different segments,
and deal with funky pointer sizes.

If you want to refer to memory in the same segment, you can use a near pointer:

If you want to refer to memory in another segment, you can use a far pointer

If you want to refer to memory in another segment and you may have pointer arithmetic that changes the pointer's value to refer to yet another segment, you can use a huge pointer:

Needless to say, writing code for this architecture was not pleasant.

The 286

In 1982, Intel launches the 80286, which we'll just call the 286, which
introduce several novelties. First off, the data pins are no longer
multiplexed - the chip has 68 pins, 16 of which are dedicated to the address
bus.

A “ring” is a privilege level, and the current privilege level is stored
in the lower two bits of the CS register. And what do you know, our sample program
is running…

(gdb) p/u $cs & 0b11
$24 = 3

…in Ring 3! As it should, since it's a regular userland program, not kernel code.

However the 286's protected mode is kind of annoying to use - for starters,
it breaks compatibility with old 8086 applications. And to make things worse,
once you switch it from “real” mode to “protected” mode, you can't switch
back without performing a hard reset.

But, the few applications that do make use of the 286's protected mode are
able to use the full 24-bit physical address space: 16 MiB. In theory. In
practice, 286 motherboards only support up to 4 MiB of RAM - and even then,
buying that much memory is prohibitively expensive.

Fast forward to 1985. The Japan-US semiconductor war is raging. Intel eventually decides to stop producing DRAM, now focusing on microprocessors.

The 386

In October of 1985, Intel releases the 80386 (which we'll call “the 386”) -
the first implementation of the 32-bit extension to the 80286 architecture.
Finally, finally, the data width and the address width are the same: 32
bits.

Intel 80386DX Chip

Which means - in theory - the 386 is able to address 4 GiB of RAM.

Advertisement for Memory Boards by Tall Tree Systems

InfoWorld, September of 1985

In practice though, boards that let you have that much memory - or anywhere
close to it - do not exist. Even a couple megabytes of RAM will set you back.

The advertisement shown above reads:

Tall Tree Systems presents JRAM-3, the newest member of the JRAM family.
JRAM-3 is a fourth generation multifunction memory board and the successor of
the highly praised JRAM-2. Designed to meet the latest expanded memory
specification standard being implemented by the major spreadsheet vendors,
JRAM-3 can access up to eight megabytes of memory for larger, more efficient
spreadsheets. JRAM-3 can also be used for DOS memory, electronic disk, print
spooler, and program swapping applications!

Determined to maintain our reputation as the price leader in memory
expansion, Tall Tree Systems offers JRAM-3 fully populated with two megabytes
for an amazing $699.

Nevertheless, the 386 is a game changer. So much so that Intel will go on to
produce 386 chips until
2007.

Paging is a huge deal. Although the concept existed previously in non-mass
market computers, having it
in the 386, a consumer-grade x86 device enabled tons of cool tricks.

We said the 8086 had “segment registers”. And we've also used the word “segments”
to refer to different parts of an ELF file…

Cool bear's hot tip

Oh look at him, tying his history lesson back into the series… nice going pal.

…and that's not a coincidence! Before paging, even in protected mode, a program
had to be loaded contiguously in physical memory. If you didn't have a contiguous
area in physical memory that could fit the entire program, you.. could not load it.

This issue of “memory fragmentation” became much less of a problem with virtual
address spaces, since you could map virtual pages to any available physical
pages:

The program's memory appears contiguous - in virtual memory it is. In physical
memory, it isn't, but that's an implementation detail. It's the MMU's job.

The 64-bit era

The story doesn't end with the 386 of course. In 2001, Intel and HP introduce
the IA-64 architecture, with a
VLIW instruction
set.

IA-64 makes a lot of changes, mostly as a means to enable parallelism with
the help of the compiler. It has 128 64-bit integer and floating-pointer
registers, performs speculation and branch prediction, and other cool tricks.

This new architecture completely breaks compatibility with x86, which is fine
because it's geared for enterprise servers - and those clients can afford to
recompile their applications for a new architecture. Right? Right.

Anyway, in 2003, AMD releases its own 64-bit architecture, which is “just”
a set of x86 extensions, which means it's backward-compatible with… pretty
much everything relevant on desktop? The exception being
PowerPC, which Apple will still be
shipping for 3 years.

AMD releases not only a series of workstation processors,
Opteron, but also consumer-grade
processors like the Athlon 64.

And with that move, 64-bit computing moves into the mainstream. The IA-64
architecture eventually loses the war against the more traditional AMD64,
and Intel starts shipping AMD64 processors, rebranding the architecture as,
successively, “IA-32e”, “EM64T”, and finally “Intel 64”.

The first Intel consumer-grade desktop processor to implement “Intel 64” is
the Pentium 4 “Prescott” - and this paves the way for at least two decades of
the architecture we usually refer to “x86-64” being mainstream.

Intel Pentium 4 Prescott SL79K chip

So there you have it - in just 31 years, we moved from 8-bit chips to
64-bit chips. And for one glorious moment in the 2000s, AMD led the charge
and Intel had to follow:

Year

Model

Pins

Data width

Address width

Address space

1972

Intel 8008

18

8

14

16 KiB

1978

Intel 8086

40

16

20

1 MiB

1982

Intel 80286

68

16

24

16 MiB

1985

Intel 80386

132

32

32

4 GiB

2003

AMD Athlon 64

754

64

64

16 EiB

2004

Intel P4 SL79K

478

64

64

16 EiB

What about segmentation?

Back to memory models. The real game-changer here was the 386. When the
data width and the address width are equal, you don't need segmentation
anymore.

Whereas on the 286, you had to have one code segment at a time, that
started on a 64K boundary, and could not overlap the other segments:

…on the 386, you can just set all the segment bases to 0, and since the
offsets are 32-bit, pointers can refer to anywhere in the virtual address
space:

Additionally, the 386 introduces two other segment register: FS (
for “fxtra data”) and GS (for “gxtra data”). Those don't really have
a specific purpose… but we can make good use of them.

How?

Well, consider a program loaded into memory. Among other things, we have
the .text section, with code, and the .data, with (mutable) global variables,
mapped at a constant offset of each other.

Since the combo can be loaded at any base address in memory, the .text
segment uses the %rip register to refer to global variables in the same object.

For variables in other objects, as we've seen, it uses the GOT (global
offset table), and for functions in other objects, the PLT (procedure linkage
table).

But with thread-local data… we need another section:

Again, this is for mutable data. Immutable data can all go in .bss, which
isn't shown here.

The problem with the .tdata section is we must have one copy of it per thread.
Threads share the .text section, the .data section, even the .bss section -
and those are at the same place for every thread - but the .tdata section
is somewhere different for every thread - at a different offset from .code:

So we can't use rip-relative addressing! There has to be a place, somewhere
that says to the thread “this is the start of the .tdata section for you".

And we can't use a general-purpose register like %rax or %rdi because
those are taken by the ABI - to return values or pass arguments. They're also
taken by the compiler - whenever it's not calling functions, the compiler
is free to use %rax through %r15 to store temporary values.

So, what to do? Use those extra segment registers! They're not used for
anything right now - so %gs becomes used to indicate the address of
the thread-local storage area on Linux x86, and %fs on Linux x86-64.

Let's see that in action.

We're going to add some thread-local variables in our echidna test program

…then we'd crash in Process::apply_relocations() - since we haven't called
Process::allocate_tls(), the tls field is still None, and we can't
apply TPOFF64 relocations.

Ideally, our API would be designed in such a way that it would be impossible
for us to do those operations out of order. But it would still let us inspect
fields like objects and tls at various stages, if we wanted to add a little
debug printing - as a treat.

We can also define methods that are callable in any state. For example,
Process::lookup_symbol() needs only read access, it doesn't have any side
effects, why not allow it all the time, for debugging purposes?

It's important to note that after calling set_fs, we should avoid doing a lot of things:

Calling println! will lock stdout, and locks use thread-local storage, so

that will crash now. Allocating memory on the heap will call malloc, and

malloc uses thread-local storage, so that will
also crash.

In fact, we should try doing as few things as possible. If we did need logging afterset_fs, we should
write our own logging functions on top of the write! syscall, and only do stack-allocation. Which,
as it turns out, is relatively easy to do in Rust, as we've seen in echidna!

Ooh, a new relocation type! We've kind of ignored relocation higher
than Relative (8) so far, but the table does continue:

Name

Value

Field

Calculation

TPOFF64

18

word64

Cool bear's hot tip

Again, this is taken from the “System V AMD64 ABI” document.

Of course, the empty “calculation” column doesn't bode well, but… we've
seen the assembly, we know pretty much what's expected here: a negative offset
which, added to tcb_addr, will give the actual address of the symbol.

We should probably take a look what the TLS symbols look like in the file
though:

I know, I know, you're disappointed. So am I! So is cool bear. But do not
worry. The series is reaching critical mass… and so that must mean the
dénouement will be upon us soon. Very soon.

What did we learn?

In 2020, as far as CPU memory models are concerned, we have it somewhat
good. Segmentation is mostly a thing of the past, except for thread-local
storage, where Linux 64-bit uses the %fs segment register to store the
address of the “TCB head” (thread control block).

In GDB, the $fs pseudo-variable is always 0 - we can use $fs_base to find
the value we're looking for. In code, we can use the arch_prctl syscall
with ARCH_GET_FS and ARCH_GET_FS values.

TLS variables come with a new type of relocation: TPOFF64. The way the
value is computed is specific to the dynamic loader - in elk's case, we chose
to only support a single thread, and we store object offsets in a HashMap.
The resulting value is always a negative offset from $fs_base.

Typestates are a neat way to encode the state of an object in its type,
to prevent API misuse. They probably would've warranted a whole article, but
adding that pattern after-the-fact to elk's codebase was all in all
relatively painless.