Geek in training

My laptop is a 6-year old ThinkPad X220. Although it’s almost
falling apart from years of constant abuse, I don’t see myself replacing it
anytime soon: it’s easy to repair, has a great keyboard, and is a very
dependable machine.

And it’s supported by Coreboot.
Substituting the proprietary firmware with it is very trivial: I followed
the instructions on this blog post
and they worked out of the box. (I also went the extra mile and flashed the
firmware after passing it through me_cleaner.)

Flashing the serial flash using a Raspberry Pi 3. Yes, I need to clean
up this computer.

The major difference from my previous setup is that my SSD had
hardware-based full disk encryption. I
ended up disabling this for two reasons: first, this isn’t very secure (the
key will remain in the disk RAM for as long as power is supplied); second, I
was not sure if Coreboot supported this. So I disabled encryption prior to
flashing the new firmware.

But keeping a hard drive unencrypted on a laptop isn’t good practice. I
decided to use LUKS
instead.

However, instead of using SeaBIOS
as the payload and have a standard bootloader, I opted to go through a
slightly different route: have a custom-built Linux inside the ROM, open the
/boot partition with LUKS, and
kexec the current
vmlinuz/initrd.

Compared to the usual setup of using SeaBIOS as a payload, this setup
reduces boot time by cutting the middlemen. With the ability to boot from
external devices removed, it’s also arguably more secure. The in-ROM Linux
has only the bare minimum: no network subsystem, only necessary filesystems,
bare minimum drivers are built-in, USB is limited to HID devices, etc; the
compressed kernel has ~1.7MiB with room to shrink. The in-ROM initrd is
also quite minimal, containing just one file.

The only file is a hacked version of
cryptsetup that acts as a primitive init, creating /proc, /dev (and mounting
these two), and /boot, decrypting /boot, and performing kexec. It’s
statically linked with musl libc.

Flashing this requires opening the laptop, and I’m planning to do this next
weekend when replacing the USB ports. However, the setup works very well
under QEMU.

This blog post isn’t meant as a tutorial – feel free to contact me if you
have questions or ideas on how to improve this. If you end up using
something similar to this idea, I’d love to know as well.

There are many libraries out there to parse JSON files. It might be a
futile attempt, then, to write yet another one. However, when you’re
working on a RTOS where memory is golden, and the alternatives don’t look
that great, you got to do something about it.

Recently I wrote a JSON parser for a project at
work. This parser uses constant memory, regardless of the amount of data
it’s working with, and deserializes directly to a C struct. Similar, in
spirit, to the JSON parser that’s part of the Golang standard library, that
encodes and decodes data based on a tagged structure.

The lexer is the usual state machine, where the state itself is a function
pointer to a function that handles that particular state.

I’ve been using this technique for a while, and I found that it’s a very
clean and efficient way of describing state machines, specially for lexers.

I began using it after a coworker wrote a parser for a DSL
using it – and he got the idea from the – you guessed – Golang template
package. (There’s a nice talk by Rob Pike about it – I recommend this talk
not only for the lexing goodness, but also for the tips on how to evolve a
concurrent design.)

The parser implementation itself is nothing to write home about. However,
by using the same idea used in Lwan’s mustache template engine to obtain
the variables, it
manages to do some things that are not common in JSON parsers written in C:

It will accept only values of known types for a particular key.

It will save the decoded value directly in a struct field.

It won’t try to decode the same field twice.

The first point is crucial when working with data received from the network,
which is precisely the kind of thing I’m dealing with. This avoids problems
such as type confusion and such, and moves the responsibility of checking
the types to the library rather than the user of the library.

By saving the decoded value directly into a struct field, it does use a
predictable amount of memory. This is good, as it’s not going to balloon
out of control, or require some guesswork to know beforehand how many tokens
are going to be necessary to deserialize some values. The C compiler
already knows exactly how many bytes a struct needs.

Some fields might be optional in a JSON blob. So the parser uses a bitmask
to mark which fields have been decoded (and returns that, so that the
library user can efficiently test if a value has been deserialized or not).
Since it was easy to do, the library refuses to decode a key that has been
deserialized before.

So, a typical use is the following:

/* First, define a struct to hold the values. */structvalues{char*some_string;intsome_int;boolsome_bool;};/* Then, define a descriptor for that struct. */staticconststructjson_descrvalues_descr[]={FIELD(structvalues,some_string,JSON_TOK_STRING),FIELD(structvalues,some_int,JSON_TOK_NUMBER),FIELD(structvalues,some_bool,JSON_TOK_TRUE),};/* (FIELD is just a macro that saves the offsetof() * each struct member so that a pointer can be produced * afterwards.) *//* It's now just a matter of parsing the JSON now. */structvaluesvalues;int32_tret=json_parse(serialized,strlen(serialized),values_descr,ARRAY_SIZE(values_descr),&values);/* Bits 0, 1, and 2 of ret will be set if some_string, * some_int, and some_bool have been successfully * deserialized. */

Another thing that could be done – but that has not been implemented yet,
is to do the opposite as well: the descriptor and a struct to produce
JSON-encoded data. This has many advantages over the usual JSON libraries
that require generating a JSON tree in memory just to serialize it
afterwards.

And although I’m quite happy with this code, there are still some
limitations that I’ll address whenever I have the need.

Mainly, there’s no way to parse nested objects or arrays. I’ve written code
to do this but these changes haven’t gotten any fuzz-testing action so I’m
holding them off it until my living room heater^Wcomputer has worked on the
problem for at least a week.

Another one that’s not a deal breaker for an embedded OS is the lack of
floating pointing numbers (only integers for now). Parsing floating point
is trickier than it sounds, and
there’s no strtod() in Zephyr’s minimal libc.

As a minor issue to work around, there’s the return value: this limits the
number of fields to be parsed to 32; that should be plenty for most uses.
However, in the unlikely event that’s not sufficient, this can be worked
around by having multiple descriptors.

The major problem with this piece of code, that can’t be fixed by writing
more code, is that I can’t use it with Lwan due to licensing reasons:
although it is open source, part of the Zephyr project, it is licensed under the Apache 2
license, which is incompatible with the GPL2+ used by Lwan (would have to
bump it to [L]GPL3).

(For those that might ask how fast is it: it’s fast enough. I didn’t
measure, I didn’t compare, and I don’t really care: it’s readable,
maintainable, and does the job.)

Bandwagons

I’m not one to jump on each and every bandwagon I see. Sometimes that’s a
good decision, sometimes it’s better to just wait and see where they go
before taking any action.

Containers are one of those ideas that, while promising and intriguing, were
quite clumsy in the beginning, so I ignored them for a good while. It’s
sufficiently mature now; so much so that’s quite difficult to ignore them.
Time to investigate them again.

Now, most of my work revolve around writing embedded software that runs on
bare metal; containers don’t really solve any work-related problem I have.
For personal usage, package management is more than sufficient to install
programs. However, the sandbox aspect of containers are quite interesting
and it’s something I’d like to know more about.

There are many articles around the web explaining how containers on Linux
work. Some get out of their way to explain in depth all the machinery
necessary to make them work, so there’s no need to repeat it here.

But, in sum: almost all of the kernel side of things was already present before
containers were actually a thing: cgroups, system call filters, etc.
Containers (and their runtimes) only make them so simple to use it’s
transparent for the user.

I usually have a hard time understanding things that I cannot build, so I
decided to build a toy container runtime. It’s crude and it’s a far cry
from what any industrial-strength container runtime is capable of, but it’s
not only a start, it’s implemented in a way that makes things a
lot easier for the user.

Virulent tutorials

Before I go into details on how my contraption works, a little bit of
background. I’ve been using Linux for over 18 years, and began my forays
in C about 14 years ago.

Around that time, a pretty interesting HOWTO explaining how to create
viruses for ELF binaries came out. It
explained not only various methods of infecting an ELF executable, but also
methods to detect them. Suffice to say, I couldn’t understand a thing back
then. A few months ago, though, a conversation in the local hackerspace brought up that tutorial; I could now finally
not only understand the techniques but put them to use.

One of the techniques explained in the HOWTO involves finding some
unused space in an ELF segment that’s also executable, writing shellcode
to that area, rewiring the executable’s entry point to point to the shell
code, and modifying the shell code so that it points to the original entry
point. It’s all quite Rube Goldberg-ey, but it’s actually quite simple.

This way, a chunk of code can be executed every time that program
starts, without altering the size of the program. The perfect crime.

Dual use technology

By now, you’ve most likely connected the dots: the idea is to use the
very same technique, originally designed for viruses, to create a program that
transforms any program into a sandboxed version of itself.

The prototype I wrote is very elementary; the only thing it does is
limiting, just once, which system calls a program can execute.

Sort of a less-powerful version of OpenBSD’s
pledge(2)
(née tame(2)), which can be repeatedly called to reduce the amount of
privileges a process has. Useful, for instance, in cases where a
configuration file has to be read before processing user-supplied work.
That BSD version has been sprinkling calls to
pledge()
in almost all of the programs in the base install (which is easier for a
BSD system, since everything is kept under the same roof.)

But, unlike pledge(2), this thing can be applied to binaries that have
been already built. No source code modifications are necessary. If your
distribution can withstand the stench, “infected” binaries could be a
thing in the default installation.

Filtering the system calls

Any respectable container runtime will perform a lot of tasks to sandbox
a process and their children. So, for a proof of concept, I decided to
do just the bare minimum: limit system calls using Seccomp-BPF.

Seccomp is a set of features
present in the Linux kernel, since the 2.6.x days, that allows
restricting what a program can do, system call-wise. The original intent
was to do not permit any other system calls excepting those to end the
program, and read and write to already-opened file descriptors. In some
scenarios, this is perfectly acceptable. For others, there’s the
seccomp-BPF extension.

BPF stands for Berkeley Packet
Filter. A
famous use of BPFs is in the tcpdump program, where rules such as “only
give me back TCP fragments with the RST flag set” can be passed to the
kernel; packets that don’t match the filter are not copied back to the
userland, reducing a lot of the chatter between the two lands.

Obviously, this must be extremely performant, since kernel time must be
conserved at all costs (the kernel is there only to serve userland, after
all). Linux has many ways to speed up BPF programs, including an in-kernel
JIT compiler. Some restrictions are in place that wouldn’t allow BPF
programs to take an infinite amount of time to execute, and this blog post is a good
introductory reading material on the subject.

Another, slightly less famous use of BPFs is with the seccomp-BPF
extension. Instead of filtering network packets, processes can, for
instance, pick which system calls they’re allowed to perform. And that’s
precisely what’s necessary for my proof of concept.

Scripting like a kid

There are many ways to skin a cat. I decided to take a look how other
programs were doing their sandboxes, and eventually found one that
seemed easy
enough
to copy the technique from.

Unfortunately, writing shellcodes in C isn’t that easy, specially if you
don’t know which C library a program was linked with (or if it were
linked to a C library in the first place). Luckily, all the shellcode
has to do is make two system calls, which is straightforward to do with
a little bit of assembly.

The first call will forbid the process from getting more privileges. The
second call will actually copy the BPF program to the kernel side.

The first call is painless: just set a few registers, invoke the syscall,
done.

The other one takes a little bit more work. A few things helped: I’ve
used nasm, which is a macro
assembler,
and wrote a few macros that let me write BPF programs as if they were
standard x86-64 instructions.

The remaining issue is that a pointer to the BPF program must be passed
to the call to prctl(), and the shellcode must be relocatable. A common
trick to perform in these
scenarios is to employ the fact that, on x86, when a call instruction is
made, the return address (i.e. the address of the byte right after the
call instruction) is pushed to the stack:

The bpf label doesn’t point to any x86 instruction: it contains only
macros that expands to the definitions of structsock_filter as defined
in linux/filter.h. To copy the BPF program to the kernel, the prctl()
call expects a structsock_fprog, which contains the BPF program length
(in number of structsock_filter elements), and a pointer to the base of
that array. Since there’s no way to know where this code is gong to land in
memory beforehand, this trick comes in handy: after the callapply_filter instruction, the top of the stack now contains the base
address of that array.

Now that I had a way to write the shellcode, it was just the matter of
shoehorning it into the executable.

Scoring a goal

For the proof of concept, I was initially going to write the infection
program in Python, as I usually do for throwaway code. However, I wasn’t
successful in finding a working ELF library that would let me dump the
modified executable.

I was too lazy to actually fix or write support for that, so I kept
looking for alternatives and ended up finding the
ELFkickers
suite from the always excellent Muppet Labs. It includes an “infect”
program that does exactly what says in the tin: it takes in an
executable file, and produces another executable file that creates a
setuid shell before continuing to the original program. Exactly what one
would expect from a program with nefarious purposes.

So I substituted the original shellcode for the one I’ve just assembled, and
now I had a proof of concept. Which of course didn’t work the first few
tries. In fact, it took a long while to get it right.

Debugging the contraption with gdb

The GNU Debugger is indeed very powerful, but ease of use (compared to
the Turbo Debugger I used to use in the DOS days) is not it’s strong
suit. I’m not used to using it to debug programs without access to
source, and this was a good opportunity to learn a few things.

Since the infection program modifies the ELF entry point, setting a
breakpoint on main() won’t actually work. But this is easily solvable:
just use readelf(1) to find
where the new entry point is, and set a breakpoint to that:

From now on, it’s just the usual
execute-inspect-modify-reassemble-reinfect loop until it works. Although
it’s no td, I’m
certainly glad GDB has layouts that displays both the disassembly and
the
registers.

Watching the magic happen

The hello program is very short and the call to socket(2) doesn’t
make much sense there. It’s just a way to test what’s going to happen when
the filter is in place, without the need to modify the program to test this
assumption. (Lots of things
happens when executing a simple program such as this.)

Notice the calls to prctl(), very similar to the ones found in the
previously-mentioned commit. And then the program executes as usual.
Now, if an argument is passed, the program will attempt to create a
socket:

And Seccomp kicks in and kills the program with a SIGSYS signal. As
expected. It’s alive!

Next steps

The prototype works. But there are a few things that must be considered
before even considering this idea for anything.

System call whitelist

The list of system calls is still hardcoded within the shellcode. That’s
not optimal. Maintaining a list such as this for each and every program
will most likely be so boring nobody is going to do that.

I can think of three possible ways of coming up with this list.

The first would be doing the same thing pledge(2) does: allowing a very
restrict set of system calls at first, with some limitations, and then
providing a few sets of calls per set of features a program might use: stdio,
inet, tty, etc. The nice thing about this is that the filters are more fine
grained; it’s not just a whitelist of system calls. (The man page has
more details.)

The second way would involve running the program under strace(1) and
record which system calls the program makes from a few runs. If the test
coverage for each run is sufficiently high, this will work very reliably;
this isn’t always the case, so the mileage may vary. Also, for certain
large, complicated programs, stracing it all automatically could prove to be
a challenge.

Another way would be the following: Grab a list of undefined symbols a
program uses, and find them in the shared libraries it links to. Then scan
the executable and the libraries for sequences like moveax,57;syscall
(for the oldschool fork(2) syscall on x86-64) or movrdi,57;callsyscall@plt. This is still not foolproof, since not necessarily a system
call number (loaded into eax) will be hardcoded within a program or
shared library.

There’s a fourth idea, as well, which involves both doing the automated
static analysis on the binary and running strace to catch “runaway”
syscalls. This can get quite complicated and it’s unlikely I’ll get it
correct in the first few tries (and, yet, the same shortcomings will
apply in the end.)

For me, though, these experiments are all about the hunt, not about the
treasure. So the tried and true approach that pledge(2) uses won’t be
used at first.

Filter optimization

Another thing that might be a problem is: on x86-64, Linux has hundreds of
system calls. (329 according to sys/syscall.h at the moment I write
this.)

Even if the JIT for BPFs is quite efficient, doing a linear search before each
and every system call will certainly be a bottleneck. Also, BPF programs are
limited in size, and a large whitelist that’s implemented the same way as the
prototype will limit the possibility for more fine-grained filters. Things
like “the socket(2) call is allowed only for UNIX-domain sockets”, rather than
allowing whatever call to socket(2) would be impractical.

Since each syscall is identified by a number, a simple bitmap could be
used to implement the whitelist. This will also free up some space in
the BPF program for more detailed whitelisting for certain syscalls (for
instance, only allowing certain family of sockets to be created).

After a quick read of
networking/filter.txt,
this seems doable by using an algorithm such as this, which will reduce
the number of comparisons as the number of acceptable system calls
increases:

(Some of the ifsyscall_number<N blocks could be changed to
syscall_number-=M if their respective bitmask is 0.)

Or maybe just a bloom filter instead of a series of bitmaps. I’ll have
to experiment.

Getting a larger vessel

Containers, of course, are not just about restricting which system calls a
program is allowed to perform. There are many things that can and must be
considered before even calling this a container runtime, or really consider
that this is in fact sandboxing anything. Learning about namespaces,
cgroups and virtual machines are certainly on the list of things to learn
about.

Conclusion

While the prototype I built isn’t practical and is of very limited use,
I find the idea of sandboxed programs without the need for specialized
runtimes very enticing.

Programs can be still packaged the way they have been packaged in the
past decades, without throwing away some of the sandboxing benefits that
containers provide, all the while not introducing new concepts for
users.

Of course, something like this – even if properly implemented – won’t
be a replacement for containers. Specially if one considers their role
as packets ready for deployment, which have a lot of value for devops
personnel.

A pretty common mistake that happens when programming things in C is to
allocate less memory than necessary to hold a structure:

structfoobar*foobar=malloc(sizeof(structfoobaz));

Note that structfoobaz is passed instead of structfoobar. We might get
lucky, and sizeof(structfoobaz) might be larger or equal than
sizeof(structfoobar), but we might not.

There are lots of tools out there that will catch these mistakes: static
analyzers such as the one from Clang, and Memcheck from Valgrind are just
two examples that should be in any C programmer’s toolbelt.

Even then, people often resort to a a nicer idiom: sizeof(*foobar),
which not only avoids these problems, but also is somewhat future-proof,
should the type of foobar change:

structfoobar*foobar=malloc(sizeof(*foobar));

However, structures often have members that, if someone forgets to
initialize, will inflict some undefined behavior pains on the user. The
things in the toolbelt might help here, as well as the calloc()
function, that, in addition to allocating memory, also zero-out the memory
block:

structfoobar*foobar=calloc(1,sizeof(*foobar));

Not always one would want to zero out the whole memory chunk just to fill
out important fields afterwards, though.

Here’s a trick that’s being used in a yet-to-be-released project I’ve been
working on and off for the past few months. It starts by defining the
generic-chunk-of-memory equivalent of strdup(), memdup():

It’s a pretty nice idiom that I haven’t seen anywhere else, and I’m blogging
here as the project I’m working on might not ever see the light of day and
it would be a shame if at least this didn’t become public.

I’ve been thinking for a while on how to reduce the overhead in Lwan‘s
string buffer, when the strings are small. There are a number of
ways of accomplishing this.

A somewhat common way is what std::string does: it reuses the bits
reserved for effective string length, allocated buffer size, and
pointer to buffer to store the string contents inline.

A clever improvement is, when the string is small, to turn the
effective string length counter to a bytes remaining counter, and
put it after the buffer that’s storing the string; this way, when the
string is at full capacity, this serves as a \0 terminator, which
is very useful for compatibility with C. And, of course, as a result,
one more byte can be stored in that string.

Another common approach are the strings used in Pascal, where the first
byte tells the length of the string. This has the advantage of allowing
strings to contain \0, but the disadvantage of limiting the maximum
size of the string. If someone were to implement this in C, the
advantage would turn into a disadvantage, as most string-handling
routines present in the standard library would be then rendered useless.

Or would it?

I’m sure I’m not the first person to come up with the idea of having a
C/Pascal String hybrid. But at least the Wikipedia article on
Strings doesn’t seem to mention this variant I just came up with:

Keep the \0 to terminate the string. This helps reusing the
string handling routines from the C standard library, which are usually
very fast, hand-tuned functions

The first byte tells the size, not in bytes, but in 8-byte blocks.
To calculate the string length, one just jumps that amount of 8-byte
blocks and find the position of the \0 terminator.

Larger blocks could be considered if SIMD instructions were available.

With 8-byte blocks, this can yield strings up to 2KiB of size (256 *
8), with an overhead of only two bytes, while retaining compatibility
with C strings. With SIMD, the maximum string size could be easily
doubled or quadrupled.

Of course, this isn’t actually an improvement on the kind of small
string optimization performed by std::string, so I’m not yet
convinced this is the way to go. This is one of the reasons I haven’t
yet implemented this, but I might use the fact that I’m currently
enjoying some vacation time and write a prototype.