Categories

Tue, 30 Oct 2018 - SAT solvers for fun and fairness

Trøndisk 2018, the
first round of the Norwegian ultimate series (the frisbee sport, not the
fighting style) is coming up this weekend! Normally that would mean that
I would blog something about all the new and exciting things we are doing
with Nageru for the stream, but for now,
I will just point out that the stream is on plastkast.no
and will be live from 0945–1830 CET on Saturday (group stage) and 1020–1450
(playoffs) on Sunday.

Instead, I wanted to talk about a completely different but interesting
subproblem we had to solve; how do you set up a good schedule for the group
stages? There are twelve teams, pre-seeded and split into two groups (call them A0–A5
and B0–B5) that are to play
round-robin, but there are only two fields—and only one of them is streamed.
You want a setup that maximizes fairness in the sense that people get
adequate rest between matches, and also more or less equal number of streamed
games. Throw in that one normally wants the more exciting games last, and
it starts to get really tricky to make something good by hand. Could we do
it programmatically?

My first thought was that since this is all about the ordering, it sounded
like a variant of the infamous travelling salesman problem. It's
well-known that TSP is NP-hard (or NP-complete, but I won't bother with
the details), but there are excellent heursitic implementations in practice.
In particular, I had already used OR-Tools, Google's optimization
toolkit, to solve TSP problems in the past; it contains a TSP solver that
can deal with all sorts of extra details, like multiple agents to travel
(in our case, multiple fields), subconstraints on ordering and so on.
(OR-Tools probably doesn't contain the best TSP solver in the world—there
are specialized packages that do even better—but it's much better than
anything I could throw together myself.)

However, as I tried figuring out something, and couldn't quite get it to
fit (there are so many extra nonlocal constraints), I saw that the OR-Tools
documentation had a subsection on scheduling problems. It turns
out this kind of scheduling can be represented as a so-called SAT
(satisfiability) problem, and OR-Tools also has a SAT solver. (SAT, in its
general forms, is also NP-hard, but again, there are great heuristics.)
I chose the Python frontend, which probably wasn't the best idea in the
world (it's poorly documented, and I do wonder when Python will take the
step into the 90s and make spelling errors in variables into compile-time
errors instead of throwing a runtime exception four hours into a
calculation), but that's what the documentation used, and the backend
is in C++ anyway, so speed doesn't matter.

The SAT solver works by declaring variables and various constraints
between them, and then asking the machine to either come up with a solution
that fits, or to prove that it's not possible. Let's have a look of some
excerpts to get a feel for how it all works:

We know we have 15 rounds, two fields on each, and every field should
contain a match. So let's generate 30 such variables, each containing
a match number (we use the convention that match 0, 2, 4, 6, etc. are
on the stream field and 1, 3, 5, 7, etc. are played in parallel on the
other field):

So this is 30 variables, and each go from 0 to 29, inclusive.
We start a fairly obvious constraint; we can only play each match once:

model.AddAllDifferent(matchnums)

The SAT solver might make this into a bunch of special constraints
underneath, or it might not. We don't care; it's abstracted away for us.

Now, it's not enough to just find any ordering—after all, we want to
find an ordering with some constraints. However, the constraints are
rarely about the match numbers, but more about the teams that play in
those matches. So we'll need some helper variables. For instance,
it would be interesting to know which teams play in each match:

AddElement() here simply is an indexing operation; since there's no difference
between home and away teams for us, we've just pregenerated all the matches as
A0 vs. A1, A0 vs. A2, etc. up until A4 vs. A6, A5 vs. A6 and then similarly for
the other gruop. The “element” constraint makes sure that e.g. home_team_match0 = home_teams_for_match_num[matchnum0].
Note that even though I think of this is as an assignment where the home team
for match 0 follows logically from which match is being played as match 0,
it is a constraint that goes both ways; the solver is free to do inference
that way, or instead first pick the home team and then deal with the
consequences for the match number. (E.g., if it picks A5 as the home team,
the match number most certainly needs to be 14, which corresponds to A5–A6.)

We're not quite done with the helpers yet; we want to explode these variables
into booleans:

So now we have a bunch of variables of the type “is the home team in match 6
A4 or not?”. Finally we can make some interesting constraints! For instance,
we've decided already that the group finals (A0–A1 and B0–B1) should be the
last two matches of the day, and on the stream field:

Note that there's nothing here that says the same team can't be assigned
to play on both fields at the same time! However, this is taken care of
by some constraints on the scheduling that I'm not showing for brevity
(in particular, we established that each round must have exactly one game
from group A and one from group B).

Now we're starting to get out of the “hard constraint” territory and more
into things that would be nice. For this, we need objectives. One such
objective is what I call ”tiredness”; playing matches nearly back-to-back
(ie., game - rest - game) should have a penalty, and the solution should
try to avoid it.

So here we have helper variables that are being set to the minimum
(effectively a logical AND) of “do I play in round N” and “do I play in round
N + 2”. Tiredness is simply a sum of those 0–1 variables, which we can seek
to minimize:

model.Minimize(sum_tiredness)

You may wonder how we went from a satisfiability problem to an
optimization problem. Conceptually, however, this isn't so hard.
Just ask the solver to find any solution, e.g. something with sum_tiredness 20.
Then simply add a new constraint saying sum_tiredness <= 19 and ask
for a re-solve (or continue). Eventually, the solver will either come back
with a better solution (in which case you can tighten the constraint
further), or the message that you've asked for something impossible, in
which case you know you have the optimal solution. (I have no idea whether
modern SAT solvers actually work this way internally, but again, conceptually
it's simple.)

As an extra bonus, you do get incrementally better solutions as you go.
These problems are theoretically very hard—in fact, I let it run for fun
for a week now, and it's still not found an optimal solution—and in practice,
you just take some intermediate solution that is “good enough”. There are
always constraints that you don't bother adding to the program anyway, so there's
some eyeballing involved, but still feels like a more fair process than trying
to nudge it by hand.

We had many more objectives, some of them contradictory (e.g., games between
more closely seeded opponents are more “exciting”, and should be put last—but
they should also be put on the stream, so do you put them early on the stream
field or late on the non-stream field?). It's hard to weigh all the factors
against each other, but in the end, I think we ended up with something pretty
nice. Every team gets to play two or three times (out of five) on the stream,
only one team needs to be “tired” twice (and I checked; if you ask for a
hard maximum of once for every team, it comes back pretty fast as
infeasible), many of the tight matches are scheduled near the end… and most importantly, we don't have to play the first matches
while I'm still debugging the stream. :-)

You can see final schedule
here.
Good luck to everyone, and consider using a SAT solver next time you have
a thorny scheduling problem!

Tue, 12 Sep 2017 - rANS encoding of signed coefficients

I'm currently trying to make sense of some still image coding
(more details to come at a much later stage!), and for a variety
of reasons, I've chosen to use rANS as the entropy coder. However,
there's an interesting little detail that I haven't actually seen
covered anywhere; maybe it's just because I've missed something,
or maybe because it's too blindingly obvious, but I thought I would
document what I ended up with anyway. (I had hoped for something
even more elegant, but I guess the obvious would have to do.)

For those that don't know rANS coding, let me try to handwave it
as much as possible. Your state is typically a single word
(in my case, a 32-bit word), which is refilled from the input stream
as needed. The encoder and decoder works in reverse order;
let's just talk about the decoder. Basically it works by looking
at the lowest 12 (or whatever) bits of the decoder state, mapping
each of those 2^12 slots to a decoded symbol. More common symbols
are given more slots, proportionally to the frequency. Let me just
write a tiny, tiny example with 2 bits and three symbols instead,
giving four slots:

Lowest bits

Symbol

00

0

01

0

10

1

11

2

Note that the zero coefficient here maps to one out of two slots
(ie., a range); you don't choose which one yourself, the encoder
stashes some information in there (which is used to recover the
next control word once you know which symbol there is).

Now for the actual problem: When storing DCT coefficients, we
typically want to also store a sign (ie., not just 1 or 2, but
also -1/+1 and -2/+2). The statistical distribution is symmetrical,
so the sign bit is incompressible (except that of course there's
no sign bit needed for 0). We could have done this by introducing new symbols
-1 and -2 in addition to our three other ones, but this means we'll
need more bits of precision, and accordingly larger look-up tables
(which is negative for performance). So let's find something better.

We could also simply store it separately somehow; if the coefficient
is non-zero, store the bits in some separate repository. Perhaps
more elegantly, you can encode a second symbol in the rANS stream
with probability 1/2, but this is more expensive computationally.
But both of these have the problem that they're divergent in terms
of control flow; nonzero coefficients potentially need to do a lot
of extra computation and even loads. This isn't nice for SIMD, and
it's not nice for GPU. It's generally not really nice.

The solution I ended up with was simulating a larger table with
a smaller one. Simply rotate the table so that the zero symbol
has the top slots instead of the bottom slots, and then replicate
the rest of the table. For instance, take this new table:

Lowest bits

Symbol

000

1

001

2

010

0

011

0

100

0

101

0

110

-1

111

-2

(The observant reader will note that this doesn't describe the
exact same distribution as last time—zero has twice the relative
frequency as in the other table—but ignore that for the time
being.)

In this case, the bottom half of the table doesn't actually need
to be stored! We know that if the three bottom bits are >= 110
(6 in decimal), we have a negative value, can subtract 6, and then
continue decoding. If we are go past the end of our 2-bit table
despite that, we know we are decoding a zero coefficient (which
doesn't have a sign), so we can just clamp the read; or for a GPU,
reads out-of-bounds on a texture will typically return 0 anyway.
So it all works nicely, and the divergent I/O is gone.

If this pickled your interest, you probably want to read up on
rANS in general; Fabian Giesen (aka ryg) has some notes
that work as a good starting point, but beware; some of this is
pretty confusing. :-)

Sun, 25 Dec 2016 - Cracking a DataEase password

I recently needed to get access to a DataEase database; the person I helped
was the legitimate owner of the data, but had forgotten the password,
as the database was largely from 1996. There are various companies
around the world that seem to do this, or something similar (like give
you an API), for a usually unspecified fee; they all have very 90s homepages
and in general seem like they have gone out of business a long time ago.
And I wasn't prepared to wait.

For those of you who don't know DataEase, it's a sort-of relational database
for DOS that had its heyday in the late 80s and early 90s (being sort of
the cheap cousin of dBase); this is before SQL gained traction as the
standard query language, before real multiuser database access, and before
variable-width field storage.

It is also before reasonable encryption. Let's see what we can do.

DataEase has a system where tables are mapped through the data dictionary,
which is a table on its own. (Sidenote: MySQL pre-8.0 still does not have
this.) This is the file RDRRTAAA.DBM; I don't really know what RDRR stands
for, but T is the “database letter” in case you wanted more than one database
in the same directory, and AAA, AAB, AAC etc. is a counter so that a table
grows to be too big for one file. (There's also .DBA files for structure of
non-system tables, and then some extra stuff for indexes.)

DBM files are pretty much the classical, fixed-length 80s-style database
files; each row has some flags (I believe these are for e.g. “row is
deleted”) and then just the rows in fixed format right after each other.
For instance, here's one I created as part of testing (just the first few
lines of the hexdump are shown):

Even without going in-depth, we can see the structure here; there's
“testbase” which maps to C:RDRRTAA.DBM (the RDRR itself), there's a table
called “REPORT DIRECTORY” that maps to C:REPOTAAA.DBM, and then more stuff
after that, and so on.

However, other tables are not so easily read, because you can ask DataEase
to encrypt a table. Let's look at such an encrypted table, like the “Users”
table (containing usernames, passwords—not password hashes—and some extra
information like access level), which is always encrypted:

Clearly, this isn't very good encryption; it uses a very short, repetitive
key of eight bytes (64 bits). (The data is mostly zero padding, which makes
it much easier to spot this.) In fact, in actual data tables, only five of
these bytes are set to a non-zero value, which means we have a 40-bit key;
export controls?

My first assumption here was of course XOR, but through some experimentation,
it turned out what you need is actually 8-bit subtraction (with wraparound).
The key used is derived from both a database key and a per-table key,
both stored in the RDRR; again, if you disassemble, I'm sure you can find the
key derivation function, but that's annoying, too. Note, by the way, that
this precludes making an attack by just copying tables between databases,
since the database key is different.

So let's do a plaintext attack. If you assume the plaintext of the bottom row
is all padding, that's your key and here's what you end up with:

Not bad, eh? Actually the first byte of the key here is wrong as far as I
know, but it didn't interfere with the fields, so we have what we need to
log in. (At that point, we've won, because DataEase will helpfully decrypt
everything transparent for us.)

However, there's a twist; if the password is longer than four characters,
the entire decryption of the Users table changes. Of course, we could run
our plaintext attack against every data table and pick out the information
by decoding the structure, but again; annoying. So let's see what it looks
like if we choose “passs” instead:

Distinctly more confusing. At this point, of course, we know at which byte
positions the username and password start, so if we wanted to, we could just try
setting the start byte of the password to every possible byte in turn until
we hit 0x00 (DataEase truncates fields at the first zero byte), which would
allow us to get in with an empty password. However, I didn't know the
username either, and trying two bytes would mean 65536 tries, and I wasn't
up for automating macros through DOSBox. So an active attack wasn't
too tempting.

However, we can look at the last hex byte
(where we know the plaintext is 0); it goes 0x5d, 0x2d, 0xfd... and some
other bytes go 0x08, 0xd8, 0xa8, 0x78, and so on. So clearly there's an
obfuscation here where we have a per-line offset that decreases with 0x30
per line. (Actually, the increase/decrease per line seems to be derived from
the key somehow, too.) If we remove that, we end up with:

Well, OK, this wasn't much more complicated; our fixed key is now 16 bytes
long instead of 8 bytes long, but apart from that, we can do exactly the same
plaintext attack. (Also, it seems to change per-record now, but we don't see
it here, since we've only added one user.) Again, assume the last line is
supposed to be all 0x00 and thus use that as a key (plus the last byte from
the previous line), and we get:

Well, OK, it wasn't perfect; we got “pass\x12” instead of “passs”, so we
messed up somehow. I don't know exactly why the fifth character gets messed
up like this; actually, it cost me half an hour of trying because the
password looked very real but the database wouldn't let me in, but
eventually, we just guessed at what the missing letter was supposed to be.

So there you have it; practical small-scale cryptanalysis of DOS-era
homegrown encryption. Nothing advanced, but the user was happy about getting
the data back after a few hours of work. :-)

Wed, 26 Oct 2016 - Why does software development take so long?

Nageru 1.4.0 is out (and on its way through
the Debian upload process right now), so now you can do live video mixing with multichannel
audio to your heart's content. I've already blogged about most of the
interesting new features, so instead, I'm trying to answer a question:
What took so long?

To be clear, I'm not saying 1.4.0 took more time than I really anticipated
(on the contrary, I pretty much understood the scope from the beginning,
and there was a reason why I didn't go for building this stuff into 1.0.0);
but if you just look at the changelog from the outside, it's not immediately
obvious why “multichannel audio support” should take the better part of three
months of develoment. What I'm going to say is of course going to be obvious
to most software developers, but not everyone is one, and perhaps my
experiences will be illuminating.

Let's first look at some obvious things that isn't the case: First of all,
development is not primarily limited by typing speed. There are about 9,000
lines of new code in 1.4.0 (depending a bit on how you count), and if it was
just about typing them in, I would be done in a day or two. On a good
keyboard, I can type plain text at more than 800 characters per minute—but
you hardly ever write code for even a single minute at that speed. Just as
when writing a novel, most time is spent thinking, not typing.

I also didn't spend a lot of time backtracking; most code I wrote actually
ended up in the finished product as opposed to being thrown away. (I'm not
as lucky in all of my projects.) It's pretty
common to do so if you're in an exploratory phase, but in this case, I had a
pretty good idea of what I wanted to do right from the start, and that plan
seemed to work. This wasn't a difficult project per se; it just needed
to be done (which, in a sense, just increases the mystery).

However, even if this isn't at the forefront of science in any way (most code
in the world is pretty pedestrian, after all), there's still a lot of
decisions to make, on several levels of abstraction. And a lot of those
decisions depend on information gathering beforehand. Let's take a look at
an example from late in the development cycle, namely support for using MIDI
controllers instead of the mouse to control the various widgets.

I've kept a pretty meticulous TODO list; it's just a text file on my laptop,
but it serves the purpose of a ghetto bugtracker. For 1.4.0, it contains 83
work items (a single-digit number is not ticked off, mostly because I decided
not to do those things), which corresponds roughly 1:2 to the number of
commits. So let's have a look at what the ~20 MIDI controller items went into.

First of all, to allow MIDI controllers to influence the UI, we need a way
of getting to it. Since Nageru is single-platform on Linux, ALSA is the
obvious choice (if not, I'd probably have to look for a library to put
in-between), but seemingly, ALSA has two interfaces (raw MIDI and sequencer).
Which one do you want? It sounds like raw MIDI is what we want, but actually,
it's the sequencer interface (it does more of the MIDI parsing for you,
and generally is friendlier).

The first question is where to start picking events from. I went the simplest
path and just said I wanted all events—anything else would necessitate a UI,
a command-line flag, figuring out if we wanted to distinguish between
different devices with the same name (and not all devices potentially even
have names), and so on. But how do you enumerate devices? (Relatively simple,
thankfully.) What do you do if the user inserts a new one while Nageru is
running? (Turns out there's a special device you can subscribe to that will
tell you about new devices.) What if you get an error on subscription?
(Just print a warning and ignore it; it's legitimate not to have access to
all devices on the system. By the way, for PCM devices, all of these answers
are different.)

So now we have a sequencer device, how do we get events from it? Can we do it in the main loop? Turns out
it probably doesn't integrate too well with Qt, but it's easy enough to put
it in a thread. The class dealing with the MIDI handling now needs locking;
what mutex granularity do we want? (Experience will tell you that you nearly
always just want one mutex. Two mutexes give you all sorts of headaches with
ordering them, and nearly never gives any gain.) ALSA expects us to poll()
a given set of descriptors for data, but on shutdown, how do you break out
of that poll to tell the thread to go away? (The simplest way on Linux is
using an eventfd.)

There's a quirk where if you get two or more MIDI messages right after each
other and only read one, poll() won't trigger to alert you there are more
left. Did you know that? (I didn't. I also can't find it documented. Perhaps
it's a bug?) It took me some looking into sample code to find it. Oh, and
ALSA uses POSIX error codes to signal errors (like “nothing more is
available”), but it doesn't use errno.

OK, so you have events (like “controller 3 was set to value 47”); what do you do
about them? The meaning of the controller numbers is different from
device to device, and there's no open format for describing them. So I had to
make a format describing the mapping; I used protobuf (I have lots of
experience with it) to make a simple text-based format, but it's obviously
a nightmare to set up 50+ controllers by hand in a text file, so I had to
make an UI for this. My initial thought was making a grid of spinners
(similar to how the input mapping dialog already worked), but then I realized
that there isn't an easy way to make headlines in Qt's grid. (You can
substitute a label widget for a single cell, but not for an entire row.
Who knew?) So after some searching, I found out that it would be better
to have a tree view (Qt Creator does this), and then you can treat that
more-or-less as a table for the rows that should be editable.

Of course, guessing controller numbers is impossible even in an editor,
so I wanted it to respond to MIDI events. This means the editor needs
to take over the role as MIDI receiver from the main UI. How you do
that in a thread-safe way? (Reuse the existing mutex; you don't generally
want to use atomics for complicated things.) Thinking about it, shouldn't the
MIDI mapper just support multiple receivers at a time? (Doubtful; you don't
want your random controller fiddling during setup to actually influence
the audio on a running stream. And would you use the old or the new mapping?)

And do you really need to set up every single controller for each bus,
given that the mapping is pretty much guaranteed to be similar for them?
Making a “guess bus” button doesn't seem too difficult, where if you
have one correctly set up controller on the bus, it can guess from
a neighboring bus (assuming a static offset). But what if there's
conflicting information? OK; then you should disable the button.
So now the enable/disable status of that button depends on which cell
in your grid has the focus; how do you get at those events? (Install an event
filter, or subclass the spinner.) And so on, and so on, and so on.

You could argue that most of these questions go away with experience;
if you're an expert in a given API, you can answer most of these questions
in a minute or two even if you haven't heard the exact question before.
But you can't expect even experienced developers to be an expert in all
possible libraries; if you know everything there is to know about Qt,
ALSA, x264, ffmpeg, OpenGL, VA-API, libusb, microhttpd and Lua
(in addition to C++11, of course), I'm sure you'd be a great fit for
Nageru, but I'd wager that pretty few developers fit that bill.
I've written C++ for almost 20 years now (almost ten of them professionally),
and that experience certainly helps boosting productivity, but I can't
say I expect a 10x reduction in my own development time at any point.

You could also argue, of course, that spending so much time on the editor
is wasted, since most users will only ever see it once. But here's the
point; it's not actually a lot of time. The only reason why it seems
like so much is that I bothered to write two paragraphs about it;
it's not a particular pain point, it just adds to the total. Also,
the first impression matters a lot—if the user can't get the editor
to work, they also can't get the MIDI controller to work, and is likely
to just go do something else.

A common misconception is that just switching languages or using libraries
will help you a lot. (Witness the never-ending stream of software that
advertises “written in Foo” or “uses Bar” as if it were a feature.)
For the former, note that nothing I've said so far is specific to my choice
of language (C++), and I've certainly avoided a bunch of battles by making
that specific choice over, say, Python. For the latter, note that most of these problems are actually related
to library use—libraries are great, and they solve a bunch of problems
I'm really glad I didn't have to worry about (how should each button look?),
but they still give their own interaction problems. And even when you're a
master of your chosen programming environment, things still take time,
because you have all those decisions to make on top of your libraries.

Of course, there are cases where libraries really solve your entire problem
and your code gets reduced to 100 trivial lines, but that's really only when
you're solving a problem that's been solved a million times before. Congrats
on making that blog in Rails; I'm sure you're advancing the world. (To make
things worse, usually this breaks down when you want to stray ever so
slightly from what was intended by the library or framework author. What
seems like a perfect match can suddenly become a development trap where you
spend more of your time trying to become an expert in working around the
given library than actually doing any development.)

The entire thing reminds me of the famous essay
No Silver Bullet by Fred
Brooks, but perhaps even more so, this quote from
John Carmack's .plan has
struck with me (incidentally about mobile game development in 2006,
but the basic story still rings true):

To some degree this is already the case on high end BREW phones today. I have
a pretty clear idea what a maxed out software renderer would look like for
that class of phones, and it wouldn't be the PlayStation-esq 3D graphics that
seems to be the standard direction. When I was doing the graphics engine
upgrades for BREW, I started along those lines, but after putting in a couple
days at it I realized that I just couldn't afford to spend the time to finish
the work. "A clear vision" doesn't mean I can necessarily implement it in a
very small integral number of days.

In a sense, programming is all about what your program should do in the first
place. The “how” question is just the “what”, moved down the chain of
abstractions until it ends up where a computer can understand it, and at that
point, the three words “multichannel audio support” have become those 9,000
lines that describe in perfect detail what's going on.

Wed, 23 Jul 2014 - The sad state of Linux Wi-Fi

I've been using 802.11 on Linux now for over a decade, and to be honest, it's
still a pretty sad experience. It works well enough that I mostly don't
care... but when I care, and try to dig deeper, it always ends up in the
answer “this is just crap”.

I can't say exactly why this is; between the Intel cards I've always been
using, the Linux drivers, the firmware, the mac80211 layer, wpa_supplicant
and NetworkManager, I have no idea who are supposed to get all these things
right, and I have no idea how hard or easy they actually are to pull off.
But there are still things annoying me frequently that we should really
have gotten right after ten years or more:

Why does my Intel card consistently pick 2.4 GHz over 5 GHz? The 5 GHz
signal is just as strong, and it gives a less crowded 40 MHz channel
(twice the bandwidth, yay!) instead of the busy 20 MHz channel the 2.4 GHz
one has to share. The worst part is, if I use an access point with
band-select (essentially forcing the initial connection to be to 5
GHz—this is of course extra fun when the driver sees ten APs
and tries to connect to all of them over 2.4 in turn before trying
5 GHz), the driver still swaps onto 2.4 GHz a few minutes later!

Rate selection. I can sit literally right next to an AP and get a
connection on the lowest basic rate (which I've set to 11 Mbit/sec
for the occasion). OK, maybe I shouldn't trust the output of iwconfig
too much, since rate is selected per-packet, but then again, when
Linux supposedly has a really good rate selection algorithm (minstrel),
why are so many drivers using their own instead? (Yes, hello
“iwl-agn-rs”, I'm looking at you.)

Connection time. I dislike OS X pretty deeply and think that many
of its technical merits are way overblown, but it's got one thing
going for it; it connects to an AP fast.
RFC4436 describes some of the
tricks they're using, but Linux uses none of them. In any case, even
the WPA2 setup is slow for some reason, it's not just DHCP.

Scanning/roaming seems to be pretty random; I have no idea how much
thought really went into this, and I know it is a hard problem,
but it's not unusual at all to be stuck at some low-speed AP when
a higher-speed one is available. (See also 2.4 vs. 5 above.)
I'd love to get proper support for CCX (Cisco Client Extensions) here,
which makes this tons better in a larger Wi-Fi setting (since the access
point can give the client a lot of information that's useful for
roaming, e.g. “there's an access point on thannel 52 that sends
its beacons every 100 ms with offset 54 from mine”, which means you
only need to swap channel for a few milliseconds to listen instead of
a full beacon period), but I suppose that's covered by licensing
or patents or something. Who knows.

With now a billion mobile devices running Linux and using Wi-Fi
all the time, maybe we should have solved this a while ago.
But alas. Instead we get access points trying to layer hacks upon
hacks to try to force clients into making the right decisions.
And separate ESSIDs for 2.4 GHz and 5 GHz.

Thu, 08 May 2014 - Static library linking behavior

A static library (an .a file) is, at its heart, nothing special—it's
just a bunch of object files (.o files) in an archive, with an optional
index. Thus, I've always treated them mentally the same to a bunch of object
files, but it turns out there's an edge case where there is a difference between linking to
an .a file and a bunch of .o files (even if you discount the fact that
the .a file is a linker group and the .o files would need -( and -) around it
to get the same effect).

The difference manifests itself when you are building a shared library
(a .so file).
For an .o file, the linker does what you'd expect; any symbol that has
global linkage (ie., for C or C++: it's not marked as static or in an
anonymous namespace) and has default visibility (which, unfortunately,
means almost all symbols unless you're clever enough to use
-fvisibility=hidden and explicitly mark your exports
with __attribute__((visibility("default"))) or similar),
will be exported in your shared library to external clients. This means,
among others, that it will take up space because the name and a few other
things need to be stored.

However, for an .a file, things are different! If you have a symbol in an .a file, and it's not used by anything
from an .o file (directly or indirectly), the linker will silently just
discard it as unused. It doesn't matter what visibility it has; it will
just not be there. (To be honest, I haven't checked if this is on the level
of a symbol, a section, an .o file within the library, or the entire .a file.)

The workaround for this is to add --whole-archive before the
.a file in question (and presumably --no-whole-archive after it
to negate the effect for the following ones), which will negate this
behavior. However, I am unsure if it also has other effects, like messing
with visibility, so beware.

To be honest, I think this is insane behavior, but given that so few people
set visibility explicitly, I guess there would never be any pruning if
the behavior were different (and ELF symbol visibility is a relatively new
invention anyway), so I can understand why it is like that.

Of course, I figured this out the hard way. Poof, there goes a few hours of
my life.

Tue, 04 Feb 2014 - FOSDEM video stream goodiebag

Borrowing a tradition from TG, we have released
a video streaming goodiebag from FOSDEM 2014. In short, it contains all the
scripts we used for the streaming part (nothing from the video team itself,
although I believe of most of what they do is developed out in the open).

If you've read my earlier posts on the subject, you'll know that it's all
incredibly rough, and we haven't cleaned it up much afterwards. So you get
the truth, but it might not be pretty :-) However, feedback is of course
welcome.

Sun, 02 Feb 2014 - FOSDEM video streaming, post-mortem

Wow, what a ride that was. :-)

I'm not sure if people generally are aware of it, but the video streaming
at FOSDEM this year came together on extremely short notice. I got word
late Wednesday that the video team was overworked and would not have the
manpower to worry about streaming, and consequently, that there would be
none (probably not even of the main talks, like last year).

I quickly conferred with Berge on IRC; we both agreed that something
as big as FOSDEM shouldn't be without at least rudimentary streams. Could we
do something about it? After all, all devrooms (save for some that would not
due to licensing issues) would be recorded using DVswitch anyway, where it's
trivial to just connect another sink to the master, and we both had extensive
experience doing streaming work from The Gathering.

So, we agreed to do a stunt project; either it would work or it would crash
and burn, but at least it would be within the playful spirit of free software.
The world outside does not stand still, and neither should we.

The FOSDEM team agreed to give us access to the streams, and let us use the
otherwise unused cycles on the “slave” laptops (the ones that just take in
a DV switch from the camera and send it to the master for mixing).
Since I work at Google, I was able to talk to the Google Compute Engine people,
who were able to turn around on extremely short notice and sponsor GCE resources
for the actual video distribution. This took a huge unknown out of the
equation for us; since GCE is worldwide and scalable, we'd be sure to have
adequate bandwidth for serving our viewers almost no matter how much load we got.

The rest was mainly piecing together existing components in new ways. I dealt
with the encoding (on VLC, using WebM, since that's what FOSDEM
wanted), hitting one or two really obscure bugs in the process, and
Berge dealt with all the setup of distribution (we used cubemap,
which had already been tuned for the rather unique needs of WebM during last
Debconf), parsing the FOSDEM schedule to provide live program information,
and so on. Being a team of two was near-ideal here; we already know each other
extremely well from previous work, and despite the frantic pace, everything
felt really relaxed and calm.

So, less than 72 hours after the initial “go”, the streaming laptops started
coming up in the various devrooms, and I rsynced over my encoding chroot
to each of them and fired up VLC, which then cubemap would pick up and send
on. And amazingly enough, it worked! We had a peak of about 380 viewers,
which is about 80% more than the peak of 212 last year (and this was with
almost no announcement before the conference). Amusingly, the most popular
stream by far was not a main track, but that of the Go devroom; at times,
they had over half the total viewers. (I never got to visit it myself, because
it was super-packed every time I went there.)

I won't pretend everything went perfect—we found a cubemap segfault
on the way, and also some other issues (such as initially not properly
restarting the encoding when the DVswitch master went down and up again).
But I'm extremely happy that the video team believed in us and gave us
the chance; it was fun, it was the perfect icebreaker when meeting new
people at FOSDEM, and hopefully, we let quite a few people sitting at home
learn something new or interesting.

Mon, 19 Aug 2013 - Whole-disk dm-cache

dm-cache is an interesting new technology in the 3.10 kernel onwards;
basically, it is a way to use SSDs as a cache layer in front of rotating
media, supposedly getting the capacity of the latter and the speed of the
former, similar to how the page cache already tries to exploit the good
properties of both RAM and disks. (This is, historically, nothing new; for
instance, ZFS has had this ability for years, in the form of a patented
cache algorithm called L2ARC.)

dm-cache is not the only technology that does this; it competes with,
for instance, bcache (also merged in 3.10). However, bcache expects you
to format the data volume, which was a no-go in my case: What I wanted,
was for dm-cache to sit below my main RAID-6 LVM (which has tons of volumes),
without having to erase anything.

This is all a bit raw. Bear with me.

First of all, after a new enough kernel has been installed (you probably
want 3.11-rc-something, actually), we want some basic scripts to hook onto
initramfs-tools and so on. I used
dmcache-tools, and simply
converted it to a Debian package with alien. It comes with a tool called
dmcache-format-blockdev that tries to partition your block
device as an LVM, split into blocks and metadata
volumes (seemingly they are separate in case you want e.g. RAID-1 for
your metadata only), but I found it to make a metadata volume that was
too small for my use. I ended up with 512MB for metadata and then the
rest for blocks.

The next part is how to get startup right. First of all, we want an
/etc/cachetab so that dmcache-load-cachetab knows
how to set up the cache:

This gives you a new /dev/mapper/cache that's basically
identical to /dev/md1 except faster du to the extra cache.
Then, you'll have to tell LVM that it should never try to use
/dev/md1 as a physical volume on its own (that would be
very bad if the cache had dirty blocks!), so /etc/lvm/lvm.conf
needs to contain something like:

filter = [ "a/md2/", "r/md/", "a/.*/" ]

Note that my SSD RAID is on md2, so I'll need to make an
exception for it. LVM aficionados will probably know of something
more efficient here (r/md1/ didn't work for me, since there's also
/dev/md/1 and possibly others).
Then, we need to get everything set up right during boot.
This is governed by /sbin/dmcache-load-cachetab.
Unfortunately, LVM is not started by udev, but rather late
in the process, so /dev/cache/blocks and
/dev/cache/mapper are not available when dmcache-load-cachetab
runs! I hacked that in, just before the “Devices not ready?” comment,
by simply adding the LVM load line used elsewhere in the initramfs:

/sbin/lvm vgchange -aly --ignorelockingfailure

Finally, we need to make sure the hook is installed in the first place.
The hook script has a line to check if dm-cache is needed for the root
volume, but it's far too simplistic, so I simply changed
/usr/share/initramfs-tools/hooks/dmcache so that
should_install() always returned true:

should_install() {
# sesse hack
echo yes
return
}

After that, all you need to do is clear the first few kilobytes of
the metadata filesystem using dd, update the initramfs,
and voila! Cache.

It would seem the code in the kernel is still a bit young; it has
memory allocation issues and doesn't cache all that aggressively
yet, but most of my writes are already going to the cache, and
an increasing amount of reads, so I think this is going to be quite
OK in a few revisions.

Sun, 28 Apr 2013 - Precise cache miss monitoring with perf

This should have been obvious, but seemingly it's not (perf is amazingly
undocumented, and has this huge lex/yacc grammar for its command-line
parsing), so here goes:

If you want precise cache miss data from perf (where “precise” means using
PEBS, so that it gets attributed to the actual load and not some random
instruction a few cycles later), you cannot use “cache-misses:pp” since
“cache-misses” on Intel maps to some event that's not PEBS-capable.
Instead, you'll have to use “perf record -e r10cb:pp”. The trick is,
apparently, that “perf list” very much suggests that what you want is
rcb10 and not r10cb, but that's not the way it's really encoded.

FWIW, this is LLC misses, so it's really things that go to either another
socket (less likely), or to DRAM (more likely). You can change the 10
to something else (see “perf list”) if you want e.g. L2 hits.