Categories

Wed, 26 Oct 2016 - Why does software development take so long?

Nageru 1.4.0 is out (and on its way through
the Debian upload process right now), so now you can do live video mixing with multichannel
audio to your heart's content. I've already blogged about most of the
interesting new features, so instead, I'm trying to answer a question:
What took so long?

To be clear, I'm not saying 1.4.0 took more time than I really anticipated
(on the contrary, I pretty much understood the scope from the beginning,
and there was a reason why I didn't go for building this stuff into 1.0.0);
but if you just look at the changelog from the outside, it's not immediately
obvious why “multichannel audio support” should take the better part of three
months of develoment. What I'm going to say is of course going to be obvious
to most software developers, but not everyone is one, and perhaps my
experiences will be illuminating.

Let's first look at some obvious things that isn't the case: First of all,
development is not primarily limited by typing speed. There are about 9,000
lines of new code in 1.4.0 (depending a bit on how you count), and if it was
just about typing them in, I would be done in a day or two. On a good
keyboard, I can type plain text at more than 800 characters per minute—but
you hardly ever write code for even a single minute at that speed. Just as
when writing a novel, most time is spent thinking, not typing.

I also didn't spend a lot of time backtracking; most code I wrote actually
ended up in the finished product as opposed to being thrown away. (I'm not
as lucky in all of my projects.) It's pretty
common to do so if you're in an exploratory phase, but in this case, I had a
pretty good idea of what I wanted to do right from the start, and that plan
seemed to work. This wasn't a difficult project per se; it just needed
to be done (which, in a sense, just increases the mystery).

However, even if this isn't at the forefront of science in any way (most code
in the world is pretty pedestrian, after all), there's still a lot of
decisions to make, on several levels of abstraction. And a lot of those
decisions depend on information gathering beforehand. Let's take a look at
an example from late in the development cycle, namely support for using MIDI
controllers instead of the mouse to control the various widgets.

I've kept a pretty meticulous TODO list; it's just a text file on my laptop,
but it serves the purpose of a ghetto bugtracker. For 1.4.0, it contains 83
work items (a single-digit number is not ticked off, mostly because I decided
not to do those things), which corresponds roughly 1:2 to the number of
commits. So let's have a look at what the ~20 MIDI controller items went into.

First of all, to allow MIDI controllers to influence the UI, we need a way
of getting to it. Since Nageru is single-platform on Linux, ALSA is the
obvious choice (if not, I'd probably have to look for a library to put
in-between), but seemingly, ALSA has two interfaces (raw MIDI and sequencer).
Which one do you want? It sounds like raw MIDI is what we want, but actually,
it's the sequencer interface (it does more of the MIDI parsing for you,
and generally is friendlier).

The first question is where to start picking events from. I went the simplest
path and just said I wanted all events—anything else would necessitate a UI,
a command-line flag, figuring out if we wanted to distinguish between
different devices with the same name (and not all devices potentially even
have names), and so on. But how do you enumerate devices? (Relatively simple,
thankfully.) What do you do if the user inserts a new one while Nageru is
running? (Turns out there's a special device you can subscribe to that will
tell you about new devices.) What if you get an error on subscription?
(Just print a warning and ignore it; it's legitimate not to have access to
all devices on the system. By the way, for PCM devices, all of these answers
are different.)

So now we have a sequencer device, how do we get events from it? Can we do it in the main loop? Turns out
it probably doesn't integrate too well with Qt, but it's easy enough to put
it in a thread. The class dealing with the MIDI handling now needs locking;
what mutex granularity do we want? (Experience will tell you that you nearly
always just want one mutex. Two mutexes give you all sorts of headaches with
ordering them, and nearly never gives any gain.) ALSA expects us to poll()
a given set of descriptors for data, but on shutdown, how do you break out
of that poll to tell the thread to go away? (The simplest way on Linux is
using an eventfd.)

There's a quirk where if you get two or more MIDI messages right after each
other and only read one, poll() won't trigger to alert you there are more
left. Did you know that? (I didn't. I also can't find it documented. Perhaps
it's a bug?) It took me some looking into sample code to find it. Oh, and
ALSA uses POSIX error codes to signal errors (like “nothing more is
available”), but it doesn't use errno.

OK, so you have events (like “controller 3 was set to value 47”); what do you do
about them? The meaning of the controller numbers is different from
device to device, and there's no open format for describing them. So I had to
make a format describing the mapping; I used protobuf (I have lots of
experience with it) to make a simple text-based format, but it's obviously
a nightmare to set up 50+ controllers by hand in a text file, so I had to
make an UI for this. My initial thought was making a grid of spinners
(similar to how the input mapping dialog already worked), but then I realized
that there isn't an easy way to make headlines in Qt's grid. (You can
substitute a label widget for a single cell, but not for an entire row.
Who knew?) So after some searching, I found out that it would be better
to have a tree view (Qt Creator does this), and then you can treat that
more-or-less as a table for the rows that should be editable.

Of course, guessing controller numbers is impossible even in an editor,
so I wanted it to respond to MIDI events. This means the editor needs
to take over the role as MIDI receiver from the main UI. How you do
that in a thread-safe way? (Reuse the existing mutex; you don't generally
want to use atomics for complicated things.) Thinking about it, shouldn't the
MIDI mapper just support multiple receivers at a time? (Doubtful; you don't
want your random controller fiddling during setup to actually influence
the audio on a running stream. And would you use the old or the new mapping?)

And do you really need to set up every single controller for each bus,
given that the mapping is pretty much guaranteed to be similar for them?
Making a “guess bus” button doesn't seem too difficult, where if you
have one correctly set up controller on the bus, it can guess from
a neighboring bus (assuming a static offset). But what if there's
conflicting information? OK; then you should disable the button.
So now the enable/disable status of that button depends on which cell
in your grid has the focus; how do you get at those events? (Install an event
filter, or subclass the spinner.) And so on, and so on, and so on.

You could argue that most of these questions go away with experience;
if you're an expert in a given API, you can answer most of these questions
in a minute or two even if you haven't heard the exact question before.
But you can't expect even experienced developers to be an expert in all
possible libraries; if you know everything there is to know about Qt,
ALSA, x264, ffmpeg, OpenGL, VA-API, libusb, microhttpd and Lua
(in addition to C++11, of course), I'm sure you'd be a great fit for
Nageru, but I'd wager that pretty few developers fit that bill.
I've written C++ for almost 20 years now (almost ten of them professionally),
and that experience certainly helps boosting productivity, but I can't
say I expect a 10x reduction in my own development time at any point.

You could also argue, of course, that spending so much time on the editor
is wasted, since most users will only ever see it once. But here's the
point; it's not actually a lot of time. The only reason why it seems
like so much is that I bothered to write two paragraphs about it;
it's not a particular pain point, it just adds to the total. Also,
the first impression matters a lot—if the user can't get the editor
to work, they also can't get the MIDI controller to work, and is likely
to just go do something else.

A common misconception is that just switching languages or using libraries
will help you a lot. (Witness the never-ending stream of software that
advertises “written in Foo” or “uses Bar” as if it were a feature.)
For the former, note that nothing I've said so far is specific to my choice
of language (C++), and I've certainly avoided a bunch of battles by making
that specific choice over, say, Python. For the latter, note that most of these problems are actually related
to library use—libraries are great, and they solve a bunch of problems
I'm really glad I didn't have to worry about (how should each button look?),
but they still give their own interaction problems. And even when you're a
master of your chosen programming environment, things still take time,
because you have all those decisions to make on top of your libraries.

Of course, there are cases where libraries really solve your entire problem
and your code gets reduced to 100 trivial lines, but that's really only when
you're solving a problem that's been solved a million times before. Congrats
on making that blog in Rails; I'm sure you're advancing the world. (To make
things worse, usually this breaks down when you want to stray ever so
slightly from what was intended by the library or framework author. What
seems like a perfect match can suddenly become a development trap where you
spend more of your time trying to become an expert in working around the
given library than actually doing any development.)

The entire thing reminds me of the famous essay
No Silver Bullet by Fred
Brooks, but perhaps even more so, this quote from
John Carmack's .plan has
struck with me (incidentally about mobile game development in 2006,
but the basic story still rings true):

To some degree this is already the case on high end BREW phones today. I have
a pretty clear idea what a maxed out software renderer would look like for
that class of phones, and it wouldn't be the PlayStation-esq 3D graphics that
seems to be the standard direction. When I was doing the graphics engine
upgrades for BREW, I started along those lines, but after putting in a couple
days at it I realized that I just couldn't afford to spend the time to finish
the work. "A clear vision" doesn't mean I can necessarily implement it in a
very small integral number of days.

In a sense, programming is all about what your program should do in the first
place. The “how” question is just the “what”, moved down the chain of
abstractions until it ends up where a computer can understand it, and at that
point, the three words “multichannel audio support” have become those 9,000
lines that describe in perfect detail what's going on.

Tue, 04 Feb 2014 - FOSDEM video stream goodiebag

Borrowing a tradition from TG, we have released
a video streaming goodiebag from FOSDEM 2014. In short, it contains all the
scripts we used for the streaming part (nothing from the video team itself,
although I believe of most of what they do is developed out in the open).

If you've read my earlier posts on the subject, you'll know that it's all
incredibly rough, and we haven't cleaned it up much afterwards. So you get
the truth, but it might not be pretty :-) However, feedback is of course
welcome.

Sun, 02 Feb 2014 - FOSDEM video streaming, post-mortem

Wow, what a ride that was. :-)

I'm not sure if people generally are aware of it, but the video streaming
at FOSDEM this year came together on extremely short notice. I got word
late Wednesday that the video team was overworked and would not have the
manpower to worry about streaming, and consequently, that there would be
none (probably not even of the main talks, like last year).

I quickly conferred with Berge on IRC; we both agreed that something
as big as FOSDEM shouldn't be without at least rudimentary streams. Could we
do something about it? After all, all devrooms (save for some that would not
due to licensing issues) would be recorded using DVswitch anyway, where it's
trivial to just connect another sink to the master, and we both had extensive
experience doing streaming work from The Gathering.

So, we agreed to do a stunt project; either it would work or it would crash
and burn, but at least it would be within the playful spirit of free software.
The world outside does not stand still, and neither should we.

The FOSDEM team agreed to give us access to the streams, and let us use the
otherwise unused cycles on the “slave” laptops (the ones that just take in
a DV switch from the camera and send it to the master for mixing).
Since I work at Google, I was able to talk to the Google Compute Engine people,
who were able to turn around on extremely short notice and sponsor GCE resources
for the actual video distribution. This took a huge unknown out of the
equation for us; since GCE is worldwide and scalable, we'd be sure to have
adequate bandwidth for serving our viewers almost no matter how much load we got.

The rest was mainly piecing together existing components in new ways. I dealt
with the encoding (on VLC, using WebM, since that's what FOSDEM
wanted), hitting one or two really obscure bugs in the process, and
Berge dealt with all the setup of distribution (we used cubemap,
which had already been tuned for the rather unique needs of WebM during last
Debconf), parsing the FOSDEM schedule to provide live program information,
and so on. Being a team of two was near-ideal here; we already know each other
extremely well from previous work, and despite the frantic pace, everything
felt really relaxed and calm.

So, less than 72 hours after the initial “go”, the streaming laptops started
coming up in the various devrooms, and I rsynced over my encoding chroot
to each of them and fired up VLC, which then cubemap would pick up and send
on. And amazingly enough, it worked! We had a peak of about 380 viewers,
which is about 80% more than the peak of 212 last year (and this was with
almost no announcement before the conference). Amusingly, the most popular
stream by far was not a main track, but that of the Go devroom; at times,
they had over half the total viewers. (I never got to visit it myself, because
it was super-packed every time I went there.)

I won't pretend everything went perfect—we found a cubemap segfault
on the way, and also some other issues (such as initially not properly
restarting the encoding when the DVswitch master went down and up again).
But I'm extremely happy that the video team believed in us and gave us
the chance; it was fun, it was the perfect icebreaker when meeting new
people at FOSDEM, and hopefully, we let quite a few people sitting at home
learn something new or interesting.

Mon, 19 Aug 2013 - Whole-disk dm-cache

dm-cache is an interesting new technology in the 3.10 kernel onwards;
basically, it is a way to use SSDs as a cache layer in front of rotating
media, supposedly getting the capacity of the latter and the speed of the
former, similar to how the page cache already tries to exploit the good
properties of both RAM and disks. (This is, historically, nothing new; for
instance, ZFS has had this ability for years, in the form of a patented
cache algorithm called L2ARC.)

dm-cache is not the only technology that does this; it competes with,
for instance, bcache (also merged in 3.10). However, bcache expects you
to format the data volume, which was a no-go in my case: What I wanted,
was for dm-cache to sit below my main RAID-6 LVM (which has tons of volumes),
without having to erase anything.

This is all a bit raw. Bear with me.

First of all, after a new enough kernel has been installed (you probably
want 3.11-rc-something, actually), we want some basic scripts to hook onto
initramfs-tools and so on. I used
dmcache-tools, and simply
converted it to a Debian package with alien. It comes with a tool called
dmcache-format-blockdev that tries to partition your block
device as an LVM, split into blocks and metadata
volumes (seemingly they are separate in case you want e.g. RAID-1 for
your metadata only), but I found it to make a metadata volume that was
too small for my use. I ended up with 512MB for metadata and then the
rest for blocks.

The next part is how to get startup right. First of all, we want an
/etc/cachetab so that dmcache-load-cachetab knows
how to set up the cache:

This gives you a new /dev/mapper/cache that's basically
identical to /dev/md1 except faster du to the extra cache.
Then, you'll have to tell LVM that it should never try to use
/dev/md1 as a physical volume on its own (that would be
very bad if the cache had dirty blocks!), so /etc/lvm/lvm.conf
needs to contain something like:

filter = [ "a/md2/", "r/md/", "a/.*/" ]

Note that my SSD RAID is on md2, so I'll need to make an
exception for it. LVM aficionados will probably know of something
more efficient here (r/md1/ didn't work for me, since there's also
/dev/md/1 and possibly others).
Then, we need to get everything set up right during boot.
This is governed by /sbin/dmcache-load-cachetab.
Unfortunately, LVM is not started by udev, but rather late
in the process, so /dev/cache/blocks and
/dev/cache/mapper are not available when dmcache-load-cachetab
runs! I hacked that in, just before the “Devices not ready?” comment,
by simply adding the LVM load line used elsewhere in the initramfs:

/sbin/lvm vgchange -aly --ignorelockingfailure

Finally, we need to make sure the hook is installed in the first place.
The hook script has a line to check if dm-cache is needed for the root
volume, but it's far too simplistic, so I simply changed
/usr/share/initramfs-tools/hooks/dmcache so that
should_install() always returned true:

should_install() {
# sesse hack
echo yes
return
}

After that, all you need to do is clear the first few kilobytes of
the metadata filesystem using dd, update the initramfs,
and voila! Cache.

It would seem the code in the kernel is still a bit young; it has
memory allocation issues and doesn't cache all that aggressively
yet, but most of my writes are already going to the cache, and
an increasing amount of reads, so I think this is going to be quite
OK in a few revisions.

Sun, 28 Apr 2013 - Precise cache miss monitoring with perf

This should have been obvious, but seemingly it's not (perf is amazingly
undocumented, and has this huge lex/yacc grammar for its command-line
parsing), so here goes:

If you want precise cache miss data from perf (where “precise” means using
PEBS, so that it gets attributed to the actual load and not some random
instruction a few cycles later), you cannot use “cache-misses:pp” since
“cache-misses” on Intel maps to some event that's not PEBS-capable.
Instead, you'll have to use “perf record -e r10cb:pp”. The trick is,
apparently, that “perf list” very much suggests that what you want is
rcb10 and not r10cb, but that's not the way it's really encoded.

FWIW, this is LLC misses, so it's really things that go to either another
socket (less likely), or to DRAM (more likely). You can change the 10
to something else (see “perf list”) if you want e.g. L2 hits.

Mon, 15 Apr 2013 - TG and VLC scalability

With The Gathering 2013 well behind us, I
wanted to write a followup to the posts I had on video streaming earlier.

Some of you might recall that we identified an issue at TG12, where the video
streaming (to external users) suffered from us simply having too fast
network; bursting frames to users at 10 Gbit/sec overloads buffers in the
down-conversion to lower speeds, causing packet loss, which triggers new
bursts, sending the TCP connection into a spiral of death.

Lacking proper TCP pacing in the Linux kernel, the workaround was simple
but rather ugly: Set up a bunch of HTB buckets (literally thousands),
put each client in a different bucket, and shape each bucket to approximately
the stream bitrate (plus some wiggle room for retransmits and bitrate peaks,
although the latter are kept under control by the encoder settings).
This requires a fair amount of cooperation from VLC, which we use as both
encoder and reflector; it needs to assign a unique mark (fwmark) to each
connection, which then tc can use to put the client into the right HTB
bucket.

Although we didn't collect systematic user experience data (apart from my own
tests done earlier, streaming from Norway to Switzerland), it's pretty clear
that the effect was as hoped for: Users who had reported quality for a given
stream as “totally unusable” now reported it as “perfect”. (Well, at first
it didn't seem to have much effect, but that was due to packet loss caused by
a faulty switch supervisor module. Only shows that real-world testing can be
very tricky. :-) )

However, suddenly this happened on the stage:

which led to this happening to the stream load:

and users, especially ones external to the hall, reported things breaking up
again. It was obvious that the load (1300 clients, or about 2.1 Gbit/sec) had something to do
with it, but the server wasn't out of CPU—in fact, we killed a few other
streams and hung processes, freeing up three or so cores, without any effect.
So what was going on?

At the time, we really didn't get to go deep enough into it before the load
had lessened; perf didn't really give an obvious answer (even though HTB is
known to be a CPU hog, it didn't really figure high up in the list), and the
little tuning we tried (including removing HTB) didn't really help.

It wasn't before this weekend, when I finally got access to a lab with 10gig
equipment (thanks, Michael!), that I could verify my suspicions: VLC's HTTP server is
single-threaded, and not particularly efficient at that. In fact, on the lab
server, which is a bit slower than what we had at TG (4x2.0GHz Nehalem versus
6x3.2GHz Sandy Bridge), the most I could get from VLC was 900 Mbit/sec, not
2.1 Gbit/sec! Clearly we were both a bit lucky with our hardware, and that we
had more than one stream (VLC vs. Flash) to distribute our load on. HTB was
not the culprit, since this was run entirely without HTB, and the server
wasn't doing anything else at all.

(It should be said that this test is nowhere near 100% exact, since the
server was
only talking to one other machine, connected directly to the same switch,
but it would seem a very likely bottleneck, so in lieu of $100k worth of
testing equipment and/or a very complex netem setup, I'll accept it as
the explanation until proven otherwise. :-) )

So, how far can you go, without switching streaming platforms entirely?
The answer comes in form of
Cubemap, a replacement
reflector I've been writing over the last week or so. It's multi-threaded,
much more efficient (using epoll and sendfile—yes, sendfile), and also
is more robust due to being less intelligent (VLC needs to demux and remux
the entire signal to reflect it, which doesn't always go well for more
esoteric signals; in particular, we've seen issues with the Flash video mux).

Running Cubemap on the same server, with the same test client (which is
somewhat more powerful), gives a result of 12 Gbit/sec—clearly better than
900 Mbit/sec! (Each machine has two Intel 10Gbit/sec NICs connected with LACP
to the switch, and load-balance on TCP port number.) Granted, if you did this kind of test using real users, I doubt
they'd get a very good experience; it was dropping bytes like crazy since it
couldn't get the bytes quickly enough to the client (and I don't think it was
the client that was the problem, although that machine was also clearly very
very heavily loaded). At this point, the problem is
almost entirely about kernel scalability; less than 1% is spent in userspace,
and you need a fair amount of mucking around with multiple NIC queues to get
the right packets to the right processor without them stepping too much on
each others' toes. (Check out
/usr/src/linux/Documentation/network/scaling.txt
for some essential tips here.)

And now, finally, what happens if you enable our HTB setup? Unfortunately,
it doesn't really go well; the nice 12 Gbit/sec drops to 3.5–4 Gbit/sec!
Some of this is just increased amounts of packet processing (for instance,
the two iptables rules we need to mark non-video traffic alone take the
speed down from 12 to 8), but it also pretty much shows that HTB doesn't
scale: A lot of time is spent in locking routines, probably the different
CPUs fighting over locks on the HTB buckets. In a sense, it's maybe not
so surprising when you look at what HTB really does; you can't process
each packet as independently, the entire point is to delay packets based
on other packets. A more welcome result is that setting up a single fq_codel
qdisc on the interface hardly mattered at all; it went down from 12 to
11.7 or something, but inter-run variation was so high, this is basically
only noise. I have no idea if it actually had any effect at all, but it's
at least good to know that it doesn't do any harm.

So, the conclusion is: Using HTB to shape works well, but it doesn't
scale. (Nevertheless, I'll eventually post our scripts and the VLC patch
here. Have some patience, though; there's a lot of cleanup to do after
TG, and only so much time/energy.) Also, VLC only scales up to a thousand
clients or so; after that, you want Cubemap. Or Wowza. Or Adobe Media Server.
Or nginx-rtmp, if you want RTMP. Or… or… or… My head spins.

Thu, 14 Mar 2013 - Introduction to gamma

Stuck in a suburb of Auckland for the night, mostly due to Air New Zealand.
*sigh* Well, OK, maybe I can at least write that blog entry I've been
meaning to for a while...

When I wrote about color a month ago, my post included, in a small
parenthesis, the following: “Let me ignore the distinction between
Y and Y' for now.” Such a small sentence, and so much it hides :-)
Let's take a look.

First, let's remember that Y measures the overall brightness, or luminance.
Let's ignore the fact that there are multiple frequencies in play
(again, sidestepping “what is white?”), and let's just think of them as a bunch
of equal photons. If so, there's a very natural way to measure the luminance
of a pixel; conceptually, just look the amount of photons emitted per second,
and normalize for some value.

However, this is not usually the way we choose to store these values.
First of all, note that there's typically not
infinite precision when storing pixel data; although we could probably allow
ourselves to store full floating-point these days (and we sometimes do,
although it's not very common), back in the day, where all of these conventions
were effectively decided, we certainly could not. You had a fixed number of
bits to represent the different gray tones, and even today's eight bits (giving
256 distinct levels, bordering on the limits of what the human eye can
distinguish) was a far-fetched luxury.

So, can we quantize linearly to 256 levels and just be done with it? The answer
is no, and there are two good reasons why not. The first has to do, as so many
things, with how our visual system works. Let's take a look at a chart
that I shamelessly stole from Anti-Grain Geometry:

To quote AGG: “On the right there are two pixels and we can credibly say
that they emit two times more photons pre (sic) second than the pixel on the
left.” Yet, it doesn't really appear twice as bright! (What does
“twice as bright” really mean, by the way? I don't know, but there's some
sort intuitive notion of it. In any case, we could rephrase the question in
terms of being capable of distinguishing between different levels, but it just
complicates things.)

So, the eye's response to luminance is not linear, but more like the square root
(actually, more like the exponent of 1/2.2 or 1/2.4). Thus, if we want to
quantize luminance into N (for instance 256) distinct levels, we'd better
not space them out linearly; let's instead do x^(1/2.2) (or something similar)
and then quantize linearly. This is equivalent to a non-uniform quantizer;
we say that we have encoded the signal with gamma 2.2. (In reality, we
don't use exactly this, but it's close, and the reasons are more of
electrical than perceptual nature.) Also, to distinguish this gamma-compressed
representation of the luminance from the actual (linear) luminance Y,
we now add a little prime to the symbol, and say that Y' is the luma.

The other reason is a very interesting coincidence. A CRT monitor takes in
an input voltage and outputs (through some electronics controlling an electron
gun, lighting up phosphor) luminance. However, the output luminance is not
linearly dependent with the input voltage; it's more like the square! (This
has nothing to do with the phospor, by the way; it's the electrical circuits
behind it. It's partially by coincidence and partially by engineering.)
In other words, a CRT doesn't even need to undo the gamma-compressed quantization,
it can just take the linear signal in and push it through the circuit, and
get the intended luminance back out.

Of course, LCDs don't work that way anymore, but by the time they became
commonplace, the convention was already firmly in place, and again, the
perceptual reasons still apply.

Now, what does this mean for pixel processing, and Movit in particular?
Noting that many of the filters we typically apply to our videos (say, blur)
are physical processes that work on light, and that light behaves quite
linearly, it's quite obvious that we want to process luminance, not some
arbitrarily-compressed version of it. But this is not what most software does.
Most software just take the gamma-encoded RGB values (you encode the three
channels separately) and do mathematics on them as if they were representing
linear values, which ends up being subtly wrong in some cases and massively
wrong in others.
There's an article by Eric Brasseur
that has tons of detail about this if you care, but in general, I can say
that correct processing is more the exception than the norm.

So, what does Movit do? The answer is quite obvious: Convert to linear values
on the input side (by applying the right gamma curve; something like x^2.2
for each color channel), do the processing, and then compress back again
afterwards. (Movit works in 16-bit and 32-bit floating point internally,
by virtue of it being supported and fast in modern GPUs, so we don't have
problems with quantization that you'd get in 8-bit fixed point.) Actually, it's a bit more complex than that, since some filters
don't really care (e.g., if you just want to flip an image vertically, who
cares about gamma), but the general rule is:

If you want to do more with pixels than moving them around (especially
combining two or more, or doing arithmetic on them), you want to work in
linear gamma.

There, I said it. And now to try to get dinner before getting up at 5am tomorrow
(which is 2am on my internal clock, since I just arrived from Tokyo).
Gah.

Sun, 24 Feb 2013 - IPv6 reverse generation

We recently renumbered (for the first, and I hope the last time), and in the
process, the question of IPv6 adressing came up; how do you assign static
IPv6 addresses within a given /64?

I won't be going into the full discussion of the various different
strategies, but I'll say that one element of the solution chosen was that
if you had an IPv4 address ending in .123, you'd also get an IPv6 address
ending in ::123. (IPv6-only hostnames would be handled differently.)

But then, how do you make sure the reverses are in sync? For some reason,
BIND doesn't have a good way of synthesizing a PTR name from an IPv6 address,
so you're stuck with typing 3.2.1.0.0.0.0.0.0.etcetcetc and hope you got
everything right. It's a pain.

Sun, 03 Feb 2013 - Color and color spaces: An introduction

One of the topics that has come up a few times during the development
of Movit (my high-performance, high-quality video filter library)
is the one of color and color spaces. There's a lot of information out there,
but it took me quite a while to put everything together in my own head.
Thus, allow me to share a distilled version; I'll try to skip all the detail
and the boring parts. Color is an extremely complex topic, though; the more
I understand, the more confusing it becomes. Thus, it will probably become
quite long.

What is color?

Color is, ultimately, the way our vision reacts to the fact that light
comes in many different frequencies. (In a sense, the field of color is
actually more a subfield of biology than of physics.) In its most exact form,
you can describe this with a frequency spectrum. For instance, here is
(from Wikipedia) a typical spectrum of the sky on a clear summer day:

However, human eyes are not spectrometers; there are many colors with
different frequency spectra that we perceive as the same. Thus, it's useful
to invent some sort of representation that more closely corresponds to
how we see color.

Now, almost everybody knows that we represent colors on computers with
various amounts of red, green and blue. This is correct, but how do we
go from those spectra to RGB values?

XYZ

The first piece of the puzzle comes in the form of the CIE 1931 XYZ color
space. It defines three colors, X, Y and Z, that look like this
(again Wikipedia; all the images in this post are):

Don't be confused that they are drawn in red, green and blue, because they
don't correspond to RGB. (They also don't correspond to the different cones
in the eye.)

In any case, almost all modern color theory starts off by saying that
describing frequency spectra by various mixtures of X, Y and Z is a good
enough starting point. (In particular, this means we discard infrared
and ultraviolet.) As a handy bonus, Y corresponds very closely to our
perception of overall brightness, so if you set X=Z=0, you can describe
a black-and-white picture with only Y. (This is the same Y as you might
have seen in YUV or YCbCr. Let me ignore the distinction between Y and Y'
for now.)

Actually we tend to go one step further when discussing color, since we
don't care about the brightness; we normalize the XYZ coordinates so that
x+y+z=1, after which a color is uniquely defined with only its x and y
(note that we now write lowercase!) values. (If we also include the
original Y value, we have the full description of the same color again,
so the xyY color space is equivalent to the XYZ one.)

RGB and spectral colors

So, now we have a way to describe colors in an absolute sense with
only three numbers. Now, we already said earlier, usually we do this
by using RGB. However, this begs the question: When I say “red”,
which color do I mean exactly? What would be its xy coordinates?

The natural answer would probably be some sort of spectral color.
We all know the spectral colors from the rainbow; they are the ones
that contain a single wavelength of light. (Then we start saying
stupid things like “white is not a color” since it is a mixture of
many wavelengths, conveniently ignoring that e.g. brown is also
a mixture and thus not in the rainbow. I've never heard anyone
saying brown is not a color.) If you take all the spectral values,
convert them to xy coordinates and draw them in a diagram, you get
something like this:

(To be clear, what I'm talking about is the big curve, with markings
from 380 nm to 700 nm.)

So now, we can define “red” to be e.g. light at 660 nm, and similar
for green and blue. This gives rise to a gamut, the range of colors
we can represent by mixing R, G and B in various amounts. For instance,
here's the Rec. 2020 (Rec. = Recommendation) color space, used in
the upcoming UHDTV standard:

You can see that we've limited ourselves a bit here; some colors
(like spectral light at 500 nm) fall outside our gamut and cannot
be represented except as some sort of approximation. Still, it's pretty good.

For a full description, we also need a white point that says where we
end up when we set R=G=B, but let me skip over the discussion of
“what is white” right now. (Hint: Usually it's not “equal amount
of all wavelengths”.) There's also usually all sorts of descriptions
about ambient lighting and flare in your monitor and whatnot—again, let
me skip over them. You can see the white point marked off as “D65” in
the diagram above.

A better compromise

You might have guessed by now that we rarely actually use spectral
primaries today, and you're right. This has a few very important reasons:

First, it makes for a color space that is very hard to realize in practice.
How many things to do you know that can make exactly single-frequency
light? Probably only one: Lasers. I'm sure that having a TV built
with a ton of lasers would be cool (*pew pew*!), but right now,
we're stuck with LCD and LED and such. (You may have noticed that
outside a certain point, all the colors in the diagram look the
same. Your monitor simply can't show the difference anymore.)
You could, of course, argue that we should let it be the monitor's
problem to figure out what do to with the colors it can't represent,
but proper gamut mapping is very hard, and the subject of much research.

Second, the fact that the primaries are far from each other means that
we need many bits to describe transitions between them smoothly.
The typical 8 bits of today are not really enough; UHDTV will be done
with 10- or 12-bit precision. (Processing should probably have even more;
Movit uses full floating-point.)

Third, pure colors are actually quite dim (they contain little energy).
When producing a TV, color reproduction is not all you care about; you also
care about light level for e.g. white. If we reduce the saturation of our
primaries a bit (moving them towards the white point), we make it easier
to get a nice and bright output image.

So, here are the primaries of the sRGB color space, which is pretty
much universally used on PCs today (and the same primaries as Rec. 709, used for
today's HDTV broadcasts):

Quite a bit narrower; in particular, we've lost a lot in the
greens. This is why some photographers prefer to work in a wider color
space like e.g. Adobe RGB; no need to let your monitor's limitations come between what your
camera and printer can do. (Printer gamuts are a whole new story, and they
don't really work the same way monitor gamuts do.)

Color spaces and Movit

So, this is why Movit, and really anything processing color data,
has to care: To do accurate color processing, you must know what
color space you are working in. If you take in RGB pixels from
an sRGB device, and then take those exact values and show on an
SDTV (which uses a subtly different color space, Rec. 601),
your colors will be slightly off. Remember, red is not red.
sRGB and SDTV are not so different, but what about sRGB and Rec. 2020?
If you take your sRGB data and try to send it on the air for UHDTV,
it will look strangely oversaturated.

You could argue that almost everything is sRGB right now anyway,
and that the difference between sRGB and Rec. 601 is so small
that you can ignore it. Maybe; I prefer not to give people too many
reasons to hate my software in the future. :-)

So Movit solves this by moving everything into the same colorspace
on input, and processes everything as sRGB internally. (Basically what
you do is you convert the color from whatever colorspace to XYZ,
and then from XYZ to sRGB. On output, you go the other way.)
Lightroom does something similar, only with a huge-gamut colorspace
(so big it includes “imaginary colors”, colors that can't actually be represented
as spectra) called Pro Photo RGB; I might go that way in the future, but currently,
sRGB will do.