Writingtag:rachelbythebay.com,writing-20112016-10-16T17:11:15Zrachelbythebay1213486160 has a friend: 1195725856tag:rachelbythebay.com,2016-10-07:magic2016-10-08T01:22:43Z
Back in February, I
wrote
about finding a ridiculous number being passed to malloc. That number
was 1213486160, and it turned out to be "HTTP" on the wire.

If you search the web for that, besides finding my original post, you'll
find lots of people who have had stuff break and can't quite figure out
why.

I'm making this mini-post tonight to do a public service. There's
another number you'll probably see a fair bit if you work in the same
space that has 1213486160 show up in it. That number is 1195725856.

>>> hex(1195725856)
'0x47455420'

See those 4x and 5x hex values with a 20 on the end? That should
get you to raise an eyebrow. What does it say? Well, it's the flip
side of the whole "HTTP" situation.

>>> chr(0x47), chr(0x45), chr(0x54), chr(0x20)
('G', 'E', 'T', ' ')

Yep, "GET ", as in "GET / HTTP/1.0" or similar.

In other words, if you see 1195725856 showing up in your logs, you're
probably getting connections from things speaking HTTP at you: actual
web browsers, security scanners, people running curl and wget, elite
hax0rs trying to own you, and so on.

This one goes out to my friends at work who found another oddity
in a production system and unraveled the significance. Welcome to
the funny number spotting club!

]]>Go upgrade Xcode. Fix your git security hole.tag:rachelbythebay.com,2016-05-05:xcode2016-05-06T03:15:50Z
Remember my
post
from a couple of weeks ago, where I warned OS X users that they
probably had a vulnerable git version installed by default?

]]>Remote code execution, git, and OS Xtag:rachelbythebay.com,2016-04-17:unprotected2016-05-06T03:11:33Z
Sometimes I think about all of those pictures which show a bunch of
people in startups. They have their office space, which might be big,
or it might be small, but they tend to have Macs. Lots of Macs. A lot
of them also use git to do stuff, perhaps via GitHub, or via some other
place entirely. There are lots of one-off repos all over the place.

So you have lots of people running Macs and running git. Great, right?
Peace, love, and free software?

It's not all good. Let's say you're running El Capitan. What git do
you get with your system? I installed it tonight just to find out:

mini$ git --version
git version 2.6.4 (Apple Git-63)

git 2.6.4. Is anything wrong with that? Well, yeah, actually. Say
hello to CVE-2016-2324 and CVE-2016-2315, present in everything before
2.7.1 according to the report. You should check this out.

My reading suggests that if you were to point a vulnerable version at a
repository which is controlled by an attacker, then they could run code
as you on your machine. Once that's done, it's game over. They own
you.

So, what's the big deal? Crappy C code gets exploited every day, and we
upgrade it, and then we're "safe" until the next huge hole that's been
there forever is reported. (In the meantime, people party with their
private stash of vulnerabilities.)

But what if you couldn't upgrade it? Remember when I said El Capitan?
Apple is doing something new which basically keeps you from twiddling
certain system-level programs without going to fantastic lengths. Not
even root is enough to do it. In short, you can't just replace
/usr/bin/git.

Maybe you want to be clever and protect your users by disabling it until
you can figure something else out. Well, sorry. You also can't "chmod
-x" to at least keep it from being used. It will also fail.

You might know a thing or two about how Apple provides these programs,
and the fact that git and 64 other files in /usr/bin are all the
same size (18176 bytes on my machine) might clue you in. You might
think "aha, it's some kind of abstraction", and you're probably right.
But, where's the "real git" then?

I know, I'll just strace it to see what it execs! Oh wait, this isn't
Linux. Uh, I'll dtruss it to see what it execs!

Or not. Try to strings it, same deal. You actually get no output on
the OS X machine. It takes GNU strings on a Linux box to get results,
and none of them point you in the right direction.

I'll save you the trouble of digging through your entire machine and
tell you that you will probably find it in
/Applications/Xcode.app/Contents/Developer/usr/bin. If you chmod -x
that binary, then 'git --version' will eventually fail
to run, as you originally wanted. Now you can be sure you won't
accidentally run it while you figure out an upgrade strategy.

Of course, upgrading over top of that will almost certainly screw
something up later. It's times like this when you depend on the
vendor. Well, vendor, how about it?

Oh, and, incidentally:

$ xcodebuild -version
Xcode 7.3
Build version 7D175

If you rely on machines like this, I am truly sorry. I feel for you. I
wrote this post in an attempt to goad them into action because this is
affecting lots of people who are important to me. They are basically
screwed until Apple deigns to deliver a patched git unto them.

Let's see what happens.

May 5, 2016: This post has an update.
]]>That VMware IPv6 NAT thing is stranger than it lookstag:rachelbythebay.com,2016-03-27:wonky2016-03-28T00:04:06Z
Right, so, a couple of days ago I
wrote
about VMware Fusion, trying to use their IPv6 NAT "feature", and failing
miserably when attempting certain types of TCP connections over it. If
you haven't read it yet, go check it out first, so this will make more
sense.

Since then, there has been a steady trickle of feedback from the usual
sources. One comment said that it sounded like a path MTU discovery
problem. I'm going to have to disagree with that, since none of the
packets are particularly large, and besides, this will repro easily even
with no funky ICMP filtering going on. As for whether I might be
able to recognize this in the wild, I present
my post
on that very topic from May 2015 -- IPv6, even.

Two different people commented that there is no "TCPv6", so I patched
the original post. No particular need to get people worked up over
some shorthand for something longer (that being "TCP, operating over
IPv6").

There were questions if I was seeing checksum errors from TCP. I am
not. It's a lot more evil than that, but you'll have to check out
my packet traces to see just how far this rabbit hole goes.

Some people didn't get it working with 100 msec. I'm sorry to report
that you may have to go higher. I don't have a solid explanation for
why yet. While working towards that during some idle time this weekend,
I found something particularly shocking which warranted putting out
this post first.

One commenter on HN wondered if connect() is returning "before it's done
its work". That's actually kind of what's happening, but not because of
a nonblocking connect call. It's connecting because, as far as I can
tell, VMware is spoofing the connection. Yeah. Telebit
called, and they want their TurboPEP back.

Seriously, check this out. I went and installed Fusion 8
Pro
on my underused Mac Mini and then installed Ubuntu LTS on that.
Then I started trying to connect outward to my dual-stack machine at
SoftLayer (the one that's probably feeding you this page). This is what
I saw from inside the VM:

Okay, so, yeah, this is a lot of crazy cruft for people who don't speak
this language. I will attempt to boil it down to the key pieces.

That fd15:... address is just whatever is being generated by the Fusion
NAT setup. It's not the host Mac's actual (routable!) v6 address. It's
what the Linux VM thinks it is, though. The 2607:...:2 address is
magpie at SoftLayer, aka rachelbythebay.com at the time I write this.

With that in mind, here we go.

At 16:04:34.801556, the test Ubuntu VM sends out the bare SYN to my
server's port 8080.

245 microseconds later at 16:04:34:801801, magpie
allegedly responds with its own SYN and ACKs the SYN it
supposedly received.

We don't even need to talk about the rest. It's just this simple.
magpie is not here at my house. magpie is in Dallas, somewhere, at a
SoftLayer colo facility. I'm using a Mac here in Silicon Valley. There
is NO WAY you could get a SYN from here to there, respond to it, and get
that response back in 245 microseconds.

You know how far light can travel in 245 usec? Just under 75 km. If
you don't believe me, go feed this to your favorite solver:

((.801801 seconds) - (.801556 seconds)) * c

75 km, one way, wouldn't even get my packet out of the state, never mind
to Texas.

Therefore, we have a hypothesis: VMware is spoofing the session
establishment. With that in mind, can we test and see if it's really
happening? Of course we can.

How? Filter (DROP) packets bound for port 8080 on the server. The
server will NEVER respond. If we get anything back, it's a
dirty lie. Here's a screen shot. You're going to need to click on it
to see it full-size to make any sense of it.

I added a rule with ip6tables to drop TCP traffic to port 8080, and
then watched with tcpdump to see what would show up. Sure enough, my
Mac Mini's IPv6 address sends a whole bunch of SYNs to port 8080, and
my server ignores them.

However, down in the Linux VM, it's actually received an ACK to its SYN
and it ACKed that in turn, and it thinks it has a connection! Yes!

And, no, it doesn't do this on TCP over IPv4. I tried that, too. When
you try to connect to a black hole, it doesn't spoof the damn
connection.

Okay, one last thing to prove that this is what's going on. I'm going
to connect to an IPv6 host that doesn't even exist, and Fusion is going
to let me.

Yep. "ss" (think 'netstat' if you're not familiar with it) shows a nice
ESTABLISHED connection. As far as Linux knows, it's connected to this
elite host with an address I just made up.

This looks less like them doing NAT and more like them doing some kind
of
SLiRP-ish
thing where they see a connection attempt and then make a new one to
the same destination and copy data between them. The problem is that
apparently you can "race" it, and thus confuse it by speaking "too
quickly".

If this is in fact what they are doing, I would expect all sorts of
other TCP stuff to misbehave. Do they handle TCP OOB data? You know,
the whole URG flag thing, perhaps only known for starring in 1997's
WinNuke?

I'm not going to do their work for them. Figure it out, VMware.

]]>HTTP/HTTPS not working inside your VM? Wait for it.tag:rachelbythebay.com,2016-03-22:6nat2016-03-28T00:07:39Z
Have you ever run into a problem, done some troubleshooting, but decided
it only affected yourself, and shelved it for a while? I did that a
couple of a months ago with a particular problem which forced its way
back into my life this morning.

The problem is simple enough: parts of the web are going IPv6-only, and
I couldn't get to them from inside my VMware Fusion virtual machine.
What was really strange is that ping6 worked, and traceroute6 worked,
and even connecting to the port by hand with netcat worked. It was more
stuff like Firefox and curl that would fail... or, a request piped
into netcat.

Yes, now you should be wondering what was going on there. Running "nc
-6 rachelbythebay.com 80" and typing out "HEAD / HTTP/1.0 [enter]
[enter]" would work fine. But, rigging something to echo that and
piping it into the same nc command would fail!

Not only would it fail, but it would fail in a most peculiar manner: it
would just hang. Sniffing the traffic on the VM, on the host, and on my
web server didn't really help me figure it out, either.

Eventually, I somehow realized that it was the delay. If I typed it in
myself, there was a small but nonzero interval between the TCP
connection opening and my request going out. If, however, I piped it
in, netcat and the kernel would fire it down the pipe as soon as it
could.

Firefox, curl, and just about everything else intended to speak over the
web also has this situation: open a socket, fire a request down it.
That's how HTTP works.

Around this time I also realized that ssh over IPv6 was working just
fine. I chalked this up to the fact that ssh clients remain silent
until the server pushes back a banner, and only then do they start
handshaking. This adds a nice delay, and apparently it's usually enough
to get past whatever the "danger zone" was.

That's where I left it until this morning. I figured VMware would come
along and patch it eventually. They couldn't possibly miss the fact
that trying to do real work with TCP over v6 from their VM hangs every
single time, could they?

They could, and they did. Someone else in the world
reported the problem
back in September, and aside from some random person asking a totally
useless question, nothing had happened on the thread.

I didn't know any of this until it came back into my life, though.
Certain web servers have been going IPv6-only of late, and I'd been
kludging around it in my own twisted way, but now it was starting to
affect other people who were also using VMs. A friend remembered my
mention of "it works with a delay" and pulled me in, and I decided to
take this through to a fulfilling conclusion.

The first thing to do was to quantify the exact nature of the situation.
I knew it needed some delay, but how much? I decided the easiest way
was to pair up sleep and echo in a subshell, piped into netcat. I'd
start with a large delay and would dial it back until it stopped
working.

I ran that. I got "HTTP/1.1 200 OK" back, as expected. I dialed it
back to 0.5 -- half a second. That worked, too. Then I went half of
that, at 0.25. That also worked. Rather than binary-searching it down,
I jumped to a natural value a human might use: .1 -- 100 milliseconds.
That actually worked.

I backed it off a little bit to .095 -- 95 milliseconds. This worked,
too, so I backed off 5 more milliseconds to .09, and that's when it got
stuck. At 90 milliseconds of delay from userspace, it doesn't work. At
95, it will. I figure that's about 100 milliseconds with some slop for
round-trip times and whatnot in the TCP session.

Just to be sure, I went beyond that to 25 milliseconds and below, and
none of them worked. They'd all hang.

Then, just to prove my point, I kept the tiny sleep and flipped it to be
"-4" to force an old-school
(historic, even)
IPv4 connection. It worked just fine, of course.

I grabbed a screen shot of all of this because nobody would believe this
otherwise.

In relaying this story to my friends who had brought me into the
discussion, I suddenly had a very bad idea: what if I purposely delay
all traffic leaving my machine by 100 milliseconds? Would it work?

I'll let this screenshot answer that question for you.

That's a big yes. A huge yes. I went into Firefox and loaded
test-ipv6.com.
For the first time ever, it was able to pass all of the checks when
accessed from inside my VM.

I wish I was making this up.

For those of you similarly afflicted, here's that command so you don't
have to transcribe it from my screenshot:

tc qdisc add dev eth0 root netem delay 100ms

Bonus points if you can figure out enough of the 'tc' syntax to only
match outgoing TCP IPv6 connections and not touch everything else
that's leaving your VM. I didn't care about that since all I needed
was a simple proof of concept, and that did it.

Unbelievable.

March 27, 2016: This post has an update.
]]>Literal-minded robots make messes easilytag:rachelbythebay.com,2016-03-11:bash2016-03-12T06:40:27Z
If someone gave you a list of commands to run in sequence, and you hit a
problem while running one of them, what would you do? Would you say
something about it? Would you stop following the list and ask for help?
Or would you just carry on like nothing happened?

Okay, with that answer in mind, now consider what you get if later
commands rely on earlier commands succeeding and providing output. Do
you still like that answer? Are you sure?

I'll give you a scenario. You're a custom cake robot thingy, and you do
what you are told, and someone tells you to do these things:

Pick a spot on the kitchen counter. Call that spot KC.

Grab a cake stand from the cabinet.

Put the cake stand at spot KC.

Grab the cake from the fridge.

Set the cake above spot KC at the height of a cake stand.

Grab the icing from the cabinet.

Ask the person next to you what their name is.

Write their name and "is awesome" on the cake with the icing.

Hand the cake stand to the person.

How many different ways can this go wrong? I'll take a whack at trying
to list some of them without going for everything.

There is no cabinet, so you fail to get a cake stand. You place the
cake at the spot where the top of the stand would be. It's supposed to
be a tall stand, so it's a long way down to the counter. The cake is
ruined, and it just goes downhill from there.

There's a cabinet, but it's fresh out of stands. The same thing
happens. Splat.

You get a stand, but there's no kitchen counter this time so you picked
a spot in the air where the counter should be. When you try to set the
cake stand there, gravity takes over and it hits the floor, exploding
into a million tiny pieces. Then the cake follows a minute later and
joins it on the floor. The rest is about the same as before.

You get a plate, but there's no cake in the fridge for whatever reason.
You wind up spraying icing all over the spot where a cake should be, and
so end up with "so and so is awesome" written in icing on a plate.

There's no icing, but you go through the motions anyway. The cake isn't
ruined, but it's also not personalized. The person is sad.

There's some icing, but not enough. It runs out part way through. You
write "So and so is aweso". They aren't amused.

You ask them their name, but they're distracted and go "Huh? One sec,
I'm on the phone". You give them a cake which proudly proclaims that
huh one sec I'm on the phone is awesome.

You ask them their name, but they don't hear you and continue their
earlier conversation. You think they're answering you, and take
everything they say as their name. You wind up writing a full
transcript of their conversation in icing on top of a cake, at least,
until you run out of cake and start spraying it where the cake would be
if it was a mile long.

They say their name is "Mary Jo Smith". You've been taught that you
only want their first name, and that first names are the first word in
whatever they say to you. You write "Mary is awesome". Mary Jo is
displeased.

The person walks away after answering your question while you're still
doing the icing work. As a result, you hand the completed cake to thin
air, and it falls to the ground where they had been standing a moment
before. Splat.

Would you let a robot act like this? I certainly hope not. But, the
thing is, if you write bash scripts in a relatively plain way, you are
basically creating a brainless robot which will barrel straight on no
matter what sort of bad things happen.

Guess what happens if get_path_to_foobar is missing. X winds up empty.
So, then, you're going to run ahead with 'rm -rf /', because X will
expand to nothing. Awesome. Unless you have a recent version of GNU
coreutils which prevents such shenanigans (probably due to people
writing scripts like this), you'll blow away everything you can access
on the box.

Bonus: if you're running on certain flavors of UEFI boxes and have
write access to it through a particular pseudo filesystem setup, you
might just render it unable to boot or even brick the management stuff
which gives you out-of-band access for remote reboots and consoles.
(You don't have to be running systemd to have this possibility.)

You dig into the bash manual and find something that claims to kill the
script after a command fails, and you stick this up top:

set -e

Now you get something like this:

./uninst.sh: line 3: get_path_to_foobar: command not found

That's it. It stops there.

You might think you're done, but then some day someone changes the way
the get_path stuff works. Now it can return multiple lines of output
with "A=...", "B=..." prefixes, and only one of them matters now, so
they add a "grep" and "cut" to grab the right part of that one line.
Now it looks like this:

What now? Now you go looking for something that'll trip if anything in
the whole pipe | line | of | commands fails, and find this:

set -o pipefail

With that set, it's all or nothing: everything in that pipeline works,
or the whole thing is considered a failure. This is a slight
improvement.

You still have to worry about the result of the pipeline even if it
succeeds. What if the line in that file is just "X=", and cut hands
you back the empty string? What if the line is just "X=" plus a bunch
of spaces, and it hands you back those spaces?

What if that line got messed up when someone hit ^J in nano and
unwrapped a bunch of lines, and so now it looks like this?

X=/real/path/to/thing We support photos now

When that gets into your script, you now have this:

rm -rf /real/path/to/thing We support photos now/

Guess what: you just blew away any files or directories in the current
working directory called "We", "support", or "photos", and any directory
in there called "now".

What if the user had a real directory called "photos"? All of their cat
pictures are now gone, gone, gone, and it's your fault.

Now you get to learn about explicitly defining each positional argument
in your call to "rm" so spaces don't trip it up. The fun never ends.

Just imagine all of the stuff I didn't cover here. It just goes on like
this. Considering how complex this stuff is, and how many different
ways it can fail, it's a wonder anything gets done.

]]>A mystery with memory leaks and a magic numbertag:rachelbythebay.com,2016-02-21:malloc2016-10-08T01:32:06Z
Have you ever seen a number that looks out of place in its current
context, or otherwise looks like it belongs somewhere else? Once in a
great while, I get that feeling, and sometimes, the hunch turns out to
be right. It's just a matter of how you look at it. Here's a situation
where it happened to apply as part of a bigger group effort.

There was a system which was supposed to report in with telemetry data
but it wasn't managing to pull it off most of the time. It seemed to
work on other hosts, but on this one, it had real problems. Someone
started looking at this and noticed that the telemetry collector daemon
was getting restarted every few minutes. What's more, it was happening
because this nanny process thought it was broken.

Looking more closely at the collector's logs showed that indeed, it was
broken. One thing it tries to do is run subprocesses on your behalf
(think 'netstat' type stuff, kinda-sorta), and then it grabs the output
and sends it off to the big stateful storage service. Well, after a few
minutes of being up, those subprocesses would fail to start.
Specifically, it would fail to fork(). It logged something like ENOMEM
-- out of memory.

(Aside: fortunately, this program is smart enough to know that
fork can fail
and doesn't try to treat "-1" as a pid.)

That got people looking at the memory size of this program, and sure
enough, it was growing fairly consistently. Every minute or so, it
would grab a bit more than a gigabyte of memory. After a few minutes of
this, it ran up against the limits of what the kernel was willing to do,
and so the fork (which was really a clone at the syscall level) would
fail.

Now the question had changed to "why is this thing leaking memory". The
first thing to do was to get the approximate times of when it would
grow. A quick loop in a shell to print the size and time provided that,
and that gave us log entries to look at. We'd see it grow between
2:30:15 and 2:30:16 and would then check out what got logged around
then. A minute later, it would grow around 2:31:15 to 2:31:16 and we'd
look again.

These log entries didn't show any smoking guns, but there were a few
things to remember. While most of these telemetry collector jobs showed
up a lot, a couple of them seemed to show up right around the time the
program would start chewing memory. One of them was also complaining
about being able to do some network I/O -- something about data being
truncated.

At this point I decided to take a whack at it in a debugger. While I'm
generally a "debug with printf" sort of person, in this case it was
going to take a while to build a new binary, and we had debug symbols,
so why not parallelize and try gdb at the same time? After a bit of
fooling around in the UI, I got it going and added a breakpoint at the
beginning of the common code used to do network-based status counter
collections. Then I started it up inside the debugger.

This means the program would run along as usual, and then it would just
return to the prompt and wait for input from me. Meanwhile, my other
terminal was printing the current process size over and over. Then I
could tell it to continue and could watch what happened. Sure thing,
once in a while, one of the calls would be followed by the memory
utilization jumping up.

After doing this for a bit I realized that these function calls had some
parameters which could be used to tell them apart, like the port number
in use. The one I suspected was going to port 8080, so I changed my
breakpoint to only make it fire when it was called with that value.
This worked nicely: the program would run without nearly as many
interruptions, then it would break, and I could continue it, and that's
when I'd see the growth.

Now this pretty much meant that this particular job was trouble,
somehow, but what was going on? There was a particularly good
suggestion to trap mmap to see which part of the code was actually
calling for more memory. After some false starts, I decided to wait for
the breakpoint to fire, and then would manually add a break on mmap on
just that thread, because trapping all mmaps was just far too noisy.
Then I continued the program and waited.

A few mmaps later, I had something plausible. There was a stack showing
mmap, and a bunch of stuff going on leading to it. I left this up on my
screen to ponder. Among the noise, one thing stuck out: the reason why
it was growing by so much. It looked something like this.

malloc(1213486160) at some/path/to/foo.cc:1234

I don't know why, but that number just seemed... strange. It seemed to
belong somewhere else. My next step was the Python interpreter, since
it's great for looking at this kind of stuff.

That's when I saw this:

>>> hex(1213486160)
'0x48545450'

This tells us that 1213486160 expressed in hex is
48 54 54 50. Again, I don't know exactly why, but this
tripped something in my head that made me think "printable ASCII,
perhaps letters". My next stop was to figure out what those were. I
used the "ascii" CLI tool at the time, but for this story I'll use
Python to keep it all in one place:

Now it all made sense. The number of bytes it was trying to allocate
was this crazy number, but that crazy number was actually just the
letters "HTTP" taken as a 32 bit value on this particular platform!

Why was it getting back HTTP? That part was easy: it was trying to
speak its custom protocol to something on port 8080, and there was some
web server going "huh, what?" by returning "HTTP/1.0 400 Bad Request".
Apparently the network client code assumes the first four bytes are a
length, and tries to set up a buffer of that size and then goes to read
that much into it.

Now, since there isn't ~1.2 GB of data waiting on the other end of the
pipe, it fails to fill up the buffer, and ultimately throws an error
about "not enough data" -- remember the truncation thing before?
Exactly.

So what do you do about it? Well, right away, you turn off the thing
which is trying to speak non-HTTP to a HTTP port. Then, you see about
making the network client code a little less trusting, and maybe add
some magic numbers or other sanity checks before it jumps in feet-first.
Also, there's that whole memory leak issue when it bails out which needs
to be fixed.

This puts the fire out, then keeps it from happening again in the first
place to someone else who mixes up their ports or protocols. Then life
goes on.

I guess this means 1213486160 is going on my list of magic
numbers. Also, it means that 2008-06-14 23:29:20 UTC is, in a strange
sense, "HTTP time". If you see either that number or that date
(with possible adjustments for your local time zone) showing up in
your life inexplicably, this might just be why.

October 7, 2016: This post has an update.
]]>Troubleshooting another spot of downtimetag:rachelbythebay.com,2015-12-31:rhel2015-12-31T17:58:49Z
Yep, the site was down this morning. Method of detection? Certainly
not any automated monitoring. Nope, in this case, I got mail from
someone who was kind enough to notice and who guessed at a contact
e-mail address (which worked -- thanks Lin!), and sure enough, the site
was toast. Time for some troubleshooting.

Obviously, Apache was down, but why was it down? It was down because it
wouldn't restart. (Duh.) But, why would it not restart? Time to try
it by hand.

It then followed with an "[OK]", but things were certainly not OK.
Apache was not running. A quick 'ss' and a double-check with netstat
confirmed: no, nothing is bound to the port, as you would expect with
the web server down. So, that can only mean one thing: the web server
is stupid enough to try to bind to the port twice and is
conflicting with itself.

Into the conf.d directory I went, and it was obvious what I had to do.

# grep -r 443 .
./some_domain.conf: Listen 443
./ssl.conf: Listen 443

some_domain.conf is one of the files I've written myself. That's not
its real name, but that's not important to the story. The point is, it
has its own config for all things SSL/TLS on the site: all of the
yak-shaving stuff
I did earlier this year to make it all look nice on those "grade your
SSL" pages.

But, what's this? ssl.conf also existed, and it was sporting an
identical Listen. What's ssl.conf? That's easy: it comes with an
install of the httpd rpm... or an update. Yep, an update, like the one
I had just run on purpose.

So this brings us to a good couple of hours ago: I decided it was time
to upgrade things on the box, and just let it run to completion by
itself. The
first machine
used to do RHEL upgrades by itself, so what could go wrong? Then, I
walked away without checking things, because it was late and what could
go wrong, and all of those sentiments.

From personal experience (which we'll get to later), it's clear what
happened: despite my home-brewed config management stuff which sets up
all of the files in and under /etc/httpd, nothing stops something from
adding a new, unmanaged file to the mix. In this case, it was a file
which I probably deleted ages ago while migrating to this machine, and
then RPM "helpfully" put back during the update.

This re-introduced the extra Listen directive, and the rest is history.

I now have a placeholder file with that same name in there just to keep
it from "helpfully" reappearing on a subsequent update.

This should sound very familiar to long-time readers... because it's the
same thing which happened three -- almost four -- years ago: another RPM
update resurrected a 'dead' config file and hilarity ensued.

There, it was mod_php. Here, it was SSL configs. Same effect: a dead
site.

One other part of this story may be making you go "wait a minute..."
here, and it should. Remember the whole bit about configuration
management systems only managing the files they know about, and not
noticing new ones?

"Any file in a magic directory becomes part of the active
configuration."

So, there's my story: nothing new under the sun, and the same failure
modes as before. The only thing that's any different is how much faster
I fixed things this time because they all seemed too familiar.

There's a lot of stupidity in this story. How much can you find?
I'm sure someone will be along shortly to tell me how these things
should work, from monitoring, to root-causing, to patching stuff in
Apache, to sending things upstream.

To those people: you aren't wrong... but you don't know the whole story,
either.

]]>One checkbox equals non-UTC funtag:rachelbythebay.com,2015-09-07:noleap2015-09-07T22:53:01Z
GPS time is not quite the same as NTP time. They are actually somewhat
different things, and if you use one of them to feed the other verbatim,
you might just have a bad day. What's the big deal? Yet again, it
comes down to leap seconds.

GPS time doesn't have leap seconds. It's just been incrementing
steadily for a little more than 35 years.
UTC,
on the other hand, still does have leap seconds, and it makes life
interesting for people every couple of years when the accumulated
offset calls for an adjustment. We just went through one of these back
in the end of June/beginning of July.

Why am I bothering to write about this? Well, it's actually not that
difficult to shoot yourself in the foot and mix up these time scales
with a commonly-used commercial GPS-based NTP appliance. There's a nice
checkbox which says, quite simply:

Ignore UTC Corrections from GPS Reference

It looks tempting, right? The default is unchecked, which gives you the
double-negative effect, resulting in honoring UTC corrections
from GPS... which is probably what you want, whether you understand it
or not.

If you check that and restart ntpd on the box, it will start giving you
time that's (currently) 17 seconds fast of what most people would
expect. This number will grow as more leap seconds are inserted in the
future.

So let's say you have a bunch of these things in your organization
spread throughout your network. Maybe one of them has it checked and
the others don't. Odds are, ntpd will declare shenanigans on the one
outlier and it will ignore it. But, what if no other sources are
available? Or, what if multiple sources get this checked somehow?

If enough sources claim the time is 17 seconds fast, ntpd will
eventually conclude that the local machine is the crazy one,
and will step the time to compensate. The machine will skip over that
time and will be running fast.

Now let's say at some point after that, the local consensus is that time
is actually back where it should be. At first, ntpd will act like all
of its sources are insane, with huge offsets and jitter values
approaching -17000 milliseconds. However, as time goes by, the jitter
will drop because they are consistently offset by the same amount.
Eventually, ntpd will again realize the local clock is the crazy one,
and it will step the clock just like it did before, but now, it'll
go backwards.

When this happens, any program on the machine which is looking at the
real-time clock is going to start rocking and rolling as it repeats the
last 17 seconds. If you're using functions like time(), gettimeofday(),
or even some flavors of clock_gettime (the REALTIME ones) in your code,
you're going for a ride!

There's a nice warning about this in the vendor's documentation:

CAUTION: NTP time is based on the UTC time scale. Distributing GPS time
over NTP is non-standard and can have serious consequences for systems
that are synchronized to UTC. This action should only be performed by a
person who is knowledgeable and authorized to do so.

It's kind of neat in an evil way. Flip that switch and you will start
publishing time over NTP that isn't equivalent to what you'll usually
find when you speak NTP to arbitrary hosts on the Internet. Again, the
vendor warns users to distribute such time only on "private or closed
networks", and to avoid doing it in public, and to even lock it down to
keep people away from it.

Somehow, I think a mere checkbox + daemon restart is a little thin in
terms of protecting people from themselves. I think there's an added
bonus in that you can apparently check the box and not restart
the daemon right away. That leaves a wonderful
timebomb just waiting
for the next time something causes it to restart: a power outage,
reconfiguring the network, changing ntpd's peers, or just clicking the
UI's restart button.

You could be good for years, and then one day...

]]>It's all fun and games until someone [XOFF]tag:rachelbythebay.com,2015-06-20:blocking2015-06-20T23:00:41Z
Way back in the dark ages of Linux, I had a bunch of machines which
didn't run X. They were strictly text mode, and sat there doing
whatever I needed them to do: routing, DNS, dialups, mail, RADIUS auth,
you name it. There were plenty of daemons working for me, and most of
them had things to say via the syslog.

To keep an eye on things, I would just watch the syslog by tailing files
like /var/log/messages. That was all well and good, but it meant having
to be logged in. If I was logged in, then there was a console open,
just a single ^C away from giving a shell to anyone who came by. This
was before the days of "cheesing", "Biebering", or "jelloing" unlocked
terminals, but I still didn't like it.

One day I noticed that you could create a new text console on a Linux
box just by shoving some data at the right /dev entry. Just "echo"
something to /dev/tty12, then ALT-F12 and you'd be able to see it.
Prior to that, ALT-F12 would do nothing since it didn't exist yet.

From that discovery it was just a hop, skip and a jump to having syslogd
write a second copy of everything to /dev/tty12. Then I could log out,
flip to that virtual console, and watch things that way. Any time I
wondered what was happening, a quick tap on the keyboard to turn the
screen back on would let me see without logging in.

That's how it went for a while, but then one day basically everything
on that machine stopped responding. It's been a long time so I forget
exactly transpired in the middle, but eventually it came down to one
thing: my keyboard's scroll lock light was glowing at me when I was on
tty12.

Yep, that key that basically has been a part of the PC keyboard layout
forever and never really did anything finally had a purpose, and on
Linux, it served to stop writes to a text console. This was all well
and good in theory, but since syslogd was doing blocking writes to it,
that also meant syslogd would get jammed. These days, that wouldn't be
a problem, but back then,
/dev/log was connection-based.

When syslogd got stuck, it stopped reading /dev/log, and eventually that
became a trap, too. Anything trying to talk to it also blocked. Given
that sendmail and a bunch of other things all called syslog(), this made
for a pretty messy situation.

Someone had pressed the scroll lock key on the keyboard while that tty
was active. Maybe it was me, or maybe it was something fooling around,
or maybe it was just the cleaning crew "doing me a favor". It doesn't
really matter.

...

What inspired me to write about this today is stumbling over something
similar not too long ago. I have a couple of programs which run in
screen, mostly out of sheer laziness on my part. This is fine until you
accidentally hit ^S while attached to it. ^S is XOFF, or for those of
you who are lucky enough to not have to know this... is part of
software flow control.

Yep, ^S, 0x13, decimal 19, DC3, whatever you want to call it, is usually
interpreted as XOFF. It'll sit there and block writes until someone
sends it ^Q, 0x11, decimal 17, DC1, which is XON.
Really.

Anyway, let's say you're in screen, which by default uses ^A as a
command key. ^A N goes to the next screen, ^A D detaches, and so on.
If you're on a vaguely QWERTY-ish layout, you'll notice that A and S
are right next to each other.

It's not much of a stretch to imagine accidentally hitting that S
instead of A. At that point, all of the writes block until you either
unwedge it with ^Q or bail out of screen. If you didn't know what was
going on behind the scenes, it might just seem like "one of those things
that happens sometimes".

If you've ever wondered why some folks hit ^Q any time things seem to
have frozen in their terminal, this might be why.

]]>Filter all ICMP and watch the world burntag:rachelbythebay.com,2015-05-15:pmtud2015-05-16T04:13:00Z
I helped a friend troubleshoot a problem the other day, and it turned
out to be something that first bit me almost 20 years ago. When I talk
about the cyclical nature of this industry, I really mean it!

The problem was simple enough: with a certain combination of wireless
providers and tunnels, certain web sites became inaccessible. We
started from the base report: https://something.or.other/ doesn't work.
Trying with 'openssl s_client -connect ...' showed the same thing, so it
wasn't the browser. Likewise with curl.

However, doing a curl to the http version of the site worked, but this
site just redirects on http requests.

ssh also did not work. It would connect, flip some data back and forth
(as shown with -v) and then would hang.

I had another machine with me that had a different setup and used it to
jump on the server, at which point I started sshd on another port in
debug mode and we kicked off another connection attempt. The same thing
happened, and sshd said about what ssh said: it connects, it says some
stuff, but gets to the key exchange and hangs.

At some point I decided to leave tcpdump running in the background and
then we did another ssh attempt, and somehow, this triggered a memory of
my problem from so long ago. Maybe it was the frames being transmitted
from the server to the client (and never arriving), but it all clicked
into place: "we need to lower the MTU".

I was still on the server and so I ran the 2015 equivalent command to
add a host route which has a purposely small MTU. It used "ip", not
"route", primarily because this particular connection was over IPv6. It
was something like this:

ip -6 route add what:ever/128 via what:ever dev eth0 mtu 960

I picked 960 arbitrarily. It was a good guess, since the next
connection attempt worked flawlessly.

Sure enough, when we tried to push some ICMP between the hosts, it
disappeared. This is all I needed to see to be sure: in doing this,
someone blackholed the very important packets which say "fragmentation
needed but DF set". Without them, how can you expect the poor sender to
figure it out?

Well, it turns out that in 2015, we have more options. While you wait
for the network providers to sort out their ICMP filtering situation,
you can enable net.ipv4.tcp_mtu_probing on your server and go on with
life. I won't explain how it works here, but in short, it will figure
things out even if the magic ICMP responses are missing.

It was gratifying to turn this on (after dropping my route hack) and
watch Linux bounce the MTU around until it found the one that worked.
It was a little like watching the
WOPR
crack nuclear launch codes.

Remember how I said this happened to me back in the '90s? It turns out
I
wrote about it
in July 2013. Back then, it was dialup SLIP into BSD/OS. Now it's IPv6
over crazy cellular stuff and crypto tunnels, but the problems remain
the same.

Filter ICMP at your peril, folks.

CPE 1704 TKS

]]>The dangers of resetting everything at oncetag:rachelbythebay.com,2015-05-02:lockstep2015-05-03T04:00:21Z
I want to describe a scenario I've seen too many times before, and I
know I will see again.

Let's say you have a bunch of systems running approximately the same
software. For the sake of this example, say they are all running Linux.
We'll also assume they have a bug which makes the machine lock up after
300 days of uptime. Nobody knows about this yet, but the bug exists.

(Incidentally, Linux boxes have had at least one bug where things went
stupid after a certain amount of time, like 208 days, but this isn't
about that. This also isn't about Win95 and 49.7 days of uptime. Stay
with me here. Focus.)

One day, there's a big announcement about the problem. It's all over
the news in big headlines: "Linux boxes die after exactly 300 days
uptime". People see it and react to it in different ways. One of those
reactions leads to the bad scenario.

There's one sort of person who will see this and will take it upon
themselves to reboot all of the machines right away. This way, by their
logic, they "just bought the company 300 days of trouble-free
operation". That's somewhat true: any host which was approaching 300 is
now reset to 0 and won't trip this bug any time soon, but that's the end
of the good news.

The bad news is that this person bought the company exactly 300
days of operation, and it's all synced up across the fleet. Whether
this becomes an actual problem depends on what happens next.

If all of the systems are then patched and have the root cause fixed,
then everything is fine. If the systems wind up getting put on some
kind of "scheduled reboot" list, then it's goofy, but it winds up okay.

No, the problem is when that's the only thing that happens, and
then nothing else changes after that. 300 days will elapse, and then
all of the hosts will come down at the same time. So much for
redundancy!

Even if most of the hosts are fixed before then, if even one group of
them is left out, then whatever service is run by that group will go
down.

So here's the trick: any time you see an announcement on date X of
something bad that happens after item Y has been up for more than Z
days, calculate what X + Z is and make a note in your calendar. That's
the first possible date you should see a cluster of events beginning.
It'll actually drag on for a few days past that point since not everyone
gets the news at the same time, and those that do get the news don't do
the "reboot the fleet" right away, either. It might take a bit.

There's a variant on this, too. Let's say there's an existing system
which always gets rebooted or reset at a given interval. SSL
certificates are typically renewed in terms of a year or two. Well, a
little over a year ago, Heartbleed was the latest bit of online
buffoonery, and everyone who was affected had to do two things: patch
their OpenSSL code (or chuck it entirely...) and then re-issue their
certs. This reset a bunch of certificates to having April expiration
dates.

Sure enough, April just rolled around again, and a bunch of sites all
had certs expire and had outages stemming from that. It's interesting
to see that you can sometimes tell who heard about and acted on the
Heartbleed news based on the order in which they expired.

Now, not everyone was affected by this clustering effect. Some folks
probably managed to keep their old expiration date and just replaced the
cert. Others probably saw this coming and have taken steps to spread
out their expiration dates to get rid of the "April hotspot".

Still, I would expect to see ripples of this every April for quite some
time to come.

]]>Some features just aren't worth the troubletag:rachelbythebay.com,2015-04-25:tls2015-04-25T20:05:38Z
I've received some responses to
last week's post
about changing out the certs on this site. One says i should look at
OCSP stapling, HSTS and HPKP. This took some digging to unravel all of
the alphabet soup.

OCSP stapling basically would have me call out to my CA to fetch a small
update that says "as of right now, this cert is still current". Then
I'd hand that out in addition to the usual handshake stuff, instead of
making clients do their own revocation list lookups.

It sounds mildly interesting, but it's complete overkill for this site.

HSTS is something else where the server basically says "hey, you have to
hit me over https", so someone in the middle can't force people into
http mode. I actually don't want to force this, in the name of maximum
availability for people browsing from everywhere. Who knows what kind
of silly regimes are out there blocking http, right?

Sufficiently paranoid individuals who care about this can certainly do
the requisite client-side mangling and then fewer people will know
exactly what they are reading on here. Of course, they will still be
able to see that someone is coming here, so there's that. "Uh oh,
they're hitting rachelbythebay.com, so they must be reading snarky tech
writing. If only we knew which post..."

Finally, there's HPKP. This seems to be something which lets you pin a
given public key to thwart forged certs, or some such. This is another
one of those "gee whiz" type of technologies, but I just can't bring
myself to worry about it. If someone really wants to make you see
another version of this site than the one I published, assume it's going
to happen. That goes for basically everything else, too.

What's the point of getting IPv6 working and going to SHA2, then, you
might ask? Easy. IPv6 is finally to the point where it solves more
problems than it creates. Those of you who haven't even played with it,
it's time to get started. The number of parsing bugs alone you will
find in your software (hint: ":" isn't always the separator between an
IP address and a port number) will shock you.

As for SHA2, well, I don't want goofy errors on page loads.

...

Another asks how many 4096 bit signing operation my server can do. The
answer is: I don't have an answer. This is a brand new box, and
honestly, I never really stress my web servers boxes. They all tend to
be largely static content, with a smattering of background processes
running to provide the occasional database lookup. That leaves a lot
of CPU power just sitting around.

Honestly though, I also am not really concerned about the performance of
this site. If the page loads in a reasonable amount of time and the
machine doesn't catch fire, I'm pretty happy. This isn't anything
important. People aren't doing any kind of critical communications over
it, after all. Nobody will die or be nervous if it loads slowly one
day.

That said, the old box held up to HN and Reddit and so forth
just fine, and the new box is much faster and bigger, mostly because
time has passed and the world has changed. Still, for the record, a ~2
GHz single processor "Celeron 440" (whatever that implies) with a 10
Mbps connection worked just fine for my needs.

Such is life for my personal servers: many years of stable operation
with not much in the way of demands from me.

My professional stuff, on the other hand, is another story.

]]>SHA2 certificate now onlinetag:rachelbythebay.com,2015-04-18:tls2015-04-25T20:06:49Z
While it wasn't actually time for a renewal of my rachelbythebay.com
certificate, it turns out that having a SHA1 signature was a recipe for
trouble. Various folks have decided their browsers will no longer trust
them in increasingly nasty ways, so it's time to move on.

The result: as of right now as this post goes live, you can hit this
site over IPv6, with TLS 1.2, and with a certificate that has a
SHA2 signature.

If you run a https web site, you'd better
figure this out
quickly. If you have https clients, particularly embedded hardware
(PS3, I'm looking at you), you'd better make sure they can handle these
new certs. Otherwise, it's going to be a real mess out there as the
world re-adjusts.

What's next in terms of sysadmin stuff this year? Leap second, you say?

]]>Server migration, IPv6, and bye bye to the beachtag:rachelbythebay.com,2015-04-17:v62015-04-18T21:51:54Z
rachelbythebay.com now speaks IPv6. Hopefully you didn't notice a
thing. If you're wondering how this came to pass, well, there's
definitely a story involved here.

Have you ever been with a service provider for a long time, and then
only realized you were still there because you had been there? That is,
n+1 only existed because n existed, and so on. Did you then re-assess
your situation and see if they still managed to provide what you wanted?

That's pretty much what just happened with me and ServerBeach. I've had
a machine with them since 2004, or basically back when I was working in
the web hosting business as a support monkey myself. It made sense
initially because they were part of the extended family, and I needed a
place to park my stuff.

A couple of years later, that first box started having disk problems and
so I replaced it. It was a simple matter to go to their web site to
find out what my options would be. You could click around and see what
sort of price points existed. I picked what I wanted, said "go" and a
new machine appeared on my account a few hours later.

That was 2008. Unfortunately, since then, they have lost their way. A
couple of months ago, I decided to spec out a new machine. It should
have been a simple task: they show me a bunch of options on a web
page, and I pick one just like before. This time, the
self-provisioning stuff was nowhere to be found.

I also needed information about whether they supported IPv6
connectivity. Their web pages didn't answer that, either. It's not a
good sign when something which is going to be pretty important rather
soon has been completely neglected by a service provider.

These two failings meant I had to resort to contacting support: the
folks doing the job I used to have so long ago. My request was simple
enough: do you have IPv6 in any datacenters, and what sort of details
apply to a basic server in one of them? I figured someone would give me
a simple answer. They did not, and instead forwarded it to my account
manager. Oh boy.

Did he mail me the details? Well, no, not at first. He called. It
went to voicemail. I did not want to get into a synchronous hard sell
situation, because, remember, I used to work for one of these places. I
just wanted the straight facts: what do you have, where do you have
it, and for how much?

I finally replied by mail and asked for the details in a reply. To
his credit, that reply included the info, but that's the end of the good
news.

The self-provisioning stuff is gone. No more popping your own servers.

They don't do IPv6. ANYWHERE. Yes, in 2015. Even Comcast
gets this much right! ServerBeach, 2015, IPv6, no. Negative. Zero.

It's not just the lack of IPv6 connectivity. Their DNS tool doesn't let
you set AAAA records, so even if you stand up a machine somewhere else,
you still can't point people at it. Instead, you have to swing your
whole zone around to some other place (which you'd do eventually anyway,
but this forces you to do it first) in order to start migrating.

As for the actual base-level hardware? He quoted something higher than
what I could find myself in the shiny sales pages on the web site. I
half-expected something like this, but this was just barbaric.

I went looking for alternatives, and I found one. They do IPv6
everywhere as far as I can tell, and they let you provision your own
stuff and do "what if" twiddling of hardware choices to see how much it
will cost. There's no wheeling and dealing. I went for it.

After that, the migration commenced in fits and starts: a few hours
here, a few hours there. I don't have a whole lot of time to spend on
this, but managed to glue it together this week, and the switch on
rachelbythebay.com itself was thrown yesterday.

DNS being what it is, I expect a bunch of wayward people to find their
way to the old machine for several more days. They'll encounter a
distinct lack of updates, and then eventually a bunch of RSTs when it
finally goes dark. Meanwhile, life goes on out here on the new machine.

There are more than a few old projects which have been retired as part
of this. Some of them, like the scanner, will come over eventually,
but others like fred are no more. Nobody should really notice that.

IPv6 is no longer magic. Suck it up and turn it on.

]]>The load-balanced capture effecttag:rachelbythebay.com,2015-02-16:capture2015-02-17T00:55:06Z
I'm going to describe a common frontend/backend split design. Try to
visualize it and see if it resembles anything you may have run, built,
or relied on over the years. Then see if you've encountered the
"gotcha" that I describe later on.

Let's say we have a service which has any number of clients, some small
number of load balancers, and a few dozen or a hundred servers. While
this could be a web site with a HTTP proxy frontend and a bunch of
Apache-ish backends, that's not the only thing this can apply to. Maybe
you've written a system which flings RPCs over the network using your
company's secret sauce. This applies here too.

Initially, you'll probably design a load balancing scheme where every
host gets fed the same amount of traffic. It might be a round-robin
thing, where backend #1 gets request #1, then backend #2 gets request
#2, and so on. Maybe you'll do "least recently used" for the same basic
effect. Eventually, you'll find out that requests are not created
equal, and some are more costly than others. Also, you'll realize that
the backend machines occasionally have other things going on, and will
be unevenly loaded for other reasons.

This will lead to a system where the load balancers or even the clients
can learn about the status of the backend machines. Maybe you export
the load average, the number of requests being serviced, the depth of
the queue, or anything of that sort. Then you can see who's actually
busy and who's idle, and bias your decisions accordingly. With this
running, traffic ebbs and flows and finds the right place to be. Great!

So now let's test it and by injecting a fault. Maybe someone logs in as
root to one of your 100 backend machines and does something goofy like
"grep something-or-other /var/crashlogs/*", intending to only search the
stack traces, but unfortunately also hitting tens of GB of
core dumps. This makes the machine very busy with 100% disk
utilization, and it starts queuing requests instead of servicing them in
a timely fashion.

The load balancers will notice this and will steer traffic away from the
wayward machine, and onto its buddies. This is what you want. This
will probably work really well most of the time! But, like so many of
my posts, this isn't about when it works correctly. Oh no, this one is
about when it all goes wrong.

Now let's inject another fault: this time, one of the machines is going
to have a screw loose. Maybe it's a cosmic ray which flipped the wrong
bit, or maybe one of your developers is running a test build on a
production machine to "get some real traffic for testing".
Whatever. The point is, this one is not right in the head, and it
starts doing funny stuff.

When this broken machine receives a request, it immediately fails that
request. That is, it doesn't attempt to do any work, and instead just
throws back a HTTP 5xx error, or a RPC failure, or whatever applies in
that context. The request dies quickly and nastily.

For example, imagine a misconfigured web server which has no idea of
your virtual hosts, so it 404s everything instead of running your Python
or Perl or PHP site code. It finds a very quick exit instead of doing
actual work.

Do you see the problem yet? Keep reading... it gets worse.

Since the failed request has cleared out, the load balancers notice and
send another request. This request is also failed quickly, and is
cleared out. This opens the way for yet another request to be sent to
this bad machine, and it also fails.

Let's say 99 of your 100 machines are taking 750 msec to handle a
request (and actually do work), but this single "bad boy" is taking
merely 15 msec to grab it and kill it. Is it any surprise that it's
going to wind up getting the majority of incoming requests? Every time
the load balancers check their list of servers, they'll see this one
machine with nothing on the queue and a wonderfully low load value.

It's like this machine has some future alien technology which lets it
run 50 times faster than its buddies... but of course, it doesn't. It's
just punting on all of the work.

In this system, a single bad machine will capture an unreasonable
amount of traffic to your environment. A few requests will manage to
reach real machines and will still succeed, but the rest will fail.

How do you catch this? Well, it involves taking a few steps backward,
to see the forest for the trees.

Let's say you have 100 backend servers, and they tend to handle requests
in 750 msec. Some of them might be faster, and others might be slower,
but maybe 99% of requests will happen in a fairly tight band between 500
and 1000 msec. That's 250 msec of variance in either direction.

Given that, a 15 msec response is going to seem completely ridiculous,
isn't it? Any machine creating enough of them is worthy of scrutiny.

There's more, too. You can look at each machine in the pool and say
that any given request normally has a n% chance of failing. Maybe it's
really low, like a quarter of a percent: 0.25%. Again, maybe some
machines are better and others are worse, but they'll probably cluster
around that same value.

With that in mind, a machine which fails 100% of requests is definitely
broken in the head. Even if it's so much as just 1%, that's still
completely crazy as compared to the quarter-percent seen everywhere
else.

Believe it or not, at some point you may have to break out that
statistics book from school and figure out a mean, a median, and a
standard deviation, and then start looking for outliers. Those which
seem to be off in the weeds should be quarantined and then analyzed.

This is another one of those cases where a reasonable response ("let's
track backend load and route requests accordingly") to a reasonable
problem ("requests aren't equal and servers don't all behave
identically") can lead to unwanted consequences ("one bad machine
pulls in far too many requests like a giant sucking vortex").

Think back to systems you may have used or even built. Do they have
one of these monsters lurking within? Can you prove it with data?

Pleasant dreams, sysadmins.

]]>Kill init by touching a bunch of filestag:rachelbythebay.com,2014-11-24:touch2014-11-24T08:31:37Z
init is a pretty big deal on a Linux box. If you manage to kill it, the
machine panics. Everything stops, and if you're lucky, it reboots by
itself a few seconds or minutes later. Naturally, you'd like it to be
stable and robust so that your machine doesn't go down.

What if I told you I found a way to kill the "Upstart" init in RHEL and
CentOS 6 with just a bunch of "touch" commands? Yep, it's true. You
can even reproduce it in qemu. In fact, I had to do it in there in
order to get these screenshots.

Step 1: Create a new directory under /etc/init. I called mine
"kill.the.box".

Step 2: Fill that path with a few hundred or thousand files. I used
'touch $(seq 1 1000)' but do it however you like.

Step 3: Kick off a bunch of changes to those files in parallel. As you
can see here, I started 10 instances of 'touch', each one receiving 7
copies of the list of files thanks to the shell expanding all of those
asterisks.

Step 4: Wait. It takes a little while to happen on the emulated
machine, at least. Bare metal might be more forthcoming. You will be
rewarded with the following:

Step 5: There is no step 5 unless you're curious. I rigged up syslog in
the emulated machine to fling data outward so I could see it with
tcpdump on my actual workstation machine. With that, you can finally
see why the box dies.

Here's the text transcribed for the sake of those web indexing sites...

Actually understanding this takes a bit of digging. First, you learn
that init on RHEL and CentOS 6 is something called Upstart, and that it
uses something called "libnih" to do a lot of work. One of the
utilities provided by that library is the ability to watch files and
directories for changes.

In this case, init uses libnih's "watch" feature to keep tabs on its
configuration files, so when they change it can know about it. If you
poke around in the source, you can find that it actually registers
watches for the entire directory of its config file for various reasons,
so it winds up following /etc (due to /etc/init.conf) and /etc/init (its
"job dir").

This takes us down into libnih. It opens an inotify socket and tells
the kernel what the caller (Upstart's init) wants to see. Then it waits
around to receive updates from the kernel on that socket. It takes the
raw data received on the socket, aligns a "struct inotify_event" over
that buffer, and reads out the result.

One of the fields in that buffer is "wd" -- the watch descriptor. This
is an opaque number you get from the kernel when you set up a watch.
That is, inotify_add_watch() might return 4, and then later, the struct
you get back will have 4 in the "wd" field. Easy enough.

libnih's consumer of this data then has the actual line of code which
causes things to die. It asserts that "wd" is greater than or equal to
0. Normally, that would be true, but here it obviously isn't. The
assert fires a SIGABRT, init goes down, and the system goes with it.

What's going on from here requires leaving userspace behind and going
into the kernel code. inotify itself has the concept of a "notification
overflow". When you first create an inotify instance, it creates an
event that already has all of the flags set up to say "hey, something
went wrong". This way, the memory is already allocated and is ready to
roll should it need it later.

If for some reason you manage to fill up the event queue and it needs to
write something more, it will then send you that prepopulated event.
It's not like a normal event: its "mask" is FS_Q_OVERFLOW. You're
supposed to notice that and presumably treat it differently.
Unfortunately, libnih barrels on ahead and I get fodder for another
post.

So where does the negative value for "wd" (the one tripping the assert)
come from? Just eyeballing the kernel code for "copy_event_to_user"
shows me a likely source:

In a nutshell, when there's private data, get the wd value from it,
otherwise set it to -1. We can be fairly certain there's no private
data for this pseudo-event because of a nice comment back in
fsnotify_add_notify_event where it actually detects the overflow and
goes into fallback mode:

/* sorry, no private data on the overflow event */
priv = NULL;

(Note: I've only eyeballed the kernel part. It might be off a bit.)

It's starting to make sense. You flood the filesystem with changes,
such that inotify has a lot to say. You do it faster than they can be
consumed, so the queue fills up. Then it fires off a notification about
the overflow which has no private data by design, and that creates a
message to userspace with wd set to -1. wd must be non-negative (>= 0)
or the code asserts, so the assert fires, the program dies, and the box
goes with it.

Given all of this, the right approach would seem to be a patch to libnih
so it notices the warning given by the kernel and does something
different. Or, maybe, some kind of option to let the ultimate consumer
(init, in this case) get a message without being slayed brutally by the
assert.

I should point out that this is not theoretical. I went through all of
the above because some real machines hit this for some reason. I don't
have access to them, so I had to work backwards from just the message
logged by init. Then I worked forwards with a successful reproduction
case to get to this point. I have no idea what the original machines
are doing to make this fire, but it's probably something bizarre like
spamming /etc with whatever kinds of behavior will generate those
inotify events libnih asked to see.

Let me say this again: this happens in the wild. The "touch * * *..."
repro looks contrived, because, well, that's the whole point of a repro.

If you're wondering how big the inotify queue is, well, it's probably
16384 on your machine. Run "sysctl fs.inotify.max_queued_events" to see
exactly what it is. If you're thinking you can raise that to avoid the
possibility of hitting the bug, you're probably right, but not
necessarily. It only gets used when the inotify instance is set up.

group = inotify_new_group(inotify_max_queued_events);

So, in all likelihood, init will be running with the old value if you
raise it later. You can try to get init to re-exec itself and hopefully
establish new watches, or try to get that value raised via the kernel
command line so it's already set to a higher value when init shows up.
Or you could edit fs/inotify/inotify_user.c. Whatever.

Of course, it would be far better for libnih to be patched to handle
this and for a new release to be pushed to RHEL, CentOS and whatever
else, but what are you going to do?

]]>Reading /proc/pid/cmdline can hang forevertag:rachelbythebay.com,2014-10-27:ps2014-10-27T22:39:12Z
Back in August, I wrote that
fork() can fail
and it made a pretty big splash. Continuing with that general theme,
I'll tell you about something else that can fail that you probably won't
expect. It has a failure mode that you're probably not equipped to
handle.

Here's the short version: ps, pgrep, top, and all of those things can
hang. I'm talking state D, uninterruptible wait, just like when you
"cat /something/on/a/nfs/mount" with the server down. Not even ^C will
get you out of this one. It can't be killed.

Do your system maintenance scripts and one-off tools expect things like
ps and pgrep to hang? What about pkill? How about stuff that randomly
grovels around in /proc looking for bits of data? I bet they don't.

Now that I've told you what happens, let's work backwards into why.

Linux systems have this wonderful pseudo filesystem called /proc. You
can get all sorts of neat data about what's running. /proc/pid/cmdline
will tell you more or less what the argv looks like for a given process.
/proc/pid/exe is probably a link to the actual executable that's running
(even if it's been deleted). You get the idea.

Recent Linux systems also have these things called cgroups. You might
call them containers. You can use them to enforce limits on certain
resources that are smaller than the whole machine provides. A program
might be limited to "only" 2 GB of memory instead of the whole 4 GB
machine, for instance. You might put caps on how much CPU time it gets,
which cores it runs on, or how much disk bandwidth it can consume.

The most common use of cgroups I've seen so far is the "memcg" for
memory limits. Someone will come along and create a container, set a
limit on it, then plop some processes into it. If they grow too big,
the kernel's OOM killer will fire inside the container, something will
die, and life goes on.

Assuming you don't run afoul of any kernel bugs in the oomkiller itself,
you're fine. No, the problems start when you disable the kernel's OOM
killer and try to act on your own.

Some process managers have decided they would rather be the ones who
keep tabs on memory size and do the killing themselves. To do this,
they run a task in a container, set a limit, and disable the kernel's
oomkiller behavior in that container. Then they wait for notifications
of it getting too big and kill it themselves.

This is all well and good, but have you ever seen what happens when the
process manager fails? Life gets ... interesting.

Now you have a container that's at its limit, and the kernel is saying
"yup, you've gone and done it now", but it's not enforcing those limits
since you told it to. All it will do now is stop accesses to that
memory space until you do something about it.

What this means in practice is that calls which reference memory inside
the container will hang forever. Remember /proc/pid/cmdline and
/proc/pid/exe? Yeah, bad news. Those are encumbered by this and
attempting to read them will get you stuck. Kiss all of those psutils
goodbye. You get to troubleshoot this the hard way. Certain things
under /proc/pid will work, but others will not. You'll get used to
doing "cat foo &" just to avoid losing yet another login shell.

The most direct method for dealing with this is to scan through your
memory cgroups (probably under /sys/fs/cgroups/memory, but your mileage
may vary) and see which of them are reporting being in an OOM condition
(look for "under_oom 1") and yet have the OOM killer disabled. To
verify this is your problem, read the "tasks" pseudo-file in there, get
a PID of something in the container, then try to access
/proc/that_pid/cmdline. If it hangs, that's at least part of your
problem.

Now you know about the bad cgroup, what can you do? Well, you can raise
the limit, and it'll get going again, at least until it grows again and
hits the new limit. You can also switch the kernel's enforcer back on
and let it lay waste to whatever it wants to kill. I guess you could
also reboot the machine, but that's just goofy. It's up to you.

Obviously, you will want to work backwards and find the problems in your
stack that lead both to the uncontrolled memory growth and the lack of
handling by your process manager. Otherwise, you'll be right back here
again soon.

So really, after this is all said and done and the fire is out, you have
gained a new little nugget of data. When a machine starts acting
strangely, the load average is climbing without bound, commands like "ps
aux" are getting stuck at the same point every time, and you're using
memory cgroups, you might just be in this situation.

Now you know how to spot it, and what to do about it.

Go make things better.

]]>fork() can fail: this is importanttag:rachelbythebay.com,2014-08-19:fork2014-08-20T05:28:04Z
Ah, fork(). The way processes make more processes. Well, one of them,
anyway. It seems I have another story to tell about it.

It can fail. Got that? Are you taking this seriously? You should.
fork can fail. Just like malloc, it can fail. Neither of them fail
often, but when they do, you can't just ignore it. You have to
do something intelligent about it.

People seem to know that fork will return 0 if you're the child and some
positive number if you're the parent -- that number is the child's pid.
They sock this number away and then use it later.

Guess what happens when you don't test for failure? Yep, that's right,
you probably treat "-1" (fork's error result) as a pid.

That's the beginning of the pain. The true pain comes later when it's
time to send a signal. Maybe you want to shut down a child process.

Do you kill(pid, signal)? Maybe you do kill(pid, 9).

Do you know what happens when pid is -1? You really should. It's
Important. Yes, with a capital I.

...

...

...

Here, I'll paste from the kill(2) man page on my Linux box.

If pid equals -1, then sig is sent to every process for which the
calling process has permission to send signals, except for process 1
(init), ...

See that? Killing "pid -1" is equivalent to massacring every
other process you are permitted to signal. If you're root, that's
probably everything. You live and init lives, but that's it.
Everything else is gone gone gone.

Do you have code which manages processes? Have you ever found a machine
totally dead except for the text console getty/login (which are
respawned by init, naturally) and the process manager? Did you blame
the oomkiller in the kernel?

It might not be the guilty party here. Go see if you killed -1.

Unix: just enough potholes and bear traps to keep an entire valley going.

]]>Multithreaded forking and environment access lockstag:rachelbythebay.com,2014-08-16:forkenv2014-08-16T22:54:34Z
Back in 2011, I
wrote
that you shouldn't mix forks and threads. That particular story was
about dealing with Python, but don't think that you're immune just
because you use C++ or even "nice simple C". You aren't. If you use
certain parts of your C library after "fork" but before "exec", you run
the risk of getting stuck forever.

This dumb little program does just enough to demonstrate my case. It
starts up a background worker thread that calls setenv() every 100
microseconds. It's purposely doing this a lot to make sure you can see
the problem right away.

The main thread of this program continues on and forks off children
which then attempt to call unsetenv. It also sets up an alarm handler
which will print an "a" after 1 second. In other words, if 1 second
elapsed between that alarm() call and the exit(), it should print "a".
Every time a child gets stuck, you get another "a".

It's trying to get a lock in glibc within unsetenv(). This never
succeeds, since we're in the child and no thread exists on this side of
things to release that lock. We copied the lock in the "set" state, and
there it will stay forever.

All you have to do to trigger this is make a copy of your process (with
fork) while setenv or unsetenv is running in another thread. If you
then try to use one of those functions in your child process, it will
hang.

You can actually get away with this for a very long time if you get
lucky. But, sooner or later, it will happen. If you see your process
blocked in "__lll_lock_wait_private" with unsetenv or setenv listed
earlier in the stack, you've probably done this, and it just decided now
was the right time to pop up and make trouble for you.

Play with the values for those usleep calls to see what I mean. Lower
numbers mean less time spent sleeping and more time spent with a lock
potentially held. Lower them both to 1 and you'll have a monster on
your hands. Try it and see.

]]>Compromised customer databasestag:rachelbythebay.com,2014-08-10:mnj2014-08-10T23:33:22Z
Long ago, I used to buy tech stuff for work from a company called CDW.
I also set up a "tagged" e-mail address which only was given to them.
Nobody else had that particular variant. I also never sent mail from
it, since it was only used for inbound order confirmations and the
usual one-way gunk associated with buying online.

This went fine for a long time. It was a clean address which only got
mail from the one vendor. Then, one day about 10 years ago, it changed.
I got mail from "MNJ Technologies Direct". This to me had "data leak"
written all over it. I looked them up when this happened, and it turned
out both companies were based in Illinois and were only about five miles
apart.

At the time I figured it was either two branches of the same bigger
(hidden) company, or some rogue employee had sold them out. This same
pattern had just happened with other well-known companies, so it
wouldn't have been much of a surprise.

I reported it to CDW and (unsurprisingly) never heard back.

Well, I have a long memory for these things, and this whole story
randomly resurfaced today. A quick search of the web turned up a great
bit of data and it all makes sense. The web of 2014 has sites where
people review their employers. One of these reviews... is for MNJ Tech.
What does it say? This:

Almost all of the sales people come from other VARS. 90% are from CDW.
Upper management is from CDW.

It all makes sense now. Some sales type obviously jumped ship and took
some or all of the customer database with them. Go watch Mad Men
reruns: it happened then and it happens now.

I like closure to mysteries even if many years elapse in the middle.

What can be learned from this? First, databases leak. Constantly.
If you use a tagged e-mail address, you will eventually have this
happen. If you don't have one, well, it's probably happened but you
can't tell.

If you have a list of customers and any employees, you have a non-zero
chance of a data leak, whether from inside or outside. You might not
know this has happened until one of your fake customer accounts is
contacted. You do have fake customer accounts which act as
canaries, right?

]]>Project 25 phase 2 (TDMA) decoding in software workstag:rachelbythebay.com,2014-08-04:tdma2014-08-04T17:09:53Z
Over a year ago, I started worrying about what would become of my
software-based scanner site after the county built their new radio
system. Whereas my existing work from 2011 worked on a Motorola
Smartnet system with regular analog FM audio channels, this new system
would be P25 with TDMA channels ("phase II") -- two talk paths per
frequency.

We've all known this was coming, and so back in June 2013 I put out an
offer
to supply a raw recording of a new-style system to anyone who wanted to
hack on it. I didn't get many queries, but fortunately out of the few
people I did manage to reach, I found the right ones: the folks behind
the
OP25 project.

Yep, that's right, they managed to do it. There is TDMA decoding
support in the OP25 tree as of earlier this year. You have to glue
together the pieces yourself, but that's relatively easy. The hard work
of figuring out all of the TIA-102 standards documents and turning it
into code has been done already.

In terms of practical applications, this means it's possible to take a
wideband stream of a whole trunked system, filter out the control
channel, and send it through OP25 to get P25 messages out. You still
have to figure out what all of those opcodes and bitfields mean, but
that's a largely solved problem. Tools like UniTrunker have been doing
this for years. Plus, the TIA-102 documents explain all of it if you
can afford to get a copy... or just find it online at some sketchy PDF
viewer site overseas.

With the control channel messages decoded, it becomes possible to look
for "voice grant" and "voice grant update" messages which tell you who's
talking (the radio ID), where they are talking (channel number) and who
they are talking to (talkgroup). It's around this time you realize that
the channel number itself does not directly tell you where to tune your
receiver. For that, you have to capture a few other messages which tell
you how the system lays out its bands.

Once you know that band 1 is from X to Y MHz and is using 2 slot TDMA
with channels that are Z kHz wide, then when you see a call arrive for
"band 1 chan 1234", you can do the math and arrive at some actual usable
frequency. It'll also tell you what slot number you should be decoding
from the data stream.

This is the point anyone could have gotten to before with the
previously-available tools: knowing a TDMA transmission was happening,
and even knowing exactly where it was, but the actual voice was
unattainable. OP25 developments of the past year have changed that.
It's now possible to set up a receiver on that channel (as determined
above) and do the whole demod/slicing/framing pipeline into their new
phase II handler, and it will yield 8 kHz audio, ready for storage in a
WAV file or whatever else you might like.

I can say this with some certainty since as of last weekend, this now
exists. There is a prototype system running here which will decode
control channel messages and kick off TDMA listeners to grab audio.
When the call is done, I wind up with a nice MP3 file.

Want proof? Okay, here's a call recorded on Saturday night using a $20
RTLSDR stick, a Mac mini running GNU Radio, OP25, and my own software:

As with the existing system, parallel calls are no problem, even if
they are two timeslots from the same channel. It's currently handling
this rather inefficiently by duplicating the tuning paths, but
optimization is not worth the trouble at this point. Eventually it will
get smarter.

If OP25 had this support months ago, why did this not happen for me
until Saturday? That's easy. It took a "real event" on this new radio
system to generate enough traffic to make it worth working on this
problem. Before that point, it was thousands (not an exaggeration) of
test calls over and over from different radio techs calling each other
from different parts of the valley. For a little while we did have some
Public Works stuff going on, but that's completely uninteresting and it
stopped anyway.

It took an actual stadium event using 3 talkgroups on the new system at
the same time to make it worth the trouble. While there are now
scanners being sold which can receive this new traffic, no scanner can
receive more than one call at a time, never mind three. That kind of
parallelism is the real reason to bother with this kind of software
setup.

What happens now? Well, there still is no traffic on the new system
from day to day. As I write this on a Monday morning, there hasn't been
anything significant to report for over 24 hours. (I can tell by
looking at the recordings from the new program!) It won't start getting
really interesting until local police and fire agencies switch over.

At that point, it'll make sense to work on it some more. Until then, I
have bigger fish to fry.

]]>1 second per second is harder than it soundstag:rachelbythebay.com,2014-06-14:time2014-06-14T22:21:46Z
If you've never had problems keeping your clock synced, you just haven't
run enough machines yet. Once you start scaling beyond a certain point,
it will become obvious that even elementary timekeeping is no small
feat, even with help from tools like ntpd.

First of all, to be clear, I'm not talking about the hardware clock.
That really only matters when you read from it, and you generally do
that once: at boot. Instead, I'm talking about the system clock -- the
thing which gives values to programs which call time() or
gettimeofday() or clock_gettime() or any of the other variants. It's
maintained by the kernel, assuming Linux here.

What sort of situations can you have with system clock timekeeping?
Here are a few I have encountered.

S0: your clock consistently ticks at 1 second per second, and is synced
to the correct time, as best you can tell.

S1: your clock consistently ticks at 1 second per second, but is a
little offset from the correct time: it's "running fast" or "running
slow". It might say 13:01:02 when the actual time is 13:01:05.

S2: your clock ticks at 1 second per second, but is wildly offset from
correct time -- over 1000 seconds, or much much more. It might be set
to tomorrow, next week, next year, or the year 1970.

S5: your clock ticks fast, slow, normally, and everything else
in-between, and changes constantly. Sometimes it gets more than 1
second per second, other times it gets less. Even then, the degree by
which it runs too quickly or too slowly changes unpredictably.

You will eventually see all of these if you have to tend a big enough
fleet. What can you do about them? That all depends on what tools you
have available.

S0 is the ideal state. If, somehow, you have a clock which stays here
by itself, you have an anomaly -- nothing is truly perfect when it
comes to timekeeping. Still, ntpd will be quite happy to keep it there.

S1 is more likely. ntpd will adjust (slew) the system clock to make it
tick slightly more quickly or slowly to "burn off" the offset. These
are tiny little adjustments, on the order of parts per million. If it
works, your clock will try to approach S0 but will probably be more like
S3.

S2 happens far too often. If you start up ntpd and the system clock is
out in the middle of nowhere, it'll refuse to fix things. By default,
at startup, it'll step the clock up to 1000 seconds to fix something
that's wildly wrong. It's a bit like picking up the needle on a record
player and dropping it back down: you get a harsh result, but you'll
probably be closer to where you want to be.

It's easy to make S2 happen by mistake. Just set your hardware clock to
something insane, then reboot and start ntpd by itself without doing
anything else before that. It'll see the offset and will give up.
Messing up the sense of whether you use local time or UTC in your
hardware clock is a great way to start this chain of events in motion.

The usual workaround is to make ntpdate run at startup just after
reading the hardware clock and just before starting ntpd. ntpdate will
adjust the clock from anything to anything else. Of course, if it
fails, your startup will continue, and ntpd will start up, and it'll
bomb. If you're reading this at a point when ntpdate has actually been
retired, then just think "ntpd -q" instead.

If you live in this world, ntpd's "-g" will let it step more than 1000
seconds for that one-time adjustment. Use it at your own risk.

S3 is probably where you wind up on most systems. Your clock's pace
will always vary a tiny little bit for different reasons (manufacturing
variance, temperature changes, cosmic rays, the phase of the moon, and
whatever you want), but ntpd will correct for it. It'll fix an error
of up to 500 parts per million in either direction.

S4 is when we start getting into the realm of things which are
increasingly difficult to fix with just a *clickity click* at the
keyboard. This is when you have a system which has a substantial offset
in the number of microseconds it gets per kernel "tick". The usual
value I see on healthy machines is 10000. Run "adjtimex -p" to see what
yours is using right now.

When I say unhealthy, I mean machines which are dozens or hundreds of
ticks offset from 10000. They might be 9800 or even less, so every time
the rest of the world gets 10000 ticks, it only gets 9800. That's a
clock which is running slowly.

ntpd will not correct for this. Clock adjustment on Linux comes in two
flavors: macro, in the form of the "tick" setting, and micro, in the
form of the "freq" setting. ntpd will only adjust up to 500 ppm, and on
a machine with USER_HZ=100, that's a "freq" value of -/+ 32768000 (65536
* 500). Outside of that (about 5 ticks either way), it won't touch it.

ntpd will probably interpret this as increasing jitter from its time
sources, even though it's the local clock which has the problem. If you
have one of these "time fixer" scripts which looks for insanity from
ntpd and restarts it any time it goes out of sync, you will keep getting
the "one time step at startup" behavior from ntpd, and your clock will
just keep getting dragged along by this fixit script. That's a lot of
needle-dropping onto the record which is your clock.

Now, let's say you're stuck with a box like this. You can use adjtimex
yourself to set the tick value to about what it should be. As long as
it's then within 500 ppm of what it's actually doing, ntpd will
eventually figure it out, and it will start declaring itself to be in
sync! This is a pretty dirty hack.

It's not entirely clear what the impact of doing this to your system
might be. I wouldn't be surprised if programs got differing numbers of
timeslices or something equally weird in such an environment.

S5 is the bottom of this pile of crazy: a clock which is so broken that
it can't even decide how broken it wants to be. Sometimes it's really
fast, and other times it's only slightly fast. Or maybe sometimes it's
really slow, but other times it's only slightly slow. In other words,
not only does it have a ridiculous variance, but even that variance
varies. It jitters.

ntpd can't handle this. You also won't be able to do the S4 workaround
of setting your own tick value once and letting ntpd work it out, since
the resulting difference will still be too large. If this is the state
of your system, I'd give up. Swap the hardware and try again on another
motherboard. It's just not worth the trouble.

One thing I've deliberately omitted here is any discussion of the actual
clock source used by the kernel. Whether you're talking TSC, HPET or
something else, that also can change how your system behaves (and which
situations you get into), but I'll have to cover that on another
occasion.

]]>Buckets of energy and evil preprocessor trickstag:rachelbythebay.com,2014-05-22:bucket2014-05-23T05:36:24Z
I think about projects as requiring energy. There's this virtual
bucket, if you will, and it's what holds onto that energy. Imagine it
as a liquid and now you can get a better idea of how it might work.
Every day, you wake up and there's that much more energy in your bucket.
You can go off and "spend" this on tasks like building things, fixing
things, teaching people, networking, and yes, writing. You can also
invest it in your family, hobbies, games and things of that nature.

Once that bucket gets low, it's harder to do some of those things. It's
usually not a big deal because there's always tomorrow, at which point
things will have refilled in time to get going on more things. But
really, it's a matter of whatever gets first crack at the bucket of
energy. Whatever gets there first wins. There's a time-based element
to this (early birds and all that), but also a matter of personal
priorities.

So what have I been up to? Well, all of the above. There are quite a
few more things calling for energy these days, and something had to
give. Unfortunately for my readers, it turned out to be what had been
daily posts for two years. There just isn't enough to go around to do
this sort of thing well since I'm already giving it to other parts of my
life. Sorry.

I should note that I'm basically out of rants about the inside of That
Place. I got that stuff out of my system quite some time back and have
moved on. Writing about it was still the right thing to do, and I
stand by all of those old posts, but there's just no reason to go back
to that well. It can rest for now.

...

Here's a quick technical story about something I found and fixed not too
long ago. There's this project which relies on a bunch of
externally-sourced libraries, just like lots of other things you find on
Linux boxes these days. One of these libraries had diverged quite a
bit from upstream, so one of the developers decided to do a fresh
import to pull it in. This went out in the next release.

Not too long after that, certain functions in this program started
misbehaving. It was relatively low-level most of the time, but it
definitely had not been there before. Debug logging was added in an
attempt to figure out what was going on.

Finally, one night while working on something else, I came across one of
these things running and running and running instead of shutting down.
It seems it was stuck on a particular inbound request and couldn't
finish it. I used some profiling stuff and found out it was spinning in
strcmp(), burning CPU like it was going out of style. Wait, what?

The strcmp in question turned out to be part of a check that was run on
every item in a linked list. For whatever reason, the linked list
never reached the end. Instead, it looped back onto itself at some point, and
that particular thread would stay stuck in here. Minutes. Hours.
Days, even, if the parent process lived that long.

Then I found another, and another, and another. Some spelunking turned
up what the request was supposed to do, and it was associated
with the aforementioned external library. So we're getting into an
infinite loop in code that's closely associated with something that
changed recently. Wonderful. Now what?

Somehow, in looking at all of this, I noticed this code had a bunch of
weird C preprocessor gunk going on which set up mutexes and other guards
but only if you purposely enabled "thread safety mode". By default, it
did no such thing and it was up to you to only have one call outstanding
at any given time.

In playing with it some more I found the actual problem: the #ifdefs had
changed. The old version was effectively enabling thread safety with
"#ifdef FOO", and the new version was effectively gating it on "#ifdef
BAR". Since we never defined BAR, we got the unsafe version, and it was
only a matter of time before it caught up with us.

I added a #define to enable BAR, but that didn't do it. The code would
not just switch on the safety stuff with that. Oh no. Instead, it
actually did a bunch of #defines to further abstract away the actions,
like this:

At first glance, this looks simple enough. In local_portability.h , I'd
just override this stuff with some pthread stuff. I'd make sure that
MUTEX_TYPE was a pthread_mutex_t, and that would be fine.

#define MUTEX_TYPE pthread_mutex_t

That was all well and good, but then it got ugly. I needed to call
pthread_mutex_init() on that thing, but that second #define throws away
the argument! pthread_mutex_init wants to get the mutex as an
argument. It doesn't return the value to store, like their
"setup_a_mutex()" obviously does.

So now what? I resorted to great evil. It turned out they only ever
used a single variable name (mu_) with this call, so I just hard-coded
it into my next compatibility hack.

#define setup_a_mutex(x) mu_; \
pthread_mutex_init(&mu_, 0);

See that? I don't use "x" anywhere in the macro. Instead, I cheat and
directly reference that thing which was thrown away in the prior macro.
See that one "mu_;" hanging out by itself? That's there because I had
to do something to make the setup_mutex macro compile. Recall
that it takes "x" and turns it into "x = setup_a_mutex();". I have to
give it something for the right hand side of that =, so it give itself.
The final code winds up being something like this:

mu_ = mu_;
pthread_mutex_init(&mu_, 0);

Yep, I assign mu_ to itself. Fortunately, this is just C code and it's
just a pointer, so that does exactly nothing.

Why didn't I just fix helpers.h to do everything the right way without
these silly intermediate steps? Oh, that's the best part. That is
part of the library, and it would get overwritten the next time they do
a "pull" from upstream. The developers didn't want to "own" the
maintenance of that particular fork, so the only way to inject anything
safely was in local_portability.h.

All of this actually works, and it's in production right now. Those
requests run properly yet again. I documented all of this insanity in
the code for the next poor sucker who comes along and tries to make
sense of it.

So what am I doing when I'm not writing? Stuff like that.

]]>It's old news, and it's not really that interesting any moretag:rachelbythebay.com,2014-04-26:vic2014-04-26T22:07:40Z
"Did you hear? Did you hear? So and so is out."

People have been reaching out to me to see what my reaction is on yet
another personnel change in the valley this past week. Okay, obviously
this one was a little closer to home than most, but it's also far in
the past. That chapter of my life ended nearly three years ago now, and
I'm over it.

The treasury has been looted, and people have moved on.

Oh sure, just like everyone else who was touched by this, I'm still
placing my bets with friends regarding what will come to pass with that
which was left behind, but there's no reason to go into it here. I've
already covered this ground.

There's better stuff to worry about.

]]>Segfaulting atop and another trip down the rabbit holetag:rachelbythebay.com,2014-03-02:sync2014-03-04T03:57:12Z
Let's say you're writing a program which is intended to take a snapshot
of your system's status every minute or so. The idea is to grab
whatever data you might have looked at directly had you been on the box
at that moment. You might have the results of an all-encompassing call
to "ps", "netstat", "lsof" or whatever else suits your fancy.

Odds are, after some iterations of this, you'd end up with a tool which
did some kind of binary-level data logging so as to be more efficient.
Even once a minute is (usually) 1440 samples per day, after all. So
there you are, with your binary blobs and a growing file. Your logic
for the writer probably looks something like this:

Open file for writing and appending

Write some stuff

Write some more stuff

Possibly write even more

Close the file

Then you shut down or go to sleep or whatever until the next time you're
supposed to take another snapshot. If you're the "sysstat" package,
this probably means having one of your binaries run by cron in a minute
or two. If you're "atop", then you're probably a long-lived daemon of
some sort which just sleeps in the middle.

So now here's where things get interesting. What if you get interrupted
in the middle of this? What happens now? The box might go down, or you
might get OOM-killed, or some runaway admin might hit you with a SIGKILL
by accident, or someone might decide it's time to upgrade your binary
and down you go.

Now, when it comes to data integrity, there's the obvious angle of
worrying about whether you use fsync() and if your filesystem is smart
enough to put things back in a sensible state, but that's not the end
of it. Remember the list earlier? You probably did multiple writes to
the data file. What if the first one got in but you died before making
any others? Chances are, you have the beginning of a record written to
your file but nothing else.

This now brings up two questions: what happens when you go to read this
file, and what happens when it's time to continue writing to this
file? Trying to read this last (interrupted) record from the file will
probably result in an EOF. If the program doesn't want to do anything
with that, I can certainly understand. That makes sense.

The situation of having the end of the file contain a partial record
will last just long enough until the writer starts up. Then, it'll
probably run the list as above: open for writing and appending, spew out
some blobs, close the file, and go to sleep again. Now you have a file
which has a bunch of good records, one broken attempt at a record, and
then probably more good records.

What happens when the reader comes along and hits this? If you're
"atop", you pull in data, assume it's a record, and use it as such.
Since things are completely mis-aligned in that file now, this means
incredible and impossible numbers appear in fields because they are
actually part of different fields from the next (complete)
record.

In my experience, the first time anyone finds out about this is when
they go to run atop after a system crash and it segfaults. Then they go
to find out why it segfaulted and realize it's happening in mktime() or
somesuch. It takes a while of backing up to find out that the segfault
is caused by dereferencing a NULL. Where did that NULL come from? A
call to localtime(), of course. Wait, what?

Did you catch that? localtime() almost always returns a pointer to a
struct tm, except when there's some kind of error. It seems simple
enough: you give it a time_t (seconds since 1970-01-01 00:00:00Z, more
or less) and it gives you back a structure which has things broken out
into a bunch of neat little ints.

It does... except when it doesn't, of course.

From the ctime(3) man page on my Linux box:

Each of these functions returns the value described, or NULL (-1 in
case of mktime()) in case an error was detected.

Got that? NULL can and will happen. Still, we're talking about taking
a number and slicing it up to make it into years, months, days, and so
on. That's just a bunch of division and subtraction and stuff like
that. How can you make that fail?

This is about the point where you start thinking "just what is so magic
about 67768036191705600", and, well, I can give you a hint. Think
about it in terms of years.

Here, look at just one second before that.

$ date -d @67768036191705599
Wed Dec 31 23:59:59 PST 2147485547

That "2147" is what got me paying attention. That magical "highest
number you can pass to this version of localtime()" yields a year which
is in the neighborhood of 2^31, aka 2147483648, aka 0x80000000. It's
actually a little past it - by 1899. Remember that for later.

tm_year is documented as an int, not an unsigned int, so anything with
that high bit set would actually be negative. That gives us an
effective top end of 2147483647 or 0x7fffffff.

Given all of this, how are we rocking a year which is somehow
above that value? For this, we must also remember what tm_year
stands for. It's not exactly the year number. Oh no. Quoting from the
man page yet again:

tm_year : The number of years since 1900.

Remember all of that software which went goofy about 14 years ago
because people were doing things like this?

printf("The year is 19%02d.\n", t.tm_year);

That gave us such gems as "The year is 19100.", or "19114" as it would
say today, assuming it's been unmaintained all of that time.

That's how we seemingly get past the (2^31)-1 year limit: it's offset.

So let's pop up a few levels here, and get back to talking about atop.
atop is using data which is bad, and it blows up. Due to the nature of
how it traverses a file, it's now impossible to seek past this bad
record, since it always starts from 0 and reads along until it gets to
the time offset you requested. Now what?

This is how it came to me: atop was blowing up on machines, seemingly
randomly, but usually right after a machine went down. This is exactly
when people wanted to use it to look for signs of insanity, and it would
let them down unless they were super careful to only request times
before the first crash. They couldn't see anything beyond it.

My investigation took me through all of the above and the inescapable
conclusion that we were getting incomplete data. However, I suspected
that there was good data in that file past the interruption, if only we
could get to it. This got me looking for some kind of "beginning of
record" marker. I found none. I looked for an "end of record" marker,
and again found none. That would be less ideal since it would mean
potentially skipping the first good record to be sure of a sync, but it
wasn't there, either.

You can't just skip a fixed number of bytes, since each record is
variable length. The first part of a record has a header which says
"this many entries are coming", and then you get that many entries. If
some or all of them fail to appear (due to the crash), then you're in
trouble.

Ultimately, I resorted to brute force. There are a number of fields in
the header which are always set to zero since they are "reserved for
future use". There's also the time field which needs to be in a small
range of values: basically a couple of years ago to a handful of years
in the future. I can use them to figure out if I'm lined up with the
record header.

So now here's what happens: I read a record and then do a sanity check
on it. If it seems fine, then I exit to the normal display code and let
it do whatever it wants. Otherwise I declare shenanigans and start
single-stepping through the file until it seems to be reasonable again.
If this succeeds, then we carry on from there. Otherwise, after some
number of failed attempts or hitting EOF, we just give up.

That's about where things are now: patching the code means it usually
can survive certain kinds of problems. It's a nasty little hack driven
by the desire to change as little as possible about some
externally-sourced code which may be a moving target.

What would be nicer is if the file format had some stronger assertions
about exactly what you were looking at. If it had both a beginning and
end marker for a record, then you could be reasonably sure that you had
gotten a complete view. If those markers also included some sort of
hopefully-unique identifier for each record, you could be pretty
confident that you got a single record and didn't somehow pull in parts
of two or three.

These markers should probably include a whole bunch of magic numbers, or
at least magic bytes within the magic numbers so you can be sure of
alignment. If you're using 64 bit types, there's a lot of room for
leaving your favorite easter egg values around for the program to find
later.

I'm not asking for every little sysadmin type tool to have its own
log-backed, transaction-based storage system underneath. I just want
them to behave better in the face of the sorts of common problems which
send people to those tools in the first place. It doesn't take much to
greatly improve the situation.

]]>Responses to reader comments regarding some older poststag:rachelbythebay.com,2014-02-17:update2014-02-18T05:23:56Z
This is a batch of updates to older posts and responses to reader
questions.

One question is about my scanner project: am I
motivated by profit or is this a hobby? That one is a little hard to
answer, but I can definitely remove one angle from it: there is no
profit in this kind of work. Nobody wants to pay for it. They don't
want to pay for the hardware, the software, or the development time.

Back when you had to pay $800 or more to get into this field with a
$500 USRP1 and a $300 daughterboard, hardly anyone did it. It took the
existence of the $20 RTLSDR sticks to get more people to pay attention.

So I guess that leaves the "hobby" angle. I don't listen nearly as
often as before, but other people have found it useful on occasion, so
it's worth keeping up. I got a few mails from someone who was concerned
for a friend out this way when something scary happened involving some
criminals, and my site was able to help them out.

If I don't want that to go away, then I have to adapt my site to the
new TDMA-based system
somehow. There's just no way around it.

...

In regards to a post from 2012 about
working tech support,
there was a question as to how big the day shift was. To be honest, I
don't remember exact numbers any more. I think the Linux side of the
house had 3 teams back then just for the non-high-end customers. Each
team probably had three or four "level 2" techs and one or two "level 3"
techs, plus however many "level 1" phone firewalls they could cram in.

So... maybe 12-15 people? That number seems low. I'm trying to
visualize the area which held their desks now and it seemed to go on
forever.

I know that's not much of an answer. Sorry. It's the best I can do.

...

Finally, I got a couple of questions about my
RPM cleanup project,
and whether I'd make it available. Well, I'm sorry to say that it's not
mine to give away. Someone paid for the development of those tools, and
they own them a result.

Fortunately, we're not talking about rocket science here. Any of these
tools could be re-created by anyone so inclined. Run "ps" with the
right flags to select things like yum and rpm and have it return the
pid and wchan. If it's in the futex state, poke it with ptrace as
previously described. Try a few times and then punt if you must.

That's one tool. Another one just tries to do the equivalent of "rpm
-qa" using the librpm calls, and if it gets stuck or fails, it kicks off
db_recover. It's basically a bit of paranoia to keep from running the
recovery tool every single time, since I'm not sure exactly what that
would do to a concurrent RPM operation which is otherwise fine. The
only magic here is that it happens in a child process to avoid getting
itself stuck when things go bad in libdb.

Finally, for the NFS thing, you do the same "fork and test in child"
scheme, but you're looking at entries from /proc/mounts instead. If
they don't work, kick them out and remount them if at all possible.
Lather, rinse, repeat.

I don't have the cycles to maintain or support something like this for
the world. If you need them, write them. They aren't terribly
complicated, and the wins you'd get from local customization beat any
savings you might get by copying code from someone else.

]]>Welcome to 2014, in which RPM still gets stuck in futextag:rachelbythebay.com,2014-02-09:futex2014-02-10T04:45:33Z
I get questions now and then: what do I do when I'm not writing? These
days, it could be any number of things. Some of them can be described
in fairly general terms and others I prefer to leave alone for now.

One technical issue was that of
wrangling RPM
as described back in December. I got through all of that and got my
tools working to clean up whatever messes might come along. That much
is basically done.

Along the way, I've run into yet another class of problem: rpm and yum
processes which do some rather broken things in how they handle futexes.
It always starts as the same thing: someone has a box which is supposed
to install or upgrade a package by itself, and it hasn't. They take a
look and notice that yum or rpm or whatever can't get any work done.
Sooner or later, it comes to me.

A pattern became evident with these systems: they all had one or more
RPM-related processes sitting in "futex_wait_queue_me". Those which
weren't in that particular function were sitting in a call to fcntl()
or similar, trying to get a lock. Naturally, I turned to strace to see
just what they were up to, and that's when things got weird.

While attaching to the fcntl-waiters didn't do anything, more often than
not, attaching to the futex-waiters did. They'd manage to
finish the futex call and would get going on their work. Then they'd
exit. This would then release the locks, and everything else would wake
up and go to work, too. A few seconds later, it was all back to normal
as if nothing had ever broken.

What a nice heisenbug! Go to investigate it, and it changes... and
disappears, too. How fun.

Like the prior situation, I had two major ways to pursue this problem:
I could go digging into RPM and the Berkeley DB stuff and maybe even
glibc and beyond yet again, or I could come up with a fix first and then
maybe come back to the problem. That's about where I am now: I wrote a
fix. I might go back and dig into this crusty old RHEL-variant version
some day, but the value of that is already low and continues to drop.

So, what's the workaround? It's pretty ugly, but it does let the
machines keep doing their jobs without sending humans in. It's also a
"light touch" and does as little damage as possible.

First, look for processes which are touching the RPM database which are
in this state. Then attach to them just like strace would -- yep, that
means ptrace(). Then detach and give them a poke if necessary to get
them running again. Wait a few ms and look again. It'll probably
disappear all by itself, having run to completion. If another one is
there, then poke it, too. Run a few passes just to be sure.

If this fails, then okay, go nuclear and SIGKILL the stuck ones.

"Look for stuck workers and SIGKILL 'em" is what everyone has done up to
this point. It has the potential to make things worse.

I found a happy middle ground. It's a hack, but it does work.

Pragmatism is weird that way.

]]>More lies from Siritag:rachelbythebay.com,2014-01-19:lies2014-01-19T21:38:28Z
Siri lies. Not too long ago, it completely misparsed one of my requests
and turned it into a a cheeky encouragement to upgrade the OS on my
phone. It looked like this:

It's done this with other things. Things which probably wouldn't be
more complicated than displaying a HTML table. I have no idea what it
thought it was going to do by creating an "out station" (???), but
clearly it thinks that a newer iOS will help.

News flash: it won't. I have access to another device running the
latest (and I'm stopping short of calling it the "greatest") iOS
publicly available. I asked it the same thing on purpose (in fact,
doing it at the same time on both devices, so as to get the above
screenshot). It said this instead:

Bait and switch, eh? Typical programmers.

User: "I'm having a problem."

Programmer: (without checking anything else) "You should upgrade."

(time passes)

User: "I upgraded. It's still happening."

Programmer: (radio silence... they've already moved on)

]]>Inflicting a worst-case scenario on my own binary tree codetag:rachelbythebay.com,2014-01-01:path2014-01-02T04:50:50Z
I have unintentionally written ticking time bombs in my code.
Hopefully, I don't do it nearly as much any more, but the old ones
remain. One of them bit me earlier this week.

About 12 years ago, I sat down and wrote a program to manage my personal
diary. It's not the sort of thing I ever shared with the world, and it
was just enough to take a bunch of flat text files and render them into
somewhat-nicer HTML. It was later the inspiration for the "publog"
software which generates these posts, the Atom feed, and yes, even the
books.

The original software, however, never changed. Aside from some cosmetic
changes over the years and adaptations to varying path schemes when I
moved it to a Mac, the core has remained. It was written in C with very
different techniques than I might use in C++ today.

One of the things I did when writing that program was to cache
information about entries. Instead of having the program rewrite every
input file into an output file on every pass, it would instead stat()
the input file and compare its mtime to the cached mtime. If it was the
same, nothing had changed and so nothing needed to be done.

On disk, the format was simple enough. It was something like this:

2014/01/01 1388621917 "thing 1" "thing 2" "thing 3"

This was then stored in my own hand-rolled binary tree implementation,
keyed off the first blob: "2014/01/01" in this case. I figured this
would make for speedy lookups later on when it came time to check the
data for an arbitrary record.

So, I wrote it this way in 2002, and it's been fine all this time, and
then it broke. It just started segfaulting a couple of days ago. I
decided to feed it to gdb just to get some idea of where it was blowing
up and was rewarded with quite a mess.

The backtrace showed several thousand calls to one function with a name
like add_to_tree(). That plus the segfault makes it pretty clear: I was
blowing up the stack with excessive recursion. What was going on?

To understand this problem I first had to get back into the groove of my
old code. The best way to do that was just to read it "in order",
starting from "int main" and following branches, taking notes along the
way. Remember, this thing is 12 years old and I forgot all about the
finer points of how it worked along the way.

Eventually I had a sort of narrative about how it was supposed to work,
and things started fitting into place. That's when I realized what
happened: I had created a pathological case for my binary tree.

Recall that I have a cache, and I write it to disk between runs. It
gets written to disk using an in-order traversal, so I get things like
this in that file:

2013/12/01 ...
2013/12/02 ...
2013/12/03 ...

You get the idea. When this gets loaded in, my binary tree code just
sees it as being "greater than" the root node, and then that node's
right child, and then that node's right child, and so on.

Yep, I turned my binary tree into a linked list.

This held up for years and years until I finally gave it the entry which
broke the camel's back, and it simply could not recurse any more given
the available stack space. Then it just died.

This would also explain why it had been getting slow over the years, but
it all happened so gradually and I didn't really notice it. That and
upgrading my workstation hardware multiple times made it disappear
into the noise.

If my binary tree had been balanced, this wouldn't have happened.

Rather than trying to hack up one of those, I just pulled this crusty
old code into my current tree of C++ projects and introduced it to a
little thing called the STL. It's still really horrible code, but I've
managed to chop out the nastiest bits which were keeping it from
working. Maybe one of these days I'll go back and bring the rest of it
up to spec.

Of course, that may be another 12 years from now. See you in 2026?

]]>Floats and version numbers: just say notag:rachelbythebay.com,2013-12-29:ver2013-12-30T03:38:44Z
3.2.

See that floating point number? It probably means you harm. It wants
to coerce you into doing things that look really nice and simple on the
surface, but are really scaly and evil underneath. Have you ever seen
that
cartoon
where one stick figure asks another how they manage to hold things when
they have no fingers, let alone hands, and one whips out a magnifying
glass to reveal tons of microscopic tentacle-ish things making it work?
It's like that.

Let's say that 3.2 is the first two parts of a Linux kernel version.
That is, it's the 3.2 from "3.2.45". Someone took that string, sliced
it using the periods as separators, and then cast it to a float. Then
they compared it to something else, like 3.0. 3.2 is in fact bigger
than 3.0, and thus some feature is supported.

That's all well and good until 3.9 came and went, and 3.10 followed.

Guess what 3.10 really means when you're a float...

>>> s = '3.10.4'
>>> float(s[0:4])
3.1

... it's 3.1, more or less, and 3.1 is less than 3.2. Oops, better
disable support for that feature!

Want to compare version numbers? You're already headed into a deep,
dark pit of despair. It's just like date and time handling: seemingly
simple because you do it every day, but a complete disaster once you try
to get a computer to do it for you.

There are helper libraries. Hopefully whatever need you have will be
satisfied by them. One thing is clear, though: chopping up a string and
turning it into a float will yield pain. It's just a matter of when.

...

Homework for those who think this is no big deal: put the following
versions in order using no other information about them.

I haven't even tried to be outlandish here. I think you can find at
least one real-world example for every format I put up there in just
the ~20 year histories of the Linux kernel and Slackware.

]]>Helping RPM stay afloat on big fleets of machinestag:rachelbythebay.com,2013-12-15:rpm2013-12-16T07:18:43Z
I've been doing some work lately which touches a lot of Red Hat-derived
machines. This is not a new pattern in my life -- they always tend to
turn up in one place or another. I first encountered their "enterprise"
flavor during my days as a web support monkey, and came to like it.
When it comes to having a dedicated server somewhere which needs to
basically keep working and not have massive changes sprung on you, they
have that down cold.

This is not to say that it's perfect. There are things which tend to go
wrong, particularly when you start talking about lots of people using
lots of machines. Beyond a certain point, it becomes impossible to care
for each and every one on an individual basis, and you have to start
rigging up checkers and fixers.

These situations are not always straightforward. Consider the case that
machines can and will reboot at any time. This could be due to the
kernel flipping out, a wayward command run on the wrong machine,
hardware errors, power problems, or anything else you can imagine. Now
also consider this can happen during an update of the oh-so-important
RPM database - the thing which tracks all of your system's packages --
the OS itself, if you will.

Most of the time, nothing bad happens. You get lucky and it manages to
finish up in a consistent state, or at least a trivially-recoverable
one. Nothing bad happens, and all of your stuff keeps working. All of
those automated processes which check on the machine, install, remove,
and upgrade your packages continue to work. If you have a small number
of machines, this is probably your life: it just never happens.

So then, let's crank up the number of machines and suddenly those tiny
little percentages of failures start yielding actual results. Your RPM
database decides it's run out of lock slots and won't run. This stops
your automatic upgrades, patches, and everything else involving
packages. It might even start breaking other things depending
on how your system management stuff works.

Let's say someone notices the "failing RPM" situation and they decide
to do something about it. I imagine the result will be a shell script
with the best of intentions. First, it'll try to do something with RPM
to see if it works or not. If it gets an error, then it'll do a "rpm
--rebuilddb" and exit.

It'll resemble this:

rpm -q foo || rpm --rebuilddb

I imagine that will automatically resolve a few situations. It might
even do its job without making life too terrible for other "tenants" of
the machine. It'll go into cron, and it'll run regularly.

Then, one day, it'll stop helping and it'll start hurting. The RPM
database will get into a state where db4 (the base library under librpm
itself) decides to enter an infinite loop. This one is fun, since it's
not making syscalls and it's not making library calls. That means both
strace and ltrace show you absolutely nothing, and it'll just sit there
burning CPU. Start another process, and it, too, will chew another core
on your machine.

Every time that cron job starts up, it gets another core. Give it a
couple of days, and soon you're in a world of hurt thanks to the loop
from hell which will never end. You're bleeding machines, and something
has to be done about it. It's time to start troubleshooting.

It takes gdb to show you that it's __env_alloc deep inside db4, complete
with nested levels of C preprocessor gunk.

This one is taking a value and is adding something else to it, then it's
bouncing it through two casts for some reason. One of those
casts is u_int8_t, and the other one is coming from the macro call.

So jump back to SH_TAILQ_FIRST. Now mentally try to wedge both of those
macros into that one, and don't forget about the ternary stuff going
on -- see that question mark and colon? Yeah.

You're testing something's stqh_first to see if it's equal to -1, and if
so, yielding NULL, otherwise, you're yielding the result of this
doubly-cast addition on ... something else.

It just goes on and on like this. I won't spam you with the rest.

I realized I could keep going down the rabbit hole of how db4 worked,
or I could start figuring out what to do about it. The "analysis" side
looked like a bottomless pit and I wanted to start delivering results.
As it turned out, the "recovery" side was simple enough: db4's own
"db_recover" will put things back together without an exhaustive
rebuild.

Of course, this recovery needs to be kicked off somehow. Attempting a
RPM database operation and seeing if it hangs is good enough, but how do
you do that without getting stuck yourself? Well, you do, and you
don't. That is, you have to fork off a child, and let that child make
the attempt while sending updates back to the parent over a pipe. If it
stops phoning home, you know you have a problem.

Then you just call db_recover and set about killing off the other stuck
processes - you did want to get back that CPU time, right?

This is just the beginning, of course. There are other failure modes,
including those where you can open the RPM database but then it refuses
to let you query anything. There's one where you can query some, but
not all of the packages. Get far enough down the list and it'll get
stuck trying to acquire a lock of some sort.

Pursue this long enough and you will discover the fun of running "fuser"
on a machine which uses NFS and has at least one dead mount. It'll
start digging through /proc, it'll find a reference to the dead
filesystem, and will enter "state D" forever. Joy! Soon, you will be
in the business of monitoring and maintaining NFS mounts. Say hello to
the forced umount and the inevitable tradeoffs between staying frozen
and possibly losing data by killing a stuck write in another process.
(If this sounds familiar, it's because I
did it with smbfs
back in the '90s. The names change, but the problems remain.)

Get through this, and the next problem to emerge won't be NFS or RPM.
It'll be yum instead. Yep, you can have a system which runs RPM
commands and looks healthy, but then fails a "yum check". Those are all
sorts of fun, too.

This is pretty much where things are now: I'm chasing a variety of
loosely-related problems which can screw up fleets of machines. The
biggest concern is not getting stuck doing nothing but this. Just like
meddling in the affairs of other countries with military force, this
sort of thing needs a clear exit strategy from the outset.

Get in, get it done, and get back out.

]]>My advice: skip the Intel museumtag:rachelbythebay.com,2013-11-24:pics2013-11-24T23:00:25Z
There are some parts of the valley where I don't go too often. Maybe
they're far away, or they don't seem interesting, or whatever, but time
just goes by without me experiencing them. Then, out of the blue, I'll
have family visiting from far away and they'll want to check out
something specific. I can't exactly turn them down, so off we go to do
something new.

This is the standard "you live here but you never go see X until someone
visits" story. For me, it was the Intel "museum". It was a pretty sad
little setup with a gift shop tacked on. I caught two pictures worthy
of snark.

I have to apologize for the picture quality here. This one popped up so
quickly I barely had time to snap this photo before it disappeared.
It's this dumb little "digital photo postcard" thing they have when you
first enter. You're supposed to sit down, somehow bend yourself into
the camera's field of view, and then have it shoot a picture.

Then you get to hold your hand out at arm's length and tap-tap-tap
on the screen to enter your e-mail address. Imagine a typical
flatscreen display, but tilt it back maybe 15 degrees, and then try to
"type" on it. You get "gorilla arm" really quickly doing this. Then,
while you're doing that, this message appears.

You'll just have to trust me that it says "UNREGISTERED VERSION" in the
top grey bar, and down below it said something about entering a
registration key, or that the registration had expired, or something
else equally clowny.

A bit later, there's a display which is supposed to teach you about
binary, I guess, but it winds up being an interesting lesson about ASCII
and the limitations thereof. Couldn't they come up with something
better than this?

There are 8 seven segment displays, each with a button beneath to toggle
them between 0 and 1. Then there's an [Enter] button, and a [Clear]
button. Above it (not shown in my picture), there's a screen which
displays the characters you've entered.

At first, this seems like a simple use of a shift register. You set up
some bits, hit the button, and it shifts them in. When you're ready to
start over, frob the clear line and it starts over. Easy, right?

Well, but there's a problem. I guess I started thinking about it after
noticing that it had an 8th bit present, and that was silly because
ASCII doesn't go that high. That got me thinking about character sets.

Notice that it only shows the patterns for A-Z. Just how many people
out there have names which can't be expressed with just those? While
I'm not personally in that camp, I've met a fair number of people in my
life who are, particularly out here in the multicultural land that is
the Bay Area.

Imagine a bunch of kids coming through here, and then one of them steps
up and tries to make her name pop up, but it's no good. The system
can't render whatever interesting character she had in mind. Sure,
trying to enter UTF-8 this way would be silly, but why make some
kid sad just because her name has some other character in it?

They could just amend it to not talk about names. They could have the
kids try to type in "INTEL" or "PENTIUM" or
"F00F BUG"
for that matter.

After looking at this list again, I'm kicking myself for not trying a
name like "Nishit" at the Intel museum. It could have been interesting.

If you're looking for a bunch of computer history stuff in museum
format, drive the extra 10 miles to Mountain View and go to the Computer
History Museum. As long as you don't make any wrong turns, you'll have
a much better time.

]]>SVRCS is alivetag:rachelbythebay.com,2013-11-17:svrcs2013-11-18T05:24:48Z
The county's new radio system is alive. Right now, the only things
you'll hear are radio techs calling back and forth with "test... 1-2-3",
but it's there. It's running Phase II (TDMA) audio as promised.

In terms of software or the scanner site I've been running, there isn't
much to say. While there are efforts afoot to make decoding the control
channel a simpler thing, there still isn't anything to be done about
audio. Without audio, the rest is pointless, and so here we are.

There have been some changes in the world of scanning. I got a PSR-800
from some random seller on eBay, and it works just fine. It's been
letting me hear the radio techs go about their business, since it's the
only currently-available radio which actually handles the 2-slot TDMA
channels.

It would be a really evil stopgap, but in theory, I could use the
PSR-800 to drive the site. There would be no talkgroup information or
anything else of the sort but there would be audio at least. It would
be subject to all of the usual annoyances, too.

What happens next? Who knows...

]]>Configuration management vs. real people with roottag:rachelbythebay.com,2013-11-03:conf2013-11-04T06:27:12Z
There are all sorts of configuration systems for Linux boxes which have
popped up in recent years. These are the Chefs and cfengines of the
world, plus all of those other ones which I haven't heard of. Every
niche has its own angle on the problem, complete with varying degrees of
configurability.

Some have their own little domain-specific configuration languages.
Others just hand you an existing programming language and let you do
whatever you want. If you can figure out how to write it, you can have
it. Just don't mess up.

One common behavior seems to be the notion of having a list of things
which must happen to turn a machine from state A to state B. Let's say
state A is "doesn't have foo installed" and state B, unsurprisingly, is
"does have foo installed". You write a script, rule, recipe or whatever
for "A to B", and then any time someone wants foo on their machine, they
run your little creation.

This creation of yours might be smart or it might be stupid. Imagine
a completely braindead script, for instance:

I won't pick that one apart, since I did that sort of analysis in
another
post
earlier this year. Instead, I'll talk about another problem which seems
to be ignored: that of humans and entropy in general.

Maybe you have a few hundred Linux boxes and a system like this. When
you want to set up a new server running foo, you have it run your
sequence and now you have foo installed. Time passes. The company
grows. More people come along who have access to the server. One day,
one of them logs in and changes a config file directly.

What happens now? Let's say the change fixes a real problem, and that
ad-hoc change persists out there for a year. Then, one day, that
machine gets reinstalled and the change is gone. It's been so long that
the original person doesn't even remember what they did to the machine.
Maybe they don't even work there any more.

Now let's say you switch to a config system which actively tracks the
files it installs. It makes sure they keep the same values and will
flip them back if necessary. If someone makes an ad-hoc change in that
environment, it'll be reverted fairly soon, and they'll realize
something is wrong. It breaks during that short window when everyone
still has context regarding the problem, in other words.

Okay, that's an improvement. But there's still more which could happen.
Maybe you're running the type of software where any file in a magic
directory becomes part of the active configuration. For examples, look
at places like /etc/cron.d, /etc/profile.d, and anything else of the
sort. Programs like qmail also behaved this way: individual config
directives were handled with individual files.

So you're in this world and your config program is tracking files A, B
and C. Then some human drops in and adds file D. That changes the way
the system behaves, but your config program never catches it. You're
back to the earlier situation.

Now you're facing a bigger configuration situation: having it maintain
entire directories. That way, any new files which appear in those
managed spaces will be removed. This also applies to subdirectories and
anything else which might be dumped out there.

Maybe you do that too, and now your various recipes own entire slices of
the filesystem. How long can this last? How long will they stay
separate? Eventually, your config system will need to create unions of
the constituent parts on a given system. If two recipes touch the same
directory, they need to somehow fit together in a compatible way.

The alternative is having two warring recipes, constantly loading and
unloading each other. That's not exactly productive.

It seems like ultimately you're bound to hit a wall in which two
conflicting configurations are required at the same time. Maybe one
tool expects /lib/libfoo.so to point at version 1.2, and another wants
it to point to version 1.6. Now what? Do you resort to LD_PRELOAD
hacks for one... or both? Do you chroot one of them?

My guess is the only sane solution is to have a base system image which
is small and which has some kind of generic "overlord" to add things.
Anything which gets added is walled off in its own chroot or possibly
even an entire lxc style container. It doesn't run with root
privileges, and as far as it knows, everything on the system exists
solely to make it happy.

Of course, it's also possible to go too far down that path and wind up
with a
monoculture problem
in your fleet where one bug wipes out of all of them.

That's obviously no good, either. More base system flavors are needed.

Remember the overlord process and how it walls off its tasks from the
rest of the system? This is a good thing. It means they shouldn't have
any idea of what's going on underneath. Now you can have base system A
running some kind of Red Hat variant, base system B with some kind of
Debian environment, and base system C with Slackware just for fun.

Screwing that up sounds a whole lot harder. One bad RPM won't take down
the entire fleet if only because it won't install on systems B or C --
assuming that you don't do something clowny like making the other ones
auto-convert and install them, of course!

This can have other benefits. Within your organization, you can have
the different base systems be owned by totally separate groups. They
can even live in different parts of the world if you want. Maybe one's
on the west coast of the US and the other's in Dublin. That's a pretty
common arrangement in tech companies.

Someone who likes working on RH-flavor systems joins team A, and someone
who likes working on Debian goes for B. Then the sole engineer who
enjoys playing with Slackware maintains C in her spare time. You get
the idea.

I would even suggest taking this kind of behavior further up the stack
if at all possible. If the "overlord" software can be reduced to a
common API, why not have multiple implementations? Let them split into
two flavors which are owned by the teams described earlier.

So many things become possible in this kind of world. You can run tests
to see which side delivers better performance for a given task and have
some friendly competition to keep improving.

People are going to log into machines and make changes by hand.
Sometimes they intend for them to stick. Other times they don't, but
they forget and those changes stay around far too long. You can try to
legislate this out of existence, or you can create a resilient,
self-healing system which doesn't allow things to get out of hand.

What'll it be?

]]>ping and inet_aton, revisitedtag:rachelbythebay.com,2013-10-20:addr2013-10-21T01:36:13Z
Last year, I
wrote
about dissecting ping and glibc to find out why the one on my system
(and many others) supports things like 192.168.01234 as a valid address.
If you haven't read that post (or that chapter in the second book), go
have a look for context.

Anyway, it came up again today: someone linked to a big number on Hacker
News and said it was a valid IPv4 address. I
commented,
linking back to my old post to help explain why it works... sometimes.
For at least one person, this wasn't good enough, and so I clarified
that it's up to the tool and perhaps even the system's libraries.

I made this point back in the original post, too: that your ping and C
library implementation will likely affect whether this works or not. For
most people, I suspect they'll be on a BSD or Linux stack, and it will
work as advertised. So, when exactly will it not work? It took some
digging, but I came up with some scenarios:
dietlibc and
uClibc. They both have their own
inet_aton implementations, and they have their own takes on the weird
corner cases.

You get the idea. Some will work and some won't. The implication is
clear: the same ping program could work or fail with those weirdo
addresses depending on which C library you happened to use.

Now imagine using a different ping (or any other program which
intends to speak TCP/IP) which doesn't use inet_aton at all. Will it
support all of these strange addresses? Considering that you have to
deliberately act to support this behavior, I'd expect it to be missing
more often than not.

]]>Future frequencies but no way to decode in softwaretag:rachelbythebay.com,2013-10-06:svrcs2014-08-14T01:40:06Z
Earlier this year, I
wrote
about a new radio system which is coming to Santa Clara County. The
first stage of it is going to include Sunnyvale and Santa Clara (the
city of the same name as the county, that is), and it's going to use a
tech called P25 phase II that hasn't been used in the area before.

This particular technology is relatively hard to monitor as an
individual. The only scanner which ever did it was the GRE PSR-800, and
that company has since gone out of business. There are rumors of
production being restarted for that model or another one with the same
capabilities under the auspices of another company, but I'll believe it
when I see it.

Meanwhile, the clock is ticking on the new local system. They finally
got their FCC licenses going a couple of weeks ago for stage 1. These
are the frequencies which will be simulcast from the sites in Sunnyvale
and Santa Clara:

770.08125

771.40625

771.85625

772.15625

772.30625

772.45625

As of this afternoon, I haven't seen any activity on these channels.
I've tuned in with one of my RTLSDR sticks but haven't seen anything
happening. Of course, it's not like I'd be able to do anything useful
with the signals, given that there's no way to decode phase II in
software. The most I could do is look at the control channel (once
it exists) and say "oh, look, there goes a call".

So, I've hedged my bets. I took a risk, went on eBay, and bought a
PSR-800 from some random individual on the other side of the country.
It showed up in the exact condition described, and seems to work. In
theory, I think I have it programmed for the SVRCS system and set to
record anything which might pop up. In practice, I have no idea what
will happen if and when they start transmitting.

I hope I will have some good news to share on this topic at some point.

]]>Evolving systems and pushing the reset buttontag:rachelbythebay.com,2013-09-21:evolution2013-09-22T05:02:13Z
I've been trying to put together a bigger sense of the evolution of
software and technology as some parts graduate from being difficult to
being relatively simple.

Imagine Ethernet networking and TCP/IP. Once upon a time, it required
expensive additions to your computer hardware, a bunch of programming
and configuration skills with trial and error happening continuously.
Now you can just walk into a store and buy any number of consumer
products which will plug in to a wired network or associate with a
wireless one using little more than a couple of GUI prompts.

Look at just TCP/IP. Assume you had the benefit of having a working
stack already. At some point, it was a matter of running 'ifconfig'
and 'route' by hand. Then it probably got scripted - the same commands
were happening, but now they were automated. After that, those scripts
morphed again to pick up parameters from an external source. The
scripts themselves would no longer be edited since they'd find their
settings in a configuration file.

The configuration file itself probably required manual editing at some
point. Then in a later version, it became a series of text-based
prompts which themselves were provided by the OS packager. Your
answers were written to that same file. This interface probably
changed to be a text-mode pseudo-GUI, and then maybe even a "real"
graphical mode interface with pointing, clicking, and all of that.

Obviously, DHCP helped quite a bit here, but even if it had never
happened, things still got easier for regular users. They no longer
needed to summon a network guru just to get online and be productive.

I wonder what would have happened if TCP/IP hadn't been relatively
stable since it was unleashed upon the world, and instead had kept
changing in non-trivial ways. Would we have ever made it past the
level of slapping patches on our kernels and running config commands by
hand every time we wanted to use the network?

This seems like a call back to my original disconnect between config
files and source code when I didn't yet know how compilers worked. When
a once-complicated system gets to the point that it can be expressed as
a series of finite configuration directives instead of requiring the
flexibility of an actual program, you can guess "normal" users are not
far behind.

Of course, if someone keeps pulling the rug out from under the tech to
reset it every couple of years, will it ever get to that point?

Come to think of it, there are probably some profitable (if nefarious)
reasons for doing that sort of thing every so often. It's the technical
equivalent of creating a
sick system:
the users have to depend on you, the programmer who actually understands
this stuff.

Why torture them? What have they ever done to you?

]]>Would your code work after a trip back in time?tag:rachelbythebay.com,2013-09-14:time2013-09-15T03:18:21Z
I recently had an idea for one more thing to consider when choosing how
to build a complicated software project. There are the usual concerns
about money and time and programmer ability and all of this. I'm
talking about something unusual here. It's a deceptively deep question
that the stakeholder should ask the programmer who is ready to launch
into building this new thing.

Would the program still work if you sent the source code back in time 10
years?

Let's look at some of the answers.

The first one is "yes", given quickly. Either they didn't think about
it, or they have already thought about it in advance and know for sure.
The person who answers without thinking about it is probably going to
give you other problems down the road. The one who has already
considered this is "on your wavelength", so to speak.

Another answer would be "yes", after a considerable delay. I assume
this wouldn't happen too much since it would make for an awkward silence
as the programmer pondered it on the spot. I think it would be more
likely to hear something else instead.

What about "I'm not sure" or "maybe", possibly after some small amount
of silence? That seems like a much more reasonable place to start.
Who's really going to know this sort of thing in advance, anyway?

A "no" given quickly might reflect "shinything" syndrome on the part of
the programmer. They have probably made a conscious decision to use the
very latest version of everything, and their programming language or
framework flat out did not exist back in 2003. They know this and so
are proud to answer quickly. They like being on the bleeding edge.

The goal is to get to a place where you start talking about nuts and
bolts. No matter what the initial answer is, you start using it to
dissect the project. Maybe the programming language did not exist in
2003, or did not exist in any usable form. That's good to know. Maybe
it relies on some libraries which did not exist, or if they did, some
newer feature of them hadn't been written yet. Then you get to talk
about those features and what sort of workarounds you'd have to do after
a trip back in time with this code.

It's basically a sneaky way to start exposing dependencies. It's also
like nerd candy, because, come on, time travel, right? Sending code
back in time? They'll be so caught up in that fantasy that they won't
realize you're asking high-level risk assessment questions in the
present.

So okay, what's my point in even asking this? Easy. I'm looking for
ways to see how fragile something might be. Sure, it'll probably work
on the one machine the programmer configures "just so", but how well
will it work on the server-class machine with a different flavor of
Linux? Server type installs tend to lag behind desktop level stuff in
terms of features and versions, after all.

How about finding other people to work on the project? Is it using some
language that nobody's going to know yet because it was only invented in
2010? Are there only 10 people on the planet who know how to be useful
in this language or framework yet? Are you prepared to pay for one of
them to replace your programmer when that person moves on to the next
shiny job?

There's also the matter of whether something is common or not. Will you
have to do a special install of some runtime environment or library on
all of your servers to support this thing, or is it already there?
Don't laugh - this has happened with Ruby, Python, and Java to name just
three. For years, various Unix flavors came without one or more of
those things installed until market forces finally forced the OS vendors
to provide a solid way to have them installed.

Until that happened, you had to do it yourself. That's one more thing
to worry about and maintain apart from the usual vendor-supplied
updates.

Some of this is an underhanded way to see just how much your programmer
knows about what's really going on under the hood. If they say
something like "we'd probably have to disable
ECN
since a bunch of firewalls still thought those were evil packets 10
years ago", that suggests a decent amount of domain knowledge. It's
also a good chance that if you happen to trip over a customer site
which still uses one of those dinosaur firewalls with the ECN-hating
rulesets, this person will find it and know what to do about it.

I imagine this kind of question will lose most of its effectiveness if
it takes off. It's just like the "round manhole cover" question. At
some point it was fresh, but now it's well-known and worn out.

I hope this one turns out to be useful and has a nice run.

10 years is an arbitrary value meant to purposely be too long. It's
deliberately far enough out to expose a whole bunch of things which
really should not matter any more. The problem is that if you stick to
something less, you might not get people to use the kind of
wide-ranging thinking that you need to get the whole picture.

So, ask about 10 year compatibility... but maybe insist on less.

]]>Spend the time to make it simpletag:rachelbythebay.com,2013-09-08:time2013-09-08T19:28:01Z
Here's an observation of a system I'm sure a lot of people have said,
heard, or thought at some point in time:

Wow, that's really complicated. It must have taken a long time to
write.

My question is simple enough: did it need to be complicated?
If the answer is "no" or "I don't know", then there's a piece missing
from that observation.

It might have taken a long time, but it still didn't take long enough.

If all other things are equal, I'd prefer a simpler solution to a
complicated one. Complicated solutions have a way of creating all sorts
of weird effects down the road.

If you were around for the big multi-vendor, multi-product SNMP security
disaster around 2002, you might remember some of this. SNMP uses this
encoding called
ASN.1
which is pretty big and scary. I bet a bunch of engineers took one look
at that spec, decided it was just too much work to write their own
implementation, and punted on it. Given there was a product
(ucd/net-snmp) with a compatible license, it shouldn't be surprising
that so many were built around it (officially or otherwise). This
effective monoculture meant one vulnerability worked on all of them.

I'm not joking. If you never saw this, or if it's faded from your
memory in the past 11 years, you need to look at the
advisory report
again. Notice how long the page is, all due to the list of affected
products. It's nuts!

Think about this another way. Let's say you're going to build a network
much like the IPv4 Internet. You want to make it so it can have
millions or billions of nodes, with some as end stations and others as
routers. There will be any number of paths through this network, and
you're worried about routing loops.

This can happen if router A sends traffic for target X to router B,
while router B sends that traffic to A. Pretty soon, you have an
electronic food fight, and the pipe becomes saturated as packets check
in and never check out. Even if there are useful routes for other
traffic which cross that pipe, they become useless, because the pipe
itself is full.

What now? Do you come up with some elaborate packet tracking system so
that all routers remember every single packet they forward so they can
check for duplicates later? Wouldn't that take up an enormous amount of
memory and CPU time, and delay every single packet going through the
system? Wouldn't it also delay all packets instead of just the ones
which were genuinely in a loop? Yeah, that's bad.

Or, you know, you could do something really simple. Have the sender put
a
number
in the packet, and every time someone passes it on, they just
decrement that number. If that makes it hit zero, then they drop it and
hopefully emit an error back to the sender. Packets will still get
caught up in a loop, but at least that loop is finite and they will
eventually "die".

This concept is simple enough to be used in a children's game. Get a
bunch of kids together, and hand one a legal pad. Tell them to rip off
the top sheet every time they hand it on to the next kid. If ripping it
off reveals the cardboard backing underneath, then the game is over.
Will programmers find a way to screw up something that simple? Oh,
sure, but at least it'll be easier to spot.

Would you rather troubleshoot a decrementing counter?

"Oh, the number didn't go from 2 to 1 here."
"Oh, we didn't catch when it went from 1 to 0 here."

... or would you rather hunt for bugs in a global stateful packet
tracking system?

"Two different packets hashed to the same value, which was supposed to
be impossible, but on this build on the software we only sample X bytes
instead of the full Y bytes and so more packets appeared to be the
same."

I know which one I'd rather work on.

]]>Pictures from the real world: menu boards and movie displaystag:rachelbythebay.com,2013-09-02:pics2013-09-03T02:33:04Z
Okay, here's one which seems snark-worthy at first, but it may take
quite a bit more thought to figure out exactly what's going on.

Context: this is on a drive through menu board for a fast food place.

Discard "illiteracy". Keep thinking.

A hint: the drive through does not have a video conference screen so
you can see the order taker as they talk to you. Even then, that isn't
going to work for everyone.

...

Seen at Frys in Sunnyvale:

New ... and recent new. Not new and recent, new and recent new. Okay.

]]>Policy vs implementation in source codetag:rachelbythebay.com,2013-09-01:policy2013-09-02T06:06:38Z
Imagine encountering some new code. It's in a language which you know
well, but there is no limit to the way to do things. Sure, there are
certain techniques which show up more than others, but really, to know
what's going on, you have to run the code in your own head.

(The "some_strxxx_function" stuff is probably actually much more
complicated than just one line. It might be several calls, with values
being passed from one to the next, and with a bunch of things glued
together with logical operators. It depends on how clever and/or
knowledgeable the original programmer may have been at the time.)

If you actually know what all of that stuff does, you might be able to
figure it out. There are no unit tests, and there are no comments
(which isn't necessarily bad, since it would be a "what" comment and not
a "why"), so you get to play computer and think it through.

"Okay, so this is a standard C library function that operates on a char*
and it looks for foo or bar, and we're handing it these values, so it
should return X when this thing happens, or Y otherwise".

Depending on how noisy and/or distracting your work area is, getting
this all down may take a while. You might be taken down a bunch of
weird dead-end paths and need to accumulate a bunch of mental state.
Keeping this straight while dealing with other bits of real life can be
challenging in a normal office environment.

Ultimately, you figure out what the code does, and then get to unravel
what the code in the if-block actually does. Oh sure, it puts some
value at data[x].bar, but what does that mean, really?

You get to dig around some more and find out what -1 means. It's a
magic value and we all supposedly know those are evil, but here it is.
It's been used, and it's been there for years, and now it's your
problem. It turns out that -1 means something special here, and -1 is
also used to mean something else in another context. -1 in this context
means "this thing is really a directory".

Eventually you wrap your head around this, and now you know both what it
means and when that gets set. Hopefully, by now, the "why" is evident.

Imagine how it might have been if the code looked more like this.

if (EndsWith(x, "/")) {
data_[x].SetDirectory();
}

"EndsWith" probably does the same weird str* function calls, and
"SetDirectory" is probably setting "len" to -1, but it reads completely
differently, doesn't it?

All I did was move the actual implementation into some other location
and left this caller as a collection of policy: when a string ends a
certain way, we go flip a flag in a labeled way.

Context: this is C++ code, despite the "const char*" and str* function
use all over the place - note the "string".

Do you have any idea what this is doing?

Put it this way - would it be easier to follow the general policy
if this function used stuff like "starts_with", "ends_with", and
"erase"?

I think so.

If this code winds up being a hot spot because it gets called thousands
of times per second, okay, fine. Unroll it again and go crazy with your
optimizations and see if you can beat the compiler. Then document the
hell out of why you did it so nobody tries to roll it back up again.

Until then, be kind to your future maintenance programmers.

]]>A roaming fixer encounters some strange codetag:rachelbythebay.com,2013-08-27:fixer2013-08-28T04:08:38Z
What sort of life does a roaming fixer have? It might involve a whole
bunch of reverse-engineering. Here's a bit of a brain-dump on the work
which starts happening.

Let's say you have a big complicated service. It has a bunch of moving
parts, all made by different people. The parts come in all sorts of
flavors: new, old, stable, wobbly, complicated, simple. The entire
system probably "works" for the most part, but like so many things,
there might be ways to improve it. Such improvements could save
programmer time, operation engineer time, user time or frustration, for
instance. There might be money to be saved if you recover a bunch of
CPU time or disk space and don't have to grow your cluster this quarter.

One of these events winds up with you being pointed in the general
direction of some code. Naturally, you've never seen it before.
Really, nobody's looked at it in quite some time. It's been there all along:
functional, but not really stellar. Maybe it used to work well but was
based on some assumptions which are no longer valid, and in this new
world it doesn't perform at the same level.

You encounter the code. It's context-free and comment-free. It has no
tests. There are no design notes, and there is no documentation. Now
what?

There has to be some way to crack the ice. One approach would be to
grab onto an interface and try to run the code. Writing a very dumb and
simple unit test would be a start. Assuming it's a class, then pick a
public function that looks promising. Maybe there's a "set" function.
Hit it and see what happens. Then maybe poke "save" and see if it'll
dump to disk. You get the general idea.

Maybe that works, or maybe not. When I get stuck at this point, I
usually start working backwards from the code to try to come up with
some pseudocode. Basically, I write the pre-code notes I would have
created way back when I first wrote it... if I had written it. I
essentially turn the code into something between English and the native
language, dropping bits of the syntax which aren't really useful.

Here's a really stupidly simple example.

foo.h:
Foo::checkWOD()
- poke backend for word list
- if this fails, try the cache
- if not found in the cache, yell and return
- flip through list
- if name == 'xyzzy': scream('word of the day') and return true
otherwise return false

As you can see, it's a mix of C++-ish syntax, python-ish "ifs" and
single-quoted strings, and some plain old bullet point action going on.
Even though there's only really 6 lines of description up there, the
actual implementation might be 30 or 40 lines of code depending on how
it was written and organized.

If I had written this in the first place, something like this probably
existed at some point. It might have been on paper, in a text file, on
a marker board, or just in my head for a while, but it was there. Here,
I had to work backwards to get this sort of thing.

What next? Well, who talks to this code? What sort of hooks do they
have into this code? In other words, where are the "seams" between the
code I'm targeting and every one of its customers? Are they clean or
messy? Is it going to be easy to do a replacement with a new
implementation, or will the call sites themselves need to be reworked
too?

That might be another list of notes, starting with the filenames found
in a grep, and expanding with details of what those bits of code do, and
what parts of the target code are accessed. Sometimes you get lucky and
can find things which aren't even used any more -- dead code! This
means it doesn't need to be ported to the new scheme.

Sometimes, people do sneaky things. Stuff like pointers and references
make it possible to hand copies of things to other things, and they can
then call you from weird places. They might even do this under a name
which doesn't resemble the name you grepped for. This expands your
search parameters. Now you have to look for references to Foo, but also
references to TheBar. Yeah, it happens.

There's more, of course, but this is how it starts.

This is the kind of stuff which keeps me busy.

]]>Carrier grade NAT vs. IP devices for ordinary peopletag:rachelbythebay.com,2013-08-22:nat2013-08-23T06:23:23Z
I noticed something interesting with a bit of consumer electronics the
other day. After being plugged into a network, you then configure it by
going to the company's web site. The device itself has no knobs,
switches, or indeed, any way to get into it. Assuming your network does
DHCP and grants the device a route to the outside world, then it'll be
reasonably happy.

How do they manage to match up your eventual visit to their web site
with the device you just put online? They don't have you install
anything weird and interesting on your computer. It's just a visit in
an ordinary, unmodified browser.

As best I can tell, they just match you up to your device by the public
IP address -- that is, the one they see you using relative to their web
site. After all, it's a pretty good bet that if you're coming from a
given cable modem or DSL address, odds are, your device is too. You
probably don't have one network for your device and another for your
computer in your house.

Thinking about this some more, I wonder what will happen at the point
that things like
carrier-grade NAT
really get cooking on the Internet in large quantities. It seems that
if one of these ISPs winds up with more customers than public IP
addresses, something will have to give. I imagine more than one
customer might show up from the same ISP IP address. If two of these
boxes are being set up at the same time, that could make life
interesting.

This probably doesn't happen that often, but think about Christmas
morning or something like that. Lots of people turn on their new scale
or thermostat or video sender device, and it phones home. What if their
neighbors also got one? All it would take is one really popular device
some holiday season, and life would get interesting.

There's also the possibility of having multiple external addresses for
the same household. I don't imagine that would be as common, but if
you're already going for forced NAT craziness, what's one more step?
Any of these scenarios could turn a design with good intentions into one
of trouble and pain.

I don't have any good and easy solutions to this. I seem to recall that
some devices used to come with these nasty little "helper" programs that
would only work on specific builds of Windows. It would install from a
CD and would then do some black magic on the network to look for the
device. Then it would jump in and set it up with some proprietary gunk,
and if it worked, then the device would be online.

Thinking of it now, some of these things probably just added a
temporary alias to your network adapter to put it into a hard-coded
network just long enough to reach the device... and woe to those who
actually used that network in real life.

]]>Pull the plug alreadytag:rachelbythebay.com,2013-08-20:reader2013-08-21T01:31:27Z
It's been
over a month
since my last check, but what's this...

... yep, Google Reader's "Feedfetcher" is still going strong. It's also
claiming a bunch of subscribers, but that's impossible, right?

Did you ever watch Ferris Bueller's Day Off through the end credits? He
hands out some
good advice.

It's over. Go home.

]]>My new development environment solutiontag:rachelbythebay.com,2013-08-16:virt2013-08-17T03:02:38Z
Earlier this week, I
wrote
about my woes in trying to do my usual brand of dev work on a Mac
machine. I went to some lengths to get it to work, but kept running
into some subtle problems here and there.

I got a bunch of feedback about this, and I figured I'd reply to some of
them publicly just to clarify my position on my UI.

One reader asked if the Mac OS still supports X11, since it might help.
Well, sure, it does. You have to jump through some hoops, but you can
in fact get an X server on the machine. Then you can kick off X
programs and they will pop up. You can even ssh to another machine with
X forwarding enabled and run something (xterm, xclock, xeyes, whatever)
and it will actually start the X server and then open a window
for the program. This is pretty neat.

But... it's still running within the overall scope of the Mac OS. The X
stuff fits into it, but doesn't replace it. It's the window management
regime of the Mac OS which bugs me, so that's not a fix, unfortunately.

...

Regarding my colors, I actually shifted to green-on-black after buying a
beautiful new monitor in 2011. It's just too bright and powerful to run
the black-on-white thing I do on my Mac laptop display. Even the usual
bright backgrounds of web pages are too bright to look at with that
thing sometimes.

Did I mention that I wind up using it at night, and the ambient light in
the room isn't exactly high? I'd have to boost the light in the room
considerably to make that monitor not be the brightest thing in it after
dark. I'd really like to have something like f.lux running on there,
but it seems to be binary-only incompatible craziness on Linux.

I guess if I had to actively use that screen with bright backgrounds
then I'd have to sit down and figure out a way to dial down the
brightness without turning everything into mud. It's a tricky balancing
act.

...

I was off a little about Terminal.app and spaces. It looks like you can
have some windows on one space and some windows on another, too. I
forgot this was possible since I have it set to be on all
spaces on my personal laptop. This is because I tend to want them to
stay around no matter whether I'm in Firefox (one space), Mail.app
(another space), iTunes (yet another), and so on.

...

I think I also said something about switching terminal windows with
CTRL-TAB. It's actually CMD-`... that's a backtick: right above tab
and right below esc. It's close enough to almost be the same movement.
Neither of them loop in all of the windows across all apps, though.

...

I should note that it's not possible to solve all of my woes by
switching terminals. iTerm is not the answer, in other words. It would
take something akin to replacing the OS X window manager to really get
things the way I like them, but that's just not an option. The system
is not built to let that be negotiable. You take what they give you or
you leave it.

Note: I mean window manager in the X11 sense. There seem to be "window
managers" for the Mac which amount to "things which will shovel windows
around within the existing regime". That's not at all what I mean. A
WM in the X11 sense is basically responsible for taking an ordinary
rectangle and giving it borders, scroll bars, title bars, making them
movable and resizable, and all of that.

...

Anyway, there's actually closure to this story. Someone introduced me
to VMWare Fusion. I had no idea that anything like it existed on the
Mac, and that such a thing would actually run full-screen and not just
in its own window (or worse, as multiple first-class OS X windows).

So, I installed it and dove in. It turns out the folks at VMWare have
obviously been paying a lot of attention to new user experience. For
instance, it realized I was installing Ubuntu and did things to make it
work properly and removed some manual setup steps I'd ordinarily have
to do.

An hour or two later, I had switched everything over to my new
virtualized Linux environment. I now run it full-screen and get all of
my work done in there.

I also managed to rip through two good-sized projects in the day and a
half I've had it running.

Mac fans, don't take it personally. Everyone's different in this
regard. I'm sure my setup would grate on you just as much as this
problem grated on me.

It's fortunate that we have options in these matters.

]]>Quantifying your ideal dev environment is surprisingly hardtag:rachelbythebay.com,2013-08-13:dev2013-08-14T03:46:09Z
You know that saying "you don't know what you've got until it's gone"?
Well, it's starting to make sense. I could never quantify my usual
development environment until I started trying to function without it.
It just felt "off", but I couldn't make a list of things which bothered
me. I never kept using it for long enough to figure it out.

Now, I've been using a Mac as the sole driver of the screen(s) in front
of me for the past couple of weeks, and it's been interesting. There
are some bits I was able to get back, but sometimes the utilities I
added to help make things more adjustable wind up causing their own
problems.

All of this is getting me to get a handle on what it is I actually
"miss" when I'm using some foreign dev environment. Some of these
things can be duplicated on new systems, but others are just too hard to
get working.

Here's what I've been able to determine so far.

I like having enough space to see at least four terminals without
overlap. This has actually grown over the years, since I used to run X
at 1024x768 on a 17" CRT and my aterms (as in Afterstep) used to
overlap. One was buried in the NW corner of my screen and the other was
in the SE, and they'd overlap a small square in the center. I
could handle that. Over the years, things slowly improved to where I
could have several terminals running with no overlap.

This is because I usually have one (on top) for editing code and one
(on the bottom) for compiling and running that code, with a third for
auxiliary stuff. That is, up top, I might be editing foo.cc, and on
the bottom might be doing my "bb lib/foo && bin/lib/foo" calls, while
on the third terminal I have "less" running with some header file for a
library I'll be calling.

Usually, if I have more open, they tend to get paired up the same way:
editor on top, running code on the bottom. If there's client/server
stuff going on, then that's at least four (dev, dev, run, run), plus
others as needed.

The Mac has no problem with this. I have a nice big screen with a crisp
resolution and I can fit tons of terminals on it. That part is fine.

What bugs me, however, is that I can't partition them nicely. The
terminal windows apparently all have to be on one "space" or another (or
whatever they call it this year).

I tend to use another workspace for "heads down hack mode" stuff. In
Fluxbox, that's just an ALT-F2, F3, ... away. My main screen (via
ALT-F1) has a bunch of log tails running so it's not good when avoiding
distractions is desirable. On the Mac, I'd be trying to escape my IM
client, IRC client, web browser, mailer, and more.

On the Mac, I have those spaces, but they come at a terrible price. The
entire screen slides off and slides back on when you switch between
them. You can't just flip, flip, flip. The actual keys you use to
select them don't really feel right, either: like you can go left and
right, but not direct to a particular screen.

I may simply have not found the right key combos yet, but that "let's
slide everything around" stuff really stinks. I don't know how to turn
it off. If you ever have your Mac laptop connected to an external
monitor, this actually gets worse, since it slides the whole mess.

I like ALT-TAB to cycle between windows. I also like it to stick to the
things which I have on screen. This way, I can have a workspace with
four terminals running, and I won't ever get pulled away from them. If
I want to go to another terminal in another workspace, I can just ALT-Fx
to that workspace first. Easy.

On the Mac, every program is always eligible for ALT-TAB targets, so I
have to be careful about what the next one in order may be. Lots of
stuff will wind up flopping back up in my face right after I
deliberately hid it if I don't hit TAB enough times before letting go of
ALT.

Despite all this, you can't flip between terminals that way on the Mac.
They are all windows which are part of the single program, so you have
to CTRL-TAB between them. It's a small adjustment, sure, but it's
bothersome. I thought I'd like the whole CMD-1, CMD-2, ... thing to
pick terminals, but it turns out to introduce other problems.

I really like my #1 to be the window nearest the top, my #2 to be below
that, and so on. The problem is that those numbers are determined by
the order in which you've opened them, or re-opened them if you closed
some in the series. Sometimes, I'll start doing something, and the
thing I'm doing will "belong" in a certain place on the screen. Maybe
I'm editing source. I want that near the top of the screen. Maybe I'm
compiling. I want that near the bottom.

However, if that "editing source" window is not #1, then it's going to
be in the wrong place in the window order, and I'm going to do CMD-1 and
get the wrong window. ALT-TAB doesn't care which one is which, since
it's entirely based on which ones I've been to recently, and it's fast
enough to just keep going until I get the right one. Remember, there
are only a couple on any given workspace. The others aren't up for
selection.

Then there are my hotkeys. I have the infernal Windows "Menu" key bound
to a short script which will kick off a terminal. I used to just run
rxvt directly, but since switching to urxvt found myself needing to
force things which really called for a script.

For anyone who really wants to run terminals like me for some reason,
this is the script:

(No, you are not expected to understand or like my font or color
choices.)

The point is that I can pick a space (ALT-Fx) and then just smack Menu a
few times and will receive a bunch of terminals which are ready to go.
That brings me to two more things: window positioning and size.

I can hold down ALT and use a left-button-drag to position a window. I
don't have to grab it on the title bar or anywhere else in particular.
If I can grab it, I can move it. Likewise, ALT with a
right-button-drag will let me resize any window I can grab.

So, when I'm ready to do something new, I usually go ALT-F2, then Menu a
couple of times, reach for my mouse and slide them around into whatever
configuration fits what I'm about to do, then I let go of the mouse and
use the keyboard from then on out. I actually push the mouse out of the
way much of the time, since it's pointless when writing programs this
way.

I've talked about my "F9" maximize keys somewhere else in the past, so
I'll be brief this time. In short, I can maximize a window vertically,
horizontally, or both, or restore it back down using F9 with or without
various modifier keys.

F9 is one of those keys which are hard to miss because there's a big gap
between it and F8. It also means I can come at it from below and not
hit anything else on the way.

When you're trying to go quickly, this sort of thing matters.

There are programs like BetterTouchTool and Zooom2 and iTerm which
attempt to provide some of this behavior on the Mac. I have not found a
combination which has managed to duplicate all of the above yet. One of
them has actually introduced a failure mode where Terminal.app will stop
accepting input until I click on the title bar (!) at which point it
will act like I pasted in three or four full lines of garbage (!!). The
results of this in my shell or editor are anything but pretty.

I'm pretty sure that a $300 E-machines laptop would make a great
terminal-running machine. I'd give it a real external keyboard, a
wireless mouse, and a nice big screen, but the machine itself doesn't
need to be much. Running X and a bunch of xterms just isn't asking that
much of a machine any more.

I'm also pretty sure it would never give me stupid beachballs and
unexplainable lag when all I wanted to do is open a terminal and start
typing. If it did lag, it would be because I asked it to do something
big right then and there, and it would make sense.

I tried. I really did. I think it's time to give up and move on to
a known-good situation.

]]>The year-long bugtag:rachelbythebay.com,2013-08-10:year2013-08-11T07:14:20Z
The age of a problem, bug report, or ticket can be a signal of
complexity, but it must not be taken alone. There are other things
which are important, and one of them is the individual who winds up
working on it.

I've showed up in situations where there has been some "it does that"
problem with a system that's been around for a year. Somehow, it gets
assigned to me, and I start sniffing around, trying to figure it out.
One thing I try to do is reproduce the problem. Does it even still
exist a year later? Let's say it does. Yep, that weird thing
definitely happens when you do this other thing. Guess I should try to
fix it.

What happens next depends on what sort of system it is. Is it
something common which exists everywhere? Is it something like Linux,
or one of the BSDs? Apache or MySQL? sendmail, postfix, or qmail? Or,
is it some proprietary system for which no analog exists on the outside?
Is there any possibility I could have worked on this thing before, or is
it entirely new to me?

Let's say it's an Apache system which is doing something weird. Odds
are, I've played with something like it in the past. Even if this
specific anomaly is new to me, at least the general neighborhood is
familiar. There isn't a whole stack of knowledge required just to find
things. That's when "domain experts" can be useful.

On the other hand, what if it's the big bag of proprietary stuff? It
could be something which evolved organically to solve some problem, with
its own rules, layers, protocols, user interface guidelines,
precedent-setting decisions, and yes, bugs. Actually fixing that is
going to take far more work. There's a whole body of knowledge which
must be acquired to properly understand the context of the problem, and
only then can a real fix be created.

Otherwise, you're liable to just slap yet another patch onto a system
which might already be nothing but patches. Which one will be the one
which finally brings the whole thing down? Or, you might do
something which immediately conflicts with a decision made somewhere
else, because they already decided they liked it this way. Also, if
it's been open for a year, assigned to people who have already been
working on this stuff for at least that long, the fact they never
solved it in that time does not bode well. In theory, they're the ones
who already have that body of knowledge, so whatever it is must not be
a simple fix as far as any of them know, or they hopefully would have
patched it already.

This might be the sort of thing you assign to a person for "trial by
fire" purposes, where you give it to them because you want them to learn
the entire stack and become a new member of that team. The path through
solving this problem will establish the necessary bits of data in this
new person's head which will make them productive later on.

I would not characterize such a task as the sort of thing you hand to a
"tourist" -- someone who's just stopping by to help tidy things up, and
is not expected to know the entire system. You might also call them
mercenaries: you bring them in for a quick fix, but not a total system
rewrite.

In such situations, it's the duty of the "tourist" to speak up and say
something, and then take decisive action. Either they drop the tourist
designation and commit to becoming part of that team, or they drop the
task as inappropriate.

Anything else is just a recipe for sickness and stress.

]]>Positive connectionstag:rachelbythebay.com,2013-08-07:adopt2013-08-08T02:48:34Z
Have you ever rescued a cat or dog from a shelter? If you get lucky and
find the right one, it becomes part of your life for what is hopefully a
long time. You look forward to seeing it, and it looks forward to
seeing you.

It's not that you expect to come home to a cat or dog. It's that
you're coming home to that cat or dog (or whatever you like).
You didn't adopt an animal, you adopted that animal.

I imagine it works much the same on the other side. They are happy to
have been recognized as a specific someone, and not just a cat, or a
dog, or a generic pet for that matter. There's no "pet spot" in your
life. It's a spot for exactly them.

Now then, have you ever been on their side of that? Have you been
somewhere that expects you to be there, and not some generic
creature which does the things you do? It's a place where they go
looking for you and only you, and are delighted when you show up.

That sounds like a really good situation.

]]>Turns to avoidtag:rachelbythebay.com,2013-08-04:map2013-08-05T03:44:13Z

The above is how Apple Maps in iOS 6 renders a pretty busy interchange:
Central Expressway at San Tomas Expressway. They get the names right,
all of the connecting roads are there, and they even put the right road
(Central) on top.

There's just one tiny little problem.

Neither of the circled spots are usable for traffic. If you're in a car
and try to use either of those ramps, you're going to have a supremely
bad time. Why? Easy. They aren't actually open, and haven't been for
a very long time -- longer than I've been living here, for sure.

Here's a look at what you see if you're going west (left) on Central and
want to go north (up) on San Tomas:

If you're dumb (or awesome, funny how that works) enough to drive off
the main roadway surface, bounce over the telco access hole, and smash
through the fence, you immediately encounter something else:

That appears to be some kind of reinforcement for a vault cover. I'm
sure it's very heavy and solid and would make an even bigger mess of
your car than all of the other obstacles.

Now, that said, if you somehow managed to get past all of those, you
would in fact reach San Tomas. The rest of the ramp appears to be in
a usable, if neglected, state, and you could pop out down there and
freak out a bunch of unsuspecting drivers.

The other ramp (coming from San Tomas) also has a nice little surprise
for you. Right at the beginning of that ramp, someone (probably the
county roads department) put a storage shed in the middle of the road.
No, I am not kidding. Look:

You'll notice I stood behind this one to get the picture, since standing
in front of it would have put me in the middle of San Tomas.

But, hey, if somehow you manage to go around, over, or through (!) the
shack, you wind up on a ramp which still exists and seems to be just
fine.

Google maps, Bing maps, and Mapquest do not show these connectors.
OpenStreetMap shows them as some kind of trail or track but definitely
not a road. Apple stands alone.

Go on, crash through those barriers. Your GPS said so.

]]>My, that grass is rather green, isn't ittag:rachelbythebay.com,2013-08-03:bias2013-08-04T05:27:45Z
What does it mean when the only people who come out in support of
something are those who keep saying things like "I work for X, but this
is my own opinion, not theirs"? It just seems mighty odd. A
technology, service, product, or other thing which could stand on its
own merits should do it based on them.

In other words, if something is any good, then someone from the
community at large will probably speak up about it. If, however, you
only hear from the same half-dozen or so people who are unambiguously
connected with the company which made it (while somehow claiming to not
speak for them), what's going on there?

Back in the early days of Slashdot, people used to accuse Microsoft of
"astroturfing".
I never heard whether that was proven or not. However, there's no need
for a time machine. I think there are still plenty of examples of it
going on right now with the forums and companies of today.

I like the times when the ordinary people who worked for a company just
shut up about it and let other people (users, customers, ...) do the
talking about it, or let PR do their jobs and generally stayed out of
it themselves.

The names may change, but the actions remain the same.

]]>Your simple script is someone else's bad daytag:rachelbythebay.com,2013-08-01:script2013-08-02T04:08:48Z
Let's say you're a programmer and you have to automate a sequence of
tasks. The end result isn't important, but if it helps, think of it as
something like baking a cake, changing the oil in your car, or whatever.
What matters is that there are quite a few steps in this sequence, they
have to happen in the right order, and they all have to succeed.

How do you handle this? Assuming a Unix type environment here, if it
was originally just a bunch of commands people ran by hand, it might
have started life as a cheat sheet and perhaps became a "runbook" entry
at some point. Then, I suppose you might turn it into a shell script.

If that's how you handle it, you're walking a thin line. If all of
those steps actually succeed, then sure, okay, you win, and it's
probably an improvement over the old manual processes. But no, I'm not
going to let you get off that easy.

Pick one of the steps in this file. Now imagine it failing. What
happens next? I imagine all of the subsequent steps will still be
started, since there's nothing happening to check for success up there.
They'll probably fail, since I imagine they depend on the prior steps
actually working. Switching it to use bash and adding a "set -e" would
be a good start. Even just having some "|| exit 1" blobs appended to
some of those calls would be better than nothing.

Without those checks, what happens if the subsequent steps run, and
actually manage to get in some weird state because they ran when they
shouldn't have? It might even make it unable to run again later without
manual intervention, since now it won't be starting from a fresh slate.

I find that sometimes when such processes exist, the best way to make
them work is to start from scratch every time. In other words, if it
doesn't run all of the steps all the way through successfully, running
the script a second time probably won't help and might even make things
worse. It may be necessary to wipe out all of the work and do it from
the beginning to get a usable result.

To call back to the example of changing the oil in your car, imagine
a simple-minded automaton like a robot. It takes things literally, no
matter what sort of hilarity and craziness might result. Now let's say
someone writes up a list of instructions for changing the oil in the
car, and one of the steps is literally "add five quarts of oil". If
everything works properly, you're fine. It's the corner cases where it
gets interesting.

An early version of the script would probably not catch errors, and so
would continue running the later steps even when it should have bailed
out. This means even if your dumb oil-changing robot fails at draining
the old oil, it still tries to add new oil! This creates a mess and
potentially a hazmat situation. Way to go.

So you add error checking, and now every single step is tested to be
sure it didn't return an error code. If an error occurs, it bails out
and lets someone step in and take a look. They fix whatever's wrong and
run it again. This probably works out most of the time, too.

Of course, this eventually fails far enough along that the new oil is
now in the car, and re-running it causes the robot to needlessly drain
that new oil as it follows instructions. Hopefully, the programmer
comes along and adds some checks so it won't do that. Or, maybe they
do some clever checkpointing so it can restart a failed step so it won't
keep re-running the early ones.

That works for a while, then it fails during the "add oil" stage. It's
trying to follow the "add 5 quarts of oil" instruction, and runs out of
oil about halfway through the fill stage. It errors out. Someone gives
the robot more oil to use, and restarts it. It restarts the step.

It again tries to "add 5 quarts of oil". Of course, the car already had
about 2.5 quarts in it, given that it ran this step about halfway
before, and so it overfills the container, and spills out, and yes,
there's another mess and possibly a call to the fire department's
hazmat team as it heads for the storm drain. It's embarrassing.

It's around this point that someone figures out that the command needs
to be something more like "fill with oil until there are 5 quarts in the
car", but of course that also fails spectacularly the first time a car
with an oil leak comes through. The robot keeps filling and filling and
filling and never hits the 5 quart mark. Meanwhile, the fire department
has been called once again.

Someday, someone might eventually manage to implement it properly with
the proper amounts of paranoia: "Add new oil to the car, using up to 5
quarts, until the car has 5 quarts in it. Also, look for leaks and stop
immediately if anything escapes from the car."

It takes a non-trivial amount of work to automate a process which can
notice failure, be restartable, and not create messes when interrupted
in arbitrary locations. In terms of dev time, it's probably faster and
definitely easier to just list a bunch of steps and do no checking, but
that just shifts the load onto someone else when it fails.

I guess the question is: are you feeling lucky? How about your users?

...

Wikipedia has a
few things to say
about some of the fundamentals here.

]]>Efficiency, yes, but for how longtag:rachelbythebay.com,2013-07-30:efficiency2013-07-30T12:11:36Z
I heard a story the other day about a company which decided to donate
efficiency experts to a charitable organization instead of just
throwing money at them. This was hailed as a marvelous thing, and
apparently it made quite a difference and found all kinds of places
where they could get more done with the same resources. Much praise was
passed around for this accomplishment.

I hear about a story like that and wonder if there's more to it. This
isn't meant to diminish the contributions of the experts or the company
which provided them, and certainly not the outcomes. It's more of a
thought about how "sticky" these changes might be. It's about root
causes and fundamental foundation type stuff.

Let's say the charitable organization is C, and they are managing to
accomplish 10 units of awesomeness for every 5 units of donations. Then
this efficiency expert comes along, and now C can deliver 20 units of
awesomeness for the same 5 units of donations. Sure, that's a doubling
of output, and that by itself is great, but I'm wondering what happened
to make that necessary in the first place.

Obviously, something wasn't quite right in the original configuration of
C. Some of their policies or techniques weren't delivering their full
potential. What I want to know is why this deficiency existed in the
first place. I'm not even assuming anything nefarious here. It's
probably just as simple as ignorance -- someone in the system didn't
know about a technique, and these outsiders taught it to them. With the
skill acquired, now they can be more effective. Okay, that works.

But of course, I don't want to stop there. Now I want to know why this
ignorance of those techniques existed. I wonder if they would have
discovered them on their own if they had purposely gone out looking for
fresh ways to do things. That leads to my next question: were they
actually looking for those fresh ways, or were they just keeping things
about the same from year to year?

Let's say they were just leaving things alone and weren't actively
trying to improve efficiency. This would also make me ask some
questions. Why were they sitting still? Did they think things were
okay as-is? Or, more likely, were they just too busy to handle anything
new? Remember that being busy can be a real thing, or it can be a state
of mind caused by stress or any number of other problems. I'd want to
know which of those might have been happening.

Why do I ask questions like this? Well, I wonder what happens when the
efficiency experts are gone. They go back to doing whatever they did
before they were donated, and the recipient organization just sort of
coasts along on whatever procedures they may have changed or added plus
whatever learning they may have picked up. In the short-term, all is
well.

What happens in the long run, though? People change jobs and even move
to other companies. They even retire sometimes (imagine that with this
economy). When they go, they might "take knowledge with them" in the
sense that nobody else might know about something that would help the
organization be more effective. This probably gets worse with time.

Does the organization ultimately regress back to a less-effective state?

I assume that unless there's been some deliberate design change to make
sure someone cares about that in general, it will eventually decay back
to nothing.

]]>Post frequencytag:rachelbythebay.com,2013-07-28:freq2013-07-28T21:10:49Z
I've been thinking about the quantity and frequency of my writing these
past few days. As it is, I've been writing daily for a bit over two
years, with some days getting two or even three posts. This adds up,
and there are just over 1000 posts total at the moment.

This has some unintended side-effects. I don't want people to read
anything into a gap in posts. It doesn't mean I have something big
queued up or anything special like that, and it doesn't mean I've run
out of topics. It just means I didn't post that day.

Just like I can't predict whether I will post every day or not, I also
can't predict whether I'll take any days off.

Wait and see, I guess.

]]>Good programmer with good libraries got something donetag:rachelbythebay.com,2013-07-27:mach2013-07-28T06:43:14Z
There's a
slide deck
going around describing how a C++ program written in 2007 did a number
of bad things, and didn't hold up under load in 2012. The replacement,
written in Go in 2012, eliminated a great many problems and simplified
operations. It also brought the code back under the "maintained"
umbrella, whereas previously this service apparently had gone
unmaintained and unloved, despite having countless active external
users.

I never heard of it before the other day, for the record - not the
original implementation and certainly not the replacement. I don't even
know who wrote the original version.

The C++ vs. Go part of it doesn't seem that interesting to me. What
interested me more was that the original code seems to have gone to
lengths to handle many things by itself. It's almost like there was no
HTTP server code available as a library. Of course, that would be quite
a claim, considering all of the other things which are built on
top of HTTP and are built in C++ from that same code base.

Why did the original author or authors do that? Who knows.

The first part which really grabbed me was one bullet point: "C++
single-threaded event-based callback spaghetti". Ah yes. I've seen
that before.

This makes me wonder: just what sort of problem is this, exactly? On
the surface, it sounds like something that just takes a bunch of HTTP
requests and pumps out data. I imagine something which figured out the
path from a GET against its local mapping of paths to files and then
just threw bytes at those file descriptors would be a good place to
start.

If you started from absolutely nothing and had no help from libraries
and wanted to do this the fairly old-school way without things like
epoll, what would this look like? It seems like you'd probably write
something which does the usual socket + bind + listen thing to create a
listening file descriptor on some TCP port. Then, you'd have a main
loop where you create a fd_set and put that listening fd in there
before calling select. (More will be added here, but sit tight.)

The return code of select dictates what happens next: if it's less than
1, something potentially bad happened, but it might just be an
interruption from a signal. You might need to bail out, but odds are
you just need to "continue" and restart the loop. If it returned 0,
then nothing was active, so you restart the loop. Otherwise, it's time
to see if the listening fd is active.

If that's the case, then you get to use accept (or similar - see
accept4) to turn that incoming connection into a fd, and sock it away in
some kind of client structure. This means the first part of the loop
now grows to add the client fds from this structure to the fd_set, and
the middle now grows to check those same fds for read activity much as
the listen fd is checked already.

If one of those client fds should have activity, then you have to read
from it and see what happens. If it returns less than 1, something
failed and/or the client disconnected, so you get to deal with that -
call shutdown, close the fd, and clean up the client state info.
Otherwise, it's time to figure out what they sent us. This means
accumulating enough characters to be able to parse it all out into
something usable. That might mean a state machine of some kind - more
flags and buffers in that per-client structure, I guess. It might have
to hang onto these bytes across multiple read calls, since you might
get "GET /" and "foobar.zip HTTP/1.0" in two different calls. That's
supposed to be a HTTP/1.0 request for "/foobar.zip", not a HTTP/0.9
request for "/", after all.

Let's say all of this is figured out, and it's understood what the
client wants, and it maps onto something which can actually be
fulfilled. Now it's time to start sending data to the client. Okay, so
you open a file descriptor and start copying bytes, right? Well, yes
and no. Unless it's a tiny file, odds are a single call to write isn't
going to cut it. There are buffers for network I/O, but they usually
won't absorb more than a few KB. Then the write will return with EAGAIN
or similar, assuming you set nonblocking mode -- you did
remember to do that, right?

Now you have to remember that this particular client connection is in
"push file mode" so you can come back to it a bit later. This also
means changing up how you call select() in that main loop, since you
probably want to have it tell you when those fds for clients in the
"file sending" mode become writable. Otherwise, you'll end up making a
bunch of pointless "would block" write calls over and over, and that
would mean using select with a zero delay value so you can get back to
those write attempts. That sucks a serious amount of CPU, so that's
bad.

Anyway, this means another way for select to return 1, and now you have
to also check all of your "file send mode" client fds for membership in
the writable set. If any of them come up, then you get to (attempt to)
push another block, and see if you hit EOF while doing this. If so,
then you get to make another decision: do you hang up on them now, or do
you somehow reset for another request? That is, do you support things
like HTTP/1.1 pipelining? If so, how did you deal with bytes coming in
from the client while were you were still pushing that file to them?
Even if you don't do pipelining, you have to watch out for what you wind
up doing with that input side of this connection. Closing it down early
might have
suboptimal results.

(This might be a non-issue depending on what sort of frontends are
sitting between you and the user. Cough.)

At this point, in theory, you have this big round-robin thing going on.
It looks for new connections and adds them. It looks for bytes waiting
to be read from clients and reads them and feeds them to a parser of
some kind. It also potentially pushes bytes out if those connections
have gotten that far, and handles whatever happens when it finishes
sending. It also cleans up when the work is done or the connection
goes away abnormally.

Let's say you do all of this perfectly, and it manages to support all of
the specs which are required for this server project. With all of this
work done, there's still something out there, taunting you. This thing
is only ever going to run on a single CPU. It might eat that entire CPU
if you give it enough work (or program it badly), but it can't possibly
spread out to use others. If you want to use the other processors on
your machine, you're going to need to run multiple instances of the same
code somehow.

What now? Well, threads are one way, but you can't just throw threading
into this situation without creating a big mess. Threads by their very
nature share memory because they're all in the same space, so you have
to be very careful about data updates. How do you delete a client from
the state structure without interrupting someone else who might be
iterating over it? You get the idea.

Forking might give you some options. If the program starts up, creates
the listening file descriptor, and then forks, I imagine all of the
children would end up getting different clients. That is, only one
child will manage to "win" the accept() call for a new incoming
connection. Of course, the others might end up waking up in select(),
only to have accept() fail since someone else got it. It's the
"thundering herd" thing all over again, and you'd have to figure out
exactly how your system and implementation would handle this.

So now you start thinking about having just one process run accept().
It sits there and waits for new incoming connections. Then, once it has
a viable file descriptor, it can pass the file descriptor to one of its
children. Yep, this is actually possible, if you use a Unix domain
socket and do some ancillary data magic on a system which supports it.
This changes the children so that instead of having the real listener
fd, they now look for activity from their parent, and read those
messages to acquire new clients.

If you did all of this correctly and balance it sanely, then you have a
decent chance of fanning out your incoming connections to different
processes, and you might actually get to use more than one CPU for your
serving duties.

It also means you have duplicated the efforts of a great many people who
have come before you, possibly even at the same company. Their code
might even be right there in the depot in a sufficiently generic form.
Unless your needs are extraordinary, odds are you can probably get
things done by using those libraries. If it's not a direct match, you
might even be able to add those features by submitting a patch or just
by wrapping it with some of your own code.

They've probably already solved the parallel-HTTP-serving problem. I
mean, if you work at a company where the web is their lifeblood, you'd
think there would be a good solution to this already. Otherwise, how
else is everyone else getting anything done?

That's really what this is all about. It's about having solid, useful,
and proven libraries, knowing about them, and using them. The
replacement project here had a single person at the helm who knew about
a good environment and went and used it to good effect. The original
project may have had multiple programmers who may or may not have
needlessly reinvented the wheel -- it's not clear exactly what happened
there.

Without a lot of internal proprietary information, it's impossible to
say how the original program came to be or why. I've done post-mortems
for such things in the days when I had access, and it's no small task.

]]>Binary editing thwarted by a simple optimizationtag:rachelbythebay.com,2013-07-26:opt2013-07-27T05:14:18Z
Some years back, I was trying to get rid of the thing in my IM client
which would send the "so and so has closed the conversation window"
message. This was an annoying misfeature which could not be disabled
through the UI. I needed to get rid of it since it was sending the
wrong impression to my chat friends.

When I'm at work, sometimes I like to "garbage collect" my chat windows
by closing the ones which haven't been active for a while. It's nothing
personal, and it doesn't mean I "hung up" on someone. It just means I
don't want their text in my face any more since it's not currently
serving a purpose. So, I'd close the window.

Unfortunately, some version of Gaim (or was it Pidgin by then, hmm...)
changed things so that it would emit a message to the server, and that
would then bounce out to the other person who would find out. I decided
to turn it off. The trouble was that I was running a binary version at
work and didn't want to go through the mess of rebuilding it from
source. I figured I could just find the string and change it to
something else so it would send a harmless no-op message to the server.
It was easier than trying to find the actual code which sent
the message and NOPping it out.

I set to digging around in the binary, and hit a snag: the string in
question didn't occur anywhere by itself. This was bizarre because the
string itself was right there:

Still, looking for "gone" didn't turn up anything useful. The closest
thing I found was "host-gone" which was used for something else
entirely. I stared at this one for a minute or two and then it occurred
to me: the compiler was probably doing something clever.

Think about it in terms of pointers. That literal "gone" and the other
literal "host-gone" can both be expressed with a single entry in a table
somewhere and two slightly different pointers. References to
"host-gone" would amount to "ptr", whereas "gone" would be "ptr + 5".
Since it would never change, it was safe to point both of them at the
same memory.

I verified this by taking the source and changing "host-gone" to
something else. Once that happened, "gone" popped up as its own entry
in the symbol table.

This pretty much established that it would be impossible to merely patch
a string in the binary to turn off this unwanted feature. I wound up
having to patch the annoyance out of the source and run a custom binary
instead. This was suboptimal but it was better than having people
feeling offended because I decided to clean up my desktop.

]]>Some half-baked thoughts on scoped sharingtag:rachelbythebay.com,2013-07-25:strata2013-07-26T05:06:06Z
Recent developments in the news have me thinking about information
classification. There's the whole compartmentalized thing, top secret,
and all kinds of other codewords and other labels which come up
depending on the context. That's all well and good for government
agencies and companies, but what about individuals and their personal
data?

There's a term from the world of
NTP
which comes up when configuring time synchronization: stratum. The
actual clock devices or receivers are thought of as "stratum 0", and
the number only increments from there. Stratum 1 is a system connected
to a stratum 0 device. Stratum 2 is a system which is synced to a
stratum 1 system, and so on down the line.

It seems like some of this might apply to the notion of sharing online.
I came up with a possible starting place for describing this.

"Level 0" in this scheme would be the data itself: something you know
and keep in your head. You haven't stored it in any computer system as
far as you know.

"Level 1" could then be that same data, but now it lives in your own
personal computer. This is a client-only system like a laptop or maybe
even a tablet if that's the way you roll. No other people have access
to it.

"Level 2" extends this concept a little further. Now it's also on your
own personal server which is also accessible by others. This might be a
dedicated server somewhere in the world, a VPS, or maybe even some kind
of server software running on your little bitty plastic consumer router
box. The point is: you control the server.

"Level 3" then goes beyond that, and it's when you park the data on
someone else's server. You don't have admin powers on it, and are just
a user as far as they are concerned. There are access controls on this,
so you can say that person X can see it and person Y can't, but you're
ultimately trusting that the server's code is solid and it's managed
competently... and ethically, for that matter.

At "level 4", you're still hosting it on someone else's server, but now
there are no access restrictions, or nearly none. Anyone who wants to
see it can get to it. This might happen if the security fails at level
3.

I'm not quite sure what happens after this. There might also be some
other states which I forgot about, and these five numbered levels might
need to be spread out to accommodate them. This is not meant to be an
exhaustive list, but rather a suggestion to get people thinking about
the notion.

There's one other state which occurred to me: "level null", or "nil", or
your favorite "not-really-a-value" value. This is when you don't even
have the data in the first place, so you can't store it anyway or give
it away, for that matter.

In practical terms, level 0 might be an idea for a post I think of while
out in the world and don't write down or type in anywhere. Level 1
could be when it transfers to a local plain text file. Level 2 could be
when that idea gets fleshed out and turns into a real post online, since
I run this server. It just goes on like this.

Sharing doesn't have to be all-or-nothing.

]]>Sniffing for rogue unmanaged switchestag:rachelbythebay.com,2013-07-24:layer2013-07-25T02:37:25Z
In the past week or so, I ran into an interesting sysadmin type question
online: how do you detect unmanaged switches on your network? They
don't have IP addresses or any other way to be contacted directly.
Imagine your typical "desk" switch: it might have four or five ports, a
power plug, and a couple of happy little LEDs. There is no serial
console since it is just a dumb little embedded device.

This seemed like a potentially interesting problem to solve. Obviously,
you could just brute-force the matter by picking one of your own
(managed) switches and tracing out from each port to the target system.
If you encounter another device along the way, then you have your proof.

Clearly, this sort of approach does not scale. It might be good if you
suspect a certain person and just need to find out, but it would be
ridiculous to try to do this in any real quantity. There just aren't
enough interns in the world to support this kind of manual monkey work.

You could also try to monitor the port for traffic from too many MAC
addresses. This also takes a fair amount of work, assuming your switch
even supports the kind of "mirroring" required to get a copy of traffic
on that port. If your switch supports port security and MAC filtering,
you might be able to do that, but again, that requires upkeep, since
people move around and NICs change. There has to be another way.

After pondering this one for a while, I had an idea: link state. Given
that I have control of the switch infrastructure in the area to be
scanned, that should mean that I can also administratively enable and
disable ports in software. This might mean jumping through a web or ssh
(or even telnet, gag) interface, or maybe could just be a SNMP message.
The point is, I'd be able to temporarily turn off a port.

Ideally, this would make the port look just like it had been unplugged
from the switch. If the end station is in fact directly connected to
that line, it should see the link go down. How can you find out?
That's easy. If this is a corporate environment, you should also have
access to the machine in question. Turn the port back on and go look in
the syslog.

I unplugged the cable from one of my test machines to demonstrate this:

Now, not all operating systems or NICs will log this sort of event, but
quite a few will. If the machine logs link state changes which
correspond to your frobbing of the port admin state, it's a pretty good
guess that they are directly connected. Here's why.

Imagine the situation where someone has a rogue switch plugged into the
drop in their office and then has a whole bunch of machines plugged into
that. When you turn off the port on your end, that makes the "uplink"
port on their switch drop out. However, all of the machines in that
office still see a link since they are plugged into the local
switch. Sure, they're off the rest of the network from a logical
perspective, but at the physical layer, they can still see something out
there.

This might fail if the unauthorized switch is somehow set to drop the
link state on its "inside" ports when its "uplink" port also drops.
This might sound weird, but I used to have a couple of fiber
transceivers which had a feature called "MissingLink". When the fiber
link went down, they'd purposely drop off their twisted-pair ports so
the switch could see there was a problem. This allowed you to use port
state in the switch as an alert condition even though the fiber wasn't
directly connected.

So let's say for some reason this doesn't work. Maybe you can't get to
the syslog on the box. Did you notice the other thing which happened
when the port went down on my test box? My DHCP client started freaking
out. This is what happened after it came back up:

It basically went right back out on the network to make sure it had a
healthy DHCP lease. You should find that many systems will do this
after they've had an "out of network experience". So, even if you can't
use the syslog, you can still use your own DHCP server logs to see if
that machine starts asking for a new lease after you bounce the port.
If it doesn't, odds are, it didn't see the port state change, and that
means something else was in the way.

Here's another possibility. Maybe you can get into the machine, but not
as root, and the DHCP thing isn't working for you for whatever reason.
Maybe it has a static IP address. You suspect it's on an unmanaged
switch with some other machines but you can't be sure. I'd set up
something to run a broadcast ping or fire off a UDP datagram to the
broadcast address or anything else I might be able to do as a mere
user. Then I'd start it running in a loop: run once, sleep 5 seconds,
run again, and so on. With this running in the background, I'd pull
down the switch port for 30 seconds or so.

That should let the broadcast generator thing run 5 or 6 times while
disconnected from my network. When I put the port back online, I should
be able to log back in (or resume my existing session... TCP can survive
such breaks usually) and look at what it captured. If that machine
logged traffic from any other system in the meantime, it's obviously on
some other switch!

Other things to watch would be the ARP cache. If it manages to grow
while kicked off the main network, then something is out there
with it. This one would probably need a much longer period of downtime
to be sure everything else expires, but it should work with no special
permissions on the machine.

I hope I never need to use anything like this, but it's a fun thought
experiment at any rate.

]]>Save the frogstag:rachelbythebay.com,2013-07-23:frogs2013-07-24T04:59:29Z
How do you grow a love of weird electronic noises in a kid? I think
part of it comes from giving that kid weird electronic devices. Sure, I
had a VIC-20, and later would own a C-64, but in the middle, there was
something else. I had a Speak & Spell.

For those who aren't lucky enough to have grown up with these things, I
will explain. It's a small electronic device which runs from 4 C
batteries and lives in a rugged plastic box with a built-in handle.
Depending on how old it is, it might have raised buttons or a membrane
keypad like a microwave.

The basic idea was that it would ask you to spell stuff by talking to
you, and then you had to key it in. The non-QWERTY keyboard seems
rather odd to me now, naturally, but back then it didn't seem to be a
problem. You'd type in the word and hit ENTER and it would tell you how
you did before moving on.

One really fun part about this system was that it could be expanded. In
1982, when E.T. hit theaters, they did a tie-in module which added
content. There were a bunch of words and other puzzles stored on this
tiny little plastic module which referred to events from the movie.

The one I'll never forget has something to do with a scene I otherwise
would have forgotten from the film. One of the prompts says "Spell
frogs, as in save the frogs... RIBBIT".

Yes, the little plastic box actually says "ribbit". I loved
that one and would play it over and over just to hear it make the noise.

Recently, I stopped by a store and bought some C batteries just to bring
this thing back to life, and hooked it up to my computer to record some
samples from it. Here, some 30 years later, this little device and
module both still work fine, and I can share this wonderful recording
with everyone:

Isn't that great? Unfortunately, as far as I know, there's nothing else
in there which even comes close in terms of hilarity.

There's more to this, though. The top row had several buttons which
could be used to select modes, and it would play a short sound when you
pushed them. It seemed to choose from a set of three or four sounds and
you could never be sure which one you would get. I caught a recording
of some of this, too.

I actually
wrote about this
back in 2012 and professed my love for the beeps and squawks of the
Speak & Spell. Back then, I was talking about it in the context of
a series of "bumpers" used by a TV station which used to play kids
shows. That old post linked to a Youtube recording of some of the
audio.

Unfortunately, that link is no longer valid since it's been marked
private, but I did manage to save a copy of it. For the sake of
preserving this data, here it is again. Listen carefully to the
background noises which become easier to hear towards the end of the
recording, and you should notice the same beep-boop-blorp stuff.

That's pretty solid evidence as far as I'm concerned. What else would
make that noise?

Almost every key on that thing would make a noise, and I'm not just
talking about mode selects and the letters. They even had a sharp funky
noise which corresponded to the apostrophe (') key for some reason. I
made a recording of the entire alphabet and added 5 apostrophes at the
end just for the completeness. Enjoy.

I think this is one of those cases where emulation might not be able to
capture the whole experience.

Ribbit.

]]>Apple's dev center seems like a glorified web BBStag:rachelbythebay.com,2013-07-22:dev2013-07-23T04:27:36Z
A couple of years ago, I decided to experiment with developing for iOS
devices and signed up for Apple's "dev center". I paid the $100 and got
to write code to run on my own devices. It didn't work out, however,
and so I never released anything and let it lapse. I still have an
account in this system, but it's not authorized for anything.

That was enough to earn me membership in this particular exclusive club
this evening:

Yep, they had a huge security hole, and basically freaked out and pulled
the plug. The site has been down for several days at this point. There
have been a bunch of stories about it all over the usual sources for
nerd news, but this is the first official notification I've received
directly from them.

The actual hole seems to be rather amazing. If certain things I've been
reading are to believed, it amounts to a remote command
execution vulnerability. Basically, you can say "hey webserver,
run this shell command", and it will do exactly that.

Want to see more? Check out this
Struts project document.
Notice how it practically tells you how to drive a truck through the
hole.

In talking with a friend about this situation tonight, I tried to figure
out exactly what this site is supposed to do. I never really used it
that much. He said that they provide beta downloads, documentation, and
forums. There's also some kind of way to configure all of the things
which go with adding an app to the store and managing the business side
of that.

This all seemed very familiar. Beta file downloads? That's a file
transfer section. Documentation? That's the library. Forums? Those
are the message boards. These are all things we had back in the days of
BBSing!

So, what about the whole "app store" management angle? That's easy.
That's a separate program written specifically for the job which
augments the rest of the system. It doesn't really connect to it in any
meaningful way beyond the base system saying "this is user so and so,
have fun!".

We used to call them doors. Most doors were things like games (Trade
Wars 2002, that sort of thing), but there were a fair number of
utilities. If you wanted to download QWK packets, for instance, that's
how you usually did it.

It sure sounds like everything they did on this site could have existed
20 years ago as a BBS. Sure, it would have involved terminal-mode
access and a bunch of typing and very little clicking, but I bet it
would have been mighty fast and rather effective.

I also suspect it wouldn't have been compromised just by having someone
ask the server to run nefarious commands which it gladly executed on
their behalf.

Just how often did a BBS get cracked through the front door? It seems
like the few "hacks" I heard about in those days all turned out to be
inside jobs, like a co-sysop gone bad, or a software author with a chip
on their shoulder. It just worked differently back then.

Sometimes I wonder if we will ever wind up back in that sort of world.

]]>My second computer storytag:rachelbythebay.com,2013-07-21:c642013-07-23T04:49:01Z
My first computer was a Commodore VIC-20 which was given to me by a
neighbor and friend of the family. His business was doing pretty well
and he tended to buy nice things for his family. In that case, he
upgraded his step-daughter to a C-64, and I got her VIC-20. That along
with a TV interface (in place of a monitor) and a cassette tape for
storage got me going. I've written about this before, but the point
bears repeating: I got into this stuff through an unusual vector, and if
not for that chance occurrence, it might not have happened at all.

My second computer also has a weird set of circumstances for how it came
to be. I was up in the Northeast for about a week, visiting family, and
on one afternoon we went for a drive to check out some part of the
countryside. While we were out there, we happened to stop into a local
gas station and convenience store type place that also hand-dipped ice
cream.

As I stood there with my parents, waiting to get my cone, I noticed a
sign high on the wall: SHERIFF'S SALE. It listed a bunch of
things: a television, a stereo, and ... a Commodore computer. It looked
like a really good price for a whole bunch of stuff, actually. There
was a location listed, and it seemed to be that same day, so I convinced
them to take me over, and so we went.

We rolled up to this house where this guy was basically in the process
of having the Sheriff clean him out to pay for some settlement. I don't
know if he was late on rent, or what, but it was one of those things
where they seize assets and sell them off to make good on your debts.
Anyway, we showed up and went looking for this guy and eventually found
him. He told us that the computer stuff "had been sold already", but
was kind of doing this nudge-nudge-wink-wink thing, and asked us to
meet him early the next morning at this diner on the main road just
outside of town.

I don't know why my parents went along with this, but they actually
agreed to this, and so we came back out to that little town early the
next morning, right around sunrise. Sure enough, in the parking lot,
there he was. Apparently the Sheriff or one of his deputies was
actually there when we had turned up, and he was supposed to
have sold it off already (to pay off the debts), and lied to them about
it. He actually hadn't managed to sell it yet, and instead had hidden
it somewhere so they couldn't find it.

Anyway, he popped his trunk lid, and sure enough, there it was: a C-64,
a floppy drive, a joystick, and a whole bunch of games and other
programs (like GEOS). We handed over the cash, and he moved the stuff
into our trunk, and thanked us for helping him out, and then
disappeared. He went one way and we went the other, back to where we
were staying at a relative's house.

So now I had this complete computer system with me in the middle of a
vacation. What did I do? Well, I read the manual and tried to hook it
up, naturally! We were in this 1930s house, and it had matching
(original) electrical hookups. This meant very few outlets, and those
outlets which did exist weren't even polarized. Somehow, I managed to
stretch everything out so the computer and floppy drive could get power
and could still reach the TV.

How did I get around the polarized plug angle, you ask? That involves a
little treachery on my part. Somehow, by this point in my life, I knew
about those little
(affiliate link!)
adapters
which let you plug in a three-prong device
and connected to the screw that held on the wall plate for grounding.
They were typically orange, so they were hard to miss. This house had a
pair of them down in the basement for the washer and dryer, so, well, I
"borrowed" them for a couple of hours.

Somehow, all of this worked out and I was able to play with my new
computer right there, and nobody died from being shocked by the
appliances in the basement, tripped over the cords, and the house didn't
burn down due to my experimentation.

When it came time to go home, we took everything which was theoretically
disposable and threw it out, like the boxes for all of the games and
other parts. The computer, floppy drive, and everything else wound up
crammed into our respective suitcases, using our clothes as padding.
This was before the days of grabby-hands TSA people, so amazingly, it
all made it through okay. Back at home, the components came back out of
the luggage unscathed and were ready to run.

I have to hand it to Commodore for making their ports basically the same
across those computer systems. I was able to unplug my VIC-20 and put
it in a safe place and drop in the C-64. The modem, printer and
joystick all plugged in and worked fine. I had to swap the power brick
and switch up how the TV hookup worked (since this C-64 had its own
built-in modulator), but that was it. It all just fit together.

This second system opened a bunch of doors to me. Now that I had a
floppy drive and a 64, I could actually use the "Common Sense" terminal
program which had come with my modem instead of writing a horrible hack
by hand every time I wanted to "get online" (not what I called it then,
but you get the idea). Common Sense was a pretty weird terminal for the
time, but it supported Xmodem, and that was enough to let me download
another terminal a little more my style, like TouchTerm or CCGMS. Those
programs actually supported Punter and PETSCII mode, and thus made much
more sense when calling the Commodore BBSes of the era.

It was a series of bootstrapping moves, and it introduced me to a great
many new things and interesting people. It ultimately set the stage for
running computer systems for other people, and a career in system
administration.

Now, many years later, as my career continues to evolve in new and
amazing directions, I still remember how it all came to be, and what a
strange and unlikely road it has been. To my former neighbor, my
parents, and to that weird guy who was down on his luck back east,
thanks for going for those unusual choices and taking a few chances with
me and for me.

...

Footnote: want to know what the "evil hack" was? I had to type
something like this in any time I wanted to use my modem.

Translated into English, that opens the "user port" (where the modem
was), tries to get a key from the keyboard, and pushes it at the modem
if so. Then it tries to get a character from the modem and pushes it at
the screen if so. Then it starts over.

This evil little scheme neglected to do all of the crazy CHR$(...) stuff
required to set up the software (!) UART on the machine for higher
speeds, so I wound up having to chug along at 300 bps. Actually making
1200 bps work would have involved getting the OPEN line right, and I
never really figured that out.

If that last paragraph meant anything to you, I'm sorry. Come join me
in celebrating the commitment of neurons to technical trivia which no
longer has any practical value.

Cheers!

]]>Seeking permissions with SSH public keys with help from GitHubtag:rachelbythebay.com,2013-07-20:ssh2013-07-21T04:46:42Z
Remember back in April when I
wrote
about how GitHub exposes your ssh public keys to the world? Back then,
I had a half-baked idea about using it to establish identities of people
trying to connect.

I've since discovered another use. Apparently, you can take a public
key and use it to generate a ssh login attempt without actually
doing the sort of crypto stuff which requires a private key.

If you want to try to replicate this yourself, here's what you do.
First, create a test account, then in that account create a .ssh
directory mode 0700 and an authorized_keys file in that, mode 0600.
Then, in another account, use ssh-keygen to make a public/private key
pair. Take the public key only and copy it into some other filename
that has no relation to the private key, or delete the private key. It
doesn't matter.

Now try to connect as the test user in verbose mode using that public
key. It'll look like this: "ssh -v -i public.key.file user@host".
It'll spit out a lot of garbage and then it will either ask you for a
password (if the host allows password auth), or it'll bomb. The point
is, nothing useful will happen.

Now take the public key contents and paste it into the
authorized_keys file in the test account. Then run the same ssh command
again. You should get a very different result.

This time, you should see something like this:

debug1: Server accepts key: [...]

Then it'll probably complain about an unprotected private key file and
will fail, but that's not important. The point has been made: this
public key is known to exist in that account's authorized_keys file.
This by itself is not enough to let you break into an account, but if
you're doing some kind of security analysis, being able to figure out
who can get to what is a great place to start. If there's an
organization with 50 role accounts and 500 employees, being able to
narrow down the possibilities for the most tasty accounts can save you a
lot of work. Once the targets are known, you can specifically pursue
them and try to compromise their private keys.

So here's the new evil idea: scrape GitHub, get all the public keys you
can, then try all of them against every ssh account you can find. Then
build up a list of which mappings look particularly interesting and
start chasing them down.

I think I have a good bit of key auditing to do now. I now consider
anything I might have stored out there in the past to be less than ideal
for continued production use.

You might consider doing the same if this bothers you at all.

]]>Before ISDN and before PPP, I used to run SLIPtag:rachelbythebay.com,2013-07-19:slip2013-07-20T04:24:18Z
I've had a variety of technologies to deliver the "last mile" of
Internet access to my residences over the years. I've done cable
modems, DSL, ISDN and even wireless. Before all of that, however, were
years and years of dialup access. Some of that was PPP, sure, but for a
very long time, it was plain old SLIP, and it wasn't always so pretty.

The earliest SLIP setup I had involved manually logging in to an ISP's
terminal server, and then I'd run "slip" at the prompt to get it to
switch modes on their end. Then I'd have to do an escape code to get
back to the prompt on my end and do a bunch of things to define the
connection: "set remote x.x.x.x" "set local x.x.x.x" "default" and "mode
slip", if I remember correctly. This was all done in a little program
called "dip" on Linux.

Later, with the help of a friend, we got it to actually dial in by
itself, log in, and even parse out the IP addresses in order to set up
the connection properly. It would then play a very loud recording of my
cat meowing which could be heard from far away. Why would I do such a
thing? That's easy: it could take several minutes for it to actually
get an open line and get connected successfully. This let me start it
up and walk away, and then I could return after the "wharrrllll!" went
off.

The stateless nature of ordinary SLIP made some interesting things
possible. At one point, I reconfigured my modem to ignore the DTR line.
This is what a modem usually uses to know when you want to drop the
connection without doing the whole escape-and-ATH thing. By disabling
that, I could do all sorts of crazy things with the computer and the
modem would still keep the link up.

For instance, I could reboot the computer... and the modem would just
sit there, still connected. Since SLIP itself had no sense of health
checks or other keepalives, never mind sessions, the other end wouldn't
care, either. This let me go from Linux into Windows and back without
having to redial. Granted, I had to rig up a second set of
configurations for both dip and Trumpet Winsock which would set up a
connection without dialing first, but that was a one-time chore.

Being able to flip back and forth was pretty handy in those days, and
not losing my precious dialup port was a definite plus. You just can't
do a trick like this with anything which would need to maintain state on
either side, but SLIP is little more than IP with some framing
characters and an escape scheme for those magic characters, so it didn't
care. It's like the honey badger of Internet connectivity.

Not everything about SLIP was good. Some of the systems I dialed into
weren't particularly good about how they handled it. I think the BSD/OS
boxes at the time didn't do anything more than a MTU of 296 bytes. This
had some impact on the practical amount of bandwidth I could use, but
there was a far bigger problem: broken people with admin powers on
their routers and firewalls.

One of the sites I regularly accessed had some people who decided it was
a great idea to block all ICMP traffic crossing their Internet
connection. They figured this would keep evil IRC warriors from trying
to swamp their entire company's circuit. Maybe it would stop some of
the replies, but it also had the side-effect of breaking path MTU
discovery.

For those who haven't encountered this, here's the premise in a
nutshell. Any given interface has some limit on how big the packets can
be. If something bigger arrives, it gets rejected. The error is
approximately "fragmentation is needed to make this work, but you said
not to fragment this packet". This error is returned to the sender, and
they are expected to respond by emitting smaller packets until the
errors stop.

Trouble is, those errors are conveyed over ICMP, and these guys had
filtered it. That meant their machine would never hear the errors
coming in from my SLIP host. As a result, any time a packet above the
magic cutoff left their end for mine, my connection would probably go
dead. Worse still, they seemed to be horribly inconsistent about when
this rule would be running or not, so I couldn't be sure when I could
get in.

As for why I was still running SLIP, well, that's mostly my fault. My
home gateway machine had a really old Linux install that I didn't want
to update, and the long-suffering kernel it had only had SLIP support.
It would need a recompile to add PPP, and that sort of thing was an
all-night affair. It just wasn't worth the trouble.

I wound up working around their ICMP filter shenanigans by first
telnetting to another machine which had a reasonable MTU to the world,
and then connected to them from there.

Sometimes I get nostalgic about old stuff, but I don't miss dialup
connections one bit. The past can keep it.

]]>Backported patches mean version numbers of variable utilitytag:rachelbythebay.com,2013-07-18:ver2013-07-19T04:39:10Z
Version numbers are everywhere, and yet, they can be misleading. The
worst interactions I've had with them came from my life in the web
hosting support business. The intersection of library versioning,
operating system backports and internal versioning and relatively
unsophisticated customer audits would occasionally make my life
complicated.

It goes like this. First, you have well-known programs like Apache and
OpenSSH. Then you have libraries used by those programs like OpenSSL,
libcurl. There are also runtime environments like PHP or Java. All of
them have their own version numbering schemes from their respective
developers and/or corporate goons.

Then there are operating system vendors like Red Hat, and their products
like RHEL (Red Hat Enterprise Linux). They tend to roll up a bunch of
releases of Apache and OpenSSH and all of those into a given product of
their own, and then call it "version 5" or "version 6" or whatever.
They then have point releases along the way, so it might be 5.1 or 5.2
or 6.3 or 6.4. Between those, there are security updates which might
come out at any time, and might even be applied without you realizing it
if your system is configured to auto-apply patches.

The catch is they tend to keep the same base versions of everything for
the entire lifetime of a given product: 5.x, 6.x, etc. RHEL 5, for
example, is "still" running Linux 2.6.18. Of course, it's not the stock
2.6.18 which was released way back when. It's had a whole bunch of
custom patches applied to add features they decided were important but
which actually came out in some later kernel. They also apply security
patches as flaws are found. It's still 2.6.18 at its core, and so all
of those basic behaviors remain in place, but it slowly improves with
time while still remaining more-or-less stable.

Normally, this is not a big deal. People find out about flaws, find out
their release was affected, then go and look and find that their machine
has already applied the new RPM build from upstream. That new build
included a patch to whatever the problem was, and now it's no longer
vulnerable. This continues until they retire the release, and that
takes on the order of 10 years if you're willing to pay for that kind of
stability.

It gets ugly when you have a customer who only sees the "external"
version numbers and doesn't see the whole picture. They find out that
their server is "running Linux 2.6.18" and flip out because there might
have been a dozen security holes discovered in that since it came out.

We'd usually get tickets from people who had run some external auditing
script against their server. It would see the "Apache/1.3.30" or
whatever and would proclaim that it had bunches of vulnerabilities from
that alone. It didn't actually check for whether they were
still vulnerable or not. Every time someone did this, some poor tech
like me would have to go through and explain exactly what was going on.
It usually went something like this:

"It wants at least version X of program Y because of vulnerabilities A,
B, and C. Red Hat patched A in build 1, B in build 15, and C in build
30. Your server has build 45, so you have patches for all of those
problems even though it still displays as some version before X".

Basically, these programs would trip out at the results of the Server:
line in a HTTP response and would miss the fact it came from a RPM that
was in fact up to date.

Usually, I could grab one of these tickets, do the foot work and prove
that everything was already patched, and they'd be happy. I'd also try
to explain the situation to them so they didn't freak out nearly as much
in the future.

Unfortunately, now and then, someone would take them literally and would
start trying to get everything upgraded to meet their demands.
That is, if the customer said the audit program wanted PHP version X,
then the tech would try to build it manually on the machine. This would
mean rebuilding a great many other things which relied on it, since a
Red Hat type distribution is a huge mass of interlinked dependencies.

This is a non-trivial matter, and you really don't want people trying to
do this on a whim. They usually get it wrong, and remember, it's
probably completely unnecessary due to the backporting of patches I
described earlier. There's another problem that's not quite as
obviously which actually will make things worse down the road: taking
them off the main line.

By that, I mean that once a server is running a custom build of some
package, be it PHP or Apache or even the kernel, odds are, it's not
going to ever run the "official" package of that program ever again.
It'll probably just have that one-off build installed as the result of
that ticket, and that build will sit there and rot until the machine is
shut down... or cracked.

This is because when a customer demands and receives a special build,
they usually aren't moved to a new package stream. Instead, they only
get the one patch and then that's it forever. Actually moving the
customer's machine to a new provider for that package would mean
actually having a provider and "build master" who maintains such
things, and that's well beyond the scope of what support techs usually
get to do.

So, if you're one of these customers, be careful what you ask for. If
you make some unreasonable request and they make a one-off build of some
package, you run the risk of never receiving another automatic update
for that program ever again. If you really need to do this, make sure
they move you onto another maintained source of packages and don't just
"rpm -Uvh blah" and leave it like that until the heat death of the
universe.

This one is on the Oracle campus in Santa Clara which is also a park of
sorts where a mental hospital used to be.

Remember the '90s, when "... and stuff" was a common thing to append to
sentences? "Yeah, we went down to the mall and stuff." They apparently
do.

Obviously, this one should be "... death of 117 patients and
staff", but "stuff" passes the spell-checker, so... ship it!

...

I actually helped make this next one happen.

Back in 2007, there was this
ill-executed publicity stunt
for an animated show which involved putting a bunch of weird-looking
electronic devices all over Boston. The bomb squad got involved and a
bunch of people online reacted to their seeming over-reaction.

This is how we reacted to it in Mountain View:

That's the side of Google's building 46, and specifically the windows on
the "bullpen" where my pager monkey team worked. I bought a whole bunch
of colorful Post-It notes at an office supply store, and we set up some
pixel art based on one of the characters.

It stayed up for a good bit after we moved and then moved again, but
it's since come down. If you're one of the subsequent occupants of our
former space, well, now you know how that got there. It was me!

As a side-effect, I still have a mighty stack of colorful note pads.

...

One of my usual pizza haunts in Santa Clara has a Chicago theme.
They have pizzas like "The Ditka" and "The Loop". There's one line on
the menu which always makes me wonder.

"Sorry, we cannot put sausage in your calzone".

No? Damn!

...

There's a shopping center in Santa Clara which is being rebuilt. As
part of this work, they built a new Walgreens to bring it right up to
the street frontage. It used to be set back almost two blocks back in
this shopping center and was hard to find.

When they did this, they also redid the corner where the shopping
center's access road reaches the city streets. This corner had a stop
sign which had been a bit out of place for a while (Anna at McCormick,
if you want to look it up), but now it's just ridiculous.

I call this one "stop where, exactly?":

The sign is now set back so far that there's enough room to have a
hydrant, a drain grate, and an entire driveway between it and
the actual intersection and stop line. It's practically far enough back
to create another intersection!

Is this all because they didn't want to install another pole for the
sign?

...

Finally, I caught this vanity plate in Mountain View over the weekend:

If you don't see the problem with this one, back up from your computer
or take off your glasses. Also imagine what happens when you see it
from slightly below and that chrome license plate cover occludes the
bottom of the letters.

]]>Silly radio liners and commute efficiencytag:rachelbythebay.com,2013-07-16:boneyard2013-07-17T03:26:13Z
I used to work second shift tech support, and that meant I had to be at
the office by 4. Due to the length of the commute and presence of some
serious construction along the way, I tended to leave really early.
This wound up having me on a particular stretch of freeway at the same
time every day. I could actually tell how well things were going based
on the time of day vs. my position.

Normally I wouldn't notice any particular time during a drive, but there
was one thing in particular which happened every day at 3:20 my time.
My car had come with an XM receiver which was still activated, and I was
able to listen for free at first. I had discovered a channel which
played some good music called "The Bone Yard" and would usually keep it
on while rolling to work.

You see, 3:20 in my location was 4:20 one time zone east of me, and they
used to "celebrate" 4:20 every day by playing a 20 second
"liner" to
fill the time. Since it was a national service, that meant doing it
four times a day! It was the craziest thing I had ever heard on the
radio at the time and always gave me a chuckle. Sometimes I'd catch it
at 4:20 in my local time zone or the subsequent ones for the rest of
the country, too.

This whole thing seems to have fallen down the "memory hole" of the web.
You can find a few references to it on some web boards, but the actual
audio seems to have been lost. I've decided to help that situation a
little since I happened to grab a copy of two of the four way back in
the day.

For the sake of archivists, nostalgic satellite radio metal heads, and
other people who just like amusing audio content, here it is.

Central time zone:

Pacific time zone (West coast):

If you heard that every day at the same time, I imagine you'd start
using it as a gauge for the efficiency of your commute, too.

"Hey, I always go through this narrow spot when this comes on..."

Oh, one final note: when they played this, the radio would display a
song name: "BONG ME". Subtle, right?

]]>Make it harder to screw up and make unsafe things obvioustag:rachelbythebay.com,2013-07-15:taint2013-07-16T03:30:49Z
Right up front I want to say something: I am not a PHP programmer. I've
had to support the runtime for people, and I've modified existing code
in the course of getting other things to run, but I've never willingly
written new code in it. There are probably things I'm missing here.

With that said, onto the rant.

Last weekend, I was trolling the /new page on Hacker News looking for
some inspiration between projects, and found someone who had posted
a link to a personal web site. That web site told a story about
creating a web app which acted like a guest book but would also attempt
to collect some info about why the person was visiting. The idea was to
have it run on a tablet in your business to "capture" people in real
life.

There was a link to see the source code for this project, so I clicked
through and started sniffing around to see how it worked. I wanted to
get some idea of where this person was coming from and what sort of
techniques they used or possibly avoided in the course of creating it.
What I found was not encouraging.

In the first file I checked, I found something like this:

$query = "SELECT * FROM table WHERE `name` = '$_POST[name]' ...

I won't even get into the whole thing about "SELECT *" versus explicitly
listing columns and why you don't want to get your DBAs angry at you.
This is more about how the query was being constructed. It's a string
which then gets handed to MySQL and as a result has the ability to do a
great many things.

Since it's being constructed that way, the user-supplied input (in this
case the contents of a POST) gets a free ride right into that string.
This can open the door to all kinds of badness in the form of SQL
injection attacks. It's old, old news, just like how using certain
techniques in other languages will open you to buffer overflows and
attacks in that realm.

It's almost boring because it's that old. Yet, it persists.

This got me thinking about the human side of this equation. Here we
have someone who is trying to make something work, and has created a
vector for a well-known attack. I assume this comes from having the
seemingly direct route be one of string construction, and countless
examples do nothing to disprove it. There are always better ways (like
parameterization), but it's going to take something special to make the
world take notice.

My thoughts turned to those of "taint". This is where you have the
ability to flag a variable as containing user (and thus, untrusted)
input. It's a latch, meaning once you set this flag on a variable, it
can't be cleared. It also "rubs off", so if you use the data from a
tainted variable A to populate variable B, B is also now tainted.

I went digging around and found that while PHP apparently has some
third-party extensions to add taint, it does not do it natively. If I'm
wrong, write in and tell me, but nothing I found suggests that the
version typically encountered by someone on a random LAMP environment
will have this configured and running full-time.

This seems to be the root of the problem. I want to see them do
something bold. Put in full-time taint checking. Make functions which
run SQL queries holler if someone passes in a tainted variable. Make it
so you have to set some kind of special "yes, I accept that this will
open a gaping hole" flag per source file if you want to disable
it. Make it fail to run properly if you don't do it right or turn it
off.

Then you can turn people loose on it. They'll still try to write their
potentially-vulnerable code, but this time, it won't work. Then,
instead of thinking they're done, they will now have a new problem to
solve, and in chasing down the problem, might encounter some useful
documentation which says "please don't build your queries this way".

Of course, there's also the possibility that someone will just drop the
"shut up and leave me alone" setting into all of their source files, but
that's okay too, since it makes code auditing really simple. All you
have to do is make it eminently greppable, and then someone like me can
just grep the whole tree for it upon encountering a new project. If
it shows up, you know the mindset of the author and can decide whether
it's worth your while to proceed.

Obviously, this would have no effect for the legions of machines which
are running older builds of PHP before this vaporware always-on taint
thing was released. To address that, allow me to tilt at another
windmill and come up with yet another vaporware concept.

There should be a test program which purposely does bad things with
tainted data. It would be the sort of thing which the new hypothetical
taint-always-on PHP version would reject right away. You would use this
to verify if the runtime environment on someone's host was actually
enforcing taint or not.

Once this existed, you could ask a new PHP programmer to run this
program and post the results. If it says something like "taint check
not present in this interpreter", then you know there's no point in
grepping the code for the tell-tale "go away" string, since the odds
they ran into the roadblock are slim.

Basically, I'm coming at this from an angle of trying to be a helpful
web forum participant without trying to get too involved with what could
be a big bag of hurt. Having a few more tools available would reduce
some of the round-trips required to know where someone is and how much
help they might need.

So, to review:

1. They need to run the taint checker and include the output. If they
don't have it run, they will be asked to run it before anything else
will happen.

2. If the taint checker is shown to be running on their host, then a
grep for the "disable taint check" setting will be run. If it's there,
it might be better to ignore this person since they're purposely aiming
a loaded gun at their feet.

3. At this point, the checker has been shown to work (or they're lying,
I suppose) and they haven't intentionally disabled it in their code (or
they've hiding it really well), and you can start looking for more
interesting things to audit.

Things like this would make it a lot easier to be helpful and not mean.

Finally, nothing I said is unique to PHP or this program. This could
apply to anything potentially unsafe in code. Lots of languages might
benefit from something like this.

Your language might call it "use strict" or "-Wall" or something else
entirely, but establishing that you've already passed that hurdle tends
to make other people treat you more seriously when you go asking for
help.

]]>Don't get on this planetag:rachelbythebay.com,2013-07-14:pos2013-07-15T01:28:55Z
Quick, what's the whole point of an airplane?

Stay in the air? Sure.

Try not to break apart? Absolutely.

Go really fast? Probably.

Back up a little. What's the actual point?

How about carrying stuff? People, cargo, you know, ... stuff.

With that in mind, what about this thing?

That... is a flying wood chipper. It carries no cargo. It exists only
to carry itself. Attempting to store something in it will result in
immediate consumption and destruction.

Gee, it's almost Freudian.

]]>Something's always wrongtag:rachelbythebay.com,2013-07-13:wrong2013-07-14T04:48:34Z
If your entire metric for whether you're doing well or not is a web
forum, you will probably never be satisfied because the posters on
these forums are never satisfied. No matter what happens, there will
always be something you did incorrectly. I've seen this manifest itself
a bunch of times. Sometimes I've just been a bystander, and sometimes
I've been on the receiving end of it.

Case in point: let's say you're going to write a software project. No
code exists yet. You have some kind of sketch or mock-up or some other
shiny stuff, but it's definitely not code. This gets put up on a
Kickstarter type site in an attempt to make money.

You will hear from people who will call it "vaporware". As far
as they are concerned, you should have written it first. Otherwise, how
can they know you'll deliver?

Then there's the case where you write a program and then offer to
release the code after the fact. It also goes out for some kind of
crowd funding, but this time there really is code behind all of this.
Perhaps there is a demo version to show what it does.

Having done this, you will hear from people who say you should
have just cut it loose instead of "holding it hostage".

If you just release it and then ask for money in order to support it
after the fact, people will yell at you for having the gall to ask.
It's not just that they aren't supporting the project, but they also
go to lengths to throw a few rocks in your general direction before
departing the area.

Then let's say you write something totally on your own and then release
it for free and never ask for any money. Someone will eventually come
along and say that it's crap and they could do it better. They'll talk
about how they managed to create a variant which is smaller or faster or
both. Then it'll become apparent that theirs is so much "better"
because it only handles their personal set of use cases and forgets
about everyone else.

Of course, when you mention this, they vanish, never to be seen again.
They sure don't come back to actually try doing something about the
"problem" which caused them to start mouthing off in the first place.

Given some of this behavior, it should come as no surprise that some
projects are deliberately created in private and not shared widely.
It's not even because it's some kind of "secret sauce", because it
isn't. Instead, it's merely done this way to avoid the spitballs and
other hurled insults from the global peanut galleries.

One wonders if there are secret forums which exist just to share such
tools among people who promise to not be dicks about it. These forums
are private not because of any illegality of the tools, but rather to
keep out the annoyances who contribute nothing and only serve to make
people needlessly doubt their own abilities.

Let the haters suffer.

Finally, here's a test for whether people are deliberately being mean.
Wait for one to report a fundamental problem with something. Then show
that it does not actually exist. See if they accept it or not. If they
drop it, then see if they turn around and find something else to yell
about. If this happens, you can be sure they are purposely
looking for ways to hate on your project.

With that established, treat them appropriately.

]]>There's still some room for C in this modern worldtag:rachelbythebay.com,2013-07-12:c2013-07-13T08:37:22Z
I received a question from an anonymous reader last week:

Other than embedded systems or low level systems programming, do you
think there is any room for C in today's world? What features of C++ do
you use most often?

They also mentioned that some people are proud to be using C instead of
C++ and asked why I thought that. There was a mention of a specific
project, but I'm sure we can all think of some which work this way. The
Linux kernel is a good example of a project with a leader who has very
strong
opinions about avoiding all things C++.

I think there is room for C, but there are specific situations which
call for it and others where it would be reckless to recommend it. The
problem is that statement I just made is basically content-free since
you could express it for any programming language and would still be
spot-on. So, I'll try to go a little deeper with it.

First, some definitions for the sake of clarity. When I talk about C,
I'm talking about writing code which specifically does things in the
old school "C way" like using certain types of for loops to iterate
through char* spaces. Take a look at this, for instance:

That will in fact print out the characters in "x", one per line. The
question is: do you "get" what's happening in that for loop? Obviously,
you're initializing the pointer y to x, and you're also bumping it along
one spot every time through, but what's this "*y" business?

Old-school C people sometimes use constructs like this. They know it
means "keep going until this thing points at a \0". It could also look
like this:

for (y = x; *y != 0; ++y) {

That has the same effect, but it's a little more verbose about what it's
up to: testing for the presence of 0 at whatever's under the pointer.

Why might you do this? Well, for one thing, it saves on a call to
strlen() which would make it look like this instead:

This one has to include an extra system header to get the prototype for
strlen(), and the resulting binary is a wee bit larger by about 200
bytes on my machine.

There's more to this, though. The strlen has to flip through the entire
string until it finds that terminating \0. Then the for loop flips
through it again. You now have two passes over the same data where
previously you had one. This might be a performance issue... maybe.

Of course, there are plenty of horrible ways to do this, too. For
instance, you could neglect to cache the result from strlen, and instead
call it every time through the loop. Now you're in real trouble.

Why not? Well, now you're scanning the entire string at least once for
every character in the string, and none of them are telling you anything
new. It's wasted effort on the part of your system, and again, could be
a performance issue.

Are you going to notice this kind of mistake? Maybe, maybe not. It
depends on what your workloads look like. If you have a bunch of inner
loops written like this with really horrible blowup factors and they're
constantly being hammered by requests, then sure, it'll probably add up.

So now let's look at what happens when we switch over to using C++ and
std::string for this kind of task.

We no longer use pointers or indexes and array lookups. This time, our
traversal happens by way of an iterator, and there's a bunch of stuff
going on behind the scenes to make it actually work.

If you want to get some idea of what really happens inside those
libraries, try compiling the above with '-g', then load it into gdb, do
'start' to get it up to main, and then 'step' through it. There's a lot
of stuff happening in there!

By way of comparison, the original C example, when run through the
debugger in a stepwise fashion, looks like this:

... you get the idea. It's just running lines 7 and 8 over and over
until it eventually falls out of the loop and returns. There's stuff
going on, but it doesn't involve any library calls. Here's how it ends:

At some level, this represents less work to do. The question is: does
it matter? Well, that depends on what you're doing. If you have places
where such minutiae would add up to a real difference, then maybe you
need to perform such micro-optimizations and use a language which allows
for it.

Some languages force you to do this low-level fiddling all of the time.
Other languages give you a choice between that kind of behavior and some
organized higher-level functions and other helpful bits of code which
handle some of it for you but might technically be less efficient.

Then there are languages which always force you down the "expensive"
route and have no way around it. Even this might not be a problem if
you're only doing lightweight operations with it. This kind of stuff
would probably be lost in the noise of startup and shutdown if it's some
program which lazily sits there and waits to be poked by a user.

On the other hand, if you have something at the core of a program which
is getting thousands of queries per second, you probably won't have that
kind of luxury. "Buy more servers" starts getting mighty painful at
some point!

It's up to you to pick which way to go for any given part of your code.
You can also start with the relatively lazy approach and then optimize
later after
actually finding
hot spots.

In summary, I think old-school C behavior should continue to be
possible in order to solve specific problems, but I don't think it's
always appropriate. In particular, I think there is a danger of using
it to "show off" when something simpler (in terms of code) and yet more
expensive (in terms of computing resources) would be better.

Maybe you don't care about latency if it's below 5 seconds, and you're
running on a big box that's always plugged in, so CPU time doesn't
bother you. It has tons of RAM and a huge disk, so those don't matter
either. However, you have precious few programmer cycles to spare, and
you have to get it out soon or your company will go out of business.
That gives you one way through it.

Someone else needs super low latency with as little CPU consumption as
possible since they're on a hand-held device which runs from a tiny
little battery and are dealing with finger taps. People want to see it
move as soon as they touch it. There's not much memory or long-term
storage, either, so you can't be a pig about those.

I suspect these two situations have very different answers.

So how do you figure it out? Painful experience. That's how you
justify your worth to clients and potential employers.

...

Finally, to answer the last question about "features of C++ I use", I
assume this is asking about what sort of non-pure-C type things I do in
my C++ code. That's a fair question, and I'll attempt to list some of the
more obvious ones which come to mind, with the understanding this might
not be a complete list.

I use classes to organize things. They tend to arrange themselves
around the logical "seams" in a design with different moving parts.

That said, I don't usually touch multiple inheritance or any other kind
of polymorphic whatsits in day to day operations. It just isn't needed.

I use the stream operator "<<" for logging. I picked this up from
a former employer and it agrees with me, so I kept using it for my own
work.

I use STL containers for storing all kinds of stuff. I'd say that maps
and vectors show up the most, but there are a couple of deque, list and
set users in my tree at the moment.

I also use std::string quite a bit. It's a lot harder to screw things
up when you eliminate a bunch of uses of pointers. Ordinary C
string-like behavior with char* means nothing but pointer wrangling, and
there are decades of proven badness from failed attempts at doing it
safely.

I do use "new" (instead of malloc) but it's usually just to stick
something into a "scoped_ptr<>", which again is something I picked
up at a prior job. It basically lets you hand a pointer to an object
which will hang onto it and will call the right "delete" on it when that
object goes out of scope.

This is great for pointers to utility classes, and that in turn lets you
do some fun stuff when it comes time to test classes in isolation. For
instance, imagine this:

Somewhere in Foo, something will do "util_.reset(new Util)" or
similar. That creates an instance of that class and hangs onto the
pointer and then whacks it when the Foo goes out of scope and is deleted
itself.

This MockUtil will "fit" into the scoped_ptr spot for a Util, and Foo
will use it without realizing anything has changed. Now you can
eliminate actual calls to Util (which might be costly) and create a
bunch of interesting situations without having to set up elaborate fake
versions of whatever Util is going to touch.

(Actually getting your MockUtil into Foo to replace the usual Util is
left as an exercise for the reader.)

Could you do this with a plain old "Util* util_;" pointer? Of course!
But now, you'd have to keep track of it and make sure it gets deleted
when Foo gets destroyed. Otherwise, you'd leak it.

The important part is that you couldn't easily do this if you had it as
"Util util_;" instead. In that case, you have a genuine Util bound up
inside of your class and it's not going to accept anything else. You
can't just swing it around to another implementation, and that can make
it harder to test the calling code.

The "scoped_ptr" I use is not a part of C++ (and so it exists as my own
little helper library in my tree), but I understand that something very
much like it now exists in "unique_ptr". This assumes that relying on
C++11 support is okay for your project, of course!

Regarding pointers and references, it's pretty much like this: if I pass
a reference to a function, it's const. I pass const pointers around
if the situation really calls for it. Actual non-const pointers are
usually there as a way to have an "out" argument on a function.

This means you can get some idea of whether your data is going to be
accessed or not by looking at how a function is called.

log_something("something which will not be changed");
string value;
bool success = get_value("also not changed", &value);

The magic & up there basically says "hey, this function might
wind up writing to it" to me.

None of this is perfect, but this is about where I am with it. I
definitely do not use the whole
buffalo
and I'm just fine with that.

]]>Feedfetcher is still polling as if nothing happenedtag:rachelbythebay.com,2013-07-12:reader2013-07-12T22:29:31Z
Would someone please turn out the lights at Google Reader? If it's
truly dead, let it die.

This is the situation right now, with a log file that rolled over Sunday
morning:

$ grep Feedfetcher-Google access_log | wc -l
171

It keeps checking in just like nothing has happened. It even still
declares the same number of subscribers even though I imagine none of
them actually exist in any tangible form any more.

Why keep grabbing content if you're not going to share it with others,
particularly when you went to the trouble of booting them off?

]]>Take a loopy movie and make it even loopiertag:rachelbythebay.com,2013-07-11:prank2013-07-12T04:12:52Z
I had another really bad idea a couple of days ago. I was thinking
about how certain movies become favorites, and you can't help but wind
up learning all of the lines. Before long, you're practically
performing the movie along with the characters. Any one person might
only have one or two cherished movies "burned in" this way, but to them,
they are very special.

What might be interesting would be to throw them off. Usually, watching
a movie on TV will throw a good couple of monkey wrenches into the
works, both due to censoring and also editing for time. Not all edits
make the movie shorter -- some actually seem to add "new" footage to
stretch it out. A longer movie means more opportunities for ad breaks,
right?

Eventually, the most hard-core fan will get used to the better known
edits like "yippie ki-yay, major falcon" and the whole thing
about "monkey-fighting snakes on this Monday to Friday
plane". What I'm talking about is to give them a situation where they
won't see it coming, and then be rather subtle about it.

This would work by reordering scenes, and occasionally swapping in
alternate takes. It could all be done through the magic of DVD. I
think there are ways to have different "routes" through a DVD depending
on options which were chosen at the beginning of the movie. I know I've
seen some which offer "play original cut" and "play with deleted scenes
re-added" over the years.

So the question is: would it be possible to make the disc grab a random
number somehow and then have it use that to mix up playback? It would
be pretty amazing if so.

It might be a little bit like some of those old gag records. A few of
the records made over the years actually had multiple grooves instead of
just the one, and depending on where you dropped the needle, you'd get
one or the other. I understand that while some of them were
straightforward and obvious such that you'd realize the difference right
away, others were subtle about it and would only have slight differences
later in the song.

I was inspired by a re-watch of (affiliate link!)
Primer,
which is
well known as a totally loopy movie.
The first time I saw it, I had no idea what was going on when it was
all over. On my second time through, it almost seemed to make sense by
the end. At first I wondered if that meant I "got it".

Then, later, doubt started creeping in. What if I didn't actually get
it this time? Maybe I was actually confident in an understanding which
did not exist. While pondering that conundrum, my thoughts turned to
those of prankster-level evil: how might one intentionally mess up
viewers?

Just imagine it: a movie as loopy as Primer, but with DVD mastery hacks
which deliberately make it move scenes around and create new paths
through the movie every time you watch it. Then, release it and don't
tell anyone about it. See how long it takes for people to come up with
their own theories of what happened, and then watch the flame wars erupt
online as they argue with each other.

The best part will be when someone goes to refer back to their own copy
and finds it subtly different. At first, they'll think they're going
crazy. Eventually, someone sufficiently nerdy will dissect the DVD
itself and will find the "code" which makes it all happen.

Maybe this has already happened, and there's some quirky DVD release of
a movie out there, subtly messing with viewers every time they press
play.

Which version will you get this time?

]]>More corporate IT avoidance by geeky employeestag:rachelbythebay.com,2013-07-10:domain2013-07-11T03:16:00Z
This is a story about what happens when you have a whole bunch of techs
who do all sorts of web hosting work and an IT department which is more
of a liability than an asset. It's about the things which are done that
fly in the face of security, reliability, documentation and everything
else that's good and proper, all in the name of getting
something done.

So there were these techs, and they did mostly web hosting technical
support. They took phone calls and they worked tickets. The company
had giant pipes to the outside world and a strong DNS infrastructure...
as long as you were a customer. If you were an employee, it was like
pulling teeth to get hosting space or a hostname in the official
corporate domain/namespace. It got to where people just stopped asking,
and would tell newcomers to not even bother lest they get branded a
troublemaker.

One tech went too far with it and installed an access point to get
wireless access on the support floor. They walked someone up there and
terminated him on the spot. Eventually, the president of the company
heard about it and rescinded it, but it was a sign of the hair trigger
they tended to have. So, when things needed to be done, they tended to
happen on the "down low" and only the right folks heard about it. That
lead to the whole domain name thing.

Every tech had their own workstation (or two or three) at their desks,
each with an arbitrary name that didn't really matter. There was no
standard operating system so it was a big bunch of weird. Some of
these techs would develop useful things: web pages with helpful docs,
CGI tools and random multi-user web-based amusements (like the
much-enjoyed "paint on the annoying customer's face by clicking" page).

Other techs took on the roles of unofficial "buildmasters", and they
started rolling RPMs and other packages for frequently-encountered
situations. Customers used to ask for custom builds of PHP, MySQL,
Apache, and the usual bits of the "LAMP stack", and it got tiring to
keep re-doing it. Having both binary RPMs for the quick load and go
situations and source RPMs for those requiring customization eliminated
a lot of work.

It would have been helpful to have some kind of way to give all of these
hosts names in one or more of the official corporate domain names, but
that did not exist. Finally, one day, one of these techs registered a
real domain name with an ordinary registrar. Then they set up the
domain just like it was any other customer domain in terms of the
primary nameservers, and added it to a "customer" account. That is, it
was the personal account of this employee, and since customers can host
domains, that's how it was added.

Then the word went out: if you want to give your workstation a hostname,
now you could. This would let you get back to it from other places
inside the company without trying to remember the IP address. You could
share the hostname with other people and let them get at your stuff -
docs, games, toys, or your massive MP3 cache. Whatever. All you had to
do was access this one employee's account, drop into the DNS editor, and
edit this one domain.

It took off like gangbusters. All sorts of people added things to it.
They even started getting clever and started adding entries for other
stuff besides their own workstations. Some of the corporate servers
had really stupid and unhelpful hostnames for services which didn't run
over HTTP. So, they would set up a memorable hostname in this new
domain pointing at the same IP address, and would then just refer to it
from then on.

As far as I know, their IT department never found out.

Anyway, this is what happens when you have a bunch of geeks without a
useful corporate infrastructure for workstation names or good ways to
create memorable URLs. They'll find a way to make one. It might not
have anything to do with the IT department, but they'll find a way!

]]>A correction about the 1920s Dumbarton Bridge fishing piertag:rachelbythebay.com,2013-07-09:pier2013-07-10T00:43:15Z
I have a correction to make to an older post of mine about the Bay Area.
In August of 2012, I
wrote
that the remaining bits of the 1920s Dumbarton Bridge had a pair of
fishing piers: one you could access from the Fremont/Newark side, and
one which was accessible from the East Palo Alto/Menlo Park side.

As it turns out, I was only half-right.

In the past year or so, they have wrapped up a seismic refit of the
bridge, and part of that involved using the old western ("Ravenswood")
pier as temporary scaffolding. Once they were done, they took down the
construction rigging and removed the old span. As a result, there is no
longer a second, smaller bridge if you approach from the west side.

It looks like this when viewed from the Ravenswood side now:

[Click to see this one full-size. It's way better.]

I was operating from outdated info and assumed that since it had been
there 90 years, it would be there a while longer. I was wrong.

The good news is that the eastern side isn't going anywhere any time
soon from all appearances, so if you want to go a good distance into the
bay without getting wet or hopping in a boat, you can still make it
happen. The fishing opportunities still exist on that side.

]]>Comparing "bb" and "ekam" by requesttag:rachelbythebay.com,2013-07-08:build2013-07-09T03:20:30Z
I recently received a request to compare my
C++ build tool with
"ekam". I have to admit
that I haven't really looked at ekam beyond the superficial clicking
around on a few web pages over a year ago, so I'm coming at this "cold".
This post is basically a stream-of-consciousness log as I poke around at
the documentation, in other words. I can't actually build it and
run it (more about that below).

Right off the bat, my build tool (let's call it "bb", since that's how I
usually invoke it) is not open source. It could be, but it isn't. I
won't get into why -- too much drama about that in recent posts already.
ekam is open source.

According to its
Quickstart page,
Ekam is said to use C++11/C++0x features, which means it relies on the
"unreleased" gcc 4.6 (or later). I assume the wiki page is a bit out of
date since gcc 4.6.0
landed on March 25, 2011.

"bb" is pretty boring C++ by comparison. I don't even think of version
numbers when I write C++, to be honest. I actually hate programming
languages where versioning matters - see a
prior rant.
I guess "bb" is technically C++98, but that's not a conscious decision.
The mix of C++ I use is basically C with the STL, classes, and one or
two other things I'm probably forgetting. I don't
use the whole buffalo,
in other words.

In practical terms, this means I can build my own build tool on all of
my Linux boxes and my Macs. I imagine I would not be able to build
Ekam because I do not have a sufficiently advanced version of gcc
installed.

On the topic of platforms, according to the Ekam wiki, it had FreeBSD
and Mac OS X support but it has "atrophied". However, if the wiki
itself is out of date (like with the gcc release info), Ekam itself may
have been rectified to work on those platforms again.

(Then again, I just found
Capn Proto
docs which suggest it's Linux-only and best used with Eclipse. Those
pages seem to be more recent so perhaps the Wiki is correct.)

"bb" only has binaries for Linux x86_64 (known to work on Slackware64
13.37 and 14 and RHEL 5) and Mac OS X (known to work on Mountain Lion),
but I've gotten it to compile and run just fine on OpenBSD. I bet I
could probably make it run on FreeBSD without any funny stuff, but I
haven't actually tried it yet. I guess I need to put another test
partition on one of my scratch boxes and give it a spin.

I suspect the reason Ekam is platform-sensitive is that it does neato
(and spooky) stuff to figure out what you're doing. It basically uses
(again, according to the Wiki) LD_PRELOAD to get into your compiler's
"brain", and then intercepts library calls and sniffs around to find
out what's going on. This seems to let it "discover" a great many
things about what it takes to build your project.

"bb" has no magic or cool stuff. It just parses .cc and .h files to
look for three things:

I guess technically it has a side case for #2 where it will notice that
you're including "foo.pb.h" and will interpret that as a protocol
buffer. Then it'll go out and look for foo.proto and run 'protoc' to
create foo.pb.cc and foo.pb.h in a "genfiles" directory. This
is baked into the tool.

Anything outside of those triggers is not going to be picked up. I
suspect Ekam will pick up a great many things and thus technically wipes
the floor with "bb" in terms of detection, but at the cost of
portability. There are some other potential quirks involving platforms
which use a statically-compiled compiler. "bb", on the other hand, is a
pretty boring Unixy program that will probably compile and run most
anywhere.

Ekam also supports protobufs, but it seems to require some extra steps
to make it happen. "bb" just wants to see the right sort of #include, a
.proto file, and 'protoc' in your path.

Ekam has continuous building and client/server magic. "bb" has neither.
This basically means that you can keep editing things and Ekam will
notice them and automatically re-build as appropriate. "bb" makes you
run it again by hand, just like make and friends.

Ekam talks to Eclipse with a plugin. "bb" has nothing of the sort.
I write my code in nano
(yes, really)
so it doesn't really work that way for me.

Ekam can build itself. "bb" can, too. They both have bootstrapping
scripts. Ekam's bootstrapping seems to be relatively clever and smart:
it looks like it keeps bashing things together until they "stick":

When Ekam builds itself, some errors will be produced, but the build
will be successful overall. Pay attention to the final message printed
by the bootstrap script to determine whether there was a real problem.

The reason for the errors: Some files are platform-specific (to Linux,
FreeBSD, or OSX). The ones for the platform you aren't using will fail
to compile, producing errors. Ekam will figure out which files did
compile successfully and use those. For example, KqueueEventManager
works on FreeBSD and OSX but not Linux. EpollEventManager works on Linux
but not any other platform. But Ekam only needs one EventManager
implementation to function, so it will use whichever one compiled
successfully.

"bb", on the other hand, has a bootstrap script which spirits a copy of
the source code and required libraries from my main tree into an empty
directory and then throws the whole mess at g++ to get a binary:

It isn't pretty, but it's not supposed to be. As soon as that much
works, it's possible to turn around and run it to build itself in the
usual fashion.

Why "dep_cli"? It's dull, but here goes. "dep" is the name of
the class which does most of the work at the moment, and "dep_cli" is
just a dumb little wrapper with enough of a main() to start it up. See,
I said it was dull. Why "dep"? It's a "Dependency" - a part of the
project.

"bb" can "stamp" binaries with build details. This is why "dep_cli -V"
will spit out a build time. I don't know if Ekam has that as an option.

"bb" has several build types. At present, it does "debug", "coverage",
a default called "bin", "opt", "optstatic", and "prof". Those
essentially manifest themselves in terms of different gcc compiler and
linker flags. They're hard coded at the moment but I intend to expose
those via a config file for overriding and augmenting the stock list.
The build type also determines where the "genfiles" (stuff like protoc
output) and output files (objects, binaries) wind up. This is
important, since you don't want to mix up your objects from different
"flavors" of builds.

I can't tell if Ekam has build variants. I suspect it does, knowing the
sort of build environment its author came from, but I can't find hard
evidence either way from the docs or a cursory romp through the source
browser.

Ekam seems to support parallel operations with a "-j" flag just like
Make. "bb" does not. It is decidedly serial. It probably has enough
state information to safely run things in parallel and block others
until the constituent parts have been (re)built, but at present it does
not. It hasn't bothered me enough to make me try to write it yet -- my
builds are fast enough for my purposes right now.

The "bb" source does not contain the word
"factory"
either in the actual project code or in any of the helper classes found
elsewhere in my tree. (Sorry.)

Ekam has a warning about being experimental and not being ready for real
use. I doubt it's actually that bad, but that's what it says. "bb" has
no such warning, but you're still at the whims of a binary-only build
tool which you haven't funded and may cease to exist at any time (just
like any other project).

I actually use "bb" for "production" stuff, or as close as I can get to
it. For instance, it builds all of the stuff behind my
Super Trunking Scanner
site: the stuff which uses GNU Radio to chug down raw I/Q data from my
USRP, the MP3 generation and metadata logging to MySQL, the RPC client
to push calls to production from my logging machine, the RPC server to
accept those calls, and the CGI stuff on my web server to hand out those
MP3s and JSON lists of calls. It's all C++ and it's all built by this
tool.

Other stuff? fred.
"publog", the software which generates these very
posts,
Atom feed,
protofeed,
and
books. The backend for
the web store which sells my books, and does the
"Virtual Rachel" logins,
sales, and product management (so you can't request a book you haven't
paid for - naughty, naughty).

... and, well, everything else I do in C++ nowadays. I write stuff for
my clients using it and then write them a nice Makefile when it's ready
for handoff. They're welcome to use "bb", too, but I give them the
power of not being bound to a proprietary build tool as a matter of
honor.

Final thoughts? Ekam makes me say "wow" - capturing library calls to
stay on top of things is way beyond what ordinary build tools do.

"bb" makes me say "eh, it works" - to me, it's boring.

But hey, I'm obviously biased. The question is: what do other people
think?

]]>Responding to HN and other comments on the kickstartertag:rachelbythebay.com,2013-07-07:ks2013-07-08T02:29:36Z
Oh dear, it seems that HN has gotten a hold of
yesterday's post.
Rather than trying to wade through the comments, I'll attempt to reply
to some of them here. I'll also try to reply to the feedback I've
received directly on my site.

Point: "There are other currencies than dollars".

True, but there's a problem. Frys doesn't take
"whuffie". Neither
do Ettus, Halted, or the ham radio shop in Sunnyvale. I can't buy a
USRP, a right-angle SMA connector, and a flexible antenna on social
capital alone. They usually only deal in USD.

Point: I could use something other than Kickstarter.

True. I could do that, but that means being stuck in a state of "will
it or won't it" even longer. The last month of waiting on this thing
to either work or not sucked, okay? I don't want to go through that
again. I like knowing when something is going to happen so I can get
working on it.

Point: I should explain how it excludes me from income.

I thought I did, but in case I didn't: it takes time to work on this
kind of stuff. Besides the actual known matters of making it
releasable to the rest of the world, there are support issues. Once
it's out there, people are going to have problems, and I would feel
compelled to help them out.

Yes, even for the people who paid nothing and are merely riding on the
goodwill of the others who did pay for it. If I wrote it and it's
causing pain to someone, then I'm probably going to feel the
need to help them. The only way this breaks down is if something
higher-priority comes along and bumps it out. Then I have to just walk
away from it.

The funding was about making fred the one bumping out other projects,
and thus it would "win" and get the time from me that it deserves.

I have multiple projects and clients competing for my attention. I was
trying something different here to let the whole world be my "client".
It was not successful.

Point: my "video thumbnail" could have used "15 minutes of graphic
design input".

Sorry, that's not a video thumbnail. It's a static image that I
generated in Omnigraffle. It replaced the initial hand-drawn
image that I did on my marker board and photographed (!) to get things
going.

For reference, this is what was originally on the KS page:

... and this is what I made with Omnigraffle:

What you didn't see is that I actually took a series of pictures as I
drew the first image. I had a camera on a tripod and was playing around
with the concept of making it into a stop-motion animation. The logo
appears, fills itself in, and then gets wiped away.

Why didn't I do that, or indeed, any video? Easy. I'm not a
video producer. I'm a programmer, sysadmin, writer, and a few other
things. But I'm not any good at making videos, or really, making
polished graphic art that looks like a professional. I can't be good at
everything! I have distinct weaknesses and that is among them.

Sometimes I get lucky. My second book managed to get a pretty nice look
if I do say so myself.

I also did that with Omnigraffle. It's based on a picture which
originally showed up in a
post
from December where County Roads put up the wrong sign. It looks good
in a list of books. My original book did not.

I call this the "Charlie Brown color theme". It's the product of maybe
10 minutes sitting on my couch with my now-departed iPad in the Brushes
app. I needed a cover, and had no idea what to do about it. I also
wanted to get the book out after working on it for so long.

It wound up looking like that, and yes, it's horrible. I actually
intend to go back and fix it to make something more like The Stupid Hour
one of these days. Then this hand-drawn abomination will hopefully fade
away.

Point: There are so many other feed reader projects.

There are now. There weren't nearly as many, or at least, it didn't
seem like it, back several months ago. The destruction of Reader
basically made a bunch of stuff come into being.

fred started back in 2011 when I decided to bail out from
Google properties. It predates this no-more-Reader thing.

Point: Release it anyway. A tarball of shit is better than nothing.

Nope. Not gonna do that. Not attributable to me, at least! If you
want to put your name on a big pile of poop, I'm not going to stop you.
Just don't expect me to join in.

Point: This software must have been trivial or it would have been easy
to get the money later.

Well, actually, it is trivial... for some people. Some people
exist to write plumbing and plug stuff together to make bigger systems.
I am one of them. I write stuff to solve my own problems.

Other people are better at other things, and for them, this problem
would be non-trivial. Their solutions to it would look as crappy as my
attempts at graphical design (see above). But their graphical design
might look awesome.

We're not all good at everything. Expecting that is folly.

Point: Holding onto it won't give you any revenue or popularity.

I never wanted to be the queen of feed readers. I did this as a direct
response to the feedback from my
"what should I do now"
post. Remember, for me, fred is basically done. It lets me read my
subscriptions in peace and doesn't get in my way. I didn't need to do
anything more to it.

Point: You should have promoted it more.

I think I told everyone I know about it at least once. Probably more
than once. I even bribed some of my friends with baked goods in order
to get more eyeballs.

Have you ever baked an apple pie and then delivered it personally in
order to make sure they'd remember you and tell everyone they know? I
have.

I wanted this to work.

Point: Blah blah money evil dumb whore.

Remember
protolog? I
came up with it, tweaked it, and cut it loose on Github. It's still
there. I didn't ask for any money.

An equivalent to protolog doesn't exist anywhere. People are still
using regexes to grovel through this crap! I gave it away in an attempt
to change that.

I haven't asked for any money for protolog, and I haven't received any.

Which stuff, exactly? The libraries, so libcurl, libxml2, and jansson?
Did you really want me to write my own HTTP client, XML parser, and JSON
serializer?

I mean, I could... I've done stuff like that before. They wouldn't be
good implementations of those libraries because they would only
have the absolute minimum needed to make fred work. They'd probably
have a lot of stupid holes which had already been found and squashed in
the better-known libraries.

But hey, it would be all my code... or would it?

I'd still link to the C library on your system. The binary would run
atop the kernel. I didn't write those. Did you want that, too?

Where does it end?

]]>"Just open-source it" is not realistictag:rachelbythebay.com,2013-07-06:fred2013-07-07T20:29:47Z
I've received a couple of questions about fred following the
failure
of the kickstarter to fund it as open source yesterday. One of them
was also asked before this whole thing started, and I had hoped it would
just go away. I was hoping to not have to answer it. However, since it
has come up again, it probably will keep coming up, so I might as well
answer it.

The big one is: "why don't you just open-source it as-is?".

My answer is: it's not that simple, and it wouldn't help anyone as-is.
First, you have to appreciate that fred is part of a much bigger code
base. It's just like the "//depot/..." thing that some of my readers
will recognize, only much smaller. It's a collection of libraries which
I have written which have found their way into multiple projects.

I am not willing to just cut this all loose to the world. It represents
a non-trivial amount of work, and why would I undercut my own ability to
license it appropriately? Seriously, how can you possibly justify that?
It would be one thing if I was riding a train of fat paychecks from some
"day job", but I'm not. This is what pays my bills, and I'm not giving
it up for free.

For fred to be released, it would first need to be split off from this
code. This is not a huge deal. I've done it before. When splitting
things off, some of the more generic aspects become less meaningful and
sometimes I replace them with smaller versions. That is, instead of
having a whole "Swiss army knife" class go out with the forked project,
I whittle it down to just the parts which are used by that one project.

With this "freestanding fred" code base thus established, I'd go about
making it somewhat less embarrassing. There are a few things in there
which are great when used on my own systems but really should not be
inflicted on anyone else. For those who are not familiar with this
sentiment, it's called empathy. There would be an unnecessary amount of
suffering needed to make it run as-is.

Let's also keep in mind that fred has no Makefiles. Seriously. None.
It also has no build scripts or any other kind of build files. This is
because it's part of my tree, and my entire tree is built with my
build tool. I just say "bb
fred/getpost" or whatever, and it gives me a binary.

As much as I want my little tool to take over the world of building C++,
it would be unreasonable to expect other people to just give themselves
over to it. It's also binary-only for the moment, and that means that
only those people on certain similar x86_64 Linuxes and Mac OS versions
can run it. Everyone else is out in the cold.

(Why is the build tool not open source? Easy, for the same reason fred
isn't: it represents a non-trivial amount of work, and why should I
compromise my own ability to license this and make some money from it?)

Given that relying on the build tool to exist would be foolhardy, that
means I would have to create a Makefile. This is not a big deal, but
making it handle the multitude of configurations is. That means
autoconf. I have plenty of experience with these things (which is
partly why I hate them so much), but getting it right still requires
effort.

To release something without this work done would mean that someone else
would have to do it. If someone was capable of doing that work, they
probably would have done it already, and wouldn't be complaining about
me not doing it for them. The whole point of doing this kind of
compatibility and release engineering work is to make it useful for
people who can't or won't get up to their armpits in http, RSS, Atom,
XML, SQL, C++, Make, and autoconf grunginess.

Let me say that again: the people who are willing to meddle in such
affairs aren't even here. They've already gone off and hacked their own
thing in their language of choice. It's a feed reader, and the major
pieces already exist. It should be possible to go from a general idea
to a simple proof of concept single feed grabber in a day. If not,
perhaps that task should be delegated.

The kickstarter was an opportunity to have me do that meddling on behalf
of others on a grand scale. Once done, it would have been turned loose
to the world to succeed or fail on its own merits.

Finally, I wanted to pass on an insight from a reader: Fred Brooks in
TMMM
says that a systems product costs at least 9 times as much as a program.
Given that, does it seem so unreasonable to ask for some support when
investing this kind of work in a fully-baked system?

For those who backed the project, I thank you. I wish we could have
gotten to where this would have worked, and then I would have been able
to fully commit to this project. As it is, I must now turn to other
projects and clients. Such is the reality of capitalism.

(If anyone wants to dump money on me just so I can write code and give
it away without having to worry about capitalizing on it, that would be
just fine, too.
You know how to reach
me.)

July 7, 2013: This post has an update.
]]>The fred kickstarter has failedtag:rachelbythebay.com,2013-07-05:ks2013-07-05T21:35:05Z
Well, the
fred kickstarter
has failed. I take this to mean that adequate interest does not exist
for this project as stated.

fred itself will still exist for my own purposes but I will no longer
attempt to push it as a subscription service (the original idea, way
back in 2011) or a one-time "let's turn this thing loose" (the
kickstarter).

I guess I'm not much of a "product person".

]]>IRC netsplits, spanning trees and distributed statetag:rachelbythebay.com,2013-07-04:split2013-07-05T03:37:34Z
I last used IRC in a meaningful way at a prior job. There was a channel
where I would spend a lot of time lurking with other people who worked
on kernel and platforms stuff elsewhere in the company. I hadn't used
IRC in quite a while when I joined that channel and expected things had
evolved. They hadn't.

Oh sure, there are nick and channel services now, even though that's
kind of pointless on an IRC network which is completely controlled by
your employer and is only accessible from the corporate WAN. There are
features which will hide some or all of your address to keep people
from attacking you, but again, that's also pointless on a private
network like that one. There's even something which will kick off a
low-level PING of your client before letting you on just in case you're
one of those dreaded TCP sequence number spoofs. It's not like anyone's
using an OS which still is vulnerable to that, but, hey, it can't hurt,
right?

No, what I'm talking about is the fundamental model for linking servers.
Everything I can find about linking IRC daemons even right now in 2013
seems to suggest that any given client, be it a server or a user, can
"appear" down only one file descriptor at any given time. That means
the entire network is a series of links which themselves constitute a
single point of failure for people on either side of it. When something
goes wrong between two servers, this manifests as a "netsplit".

Whole swaths of users (appear to) sign off en masse when this happens,
and then will similarly seem to rejoin after the net relinks, whether
via the same two servers or through some other path.

This was the situation back in the '90s, and it's still happening today.
From all appearances, it would seem that it has been deemed "too
difficult to solve" and thus has been left alone. One problem mentioned
was that of duplicate messages flying all over the place.

It seems like this problem has been solved multiple times in different
venues. Fidonet echomail flowed out in a fashion where messages would
go to a node and then "explode" out to all of the subscribers, and would
travel out from there. They tagged those messages with "Path" and
"Seen-By" lines to keep them from going to places where they had already
been. I suspect they also had message ID histories to squash
duplicates.

Or, from the world of Ethernet, how about spanning tree? This is where
you have a whole bunch of potential links nailed up and ready to go, but
the switches collectively come up with a path in which no loops are
present. The others are reserved and don't receive traffic normally.
If a link goes down, another one can be brought up. Downtime is
minimal... when it works, that is.

So, okay, there's also the matter of the whole "channel state" thing
which isn't particularly easy when you are talking about actions
happening in parallel with some amount of lag between servers. Even
though it seems like the days of multi-minute lags between servers are
gone thanks to the relative availability of large pipes, it's still a
problem which would need to be solved.

One solution would be to have a "master server" which maintained all
channel state, and requests from users would have to travel "upstream"
to it. This would solve the problem of parallel actions since there
would be only one reference point: that single server. Of course,
having that means you now have a massive single point of failure for
your network. Lose the stateful server and all of your channels are
toast.

That particular problem could take a page from the GT BBS network which
worked a little like Fidonet but had its own quirks. In particular, the
GT echomail scheme had "sponsors". Instead of having echo messages
"explode" at every point after leaving the origin system, they instead
first flowed upstream to the sponsor, much as an ordinary unicast
message might. Then they were batched up on the sponsor, and would be
released along with others in a daily "bag" (file).

To map this back onto IRC, each channel would have a sponsoring/master
server. They wouldn't all need to be in the same place. This would
remove the problem of having a single server which held all of the
power, but it still could have issues. For one thing, what happens if
that one server goes down? Some other server is going to need to pick
up the slack.

This starts calling back to matters of distributed state management and
consensus. That in turn gets me thinking
"Paxos".
For those who aren't familiar with it, it's decidedly non-trivial, but
what it gives you is easy enough: you have several systems which are in
charge of tracking state, and they can come and go over time. If
certain conditions are met, the state can continue to be read and
updated with confidence.

None of this would be trivial. It would be a mighty big amount of work,
in fact. I suspect there might be a way to shoehorn some of this into
the existing client-side IRC protocol at the expense of some surprise on
the part of experienced users. Things like mode changes and kicking
people off a channel wouldn't actually happen just because you issued
them. They'd have to travel upstream and would happen on the master,
and only then would your local server apply it.

I suspect this lag is probably what historically scared people off from
any sort of "submit to the master" concept. The thing is, people
probably aren't trying to run IRC networks over "switched 56" leased
lines and other bandwidth-starved pipes. It's a different world now,
and it's probably not going to take multiple seconds to get a packet
out to a host and back again.

In re-reading the Wikipedia page on Paxos, I noticed that the vanilla
flavor does not handle malicious injections. I guess this means using
it on an IRC network with a multitude of unassociated operators might be
a bad thing. After all, server-level code hacks to allow IRC operators
to overstep their powers are definitely nothing new. Perhaps the
Byzantine variant would make more sense, along with extensive
instrumentation to detect and publish synchronization anomalies.

There are solutions to problems which don't exist, and then there are
assortments of technologies which might be useful for problems which
technically exist, but for which nobody may care about fixing. I
suspect the whole IRC netsplit thing is a case of the latter.

Perhaps the patent monster is lurking, and that's keeping people from
playing around in this space. That would be rather unfortunate, if so.

]]>It's gone gone gone and it's not coming backtag:rachelbythebay.com,2013-07-03:service2013-07-04T03:40:57Z
That service was so useful.

It was free.

Nobody ever thought it could ever disappear.

Unfortunately, the people running it didn't see it that way.

They probably could have kept it running for a long time. It's just a
bunch of bits and bytes. But they didn't.

They decided to shut it down. The users were told.

Users set out in search of alternatives. Some found them. Others built
them. More than a few just abandoned the space entirely.

Some of those users might have been done with that kind of service.
Others just didn't want to start all over again and decided they were
better off not getting stuck in that kind of relationship for a second
time.

A few genuinely wanted to keep using this sort of thing but were unable
to find a replacement. Perhaps someone else had set them up in the
first place, and they were just happy users. Heavy technical lifting
was not their thing.

The rationalization and pleading started.

Why can't they keep it up? It can't use that many resources.

Why now?

Why is there no replacement?

The questions went on like this.

But there's one question which kept coming up.

Why didn't they just start charging for it?

That one's easy. They'd have to support it. You can't just charge for
something and expect to ignore paying customers. They get certain
amounts of leverage once that starts happening, and the provider can't
have that.

They'd have to start supporting it. It might not be good support, but
it would have to be something. Otherwise the paying customers might
start grabbing their pitchforks and torches.

More than anything else, it might just come down to one thing: charging
money means supporting it, and supporting it means dealing with people.

They hate that.

Humans are so squishy and nondeterministic while computers are so
wonderfully cold and calculating.

Anything which would involve squishiness is to be shunned. Only the
heartless pursuit of technology is worthy of their efforts.

...

How do you tell an introvert programmer from an extrovert programmer?
The extroverted one is looking at your shoes.

]]>My "good book" of TCP/IP Networkingtag:rachelbythebay.com,2013-07-02:book2013-07-02T21:22:43Z
A bunch of people have written in asking which book I was referencing in
my
post
about simultaneous TCP opens. I should have included some links in that
post. Sorry about that.

Upon going to get that link to Amazon, I was surprised to discover that
was updated in 2011 and now has a (referral link!)
second edition
with another author at the helm since Stevens died in 1999. I never
thought of this as the sort of book which would need updating, but I
guess the Internet has evolved a bit in the past 20 years. Amazing!

Back then, I also had a huge book called PC Interrupts. It talked all
about the weird things you had to do to talk to DOS itself, stuff like
Netware and Lantastic, and a bunch of other software which was common in
that world. It all boiled down to defining values in registers like AX
and BX and then kicking off a software interrupt. Then it would tell
you what to expect and which registers would hold the results.

The first edition of this book included a whole section on network
interrupts. By the second edition, they included a note which said that
the book would have been too big to bind, and so they had to create a
spin-off called Network Interrupts. Notice that these also came out in
the beginning of 1994.

These three books plus a whole bunch of messing around on my home
network is what taught me a lot about the logical layers of networking.
I never played with raw voltage levels and other stuff at the
physical layer, but just slinging packets around and watching
stuff happen was quite useful in and of itself.

I've since given away the Interrupts books since I no longer develop for
DOS systems, but the Stevens book has been on my bookshelf this whole
time. Once in a great while I still break it out to show someone the
diagrams and other bits which made it such a great reference.

It can be handy to have around when someone doesn't believe that you
could possibly know what you're talking about when some aspect of
TCP/IP networking comes up. Pro tip: judging someone's technical
abilities based on how they look is a great way to get in trouble.

]]>Simple-minded literalism and avoiding the big picturetag:rachelbythebay.com,2013-07-01:school2013-07-02T05:14:44Z
Back when I was in school, I had more than a few classes which were
utterly silly and yet couldn't be avoided. The worst ones centered on a
single implementation of a certain technology. That is, instead of
being about spreadsheets, it was about Excel. Instead of being about
word processing, it was about Word.

There were some seriously broken assignments which went along with these
classes. One of them was something like this:

So and so created an Excel workbook to track his video game collection.
He has headings and wants to sort on Name, Date Purchased, Purchase
Price, Size and Location for each game. His wife wants to know how much
he's spent on this hobby, and his son wants to know when he started this
collection. He also wants to know which games were purchased prior to
2000.

This screams "SQL database and a handful of SELECT queries", doesn't it?
Being able to say "use a database, you idiot" would have felt good, but
they wouldn't have accepted it, since this was a spreadsheet class, and
specifically was all about whatever version of Excel was current at the
time. I could have said it but it wouldn't have accomplished much.

It took me a while to realize what I had to do to make this work. Part
of the problem was understanding that my own internal monologue for how
this should work was down on step F or G and they wanted me to explain
step A. That is, instead of talking about the fundamentals of the
problem and proposing something which went with it, I had to get down to
basics and just cough up a list of actions.

Create rows and columns. Columns are the name, date purchased, and so
on. The rows are individual games. The various cells get filled in
appropriately. The columns are tagged with whatever formatting might be
needed, if any. Then they should be resized to avoid truncation if at
all possible.

This is all obvious to me, but I had to write this as if for an audience
which had no idea what was going on. To someone with the same context
as myself, there would be nothing to say, but for this hypothetical user
scenario, it would be rather different.

This assignment was actually conducted using the class online forum
where everyone first posts their essay and then responds to those of
several other students, so assuming anything about the reader was
dangerous at best. Some people in there had probably never so much as
touched a spreadsheet before this class and would have no idea what I
was talking about unless I was very careful with my wording.

I guess this video game tracking situation doesn't really exist in my
world, since so many of these things can be solved as a "Simple Matter
of Programming" (tm). Sure, most of the time when I describe something
like that, it's intended to be snarky and calls attention to the fact
that a problem is distinctly non-trivial, but not always. Some
problems are in fact the sort of thing which can be handled by gluing
together a couple of existing modules.

This one would need some kind of database, and the exact implementation
isn't that important. It could be a flat file, MySQL or Postgres,
SQLite, or some funky NoSQL thing and for this problem space it would
not matter. There just isn't enough data for it to make any difference.
Then there would be just enough of a frontend to collect data, and a
viewer to render the results. That's it. A specific problem gets a
specific solution. Trying to be generic would be hard, so I punt on
that.

The spreadsheet neophytes don't live in a world like this. They need a
very blunt hammer to solve their problems. Maybe the world would be a
better place if they had tools to let them do the Right Thing which
didn't create too much of a stink. In that sense, having them use
spreadsheets might not be so terrible.

The essay I wound up writing focused on the exact things to do in Excel
with none of the "bigger picture" / existential view of what was
actually going on. I talked about the problem at hand, and then
described which labels should be created and the cells where they should
go. Then I explained where the different data points would go and the
cells which would hold them.

I even went as far as describing the right-click action on top-of-column
label for dates to enable the magic formatting for that kind of data.
This way it would know to treat those numbers as dates and not guess as
to what it might mean. This repeated for other columns with potentially
interesting data - prices are currency, for instance.

Then I laid out how to select the data (dragging, or clicking that weird
corner box thing) in order to feed it to the sorting and filtering flow.
Everything else followed from that.

It answered everything which had been in the problem description with
stupidly simple language and offered absolutely no "outside the box"
thinking.

A few people wrote in regarding my half-baked idea to
bring foreign planes
to places where certain things are impractical or even illegal. There
were the obvious problems of people figuring out what was really going
on and then refusing access to those planes.

Still, there is some precedent for this. I learned about the "butter
ferries" between Eemshaven and Borkum. They would travel across an
international border and would sell tax-free items. They also sold
surplus butter and sugar which couldn't be sold normally due to trading
restrictions, hence the name. It seems this went away in 1999, though.

There is also at least one organization which takes their ship around to
various ports and will take people into international waters before
helping them out. I appreciate that this is simpler from a perspective
of not running afoul of the airport's governing entities, but there's a
problem here: large chunks of my country are nowhere near an ocean.
People in the so-called "flyover states" are probably land locked and
would have to travel quite a distance to make it to a port.

Then there's the relative size of these states. Imagine being in Texas.
Sure, it has a huge coastline on the Gulf of Mexico, but what if you
aren't living somewhere along the coastal bend? You'd have to get
there first. Texas is the kind of place where you can drive at
high-but-legal speeds (85 mph in places!) and still burn a
whole day just to get out of the state. Then there is the matter of
actually funding it.

I think Interstate 10 is something like 900 miles from one end to the
other in Texas alone. I think my car gets something like 300 miles on a
single tank of gas, so that's three tanks of gas. I think it holds 18
gallons, so 3 tanks is 54 gallons. Gas prices are something like
$5/gallon (here in CA, obviously somewhat cheaper in TX, but go with me
here...), so figure that's up to $270 just for gas... one way.
Those numbers are rough, but you get the general idea.

Meanwhile, a plane which isn't monster-sized can land at a great many
airfields. Or, hey, maybe they can have planes as feeders to the ships.
Whatever. I just want to see people start treating insanity as damage
and start routing around it as creatively as possible. Tell those
people what they can do with their law-making.

...

In response to my
update about fred,
I got a question about whether I had looked at other open source
server-based feed reader systems. My answer is easy enough: I did not.
Back in 2011, I decided it was time to get away from Google Reader and
blew a day or two gluing stuff together to run some tests: fetch a
URL, parse the XML, then dump out a static HTML page with a bunch of
posts in it.

It worked well enough as a proof-of-concept that I decided to evolve it
to handle both Atom and RSS and had it use a proper database, track
things in terms of individual posts and different user accounts. It
basically grew to be a "proper" feed reader with a clear division of
responsibilities without going overboard with structure and
"architecture astronauting".

I did not look at any alternatives to Reader, and didn't look at
downloading anything. Such pursuits are usually a waste of time, since
the things I find aren't anything I would want to run in the first
place.

One of the better-known "run your own backend" systems is based on PHP.
I don't do PHP. I actually
go to lengths
to avoid PHP on my systems. That means anything written in it won't
work for me, and all of those programs are no longer options.

I feel the same way about mod_python, incidentally, so toss that out
too. It would need to run as CGI. Besides, I'm not crazy about having
Python stuff sticking around, since I'd eventually need to do something
to it, and then I'm writing Python again. I'd like to avoid returning
to that world if at all possible.

I'm not really happy with any interpreted language in this kind of role,
so cast out anything written in Ruby, and yes, things like node.js, too.
Forget Perl for the same reason.

What's left? Well, without getting into crazy stuff that will require
non-trivial tools to compile on my machine, there's C, C++, and Java. I
don't do plain old C any more if I can help it, so that's probably out.
I'm definitely not going to write Java for myself, so that's out too.

That basically leaves C++, and that's what I used for fred. The fact
that I already had a nice little bunch of libraries in C++ from my
earlier post-corporate-job experiments didn't hurt.

For me to be happy with something I had downloaded, it would either have
to be perfect, or it would have to let me hack on it so I could deal
with any quirks. That basically means something written in C or C++,
since dealing with the others just annoys me.

So this is where reality catches up with me. Who writes solid code in C
or C++ any more? It seems like the shiny is long gone from
that world. Everything new seems to be in some other language now:
JavaScript (via node.js), Ruby (via Rails), Haskell, Clojure, and
all sorts of other stuff. This reduces the set of possibilities
greatly.

Now, if someone were to come along and give me exactly what I want in
such a way that I never needed to fiddle with it, I'd probably switch.
I don't run fred because I want to be known as the author of a feed
reader. I run it because I want a feed reader and nothing else looks
appealing. It's also a waste of time to go looking for alternatives
when there are no problems with what I have now.

The time to catch me and get me to consider an alternative was about two
years ago when I first wanted to bail from Reader. Now that fred exists
and requires no regular maintenance, there's little use in switching
yet again.

...

Yesterday's
post
about TCP simultaneous opens elicited a question from an anonymous
reader. They mentioned using IRC and using DCC to shoot files around
and wanted to know if that's what I was talking about.

Well, no, not really. DCC was (and still is, I guess) a particularly
nasty way to send files between two IRC clients. It seems to have a
whole extra layer of ACKs which need to be lobbed back at the sender to
keep things going. Never mind that it's running over TCP and thus the
socket itself is already as reliable as it needs to be.

There have been a bunch of hacks to this, but it's still fundamentally
one of these things where one host starts listening on a port (TCP
passive open) and another one (attempts to) connect to it. For this to
work, there has to be a way to get all the way to that listening host.
Sometimes, this is only possible if the firewall(s) or NAT devices in
line know what's going on by snooping in the IRC session for the "^ADCC
SEND blah blah blah^A" and noticing what port number has been declared.
Then they make sure to cut a hole for it.

If you've never used this before, I'll give a scenario to better explain
it. You're on a typical machine running a dumb old IRC client behind a
relatively stupid NAT box. You want to send a file to someone using
old-school DCC SEND, so your client starts listening on some port (call
it 12345), and then sends something like this to the other side:

^ADCC SEND leet-warez.rar 192.168.1.5 12345^A

(I forget if it used a dotted quad or some other representation for the
IP address. Maybe it was a 32 bit integer like
MacTCP
had. It's been a while...)

Right away you have two problems. First, the dumb NAT box doesn't know
about this, so it won't allow connections back to its port 12345.
Second, that won't even matter, because the IP address your IRC client
sent is the internal RFC-1918 address for that host, and it
is unusable by the other host (unless it happens to be on the same
network as you, that is, but let's assume they aren't).

For this sort of thing to work, the NAT box has to watch for such
traffic and rewrite the outgoing DCC SEND to have the actual external IP
address for that connection. Then it also has to set up a mapping to
allow the inevitable SYN to port 12345 from the other host to go through
its firewall and reach your client.

If you're running IRC in some way that your NAT box can't see it (like
over SSL/TLS) or on some "nonstandard port" (other than 6667, even
though some would say IRC should only be on 194), forget about having it
help you out.

Assuming nothing else got in the way, simultaneous open would remove the
need to do this packet inspection, rewriting, and hole-chopping for
incoming connections. There's still the matter of determining an
appropriate external address for hosts behind NATs, but that's where the
third party comes in. Whatever mechanism which is used to help hosts
find each other and coordinate these connections can see what addresses
they appear to be and pass those along to the other client.

Still, this is a solution in search of a problem, and that's why I
believe none of this really matters.

]]>TCP simultaneous open and peer to peer networkingtag:rachelbythebay.com,2013-06-29:sim2013-07-02T21:21:33Z
Many years ago, I picked up a copy of a book which taught me a great
many things about networking as we use it on the Internet today. While
I haven't had a reason to refer to it recently, there was a time when it
never left my desk. In the days when I was actively playing with packet
drivers and tcpdump and ka9q and things like that, it was essential to
understanding what actually happened on the wire.

One thing which stuck with me from back then was an interesting quirk
about the way TCP operated. The book uses a bunch of great timeline
diagrams to show which packets are emitted by hosts and when, and this
explains things like a three-way handshake or how you tear down a TCP
connection. After it gets done with that, it then goes on to describe
the "simultaneous open". I thought it was fascinating.

In that situation, both hosts somehow know to open connections to each
other at the same time. They configure the connection explicitly rather
than grabbing an ephemeral port. The exact nature of how they figure
out the port numbers to use and how they know when to connect to each
other isn't given, but that's not important. What's interesting is what
happens when they do this.

At a glance, you might think this results in two connections. One would
be the connection from A to B, and the other would be from B to A.
However, because there's an explicit specification for how to handle
this, it instead collapses into a single connection. The SYNs which
cross each other on the network wind up (eventually) bringing it up.

If you try to do this with two hosts on the same network, it might be
hard to do. If one of the hosts gets going just a bit too early, it
probably will hit the other end before that one is ready to do anything.
That host won't be expecting the connection on that port, and will fire
off a RST. The too-quick host will then get that and assume it's not
listening and so will give up. Then when the slower host "dials out",
the same thing may happen.

It's easier to make this happen if there's some distance between the
machines, but it also helps if they aren't going to emit a RST when a
connection fails. It just turns out that ordinary consumer grade
Internet connections happen to provide both of those traits most of the
time. You're probably not "right next to" anyone of interest, and your
"router" (plastic consumer-grade NAT box) probably will ignore any
packets which don't match an established connection.

This means if one packet (from A) should make it to the other end
(B) "too soon", then the NAT box at B will probably just drop it.
Moments later, host B will start its own connection attempt, and that
will add an entry to the NAT box's table of valid connections. When A
retries and retransmits shortly after that, it will match and will be
let through.

When this works, it means you can take two hosts which both cannot
receive incoming TCP connections from the outside world and manage to
connect them to each other. You just have to have all of the details
right: addresses, ports, and timing, plus nobody in the middle flinging
RSTs at you.

So that's what I learned from this book: a weird little trick which can
be used to do what might seem impossible at first glance. Sure, there's
the matter of synchronization, but that's what third parties are for.

I've been disappointed that nothing seems to use a simultaneous TCP open
to get things done. There are plenty of ways for it to go wrong, like
if there are nefarious devices in-between which mangle port numbers or
worse. I guess it just wouldn't be sufficiently reliable for someone to
build a service on top of it.

If it could be made to work reliably, it could be quite a thing. Hosts
could connect to a shared server somewhere to find each other, and then
agree on specifics and bang away at their connection attempts until they
succeeded with a simultaneous open. Then there would be a path between
the two hosts which did not involve the third party and they could
communicate with somewhat better privacy.

I had this vision of a bunch of ordinary client systems on ordinary
consumer-grade Internet connections with incoming connections filtered
and NAT in place, and yet they'd still be able to reach each other
directly with TCP. They'd get all of the utility it provides without
trying to reproduce it over UDP with weird hole-punching tricks or
the craziness that is requesting holes in the firewall with UPNP.

Exactly what would then run over those connections would be up to the
users. I'd personally love to see everyone have their own little
"platform" for hosting messages, cat pictures, and baby videos which
requires nothing more than their own machine at home. The third party
"sync server" would only be used to bring up the peer connections, and
the actual pictures, chat messages and so on would then run over those
peer to peer links.

User data would never go anywhere near the sync servers, in other words.
This is important.

I even have an idea of a model which might be used for inspiration.
Flip back about 20 years and think about what BBS networks looked like.
Now imagine a BBS network where every one of your friends and relatives
runs a personal node, and they all talk to each other to do the whole
"store and forward" thing. Add a dash of cat pictures and memes and now
you're caught up to the present day.

My most recent inspiration for this general idea was an XKCD comic in
which two characters try to exchange a large file. Burning it to a CD
or DVD is impractical, and that means e-mailing it as an attachment is
also a bad idea. Both hosts are behind NATting firewalls, so direct
connections seem impossible. Neither user has a way to bounce
it through a third host somewhere.

A bunch of services exist which do a "transient dropbox" type of thing
where person A uploads to them, and then person B downloads from them.
I don't like that. It sends the data to people who have no business
seeing it. It may also be woefully inefficient depending on where the
three hosts are located.

Sure, obstacles exist, but I still think it would be interesting. Of
course, since it adds little in the way of new features, no normal
person would want this. They can already do their instant messaging,
slower messaging (e-mail), picture and video sharing through any number
of services which are run by third parties. Sure, those services morph
and change and die and sometimes do creepy or stupid stuff, but for the
most part, the end users manage to get their content out.

The XKCD situation will still exist, but it won't bother people to the
point of having to worry about this sort of thing.

I expect this entire space to remain in the realm of "technically
possible but dead on arrival in the marketplace" for a very long time.

July 2, 2013: This post has an update.
]]>Remote control and heavyweight authenticationtag:rachelbythebay.com,2013-06-28:auth2013-06-29T02:20:19Z
After
writing
about ssh yesterday, I started thinking about alternatives, and the
sorts of responses I was likely to get. Sure thing, within a few hours,
a few people wrote in to basically tell me that I was nuts and that ssh
would work just fine.

It's not black and white here. ssh can work. It can also not work.
The exact requirements matter with this kind of thing. If you're doing
one short-lived command on 20 machines, it'll probably work. If you're
doing a long-lived command on 200 machines, it might not be that easy.

Fortunately, I also got some reports of people who have run into the
same types of problems. One mentioned hitting a wall at merely 50
connections. That's harsh, and that's exactly what I was talking about.

If ssh isn't going to handle every situation, then there will be times
when some other technique will be required. I'd start by thinking about
what's really going on with your system. If what you actually want is
specific chunks of data at regular intervals, perhaps it would be better
to write something which could serve those up. Then it could serialize
a message and fire it at you over the network. This is a great
opportunity to avoid having to parse text output from tools like "df" or
"uptime".

ssh provides both encryption and authentication, so any replacement
needs to at least consider doing that in order to allow the same level
of service. Now there's the matter of doing some kind of crypto, and
doing it properly. Hopefully this means deferring to some proven
library with a good track record instead of trying to roll your own.
Still, there are some issues which can crop up with this technique.

For example, what if you intend to have a whole bunch of different
services/daemons on a machine, and all of them need to do this auth
stuff? Are they all going to link to the auth code? Will they all have
to keep their own crypto state with random numbers and all of that?
That sounds like a lot of overhead to me.

I also wonder about what would happen when the inevitable security
upgrade comes down the pipe. Maybe your system was built without the
ability to revoke a certificate because it never came up before. Then
you fired someone for the first time and really need to make sure their
old workstation can't shoot RPCs at your services.

If all of that crypto code happens by virtue of linking in a library and
pushing out a bunch of new binaries, that's going to take a nontrivial
amount of time to pick up an upgrade. Even if it's all API-compatible,
just having to rebuild and repush everything might be a lot of work. If
something in the interface changed and the clients need to be updated,
forget about any sort of speedy rollout. It just won't happen.

I would assume the logical outcome would be some kind of daemon which
runs on the local machine and keeps a crypto engine humming at all
times. It might take connections over loopback or even a Unix domain
socket. Client code would link in a library, sure, but all that library
would do is call out to this long-lived crypto engine server to present
some credentials and say "is this okay?".

Now imagine in this world if something needs to change in your crypto
regime. You could just replace the crypto engine server thing on each
machine instead of relinking and repushing all of the binaries for
everything else. As long as the new crypto engine server can still talk
to the existing crypto engine client library code which is already baked
into all of those binaries, a lot of work should be avoidable.

I think of it like how the alcohol thing works at an "all ages" show at
a venue which serves booze. Everyone gets in a line out front, and then
the bouncers start checking tickets and IDs. They have good light and
can make sure the picture on your ID matches your face, and they can
look for obvious forgeries. Then they mark your hands appropriately -
if you aren't 21 or can't prove it, or if you just don't want to drink,
you get the big black X on your hands. Otherwise you might get their
particular branded stamp and that's it.

Inside, the bartenders can look for the X or special stamp instead of
attempting to do the whole ID check routine again. The light inside a
rockin' club probably wouldn't allow for a good match of the photos and
it takes time away from selling drinks. If the people inside can trust
the marks, then they can apply the restrictions without having to
actually check IDs right then and there.

Now imagine one day something changes, and the security regime needs
another check added. Would you rather re-train your 20 bartenders and
related drink-slinging personnel (hey, it's a big place!) or just the
people at the door? As long as people still fit into the "booze OK"
and "no booze" categories in the new scheme, the "clients" (bartenders)
don't need to be "updated" (re-trained).

Again, this is one of those things where you need a large operation to
really start feeling the pain. If your "club" (server farm) is just a
handful of "people" (machines), then having everyone check IDs probably
isn't a big deal.

You just can't expect the routines used by a cute little operation to
scale infinitely. At some point, you have to step back and rethink it.

]]>Taking ssh too fartag:rachelbythebay.com,2013-06-27:ssh2013-06-28T03:56:27Z
ssh and shell scripts together represent one of those methods which
seems to draw a lot of people in. It's a way of working which
practically begs for you to build systems on top of it. It can do so
many things. If you want to run a command on a distant system and you
have done some basic setup work (keys and such), it'll do it for you.
I think it's actually a little too tempting and it's easy to take it
too far.

You don't have to actually log in, since you can tell it whatever needs
to run and it will run it directly instead of starting a login session.

ssh host uname -a

So now you know that 'host' is running Linux x.y.z with the foobar
patch. Good enough.

Things change. Now you want to run it on 20 systems. This might happen
with a simple for loop or something like that. You also adjust the
command a bit to keep the output from getting mixed up.

for i in `cat list_of_hosts`
do
ssh $i 'echo `hostname` - `uptime`'
done

Of course, if you used "" instead of '' to wrap that command which is
intended for the remote machines, you now have two problems: you
probably just ran "hostname" and "uptime" locally.

That's going to run them in series. If any of the machines are
particularly slow, then everything after it is going to suffer. ssh
will eventually give up if the machine is down. You might still get
stuck if the machine is up but isn't particularly responsive to
userspace for whatever reason. Machines which are chewing swap will
tend to do this.

So okay, you think, let's take these things out of series. Now it's
time to run them in parallel. The command is adjusted to stick a
"&" on the end and now these ssh commands run in the background.
All 20 of them are launched at the same time.

Of course, the first time this is run, the script drops out just as soon
as it can start all of the ssh sequences, and the results come back all
willy-nilly. Some of them hit while the script is still running, and
others return much later as they manage to succeed. Others just fail.

This can't stand, so now it's time to adjust the script to make it wait
for the various child ssh processes to finish. This is simple: just a
single "wait" at the bottom will make it do that, and as long as your
machines are basically responsive, you're probably okay. Of course, if
any of them are slow, then your entire script hangs at the bottom. If
they are so wedged that running commands is impossible but ssh logins
still nominally work (and this happens!), your script will sit forever.

Now what? I guess you have to change the script yet again to add some
kind of timeout for each call. Unless your system already has some kind
of "alarm" helper which will run a command with a timeout (enforced by
SIGALRM, perhaps?) then you either get to write one or start getting
really clever with your scripting.

Maybe you just punt and just put something really horrible at the
bottom.

( sleep 15; kill $$ ) &
wait

So now if it takes too long, the script will still manage to exit... by
killing itself. Of course, all of those ssh processes are still in the
background, and they could still manage to succeed somehow. Or they
could stay around for a very long time. This means yet another change
to make it kill the process group instead of just the process.

( sleep 15; kill -15 -$$ ) &
wait

This whacks any children which are still hanging around, and so okay,
now you're probably not too likely to still have "klingons". In theory
the ssh children could go off and become their own session leaders, but
in practice you'll probably get away with it.

Once this "system" reaches this stage, it will probably be good enough
to keep working for a while. Then, some day, someone will start adding
more systems to the list, or will otherwise come up with a workload
which is far bigger than 20 systems, and it will get ugly again.

Have you ever looked at what sort of resources a ssh connection
consumes? It might be "only" 2 MB resident, but how far can you take
that before it starts to be a problem? How much memory does your
machine have, anyway? How about CPU time, or network bandwidth? Do you
really want all of that stuff running in parallel?

Okay, forget about the steady-state requirements. How computationally
intensive is it to bring up all of those connections at the
same time? Do you really want to do that?

I imagine at this point, people in this situation frequently find
themselves trying to write some kind of batch scheduler thing... in
shell scripting languages... for ssh connections. This is so it can
kick off a bunch of connections in parallel (so as to avoid the
serialization problem from before), but not too many (so as to avoid
melting down the controller machine).

Of course, all of this assumes the connections will be short-lived, like
my example to run 'hostname' and 'uptime' shown above. What happens if
these connections have to last a long time? Now you don't have the
luxury of letting 20 run, then finish, and then starting 20 more. Now
you somehow have to figure out how to run 40 in parallel. If 40 is
okay, then you have to figure out 80, and so on.

Does this sound far-fetched? If you try to write a system which does
cluster-scale testing of large software systems, you might find yourself
trying to get something to bring up hundreds of ssh connections
in parallel... just for one test. Meanwhile, another test will
be running on another cluster with another few hundred machines, and it
will also need a bunch of long-lived ssh connections.

Why long-lived ssh connections? Easy. They're running "tail -f" on a
bunch of system files while the test does its thing. You know, "just in
case". Maybe they're using ssh as a command channel for the testing
framework. Whatever. The point is, you can't tear it down. It has to
stay up.

This sort of thing never gets easier. It just gets more and more dark
as you try to find new ways to work around the latest problem which has
cropped up.

Maybe this is why I tend to think "oh, that's cute" whenever someone
releases yet another system management framework which runs over ssh.
To me, it's cute because it might work for a handful of well-behaved
machines. It just seems likely to start dragging and eventually
breaking once it's time to scale up beyond some point.

If you don't know where that point is, and don't know how to find it,
that's okay. It'll find you.

]]>A C-64 intro with a quirk, and finding that quirk againtag:rachelbythebay.com,2013-06-26:intro2013-06-27T04:09:28Z
I've always had a soft spot for the scrolly
raster graphic hacks
which used to show up on demos and in front of cracked software on the
C-64. While it seemed that every pirated program had some kind of intro
back in those days, these scrolls were also used to distribute
information.

One day back in 1990, I downloaded something which was nothing but a
scroller with some music and a whole bunch of text. It was advertising
a "SFU" party which was to be held at a motel somewhere in town.
Apparently SFU meant "Sysops For Unification" but I have no idea what
that was anymore. It might have been a warez group or something else
entirely. That is: I have no idea what the unification was about.

I don't think it was fighting Ma Bell, since that one was called
COSUARD -- the "Coalition Of Sysops Unified Against Rate
Discrimination". (How I still remember this stuff, I may never know.)

Somehow, this party announcement made it onto one of my floppies and
rode out the years in a box in my closet somewhere. Then, one day not
too long ago, I built a cable to recover this data and started making
images of everything which could still be read. This was one of the
files which made it.

I've managed to transcribe the entire thing. For the sake of history
and folks looking for bits of the "scene" from those days, here it is:

On July 14, 1990 SFU is planning a user party. The 3rd annual SFU user
party. For complete directions to the party, you can call any of the
following SFU BBS'S: Phase 4 713-371-0121 Control Room 713-558-5016
Texas 64 713-441-7138 City on the Edge of Forever 713-937-9318 Rangers
Den 713-852-9772 Jammer's Place 713-458-7283 Cyber Space Express
713-688-8203 Improb. Drive-in 713-339-3373!!! and attention machine
language coders! There will even be a special contest! The prize for the
best SFU demo will be a 2400 baud modem, so everyone sharpen their ML
abilities. For more complete information on the demo contest, simply
plug in yer modem and give one of the SFU BBS's a call. Credits:
Coding/logo/sprites/character set by Panther Modern Music by Alpha
Flight, Look for redline BBS coming soon and possibly the return of
Mike's game room. Call code zero vmb at 1-800-242-4674 box #745 (Panther
Modern #746 Greenleaf box #747 and Neurofire box #748) Also call code
zero hq BBS: Pet Semetary at 713-580-1470 L8R SK8RS! Panther Modern
-WRAP-

What a wild bunch of text, right? I liked how they admitted how they
had a bunch of voicemail boxes ("vmbs") set up, presumably obtained by
cracking into a corporate system somewhere.

Remember, this was a scroller, so it was a bar at the bottom of the
screen which displayed a handful of letters at a time. They would
travel from right to left, and as some fell off the left, new ones would
appear at the right. This continued until it hit the end of the
sequence, where the author had signed it with his handle ("Panther
Modern") and something unusual: "-WRAP-".

That last bit about "wrap" was nice, since it meant you didn't need to
keep watching for new data. You had hit the end and it was just going
to display the same long string from the beginning again. You could
move on to something else. Most people didn't do this in my experience.

Back on the day when I recovered this program and did this
transcription, I did some digging around. I wondered if this "Panther
Modern" character was still out there anywhere. What I found was
surprising, to say the least.

It was a web page with Flash embedded in order to display raster bars,
weird pixelated art, and yes, even a scrolling message. Any doubts
about whether this was the same person were eliminated when it reached
the end: "**wrap**".

Unfortunately, that site is no longer active, but I managed to catch
that screen shot while it still lived. Maybe this person will resurface
again with more retro awesomeness some day.

Has the music finished yet? Did you figure out the song? I'll give you
a hint: Roxette.

]]>Just fly in a friendlier jurisdictiontag:rachelbythebay.com,2013-06-25:foreign2013-06-26T03:27:24Z
There's a lot of controversial stuff going on in politics at any given
time. Some of it tends to revolve around what might be legal in a
certain jurisdiction. In this part of the world, some states are trying
to stop certain activities, and they sometimes succeed at it. A lot of
this is in the news this week, and particularly this evening. When
these new laws succeed, it basically means the people who live in those
states must now exit those states in order to make something happen.

Of course, if you live in a really big state far from any borders, this
is no small task. It takes a fair number of resources in order to be
mobile. Just look at what happened in 2005 with the people who were
stuck in New Orleans during Katrina since they simply did not have any
way to travel short of walking, and that wasn't nearly enough.

If you're in a place where it takes a day of driving just to get to the
next political entity, and even that might not be compatible with your
goals, you will probably find yourself unable to accomplish certain
things. Whether that's smoking something, drinking something, or
certain medical procedures, you might not be able to do it at home...
legally, that is.

If you have to go to another jurisdiction but can't actually get there,
then what do you do? That brings me to my latest half-baked idea: bring
a friendlier jurisdiction to you.

I've read some things online which suggest that the inside of an
aircraft is effectively treated as the soil of whatever country in which
it is registered. There are a bunch of forum posts in which armchair
lawyers pontificate on the subject and give examples. One of them talks
about someone who flew a US-registered plane across the Bering Strait to
Russia. It had a GPS on board and apparently that's normally illegal in
Russia, but it was okay in this case as long as it stayed on the plane
and thus "in" the US.

This makes me wonder about turning it around. What if some country with
compatible laws flew in a plane, landed it, picked up potential clients,
then took off and started doing whatever they intended to do? I
understand some airlines serve booze to people at 18 instead of 21
thanks to policies like this, so why not bring over the whole gamut of
things which people need?

All it would take is someone with a few airplanes, a bit of money which
could be spared for running the operation, and who doesn't mind pushing
the envelope to get things done. I can think of one or two airline
moguls who might fit this description.

The question is: who's going to be the first to bring a chunk of
friendly "soil" into these embattled states in order to help out the
less-fortunate who live there?

Pedro wrote in to say that a cable TV service in Portugal has a remote
which is just for kids, like in my
post
from Saturday. Apparently this remote has some flaws, like being able
to "control all digital equipments", which can probably put the system
in a bad mode by being too flexible. It also seems to have an "adult
mode" which doesn't make much sense.

Still, it shows that someone was starting to think of the problem space.
I guess they just haven't quite gotten it yet.

An anonymous reader wrote in on this same topic and mentioned the use of
Windows as media centers. Windows itself supports multiple accounts so
they might be able to have different permissions. Of course, that
requires support throughout the whole stack, and it won't make sense if
everything isn't expecting it. How much of that stuff expects to have
administrator rights, or just expects to be "the only user" on a system?

One thing I never understood about Tivo type systems is why they didn't
seem to have multiple user accounts. I should be able to watch my Mad
Men and "delete" it without knocking it off the list of someone else who
shares the same system with me. They can "delete" it at their own
leisure, and once it's no longer being referenced anywhere, then it can
really go away. It's a weird mix of garbage collection, reference
counting, and the general concept of hard links on a Unixy filesystem.

This relative lack of innovation is what got me to write a
post
last year about why Tivo still exists. Someone else should have come
along and eaten their lunch by outdoing them in terms of features. It
hasn't happened. Where is the competition? My money is on patents
scaring everyone else out of this realm.

...

Regarding yesterday's
rant
about Apple moving "Spotlight", I had a couple of comments advise me
that command+space would bring this up. I think I was aware of this
previously, but for whatever reason, that's not how I usually bring up
that thing.

It boils down to the whole "forced retraining" thing. I'm getting tired
of it happening over and over again with software which I cannot
control.

Imagine what life would be like if every time you visited your kitchen,
there was a nonzero chance of having a knob or button move, disappear,
or otherwise change its behavior. Would people really stand for that
kind of insanity?

That's where we are with software right now.

...

Finally, I was asked what I think of Larry, Sergey, and Eric.

I don't know them. I have no reason to think they know me.

They are three people who used to show up to run TGIF all the time.
Then they started becoming aloof. Then, one day, they were all gone,
and that "used car salesman" type guy ran TGIF instead. It turned a
usually laid-back afternoon thing into a high-pressure rah-rah session.
It was never the same after that. Sure, they made appearances, but it
was a completely different vibe.

All of this happened far away from me in the organization. The people
who cared about me had no power. The ones with power didn't care.

]]>10 days left on the fred kickstartertag:rachelbythebay.com,2013-06-24:fred2013-07-05T21:35:55Z
There are 10 days left until the
fred kickstarter
comes to a close. If it reaches the goal by that point, then I'll clean
it up and turn it loose. Everyone will be able to run their own thing
without having to answer to anyone else. They can tell those cloud
providers what they can do with their quarterly goals which ignore
the users.

Of course, if the goal isn't reached, then I'll have to turn to other
projects. That's the way this stuff works out.

I should point out that Google Reader is going away very soon now. If
you've been hiding out on that service, you should start making plans to
migrate somewhere. To get my feed going on whatever new system you wind
up using, just come back here and hit the feed button in your browser
(if you still have one), the little feed icon at top, or just this link:

You can also change that to https if you prefer that kind of transport.

July 5, 2013: This post has an update.
]]>Missing the target because someone moved ittag:rachelbythebay.com,2013-06-23:aim2013-06-24T05:26:36Z
Some time back, I learned about
Fitts's law.
It sometimes pops up as a citation on a discussion about computer user
interfaces. If there's motion and pointing and targeting involved in
the interface (say, in a GUI), there's a good chance it's related.

I think I first heard it used in the context of the Mac interface, and
specifically for the top bar which is there no matter where you go.
You can be in the Finder, or Firefox, or something else entirely, and
there's always a single bar at the top which has a bunch of pull-down
menus and other potentially useful things. The pull-downs change
depending on which program you're "in" at that moment but they're always
in the same place.

The idea as it was related to me is simple enough: it's hard to "miss"
the top of the screen. All you have to do is keep pushing "up" with
the mouse (or now, a trackpad) and eventually it won't be able to go
that way any more. The pointer will reach the top and will be unable to
go any higher. Once this happens, you're free to click on something in
the menu.

I think of this as a column of infinite height over each of the menu
items. Even though the pointer doesn't actually move higher than the
topmost row of pixels, you're driving the input device as if it was
going for a button which was impossibly tall. Imagine the "File" menu,
for instance, but while it might be an inch across, you can "aim" for
anywhere at or above that point and you'll still manage to hit it.

Now imagine a button labeled [File] with the same font and all of that,
only somewhere else on screen. Instead of being able to do a rough
"throw to top and finesse horizontally a bit" type of movement, you now
must control your inputs with sufficient precision to hit the exact box.
If you go too high, you will miss it (and possibly hit something else).

This is why having clickable things at the very edge of the addressable
pointer space makes sense to me. It's also why I can't figure out what
Apple is up to these days with their "Notification Center" thing on
Mountain Lion.

I've been using Macs for a couple of years now, and as long as I can
remember, the top left corner has been the Apple menu, and the top right
corner has been a search widget. This basically meant I could "throw"
the pointer at the top right corner, then click, and type in something
to find arbitrary files on my machines. I could also go back up there
to access the results of my last search if needed.

By being both the topmost and rightmost item, it meant I didn't really
have to think about it, and didn't have to be careful, either. Using
the inertia of my input device would let me hit it easily.

Then, one day, I installed Mountain Lion (OS X 10.8), and this happened:

The "Notification Center" had taken over the top right corner,
displacing the search widget. I'd "throw" the pointer up there like
always and I'd trigger this stupid thing:

Great, it's a list of janitorial tasks, and other uninteresting things.
It's nothing I care about, in other words, and now I have to purposely
adjust every time I want to search for something.

What's more, it had a tendency to "pop up" and would try to get my
attention about more of these uninteresting things whenever it felt like
it. I thought we established that pop-ups and pop-unders and all of
those were evil back in the '90s! Why would we purposely have one
that's always on?

So, I found how to turn it off and did exactly that.

launchctl remove com.apple.notificationcenterui.agent

It takes a little bit for it to figure out exactly what's happened and
get the icons back in the right spots without any weird gaps, but it
eventually works out and I'm back to something like this:

It stays away... until the machine is updated, that is. Every time
they do an update to OS X 10.8, this stupid thing comes back, and I have
to run that command to whack it once again.

I assume this is where things are going: more and more little annoyances
which have to be taken out and shot every time they have a chance to be
revived by their software updater buddies.

What I still can't figure out is why they gave up some prime target real
estate to something which mostly opens by itself whenever it feels like
it. Meanwhile, the search tool, which is something you have to access
deliberately, is now that much harder to hit.

]]>Limited access can actually be a good thingtag:rachelbythebay.com,2013-06-22:remote2013-06-23T04:49:53Z
Interfaces are complicated. People who work with them frequently get
used to them and probably don't notice it any more. This comes up
enough with computers, but what about other items? How about home
electronics where the computer inside isn't the focus? Imagine the
potential complexity of a home entertainment system, for instance.

Maybe you're in this situation. You've put together your own custom mix
of items which provides TV shows and movies on demand. Maybe there's a
cable hookup, or a satellite provider. There might be a recorder or two
adding to the mix of things which are waiting to be played.

It's not much of a stretch to imagine the owner eventually getting used
to this system. This is both a matter of familiarity from being around
it often and the general need to understand the stuff you paid for.

But... what about guests? They're valid users, too. They didn't build
your system and have no idea what went into it. They also haven't been
using it for years. Some of the elements might be familiar, but can you
really rely on that? Maybe some of your guests know enough about
certain systems to know that you might have a significant investment in
how it's configured or what programs are currently stored. Will they be
able to amuse themselves without worrying about "hurting something"?

Imagine a Tivo, for instance. You might have a whole bunch of season
passes and suggestions just waiting to be played. You want your
visitors to be able to play through that content on their own, but maybe
you're worried about them deleting stuff without realizing it. Or,
maybe they are poking around and manage to cancel a recording.

Chances are, they aren't trying to make life miserable for you. They
just wanted to watch some TV on an unfamiliar setup and made a misstep.
It seems like there should be something which can be done about this.
Guests should be able to enjoy the setup without worrying about hurting
anything.

Cars already have a system like this. A "valet key" might allow the car
to be driven, but it won't remember changes to the seat position, for
instance. It might be set to keep them out of the trunk or other
non-essential areas of the car.

Why doesn't this exist for home entertainment systems? Imagine a remote
control which is recognized as special and can't be used to change
season passes, delete recordings, or otherwise interrupt the usual
scheduled tasks which happen regularly. It would also be great if it
subtracted all of the menu options which make no sense for a guest.

This would benefit multiple parties. Guests would be able to explore
without worrying about hurting anything. The system would not let them
do anything that might leave a lasting effect. They'd know this and
could relax in the knowledge that they couldn't do anything bad. Hosts
would also be able to relax and not have to keep an eye on things.

Of course, guests aren't the only interesting use case for something
like this. Why not a similarly limited remote control for younger kids?
Let them watch their cartoons or other favorites over and over with no
danger of them making a mistake and deleting something precious. I hate
to think of the tears which might follow a bad button press, especially
one which might be avoidable.

In theory, all of this should already be possible. If these
entertainment systems were constructed as proper layers with their own
responsibilities and strong, well-defined interfaces, then it should be
a "simple matter of programming" to make another frontend with limited
capabilities. Of course, in practice, I'm sure it'll seem practically
impossible to some developers and on some types of A/V hardware.

It's not hard to make a nerd-friendly system. The question is: can you
create one that inspires confidence in a young one or a guest who just
wants to watch TV and doesn't want to worry about making mistakes?

Not everyone wants to be root on the TV.

]]>We want you! Well, no, actually, no, not really.tag:rachelbythebay.com,2013-06-21:svp2013-06-22T04:37:57Z
I saw this a few minutes ago. It's the usual article about ratios.

It must be really awkward to be on the spot like that when your brother
in law is one of the founders. Business spats and family dinners, oh
my!

]]>Reminder: Google Talk federation runs unencryptedtag:rachelbythebay.com,2013-06-21:ssl2013-06-22T01:08:42Z
Just a short reminder for people who worry about privacy and things like
that: even if you're using your own XMPP (Jabber) server for chat
purposes, if you talk to people who use Google, you're doing it in the
clear. Their federation links (inter-domain communications) are
established without encryption.

I
wrote
about this last year, but in light of recent revelations it seems worthy
of a second mention.

Here's how it works. I run my own XMPP server. I connect to it from an
IM client on my laptop. It negotiates TLS using my rachelbythebay.com
certificate, and then it's running relatively securely. Then, it sees
that I have "buddies" on gmail.com and other domains which are
also hosted by Google.

This makes my server reach out to the XMPP servers for gmail.com and
friends, and it puts up "s2s" (server to server) links. These links are
then established without TLS. There is no encryption.

Now, there's still encryption on the client side for people using
Google's talk servers, but there's still a weak link in the chain.

To review:

My IM client talks to my server over TLS. Good.

Their IM client talks to Google's server over TLS. Also good.

Our servers talk to each other in the clear. Bad.

If you run your own server and care about this kind of thing, you might
want to try sniffing your own traffic heading out to Google to see
exactly what I'm talking about. Try a command like this:

tcpdump -nl -s 0 -X net 74.125.0.0/16 and port 5269

Look carefully. If there's anything other than random-looking garbage,
it's running in the clear. Don't trust your log files. Do your own
sniffing and see what an attacker would see.

I should note that it's not my server's fault. It happily runs TLS over
federated links to everyone else.

Just not Google.

]]>Subtractive framework creation makes more sensetag:rachelbythebay.com,2013-06-20:framework2013-06-21T03:19:10Z
Does anyone really start out on a project by saying "I want to build
something huge and crufty that doesn't serve any real purpose"? I don't
think so. I think it just looks that way after seeing enough failed
projects come and go over the years. I'm thinking about the notion of
frameworks in particular, but this might apply to other sufficiently
large software systems as well.

Consider this situation. You're in a "green field" environment: there
is no code already written to do what you want to do, or if there is,
you can't bring it along for whatever reason. Maybe the licensing isn't
compatible, or it's in a language which can't be used any more. You get
the idea. You know you're going to wind up making a whole bunch of
utilities along the way.

So what happens next? Do you whip out your crystal ball and attempt to
predict everything which will be needed? Does that lead to a whole load
of functions which then becomes the to-do list? Are these tons of
feature requests or sticky notes put up somewhere to get all of these
written?

How about the structure? Even though none of this code exists yet, is
there a whole hierarchy being built out somewhere? Is someone already
arranging places for this stuff to "hang" on the tree even though there
isn't anything to show for it at this point?

It seems like you could refer to this approach of framework building as
"addition": you keep identifying things which "should be useful" and
start trying to build them. You may actually succeed at building a fair
number of them, and now you have this whole pile of tools which may or
may not be useful.

This is the point where actual work towards solving the problem starts,
and now it's a matter of mapping that work onto the pre-existing
framework. I sure hope you had a good "read" on that crystal ball and
nailed all of the utility requirements even without actual code use
cases, since otherwise your utilities are going to need to be fixed.
This may change their behaviors significantly. Whole assumptions about
who worries about what may have to be cast out and replaced. What does
this do to your wonderful schema now?

I guess if you're some kind of never-fail wizard type, then you might be
able to sit down and just slap all of this together and never have to go
back to change something. I don't know anyone like that. I don't know
if that kind of person can even exist.

My suspicion is that this could only happen if an existing project was
being literally rewritten piece by piece. Since the architecture was
the same, just with a different language used for the implementation, in
theory, the structure might be the same. Such projects do exist, but
really, how often do you do a straight-across port without changing
something?

Wouldn't it make more sense for framework creation to be a
subtractive process? First go and build your program which
accomplishes some goal. Then build something to make a second goal
happen. Then do a third. Now take two or three steps back and look at
what you've created.

What kind of common stuff is there, repeated across the three programs?
Call it F. What sort of specialized things exist for the three goals?
Call them X, Y, and Z.

F would be those elements which make sense on their own. They'd have to
remain useful even after being translated to a more generic form in
order to suit the needs of multiple calling programs.

I've been down this subtractive road many times. Initially, when it
came to doing JSON, I used to just bodge it together myself and built
the strings myself. Yes, really. Then I got paranoid about encodings
and wrote something to handle escaping evil characters. That lasted for
a while, but then the prospect of releasing
fred
to the world came along and got me thinking about less crufty
implementations. I started using a helper library called jansson to
handle my storage and JSON encoding needs.

Initially, this was just part of fred, and so it lived in that
directory. I think it was three separate classes spread across three
sets of files: one for the base-level JSON storage and types, another
for arrays, and a third for objects. fred used this for a while and it
worked okay.

Eventually I found myself working on a project for a client which also
did JSON. Now I had three different uses of JSON in my tree: my
scanner, which was
still manually generating JSON, fred with its jansson wrapper, and now
this new thing for my client. I finally had enough things approaching
JSON in their own ways to give me a better sense of the problem space,
and could come up with an API that made sense.

I copied the fred code to a new place and subtracted the fred-specific
stuff. A few things were adjusted to make them suitably generic, and
then fred, the scanner, and the new program were changed to use it.

Now, after these iterations, I have a json.cc and json.h which has
everything I need to handle my current use cases. I don't have to have
special handlers for fred or the scanner or the third project. How
could I possibly have coded this from scratch when there was no way to
know what was coming?

This code, the closest thing to a "framework" I have, had to evolve from
other code which works and is in active use. Trying to come at it from
the other side probably would have resulted in some time-wasting
monstrosity which would have to be thrown out anyway.

Finally, doesn't this seem to fit best with an "Agile" mindset? Code
what you need when you need it, and reconsider everything regularly.

Otherwise, aren't you throwing darts at a dart board which might not
even be there? How can you possibly see through the fog of time?

]]>Wear blinders and you can justify anythingtag:rachelbythebay.com,2013-06-19:interface2013-06-20T00:46:38Z
Here's a pattern I keep seeing. I don't understand it.

There's some software product or web site. It has an interface which
has pretty much remained the same for three or four years. A couple of
dropdown menus or chooser lists might have gotten a little bit longer as
features were added, but it's basically the same thing. Someone who had
been away from the product for several years would still know their way
around.

You're wary of interface changes, so you don't switch to it immediately
when it first comes out. Eventually, it becomes clear that it isn't
going anywhere and it's a stable experience, so you try it out. The
experience seems to agree with you, and so you slowly use it more and
more. Before long, it's a part of your daily routine. You know how it
will behave and there are no surprises. You can count on it being the
same when you come back every day.

Then, one day, something changes. There's been a change of leadership
on the product, and soon they want to start moving things around. Some
options are hidden behind new menus and are no longer directly
accessible. Others disappear entirely. A handful of strange new things
which you really don't need and never asked for are forced in your face
to get you to pay attention to them.

You deal with it at first, but the changes slowly migrate around the
interface until eventually it starts impacting your day to day behavior.
Perhaps something you used to be able to do is no longer possible, or
it's become complicated where before it was simple. It's a new source
of friction, and it's not going away.

This inspires you to push back on the changes. How can the product
people know there's a problem unless they hear from the users, right?
So, you write in and report that the removal of feature X and the hiding
of feature Y is really crimping your style. You even write up a
detailed explanation of how you used to use the system so they can
better appreciate your use case.

It's no good. Someone on the team responds and says that they have
metrics, and the metrics say that only 0.7% of all users used feature Y,
and even fewer used feature X. That's their story and they're sticking
to it.

But, they say, it's okay. This interface is good stuff. It passes all
sorts of approval metrics and allows you to do this, and that, and this
other thing. It's the new direction for this product, so you might as
well get on board now.

Let's say for the sake of argument that you accept this and you get on
board with the changes. You re-learn the interface and find different
ways to do whatever it was you need to do. You find some way to live
without the features which have evaporated completely. You still feel
the friction every time you run into a situation where the old feature
would have helped, but you tell yourself to ignore it. After all, you
don't want to be one of "those people". Managers read books about
"those people" and try to peck them out of the organization for not
being "team players".

About a year passes. Then, without warning, they throw out that new
interface and drop in yet another interface. This third interface has
no relation to either the first one that you liked or the second one
which you were making yourself use. You have to start all over again,
and there are even more features missing or otherwise behaving
differently.

The users suffer through two major upheavals in a short span of time.

When this happens, all sorts of questions pop up in my head. If the
second one was so good, why did they switch to the third one? Was there
some initial condition which made the second one good, and that
condition is now gone? Were the claims of the second system being good
even valid? If they weren't, why should anyone believe their claims
about the third system?

I think it's more like this. Someone decides they are going to make
a new interface. They get it ready for launch, and they also prepare a
nice package of justifications. They can even do bad science to back it
up -- that is, run experiments, and only keep the results which support
your desired outcome. Then, they launch it.

When it's time to do it again, they just follow the same pattern.
Notice that the past has absolutely no influence on this routine. They
want to make a change and they don't care about what may have been there
before. Maybe the old stuff came from someone else and they want to
"make their mark" somehow.

]]>Apple moved the Lucile Packard Children's Hospitaltag:rachelbythebay.com,2013-06-18:hospital2013-06-18T07:15:28Z
I was digging around through Apple Maps on my old phone which runs iOS
6 in search of even more broken stuff. If you missed my older posts, be
sure to check them out:
123

What got me looking for it was actually the new iOS 7 maps icon. It
seems to show Interstate 280 going across the lower-left with a green
square at top (a park?) and a pink square at top right (a hospital?). I
tried to find a spot which resembles that somewhere along the entirety
of 280 but failed.

This got me looking for other pink spots, though, and I found one in a
most unusual place: right next to Mission College in Santa Clara.
Here's what I saw while relatively zoomed-out:

OK, so that's GAP at Mission College. Great America is in the right
place, IHOP is right, Togo's is close, Bennigan's looks okay... but
what's that pink blob? Is that supposed to be a hospital? I zoomed in.

WTF? Stop the presses! The Lucile Packard Children's Hospital has been
picked up and flown from Palo Alto to Santa Clara! Are they serious?

No, there's no hospital there. Trust me. Look at it on Street View or
whatever if you want. It is NOT A HOSPITAL.

I pulled up the info page to see what was going on.

According to their data source, LPCH is at 2807 Mission College Blvd.
That's the right address for that building, but they are certainly NOT
there. They are still up in Palo Alto, on the Stanford campus (surprise
surprise).

So how in the hell did Apple get this address? I did some more digging.

It turns out that 2807 Mission College is a
data center...
you know, a "co-lo". It will hold lots and lots of computers... but no
patients, for it is definitely not the children's hospital, and it never
has been.

So how did Apple get this mapping? A little more digging turned that
up. Netcraft had the answer.

The LPCH, not surprisingly, has a domain name called lpch.org. Once
upon a time, it was hosted at a data center... at 2807 Mission College
Blvd in Santa Clara. Yes. That's where they got the mapping: the
web site's network address.

Imagine driving a kid to a data center when they get sick.

Does this sound familiar? It's the second hospital they got
completely wrong just in Santa Clara. I reported on the first one back
in my
third post
about Apple Maps back on May 12th. They still have the Kaiser hospital
at the location which closed in August 2007, and by that, I mean right
now as I'm writing this a month later.

That's two bad hospital entries in a relatively small city. What other
important things have they dropped or relocated based on crap data?

]]>Fastrak as used for more than just paying tollstag:rachelbythebay.com,2013-06-17:track2013-06-18T03:41:36Z
I recently bought a transponder tag for my car. It's registered with
the state and is associated with my license plate. If I drive through a
toll facility which is suitably equipped, it will automatically subtract
from my balance. When the balance drops too low, it will replenish
itself by applying a charge to my usual payment instrument.

In the Bay Area, this mostly applies to bridges, but there are also a
couple of freeways which have "express lanes" -- lanes where it's free
if you're a carpool, but you have to pay otherwise during rush hour.
Having the transponder mounted on your windshield so it can be read is
how you pay for using those lanes if you're a solo driver. If you're
carpooling, you're supposed to remove it and put it in a little
anti-static bag they supply.

There is at least one other use for these transponders: traffic speed
detection. Some freeway segments have the equipment needed to read the
tags as they pass underneath. If the same ID appears at two known
places a known distance apart in a known amount of time, then
calculating the speed is a simple matter. This isn't particularly
surprising or even new. It's been done in a bunch of places for years.
In San Antonio, the state gave away a whole bunch of these things just
so they'd have more roaming data points on the roads.

For a while, I had a transponder tag for a parking garage on my car. It
was made by the same company who made the tags the state was giving
out. I imagine that every time I drove under one of their readers, it
picked up on the ID and made a note of it. The next time I passed
another reader, it did the math and figured out a speed for that road
segment.

Rejoining the world of tagged vehicles got me thinking about other
potential uses for this technology. Mostly, I started thinking about
how to stop criminals with it. Imagine what would happen if most cars
on the road had some kind of unique identifier which could be
interrogated at a distance. It would open up a whole bunch of
possibilities.

First, realize that probably every car already has a unique ID: the VIN.
You have to get up close and personal to actually read it, though. If
the vehicle is in motion or you otherwise can't get right up to the
windshield, you probably can't read it. There are license plates, but
those can be hard to copy down on the fly, and they can be removed or
changed.

So let's say this exists, and most cars have a transponder tag which has
been added on, or some equivalent device which is part of the built-in
electronics. Now let's say a car is involved in some kind of crime. As
long as there was some kind of listening device which logged it in the
vicinity of the crime at the right time, it can probably be found later.

What's going to do that sort of logging? Easy: other cars. I imagine a
"smart grid" situation in which cars talk to each other and the road
itself. There are already city transit agencies starting to get
licenses for this kind of technology. The magic terms are "5.9 GHz
DSRC".
Look it up - it's fascinating stuff.

Is it such a stretch to think that cars are going to start noticing each
other? I don't think so. Airplanes are already headed in that
direction with
ADS-B. One interesting
point about the ADS-B system is how aircraft learn about each other and
share that data with others. If I'm reading this correctly, plane A can
see plane B and tell plane C about it. C couldn't see B directly,, but
knows of its existence thanks to the report from A. This even applies
if C is using a receive-only "in" system.

It's definitely not a stretch to imagine cars being scanned on purpose
after a crime happens. Police agencies already drive their
"PIPS" cars through an area
where something has occurred. This way, if the bad guys left their car
in the area and later come back to it, there's a log of it being there.
If that same car shows up at yet another crime site later, it can be
correlated.

I've seen my local PD driving up and down the lanes at the big movie
theater parking lot to look for interesting cars. The cameras on top
scan plates as they pass by, and the computer does the rest. They wind
up recovering a bunch of stolen cars this way.

I'm not trying to make dire predictions of a
"grim meathook future"
or anything like that. It just seems likely that such technologies
will be much more closely related not too long from now. It doesn't
even really have to involve law enforcement or their federal buddies.
The nerd contingent could get going on it right now for their own
enjoyment. They're already doing it for ADS-B data with those $20
RTLSDR TV tuner sticks. (Really.)

Let's just hope this tech is used for the right reasons.

]]>Self-limiting groups and more than 100% membershiptag:rachelbythebay.com,2013-06-16:meet2013-06-17T04:05:04Z
I keep seeing people who assert that companies are filtering their
applicants and are creating a hiring process based on IQ or similar.
The idea is that they draw a line and anyone who falls below that line
won't get in. Then they maintain this over time no matter what. A
thinly-veiled advertisement post which made it onto HN today continues
this line of thinking.

I need to shoot some holes in this theory
again.
But, before I do that, I want to talk about classic car dinners.

So there's a group of people who own a specific kind of classic car.
Maybe it's the Delorean made famous by the Back to the Future films.
It could be the 1948 Tucker Sedan, which figured into a movie about the
creator, Preston Tucker. The point is, they own a car which was never
made in huge quantities in the first place.

One year, a bunch of these people ran into each other at a classic car
meet-up. They decided to hold a BBQ just for fellow owners of this one
car. Nobody else was allowed in. It went off so well, that a year
later, they went and did it again. Before long, they had a vibrant
group and looked forward to those dinners all year long.

These dinners became legendary. People wanted to get into them, and
would seek out these cars in order to gain membership. This drove up
prices for a car which had only had a small amount of interest
previously. Now, any time one would come up on an auction site, a bunch
of would-be owners would pile on. Only one lucky person would come away
with a car, and they'd be at the next dinner.
The group continued to grow. They outgrew a simple campground and had
to book an actual lodge one year. Then they found themselves unable to
keep up with the BBQing duties themselves and had to bring in catering
people to do it instead. This seemed weird, but they were having a good
time and so they rolled with it.

Eventually, one of the folks who had been in the group for a while
looked around and said... wait, just how many people are here? They did
a head count. They found that their latest dinner had 1200
registrations. These are people who supposedly owned one of these cars
and thus were members of the club.

There's a problem, though: they only made 1000 of those cars, and they
had been tracked in exquisite detail over the years. Each one had a
unique serial number and was tracked in the club newsletter as it was
handed off to a new owner, got in a wreck, or was lost to a garage fire.

Somehow, they had 1200 members of a club that should have maxed out at
1000, and is actually lower due to various cars being destroyed. They
didn't have 200+ cars suddenly materialize out of nowhere. Nobody was
making replicas. Somewhere, something went wrong.

They thought they were filtering based on ownership but in reality
something had slipped through. As the group continued to grow, this
became more and more obvious. The group continued to assert that it was
just for owners of this special car, but it couldn't possibly be true.
By the 2000th member, it was clear that something else was going on.

It wouldn't have been a big deal if they kept the group going with some
new reason, like "people who own and those people who like the
X car", but they stuck with their "only the owners, and nobody but the
owners" motto. That just did not make sense, and the numbers did not
check out.

That's sort of what it's like when companies talk about only hiring the
best. There can be only so many people at or above a certain line, and
even though there are seven billion people on the planet at the moment,
the pool of candidates shrinks rather rapidly when you start applying
restrictions.

Maybe some are too young to work for a company. 8 year olds can be
wicked smart but they can't be employees in your software sweat shop...
in this country, anyway. Perhaps others are too old. Once you hit 35,
that little turkey timer thing pops out of your neck and all of the
other valley techies start treating you differently.

Some won't speak whatever language or languages the business happens to
use. English is in a lot of places but it sure isn't everywhere. Some
people can't legally work in the countries where a business might have
an office. This might be a temporary problem or it might be a permanent
one. Besides, not everyone wants to move.

Keep subtracting the people who refuse to work for that company or have
already worked for that company and then moved on. Also remove the ones
who are at other companies and are happily employed, and won't be
moving. How many are left? That's your global talent pool.

Let's say you come up with about 25,000 people, and your company has
30,000 engineers. Where did the other people come from? Who let them
into the exclusive classic car dinner?

Just like the hypothetical dinner, they're basing their acceptance of
new people on something other than whatever they claim. You can believe
it and go along with the story even though it can't be true, or you can
call it out and try to guilt them into changing their story.