If killall is the answer, you're asking the wrong question

Have you ever noticed that when you see one thing that's cosmically bad
in a program, it's usually not alone? It's the broken window
theory: once someone smashes through and nobody else cleans it up, it's
open season for anyone else to do the same. This is a story about one
of many such cases I found, and this one involves the "killall" command.

So there was a daemon which ran on a bunch of machines. Its job was to
act as a software installing (or removing) agent. It just sat there and
kept track of what it had installed and what its "boss" server thought
it should have installed. When there was a difference, it started
applying changes and kept going until there was nothing left.

There was a catch, though: this thing was slow. It was intentionally
made that way so that it wouldn't perturb the machine much, or at least,
that was the official explanation. The problem is, of course, if you
wanted to load something on and get going right away, slow is not what
you want.

Someone decided to provide a way to pull out the limiters and make it
sync 100% right then and there. That by itself was uninteresting. What
made it bizarre is how they implemented it.

Remember this thing was a daemon. It just sat there waiting for
commands and other stuff on a network socket. So, they just added an
endpoint where you could poke it and it would run the fast syncer,
right?

Wrong.

Oh no. You had to log in via ssh and then run a command. Let's call it
"sync-now". That particular command would then start up and kill the
existing daemon. It would run "killall sync-daemon", then it would
pause to let it die.

Oh, it gets better. Next, it would restart the "sync-daemon" process
with different flags. These flags basically took the limiters out of
the main loop and let it run as fast as it wanted to run. Meanwhile,
"sync-now" would start polling the "sync-daemon" via its network
connection to see what it was doing. It would keep doing this until the
daemon went idle again.

Once it got to this point, it would then use killall on the daemon
yet again, wait for it to die, then it would restart it one
more time with "normal" (rate-limited) flags. Then, finally, "sync-now"
would die.

So that's pretty ridiculous already, but the rabbit hole goes deeper.
If you got curious and went looking, it turned out that "sync-now" was
just a symlink to "sync-daemon". Yep, it was the same binary for both
jobs. It actually looked at argv[0] when starting to figure out which
"personality" to become.

Putting this all together, it means you had a single program which
looked at its own name and decided to behave differently. It then went
out and shot the instance of its other personality in the head and stood
up yet another instance of that personality, but with its own
special values. Then it let that one do some dirty work, then it killed
that mercilessly and swapped the original configuration back in.

This isn't software. This is like a bad psychological thriller novel.

There's a final bit of craziness here. Let's say you had some automated
process which wanted to set up a bunch of machines and then run some
tests on them. You might write a simple loop which connected to the
machines and ran "sync-now".

Odds are your loop would come back around before it finished syncing.
Let's say you had a naive implementation which would then connect back
to the boxes and say "hey you, run sync-now again". That's when your
life got really interesting.

Remember that "sync-now" works by shooting the "sync-daemon" in the head
with killall. It'll do that to any instance of it, including
the one which gets started up temporarily to do a full-speed sync. So,
when you logged in again and ran it again, you broke the first one which
was still going. It would just sit there and flail around while the
second one started doing its thing.

This would just keep happening over and over. If the syncer didn't make
any progress in the interval between calls, you would never finish. It
would keep pulling the rug out from under itself over and over and over.

I won't even bother describing how evil the code looked. All you need
to know is that it looked at argv[0] and went this way or that way in a
big if-then-else type construct. Test coverage was low or nil.

Naturally, you know someone got a nice promotion for writing that thing.