Autoscaling Is Hard

I was discussing autoscaling with a friend when it magically occurred
to me that it’s a pretty hard problem unless you know all of the
stack inside and out and have incredibly awesome and precise
introspection capabilities. I’m pretty sure that even then, autoscaling
is tantamount to bailing water out of a boat with a bucket. WHAT
HAPPENS NEXT WILL SURPRISE YOU. (Your
mileage may vary, terms and conditions apply, “autoscaling” here
means out-of-the-box solutions to make your things scale, as provided
for example in kubernetes, these opinions are mine and mine (probably)
alone, definitely not my employer’s, I have no easy solution to
sell you and I most definitely don’t make money with this blog. See
the utter lack of ads for details.)

Once upon a time you had a computer…

…and it was SOOOOO SLOOOOOOW. Without looking at it, no cheating,
which part do you need to improve? More RAM? A spinny disk that
spinnies faster? A more powerful central processing unit? Or maybe
you need to tell your little brother to chill out with the illegal
peer-to-peer file transfer, because it’s saturating your router?

In my mind (and in my car, we can’t
rewind we’ve gone too far), autoscaling is the exact same
deal. “My memory usage/CPU usage/response time is spiking, let’s
spin up more instances”, except it doesn’t help, because the reason
for the spike is a few levels upstream, and it’s a misbehaving
service completely unrelated to you that’s making the database hang
because it really needs to do a join on those two fields that’s not
indexed because reasons. Or that other service has an MLG1 library
that transcodes from XML to JSON to YAML and it’s been handcrafted
by a grandmaster hacker dude-man with such passion and fury that
any attempt to make it simpler and faster somehow breaks it. If you
have a decently-sized infra that’s broken out in a few services,
you know what I’m talking about, it’s the dark beast hidden in the
legacy unvisited crevices of your code that you (and nobody else,
apparently), knows how it works. At the end of the day, you can
scale upstream all you want, if you can’t pinpoint why that specific
operation is slow, “ya ded”.

Okay cool let’s scale the database then

Hahahahaha no. So if your database is non-trivial in size, and also
you’re not using the shiny “enterprise” tools, spinning up a new
database is an error-prone (which you definitely don’t want) and
time-consuming (which defeats the purpose) process. By the time
you’re done spinning up a new database, your traffic spike will be
long gone, and you’ll spin it back down again.

Know thyself

As I said in the introduction blurb in small(er) (e-)print, I have
nothing to sell you and I don’t think there’s an “easy way out”.
I’m not arguing that “spinning more things” is not useful. I’m
arguing that it’s inefficient. I’m arguing that most of the time,
it won’t help. It’s the infra equivalent of solving the performance
problem by using a cache; you’re just buying extra time. Might be
enough. Might not be enough. That’s for you to know. Not my circus,
not my monkeys, so to speak.

On the other hand, you could learn about your system, figure out
where the real bottleneck is, and fix it there. And/or figure out
the exact specific cases in which each of your systems should scale
up. I don’t know how much money it’ll save you, but it might save
you sleep. Those of you with kids (plural) very well know you can’t
buy those back.