Thoughts Evoked By CircleCI's July 2015 Outage

After having a bit of downtime, CircleCI's team have been very kind to post a
very detailed Post Mortem.
I'm a post mortem junkie, so I always appreciate when companies are honest
enough to openly discuss what went wrong.

I also greatly enjoy analyzing these things, especially through the complex
systems lens. Each one of these posts is an opportunity to learn and to
reinforce otherwise abstract concepts.

NOTE: This post is NOT about what the CircleCI team should or shouldn't
have done - hindsight is always 20/20, complex systems are difficult, and
hidden interactions actually are hidden. Everyone's infrastructures are full of
traps like the one that ensnared them, and some days, you just land on the
wrong square. Basically, that PM made me think of stuff, so here is that
stuff. Nothing more.

Database As A Queue

The post mortem states:

Our build queue is not a simple queue, but must take into account customer
plans and container allocation in a complex platform. As such, it's built on
top of our main database.

As soon as I read that, I knew exactly what happened. I'd lived this exact
problem before, so here's that story:

At Flickr, we would put everything into MySQL until it didn't work
anymore.
This included the Offline Tasks queue (aside: good grief, this
post was
written in 2008). One day, we had an issue that slowed down the processing of
tasks. The queue filled up like it was supposed to, but when we finished fixing
the original problem, we noticed that the queue was not draining. In fact, it
was still filling up at almost the same rate as during the outage.

When you put tasks into mysql, you have to index them, presumably by some
date field, to be able to fetch the oldest tasks efficiently. If you have
additional ways you want to slice your queues, which both CircleCI and Flickr
did, that index probably contains several columns. Inserting data into RDMS indexes
is relatively expensive, and usually involves at least some locking. Note that
dequeueing jobs also involves an index update, so even marking jobs as in
progress or deleting on completion runs into the same locks. So now you have
contention from a bunch of producers on a single resource, the updates to which
are getting more and more expensive and time consuming. Before long, you're
spending more time updating the job index than you are actually performing the
jobs. The "queue" essentially fails to perform one of its very basic functions.

Maybe my reading is not quite right on the CircleCI issue, but I'd bet it
was something very similar.

In the aftermath of that event at Flickr, we swapped the mysql table out for a
bunch of lists in Redis. There were pros and cons involved, of course, and we
had to replace the job processing logic completely. Redis came with its own set
of challenges (failover and data durability being the big ones), but it
was a much better tool for the job. In 2015, Redis almost certainly isn't the
first thing I'd reach for, but options are plentiful for all sorts of usecases.

Coupling at the Load Balancer

First we tried to stop new builds from joining the queue, and we tried it from
an unusual place: the load balancer. Theoretically, if the hooks could not
reach us, they couldn't join the queue. A quick attempt at this proved
ill-advised: when we reduced capacity to throttle the hooks naturally they
significantly outnumbered our customer traffic, making it impossible for our
customers to reach us and effectively shutting down our site.

I don't actually think that's an "Unusual" place to start at all. If one of the
problems is that updates to the queue are becoming too expensive and every
additional update is exacerbating the problem, start eliminating updates!

The rest of that paragraph is also not unusual at all. It hints at some
details about the CircleCI infrastructure that you would find in an
overwhelming majority of infrastructures.

The public site and the Github hooks endpoint share a loadbalancer

The processes serving the site and the github hooks run on the same hardware
(likely in the same process, as they're probably just endpoints in the same
app)

There is no way to turn off one without turning off the other

Everyone that knows me knows I LOVE to talk about "unnecessary coupling" in
complex systems. This is a really good example.

The two functions have key differences - for one, their audience. Let's focus
on that. The hooks serve an army of robots residing somewhere in Github's
datacenter. The site serves humans. As a general rule, robots can always wait,
but making humans on the internet wait for anything is a big no-no. To me, this
is a natural place to split things up, all the way through. You can still use
the same physical load balancer or ELB instance, but you could make two paths
through it - one for the human oriented stuff, another for the robots. Sure,
there'll probably be some coupling farther down the line, like when both
processes query the same databases. But at least now the site will only go down
if the database is actually inaccessible, not when it has a single contended
resource that has nothing to do with serving the site.

A Long Aside: Traffic Segregation At Opsmatic

I do obsess over this stuff, and we've already had our fair share of outages
with very similar causes. I want to talk a bit about how traffic is currently
handled at Opsmatic. This section is full of admissions of having flavors of the
same issues as above to further drive the point home that no one's infrastructure
is perfect, certainly not ours. It's also meant to demonstrate that following
some very high level guidelines built on prior learning can go a long way
towards improving an infrastructure's posture in the event of unexpected
issues, especially surges.

There are three entry points into Opsmatic:

opsmatic.com is our company's website and the actual product app

api.opsmatic.com is our REST API, which has historically been used mostly by
the app (that's changing quickly)

ingest.opsmatic.com is the API to which our collection agents talk

Here's an ugly drawing to help you along:

The first two are configured to talk to the same AWS Elastic Loadbalancer (ELB).
The ELB forwards the traffic on ports 80 and 443 to a pool of
instances where nginx is listening. nginx in turn directs the requests.
Traffic to (www.)opsmatic.com goes to one process (a Django app run under
gunicorn), traffic to api.opsmatic.com goes through a completely different
pipeline where it's teed off to the appropriate backend depending on the URL
pattern. Currently, most of the API traffic is actually coming from humans
using the app. As we flesh out, expand, and document our REST
api, that's bound to change, at which
point we may put even more buffer between the two traffic streams - separate
nginx processes with appropriate tuning, possibly even separate hardware.

The third ingest.opsmatic.com subdomain is pointed at a completely different
ELB. That's our equivalent for the Github hooks - the agents are always
running, always sending heartbeats, always sending updates. An unexpected surge
in traffic - for example, an enormous new customer spinning up agents on their
whole fleet of servers all at once without warning us - could certainly
overwhelm the currently provisioned hardware. At the moment, this would take
the app down as well - while the Opsmatic backend is extremely modular, we
currently run all those pieces on the same machines. This limits the operational
overhead at the expense of introducing gratuitously unnecessary coupling.

However, just having the separate ELB gives us recourse in the event of a sudden
surge in robot traffic: we can just blackhole THAT traffic at the ELB and
continue serving site and API read traffic. The robots would be mad, and the
data you were browsing would gradually get more and more stale, but it
beats the big ugly 500 page.

The Opsmatic agent is also built to accumulate data locally if it can't
phone home, so the robots would build up a local version of the change history
without losing any data or timestamp accuracy. When we were back up, they'd
eventually backfill all that data. This event itself could cause a stampede,
but we've found it to be a real nice luxury to have.

The modularity combined with reasonably healthy automation allows us to regain
our balance quickly. If a certain service is overloading a shared database, we
can kill just that service while we work out what's going on or scrambling to
add capacity.

Every Incident Is A Push Towards Self Improvement

The next time this sort of event does happen, we'd likely follow up with a few
more steps that have been put off solely due to resource constraints:

Split up stack-role into smaller pieces, likely along the lines of
"human-facing services" and "robot-facing services". That is, physically
separate services that deal with agent traffic from services that deal with
human traffic. Possibly we'd go a step further and split up web services from
background job processors that pull work from queues.

Split the opsmatic.com and api.opsmatic.com load balancers up

A bunch of auxiliary work on various internal tools to better
accommodate the fragmentation

The upshot - we currently have a bit of coupling and resource sharing
going on for things that really shouldn't be coupled, but it's only because
we've postponed actually splitting everything up in favor of other projects. We
are:

Seconds away from being able to blackhole automation traffic in favor of
preserving the app, as well as turning off any background processing that might
be causing issues - we can just let that queue grow, turn the service on and off
as we try different fixes, etc.

A few minutes of fast typing away from adding capacity while most of our
customers likely don't even know anything is amiss

A few more minutes of fast typing from completely decoupling robot traffic
from human traffic so that the next surge doesn't affect the app at all

Hey, that's pretty good! If we have to fight a fire, at least we can fight it
mostly calmly. That, in and of itself, is huge. Being able to isolate the
problem and say "OK, this is the problem, it is not the whole infrastructure, it
is contained to a particular set of actions and now we're going to work on it"
is huge for morale during an outage. I do not envy the feeling the CircleCI
team must have felt when attempts to bring back the queue took down the main
site.

I used the word "posture" earlier - I have in mind a very specific property
when I use that word. It's not so much about "how resilient to failures is our
infrastructure?" but rather "how operator-friendly is our infrastructure during
an incident?" Things like well-labeled kill swtiches, well segmented traffic, well
behaved background and batch processing systems that operate indepenently from
the transactional part of the app go a long way towards decreasing stress levels
during incidents.

Conclusion?.. What is this post, even..

This turned into a bit of a rambling piece. Hope you found it interesting.
Here's my key takeaways:

You can use a database as a queue, but you should keep a close eye on the
timing data for the "work about work" your database is doing just to get jobs
in and out. One day, you're going to have a bad time. That is ok. It'll make
you stronger.

It pays to think about the sources of traffic to your infrastructure and how
they interact with each other. Over time, it pays even more to have parallel,
as-decoupled-as-time-allows paths through your system, any of which can be shut
off in isolation.

Every infrastructure is a work in progress; computers are hard, and
distributed systems are even harder

A Story and Some Tips For Sustainable OSS Projects

This past week Kyle Kingsbury
tweeted about being
flooded with pull requests caused by changes to the InfluxDB API. Concidentally,
I had just spent several hours over the July 4th weekend dealing with the same
problem in go-metrics, albeit on a
smaller scale. I think these are symptoms of a very very common problem with OSS
projects.

A bit of history

The Metrics library has a very simple core API made up of various
metrics-related interfaces - you can create metrics, push in new values, and
read the metrics' current values and aggregates. Simple and beautiful.

The library was originally put together by the epic Richard Crowley while he was working
at Betable. He was starting to experiment with using Go for services, and needed
a way to keep track of them. Finding no satisfactory equivalent to Coda Hale's
metrics library for Java, Richard made his
own. Folks quickly wrote adapters to push metrics into their time series system
of choice - I wrote one for Librato. Richard happily merged the PRs.

The core features were built, everything worked reasonably well, and Richard
moved on to a job that doesn't use Go nearly as heavily. Several months later, I
noticed go-metrics had 20+ open pull requests. I pinged Richard and offered to
help maintain the project. We were using it heavily, and were happy to pay our
dues. Richard immediately made myself and Wade, a
Betable employee, collaborators on the repository. I started looking over the
PRs.

The Paralysis

I quickly realized that I was not qualified to review a good chunk of the PRs:

Update for InfluxDB 0.9

Fallback to old influxdb client snapshot

Update influxdb client

"I don't know jack about InfluxDB," I thought. "How am I supposed to decide
what gets merged and what doesn't?" There was a Riemann client in there too. Who
am I to judge a Riemann client lib?

I had also observed that the InfluxDB API was still changing quite a bit. I
remembered that there had previously been a wave of PRs about InfluxDB. Wait,
was this the same wave?

Another issue that gave me pause was that I had no idea how many people were
already using this library with Influx, expecting the current client to continue
working. How many builds would break? Go's notoriously loosey-goosey dependency
management made it likely that as soon as I merged any API changing PR, I would
get another PR changing it back the next day.

There was also a PR about adding a Riemann client. Welp, I don't use that
regularly either..

Clarity

In the summer of 2012, I did a brief contacting stint with Librato. Among other things, I
helped build a Java client library. They also asked me to tie that client to
Coda's library, so I obliged and submited a PR.
Coda replied fairly tersely:

Really cool functionality, but I've been declining further modules for the
main Metrics distribution. I suggest you run this as your own project. I'll be
adding a section in the Metrics documentation with links to related libraries,
and this should definitely be in it.

At the time, I thought "Well that kinda sucks. I want my code up there, with the
cool kids' code in the really popular library." Now, literally 3 years later, I
understood exactly why Coda made that move. He didn't use Librato. He had no
idea what would make a good or bad Librato client. It was just more surface area
to support. He had enough to worry about with core Metrics and DropWizard
features, keeping up with JVM changes and compatibility issues, etc, Never mind
other projects.

The Path Forward

Though Kyle points out that this may not be the best approach for every
project,
it seemed very clear to me that the only way the go-metrics lib could continue
to be maintained, at least by myself and Wade, was to modularize and move
any external dependencies out to their own libraries - with their own
maintainers, and hopefully their own communities. It's not going to make the
"moving target API" problem any easier, but it'll put the
solution into the hands of the people who are actually interacting with the
problem and have a vested interest in achieving and maintaining a palatable
solution. It removes myself, Richard, and Wade, completely disinterested and
uninitiated bystanders, from the critical path to a solution.

At the end of the day, it's just Separation of Concerns. It's just good
organization. The task is broken up into small semi-independent pieces with
responsibility for each piece given to the person with the most interest in that
piece. There's a corresponding and very palpable feeling of psychological
relief. "Review the PRs for go-metrics" is no longer this huge nebulous task
that will require a huge amount of context and deep understanding of some
additional system. I know the core APIs. I can evaluate changes to that fairly
quickly.

Practical Tips For Maintainers

If you find yourself maintaining a small OSS project with a fairly well defined
scope and API, here are some tips to keep yourself sane (some of these are more
general, not specific to the above story):

Always have a buddy. If your project gets any traction and you start
seeing community adoption, find one or more particularly enthusiastic users and
convince them to help carry the load. We all want to take care of our baby
projects, but real life is what it is. People change jobs, have health issues,
go on lengthy vacations, start families, become vampires. Some combination of
those things will likely make your interest in any given project oscillate, and
you should have a framework in place for making sure you don't create another
zombie on GitHub.

Resist dependencies. If someone creates a PR which brings in a new library,
especially code that talks to something over the network - a server or SaaS
of some kind - strongly consider pushing the author towards starting their own
library. If this is not possible due to a lack of APIs, invest the time in
adding hooks instead. It'll be worth it.

Have a concise contribution policy. This will greatly reduce the burden of
having to reply to PRs that suffer from obvious code quality issues. It is an
absolute MUST to have a pre-written set of rules to appeal to instead of having
to post seemingly arbitrary responses to individual PR authors.

Enforce guidelines automatically whenever possible. We are living in a
remarkable age. The tools available to maintaners are simply amazing. With the
help of services like GitHub, TravisCI, CodeClimate, etc., there's no need to
maintain a mailing list, apply patches by hand, set up some jury-rigged systems
for running tests. It's all free, and it's all great. Use it. go-metrics and
go-tigertonic do not take advantage of the OSS ecosystem, and I am about to
fix that. One other small note here: you should make it very easy to replicate
the exact process that the build is going to perform locally. There should be a
Makefile or something similar containing the one command that the build tool
is going to run so that folks can validate their branches easily without having
to wait on the CI tool to run against their PR.

Hopefully you find our experience with maintaining and reviving go-metrics
helpful, and this story helps you avoid similar pitfalls. Happy hacking.

A failure months in the making

This is the story of an outage that occurred on September 25th 2014 and has
previously been discussed in the context of blameless post mortems on the
PagerDuty blog.

If you attended Surge 2014, you may have noticed something strange: a man was
sitting on one of the cube-shaped stools in the Fastly expo area hunched over
his laptop almost the entire day, and well into the evening hours. Even if you
didn't notice, and even if you weren't even AT the conference, you may be
curious about this man. The security guard certainly was, as he made his rounds
after dark, long after everyone had left the expo area..

That man was yours truly; I was fixin' stuff. This is the story of what
happened.

The Outage

On September 24th Opsmatic was one of the many AWS customers to receive
one of these emails:

One or more of your Amazon EC2 instances are scheduled to be rebooted for
required host maintenance. The maintenance will occur sometime during the
window provided for each instance. Each instance will experience a clean
reboot and will be unavailable while the updates are applied to the underlying
host. This generally takes no more than a few minutes to complete.

The EC2 Event Console confirmed that quite a few instances in our infrastructure
would be affected:

All the servers would be rebooted early Friday or Saturday morning SF time..
while I was at the conference. There was not much certainty in the exact timing
or order of the reboots (the windows were 4 hours long), but we did eventually
discover some good news:

Any instances using EBS for their root volume could be put through a
stop/start cycle in advance of the window to avoid the reboot. When you "stop"
an instance, you're essentially destroying it, but the EBS volume survives. When
you "start" it back up, you get no guarantees about which "host" will receive
the instance that will then boot that volume. This is where "ephemeral" drives
get their names - they are attached to the "host" and do not survive a
stop/start.

Any instances provisioned after the notifications went out would not need to
be rebooted. As we later learned, the reboots were necessary for Amazon to
roll out a patch to Xen which fixed XSA 108.
Many hypervisor "hosts" were already running patched code, so Amazon would
simply put new instances on already-patched hosts.

Since every single piece of Opsmatic's infrastructure is redundant at least at
the instance level, we quickly concluded that this was actually not that big of
a deal:

All of our nodes used EBS root volumes, so they could be stop/started

Most of our nodes do not use ephemeral storage for anything important

The affected nodes that DID use ephemeral storage were Cassandra nodes. Since
we use a replication factor of 3, we can afford to have at least one of those
rebooted at any time.

We briefly debated pre-emptively re-provisioning the Cassandra nodes anyway, but
decided that it was better to just let the reboot happen. Copying data is time
consuming, and the reboots were hours away. We would just get up just before the
maintenance window started and gracefully stop Cassandra on the node about to be
rebooted out of an over-abundance of caution.

To minimize the amount of odd-hours activity, we decided to stop/start all the
stateless nodes that were scheduled to be rebooted on our own terms, during
business hours. Since I was already at a conference, I'd take care of it in
order to minimize disruption to the rest of the team back home, cranking away.

At around 13:50 PDT I started the process. I stop/started one of our NAT nodes
without incident. Then things get a little murky.

For some reason, I decided to actually replace one of the nodes, but I don't
remember why. I did not make any record of my reasoning. It is entirely possible
that I got distracted between the last node and the next one and went to
reprovision it instead of just doing a stop/start cycle. It's also possible
there was some other issue with the node, and I simply failed to document it.

At about 14:15 PDT, I terminated one of our "stack" nodes (they run all the
services that power the Opsmatic app) and then went to replace it.

We had provisioned our AWS infrastructure using Chef Metal
so replacing the node should have been as simple as terminating it and then
"converging" the infrastructure - a single, global command that does not take
any parameters other than the declaration of what your infrastructure should
look like (number of nodes in each cluster, etc). Chef, in theory, would detect
that the "stack" cluster was missing a node and provision a new one to replace
it.

So that is what I did. Replacing a node in our infrastructure is a routine
operation that we had practiced several times without incident.

At 14:20 PDT Opsmatic went down in flames. The Chef run restarted every single
instance in our infrastructure.

Talk about a "Game Day"...

As soon as the instances came back up, we scrambled to make sure that all the
services were back to normal. We were down for a total of about 30 minutes, in
part because there were certain parts of the recovery process that were not as
smoothly automated as we had thought; these defects became very apparent during
the previously un-tested "restart the entire infrastructure" scenario.

The Causes

Once service was restored, we started trying to figure out what the hell had
happened. Meanwhile, the delightful Surge lightning talks were drawing
uproarious laughter in the main ballroom behind me.

As I scrolled frantically through the log from my fateful Chef run, I saw a bunch
of lines like this:

[2014-09-25T21:18:39+00:00] WARN: Machine ******.opsmatic.com (i-*******
on fog:AWS:************:*********) was started but SSH did not come up.
Rebooting machine in an attempt to unstick it ...

One per server. We quickly confirmed in the #chef IRC channel that this was a
bug - because Chef could not establish an SSH connection to these nodes,
it decided to reboot them. That, apparently, should not have happened.

[2014-09-25T18:30:13-0400]
<johnewart> Ah, well -- you managed to uncover a bug by doing that
<johnewart> we should only reboot it if it's within the first 10 minute window
<johnewart> like, you create, and then try to run again 5 minutes later and it can't connect

After a bit more digging, we sorted out that chef-metal had been relying on
the ubuntu user being present on all our machines along with a specific
private key. Something had caused the home directory for the ubuntu user to be
deleted.

At this point I remembered something: a LONG time ago, before Opsmatic even had
a name, I had done some experiments with AWS. As part of that, I had a
bootstrapping scheme which relied on the same ubuntu user (standard practice
when provisioning Ubuntu AMIs), but also included a recipe called
remove_default_users which nuked the ubuntu user once bootstrap was
complete.

This bootstrap process was never used for anything serious - the initial
iteration of Opsmatic's infrastructure was one big server at an MSP; from there,
we moved straight to the Chef-driven AWS setup. However, that small bit
of cruft persevered in our chef-repo.

My hunch was correct. Although remove_default_users was never part of any
roles or run lists in the new infrastructture, we were able to confirm that it
was applied on all the nodes on August 31st (just a couple of days after the
last time we had practiced replacing a node) by performing a search in Opsmatic
itself:

However, by the time of the outage it was once again absent from all run lists.
So how did it get there on August 31st and how was it ultimately removed? That
would take another couple of weeks to figure out.

The remove_default_users recipe was clearly dead weight; we had gotten a
little sloppy and let a bit of invisible technical debt accumulate. In order to
prevent the same thing happening again, we immediately deleted the recipe. This
has another nice side-effect: the next time this recipe appeared in a run list,
Chef would fail. We have good visibility into those failures in Opsmatic, so we
would be able to react and debug "in the moment."

That exact thing happened on October 14th: as I was doing some
refactoring in our cookbooks and roles, I found chef failing because it could
not find remove_default_users. I knew I was about to find something important
- something slippery, elusive, confusing, and damaging. Indeed.

The recipe was originally part of a cookbook called base - a collection of
resources that needed to be applied to all nodes. As we moved to a
"more-than-one-node" setup, we started using Chef roles to define run lists. The
base cookbook was pulled apart and reconstituted as a role to be included in
other roles. There was a step in the refactor where "parity" was achieved - the
role was made to replicate the previous behavior exactly. At that point, the
role was copied into another file called base-original.json to be used as a
reference as pieces of it were pulled into other cookbooks etc. Many edits were
then made to the role in the base.json file.

The base-original.json file stuck around in the roles directory.

But here's the thing about a role file: unlike cookbooks, the name of the role
doesn't just come from the filename; it comes from the name field defined
inside.

$ head roles/base.json
{
"name": "base",
"description": "base role configures all the defaults every host should have",
"json_class": "Chef::Role",
...
$ head roles/base-original.json
{
"name": "base",
"description": "base role configures all the defaults every host should have",
"json_class": "Chef::Role",
...

The majority of time spent working on Chef is spent working on cookbooks, so
it's easy to forget the subtle differences in behavior with roles.

So what had happened was this: while modifying something else about the base
role, I had assumed that base and base-original were different roles that
were both in use. I had modified both files and uploaded them both to the Chef
server, first base, then base-original. In reality, they both updated the
same role, and the base-original content won out because it was uploaded
second. Chef ran at least once with this configuration, deleting the ubuntu
user. Some time later, someone who DID know that base-original was not to be
uploaded made yet more changes and only uploaded base, wiping
remove_default_users out once more. By the time the epic reboot happened, it
was gone from the run list again, leaving us to scratch our heads.

Because the ubuntu user was created by the provisioning process and not
explicitly managed by Chef, it was not re-created.

Whoever ran chef-metal next was going to cause a global reboot. It just so
happened that I did it from a conference and ended up spending my evening
plugged into an expo booth's outlet.

Remediations and Learnings

Computers are Hard

Managing even a small infrastructure requires discipline, precision, and
thoroughness. The smallest bit of cruft can combine with other bits of cruft to
form a cruft snowball (cruftball?) of considerable heft over a relatively short
time period.

Cookbooks vs Roles

This sort of failure is exactly the cause of the trend towards "role cookbooks"
replacing the role primitive. Having a recipe that is simply a collection of
other recipes is functionally identical to a role, but has a few advantages -
namely versioning (enough said) and consistent behavior with resource cookbooks.
Having a recipe named base-original.rb would have had no effect on a recipe
named base.rb.

chef-metal

While the theory behind chef-metal sounds good, we have started switching away
from it. Bugs and maturity are the immediate problems, but it would be foolish
to act like those don't exist in all software, including whatever other scheme
we end up using. This single bug is not why we're migrating away.

The theory behind chef-metal itself sounds good, and it's the
"right" sort of automation, e.g. it's not just scripting steps normally
performed by a human
However, it was very alarming how easily a very localized, routine change which
had been successfully executed fairly recently turned into a global disaster.
This is a big red flag for any system. It is an indicator of unnecessary
coupling. Every time we wanted to add any node to our infrastructure, however minor
and auxiliary, we'd have to perform an operation that touches everything.
Having witnessed the potential for disaster, this would elicit a healthy dose of
The Fear each time. In the long run, if we're afraid to perform simple tasks
with the the provisioning system, we're not going to provision and replace nodes
as frequently. Whenever you stop doing something regularly, you become bad at
it. Routine operations should have routine consequences.

There are also more tactical concerns: "can't SSH to this server, better reboot
it" sounds EXACTLY like automating a manual ops process, and a bad one at that.
Then there's the security angle: even with the bug fixed, chef-metal still
requires SSH access to the servers it manages with elevated credentials. In
other words, you have to keep the provisioning user (ubuntu in our case)
around on your instances forever. We strongly dislike that - it adds another
little bit to the surface area. Sure, you need to be on a private network in
order to get to SSH in the first place, but it's another hidden back door that's
easy to neglect. We'd rather not have it.

We haven't had much time to think about it, but this approach may work much
better when applied at the container level, one step removed from the actual
infrastructure. We may investigate it in the future. For now, our infrastructure
is small, homogenous and simple enough that we will simply be switching to a
more "transactional" provisioning process.

Documenting and Finishing Big Migrations Quickly

A huge part of this was just technical debt - recipes, cookbooks, and roles left
over through consecutive refactors. Even in a "simple" infrastructure, success
and safety depend on a vast set of shared assumption about how things work. As
individuals change the systems' behavior, the change has to be explicit, easy to
understand, and easy to remember. Pieces being left around from "the old way"
make it easy to make a no-longer-valid assumption.

Things We Should Add To Opsmatic

We're constantly improving teams' visibility into changes and important events
in their infrastructure. That we were able to find when a particular recipe was
great, but the experience also illuminated some gaps in our view of CM (e.g.
role/run list changes, and some "meta" features to surface such changes). We're
hard at work, converting what we learned into real improvement in the product.

Parting Thoughts

As soon as we recovered from this outage, I thought "I'm going to have to write
about this." It is a great example of a complex system failure, "like the ones
you read about." It served as a great, rapid refresher course on complex system
theory; it reminded us that we have to minimize coupling and interactions within
our systems constantly and ruthlessly.

If you enjoyed this story (you sadist), you'll probably like the following posts
and books in the broader literature.

Two Factor Auth: Allow AWS IAM users to manage their own MFA devices

In light of all the recent incidents involving attackers taking control of a
company's root AWS account, myself and most everyone I know that is managing any
sort of infrastructure have been re-auditing accounts and stepping up efforts to
get everyone within our teams to turn on MFA (multi-factor authentication). MFA
makes it impossible for someone to log in as you with just a username/password
combo. An additional "factor" is required to confirm the user's identity -
typically a code from a synchronized number sequence. This has been standard
practice in larger companies and capital-E Enterprise for many years, and is now
starting to be taken seriously by folks operating at a smaller scale and in the
cloud. No one wants to be the next
tragedy

MFA (or 2-factor auth) has traditionally been embodied by RSA tokens
attached to a keychain or a badge lanyard. These days, your phone can act as an
adequate substitute.

Turning on MFA for your root AWS account is fairly easy:

However, it took me an unfortunate amount of time to figure out how to allow
users created as IAM accounts to manage their own MFA devices. Setting people's
devices up by hand through the root account was simply not an acceptable
solution. Even at our size it was going to be a major headache, especially
for our remote employee.

In the end, it's all documented in AWS docs, but it's a bit buried, and multiple
steps are involved. Hopefully this post saves you some time.

Just The Right Amount

The critical thing is to give everyone JUST what they need and no more. Since
you've already secured your root account, you can likely curtail the breach of
an IAM account reasonably quickly, but it's best if the account can wreak minimal
havoc in the first place. For example, if a compromised account was able to
fiddle with the credentials of other users, the exposure and cleanup effort
would increase greatly.

Unfortunately, the IAM permissions policy system is rather arcane. That is an
undesirable property for a security-related system to have (easy to get wrong),
but alas, it's the one we've got.

IAM Policies are made up of combinations of JSON blobs ("stanzas") each containing a
unique identifier, an effect (Allow, Deny), an action, and a resource to
which the effect/action combo should be applied. There's a whole bunch of
documentation on the subject
here
so I won't spend too much time elucidating it. Let's cut straight to what we
need.

MFA Device Permissions

When you create an IAM user, by default they are unable to do literally
anything. When you pull up the IAM dashboard (where you have to go in order to
set up your MFA device), you literally just see permissions errors everywhere:

"Well that sucks," I thought, looking over a co-workers shoulder. Googling
"allow IAM user to manage own mfa device," we find this lovely page:
Example Policies for Administering IAM Resources
Under the heading "Allow Users to Manage Their Own Virtual MFA Devices (AWS
Management Console)", we find an example policy that should do the trick.

Since this is in no way obvious, I will also note that the account ID is found
on the "Security Credentials" page of the root AWS account.

This appears to be sufficient to let users find themselves in the "Users" menu,
click the "Manage MFA Device" button, and go through the rest of the process.

Passwords etc

I also found it useful to give our users the ability to manage the rest of their
own credentials. The relevant policy stanzas can be found
here.

Surprisingly, the default "Password Policy" on our AWS account was set to
allow passwords as short as 6 characters with no additional requirements. Even
with MFA enabled, you'll want to crank that up to something quite a bit more
robust.

Keeping the robots at bay

One other important aspect of our setup is the fact that only humanoid users are
able to mange their own credentials. We have a number of automation-related
"bot" accounts who have security policies tailored specifically to their
purpose - the backup user only has access to a specific S3 bucket, the
dnsupdater user only has access to a specific Route53 zone, etc. Even with
this limited set of permissions, it's important to make it difficult for an
attacker to gain control of these users. They do not have passwords, and they
are never granted permissions to manage their own credentials. This is
accomplished by attaching the policies described above to a humans group and
only adding users with a verified heartbeat to that group.

Enforcing a Policy

We have a policy of not allowing access to any AWS resources without an MFA
device enabled. However a policy is only as good as its enforcement. I did a
brief google and didn't find any automated tools to do the job, though I did not
try very hard. I did find that the AWS CLI
tool has a aws iam get-credential-report
command, which returns a base64-encoded CSV file containing information about
all the IAM users' credentials. One of the columns is mfa_active, so the data
is all there to automatically enforce an MFA policy.

(NB: you have to run aws iam generate-credential-report beforehand. Full docs are here)

For example, the following python snippet (available as a gist
here) will parse the
contents of the report and tell you who doesn't have MFA enabled. All you have
to do is chmod +x the file to make it executable, then pipe the report into it
like so: aws iam get-credential-report | ./scripts/parse_credential_report.py.

For our current team size and growth rate, and compliance needs, this is
sufficient. I did come across an example of what a fully-fleshed out tool would
look like in the excellent DevOps Weekly: The
Guardian's gu-who for performing
account audits on GitHub accounts.

Low-hassle HTTP metrics with Tigertonic and Go-metrics

First things first: What the shit is tigertonic?

Tigertonic is a framework for making webservices in Go written by Richard
Crowley (I have contributed a bug fix or a feature here and there). Its defining
characteristic is that it allows you to translate functions which take and
return specific Go types into http.Handler implementations that understand and
return JSON payloads. Define your signature, pass it into the correct Tigertonic
wrapper, and out comes a web service that take in JSON, unmarshals it to the
input type, passes it to your handler, then takes the return value from your
handler and marshals it into JSON for the response.

It's similar to JAX-RS/Jersey annotations, but with much less code, and with
most of the ugly bits hidden from the framework's user.

Check out the README for
more info. Richard has also written
and spoken about
Tigertonic on various occasions. It's all well worth reading.

So You Want Some Metrics

At Opsmatic we strive to be a "learning organization" -
we want to learn something from every release, every change, every customer
interaction. An important component of that philosophy is an obsession with
measuring things. Jim, our CEO, wants "If you can't measure it, don't ship it"
written on his headstone when the time is right. No joke.

One of the things we wanted to measure was the number of requests served by our
API. While we were at it, we thought we'd grab the timing data too for
operational purposes.

go-metrics and Tigertonic

Richard is adamant about everything in Tigertonic reducing to an implementation
of http.Handler, and with good reason: doing so enables the Handler that
actually performs the business logic to be wrapped in any number of completely
orthogonal Handlers that handle all sorts of other concerns - logging, CORS rules,
authentication.. and metrics! (the
README lists
the available handlers.) The separation of concerns afforded by this approach is
truly refreshing.

Go-metrics is a library, also
maintained by Richard, that provides similar capabilities to Coda Hale's great
Java metrics library. It makes it very easy to
time and count things, as well as to extract the data from the timers and
counters.

Tigertonic comes with a few wrappers that hook up our Handlers directly
to these metrics. We're going to look at a couple in particular: Timed and
CountedByStatusXX. The former is a very thin wrapper around the functionality
of a go-metrics Timer - it just times the request and records the reading:

The latter is a bit more involved, but is also ultimately a thin wrapper around
some go-metrics primivites which counts the number of requests that result in a
given class of response codes 2XX, 5XX, etc. You can look at the code
here

Adding a counter is done by calling tigertonic.Counted(yourHandlerHere, ...).
Since the return value is also an http.Handler, you can pass that to
tigertonic's multiplexer or really anything that operatoes on http.Handler -
including the stdlib http server.

Putting it all together

The goal at the outset was to easily capture metrics on all our endpoints. How are we doing on that?

Quite well, it turns out. All we have to do to achieve the goals is some wrapping:

ET VOILA. We need to give our handlers some names for the purposes of metrics
collection, so we create a little wrapper function that takes that name and a
Handler and wraps it in all the properly named metrics collectors. When we
need to add more handlers, we wrap those too and the data shows up for
free. In the instrumented version of the code
you can see that I've also made a call to metrics.Log which spawns a
reporter goroutine off into the background, printing out the stats every 10
seconds. There are a number of more useful reporters available - for example,
I've contribued a Librato reporter
which posts the metrics to the Librato API.

Slightly More Advanced

The full Opsmatic version of the above code is included below for additional
illustration. It is expanded to include the name of the service, some CORS
defaults, and two versions of the wrap method - one that includes a call to
tigertonic.Marshal and one that does not; we need the latter to accommodate a
couple of endpoints we have that do not return JSON.

Conclusion

Using this little bit of boilerplate code, we can readily instrument new
endpoints as they come online without cluttering the code with counters and
timers. Using the aforementioned Librato reporter, we get graphs for new
endpoints that we deploy instantly and with zero additional wrangling. It's
quite a nice setup that required a fairly modest amount of code and requires
very minimal marginal effort on new endpoints. We hope that you enjoy it as
well.

While I agree that stretches of concentration are important for figuring out a
specific task, I think that this chorus is at the heart of a serious
misunderstanding many engineers have about their value as members of an
organization that is resulting in a tremendous amount of waste.

Sure, constant interruptions and context switches are exhausting and difficult.
I'm not suggesting that we should spend all day turning from one conversation to
another. It's easy to overdo meetings and office shennanigans. However, a
healthy amount of interaction and socialization has some very important
benefits.

Interruptions cause you to retrace your steps - this is often good

There is a much less edifying real-life counterpoint to the widely romanticized
deeply concentrated programmer. It's that of a programmer spending 4 hours
trying to track down a confusing, elusive bug, only to figure it all out 5
minutes after walking away from it. I've done it, I've seen it, and I continue doing
it and seeing it.

There's a very simple explanation for this phenomenon: in order to be able to
reason about an algorithm, especially a complex one, we have to assume and take
a whole load of things for granted. The stack, the configuration, the interfaces
on top of which we're working.

An incorrect assumption is a common source of confusion and infuriating
debugging. If you're lucky, the false assumption will be illuminated by a
debugger or a log line. However, the longer you'd been staring at the same
problem, the more likely you are to miss something much more simple. That helper
function you stubbed out earlier while testing something else? Yeah, that's
still there. You'll feel real dumb when you remember.

Interruptions - planned or unplanned - cause you to "resurface" and to have to
re-engage the problem almost from scratch. Part of that process is rebuilding
that chain of assumptions. Stepping back from a problem and seeing the bigger
picture is often much more productive than spinning down in the bowels of your
code.

(Here's a great talk Joe Damato with a pretty good
discussion of discovering violations in your basic assumptions)

Re-reading your own code is the best way to write readable code

If you're writing a bunch of code in a hurry, and especially if you're doing so
while fighting through bugs, you're likely leaving a disaster zone in your wake.
Even if you think you're writing "clean code" and writing tests to go along with
it, there are probably sections in your code that barely make any sense by the
time you've gotten them to do what you want.

Pair programming is one way of solving this - your passenger will point at the
screen and call you out for getting too fancy or too casual with your
single-letter variables. I'm still torn on pair programming, but I do think
it's a great idea to re-read your own code regularly for reasons related to the
first secion.

While an interruption causing you to lose context can be annoying, the forced
re-construction of context can point out flaws in your reasoning and force you
to recognize sections of code that are hard to read - because you'll have
trouble reading them too.

Your peer has likely seen the same problem before

We spend a lot of time talking about sharing code and know-how in the OSS
community. We've also been putting lots of emphasis on DRY - "Don't Repeat
Yourself." Well, it's more like DRO - "Don't repeat others." This broader
message applies to your peers as well. When you're dealing with OSS code and you
find a bug you can't sort out, you ask the internet and see if anyone else has
had the same problem. For whatever reason, we find this easy, but we find
turning to our neighbor and asking the same thing difficult - PROBABLY because
we're afraid of the stigma of interrupting them. So we spin our wheels. Awesome.

Don't forget that someone in the room is very likely to have used the same
software and tools you're using, seen similar problems in the same or similar
systems, or, if you're really lucky, wrote the damn thing in the first place.

Interruptions often come with an opportunity to ask your colleagues - they
may well be interrupted too.

Are you even solving the correct problem?

Many conversations between engineers about productivity make it sound like the
goal of programming is to write as many lines of code as possible. This has been
reinforced by stories of companies like Google which were "run by the
engineers." I believe this has caused people to imagine the original Google
employees all furisouly writing code for 16 hours a day without uttering a word
to each other or anyone else, inevitably producing the world's best search
engine.

This is pure professional hubris. Hubris is all I hear when engineers bitch
about product and project managers interrupting them with all their "process."
Sure, it's easy to overdo, but it brings us back to that whole
"know your business"
thing.

Sure, if you sit in your little cave for 16 hours, you're going to write a whole
bunch of code. But... what did you just produce? Sure, it's "correct" in the
strict engineering sense of the way - the right inputs produce the right
outputs, etc.. But is it correct in the context of a product? Did you actually
build something people will want? Does it work, as in, does it behave the way a
customer would expect? Chances are it does not, because it's hard to build
things for humans without talking to them.

The reality of the matter is that Google's early engineers were successful
because they were good at all those other things as well, not because they
ignored everything around them and ground code.

How hard are you concentrating, anyway?

You can tell engineers don't REALLY mind being interrupted by just looking at
the constant shitpile of activity on HackerNews, Twitter, Google Plus, IRC, etc.
It's not about interruptions. It's just flat out whining. We don't like getting
out of our comfort zone and thinking about things about which we're not that
good at thinking. Stop coming up with excuses and get better at it.

Interruptions force you to ship.

There's no disputing that interruptions and context switches are painful and
difficult, but knowing that they're coming can have a positive impact - if you
anticipate only having a couple of hours before you're interrupted, you will
work in more incremental chunks, which lend itself better to testing,
documentation, abstraction etc. These are all good things.

For example - there are guests coming over for dinner shortly, so I'm just
going to wrap this up and post it. It's too long as is.

tl;dr

Sitting in a dark basement in silence great for leveling-up your World of
Warcraft character. It's no way to build good, usable software. There's no
substitute for good communication.

A Reliable, simple way to get a PDF out of Showoff

Perpetually agonized by actually using Keynote or Powerpoint to make slides, I
continue to use Showoff to make my slide
decks. Unfortunately, the codebase appears a bit neglected, and certain features
have stopped working very well over the course of re-installs. I have neither
the Ruby-fu nor the time nor the patience to figure out why PDF generation has
stopped working (I actually don't think that particular feature ever worked for
me at all), so I've had to resort to trickery.

I am posting this here because I keep forgetting how to do this and having to
blindly figure it out each time. Hopefully my own blog will be an obvious enough
place to look. This has only been tested on a Mac using Chrome, but it looks
like Safari will work to with a bit of tweaking

How Do I DevOps?

There is lots of talk about what DevOps is and means, even a Wikipedia
page, to which I may soon give some much
needed love. However, a friend recently asked if I knew anyone worth hiring for
a "devops" role, and I found myself asking clarifying questions about the sort
of person he had in mind. Seemed worth writing down.

The friend was looking for engineers. So what does it mean for an engineer to be
devops-y?

TL;DR

Understand the Whole Company as a System

Respect Other Functions Within The Organization Profoundly

Have a Strong Sense of Personal Accountability

Build your software like you give a shit about the people whose jobs and lives
are affected by it.

1. Understand the Whole Company as a System

Your company has inputs (money, labor, etc) and outputs (product, money, etc).
I've grown to loathe the phrase "above my pay grade" because it tends to betray
a complete lack of interest in the big picture. Hanging around my new colleague
Jim, aka Mr Manager, I've recently started to identify things as "tactical" vs
"strategic." Strategic is the big picture - where is the company going; what are
the company's goals; what will make or break our success. Tactical is the every
day - what features are left on the current project and which one should I work
on next; how much time should I spend on this bug, what with the massive
deadline looming; hell, should I even be looking at bugs? If you don't have a
good grip on how you and your project fit into the bigger picture of the
company, you are always tactical. Tactical can quickly become boring,
repetitive, and un-rewarding. It's also a nice way to never grow as an
individual. In the DevOps picture, it means you probably don't make judgment
calls well with regards to what is and isn't important, distributing your time
poorly. Your colleagues probably notice; they probably don't like it.

This is a great segue to:

2. Respect Other Functions Within The Organization Profoundly

For our immediate purposes, we can focus on just the ops team, but it applies
well beyond. Understanding and respecting the priorities and needs of
non-technical teams and taking them seriously helps greatly reduce the number of
surprises on both sides. Also, if you're really living number 1 above, you
probably won't be surprised that your goals are very closely related.

But back to your relationship with the ops team (or, if you're living in devops
dream land, your colleagues, since you're all part of the combined devops
utopia, right?) What makes them tick? What wakes or keeps them up at night? What
makes their job harder? Easier? I like to make it personal: how have I made
their lives better or worse?

Let's look inwards for a moment: what if someone is asking these questions about
me? Well, I'm a software engineer. I grind code for a living. I get some
requirements (new product spec, a bug, something I think up in my free time and
don't tell anyone about, etc), figure out how to meet those requirements, write
some code, and push it to production.

What are the things that make me happy while performing these functions? Well,
there's a whole bunch of them, but they can all be summed up very easily: lack
of friction. A relatively low number of things I have to do beyond my core
activities in order to get to the end; a limited number of context switches. A
clean, consistent, reproducible dev environment. A responsive, intelligible
build system. A mostly-automated way of moving my code through various
environments.

What has ops done for me?

Well, shit, I'm actually mad spoiled. Flickr was a PHP site with a well oiled
deploy
machine
that we've all heard about - since you didn't need to restart anything to get
your code out (an under-appreciated side effect of the way PHP is traditionally
served), we'd literally just push a button and the new code got rsynced to
the boxes while also keeping a nice, visible record of the what, when, and why
(a form of this now available to the masses in the form of Etsy's
Deployinator). SimpleGeo and Urban
Airship use(d) Puppet and Chef respectively to great success, and there was an
ever-improving set of tools available to make it easier to start working
on a project and to test it as I went along. When I was done, it got reviewed,
merged, built and sent off to a package repo, then deployed to production using
automation. I spent most of my time actually debugging or writing code, not
sheparding it around environments or struggling to get it to run in the first
place. It's also easy to forget the little things that helped keep computers
out of my way - federated logins etc.

These are just the more salient examples - specific things ops has done to make
my life easier; it is by no means an exhaustive list of what I see as the core
strength of my prior ops teams.

What have I done for ops?

Let's look at what my teams at each of these orgs did that I think was helpful
to and appreciated by the ops teams. This is in no particular order, and I'm
going to forego the names of the organizations because there's a ton of overlap.

Painstakingly instrumented our services so that their state could be more
easily examined in the wild

Pumped as much data as we could into the monitoring tools kindly provided us

Thoughtfully considered what metrics and properties were helpful in
determining the health of each particular system being worked on. Business
people might call this a KPI; Mathias Meyer called it a "Soul Metric" in his
monitorama talk.

Carefully set up alerts that interpreted the above to try to minimize noise
and non-actionable alerts.

Learned at least enough about the configuration management tools to be able to
submit pull requests for desired changes in production without personal
involvement and hand holding from someone on the ops team.

Considered and tested how the software being written behaved itself before
an emergency - how is failover handled? how are configuration changes handled?

Automated or helped automate parts of the process that were difficult to
remember or tedious.

Worked on tools in our spare time that made any of the above easier.

Broadly, we tried to be sensitive to how the operators interacted with the thing
in production and how reasonable the experience was - during changes, during
outages and failures, etc. We focused on operability.

Why did we do all this?

3. Have a Strong Sense of Personal Accountability

Because it felt like the right thing to do. When people got woken up at three in the
morning because something I had deployed broke in a confusing,
difficult-to-debug way, it felt bad. I wanted it to be less confusing the
next time. If we're being honest with ourselves, it probably helped with the
motivation that I woke up too and was just as frustrated and annoyed.

Go back to #2 and think "Do people in the other organizations have the right
tools to perform their jobs?" The better the tools, the less friction there is,
the more quickly people can perform their reactive tasks (ops responding to
pages; marketing compiling a traffic report that the CEO suddenly needs for a
board meeting; support dealing with a massive DDoS or spam influx) The less time
people spend reacting, the better - reacting is by definition tactical, and
spending all your time in tactical mode, as we've covered, is not great. The
list in section 2 was focused on ops, but a lot of the same stuff, especially the tools
bit, applies to other teams as well.

It'll never be perfect, but often the smallest change makes the biggest
difference. Re-arranging a dashboard ever-so-slightly could be the difference
between someone getting RSI while trying to track down spammers until late at
night and them going home in time for dinner. A good DevOps engineer in my
mind is one that feels personally responsible and accountable for the parts of
his or her job that have an effect on colleagues' happiness and success.
Remember, everyone likes going home for dinner.

Conclusion

Coming back to what this all means for a software engineer: it's all about the
big picture. In an organization whose primary output is software, everybody
depends on how well that software is equipped to help them succeed in their
particular job. Understanding your effect on these needs and striving to meet
them - that's what DevOps means to me.

Further Reading

Developing Operability
(slides) A talk to Richard
Crowley with specific advice for smoothing the journey of code to production
for both devs and operators; more on the meaning of "DevOps" (warning: a wall of
text)

It's a train.. no, it's a computer.. can't it be both??

I am delighted to let you spread the word about an amazing innovation from Lian
Li, the acclaimed maker of computer cases. They have thrown caution to the wind
and finally introduced the thing we've all been waiting for - the Choo Choo
Train Computer
Case.

Yes. Yes. Let that sink in. It's a computer case shaped like an old
steam-powered locomotive. It has a 300 watt power supply in the front section,
and the cart can fit a Mini ITX motherboard, a slim optical drive, and a single
internal hard drive. One might point out that these are somewhat weak specs as
far as cases go, but hey, IT'S A FUCKING TRAIN.

But wait. There's more. No, seriously, there's more.

I saw that the case had 5 star revies, so I clicked to see what proud owners had
to say about it.

This SKU, which ends in an S, does NOT move compared to the more expensive SKU
that ends in an L. It has no motor, it's just a case that looks like a train.
The more expensive model
[...] actually has a
motor and a transmission, and comes with extra rails, so it will roll back and
forth when the computer is turned on.

Yup. Lian Li's product page for this puppy is
epic
Not only is there a more expensive version that moves (and comes with "Rail x6"
instead of "Rail x1"), there's a limited edition one that has an atomizer.
That's right. It makes steam!

Basically, I'm spent just thinking about this. The amount of space accommodated
by the case isn't ideal for the plans I have for an HTPC (I was on Newegg for a
reason, afterall), and I definitely couldn't handle "Rail x6" and a computer
case scooting back and forth, but y'all know what to get me for my birthday now.
I'll make it work.

Unfortunately, it's out of stock on NewEgg and I can't find it for sale
anywhere, so I fear that the opportunity may have passed. Who knows when Lian Li
will elect to share their genius with us again? I'll probably end up having to
purchase one of these guys on Ebay for thousands of dollars as a collectors item
years from now.

Some Love For Ishmael

Back in the days of fire fighting and database optimizing at Flickr, when I
could debate the merits of different MVCC options comfortably, I built a little
tool called Ishmael to help us make sense
of mk-query-digest data more easily (apparently, the project has been moved to
the "Percona Toolit" and renamed
pt-query-digest).
Tim Denike made some improvements during his remaining time at Flickr after I
had left, and then Asher Feldman took the project with him to The Wikimedia
Foundation. Eventually, he sent in a large enough pull request that I simply did
not have the capacity to test it - I, after all, have not used MySQL in anger in
ages. So I did the natural thing and made Asher a collaborator on the repo.

This past week, during a moment of vanity, I noticed that there were quite a few
more stars on the repo than there had been. I wondered what might have caused
it, and shrugged. Then on Sunday the DevOps Weekly
email provided the answer: Asher had written a post about MariaDB on Wikimedia's
blog, in which he mentions their use of Ishmael in comparing performance between
old and new database versions. It is a good read for anyone interested in
database migrations and upgrades, especially "doing it live!"

Still I see repositories with READMEs like "Gathers and relays system metrics" (not to pick on anyone in particular, just an example) and ones that skip straight to installation instructions and licensing info. That's bad.

The README is a project's face to the world. It should tell the audience what the project does and what motivated its creation at a minimum. Not to belabor the already-well-made point too much further, I've created a tool for myself and others to help formulate a basic README when inspiration betrays us.

It's called readmeme - and yet again, Richard Crowley helped me name it. It is apt. I hope it starts a meme of informative READMEs.

The 3 Little Mistakes

There are many mistakes people make when programming for the web. Here are three that I've seen everywhere I've worked. I think they deserve extra attention because they are relatively easy to avoid, but very difficult to recover from.

Encodings

Even if you only ever have US based users, enough folks have accents in their names and all sorts of other reasons to introduce non-ascii characters into your systems. Enforce UTF-8 at all levels, and especially at storage time. Fixing an encoding issue is difficult, and usually involves re-writing all the data. It is unpleasant, no matter the underlying datastore.

Limits and Pagination on API Requests

This is pretty straight-forward, but I've seen it neglected in practice quite a few times. Whether it's a POST where the bath size isn't bounded or a GET that returns "all the _____," it inevitably becomes a nightmare that is difficult to fix due to clients not expecting to have to paginate. This is particularly exacerbated in APIs used by mobile apps - a total fix requires getting all the client apps to upgrade to a new version of the library AND for all their users to upgrade.

Timezones

This issue is similar to the encoding one - a matter of consistency. Storing everythign at UTC forces consistency on the data. Not doing so leaves an opening for rows in the same table to be written with different time zones, making querying and comparisons more complicated and possibly expensive. Almost without exception, you only want to think about timezones at display or query time; it's much easier to deal with DST when your data is normalized to the same, consistent, season-immune representation.

Cal's book has a lot of really good info on other details that merit attention when working on the Internets.

Just Use Your Words

It is becoming increasingly plausible that I have a partially torn miniscus in my left knee. The unusual, difficult to place pain, the occasional painful popping and "catching" in the joint, the frequent unprompted discomfort.. I run into people on a regular basis playing soccer - any number of recent collisions could damage a ligament.

But the ligament is not the issue.

The issue is the reluctance to call it what it is - the orthopedic sugeon dictated "possible meniscus irregularity" into his recorder. The physical therapist opted for the cautious "possibility of some miniscus involvement." Having already been primed, I pressed a bit: "A torn miniscus, you mean?" "Yes, a partial tear."

Come on guys, I'm not going to die of a torn miniscus. The upper bound on recovery from a miniscus repair is 4 months, and I'm 26 - it would likely take less than that. Surely they don't expect me to break down crying right there in the office when confronted with the news..

The piece ends with a call to abandon euphemisms for a day. I highly recommend this exercise, having lived it long before this article was published. You'll piss some people off, but at least they (and you) will know where you stand.

The Lonely Wait

As I exited the Civic Center metro stop on this beautiful sunny morning, I glanced up Hyde street. There was no 19 bus to be seen up the street, which meant it was at least 5 minutes away. At best, it would make it to my destination at 8th and Bryant around the same time as I would on foot.

Rounding the corner onto 8th, just past the 19 stop, I came upon 20 to 30 people, mostly men, mostly in their mid 20s, lined up against a fence. Nobody was talking, everybody had headphones on, some were smoking. They were waiting for the Xynga shuttle to take them to their office 6 blocks away, about 2/3 of a mile down 8th street. 15 minutes by foot, 5 by bike.

At their feet started a freshly painted bikelane, possibly the widest in the city.

Somewhere far away, in an empty bar completely unaware of the human tragedy unfolding at 8th and Market, the jukebox played "Don't Know What You Got Till It's Gone."

Windshield: Display Incoming Geo Data Using PolyMaps

While discussing what to do for a Free Friday project, Neil and I decided we want to build some sort of visualization related to the location data UA had been processing. I quickly thought of the dashboard that Jon Rohan wrote for SimpleGeo, which would plot the coordinates that API requests were targetted at as those requests came in. After finding Jon's code, I realized that the front end portion was going to be very easy to adopt, as well as make generic. Having obtained Jon's blessing, I tidied the code up a bit and open-sourced it. Of course, the backend code that supplies our geo data will remain closed source and proprietary, but there is an example data source included to help others get started.

I called the project Windshield because the points look like bugs that show up on the glass over the course of a long drive. The source is here. I have an example up here.

I could make this a wordy post about programming practices, javascript, and technology, but this was a really simple project. Besides, other people did most of the hard work:

Jon Rohan wrote the code I aped heavy-handedly to get a grip on the thing.

I made the function that supplies the points pluggable so that it was easy to test and extend, so HURRAY for higher order functions.

Probably the most surprising thing was Aaron submitting a pull request hours after I had open sourced the damn thing to make the demo work correclty in FireFox. Thanks, Aaron!

Things I know I still need to do:

Make it easier to manipulate the map once it's created (perhaps return something from the windshield function)

Explore the concept of a PolyMaps "layer"

Can I create a custom layer for my points and use the DOM element for that layer to more efficeintly prune points over time? The present implementation of selecting all circle tags, then removing the parent of their parent of their parent brings the browser to its knees

Remove points over time in a FIFO arrangement? Would require quite a bit more javascript than I presently have appetite for, but who knows...

Highly relevant:

On "Infrastructure for Startups"

Conference talks on the subject of infrastructure are often lacking in actionable advice - especially for fledgling startups. I am shamefully guilty of this myself.

A notable exception is a recent talk by Paul Hammond, my old manager and good friend. His Velocity 2012 talk titled "Infrastructure for Startups" was a refreshing dose of pragmatism, drawn from the experience of building and growing TypeKit. Paul ran Flickr Engineering before that - he has street cred for weeks. Unfortunately, video of the talk does not appear to be available, though that may change soon according to Paul.

Though I am prone to lengthy rants about building things the "right" way and am often heard advocating more rigorous planning at the start, I can't agree more with most of what Paul says - a 2-3 person startup just doesn't have the time to be mucking around with anything but the product they're building. This being 2012, there is an army of service providers ready to share the burden - for much less than the opportunity cost of building everything yourself.

Don't forget to measure

I'd add one more thing to Paul's lists (here and here) - you need good graphs right away. I'm surprised Paul didn't mention this after "all performance problems have been on things we don't yet measure." Good metrics collection and display are critical to both business success and technical efficiency. The easier it is to put together dashboards that zero in on meaningful metrics and correlations, the more you'll do it, and the more quickly you'll identify inefficiencies and opportunities.

I've yet to hear a favorable review of the baked in ec2 monitoring tools (CloudWatch), so I, as usual, recommend the slick, easy to use, gorgeous Librato Metrics for these purposes. As a bonus, the product comes with some basic alerting features (haven't tried yet, in the interest of honesty), so it may help stall or obviate the need to set up nagios or one of the related monsters. All the tools for getting data in have already been written.

Speaking of alerts, PagerDuty is another no-brainer for small teams starting to set up more fine-grained monitoring. Big surprise: Librato has PagerDuty integration.

I had some experience with the competing Cloudkick product and sadly don't have many kind words, although much has probably changed since our last interaction.

Start, New Game

I'm tired of Wordpress and the associated bullshit (upgrades, vulnerabilities, etc), so I'm finally ripping the bandaid off and moving my blog to a Jekyll-generated static site hosted on s3. I will not be migrating the old content, as that would be an epic pain in the ass, the benefits of which do not seem substantial to me. The old site will live on, its database and directories set to read-only mode, to serve what little google traffic still washes up there.