Operations Infrastructure Month in Review #4

What’s this about?

The Remind OpsEng team has “Open Sourced” our monthly
status reports. This post briefly describes some of the bigger tasks and
projects we have worked on over the past month.

If you want us to elaborate on a specific topic, let us know!

Tuned EBS

We noticed that that the Docker volumes on our ECS container instances were
showing unusually high disk latency, sometimes spiking above 30 seconds. We’ve
taken a few steps to ensure that disk operations on the Docker volume are faster
and more predictable.

Dnsmasq Upgrade

Autospotting reroll

Due to concerns about some poor performance we experienced, we rolled back
autospotting in order to eliminate variables (lots of things change in our
system daily). To ensure there wasn’t any questions about it going forward, we
built dashboards to monitor these issues and slowly rolled it back out.

We didn’t experience the issues again – rather the opposite. Our standard
operational unit is c4.2xlarge, and since Autospotting treats the previous
generation (c3.2xlarge) as a compatible instance type it scaled them into
our autoscaling groups. When this happened we noticed better performance on
some metrics.

Templates can now be uploaded directly to CloudFormation (no bucket needed).
This is useful for testing (see above), but it also means that the size of
your CloudFormation templates must be smaller than 51,200 bytes

Stack-specific tags

Protected mode for stacks: stacker will switch to interactive mode for changes
to these stacks

AWS Network Load Balancer testing

AWS released a new Network Load Balancer
(NLB), similar to Application Load Balancer but working at layer 4 instead of
layer 7. This will seemingly replace classic load balancers in TCP mode, and
offer better performance and scalability.

We currently use a classic ELB to load balancer Postgres queries to a pool of
PgBouncer hosts. Proxying it with a classic ELB did noticeably increase the
latency for these queries, so the introduction of a new, higher performance TCP
load balancer was intriguing, and we wanted to see if we could get better
latency on queries with it.

Unfortunately, we found that at its current version it’s not possible to set
security groups
or to attach subnet mappings
to an NLB – rather, the NLB you set up gets assigned an internal DNS name
automatically per subnet.

This makes it difficult to set up ingress rules
to limit access to/from them, since ingress rules require either a source
security group or an IP address to set up from CloudFormation.

While there are workarounds to this (allowing access from an internal subnet,
or creating a security rule outside CloudFormation by resolving the address),
we’ve decided to wait for the ability to set security groups to be released
before testing the NLB.

Improvements to our internal workflow

We keep an internal repository of
Stacker blueprints where all Remind’s
infrastructure is declared as code; and we encourage every team to propose
changes and additions via Pull Requests to this repository.

In order to improve our workflow, we’ve made use of Github’s
.github/OWNERS file so
the OpsEng teams gets automatically added as a reviewer to any submitted PR.
We’ve also set set up a .github/PULL_REQUEST_TEMPLATE
file with helper boilerplate (nature of the change, security questions,
estimated associated costs, etc.) in order to make the process smoother and get
the PRs rolling as fast as possible.