Engineering

Taking Zero-Downtime Load Balancing even Further

Joseph Lynch, Lawrence Matthews

May 15, 2017

Ever since we rolled out our zero-downtime HAProxy reload system
a few years ago, we have been disappointed that it required additional
investment to work well for our external load balancing on our edge. We did
generate a prototype that used an intermediary qdisc so we could apply the
approach, but after evaluating the prototype, and finding that Linux wasn’t
going to fix the upstream Kernel
issue, we decided to go another
way.

Our edge is different than our internal load balancing tier because we have
typically terminated TLS with another great proxy:
NGINX. NGINX is useful because it does support
hitless HTTP (and recently TCP) listeners through file descriptor passing, so
we naturally started thinking about how we could replace or combine NGINX with
HAProxy to solve the problem of global, dynamic, highly available external and
internal load balancing. This article sketches out the answer we came up with
as the next iteration of our load balancing infrastructure.

Load Balancing Gremlins

Yelp, like many internet companies, faces two distinct load balancing
challenges: getting requests to our edge (external load balancing) and routing
service calls within our Service Oriented Architecture (internal load
balancing). In late 2016 our systems for both of these problems were not
keeping up with our rapidly evolving application architecture.

Traditionally, our edge was fairly static and did not need the rapid
reconfiguration that
SmartStack
provides. Web servers came and went infrequently and our HAProxy configuration
only needed to reload when we bought new hardware or manually launched new
instances. A relatively static NGINX terminating TLS and forwarding to a
HAProxy load balancer was sufficient. This assumption stopped holding true,
however, as we moved fully onto AWS, used autoscaling more heavily, and shifted
our monolith into our PaaS. Our old, mostly
static, configuration had to become dynamic, which meant we needed SmartStack.
We also hit scalability issues because our original edge design had NGINX
connect back to HAProxy over a loopback IP, which eventually led to exhaustion
of ephemeral TCP ports.

Our internal load balancing solution was also showing signs of deterioration
under changing requirements. Hundreds of services appeared on our PaaS and we
started running into the edge cases of Marathon where containers constantly
appear and disappear for brief periods. For example, Marathon needs to move
tasks around when a container briefly lives but dies due to an application bug,
or because container autoscaling
scales the service up or down. This can cause issues for dynamic load balancing
because the routing data plane must quickly converge across your entire
infrastructure or else slow down application deployments, or accidentally route
traffic meant for one service to a different service’s container. Marathon’s
distributed control plane and inability to keep tasks in the same “place” makes
it very difficult for highly available dynamic load balancing tiers to achieve
this.

Synapse and
Nerve could handle the churn, but as we
scaled we started seeing some performance issues on the internal load balancing
system with our previous reload approach.
In particular, we started observing performance issues with HAProxy reloads,
which now introduced 50-100ms of latency due to large HAProxy configuration
size and were triggered extremely frequently by changes in Marathon. Reloads
would also, very rarely, drop new connections because of various low
probability race conditions.
In addition, we ran into a kernel bug when we upgraded to Linux 4.4 where
localhost queuing disciplines arbitrarily drop TCP packets and introduce
200ms of latency. Linux 4.2 did fix the three-way handshake bug that our qdisc
was primarily defending against, but other races (e.g. accept before close)
remained and the 4.4 qdisc performance regression was a serious issue.

We wanted a solution that could solve our entire problem. In particular, we
wanted to be able to:

Add, remove, change, and modify load balancing configuration without any
possibility of downtime or added latency

Propagate load balancing updates across a global infrastructure in a few
seconds in the typical case.

Our Solution

Our infrastructure team decided to solve this general problem by combining these
two great pieces of software, and by using some relatively new features of
HAProxy and NGINX.

HAProxy 1.5+ (Jun 2014) supports listening on Unix domain sockets, and NGINX
1.9.0+ (Apr 2015) supports both TCP and HTTP listeners (previously just HTTP).
While Linux’s limitations meant that HAProxy couldn’t support hitless reloads
using SO_REUSEPORT, HAProxy Unix domain socket listeners have always been
zero-downtime because the new listen sockets are atomically moved into place
with rename before the
old process calls close. We believe there is technically still a very low
probability race that the old HAProxy calls close with connections on its
accept() queue. In practice, we have not observed this race as the new HAProxy
binds the sockets first and only then signals the old HAProxy. Since the old
HAProxy stops receiving new connections the moment the new one binds, it
typically has more than enough time to accept() its entire
remaining queue before getting the shutdown signal.

With this understanding, we can create the design shown in Figure 1, where
NGINX terminates TCP (or TLS) and proxies back to an instance of HAProxy
listening on local Unix sockets for each service. For our load-balancers that
terminate TLS, we run NGINX with multiple workers. For our internal
load-balancers that don’t terminate TLS we explicitly choose to use only
stream sections in NGINX to avoid any risk of NGINX messing with our low
latency internal traffic (e.g. stripping headers, adding headers, buffering
requests to disk etc …).

Results

To show this really works, we’ve created a test setup on a Linux 4.4 (also
reproduced on 3.13) Ubuntu Trusty VM running on AWS (4 core, 7 gig ram). We
have:

NGINX as the listening proxy

HAProxy as the load balancing proxy

Local NGINX serving a canned response to port 80

All details of the setup can be found in this gist.
Our NGINX config is setup with two TCP listeners proxying back to Unix sockets:

400: Invalid request

And our HAProxy config listens on those Unix sockets and operates one backend in
HTTP mode and one in TCP mode:

400: Invalid request

We run three interesting stress tests:
restarting HAProxy
with -sf, reloading
NGINX workers, and upgrading
NGINX masters. We run a control where nothing is restarting to get a feel of the
typical latency of the system, which you can see in Figure 2. In all of these
latency graphs the x-axis shows the progression of the benchmark
(left to right) and the y-axis shows the response time as measured by apache
benchmark. This is not a heatmap, the vast majority of data is in the 1-2ms
range, but the outliers are what we care about in this analysis.

We observe the control latency to be better than the qdisc approach because
qdiscs add a few milliseconds of latency under high concurrency workloads.

Under HAProxy reloads as seen in Figure 3, there are a few minor latency spikes
(< 5ms), which are most easily observed in the overlay Figure 4.

400: Invalid request

We can also reload NGINX configuration and check that as well:

400: Invalid request

Finally we can check upgrading the NGINX binary:

400: Invalid request

If we look at NGINX reload/upgrade latency overlaid on the control, we observe
in Figure 7 a greater impact on latency when reloading NGINX. This added latency
is is still extremely small (< 10ms) and in this design NGINX is reloaded so
rarely that in practice this is perfectly acceptable.

The results clearly show that by combining both pieces of software we can
reload our load balancing proxy (HAproxy) as much as we want, and SmartStack
will ensure going forward that everything is perfectly automated.

Design Tradeoffs and Alternatives

While designing our solution, we considered a number of alternative designs,
and made a set of tradeoffs based on our engineering organization’s preferences
and the technologies available at the time. We explored a number of options,
but in particular four main options:

Fix the Linux kernel

To be honest, we were hoping that the Linux kernel would fix the long standing
issues with gracefully switching listen ports during the Linux 4.2 re-write of
Linux’s TCP listen subsystem, and they did fix the SO_REUSEPORT three-way
handshake bug as far as we are aware. Unfortunately, to the best of our
knowledge there is still the accept-close race as described in this
netdev thread.

We could have invested engineering effort in fixing the kernel, but we run a
number of versions of the Linux kernel, and the time it would take to engineer
this solution and get it integrated simply didn’t make business sense.

Just Use HAProxy

The closest contender is a design that uses two HAProxy instances, where the
front HAProxy terminates TLS with multiple processes (as NGINX does in our
design), and the back HAProxy listens on Unix sockets and does the actual load
balancing. This is possible because HAProxy added pretty good TLS support in
version in 1.5/1.6. We decided against it, however, because of the connection
issues mentioned above and our SREs have significant operational
experience with NGINX as a TLS termination layer.

At the time we made this decision, we would have had to accept connection
issues if we chose to use just HAProxy, but this is
no longer true!
Recently, (April 2017)
patches
have been submitted to HAProxy which should
allow perfectly graceful reloads by passing TCP sockets over a local Unix
socket
and reserving server slots that can be
dynamically updated,
preventing the need for restarting. These patches are hopefully going to land
in version 1.8 in a few months. Given these awesome changes, our next iteration
will likely change back to using just HAProxy and improving
Synapse and
our automation
on top of Synapse to fully take advantage of these new dynamic features.

Just Use NGINX

Another option would be to just use NGINX, and for simple use cases this is a
good option, which is why Synapse now
supports NGINX as a first class
load balancer. For complex load balancing applications such as ours, however,
the open source version of NGINX lacks a number of important load balancing
features.

For one, it is very difficult to configure NGINX correctly for transparent
reverse proxying as most HTTP proxy defaults are setup for replacing something
like Apache rather than HAProxy. For example NGINX by default manipulates a
number of headers and buffers requests and responses (potentially to disk!).
This concern can be solved relatively easily with the right incantations of
options, but it’s worth noting that in our experience HAProxy is designed first
and foremost to be a load balancer, NGINX is designed first and foremost to be
a web server, and these are actually different challenges.

Another major problem is that open source NGINX has no online control
interface, so you need to restart the proxy for it to pick up any configuration
changes - including downing servers in an emergency. In a PaaS environment, you
end up with constant NGINX worker reloads. Combine those constant reloads with
long lived TCP connections, and you can quickly waste gigabytes of memory on
every machine in your fleet due to lingering NGINX processes which must remain
running as long as active TCP sessions exist. You can
rate limit
restarts to save your boxes (as we do with
HAProxy),
but this then increases the latency it takes SmartStack to respond to failed
machines. With HAProxy stats socket updates, SmartStack can down a server
globally in a few seconds, but with NGINX it may take minutes.

NGINX also lacks crucial load balancing tools like statistics, monitoring
dashboards, healthchecks (to an extent) and support for complex routing ACLs.
These are not theoretical issues, we actively rely on the HAProxy stats socket
to quickly up and down servers in an emergency, and monitor
application replica health.
We also use a number of
HAProxy ACLs
and routing rules for transparent service instance failover (between AWS
availability zones to regions and from one region to another) and for universal
service caching (acl routes to universal caching instance before routing to
actual instance).

NGINX+, the paid version, does solve some of these problems, but it is
expensive (> free) and would require re-tooling.

Use Both

At the time we wrote this solution both proxies had unique benefits, and while
it may have been simpler to use only one, using both gave us the ability to get
the best of both worlds.

This solution has worked perfectly for our external load balancing for many
months, but for our internal load balancing we had to make the main tradeoff
that our system is now significantly more complex, and required a significant
re-architecture of SmartStack,
how we configure SmartStack,
and a new implementation of SmartStack’s new config_generator API that
supports NGINX.
The qdisc solution we have used for years only took about one week of
engineering time, but in contrast supporting multiple proxies has taken months
of engineering time. We chose to do this because it gave us flexibility in
other regards, especially in the face of an ever changing proxy landscape.

The other major tradeoff is that listening on Unix sockets is not an especially
common practice with HAProxy and it’s also not as widely supported in other
software. For example, curl only began supporting HTTP over Unix sockets in
version 7.40.0+ (Jan 2015). This makes debugging harder, and potentially
exposes us to uncommon bugs, for example
this load related bug in HAProxy
(we have not observed this bug in production).

Conclusion

Highly available, dynamic load balancing is a constantly changing
infrastructure area. In addition to stalwarts like HAProxy and NGINX, we are
seeing new players like envoy,
linkerd, and
vulcand come onto the scene. In
this iteration we decided to go with the simplest, most proven technology we
could, but with the infrastructure we’ve built around SmartStack it will be
very easy to continue iterating and making the best choices for our platform
going forwards.

Acknowledgements

This project had a number of key contributors we would like to acknowledge for
their design, implementation, and rollout ideas:

Josh Snyder for coming up with the idea of using Unix sockets in the first place