Engineering

Moving the Rest of the Monolith to PaaSTA

Kyle Anderson, Site Reliability Engineer

Jun 19, 2017

This past April (2017) we finally migrated our monolith to
PaaSTA
(our open source PaaS based on Apache Mesos). Yes, although Yelp does subscribe
to the Service-Oriented-Architecture theory and we constantly try to
reduce the scope of the monolith,
realistically it still looms over us as a large towering codebase that pays the
bills. But that doesn’t mean we can’t try to constantly improve it. This blog
post is about our latest improvement to the monolith: treating it just like any
other service at Yelp and running it on PaaSTA.

Background: What is Yelp’s Monolith Made Of?

Yelp’s monolith is composed of
perfectly proportioned parts
of Puppet, Apache, mod_wsgi, and Python + virtualenv. Before PaaSTA, it was
deployed directly to servers (a hybrid of on-premise datacenter and Amazon EC2)
using a bespoke rsync-based deployment system. For the purposes of this blog
post I’m going to call this “Classic” infrastructure. (“Legacy” has such a
negative connotation, I rarely use it. I like “Classic” because it comes with a
sense of respect for the past. Think “Classic Cars.”)

This in itself isn’t very exciting but it comes with some challenges that are
baked into the system:

The application is tightly coupled to the operating system it runs on (hard to upgrade)

The bespoke deployment system means it is different than all our other applications, improvements to it only apply to the monolith

Our Puppet configuration for the host is tightly coupled to the application, some changes require Ops help and are hard to coordinate (a developer can’t just try a new version of mod_wsgi on stage)

Servers for the monolith are different than servers that run services (in many ways…)

Yeah Well, How Do You Make the Monolith “Just Like Any Other Service”?

The direction is obvious: the monolith should be run on the PaaS like any other
service. The roadmap is unclear: how do we go from Puppet to PaaSTA without
breaking the website?

Step 1: Dark launch

Dark launching the monolith on new infrastructure

The same best-practices around launching large experimental features apply
equally to infrastructure and applications at Yelp. To “dark launch” our
monolith on PaaSTA, first we deploy it as a new and different
SmartStack
endpoint. SmartStack is the service-discovery tool we use, created by Airbnb,
that allows us to decouple the deployment of a service from the discovery
of that service. By launching under a new endpoint (haproxy frontend) in
SmartStack, our normal traffic to the monolith won’t be able to discover it,
but we can still test it through a special HAProxy access control list (ACL)
that will send you to the PaaSTA deployment if you have a special cookie.

Doing it this way gives us some good benefits:

We exercise the PaaSTA components every time we deploy, allowing us to find breakages before it’s live

In these very early days there is no risk for normal users to find PaaSTA-powered webservers

Core team members can opt in with the cookie to eat our own dogfood :)

Once the glaring issues are found and fixed, we could increase the scope.

Step 2: Canary

Sending canary traffic to the new infrastructure

We already used
canary deployments with our
classic rsync-based system. For this step we have two canaries: one on PaaSTA
and one on Classic. There is a critical difference between this canary step and
the “dark launch” step: the canary gets live traffic!

We need the live traffic to hit this thing so we can find more bugs and
evaluate the performance. At this stage we made critical decisions about
container sizes, hardware / instance classes, etc.

Here would be a good time to start fixing things like monitoring dashboards,
alerting tools, orchestration scripts, etc. All of these little odds and ends
will need to handle the hybrid mode. For example, this is where we want to
make sure the tooling we have for rolling back code is solid and fast on both
platforms. It is “ok” to have them broken during this canary, but these small
breakages are blockers for the next step.

Step 3: Migrate! (Rampup)

Sending 50% traffic to the new infrastructure

Once the canary has proven itself, it’s time to crank up the traffic onto more
and more servers. You might consider a Blue/Green deployment from here, but for
a change this large and fundamental we decided not to do this for a business
reason: it would cost too much. Remember from the introduction that we run a
hybrid infrastructure, servers in datacenter and using AWS. We can’t just rack
twice as many servers and flip everything. No, for a change this large we took
it slow and re-imaged our physical and virtual servers over the course of a
month.

A concrete example of an issue we found at this stage was the classic “running
out of ephemeral ports”
problem.
We knew we would encounter issues like this as we exercised the new stack.
Luckily we have the classic infrastructure still in place to hold us over while
we fix these types of bugs.

Step 4: Cleanup

All PaaSTA, all the time

Although the cleanup step is not that interesting, it is kinda fun. I hear that
in some shops engineering time is not always allocated to this step, but for
operations and infrastructure teams that would be crazy; you have to clean up
or you will drown.

For Yelp we were able to clean up a custom AMI baking pipeline, tons of puppet
code, and of course, the classic rsync-based deployment mechanism.

Step 5: … Profit?

Once on a new platform, some things that were very difficult to do become easy
to do! Here are some examples:

Using PyPy instead of CPython. With a
Docker-based deployment system, this is “just” a change to the Dockerfile
(after blacklisting some packages that have PyPy-incompatible C-extensions).
Some teams at Yelp can now easily use this alternative interpreter and get
massive speedups.

Upgrade the base linux distro with a code push. Again, a
container-based approach gives good isolation between the host OS and the
application’s base image. This is no longer a large multi-team effort across
multiple months.

Can take advantage of the built-in goodies of a PaaS like
automatic monitoring, error reporting,
autoscaling,
etc.

Perhaps an underrated gain with this migration is the massive reduction in
cognitive load between the two systems. Now Yelp developers and operations
engineers have a unified experience when deploying services, even if the
monolith is a very big service.

An unanticipated bonus side-effect for Yelp is that our deploys are faster! The
speed of the new system is really a function of how quickly we can launch new
Docker containers and how much spare capacity there is on the cluster, and we
can tune these knobs to hit our desired speed/cost balance.

And of course, literally profit. Running on PaaSTA means that
we can declare
how many resources a service actually needs (cpu/ram/instance count) based on
real data, and Mesos can pack the cluster as best as it can. This means that
spare resources on a machine no longer need to be wasted, new smaller tasks can
be scheduled in. On top of that we can autoscale the entire cluster to make our
compute spend match our actual compute demand, on an hour-by-hour basis!

And then there is our true “secret” weapon for saving money by running on
PaaSTA: Using
Amazon Spot Fleet.
The nitty gritty details on how we do this sanely without sacrificing
availability of the website are reserved for another blog post.

Current State of The Art

The monolith and almost all other services at Yelp run on PaaSTA. We don’t run
our stateful (Kafka, Cassandra, Memcache) things on it, yet. While the
migration was rough and slow, the payout makes it worth it. There is still
plenty of work to do! There are still many use-cases for running code at Yelp
that PaaSTA can’t do, like large analytic (EMR) jobs, realtime streaming
workloads (Apache Flink), and even just random one-off tasks (xargs!). Now that
the biggest use-case (web serving) is migrated, I look forward to extending
PaaSTA to do even more new and exciting things!