How Yelp (mostly) shut down its own data centers and moved to AWS

Back in 2013, Yelp was a 9-year old company built on a set of internal systems. It was coming to the realization that running its own data centers might not be the most efficient way to run a business that was continuing to scale rapidly. At the same time, the company understood that the tech world had changed dramatically from 2004 when it launched and it needed to transform the underlying technology to a more modern approach.

That’s a lot to take on in one bite, but it wasn’t something that happened willy-nilly or overnight says Jason Fennell, SVP of engineering at Yelp . The vast majority of the company’s data was being processed in a massive Python repository that was getting bigger all the time. The conversation about shifting to a microservices architecture began in 2012.

The company was also running the massive Yelp application inside its own datacenters, and as it grew it was increasingly becoming limited by long lead times required to procure and get new hardware online. It saw this was an unsustainable situation over the long-term and began a process of transforming from running a huge monolithic application on-premises to one built on microservices running in the cloud. It was a quite a journey.

The data center conundrum

Fennell described the classic scenario of a company that could benefit from a shift to the cloud. Yelp had a small operations team dedicated to setting up new machines. When engineering anticipated a new resource requirement, they had to give the operations team sufficient lead time to order new servers and get them up and running, certainly not the most efficient way to deal with a resource problem, and one that would have been easily solved by the cloud.

“We kept running into a bottleneck, I was running a chunk of the search team [at the time] and I had to project capacity out to 6-9 months. Then it would take a few months to order machines and another few months to set them up,” Fennell explained. He emphasized that the team charged with getting these machines going was working hard, but there were too few people and too many demands and something had to give.

“We were on this cusp. We could have scaled up that team dramatically and gotten [better] at building data centers and buying servers and doing that really fast, but we were hearing a lot of AWS and the advantages there,” Fennell explained.

To the cloud!

They looked at the cloud market landscape in 2013 and AWS was the clear leader technologically. That meant moving some part of their operations to EC2. Unfortunately, that exposed a new problem: how to manage this new infrastructure in the cloud. This was before the notion of cloud-native computing even existed. There was no Kubernetes. Sure, Google was operating in a cloud-native fashion in-house, but it was not really an option for most companies without a huge team of engineers.

Yelp needed to explore new ways of managing operations in a hybrid cloud environment where some of the applications and data lived in the cloud and some lived in their data center. It was not an easy problem to solve in 2013 and Yelp had to be creative to make it work.

That meant remaining with one foot in the public cloud and the other in a private data center. One tool that helped ease the transition was AWS Direct Connect, which was released the prior year and enabled Yelp to directly connect from their data center to the cloud.

Laying the groundwork

About this time, as they were figuring out how AWS works, another revolutionary technological change was occurring when Docker emerged and began mainstreaming the notion of containerization. “That’s another thing that’s been revolutionary. We could suddenly decouple the context of the running program from the machine it’s running on. Docker gives you this container, and is much lighter weight than virtualization and running full operating systems on a machine,” Fennell explained.

Another thing that was happening was the emergence of the open source data center operating system called Mesos, which offered a way to treat the data center as a single pool of resources. They could apply this notion to wherever the data and applications lived. Mesos also offered a container orchestration tool called Marathon in the days before Kubernetes emerged as a popular way of dealing with this same issue.

“We liked Mesos as a resource allocation framework. It abstracted away the fleet of machines. Mesos abstracts many machines and controls programs across them. Marathon holds guarantees about what containers are running where. We could stitch it all together into this clear opinionated interface,” he said.

Pulling it all together

While all this was happening, Yelp began exploring how to move to the cloud and use a Platform as a Service approach to the software layer. The problem was at the time they started, there wasn’t really any viable way to do this. In the buy versus build decision making that goes on in large transformations like this one, they felt they had little choice but to build that platform layer themselves.

In late 2013 they began to pull together the idea of building this platform on top of Mesos and Docker, giving it the name PaaSTA, an internal joke that stood for Platform as a Service, Totally Awesome. It became simply known as Pasta.

Photo: David Silverman/Getty Images

The project had the ambitious goal of making their infrastructure work as a single fabric, in a cloud-native fashion before most anyone outside of Google was using that term. Pasta developed slowly with the first developer piece coming online in August 2014 and the first production service later that year in December. The company actually open sourced the technology the following year.

“Pasta gave us the interface between the applications and development teams. Operations had to make sure Pasta is up and running, while Development was responsible for implementing containers that implemented the interface,” Fennell said.

Moving to deeper into the public cloud

While Yelp was busy building these internal systems, AWS wasn’t sitting still. It was also improving its offerings with new instance types, new functionality and better APIs and tooling. Fennell reports this helped immensely as Yelp began a more complete move to the cloud.

He says there were a couple of tipping points as they moved more and more of the application to AWS — including eventually, the master database. This all happened in more recent years as they understood better how to use Pasta to control the processes wherever they lived. What’s more, he said that adoption of other AWS services was now possible due to tighter integration between the in-house data centers and AWS.

Photo: erhui1979/Getty Images

The first tipping point came around 2016 as all new services were configured for the cloud. He said they began to get much better at managing applications and infrastructure in AWS and their thinking shifted from how to migrate to AWS to how to operate and manage it.

Perhaps the biggest step in this years-long transformation came last summer when Yelp moved its master database from its own data center to AWS. “This was the last thing we needed to move over. Otherwise it’s clean up. As of 2018, we are serving zero production traffic through physical data centers,” he said. While they still have two data centers, they are getting to the point, they have the minimum hardware required to run the network backbone.

Fennell said they went from two weeks to a month to get a service up and running before this was all in place to just a couple of minutes. He says any loss of control by moving to the cloud has been easily offset by the convenience of using cloud infrastructure. “We get to focus on the things where we add value,” he said — and that’s the goal of every company.