Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Develop, deploy, and operate services at reddit scale oscon 2018

The last few years have been a period of tremendous growth for Reddit. Process, tooling, and culture have all had to adapt to an organization that has tripled in size and ambition. Greg Taylor discusses Reddit's evolution and explains how one of the world’s busiest sites develops, deploys, and operates services at significant scale.

I’m Greg Taylor, and I’m an engineering manager within our Infrastructure department. Today I’ll be sharing the story of how we’ve scaled our infrastructure, our process, and our culture as we’ve grown one of the most active sites on the internet

What is Reddit? But first: What is Reddit? Depending on who you ask, you’ll get all kinds of answers: “The Front Page of the Internet”, in a nod to its origins as a content aggregator Where the cat memes live A place for lively discourse that is always civil and serious Today we tend to define Reddit as a social network whose focus lies on user-operated communities called subreddits. These subreddits cover topics such as news, politics, sports, gaming, technology, music, and entertainment. To make this more concrete, let’s run through some of my current favorites

r/CrappyDesign Welcome to CrappyDesign, self-described as “where the comic sans and lens flare flow unfiltered” As you attempt to read this slide, you’ll probably get an idea of what this subreddit is about. Cupholders that don’t hold cups, interesting error messages, cringeworthy images, these are all the province of CrappyDesign.

r/theocho Next up is The Ocho, dedicated to spreading knowledge of seldom seen and obscure sports. You are seeing the crowd-pleasing “Trampoline Roman candle battle”. And while not all of the sports covered by this sub are as explosive, they are every bit as unusual. As a sidenote, please make good decisions, friends!

r/news But Reddit offers far more than just fun and memes. For many, it’s the place to go to discuss happenings around the world. In most cases, you’ll be able to find discussion of current events moments after they happen. We’re showcasing /r/news on this slide, but there are many more news-oriented communities. You can use subreddits like these to stay in touch with what’s going on in your city, your state, your country, or the wider world.

r/daddit As my shameless Dad jokes probably gave away, I’m a father. A relatively new one at that. If I’m telling the truth, I have no idea what I’m doing. And unfortunately, I’ve been unable to find the user manual for my little guy thus far. But r/daddit is the next best thing. You’ll find other dads in here asking questions, sharing tips, telling stories, celebrating milestones, and comforting one another during the more difficult times. Everyone gets something different from a community like this, but I will say that daddit has comforted, informed, and assisted me in my ongoing pursuit of one of those “World’s Best Dad” coffee mugs. Before we move on, I’ll also take a moment to plug a few other similar communities: r/mommit, r/beyondthebump, and r/parenting.

r/IAMA Reddit is also great for Q&A-style interactions In r/IAMA, noteworthy folks stop by to conduct AMAs, or “Ask Me Anything” sessions The participants range from politicians, athletes, entertainers, musicians, successful business people, philanthropists, and even the second person to walk on the moon. You won’t find a place on the internet where you have more direct access to interesting individuals like these. Now that we’ve covered a few subreddits, let’s get a sense for the tremendous scale that Reddit operates at.

Reddit by the numbers We’re in the top 10 most trafficked sites in the world. Our monthly active users number over 330 million To help put that into perspective, the population of the US is about 326 million. We’ve got over 138 thousand active communities (or subreddits). These communities produce over 12 million posts per month. Those 12 million posts generate about 2 billion votes per month. Now, given the numbers you see here, I want you all to take a moment to guess the size of the engineering organization that operates and develops something like this. We’ll show you the actual numbers on the next slide in a moment.

Rapid Recent Growth If you guessed about 185 engineers, you’d be correct. An interesting fact about Reddit is that we’ve been a tiny engineering org for most of our thirteen years. At least relative to our traffic. You can see the general trend on the right, with engineering in dark orange. We started 2016 with about 25 engineers. This was a record-high for us at the time. We thought we were hot stuff. Today we clock in at over 185 engineers, which is more than a 7X increase over a period of less than three years. And while this growth has been exciting to be a part of, it hasn’t been without challenge. I’m here today to share the story of the growth and evolution of our team, our culture, our process, and our stack.

So hop in your choice of time machine and fasten your seatbelts. We’re going back to 2016.

Redding Engineering in 2016 And here we are, Reddit Engineering as it was in early 2016. We numbered about 25 engineers, with about three teams writing code in some form. Pictured in the center, you’ll see our lovable monolith surrounded by its maintainers.

2016 - The Infrastructure Team One of the few teams to rise from the primordial soup was the Infrastructure team. There were about five of us on the team. At the time, we handled a mixture of operations and some backend development on the monolith. Since our infrastructure was relatively static and we weren’t spawning new services, the infrastructure team handled all provisioning and configuration of systems. We also handled most non-trivial debugging, since we had enough access to poke into the most dusty and forgotten of corners of the codebase. This arrangement served us well for over a decade.

2016 - The Stack Here’s what our stack looked like at the time. There’s nothing too crazy or noteworthy here, which was a positive for us. This stack helped us run one of the most active sites in the world with a very small infrastructure team.

Mid 2016 - Rapid Growth As we got to mid 2016, we started growing. Rapidly. Reddit’s ambitions had expanded, necessitating the growth of the engineering org. Pictured here is one of our new employee onboarding sessions in 2016. Well, not really. But this is what it felt like.

Diminishing returns As our ranks swelled, we began having difficulties in scaling our development of the monolith. While there are plenty of examples of larger teams developing working together on a monolith, our particular monolith was fragile and full of sharp edges that had accumulated over many years. It was difficult to make changes without unexpected breakages. Sort of like the development equivalent of the butterfly effect, but with more things exploding. As a result of this, our productivity was not scaling as well as we’d like as the org grew.

Determining the path forward So we had a spirited gathering to decide a path forward. Our goal was to set a course that would allow many engineering teams to efficiently develop and operate alongside one another. We considered a number of options, running the gamut from rehabilitating the monolith to a complete rewrite. In the meantime, we had a roadmap full of exciting things that needed to be built. Regardless of our chosen approach, we couldn’t afford to halt all development. After much discussion and gnashing of teeth, we decided to...

Service-oriented Architecture Start picking the monolith apart into a service-oriented architecture, or an SOA. Previously, as a smaller org the additional complexity of an SOA was unattractive to us. But as a larger (and still rapidly growing) org with a sprawling and fragile monolith, we saw this as the best path forward. We thought that this would give us a better separation of concerns, and were ready and equipped to deal with the tradeoffs. But I will say: If you have a monolith that your team is productively developing, go home and give it a big hug. Tell it you love it, warts and all.

r/IAMA With the decision made, our first service came to be in early to mid 2016. < Hit next > As we brought it online alongside the monolith, trumpets sounded, fireworks took flight, a choir angels sang. It was a beautiful moment.

Growing pains: Automated testing But, this exercise drove home the point that we needed to improve our story for automated testing before we went much further. This was something we always had wanted to improve, but it never made its way up “the list”. There were three or four different attempts to more deeply incorporate automated testing over the company’s history, but it never took root. With many new services on the horizon, we knew that had to change. So we researched options and decided to incorporate Drone CI, which we found to be easy to use and powerful enough. We started doing more linting, testing, and endeavored not to break master branches. Over time, automated testing became an expectation rather than a nicety.

Growing pains: Something to build on The other thing that became quickly apparent is the need for a foundation to build services on. Rather than copy pasta’ing new services from one another, we’d develop a Reddit service framework. We named it “baseplate”, and still use it to this day. By building our services on baseplate, we get a measure of consistency between services. They are configured the same way, they expose the same or similar ports, they have the same async event loop, they all handle RPC clients and servers the same. We’ll revisit why this is important later on. We’ve also baked in a bunch of our patterns and best practices so the other teams don’t have to re-invent the wheel. For example, you get logging, instrumentation, tracing, secrets fetching, and a bunch of other things out of the box. The other interesting bit is that we develop baseplate in the open on GitHub. While this is a very opinionated building block, we make it available with the hopes that it helps or inspires someone out there.

Hey! That was pretty cool. Let’s do it again! Back to mid 2016. Fresh off of our success with our first service, it was time to stand a second up using baseplate, our new services framework. < Hit Next > We didn’t bother with the fireworks or choir of angels this time, but this went pretty smoothly as well.

Growing pains: Artisanal Infrastructure However, the realization was setting in that our artisanal infrastructure was going to be a liability in a service-oriented world. By artisanal, I mean going into the AWS Console and point-and-clicking your way through things. This seems so obvious to us now, but keep in mind that Reddit predates CloudFormation, Terraform, and a host of other similar technologies. We resolved to shift towards an Infrastructure-as-code mentality. That means instead of clicking around in the AWS Web Console, we’d define and manage our infrastructure with in a declarative fashion. It also meant we could use our normal code review flows and policies. The idea of being able to define reusable modules was also very attractive. After evaluating the options, we went with Terraform. I’ll be candid with you, this was a tough change for us. It was and still is difficult to pull existing infrastructure into Terraform. Is anyone here fighting with this sort of thing right now? Added to that, we didn’t go into this with a set of commonly accepted best practices in hand. Discussions with other users made it evident that everyone was doing things slightly differently at the time. So we started chipping away as best as we could. We made tons of mistakes, battled with HCL at times, and even accidentally exploded a few things. But we got better over time, and today consider Terraform a key piece of our stack.

Once more, with feeling! Which brings us to late 2016. We’ve got a decent testing story, a service framework, and we’re starting to manage infrastructure with code. It was time to start pumping these new services out as more teams and engineers flowed into the company. < Hit Next > But as we deployed our third and fourth services, our lack of a good staging story started to hurt.

Growing pains: Staging/integration woes The issue here was that we didn’t have a great way to develop and test new services against one another. The monolith had a very rigid, opinionated staging system that had done the job for many years but it wasn’t something that we wanted to stamp out for our services. And this Kubernetes thing had really started to take off at the time, and we also had some folks that had previous exposure to it. We were also struggling to manage our background jobs at the time, and saw Kubernetes as potentially relevant there. So we resolved to see what kind of staging system we could come up with in the span of about three weeks. By this point in late 2016, Helm was still very young and many (or most of us) were managing clusters with bash scripts and static manifests. Which is exactly what we did. We wrote a CLI that wrapped kubectl to deploy Docker images that our CI system built. By the third week of our effort, developers were using the new staging system. It ended up being very well-received, and we resolved to continue developing it. While our staging system looks much different today, it was wonderful to be able to address a pressing need so quickly!

Reddit Engineering in 2017 As 2017 rolls around, we have a growing tapestry of services, we’ve got over 60 engineers, and have quite a few different teams writing code. Patterns were beginning to form in Terraform land, baseplate continues to evolve, and we were settling into this brave new service-oriented world. But we on the Infrastructure team had reached a threshold.

Growing pains: Infra team as a bottleneck As mentioned earlier, the Infrastructure team had historically been the arbiter of production. It had served us well for a decade, but the rest of the organization had recently grown far faster than us. We had became a bottleneck. Service owners had to wait for us to stand up and configure new infrastructure for them. Also, we handled most of the incident response and production debugging since we had the most permissive access. This led to frustration all around: The Infrastructure team was discouraged by the hefty backlog of teams waiting on us. The other teams were [rightly] not pleased with having to wait long periods of time for their services to be stood up. Seeking some short-term relief, we trained a few of the more infrastructure-oriented teams to work with our our systems and operations tooling. These teams were able to block less on Infra, while reusing some of our Terraform and Puppet building blocks. This helped them move faster and took some load off of us. Though, it didn’t come close to fully solving our Infra-as-a-bottleneck issue.

One size fits some The reality is that not all teams want to operate the full stack for their service. There’s an entire spectrum of infra comfort to contend with. When faced with the prospect of doing some infrastructure work, there are some teams will chortle with glee and tell you about the time they did a stage 1 Gentoo install on their refrigerator. Others will will get up, scream loudly, and run away, never to be seen again.

You want me to do WHAT?!? I’d argue that at most organizations, it’s not reasonable to expect deep infrastructure experience on every team. I think we are at our best when our development teams can focus more directly on the problems that they have been tasked to solve, instead of doing that plus wading into the deep, murky waters of infrastructure. And to be honest, our stack is built from some pretty ho-hum stuff for someone who spends 40 hours a week in operations land. But this is a lot to bite off for others who don’t have that much contact with this side of things.

What do we really really want? But let’s step back and look at these issues from a higher level perspective: Our fellow engineering teams are: frustrated by having to wait on us feeling like they aren’t in control of their service’s lifecycle unable to debug some issues in production themselves If we had to distill all of this down into a very specific request, they are desiring...

Service Ownership! But what does that mean? We’ve established that not every team wants to manage its own infrastructure. But most of our teams do want to own the full lifecycle for their service. They want to be empowered to jump into development quickly, then graduate to a production-like test environment where the finishing touches can be applied. After that, they want to take their service to production without much dependence on other teams. Once in production, they want to be able to instrument, monitor, and iterate on their service, again, without blockage. When problems arise, they want to be able to start by trying to diagnose the issues on their own. These are all very reasonable expectations, and it is mutually beneficial for all of us to figure out how more responsibility can lie with the service owner. But there are a few challenges to overcome...

Service ownership challenges: Learning curve No matter how smoothly paved or neatly groomed a path we present for our service owners, the days of throwing code over the wall for an operations team are over. They may not need to get a deep understanding of their infrastructure, but they’ll need to learn some basics about whatever contact surfaces we present to them.

Service ownership challenges: Responsibility Depending on the team, the response to this newfound notion of service ownership will range from “this is awesome” to “can you just do it for me?” In our case, we are there to help, but do expect the service owner to make an attempt to work towards more self-sufficiency. And nine times out of ten, the service owner will see the value in being able to accomplish more and faster. Buuuut...

Service ownership challenges: Mistakes are going to happen… often! There will be explosions. Probably at least three. Maybe even five. But instead of getting caught up on what could happen, we can look at these happy little accidents as the learning opportunities that they are. And since the service owner is the first to get paged, they are highly motivated to get things sorted!

Early 2018 - How to enable service ownership? We’ve spent time with the other engineering teams trying to get a feel for what they want. We’ve worked with them to define a notion of service ownership. We’ve also established that it’s not reasonable for most teams to become Infrastructure engineers in addition to their primary responsibilities. So how can we limit the learning curve while empowering teams?

Reddit Infrastructure as a Product We have to change our thinking a bit. Instead of expecting each team to moonlight as operators and sys admins, we needed to put together a cohesive Infrastructure Product. A Reddit Engineer should be able to take this thing off of the shelf, read and follow the instructions, and have their service running without blockage or significant fuss.

Selecting the right foundation So we started looking at ways to re-package and more humanely offer our tools and processes. There are even companies out there with a similar stack that managed just that very feat. But ultimately, we felt like there may be a better way to go. Perhaps we could continue using most of what we already had in Infrastructure land, but present some more singular “API” or contract to the users. You’ll recall that I had mentioned a Kubernetes-based staging system earlier on, all the way back in late 2016. We had, at this point, over a year of success in using Kubernetes for this and a few other miscellaneous purposes. It had served us well, though we didn’t have a well-defined idea of how it would best serve us longer term. To move in the direction of Kubernetes would mean a huge paradigm shift in how we developed, deployed, and operated our services. This was spooky to us, in that we’d be forced to take our bumps and bruises all over again with a new stack, when we had one that was working for us. < Hit Next > But as we sketched and experimented, we came to believe that this was a great foundation for us to build on. We felt that it’d be the singular, cohesive platform from which we could build our product. So we steeled our nerves and started on the journey to enabling service ownership for all.

How to build an Infrastructure Product But how do we actually build an Infrastructure Product? It’s not enough to toss Kubernetes at the other engineering teams, wish them well, then run away. Our goal of offering a cohesive product means that we need to build for ease of use, consistency, the right level of flexibility, and safety. In thinking through our problems and our ideal future, we arrived at a set of core tenets that have directed our development. Let’s walk through what we have and how we’ve satisfied each.

Limit the surface area Our first core tenet is the idea that we need to limit the total technical surface area that our service owners will be expected to interact with. In getting back to this notion of a packaged Product, we have to get away from expecting users to “build something from these hundred pieces”. Instead, we need to say “here’s one or two things you can use to develop and operate your service”. By limiting the set of distinct technologies that the user needs to touch, we lessen context switching and lower the learning curve. So instead of requiring knowledge of everything you see before you… < Hit Next > We decided to focus our users on Kubernetes as that single contact point. And by “Kubernetes”, we mean the barest of essentials: Pods and Services. We support this with plenty of tooling, training, and documentation. By not having to learn our entire systems stack, users are able to focus more on their services. But this kind of philosophy is only so valuable unless we make responsibilities and expectations explicit.

The user/operator contract To do that, we defined a “contract” that codifies the user/operator relationship. We define the “product user” as the service owners, and the “product operator” as the Infrastructure team. Product users are expected to learn the barest of basics about Kubernetes, and they’re expected to develop, deploy, and operate their own services. By extension, incidents in production involving a service, result in the owner getting paged. Product operators are responsible for provisioning, maintaining, and scaling the underlying Kubernetes clusters. If there are incidents with Kubernetes Clusters or supporting infrastructure, the Infrastructure team gets paged. If a service requires an S3 bucket, a cache, or a database, the Product operators are well-equipped to provision or operate those. In determining a dividing line for responsibilities, we allow each party to focus on the thing that they are best equipped to handle.

Batteries-included For our user and operator contract to work, we have to provide as much as we can out-of-the-box. This goes back to focusing each party on the tasks that they are best equipped to handle. For example: by providing starter test, build, and deploy pipelines, we avoid the need for our users to become Release Engineers. By baking in a set of default metrics, dashboards, and alerts, we provide an immediately observable foundation for the service owner to start with. The same holds true for logging. This isn’t something our users should have to wire up themselves. If a service interacts with another service, the owner shouldn’t have to learn the intricacies of our service mesh. If secrets or credentials are involved, our user should be able to quickly and easily set those up in our secrets store. We are able to include all of these things out of the box due to the consistency baseplate gives us. We’re able to make a bunch of assumptions about how a Reddit service operates, and can make our tooling more simple as a result. But getting back to the main takeaway in all of this: Your users should get all of these things just by virtue of standing their services up using your product. If they have to rig up all of these things themselves, you don’t have an infrastructure product. You have a pile of technology.

Paint-by-numbers As we discussed earlier, teams differ in their ability and willingness to handle infrastructure work. Some are all about it, others want nothing to do with it. Though, we have to be able to support both sides of this spectrum. That means not requiring a deep understanding of the infrastructure. Instead of expecting users to take their lumps by slogging through upstream documentation, poring over blog posts, and rooting around in forums, we provide a library of paint-by-numbers guides to cover common tasks and problems. For example, “here’s how to set and retrieve secrets”. “Here’s how to have your application assume a specific IAM role”. “Here’s how to deploy an experimental release that handles a fraction of production traffic”. We supplement these guides with training and extensive documentation. If you don’t have excellent docs, training, and support, you do not have an infrastructure product! We also must auto-generate or template everything that we can for the users. For example, instead of asking your users to write Kubernetes manifests or Helm Charts, we can generate something that will get them most of the way there, save for some minor adjustments. In summary, let’s read the last sentence on this slide together. OK? Ready?

Consistency throughout Ah, look at this. More consistency! Supporting the previous three tenets, we have to provide consistency throughout the development cycle. For us, this means Kubernetes in local dev, Kubernetes in our lab and staging environments, and Kubernetes in production. As far as how this works, we reuse the same Helm Charts and Docker images through all three phases of the lifecycle. If you’d like to learn more about how this works, see our 2018 Helm Summit presentation on this very topic. The tldr is that by pairing our Helm Charts with phase-specific values, we can account for the few differences there are at each step along the way. The result of this is that: Our developers don’t need to learn to use three different systems. They build familiarity and competency with our offering even in local development. When it’s time to go to production, they have a decent understanding of the basics. And finally, we prevent many cases where something works locally but not in production.

Deploy with some confidence Everything that we have discussed thus far must be offered with tools and processes that help our users develop and deploy with confidence. Otherwise we’re just throwing crap over the wall, hoping that it won’t make a mess. For example, we have an org-wide code review policy. We require at least one other person review all commits and designs. It can be frustrating to have to factor in additional time for review, but you’ll be collaboratively improving the quality of your contributions. This is also a great way to learn from others. In addition to code review, we need a robust unit and integration testing story. We need to be able to test more isolated chunks of code... And we should also be able to test larger systems (or even services) against one another. To tie this all together, by the time our changes are in line to deploy to production: They will have been ran through linters, automated unit tests, and integration tests They will have been reviewed by at least one other colleague They will have been deployed to a staging environment These tools and processes give us more confidence that our deploy will be a success.

Guardrails and Safeties With all of that said: despite our best efforts, mistakes will happen in production. When the worse inevitably happens, it’s important to have Guardrails and safeties: things that will limit the damage. For example, resource limits on CPU and memory can reduce the likelihood of resource starvation. On the network side, a service mesh can be used to set service-to-service rate limits. We’ll want to tightly scope IAM and RBAC permissions on applications and users to avoid compromise or unexpected fires. And by virtue of running all of this on Kubernetes, we end up with a platform that makes it easy to use the API to scan for common mistakes. For example, if someone somehow gets through our CI checks and fires up a Service of type LoadBalancer that hasn’t been approved, perhaps we alert our Security team. Going a step further, Docker Image policies can whitelist or blacklist the acceptable images or image repositories that can be used with your clusters. Having these guardrails and safeties is a critical part of being able to trust the integrity of our offering.

Reddit Infrastructure Product Core Tenets We’ve covered a lot of ground, so let’s review before I end with some parting words of wisdom: Limit the surface area The user/operator contract Batteries-included Paint-by-numbers Consistency throughout Deploy with confidence Guardrails and Safeties

What does all of this buy us? And this all sounds cool, right? But what change did all of our work bring about at Reddit?

Infra team: From Operators to Enablers Our Infrastructure Product has helped us… transform. Let’s compare and contrast before and after.

Infra team: From Operators to Enablers Before: Infra team provisions all infrastructure After: Infra provides infra as a product Before: Infra deploys new services After: Service owners deploy new services Before: Infra operates most services After: Service owners operate their services Before: Infra is a blocking dependency on production work After: Infra is an adviser and enabler for other teams to do production work

Organizational scalability Reddit has seen rapid growth over the last few years, and we’re likely to see more in the future. By building a coherent infrastructure product, we’ve packaged and distributed the Infrastructure team’s expertise for the other engineering teams to use. In doing so, we’re scaling ourselves enough to be able to handle the next few hundred engineers to come.

Closing remarks I presented quite a bit of higher level content in this talk. I’m going to hang around in here for a bit, or out in the hall of they kick us out. Please feel free to stop by and ask questions. Also feel free to contact me on Reddit as /u/gctaylor, or on Twitter as the same name.

Jobs plug Also, I’d be remiss if I didn’t mention that we’re hiring across all sorts of teams and disciplines.

Speaker info and resources Here are my details, along with some resources. I’ll take a moment to plug the kubernetes subreddit, which is approaching 10k subscribers. You can also see some other interesting content on the technology section of our blog at redditblog.com.

39.
Batteries-included
Provide as much as we can out-of-the-box:
● Test, build, and deploy pipelines
● Observability
● Service discovery
● Baseline security

40.
Paint-by-numbers
Enabling service ownership for all backgrounds:
● Limit learning expectations
● You want to do X? Here’s a guide for that
● Training and extensive documentation
● Auto-generate all the things
An engineer should not require deep infra
experience in order to be productive!

41.
Consistency throughout
As a developer takes an application through its development cycle, the
operating environment should be as consistent as possible
Local dev Staging Production

42.
Deploy with confidence
A developer should have tools and processes that will help them build
confidence as work approaches production
Code Review Unit Testing Integration Testing
Some

47.
Infra team: From Operators to Enablers
Infra provisions all infrastructure Infra provides infra as a product
Infra deploys new services Service owners deploy new services
Infra operates most services Service owners operate services
Infra is a blocking dependency Infra is an adviser and enabler