Featured in
Architecture & Design

Mini-talks: The Machine Intelligence Landscape: A Venture Capital Perspective by David Beyer. The future of global, trustless transactions on the largest graph: blockchain by Olaf Carlson-Wee. Algorithms for Anti-Money Laundering by Richard Minerich.

Featured in
Process & Practices

In-App Subscriptions Made Easy

There are various types of subscriptions: recurring, non-recurring, free-trial periods, various billing cycles and any possible billing variation one can imagine. But with lack of information online, you might discover that mobile subscriptions behave differently from what you expected. This article will make your life somewhat easier when addressing an in-app subscriptions implementation.

Featured in
Operations & Infrastructure

Mini-talks: The Machine Intelligence Landscape: A Venture Capital Perspective by David Beyer. The future of global, trustless transactions on the largest graph: blockchain by Olaf Carlson-Wee. Algorithms for Anti-Money Laundering by Richard Minerich.

Featured in
Enterprise Architecture

Mini-talks: The Machine Intelligence Landscape: A Venture Capital Perspective by David Beyer. The future of global, trustless transactions on the largest graph: blockchain by Olaf Carlson-Wee. Algorithms for Anti-Money Laundering by Richard Minerich.

A Few Good Rules

What’s wise to standardize?

Imagine a first date with someone very special. As you arrive at your favorite restaurant, you are enveloped in perfect darkness: all the lights are out. Strangely, noises from the kitchen indicate it’s business as usual. You hear a waitress approaching, waiting to guide you to your unlit seat. Your companion is confused and a bit frightened. Would you stay or would you find a more conventional place to eat?

Web applications -just like restaurants- are judged by the entire experience they provide. Even a brief outage can seriously harm the service provider’s reputation or bottom line. Policies and guidelines serve an important role in preventing costly downtime. Unfortunately they can also lead to irrational decisions which do more harm than good. For example, the establishment of a "DevOps team" within the company, which results in all operations knowhow being siloed in a single team. Although such a management directive may be heralded as DevOps, it is anything but.

Engineers despise illogical, bureaucratic rules which act as obstacles to progress, yet there seems to be a at least a few at every company. Chances are, there were excellent reasons for enacting them at some point in the past. Gradually, over time they become deprecated, but the original authors cannot (or dare not) revoke them. Anyone who has worked on C++ codebases which forbid the use of STL for historic reasons, or Java projects which staunchly refuse to move past version 1.4 of the language understands just how counterproductive these measures can be.

Should we forget all the rules?

Confronted with rules which act as hindrances, it’s tempting to want to get rid of them all. Unfortunately, the “laissez faire” companies usually aren’t very successful. Good rules are a vital form of communication about long term strategy, costly lessons learned in the past and what has been discovered about users’ need. Ideally, the rules an organization develops over time give individuals confidence in their ability to make good decisions. Does this ever happen in practice?

Netflix comes to mind as a good example of a company which has really effective guidelines. At least that’s the impression one gets from reading their blog and the code they’ve open sourced. For example, even without talking to anyone who works at Netflix, I’m certain that “You build it, you run it” is a good summary of how they approach development versus operations. Another clear principle is that they write code to build a reliable and scalable service, not the other way around. Nothing proves this more eloquently than fact that they open source much of the backend software they write.

Netflix has built the Netflix Internal Web Service Framework (NIWS), a custom software stack for creating internal web services which run reliably in the cloud. NIWS often takes the road less travelled by siding with technologies which have been traditionally less popular than other alternatives. Such a departure from the usual set of best practices takes considerable confidence; without a doubt, this can be partly chalked up to empowering policies which let engineers think “outside the box”.

Loadbalancing a la Netflix

My favorite example of where Netflix challenged the norm is how they implemented loadbalancing in the NIWS stack. Customer-facing traffic is still handled by a traditional loadbalancer (a standard Amazon EC2 ELB), but for traffic between Netflix servers, they chose something completely different called client-side loadbalancing. The basic idea is simple: instead of running dedicated loadbalancer nodes which forward traffic to a set of backend servers, the client itself maintains the list of available backend nodes. When the client makes the request, it contacts the selected backend instance directly, rendering the loadbalancer unnecessary.

Netflix didn’t invent client-side loadbalancing, but they were one of the first well-known companies to bet their entire infrastructure on it (to be fair Twitter and Yahoo also experimented with the concept around the same time). The standard way to balance requests among several backend servers is via a loadbalancer like Amazon EC2 ELB or servers running software such as HAProxy. For such a critical component, it makes sense to be conservative and commit to a technology which many engineers know well. But the real reason very few companies experimented with client-side loadbalancing before Netflix was that they never even considered it.

Software engineers working on large-scale applications are surrounded by the libraries and components they use every day much like fish are by water. After years (perhaps decades) of successfully building systems a particular way, questioning the battle-proven recipe or its building blocks seems like a waste of time. In many companies, these decisions are encoded in policies which are essentially unchangeable. Yet Netflix made significant gains due to its adoption of client-side loadbalancing. First, they removed a single-point-of-failure from the communication channel (a big win in the days when EC2 instances frequently terminated without prior warning). Second, by integrating the load balancing logic into the client, the loadbalancing strategy could take into account information which was readily available to the client. For example, consider the following loadbalancing rule:

Make the request to any node in the client’s EC2 Availability Zone if one is available. If such instance does not exist, find one in the current region which has an uptime greater than a day.

Traditional loadbalancers were not designed to execute this sort of custom logic. They also don’t typically know too much about their clients (such as which EC2 availability zone or region a client belongs to). Custom loadbalancing logic becomes part of the application, written in the same language as the rest of the codebase. This means it’s easy to write unit tests for code which was traditionally considered “infrastructure stuff”. So not only is it possible to make more complex and intelligent decisions, one can be more confident that they work as expected. In a way NIWS takes DevOps to the next level: not only are developers and operations engineers sitting side by side, working in the same team, they’re also using the same language, committing into the same repositories.

Prezi joins the club

Was Netflix benefitting from replacing the venerable standard load-balancers with an in-house client-side load-balancing implementation an isolated episode that worked only for them? Hardly. At prezi.com, we also adopted this technology for internal traffic. Some of our application servers run several services. When such services communicate, we prefer that they contact the instance of the service running on the same machine instead of making a network request. If, however, the service is not running locally, then any instance will do. The benefit for prezi is avoiding unnecessary network traffic when possible, reducing our AWS bill and response times. It is implemented by the following Scala snippet which is currently running on prezi.com’s production servers:

Netflix engineers could design NIWS without worrying too much about challenging the technical status quo because their company rules empowered them to do so. Even though the NIWS stack is available to anyone, only companies which have a similar mindset will be able to build products with it. Specifically, companies which expect engineers to make decisions primarily based on technical merit, and ones where perfectionism is discouraged.

The Netflix test

Expecting all engineering decisions to be entirely free of office politics, pet technologies and fear of change is a tall order. Yet minimizing the impact of these voices goes a long way in ensuring development is not sidetracked. An army of arbitrary restrictions generally tends to make engineers’ designs less creative and functional. In contrast, a few good rules limit the problem space and clarify constraints, improving the quality of the product.

Based on the source of the NIWS stack, Netflix generally considered two things when deciding how a component should be implemented:

What is the likelihood and consequence of the component’s failure?

Is it easy to modify the component's behavior when the assumptions on which it was built start to change?

I dubbed these questions the Netflix test. They are closely related: one could even say that the first question is just a specialization of the second. What makes them significant is how well they mirror Netflix’s business goals of providing a reliable and scalable service. Other companies which share these goals may benefit from this test, but the real power of the test lies in what’s missing. There’s no mention of any specific technology or vendor.

Perfectionists need not apply

What really surprised me about the Netflix code is how focused it was on being good enough and nothing more. Don’t get me wrong, the code I’ve seen so far is easy to read and unit test coverage is quite high. That said, I did not expect this level of focus on Netflix’s specific use cases. For example, in many parts of the code, background threads are started without any means of stopping them later on. That seems like a big problem until you realise that Netflix doesn’t upgrade software on nodes. They deploy a new version of the application by starting a new cluster of EC2 instances, killing off the old cluster when monitoring proves that the new version can handle the load without unexpected problems. If one adopts these deployment tools (which they also open-sourced), then the zombie threads are not an issue. If one tries to use netflix libraries from an application server like Glassfish, however, each redeployment of the application will trigger memory leaks.

The large number of singleton classes generously sprinkled across the codebase were also unexpected. When we tried to reuse code from one of the NIWS libraries in a way Netflix did not foresee, we quickly found ourselves struggling with convoluted hacks involving multiple classloaders to get around these issues.

Finally, while the wiki pages documenting the netflix code are immensely helpful, they are few, leaving lots of ground uncovered. More often than not, the documentation is the source code. I’ve found the best advice on solving a specific problem related to NIWS in the github issue tracker on several occasions.

Many of my colleagues who encounter the Netflix ecosystem for the first time are left a bit underwhelmed. Their first reaction is to condemn the engineers who who wrote the code for being undisciplined or lazy. “There should be rules for closing unneeded threads”, I hear them say. Yet for Netflix, none of the listed disadvantages of NIWS are a real problem. The engineering time spent on closing threads was put to better use elsewhere. The singleton classes are only an issue if one tries to reuse code in unsupported ways. Finally, while documentation is generally a good thing, readable code and lots of in-house expertise make it optional. Netflix would have stolen focus from the primary problem its engineers faced by enacting rules about thread management, frowned-upon design patterns or minimum documentation.

In fact, I’ve come to the realisation that the Netflix stack is successful precisely because of its scrappiness. Not only is it acceptable that Netflix “cut some corners”, it actually resulted in a better product. Writing mounds of text describing code which is constantly evolving just guarantees documentation will be out of date. Writing features that the authors don’t use is demotivating for developers, difficult to justify for their teams, and bad for the community as well since this code will not be tested in production. At Prezi, we have about a dozen projects we’ve been wanting to open source for months to years, but lacking the time to add the polish we’d like, we’ve been unable to open them up so far. Netflix successfully open-sourced such a vast amount of code without going broke because they were aiming for readable and unit-tested, not overpolished and futureproof. The few good rules Netflix enforced enabled engineering to keep up with its quickly growing user base and even open source what they wrote on the side.

So are all specific rules bad?

If the Netflix test is reformulated as a set of guidelines, the result is pretty generic. What about concrete nuggets of hard-earned wisdom like “devote at least 10% of your time to decreasing technical debt” or technical information like “NodeJS version 0.6.1 causes our web app to be unstable, don’t use it”? Wouldn’t it be wasteful to forget the lessons we learned from postmortems after outages?

Such advice, as well as best practices and well-known components are generally valuable allies. They’ve gained the trust of engineers by proving their merit through the years in accelerating development and facilitating the operation of systems. For example, at Prezi most of our backend systems are written in Python and use the gunicorn webserver, the Django web framework and MySQL database. In the company’s infancy, this stack gave developers the stability to focus on new product features. For years, “write your service using Django and MySQL” seemed about as obvious to us as “don’t deploy after 3pm on a Friday”. Neither of those were written rules at Prezi, but both could have been.

As the number of registered users climbed from 0 to more than 40 million, many of the facts which led to the adoption of this platform no longer held true. For example, when all of the website traffic was served by a single application, it made sense to store all user data in a single MySQL database. Today, Prezi is composed of dozens of independent services, with vastly different requirements in terms of latency, reliability and consistency. Many of these services are running in EC2 and accessing data through a single primary key, using the database as a key-value store. No technical guideline formulated during Prezi’s first year -however useful at the time- would help us face our current engineering challenges.

Standard technologies and specific rules can boost engineers’ output, as long as they don’t reach their expiration date. The problems with specific rules start when they no longer apply, but they are still enforced.

Interfaces set in stone

A particularly bad (and also very common) form of specifics which have outlived their usefulness are deprecated software interfaces. My favorite example is the Java Servlet API. Even though it’s not really deprecated! In fact, it’s a good interface: intuitive, stable, well documented and implemented by many different application servers.

When Prezi decided to explore the JVM as an alternative platform to our trusted Django stack, we chose a light-weight proxy application as our pilot project. I argued passionately for using Jetty and the Servlet API instead of the no-name Scala webserver which was the other alternative the team considered. Six months later, we scrapped the original proxy and wrote a new one using Spray (the technology I voted against), partly because it was much more efficient for our use case where response times were dominated by the roundtrip time for outgoing HTTP requests. I was thinking on the code level: what objects and interfaces do we want to use, how we write unit tests and how large the developer community is. This is the level of abstraction addressed by the Servlet API. I should have been thinking (and talking) about how it will utilize the underlying hardware. Specifically, where the bottlenecks are: is processing an incoming request going to require significant CPU or IO resources? In our case no, since the majority of the time will be spent waiting for the response of outgoing HTTP requests; this is the nature of proxies. In light of this, allocating a dedicated hardware thread for each incoming request -which is how servlets work- needlessly limits the number of requests which can be served in parallel and makes inefficient use of memory.

The fact that the Servlet API was not a good fit for this problem does not serve to discredit the venerable interface or the Java programming language. Thousands of companies use the Servlet API to build amazing products, and other programming languages have webserver interfaces with similar semantics. The morale of the story is that I applied a specific guideline out of context. Interfaces solve a specific problem. When that problem is no longer the one you’re trying to solve, the given interface is not a good choice (regardless of how widespread or new-and-shiny it is).

Rules in DevOps

The power of DevOps comes from the collaboration of people with very different skillsets. A team whose members are randomly distributed on the full-bearded-sysadmin-type vs functional-programming-fanboy spectrum has a better chance of building reliable and scalable services than a homogenous group.

This difference in backgrounds amplifies the need for clear rules by which the team operates. Developers don’t necessarily need to know the full reasoning behind why the custom Linux kernel they use has a particular constellation of compile-time options. Similarly, not everyone needs to worry about how many singleton objects exist in the code. Standards like “don’t write a shell script without a shebang line” or “Have unit tests for classes which parse user data” apply to everyone in the team, and serve to help those who don’t have enough experience in the given domain to do the right thing by habit. As long as specific rules are used only as long as they apply, they can do wonders.

More generic rules, like those of the Netflix test only apply to higher-level decision making, but apply much longer. Operating a team takes both kinds, the trick is to notice when a rule we’ve grown to rely on no longer serves its purpose.

If we return to the metaphorical restaurant at the beginning of the article, and open the refrigerator door, the range of expiration dates on various boxes will be anywhere from months into the future for something like ketchup to just a few hours for raw fish. Meals are composed of any array of ingredients, each with its own expiration dates. It’s a cook’s job to make sure all of them are fresh enough that the final product tastes good. In the same vein, real wisdom is required not only to decide what we standardize, but also to minimize how long it takes us to notice when our standards no longer make sense.

About the Author

Peter Neumark is a devops guy at Prezi. He lives in Budapest, Hungary with his wife Anna and two small children. When not debugging python code or changing diapers, Peter likes to ride his bicycle.