Strong opinions, weakly held

At Etsy, our engineering is well known for practicing continuous deployment. For all of the talk in the industry about continuous deployment, I don’t think that its impact on personal productivity is fully understood.

If you don’t work in a shop that does continuous deployment, you may assume that the core of it is that releases are not really planned. Code is pushed as soon as it’s finished, not according to some schedule, and that’s true, but there’s a much deeper truth at work. The real secret to the success of continuous deployment is that code is pushed before it’s done.

When people are practicing continuous deployment the Etsy way, they start out by adding a config flag to the code base behind which they can hide all of the code for their feature. As soon as the flag has been added, they add some conditional code to the application that creates a space that they can fill with the code for their new feature. At that point, they should be pushing code as frequently as is practical.

This is the core principle of continuous deployment. It doesn’t matter if the feature doesn’t work at all or there’s nothing really to show, you should be pushing code in small, digestible chunks. Ideally, you’ve written tests that are then part of the continuous integration suite, and you’re having people review that code before it goes out. So even though you don’t have a working feature, you’re confident the code you’re producing is robust because other people have looked at it, and it’s being tested every time anyone deploys or runs the suite of automated tests. You’re also reducing the chances of having to spend hours working through a painful merge scenario.

Many engineers are not prepared to work this way. There’s a strong urge to hold onto your code until you’ve made significant progress. On many teams, working on a feature for a week or two to build something real before you push it is completely normal. At Etsy, we see that as a risky thing to do. At the end of those two weeks you’re pushing a sizable chunk of code that has never been tested and has never run on the production servers out into the world. That chunk of code very well may be too big for another engineer to review carefully in a reasonable amount of time. It should have been broken up.

Pushing code frequently is the main factor that mitigates the risk of abandoning the traditional software release cycle. If you deploy continuously but the developers all treat the project like they’re developing in a more traditional fashion, it won’t work.

That’s the systems-based argument for pushing code at a rate that tends to make people uncomfortable, but what I want to talk about is how taking this approach improves personal productivity. I’m convinced that one thing that separates great developers from good developers is that great developers don’t allow themselves to get stuck. And if they do get stuck, they get stuck on design issues, and not on problem solving or debugging.

Thinking in terms of what code you can write that you can push immediately is one way to help keep from getting stuck. In fact, a mental exercise I use frequently when I’m blocked on solving a problem is to try to come up with the smallest thing I can do that represents progress. Focusing on deploying code frequently helps me stay in that mindset.

A great description of continuous deployment. This is something we’ve had to help some of the new people at Automattic wrap their head around. I’ve taken to using a variation of the traditional open source motto of ‘release early, release often’, making it ‘commit early, commit often’.

Having the ability to enable a feature for just a specific user, or group of users, or if using the site via a development sandbox / internal proxy, is a huge help. We do that all the time.

We different configurations for development, QA, and production. So the code will be hidden for production users, but will be seen by other users in development, and will also be exercised by the CI suite (assuming you wrote tests for it). Plus, of course, even if it’s behind a config flag it still usually has to compile.

Interesting. I can see the good points behind continuous deployment, namely where you are sure that all the required code packages, database tables, certificates, etc. are in production and there should be nothing missing when you flip the switch to ‘go live’.

However, I can see additional issues like: ‘Oops – the flag was set wrong, or the code looking at the flag was wrong, and something went to production too quickly’. Or, that in production, you have to manage so many versions of database tables and jars which are live at the same time. Your site is no longer ‘production’ instead it’s ‘many versions of test/dev/production on production’.

@Brandon:
The point is to keep the set of hidden code as small as possible. You hide features only for as long as required. By focussing on releasable chunks of code and getting hidden features visible as quickly as possible the distinction between dev, test and production becomes irrelevant in the big scheme of things.

Once you release multiple times per day the oopses (which occur regardless of size of a release) are fixed so quickly the user might think they just imagined it.

This is a fantastic explanation of continuous deploy. Having done both large & small deploys I can agree that finite deploys are smarter and less costly over all. Even when these simple finite deploys are difficult having the continuous peer review that the Etsy continuous deploy model encourages is essential.

Thanks for your reply. It still sounds functionally the same as having separate branches for dev and production and releasing less frequently. With branches, the other developers are still seeing the current dev code, and the CI test suite is still testing it, so the only difference is that it doesn’t go to production hidden behind conditionals.

Don’t get me wrong, I like the concept. I just think that adding in the conditionals kind of cancels out the gains from “releasing” the code.

@Perrin – It’s completely different than branches. With branches you are enabling the capabilities in that branch for an entire environment – be that dev/staging/production. With toggles you introduce logic so that you can choose who is exposed to those features within any given environment. This is important when you consider how you test code against a production dataset & production traffic. How do you expose 1% of your users to a particular branch? It can be done, but it becomes more of a network routing exercise than something you control in code logic. Also – if you are using conditionals in your code you can expose features in a “dark launch” scenario where traffic follows both paths & you can test the accuracy of a new pathway without exposing customers to the results of that code path.

Lastly, if something goes wrong – the speed difference between deploying a different branch and toggling a configuration value are vast. If the configuration toggles are done correctly, there’s no restart necessary and the change is immediate.

How is managing the cleanup of the feature flags? One of my concerns for switching to CI is the sprawl of conditionals around already launched features. If the switch is still there, then how do you protect against taking it away on a subsequent release?

Managing old config flag is definitely a source of technical debt, further complicated by the fact that for some features, you never want to take the flag away. In practice, it it turns out not to be a big enough problem to demand some kind of process for resolving it.

I really want to push us more in this direction, but we have things like external cams components that don’t support feature toggling. How do you suggest supporting continuous deployment when a lot of third party integrations don’t?

(i) First, don’t branch. Or, if you want to do so to avoid committing lots of little changes while you fiddle/test, then keep your branches local only and short-lived. Branches make it hard for other users to get visibility into your code, and to start integrating your code with things they’re doing. Just use your master branch; everyone then knows that’s where the most recent code is, and nobody has to go digging around different branches for something.

(ii) Push changes as the author suggests, in bite-size chunks. And if they aren’t ‘done’ (complete), so what? The point is that you make this known. The code should do what it claims to do, and if you throw an exception ‘not implemented’ then it’s doing exactly that. (If you claim it’s implemented and it doesn’t work as it should, then that’s a bug.)

(iii) In terms of the config option, that may even be overkill. But code to handle backward-compatibility is often a good trade-off.

(iv) “Commit” (or “push” in Git) should synonymous with hitting production, or at least there should be minimal delay. Hence the need for solid unit testing, code review, and other lightweight processes to make sure code does what it claims to do before you push.

A big part of continuous development is an open code base, and a philosophy of trust and communication between developers.

Also, by pushing small changes, when things break it’s much easier to see why. Compare vs integrating a huge change in one massive release (say where a branch of some 1,000 commits was merged in one fell swoop).

@Daniel – One strategy I’ve seen in the past for old feature flags is to build the removal of that flag into the story planning process. This was a kanban shop, so when they would write a story to create a feature w/ a toggle in front of it they would also create a story to remove that toggle. This meant that the feature wasn’t “done” until the toggle was removed – and the dev team responsible for adding it would drive getting it removed because they had the story in their backlog. That said – some toggles live on for a long time, sometimes outlasting the teams who added them, so it probably depends on the scenario but the above works for most short to medium-length dev efforts behind a toggle.

If you make a small effort and design a nice toggle framework, with (among other features) a good and simple way to see and control all toggles with their values in different environments, then they are not a debt, but an asset, when for example, for some business reason, you suddenly want to hide a feature.