Bing: Continuous Delivery

Microsoft’s Bing is on a long-term journey to build the largest, most relevant, best performing search engine in the world. This one goal presents monumental software engineering challenges that demand the most from its development platform. So while Bing has been steadily climbing into the second largest search engine in the world, we have also been innovating in how we build, deploy, and manage our software platform. What we’ve discovered: rapid innovations in user-facing features and performance are only possible through innovation in the software platform and the developer experience. In this series we hope to convey the challenges we faced trying to move over 600 engineers from doing something they were very comfortable with to a way that seemed not only infeasible but downright insane. The culmination of the platform and developer experiences is referred to as Continuous Delivery (or simply “Agility”), and when we began to make the leap to Continuous Delivery we not only changed the way our developers write code – we fundamentally altered the way our business operates. Regardless of organizational size, we believe this is a journey worth taking.

Craig Miller

Chap Alex

Jonathan Bergeron

Let's get you up to speed

Agility has allowed Bing developers to be incredibly efficient and incredibly productive. To give some context: We’re deploying thousands of services 20 times per week, with 600 engineers contributing to the codebase. We’re pushing over 4000 individual changes per week, where each code change submission goes through a test pass containing over 20,000 tests… and that test pass takes about 10 minutes down from several hours or even days. In short, Agility has been a game-changer for Bing.

Agility. You keep using that word…

At Bing, we’ve been so deeply invested in Agility for the last four years, phrases like continuous delivery and idea velocity are so deeply ingrained into our developers, that at times we need to step back to understand that many feature teams have only the vaguest understanding of what we’re talking about.

Simply put, Agility is the speed at which ideas can transition from the whiteboard (an idea), to the keyboard (an implementation), and from there to the live site (deployment) and finally to users (experimentation).

It is the speed of experimentation—all the way from back-of-the-napkin ideation to full-blown analysis of the feature’s impact on user engagement. And to say that test automation is central to this success would be a vast understatement.

Inner Loop:
The loop that spans ideation through code commit, and most often centers on an engineer writing the code for a given feature. Because engineers require an idea from which to design and build a feature, this loop also includes prototyping, crowd-sourced feature engagement and feasibility studies.

Outer Loop:
This is the loop that gets committed code out to Production. Because a feature is not really finished until we have successfully navigated live user trials, our experimentation-centered culture requires that we consider the cost of flighting features within this loop.

Inner Loop:
The loop that spans ideation through code commit, and most often centers on an engineer writing the code for a given feature. Because engineers require an idea from which to design and build a feature, this loop also includes prototyping, crowd-sourced feature engagement and feasibility studies.

Outer Loop:
This is the loop that gets committed code out to Production. Because a feature is not really finished until we have successfully navigated live user trials, our experimentation-centered culture requires that we consider the cost of flighting features within this loop.

Before we started working on Agility, feature ideation amounted to hallway conversations, elevator pitches, and “gut feelings.” The Inner Loop was not much better – there were some tests. Some of which were automated. And some of those passed.

But we lacked a process of checks and balances to really measure effectiveness.

There was a suggested check-in gate, but it was not enforced properly. Because of the loosely-guarded Inner Loop, production deployment was also inconsistent. We pushed production deployments every 4 weeks and considered ourselves lucky when we didn’t slip a release.

The monthly release cadence, along with the uninformed nature of feature ideation, made feature experimentation a real gamble. Feature teams often created multiple variations of a feature ahead of time and shipped them all together. This turned the whole experimentation process into a kind of Choose Your Own Adventure story. If any given flight was underperforming, it could be replaced with another (fingers crossed) to try to recover some of the productivity that would otherwise be lost while waiting for the next monthly ship cycle.

More than a search engine

More with Derrick Connell

Engineers begin to riot if we deploy less than 10 times a week.

- Chap Alex -

Today, feature ideation has strong support via rapid prototyping tools and real time crowd-sourced feature analysis. The Inner Loop has been radically altered to better support engineers during development process. Efficiency has been gained through cloud-based builds, highly-parallelized and distributed cloud-based testing, and a streamlined code submission processes that require all tests pass prior to check-in.

Because of the stability of the Inner Loop, the Outer Loop also moves much faster - in fact, the same engineers that were happy with the monthly ship cadence are now impatient if we deploy their services less than 10 times per week. Our ability to quickly deliver code has also transformed A/B testing, and feature teams no longer need to build out backup experiments to avoid downtime that was endemic to monthly releases.

Agility is a journey, not a destination

Agility did not spring up overnight. We overcame many difficult challenges along the way to continuous delivery. Some of the most difficult battles centered on the culture around—and implementation of—testing. Seasoned developers know that a code base is only as good as its tests. When we were shipping monthly, there was no hard dependency on tests. Issues could be triaged away, or signed-off on, with little visibility into what the longer consequences might be.

By committing to Agility, we put the test owners front and center, which was a difficult change to manage. Every test in the test corpus is run for each code submission attempt. In addition to enforcing accountability to each test, we spent a lot of resources increasing the reliability of the tests. Flakiness became verboten, and is something that our current culture of Agility does not allow for.

Manual testing also became too much of a burden. Simply put, there’s no time for it. Shipping up to twenty times per week leaves no window for manual testing. The shift away from manual encouraged feature teams to add functional automation. There are always teams who cannot or will not automate their tests, that quickly learn that manual testing and Continuous Delivery do not go well together.

The anatomy of a software experiment

More with Jonathan Bergeron

All Your Hardware are Belong to Us

When we made the shift to Agility, our data highlighted that engineers spent most of their time in the Inner Loop. So we worked on improving productivity by speeding up and standardizing the Inner Loop.

We started by leveraging a cloud-based build system, as well as using Azure and Microsoft’s Test Authoring and Execution Framework (TAEF) to build out a custom, highly parallelized and distributed feature validation system.

Our goal was to scale with hardware. And for the last four years that's exactly what we've done. Every part of our Agility pipeline is designed for scalability and massive parallelization. It’s scaled so well that our 100-developer org grew to over 600 in a two-year span. Our test corpus has increased ten-fold. And we've increased test reliability to 99.9% and reduced time-to-test to 10 minutes.

A 100% test pass is 100% impossible.

With our focus on testing over the past years, one of our key learnings is: there is no clean test pass. Any complex feature will experience some test flakiness, and Bing services were no exception. One of our biggest success stories over the years has been the amount of flakiness we have eliminated from our tests and systems. And while we often get surprised looks from folks when we give them our reliability numbers, we are accounting for flakiness in the system. Flakiness is very destructive and requires active management systems to contain.

Alternatively, it’s extremely challenging to fully automate testing of a complex feature. Our sophisticated test platform gives feature teams a number of different ways to test their features including driving browser-based interactions and configuration-driven test authoring.

Even with a very robust test platform, there are features that are so difficult or complex to automate that the result may not be worth the investment. Our platform allows us to achieve a balance between testing features thoroughly while getting them to our end users efficiently.

Risking failure pays off big

This is not the culture you’re looking for.

Cultural change is difficult. It cannot be imposed on a team; instead, the team must be encouraged along the journey one step at a time. Another culture-based problem is that not all feature owners and stakeholders connect the perception of agility with the laborious reality of agility. Having little concern for quality, they are impatient with testing and discouraged when the pipeline slows due to test failures.

There’s nothing special about releasing code. However, it takes a culture of quality to ship quality code fast. It takes relentless commitment and attention to quality, and a willingness to course correct when Agility is perceived as a blocker to shipping quality code.

It all starts with an idea.

To consistently churn out groundbreaking work, it requires a process for ideas to come to life. At Bing, we encourage everyone to contribute to the process. It’s the diversity of viewpoints that creates the most interesting ideas.

So what’s our process? We’re glad you asked. We encourage good ideas by providing a number of forums through which different segments of the workforce can contribute comfortably:

Growth Hacks
are aimed at tackling well understood problems that have big growth opportunities. For example, improving the efficiency of our engineering systems or improving the user engagement on a major segment. Growth hacks are tracked at the VP level, so it’s a great way to have a major impact on the direction of our organization.

BingCubator
is a forum where entrepreneurs can pitch an idea that is large enough for funding. Their ideas funnel through an incubation process that is managed by a v-team before they can present them to upper management for funding.

Hack Days
closely model engineer’s typical daily interactions, though are designed to allow them to shelve their normal deliverables temporarily and pursue something outside their area of expertise.

The most important guiding principle for us is that a feature idea can come from anywhere. While we still receive strong guidance from above (top-down), and feature teams develop features in the normal way (bottom-up), we built out systems to allow ideas to come from anywhere.

We organize our engineering ecosystem into an efficient idea funnel, where we make it easy to iterate on ideas with end users at the top so we can churn through as many ideas as possible.

Growth Hacks
are aimed at tackling well understood problems that have big growth opportunities. For example, improving the efficiency of our engineering systems or improving the user engagement on a major segment. Growth hacks are tracked at the VP level, so it’s a great way to have a major impact on the direction of our organization.

BingCubator
is a forum where entrepreneurs can pitch an idea that is large enough for funding. Their ideas through an incubation process that is managed by a v-team before they can present them to upper management for funding.

Hack Days
closely model engineer’s typical daily interactions, though are designed to allow them to shelve their normal deliverables temporarily and pursue something outside their area of expertise.

The most important guiding principle for us is that a feature idea can come from anywhere. While we still receive strong guidance from above (top-down), and feature teams develop features in the normal way (bottom-up), we built out systems to allow ideas to come from anywhere.

We organize our engineering ecosystem into an efficient idea funnel, where we make it easy to iterate on ideas with end users at the top so we can churn through as many ideas as possible.

Ask and you shall find out.

Who hasn’t asked their colleagues for advice on a feature by the water cooler? It would be far more valuable, though very difficult, to talk to true end-users for feedback to confirm your feature hypothesis.

Ideas, design, and development at Bing

More with Mujtaba Khambatti

At Bing, we have a system to do just that. We provide tooling to our engineers to allow them to get feedback from external users about their ideas within a few minutes. Engineers submit their concepts and questions and select their target audience. Their experiments are then sent to hundreds of people to get their feedback. Microsoft has its own crowd sourcing platform with a pool consisting of several thousand external people on panel so feedback from the pool usually comes back within two hours and it allows engineers to experiment visually without any need to write any code.

You want to know if your idea for hackday is a good one? You can do a quick prototype, compose a brief survey, push it to the crowd, and evaluate the feasibility of your prototype. This has empowered our developers with real time feedback to understand whether their idea can be really scaled to production for our end users.

Ideas, design, and development at Bing

It’s not about getting code to production, it’s about getting ideas into production.

- Craig Miller -

A vision of the future.

We’re not done yet, and we are excited by all the remaining challenges for becoming an even greater Agile organization.

We are looking to decrease test execution time by dynamically executing only those tests that are relevant to a given change’s quality.

We’re looking to decrease deployment times through more proactive deployment health signals.

And in a pleasant twist, feature teams are asking us to help them be even more agile. We’re looking at ways to enable self-service, quick-twitch deployments for partners who simply can’t wait hours and hours for their code to get to production.

Our long-term goal of building the largest, most relevant, best performing search engine in the world marches onward – at double-time. We started with a simple question: “Can we do it faster?”, and have spent the last few years asking that question and implementing answers. From prototyping to building and validation to deployment and analysis, there have been no areas that have been off limits. More importantly we have used data and metrics to drive decisions about the next round of innovations
including real-time developer feedback!