Tech Roots » Continuous Deliveryhttp://blogs.ancestry.com/techroots Ancestry.com Tech Roots BlogsFri, 31 Jul 2015 22:24:36 +0000en-UShourly1http://wordpress.org/?v=4.2.2Making Chef Dynamichttp://blogs.ancestry.com/techroots/making-chef-dynamic/ http://blogs.ancestry.com/techroots/making-chef-dynamic/#commentsThu, 02 Jan 2014 21:13:52 +0000http://blogs.ancestry.com/techroots/?p=1709We’ve been working with Chef (formerly OpsCode) for a couple years now. I can safely say that I’ve spent more time crawling through forums and reviewing code documentation than I care to admit. In all that time, I’ve never seen anyone clearly document how to use Chef dynamically. I hope that I can share some… Read more

]]>We’ve been working with Chef (formerly OpsCode) for a couple years now. I can safely say that I’ve spent more time crawling through forums and reviewing code documentation than I care to admit. In all that time, I’ve never seen anyone clearly document how to use Chef dynamically. I hope that I can share some cool ideas in this post and maybe somebody out there can make some magic happen similar to what we’ve done.

For those of you who may not know, Chef is a framework that makes system configuration ridiculously easy. Their terminology is a little strange as it follows a cooking analogy (hence Chef, get it?). A recipe is essentially a script, or some form of code that yields a result. A cookbook is a collection of recipes. Tools to interact with Chef include knife, etc. The analogy only stretches so far and soon you get into more practical terms like node, role, and environment.

I mentioned using Chef dynamically. So here’s the problem. You build a role with some recipes and assign it to a node. The node then takes the role, downloads the recipes and acts on them. The result is a configured server. It’s very cool, but what happens when you have hundreds of services iterating through hundreds of code versions, each of which configure their servers slightly differently? Managing hundreds of roles with versioning control can quickly become a nightmare. The trick is to handle all of the code dynamically and in a single maintainable place. You have one master dynamic recipe that is capable of reading in a configuration file and acting on the settings inside. Then, you take the configuration file and push it up into a Data Bag (Chef’s generalized data storage system).

That’s a really rough, really quick example, but it covers the gist of the idea. You can even use off-the-shelf cookbooks by taking the settings specified in the data bag and assigning them to the node as attributes, which causes them to take precedence over cookbook default values.

We’ve had the discussion about just using role files instead of data bags a number of times. Using a role is the standard for everybody out there, so why don’t we go that way? The answer is version control. Data bags can be versioned. When you’re dealing with individual specifics for each stack versioning the configuration data allows you to tie configuration directly to a version of code. This allows for dependency matching, as well as a myriad of other niceties that can be realized once you get into it. The data bags approach also allows you to easily roll back a configuration, meaning if somebody uploads a bad configuration and screws up their stack, we can just roll back to a previous data bag with no sweat off our backs. It adds a layer of complexity to the code so this isn’t an efficient answer for everything out there, but for configurations that are often changing, this makes server change control as easy as checking a file.

To make it even nicer, build a UI that creates the configuration file for you. This will allow your development teams to just click a couple boxes to indicate their desired settings, click a button, and have a pile of fully configured boxes ready to take live traffic. Just like magic.

]]>http://blogs.ancestry.com/techroots/making-chef-dynamic/feed/0Agile Development Leads to a Happier Companyhttp://blogs.ancestry.com/techroots/agile-development-leads-to-a-happier-company/ http://blogs.ancestry.com/techroots/agile-development-leads-to-a-happier-company/#commentsFri, 27 Dec 2013 18:22:52 +0000http://blogs.ancestry.com/techroots/?p=1704Over the past few years Ancestry.com, has undergone a significant transformation within many aspects of its business. Steady growth combined with Agile adoption, continuous delivery, and improvements in technology and infrastructure, has created an organization focused on delivering value. In November, Ancestry.com CEO, Tim Sullivan presented at DevOps conference, FlowCon about the organizations shift to… Read more

]]>Over the past few years Ancestry.com, has undergone a significant transformation within many aspects of its business. Steady growth combined with Agile adoption, continuous delivery, and improvements in technology and infrastructure, has created an organization focused on delivering value. In November, Ancestry.com CEO, Tim Sullivan presented at DevOps conference, FlowCon about the organizations shift to Agile Development practices. It included what inspired the change in the business and some advice to other executives and senior management on what they can do to start their own transformation and increase flow. During his presentation, Sullivan also touched on how the transition to Agile practices and culture enabled greater innovation, making for a happier company. Instead of waiting to make bulk updates, teams have the ability to make smaller changes and iterations at any time which allow them to get product and site features to market faster.

]]>http://blogs.ancestry.com/techroots/agile-development-leads-to-a-happier-company/feed/0Featured Article: Migration to Continuous Delivery at Ancestry.comhttp://blogs.ancestry.com/techroots/featured-article-migration-to-continuous-delivery-at-ancestry-com/ http://blogs.ancestry.com/techroots/featured-article-migration-to-continuous-delivery-at-ancestry-com/#commentsSat, 07 Dec 2013 00:50:57 +0000http://blogs.ancestry.com/techroots/?p=1601Starting with the adoption of Agile development practices, Ancestry.com has progressed to a continuous delivery model to enable code release whenever the business requires it. Transitioning from large, weekly or bi-weekly software rollouts to smaller, incremental updates has allowed Ancestry.com to increase responsiveness and deliver new features to customers more quickly. Ancestry.com has come a… Read more

]]>Starting with the adoption of Agile development practices, Ancestry.com has progressed to a continuous delivery model to enable code release whenever the business requires it. Transitioning from large, weekly or bi-weekly software rollouts to smaller, incremental updates has allowed Ancestry.com to increase responsiveness and deliver new features to customers more quickly. Ancestry.com has come a long way in regards to developing a continuous delivery model and will continue to evolve to further adapt to the fast changing pace of the market.

The lessons learned from our efforts in building a continuous delivery model have been featured in TechTarget’s SearchSoftwareQuality online magazine. You can view our photo story here.

]]>http://blogs.ancestry.com/techroots/featured-article-migration-to-continuous-delivery-at-ancestry-com/feed/0Let DevOps be DevOps…Nothing More Nothing Lesshttp://blogs.ancestry.com/techroots/let-devops-be-devopsnothing-more-nothing-less/ http://blogs.ancestry.com/techroots/let-devops-be-devopsnothing-more-nothing-less/#commentsFri, 12 Jul 2013 22:26:09 +0000http://blogs.ancestry.com/techroots/?p=908I recently attended DevOps Days in Santa Clara, a two-day conference/event focused on the DevOps culture and practices. Even though the conference is in its fourth year, you still get talks about what DevOps is and what it means. The term “DevOps” is relatively new movement in the software development industry. Patrick Debois, a sysadmin/developer,… Read more

]]>I recently attended DevOps Days in Santa Clara, a two-day conference/event focused on the DevOps culture and practices. Even though the conference is in its fourth year, you still get talks about what DevOps is and what it means. The term “DevOps” is relatively new movement in the software development industry. Patrick Debois, a sysadmin/developer, is credited with coining the term just a few years back. Because it is a newer concept in the development community the definition of it is still forming and coming into its own. What it really means is still discussed a lot and I’m going to add my own perspective now as well.

To get a better sense of DevOps evolution, I recently emailed Patrick and simply asked, “What did you see that inspired you to call it DevOps?” His response:

“…I was a sysadmin working a lot with agile dev teams and I was jealous of their way of working. This turned me into experimenting with Scrum and Kanban in operations teams. Then I got involved with agile infrastructure (as Andrew Shaefer called it). The first mailing list was ‘agile system administration,’ but for organizing the first conference that name was a bit long. Also, the fact that it only involved ops or sysadmins bothered me. Hence I came up with devopsdays (also a pun on dead on delivery – dod).”

“In hindsight, the name worked fine. The only thing that bothers me now, is that it is narrow when people first hear it, as my broader definition includes all company silos, also security, finance, HR, etc… but then again, it’s at the dev & ops barrier where most of the pressure builds up, so it works.”

Patrick was inspired by the way agile practices were positively affecting development and wanted to adopt them into operations. Another blog, devops.com, embraces that same idea with its tagline, “DevOps – Helping finish what Agile development started.” Thus, DevOps is fundamentally an extension of Agile into operations as it attempts to instill those values and practices into operations. At its core, Agile is about delivering value frequently so practices and technologies have been adopted in the DevOps discipline to support the rapid delivery of value. DevOps values high automation, managing infrastructure using development tools and languages (infrastructure as code), and such practices as continuous delivery.

Recently, however, broader definitions of DevOps have come onto the scene. Here is one I recently read in a Wired article covering an interview with Phoenix Project Co-Author Kevin Behr, “DevOps are really a group of folks engaged in developing critical thinking skills. Instead of viewing the world as optimizing what’s best for developers and optimizing what’s best for operations, DevOps see the world as a continuum and look at optimizing the whole system rather than sub-optimizing the parts at the expense of each other.”

What!? I like what the author is saying and I agree with the premise, but how would anyone get that out of the name “DevOps?” Whenever you say the word “DevOps” the name itself conjures up some combination of those two disciplines.s Even Patrick, the godfather of DevOps, states that he now has a “broader definition” which includes all company silos (groups), even security, finance, and HR! Hmmm, what should that be called? ManagementProductDevQASecurityFinanceHROps? Or, how about xOps? It’s just hard to make the mental leap from dev & ops to include all that other stuff!

In actuality, the broader definitions of DevOps that people are throwing around now really reflects what Lean is about and espouses. Lean values system thinking along with many other principles and practices attempt to “look at the whole.” All of that thinking is great for DevOps, but I suggest we keep the two ideas separate. DevOps can easily be integrated into an organization in which the Lean mindset exists. As a corollary, DevOps (and Agile for that matter) are effective catalysts for driving an organization into a Lean transformation.

Another clarification is in order: DevOps is not devs doing ops. It is true that a successful DevOps transform the way ops is done, but the development organization is fundamentally about building and servicing a product, while operations is about building and supporting the infrastructure for that product. Saying developers should do ops is like saying we should all be building roads or running the power company.

Some think of DevOps as dev [noun] + ops [noun], i.e. the dev and ops organizations working together and collaborating. It does embrace that notion, but DevOps goes deeper than that. Based on Patrick’s original notion of DevOps and my synthesis of ideas, I view DevOps as an adjective + noun combination, or Dev [adj.] + Ops [noun], meaning applying development principles and practices to operations.

So, what is DevOps for me? Applying the KISS principle (keep it simple, stupid) I get:

applying developer tools and practices to operations: source control, infrastructure as code, modern development languages, TDD, and continuous integration.

creation of self-service tools and products which enhance product development and business productivity.

lastly, by applying these principles operations groups become more effective in that it transforms the nature of operations. In the DevOps model, operations groups become enablers to increased flow of value. They build the infrastructure and get out of the way!I think Patrick made a key point when he said, “it’s at the dev & ops barrier where most of the pressure builds up.” I wholeheartedly agree with that statement and find it extremely valuable to identify DevOps, representing the breaking down of that barrier, as a solution to that all-too-prevalent problem. So, Patrick, don’t lament your choice of DevOps! DevOps—applying development principles and practices to operations. If we keep the focus there DevOps will be more understandable, easier to sell, and far more successful fulfilling what it was created to solve.

]]>http://blogs.ancestry.com/techroots/let-devops-be-devopsnothing-more-nothing-less/feed/1Creating a Dynamic NAnt Filesethttp://blogs.ancestry.com/techroots/creating-a-dynamic-nant-fileset-2/ http://blogs.ancestry.com/techroots/creating-a-dynamic-nant-fileset-2/#commentsWed, 03 Jul 2013 20:24:36 +0000http://blogs.ancestry.com/techroots/?p=835Every once in a while when building NAnt targets you will come across the desire to create a <fileset> which is defined at runtime. Maybe you’d like to define the <fileset> as a property in a previous target or a calling target and have it be used in a subsequent task. For example we have a… Read more

]]>Every once in a while when building NAnt targets you will come across the desire to create a <fileset> which is defined at runtime. Maybe you’d like to define the <fileset> as a property in a previous target or a calling target and have it be used in a subsequent task. For example we have a shared NAnt target that runs a test suite. It is designed to be flexible and reusable so that you can use it for a lot of different kinds of test runs (unit tests, functional tests, performance tests). You set up a few parameters in an upstream target which is specific to an individual component, then call the common test target and it runs the tests using the parameters you set. In pseudo-NAnt (which is a thing I just invented, patent pending), it looks like this:

<!--This is in a build file that belongs to a specific component-->
<target name="Upstream_ComponentSpecific_Test">
<!--set up some properties for this test run-->
<property name="TestCategory" value="UnitTest" />
<property name="ResultsFile" value="${ResultFilePath}" />
<!--run the tests-->
<call target="CommonTestTarget" />
</target>
<!--This is in some shared build file that many components re-use-->
<target name="CommonTestTarget">
<!--a shared target for running tests using parameters set as NANt properties.--><!--Pretend you have a custom NAnt task called "test", this is pseudo NAnt remember?-->
<test category="${TestCategory}" resultsfile="${ResultsFile}">
<fileset>
<!-- Dangit! How do we tell this task which files have tests and which don't? Wouldn't it be cool if we could pass that fileset in as a property? -->
</fileset>
</test>
</target>

You can see the fatal flaw in this design (because I point it out in the comments, isn’t that nice?). You need to tell the reusable NAnt target which files contain the tests (I’m using this example for simplicity so let’s assume you can’t use a test definition file or a test container or some other way of achieving the same end). We ended up with many other similar cases where this would be nice, such as excluding assemblies from code coverage, excluding files from copy tasks, including non-managed assemblies into builds, and a list of other scenarios.

So we needed to find a way to tell the Test target which assembly files include the tests, but we had to do it dynamically from an upstream target. You cannot set a <fileset> as the value of a NAnt <property>, which is probably everyone’s first thought. I don’t know why this doesn’t work, but at the time of this posting, it doesn’t and I’m not going to hold my breath until it does. The solution we arrived on was to use the refid attribute of the <fileset> type, then dynamically assign files to the <fileset> with the matching id using a NAnt target whose only purpose was to define the <fileset>. So now it looks like this:

<!--This is in a build file that belongs to a specific component--><!--This target is new, its purpose is to define a custom fileset and give it an id you can reference later-->
<target name="SetFilesetForUnitTests">
<fileset id="UnitTestFileset">
<!--Declare the files that contain the tests. Notice I put in some includes and excludes, you can do anything that the NAnt fileset supports-->
<include name="${DeploymentDirectory}\TestAssembly1.dll" />
<include name="${DeploymentDirectory}\TestAssembly2.dll" />
<include name="${DeploymentDirectory}\Test.*.dll" />
<exclude name="${DeploymentDirectory}\Test.Broken.dll" />
</fileset>
</target>
<target name="Upstream_ComponentSpecific_Test">
<!--set up some properties for this test run-->
<property name="TestCategory" value="UnitTest" />
<property name="ResultsFile" value="${ResultFilePath}" />
<!--call the target that sets up the dynamic fileset-->
<call target="Fileset.UnitTest.Standard" />
<!--run the tests-->
<call target="CommonTestTarget" />
</target>
<!--This is in some shared build file that many components re-use-->
<target name="CommonTestTarget">
<!--a shared target for running tests using parameters set as NANt properties.--><!--Pretend you have a custom NAnt task or something, this is pseudo NAnt remember?-->
<test category="${TestCategory}" resultsfile="${ResultsFile}">
<!--Now you can use the fileset that was set dynamically upstream-->
<fileset refid="UnitTestFileset" />
</test>
</target>

Et voilà a dynamic NANt fileset that you can set at runtime from an upstream target or task!

]]>http://blogs.ancestry.com/techroots/creating-a-dynamic-nant-fileset-2/feed/0FlowCon: Continuous Delivery, Continuous Design and Lean Product Developmenthttp://blogs.ancestry.com/techroots/flowcon-continuous-delivery-continuous-design-and-lean-product-development/ http://blogs.ancestry.com/techroots/flowcon-continuous-delivery-continuous-design-and-lean-product-development/#commentsFri, 28 Jun 2013 22:33:24 +0000http://blogs.ancestry.com/techroots/?p=869Over the past few years, Ancestry.com has been moving toward agile development methods. In fact, that’s the main reason I joined the company three years ago. Since starting at the company, I have worked at moving Ancestry.com into agile methods, including teaching Scrum. Over the years, agile has taken root in the company and the… Read more

]]>Over the past few years, Ancestry.com has been moving toward agile development methods. In fact, that’s the main reason I joined the company three years ago. Since starting at the company, I have worked at moving Ancestry.com into agile methods, including teaching Scrum. Over the years, agile has taken root in the company and the mindset of the teams and it has expanded to ultimately impact multiple areas of the business, particularly in operations. I’m proud that the team has received quite a bit of recognition in the industry for our significant adoption of the continuous delivery model.

I have seen the benefit first hand of how agile can impact a business. Which is why last year, I approached Thoughtworks and Continuous Delivery author Jez Humble about creating a conference around the theme of continuous delivery. I felt a conference that took a multidisciplinary approach to delivering business value would be beneficial to the community.

Many conferences tend to focus on a specific activity, group, or methodology. We know that creating business value requires “flow” across many parts of the enterprise and I wanted a conference that would reflect that. Ultimately, ThoughtWorks and Trifork (creators of QCon and GOTO) joined to start FlowCon which will run for one day in San Francisco on November 1, 2013. As well as being co-founder, I am excited to be a member of the Program Committee, which includes Elisabeth Hendrickson (TestObsessed.com), Gene Kim (author The Phoenix Project and Visible Ops Handbook), Jez Humble and Lane Halley.

FlowCon’s focus is “flow” and, as mentioned, how to create and adopt mindset, methods, and practices that enable continuous, fluid delivery of value in the enterprise. It brings together technologists and industry leaders passionate about innovation through continuous delivery, continuous design, and lean product development. The conference will explore the role of culture, technology and design in building great products and services and will provide inspiring and actionable information for key decision makers responsible for products and services that depend on software.

The target audience is everyone involved in creating products and services that depend on software. You can see a list of our confirmed speakers here. You will notice that Ancestry.com’s CEO Tim Sullivan has graciously agreed to participate in this new conference and lend his perspective as well. We intend to have speakers across a wide variety of disciplines with the intent of expanding our understanding of flow.

I encourage folks to attend the day conference in San Francisco at the Mission Bay Conference Center (UCSF).

]]>http://blogs.ancestry.com/techroots/flowcon-continuous-delivery-continuous-design-and-lean-product-development/feed/0DevOps: Finding the right place for a new ideahttp://blogs.ancestry.com/techroots/devops-finding-the-right-place-for-a-new-idea/ http://blogs.ancestry.com/techroots/devops-finding-the-right-place-for-a-new-idea/#commentsThu, 02 May 2013 20:26:44 +0000http://blogs.ancestry.com/techroots/?p=598For the last year and a half, we’ve been breaking in a new concept at Ancestry.com called a DevOps engineer. There is a ton of material on the internet about what DevOps means to various groups, and how they’ve implemented it. A lot of it revolves around SCRUM, Agile processes, and other approaches to increase… Read more

]]>For the last year and a half, we’ve been breaking in a new concept at Ancestry.com called a DevOps engineer. There is a ton of material on the internet about what DevOps means to various groups, and how they’ve implemented it. A lot of it revolves around SCRUM, Agile processes, and other approaches to increase productivity within a team, but the underlying premises are significantly different. As the name implies, DevOps is the combination of development (dev) and operations (ops). That statement alone conjures up all kinds of interesting images. Is it a developer that occasionally racks servers? Is it a network engineer that writes hyper-intelligent scripts? Is it an installation tech that manages to automate the entire installation process? Or is it simply someone who can translate why a development team suddenly needs twice the processing power because a new version of a software platform was released with a great new feature set? There are any number of combinations that could result from joining development and operations.

As a DevOps team our vision was to provide development with an API for operations. We quickly discovered that it wasn’t necessarily how we defined ourselves, but how other groups in the organization viewed us. Long story short, development thought we were operations, and operations thought we were development. As least at first.

From the developer’s perspective, we were an ops team that was there to help developers. At first we received all kinds of interesting requests. It seemed like development hoped that we’d be the avenue to deliver all the things that operations kept promising but never got around to actively delivering. As a newly formed team, we liked the feeling of being useful and productive, and tried to facilitate as many requests as possible, but quickly found ourselves overwhelmed. After all, if an entire ops organization couldn’t manage to accomplish the laundry list of developer requests, why would a single team, that was still getting its bearings, be able to?

Once we realized that we were getting nowhere quickly, we decided to try facilitating the needs of the developers instead of the actual requests. That isn’t to say that the developers didn’t know what they wanted, but rather, the developers may not have been aware of other avenues that could make life easier for everyone. The number one complaint from developers was that turn-around times for hardware were way too long. Lowering server delivery times was going to be an easy win, as the company had already begun experimenting with virtualization, and it was a short leap to go from experimentation to implementation. In the matter of a few months, we managed to drop the time for server delivery from months to days. Ops was able to monitor the virtual host load to know when more hardware was required, and order proactively. Developers got their servers much faster, and productivity increased as a result. Win-Win right? Except that DevOps suddenly became the man in the middle for every server configuration situation imaginable.

That kicked off our journey to create self-service tools to do all the manual yet very automatable processes. Everything from VM creation and configuration to deploying code and monitoring. These features, while independently insignificant, could add up to a ton of time that the developers didn’t want to spend on operational concerns. They wanted to develop and we were more than happy to facilitate. Once they realized that we were working in their best interest they were more than happy to switch over to our automated processes. That freed them up to focus on development and furthering the business goals.

From operations perspective, we initially were another demanding dev team that wanted more of their precious resources, which were often already spread thin. Coordinating with several different operations teams, each with their own field of responsibility, is a significant undertaking for any team. Initially the DevOps team would get resistance from various ops teams that felt we had no business trying to do their jobs when they were perfectly capable of doing it themselves. But we didn’t want to do their jobs for them. We wanted to give them tools to make their jobs easier. Why manually create 50 active directory objects, each with a slew of details, when a script could easily handle it for you? Why bother tracking static IP address allocation by hand when it could all be done in a database with a nice API? It took a while to find common ground, but eventually many of the time intensive tasks that consumed much of ops’ precious resources were scripted, allowing them to focus on improving infrastructure, and making the site run better.

As I said before, the team has been around for over a year now. Our vision was, and still is, to provide development with an API for operations. It will be pretty cool one day for a developer to be able to poke a single API and get everything they need to get their code from inception to delivery. On the operations side, it would be pretty cool if all the automatable minutia was all handled automatically, so they could completely focus on improvements. We’re getting there, and quickly.

]]>http://blogs.ancestry.com/techroots/devops-finding-the-right-place-for-a-new-idea/feed/0How Not to Handle an Overloaded Server – A Cautionary Talehttp://blogs.ancestry.com/techroots/how-not-to-handle-an-overloaded-server-a-cautionary-tale/ http://blogs.ancestry.com/techroots/how-not-to-handle-an-overloaded-server-a-cautionary-tale/#commentsMon, 25 Mar 2013 20:30:39 +0000http://blogs.ancestry.com/techroots/?p=76We have been working hard at Ancestry to build automated server health monitoring into our service infrastructure. One of the most recent efforts was the addition of some simple health tests that our load balancers could run by themselves to check on the health of our servers. We added three types of tests for each… Read more

]]>We have been working hard at Ancestry to build automated server health monitoring into our service infrastructure. One of the most recent efforts was the addition of some simple health tests that our load balancers could run by themselves to check on the health of our servers. We added three types of tests for each server: a simple ICMP ping; an HTTP Get request to a static resource; and a monitor that watches actual response times and HTTP response codes across the wire, looking for errors and abnormal times. The decision was made that if two of these three health monitors reported a failure over a specific time period; the server would be flagged as failing and would be pulled from the pool. Automated health tests are cool, and load balancers that can fix their own problems are extra cool, so the geek factor for this addition was quite high and we were pretty excited to have it working. To play off the famous quote attributed to George Orwell: “We sleep soundly in our beds because rough load balancers stand ready in the night to visit violence on the servers who would do us harm.” If you are thinking ahead, you can probably see the potential for problems.

One day we started to get some unusually heavy traffic. One of our services was being hit especially hard. At one point one of the servers started to return a large number of timeouts or HTTP errors because it was overloaded and was having trouble keeping up. Because the health monitors started showing a failure, they flagged the machine and it was pulled from the pool. Again, if you are thinking ahead, you can predict what happened next. The rest of the servers in that pool now were taking all of the original load, plus the added load of the server that was pulled. So, of course, they also began to be flagged and pulled from the pool one by one. We were monitoring the situation and watched that pool tick down to zero servers as we frantically tried to put servers back in fast enough. That is what kids these days are calling a “cascading failure,” where failure of one component triggers failure in other components. In retrospect it all seems terribly obvious, but at the time we were all so excited to have these automated health tests running that we were a little blinded by our geekish enthusiasm. We had inadvertently designed cascading failures into our automated health monitoring. After thinking things through more clearly in the light of day, we all agreed that a better approach would be to reduce traffic to misbehaving servers instead of pulling them from the pool. Let them cool down a bit and see if they start responding again. If they do not, then maybe it is ok to pull them, but if they are simply taking too much traffic, taking them out of the pool is the last thing you want to do. During this process, we realized that we already use the correct model for several other systems such as some of our filers and cache servers, which go into a reduced traffic “dead pool” when they get overloaded and then get put back in the normal pool if cooling down solves their problem. Based on what we learned, we have revised the load balancer monitoring so that it follows this model more closely and it has been working great.

The purpose of building automation around load balancing and health tests is to have a more self-healing server pool. Ideally the system would react to problems on a single server by routing traffic away from that server and sending it to healthy servers instead. Once we stopped treating an overloaded server the same way as an unhealthy server, the system started doing what we wanted it to do. Unhealthy servers get pulled from the pool so we can re-provision them or whatever is needed and the healthy servers take up the slack. But when a server is simply overloaded, it stays in the pool. This is a great place to be. People don’t spend as much time monitoring server pools, and fixing problems requires less manual work.